initial changes. (#507)

Added MathJax to display math symbols.

Added markdown to site processing.
Created markdown booomFilters directory with multiple documentation files.
This commit is contained in:
Claude Warren 2024-06-23 13:36:29 +02:00 committed by GitHub
parent 26e733fec0
commit 7c9e046d41
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
9 changed files with 927 additions and 1139 deletions

16
pom.xml
View File

@ -170,6 +170,8 @@
<commons.jacoco.branchRatio>0.78</commons.jacoco.branchRatio>
<commons.jacoco.lineRatio>0.85</commons.jacoco.lineRatio>
<commons.jacoco.complexityRatio>0.78</commons.jacoco.complexityRatio>
<doxia.module.markdown.version>1.12.0</doxia.module.markdown.version>
<math.mathjax.version>2.7.2</math.mathjax.version>
</properties>
<build>
@ -267,8 +269,21 @@
<artifactId>maven-javadoc-plugin</artifactId>
<configuration>
<source>8</source>
<!-- Enable MathJax -->
<additionalOptions>-Xdoclint:all --allow-script-in-comments -header '&lt;script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/${math.mathjax.version}/MathJax.js?config=TeX-AMS-MML_HTMLorMML"&gt;&lt;/script&gt;'</additionalOptions>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-site-plugin</artifactId>
<dependencies>
<dependency>
<groupId>org.apache.maven.doxia</groupId>
<artifactId>doxia-module-markdown</artifactId>
<version>${doxia.module.markdown.version}</version>
</dependency>
</dependencies>
</plugin>
</plugins>
</build>
@ -812,4 +827,3 @@
</contributors>
</project>

View File

@ -85,7 +85,7 @@ public class EnhancedDoubleHasher implements Hasher {
*</p>
* <ol>
* <li>If there is an odd number of bytes the excess byte is assigned to the increment value</li>
* <li>The bytes alloted are read in big-endian order any byte not populated is set to zero.</li>
* <li>The bytes allotted are read in big-endian order any byte not populated is set to zero.</li>
* </ol>
* <p>
* This ensures that small arrays generate the largest possible increment and initial values.

View File

@ -0,0 +1,25 @@
<!---
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Bloom filters
Bloom filters were added to commons collections in version 4.5.
The documentation comprises four parts:
* [An introduction](bloomFilters/intro.html) to Bloom filters.
* The [Commons Collections implementations](bloomFilters/implementation.html).
* [Unusual usage and advanced implementations](bloomFilters/advanced.html).
* Using [Bloom filters for indexing](bloomFilters/multidimensional.html).

View File

@ -0,0 +1,324 @@
<!---
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Bloom Filters Part 3: Unusual Usage and Advanced Implementations
In the previous post we discussed the Apache Commons CollectionⓇ implementation of Bloom filters and showed how to use them to answer the most basic questions. In this post we will look at some unusual usage patterns and some advanced implementations.
## Unusual Usage Patterns
These usage patterns are unusual in the sense that they are not commonly seen in code bases. However, it is important for developers to realize the types of questions that can be answered with Bloom filters.
### Finding filter in intersection of sets
If you have two Bloom filters and you want to know if there are elements that are in both, it is possible to create a single Bloom filter to answer the question. The solution is to create a Bloom filter that is the intersection of the two original Bloom filters. To do this we are going to use the `BitMapExtractor.processBitmapPairs()` function. This function takes a `BitMapExtractor` and a `LongBiPredicate` as arguments. The `LongBiPredicate` takes two long values, performs some function and returns true if the processing should continue and false otherwise.
In this example the `LongBiPredicate` creates a new bitmap by performing a bitwise “and” on the `BitMap` pairs presented by the `BitMapExtractor.processBitmapPairs()` function. The class has a method to present the result as a `BitMapExtractor`.
```java
import org.apache.commons.collections4.bloomfilter.BitMap;
import org.apache.commons.collections4.bloomfilter.BitMapExtractor;
import org.apache.commons.collections4.bloomfilter.LongBiPredicate;
import org.apache.commons.collections4.bloomfilter.Shape;
/**
* Calculates the intersection of two Bloom filters
*/
class Intersection implements LongBiPredicate {
private long[] newMaps;
private int idx;
/**
* Creates an intersection for a specified shape.
* @param shape the shape of the Bloom filters being compared.
*/
Intersection(Shape shape) {
newMaps = new long[BitMap.numberOfBitMaps(shape.getNumberOfBits())];
idx = 0;
}
/**
* Implements the LongBiPredicate test.
* @param a one BitMap
* @param b the other BitMap
* @return true always.
*/
public boolean test(long a, long b) {
newMaps[idx++] = a & b;
return true;
}
/**
* Returns the intersection as a BitMapExtractor.
* @return the the intersection as a BitMapExtractor.
*/
BitMapExtractor asExtractor() {
return BitMapExtractor.fromBitMapArray(newMaps);
}
}
```
Now we can use the `Intersecton` class to merge the two Bloom filters into a new filter as follows:
```java
import Intersection;
import org.apache.commons.collections4.bloomfilter.BloomFilter;
import org.apache.commons.collections4.bloomfilter.LongBiPredicate;
import org.apache.commons.collections4.bloomfilter.Shape;
class IntersectionExample {
public static main(String[] args) {
/* start setup */
// get an array of Bloom filters to check
BloomFilter[] candidateFilters = getBloomFilterCandidates();
// use two populated Bloom filters.
BloomFilter collection1 = populateCollection1();
BloomFilter collection2 = populateCollection2();
/* end setup */
// create the Intersection instance
Intersection intersection = new Intersection(collection1.getShape());
// populate the intersection instance with data from the 2 filters.
collection1.processBitMapPairs(collection2, intersection);
// create a new Bloom filter from the intersection instance.
BloomFilter collection1And2 = new BloomFilter(collection1.getShape());
collection1And2.merge(intersection.asExtractor());
// now do the search for filters that are in both collections.
for (BloomFilter target : candidateFilters) {
if (collection1And2.contains(target)) {
// do something interesting
}
}
}
}
```
### Sharding by Bloom filter
In processing large volumes of data it is often desirable to fragment or shard the data into different repositories. Bloom filters provide a quick method for sharding the data.
Lets assume there are multiple locations to store files, and we want to distribute the data across the cluster in a way that minimizes collisions. The following code is an example of how to implement this.
```java
import org.apache.commons.collections4.bloomfilter.BloomFilter;
import org.apache.commons.collections4.bloomfilter.SetOperations;
import org.apache.commons.collections4.bloomfilter.Shape;
import org.apache.commons.collections4.bloomfilter.SimpleBloomFilter;
import Storage; // a class that implements the actual storage I/O
class ShardingExample {
private BloomFilter[] gatekeeper;
private Storage[] storage;
/**
* Constructor. Assumes 10K items in filter with false positive rate of 0.01
* @param storage The storage locations to store the objects in.
*/
public ShardingExample(Storage[] storage) {
this(storage, Shape.fromNP(10000, 0.01));
}
/**
* Constructor.
* @param storage The storage locations to store the objects in.
* @param shape The shape for the gatekeeper Bloom filters.
*/
public ShardingExample(Storage[] storage, Shape shape) {
gatekeeper = new BloomFilter[storage.length];
this.storage = storage;
for (int i = 0; i < gatekeeper.length; i++) {
gatekeeper[i] = new SimpleBloomFilter(shape);
}
}
/**
* Creates the BloomFilter for the key.
* @param objectKey The key to hash.
* @return the BloomFilter for the key.
*/
private BloomFilter createBloomFilter(Object objectKey) {
byte[] bytes = SerializationUtils.serialize(o);
long[] hash = MurmurHash3.hash128(bytes);
BloomFilter bloomFilter = new SimpleBloomFilter(gatekeeper[0].getShape());
bloomFilter.merge(new EnhancedDoubleHasher(hash[0], hash[1]));
return bloomFilter;
}
/**
* Write an object into the filter.
* @param itemKey The item key for the storage layer.
* @param itemToStore The item to store.
*/
public void write(Object itemKey, Object itemToStore) {
// create the Bloom filter to for key
BloomFilter itemBloomFilter = createBloomFilter(itemKey);
// find the storage to insert into.
int selected = 0;
int hammingValue = Integer.MAX_INT;
for (int i = 0; i < gatekeeper.length; i++) {
int hamming = SetOperations.hammingDistance(gatekeeper[i], itemBloomFilter);
if (hamming < hammingValue) {
selected = i;
hammingValue = hamming;
}
}
// insert the data.
storage[selected].write(itemKey, itemToStore);
gatekeeper[selected].merge(itemBloomFilter);
}
/**
* Reads the item from the storage.
* @param itemKey The key to look for.
* @return The stored object or null if not found.
*/
public Object read(Object itemKey) {
// create the Bloom filter to look for
BloomFilter itemBloomFilter = createBloomFilter(itemKey);
// assumes storage returns null if key not found.
for (int i = 0; i < gatekeeper.length; i++) {
if (gatekeeper[i].contains(itemBloomFilter)) {
Object itemThatWasStored = storage[i].read(itemKey);
if (itemThatWasStored != null) {
return itemThatWasStored;
}
}
}
return null;
}
}
```
However, the issue with this solution is that once the filters are saturated, the search begins to degrade. To solve this problem a new, empty filter can be added when one of the existing filters approaches saturation. When a filter reaches saturation, remove it from consideration for insert but let it respond to read requests. This solution achieves a balanced Bloom filter distribution and does not exceed the false positive threshold. An example of a test for saturation is:
```java
if (bloomfilter.getShape().estimateMaxN() <= bloomfilter.estimateN()) {
// handle saturated case.
}
```
The above calculation is dependent upon `BloomFilter.cardinality()` function, so it is advisable to use BloomFilters that track cardinality or can calculate cardinality quickly.
## Counting Bloom filters
Standard Bloom filters do not have a mechanism to remove items. One of the solutions for this is to convert each bit to a counter<span><a class="footnote-ref" href="#fn1">1</a></span>. The counter and index together are commonly called a cell. As items are added to the filter, the values of the cells associated with the enabled bits are incremented. When an item is removed the values of the cells associated with the enabled bits are decremented. This solution supports removal of items at the expense of making the filter many times larger than a `BitMap` based one.
The counting Bloom filter also has a couple of operations not found in other Bloom filters:
* Counting Bloom filters can be added together so that the sum of their counts is achieved.
* Counting Bloom filters can be subtracted so that the difference of their counts is achieved.
* Counting Bloom filters can report the maximum number of times another Bloom filter might have been merged into it. This is an upper bound estimate and may include false positives.
There are several error conditions with counting Bloom filters that are not found in other cases.
* Incrementing the counter past the maximum value that can be stored in a cell.
* Decrementing the counter past 0.
* Removing a Bloom filter that was not added. This condition can decrement a cell that had an initial value of zero leading to decrementing error. But in other cases, when all the enabled bits in the Bloom filter have cells in the counting filter it is undetectable, but will lead to unexpected results for subsequent operations.
### Apache Commons Collections Implementation
The Apache Commons Collections Bloom filter package contains a counting Bloom filter implementation. To support the `CountingBloomFilter` there are `CellExtractor` and `CellPredicate` interfaces.
The `CellExtractor` extends the `IndexExtractor` by producing the index for every cell with a count greater than 0. The interface defines several new methods:
* `processCells(CellPredicate consumer)` that will call the `CellPredicate` for each populated cell in the producer.
* `from(IndexExtractor indexExtractor)` A method to produce a `CellExtractor` from an `IndexExtractor` where each index that appears in the `IndexExtractor` is a cell with a value of one (1). Since the `CellExtractor` makes a guarantee of uniqueness for the index, duplicate indices are summed together.
The `CellPredicate` is a functional consumer interface that accepts two integer values, the index and the cell value, for each populated cell.
The `CountingBloomFilter` interface defines several new methods:
* `add(CellExtractor other)` - adds the cells of the extractor to the cells of the counting Bloom filter.
* `subtract(CellExtractor other)` - subtracts the cells of the extractor from the cells of the counting Bloom filter.
* `remove(BitMapExtractor other)`, `remove(BloomFilter)`, `remove(Hasher)`, and `remove(IndexExtractor)` - these methods decrement the associated cells by 1. This is the inverse of the `merge()` methods.
* `isValid()` verifies that all cells are valid. If `false` the `CountingBloomFilter` is no longer accurate and functions may yield unpredictable results.
* `getMaxCell()` defines the maximum value of a cell before it becomes invalid.
The only provided implementation, `ArrayCountingBloomFilter`, uses a simple array of integers to track the counts.
## Element Decay Bloom filters - for streaming data
Element decay Bloom filters are effectively counting Bloom filters that automatically decrement some of the counts. Technically speaking counting Bloom filters can be used for streaming data, but to do so one would have to figure out how to remove filters when they were too old to be considered any longer. There are several approaches to this problem; here we discuss two: Creating layers of filters based on some quantized time unit (a temporal layered Bloom filter), and the stable Bloom filter.
### Stable Bloom filter
The stable Bloom filter<span><a class="footnote-ref" href="#fn2">2</a></span> is a form of counting Bloom filter that automatically degrades the cells so that items are “forgotten” after a period of time. Each cell has a maximum value defined by the stable Bloom filter shape. When a bit is turned on, the cell is set to the maximum value. The process for an insert is:
1. Randomly select a number of cells and if the value is greater than zero decrement it.
2. For each bit to be enabled, set the cell to the maximum value.
After a period of time the number of enabled cells becomes stable, hence the name of the filter. This filter will detect duplicates for recently seen items. However, it also introduces a false negative rate, so unlike other Bloom filters this one does not guarantee that if the target is not in the filter it has not been seen.
The stable filter works well in environments where inserts occur at a fairly fixed rate; it does not handle bursty environments very well.
There is no implementation of a stable Bloom filter in commons collections.
### Layered Bloom filter
The layered Bloom filter<span><a class="footnote-ref" href="#fn3">3</a></span> creates a list of filters. Items are added to a target filter and, after a specified condition is met, a new filter is added to the list and it becomes the target. Filters are removed from the list based on other specified conditions.
For example, a layered Bloom filter could comprise a list of Bloom filters, where every 10 merges a new target is created and old target filters are removed one minute after its last merge. This provides a one minute windowing and guarantees that no Bloom filter in the list will contain more than 10 other filters. This type of filter handles bursty rates better than the stable filter.
The layered filter also has the capability to locate the layer in which a filter was added. This gives the user the ability to look back in time, if necessary.
The layered filter can also be used in situations where the actual number of items is unknown when the Bloom filter is defined. By using a layered filter that adds a target whenever the saturation of the current target reaches 1, the false positive rate for the entire filter does not rise above the value calculated for the shape.
### Apache Commons Collections Implementation
The Apache Commons Collections Bloom filter package contains a layered Bloom filter implementation. To support the `LayeredBloomFilter` there are a `LayerManager` class and a `BloomFilterExtractor` interface.
The `LayerManager` handles all of the manipulation of the collection of Bloom filters that comprise the `LayeredBloomFilter`. It is constructed by a `Builder` that requires:
* a `Supplier` of Bloom filters that will create new empty Bloom filters as required.
* a `Predicate` that takes a `LayerManager` and determines if a new layer should be added.
* a `Consumer` that takes a `Deque` of the types of Bloom filters provided by the `Supplier` noted above.
These three properties ensure that the `LayerManager` will produce filters when required, and remove them when necessary.
The `BloomFilterExtractor` is a functional interface that functions much like the `CellExtractor` in the `CountingBloomFilter`. The `BloomFilterExtractor` has methods:
* `processBloomFilters(Predicate<BloomFilter> consumer)` - that will call the `Predicate` for each Bloom filter layer.
* `processBloomFilterPairs(BloomFilterExtractor other, BiPredicate<BloomFilter, BloomFilter> funcindexExtractor)` - that wil lapply the BiPredicate to every pair of Bloom filters.
* Methods to produce a `BloomFilterExtractor` from a collection of Bloom filters.
* `flatten()` - a method to merge all the filters in the list into a single filter.
The `LayeredBloomFilter` class defines several new methods:
* `contains(BloomFilterExtractor others)` - returns true if the layered filter contains all of the other filters.
* `find(BitmapExtractor)`, `find(BloomFilter)`, `find(Hasher)`, and `find(IndexExtractor)` - returns an array of ints identifying which layers match the pattern.
* `get(int layer)` - returns the Bloom filter for the layer.
* `getDepth()` - returns the number of layers in the filter..
* `next()` - forces the the creation of a new layer.
## Review
In this post we covered some unusual uses for Bloom filters as well as a couple of interesting unusual Bloom filters. In the next post, we will introduce the reference Bloom filter, and delve into multidimensional Bloom filters. We will also show how multidimensional Bloom filters can be used to search encrypted data without decrypting.
## Footnotes
<span>
<ol class="footnotes>">
<li><a id='fn1'></a>
Fan, Li; Cao, Pei; Almeida, Jussara; Broder, Andrei (2000), "Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol" (PDF), IEEE/ACM Transactions on Networking, 8 (3): 281293, CiteSeerX 10.1.1.41.1487, doi:10.1109/90.851975, S2CID 4779754, archived from the original (PDF) on 2017-09-22, retrieved 2018-07-30. A preliminary version appeared at SIGCOMM '98.
</li>
<li><a id='fn2'></a>
Deng, Fan; Rafiei, Davood (2006), "Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters", Proceedings of the ACM SIGMOD Conference (PDF), pp. 2536
</li>
<li><a id='fn2'></a>
Zhiwang, Cen; Jungang, Xu; Jian, Sun (2010), "A multi-layer Bloom filter for duplicated URL detection", Proc. 3rd International Conference on Advanced Computer Theory and Engineering (ICACTE 2010), vol. 1, pp. V1-586-V1-591, doi:10.1109/ICACTE.2010.5578947, ISBN 978-1-4244-6539-2, S2CID 3108985
</li>
</ol>
</span>

View File

@ -0,0 +1,235 @@
<!---
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Bloom Filters Part 2: Apache Commons CollectionsⓇ Bloom Filter Implementation
Previously I wrote about Bloom filters, what they are, why we use them, and what statistics they can provide. In this post I am going to cover the Apache Commons Collections Bloom filter implementation in Java and explain the logic behind some of the design decisions.
We need to realize that there are several pain points in Bloom filter usage. First, the amount of time it takes to create the number of hashes needed for each filter. Second, the trade-off between storing the bits as a bit vector (which for sparsely populated filters can be wasteful) or storing the index of the enabled bits (which for densely populated filters can be very large). Adjacent to the internal storage question is how to serialize the Bloom filter. Consideration of these pain points drove design decisions in the development of the Apache Commons Collections Bloom filter implementation.
## Shape
Last time I spoke of the shape of the filter. In Commons Collections, `Shape` is a class that defines the number of bits in the filter (`m`) and the number of hashes (`k`) used for each object insertion.
Bloom filters with the same shape can be merged and compared. The code does not verify that the shape is the same, as the comparison would increase processing time. However, every Bloom filter has a reference to its `Shape` and the test can be performed if desired.
The `Shape` is constructed from one of five combinations of parameters specified in `fromXX` methods:
* `fromKM` - when the number of hash functions and the number of bits is known.
* `fromNM` - when the number of items and the number of bits is known.
* `fromNMK` - when the number of items, the number of bits and the number of hash functions is known. In this case the number of items is used to verify that the probability is within range.
* `fromNP` - when the number of items and the probability of false positives is known.
* `fromPMK` - when the probability of false positives, number of bits, and number of hash functions is known. In this case the probability is used to verify that the maximum number of elements will result in a probability that is valid.
## The Extractors
The question of how to efficiently externally represent the internal representation of a Bloom filter led to the development of two “extractors”. One that logically represents an ordered collection of BitMap longs, and the other that represents a collection of enabled bits indices. In both cases the extractor feeds each value to a function - called a predicate - that does some operation with the value and returns true to continue processing, or false to stop processing.
The extractors allow implementation of different internal storage as long as implementation can produce one or both of the standard producers. Each producer has a static method to convert from the other type so only one implementation is required of the internal storage.
### BitMapExtractor
The BitMap extractor is an interface that defines objects that can produce BitMap vectors. BitMap vectors are vectors of long values where the bits are enabled as per the `BitMap` class. All Bloom filters implement this interface. The `BitMapExtractor` has static methods to create a producer from an array of long values, as well as from an `IndexExtractor`. The interface has default implementations that convert the extractor into an array of long values, and one that executes a LongPredicate for each BitMap.
The method `processBitMaps(LongPredicate)` is the standard access method for the extractor. Each long value in the BitMap is passed to the predicate in order. Processing continues until the last BitMap is processed or the `LongPredicate` returns false.
### IndexExtractor
The index extractor produces the index values for the enabled bits in a Bloom filter. All Bloom filters implement this interface. The `IndexExtractor` has static methods to create the extractor from an array of integers or a `BitMapExtractor`. The interface also has default implementations to convert the extractor to an array of integers and one that executes an `IntPredicate` on each index.
The method `processIndices(IntPredicate)` is the standard access method for the extractor. Each int value in the extractor is passed to the predicate. Processing continues until the last int is processed or the `IntPredicate` returns false. The order and uniqueness of the values is not guaranteed.
## The Hashers
### Hasher
The `Hasher` interface defines a method that accepts a `Shape` and returns an `IndexrExtractor`. Calling `hasher.indices(shape)` will yield an `IndexExtractor` that will produce the `shape.numberOfHashFunctions()` integers.
The `Hasher` provides a clean separation between new data being hashed and the Bloom filter. It also provides a mechanism to add the same item to multiple Bloom filters even if they have different shapes. This can be an important consideration when using Bloom filters in multiple systems with different requirements.
Any current hashing process used by an existing Bloom filter implementation can be duplicated by a hasher. The hasher only needs to create an `IndexExtractor` for an arbitrary shape. In fact, the testing code contains `Hasher` implementations that produce sequential values.
### EnhancedDoubleHasher
In the previous post I described a hashing technique where the hash value is split into two values. One is used for the initial value, the other is used to increment the values as additional hashes are required. The Commons Collections implementation adds code to ensure that the increment is changed by a tetrahedral value on every iteration. This change addresses issues with the incrementer initially being zero or very small.
The constructor accepts either two longs or a byte array from a hashing function. Users are expected to select a hashing algorithm to create an initial hash for the item. The `EnhancedDoubleHasher` is then constructed from that hash value. In many cases the Murmur3 hash, available in the Apache Commons CodecⓇ library, is sufficient and very fast. The result is a `Hasher` that represents the hashed item. This hasher can now be passed to any Bloom filter and the `Shape` of the filter will determine the number of hashes created.
## The Bloom Filters
### Bloom Filter
Now we come to the Bloom filter. As noted above the `BloomFilter` interface implements both the `IndexExtractor` and the `BitMapExtractor`. The two extractors are the external representations of the internal representation of the Bloom filter. Bloom filter implementations are free to store bit values in any way deemed fit. The requirements for bit value storage are:
* Must be able to produce an IndexExtractor.
* Must be able to produce a BitMapExtractor.
* Must be able to clear the values. Reset the cardinality to zero.
* Must specify if the filter prefers (is faster creating) the IndexExtractor (Sparse characteristic) or the BitMapExtractor (Not sparse characteristic).
* Must be able to merge hashers, Bloom filters, index extractors, and BitMap extractors. When handling extractors it is often the case that an implementation will convert one type of extractor to the other for merging. The BloomFilter interface has default implementations for Bloom filter and hasher merging.
* Must be able to determine if the filter contains hashers, Bloom filters, index extractors, and BitMap extractors. The BloomFilter interface has default implementations for BitMap extractor, Bloom filter and hasher checking.
* Must be able to make a deep copy of itself.
* Must be able to produce its `Shape`.
Several implementations of Bloom filter are provided. We will start by focusing on the `SimpleBloomFilter` and the `SparseBloomFilter`.
### SimpleBloomFilter
The `SimpleBloomFilter` implements its storage as an on-heap array of longs. This implementation is suitable for many applications. It can also serve as a good model for how to implement `BitMap` storage in specific off-heap situations.
### SparseBloomFilter
The `SparseBloomFilter` implements its storage as a TreeSet of Integers. Since each bit is randomly selected across the entire [0,shape.numberOfBits) range the Sparse Bloom filter only makes sense when \\( \frac{2 \times m}{64} \lt k \times n \\). The reasoning being that every BitMap in the array is equivalent to 2 integers in the sparse filter. The number of integers is the number of hash functions times the number of items this is an overestimation due to the number of hash collisions that will occur over the number of bits.
### Other
The helper classes included in the package make it easy to implement new Bloom filters, for example it should be fairly simple to implement a Bloom filter on a `ByteBuffer`, or `LongBuffer`, or one that uses a compressed bit vector.
## Common Usage Patterns
### Populating a Bloom filter
In most cases a shape is determined; in this example it is determined by the number of items to go in and the acceptable false positive rate, and then a number of items are added to the filter.
```java
import org.apache.commons.codec.digest.MurmurHash3;
import org.apache.commons.collections4.bloomfilter.BloomFilter;
import org.apache.commons.collections4.bloomfilter.EnhancedDoubleHasher;
import org.apache.commons.collections4.bloomfilter.Shape;
import org.apache.commons.lang3.SerializationUtils;
public class PopulatingExample {
/**
* Populates a Bloom filter with the items. Uses a shape that
* expects 10K items max and a false positive rate of 0.01.
* @param items The items to insert.
* @return the Bloom filter populated with the items.
*/
public static BloomFilter populate(Object[] items) {
populate(Shape.fromNP(10000, 0.01), items);
}
/**
* Populates a Bloom filter with the items.
* @param shape The shape of the Bloom filter.
* @param items The items to insert.
* @return the Bloom filter populated with the items.
*/
public static BloomFilter populate(Shape shape, Object[] items) {
BloomFilter collection = new SimpleBloomFilter(shape);
for (Object o : items) {
// this example serializes the entire object, actual implementation
// may want to serialize specific properties into the hash.
byte[] bytes = SerializationUtils.serialize(o);
long[] hash = MurmurHash3.hash128(bytes);
collection.merge(new EnhancedDoubleHasher(hash[0], hash[1]));
}
// collection now contains all the items from the list of items
return collection;
}
}
```
### Searching a Bloom filter
When searching a single Bloom filter, it makes sense to use the `Hasher` to search in order to reduce the overhead of creating a Bloom filter for the match.
```java
import org.apache.commons.codec.digest.MurmurHash3;
import org.apache.commons.collections4.bloomfilter.BloomFilter;
import org.apache.commons.collections4.bloomfilter.EnhancedDoubleHasher;
import org.apache.commons.lang3.SerializationUtils;
public class SearchingExample {
private Bloomfilter collection;
/**
* Creates the example from a populated collection.
* @param collection the collection to search.
*/
public SearchingExample(BloomFilter collection) {
this.collection = collection;
}
/**
* Perform the search.
* @param item the item to look for.
* @return true if the item is found, false otherwise.
*/
public boolean search(Object item) {
// create the hasher to look for.=
byte[] bytes = SerializationUtils.serialize(o);
long[] hash = MurmurHash3.hash128(bytes);
return collection.contains(new EnhancedDoubleHasher(hash[0], hash[1]));
}
}
```
However, if multiple filters are being checked, and they are all the same shape, then creating a Bloom filter from the hasher and using that is more efficient.
```java
import org.apache.commons.codec.digest.MurmurHash3;
import org.apache.commons.collections4.bloomfilter.BloomFilter;
import org.apache.commons.collections4.bloomfilter.EnhancedDoubleHasher;
import org.apache.commons.collections4.bloomfilter.Shape;
import org.apache.commons.lang3.SerializationUtils;
public abstract class SearchingExample2 {
private BloomFilter[] collections;
/**
* Create example from an array of populated Bloom filters that all have the same Shape.
* @param collections The Bloom filters to search.
*/
public SearchingExample2(BloomFilter[] collections) {
this.collections = collections;
}
/**
* Search the filters for matching items.
* @param item the Item t o search for.
*/
public void search(Object item) {
// create the Bloom filter to search for.
byte[] bytes = SerializationUtils.serialize(o);
long[] hash = MurmurHash3.hash128(bytes);
BloomFilter target = new SimpleBloomFilter(collectons[0].getShape());
target.merge(new EnhancedDoubleHasher(hash[0], hash[1]));
for (BloomFilter candidate : candidates) {
if (candidate.contains(target)) {
doSomethingInteresting(candidate, item);
}
}
}
/**
* The interesting thing to do when a match occurs.
* @param collection the Bloom filter that matched.
* @param item The item that was being searched for.
*/
abstract public void doSomethingInteresting(BloomFilter collection, Object item);
}
```
If multiple filters of different shapes are being checked then use the `Hasher` to perform the check. It will be more efficient than building a Bloom filter for each `Shape`.
## Statistics
The statistics that I mentioned in the previous bloom post are implemented in the `SetOperations` class. The methods in this class accept `BitMapExtractor` arguments. They can be called with `BloomFilter` implementations or with `BitMapExtractor` implementations generated from arrays of longs or `IndexExtractors` or any other valid `BitMapExtractor` instance.
## Review
In this post we investigated the Apache Commons Collections implementation of Bloom filters and how to use them. We touched on how they can be used to implement new designs of the standard components. In the next post we will cover some of the more unusual usages and introduce some unusual implementations.

View File

@ -0,0 +1,197 @@
<!---
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Bloom Filters Part 1: An Introduction
Bloom filters are the magical elixir often used to reduce search space and time. They have other interesting properties that make them applicable in many situations where knowledge of the approximate size of a set, union, or intersection is important, or where searching vast datasets for small matching patterns is desired, and even in cases where it is desirable to search for data without disclosing the data being searched for, or the actual data found, to 3rd parties.
In this series of blog posts well do a deep dive into Bloom filters. In this, the first post, we will touch on the mathematics behind the filters and work through an example of their use, and explore some of their properties. In later posts we will explore the Apache Commons Collections® implementation that is due out in version 4.5 of that library, discuss using Bloom filters for data sharding, and explore some of the unusual Bloom filters, like counting Bloom filters, stable Bloom filters, and layered Bloom filters, before diving in to multidimensional Bloom filters and encrypted data indexing.
Bloom filters are probably used on websites and applications you use every day. They are used to track articles youve read, speed up bitcoin clients, detect malicious web sites, and improve the performance of caches. We will get back to these later.
## What?
But let us start at the beginning. Bloom filters are a probabilistic data structure frequently used to represent sets of objects. They were invented by Burton Bloom<span><a class="footnote-ref" href="#fn1">1</a></span> in 1970. A Bloom filter is an array of bits (bit vector) into which a set of values has been hashed, so some of the bit values are on (value one), or "enabled", and others are off (value zero), or "disabled".
Multiple Bloom filters can be merged together by creating a union of the two bit vectors (V1 or V2). The resulting Bloom filter (B) is said to contain both items.
The equation for combining can be expressed as: \\( B = V1 \cup V2 \\)
When searching Bloom filters we generate a Bloom filter for the item we are looking for (the target, T), and then calculate the intersection of T with the bit vector of the Bloom filter that may contain the value (the candidate, C). If the result is equal to T, then a match has been made. This calculation can yield false positives but never false negatives.
The equation for matching can be expressed as: \\( T \cap C = T \\)
There are several properties that define a Bloom filter: the number of bits in the vector (`m`), the number of items that will be merged into the filter (`n`), the number of hash functions for each item (`k`), and the probability of false positives (`p`). All of these values are mathematically related. Mitzenmacher and Upfal<span><a class="footnote-ref" href="#fn2">2</a></span> have shown that the relationship between these properties is
\\[ p = \left( 1 - e^{-kn/m} \rigth) ^k \\]
However, it has been reported that the false positive rate in real deployments is higher than the value given by this equation.<span><a class="footnote-ref" href="#fn3">3</a></span><span><a class="footnote-ref" href="#fn4">4</a></span> Theoretically it has been proven that the equation offered a lower bound of the false positive rate, and a more accurate false positive rate has been discovered.<span><a class="footnote-ref" href="#fn5">5</a></span>
The net result is that we can describe a Bloom filter with a “shape” and that shape can be derived from combinations of the properties for example from `(p, m, k)`, `(n, p)`, `(k, m)`, `(n, m)`, or `(n, m, k)`.
To compare Bloom filters, they must have the same shape and use the same hashing functions.
## Why?
Bloom filters are often used to reduce the search space. For example, consider an application looking for a file which may occur on one of many systems. Without a Bloom filter, each system must be queried for the existence of the file. Generally this is a lengthy process. However, if a Bloom filter is created for each system, then the query could first check the filter. If the filter indicates the file might be on the system, then the expensive lookup check is performed. Because the Bloom filter never returns a false negative, this strategy reduces the search space to only those systems that may contain the file.
Examples of large Bloom filter collections can be found in bioinformatics<span><a class="footnote-ref" href="#fn6">6</a></span><span><a class="footnote-ref" href="#fn7">7</a></span><span><a class="footnote-ref" href="#fn8">8</a></span> where Bloom filters are used to represent gene sequences, and Bloom filter based databases where records are encoded into Bloom filters.<span><a class="footnote-ref" href="#fn9">9</a></span><span><a class="footnote-ref" href="#fn10">10</a></span>
Medium, the digital publishing company, uses Bloom filters to track what articles have been read.<span><a class="footnote-ref" href="#fn11">11</a></span> Bitcoin uses them to speed up clients.<span><a class="footnote-ref" href="#fn12">12</a></span> They have been used to improve caching performance<span><a class="footnote-ref" href="#fn13">13</a></span> and in detecting malicious websites.<span><a class="footnote-ref" href="#fn14">14</a></span>
## How?
So, lets work through an example. Let's assume we want to put 3 items in a filter `(n = 3)` with a 1/5 probability of collision `(p = 1/5 = 0.2)`. Solving \\( p = \left( 1 - e^{-kn/m} \right) ^k \\) yields `m=11` and `k=3`. Thomas Hurst has provided an online calculator<span><a class="footnote-ref" href="#fn15">15</a></span> where you can explore the interactions between the values.
Now that we know the shape of our Bloom filters, lets populate one. In this example we will be using a CRC hash; this is not recommended and is only used here for ease of example. Also, we will be using a naive combinatorial hashing technique that should not be used in real life.
We start by taking the CRC hash for "CAT" which is `FD2615C4`. The naive combinatorial hashing technique splits the calculated hash into 2 unsigned values. In this case the CRC value can be interpreted as 2 unsigned ints
| Value | Use |
|--------------|----------|
| FD26 = 64806 | increment |
| 15C4 = 5572 | initial |
We need to generate 3 hash values (`k`) using the naive combinatorial hashing algorithm. The first hash value is the initial value. The second hash value is the initial value plus the increment. The third hash value is the 2nd hash value plus the increment. This proceeds until the proper number of hash values have been generated. After we generate the `k` values, take the modulus (remainder after division) of those numbers by the number of bits in the vector (`m`).
| Name | Calculation | Value | Value mod(11) |
|------|-------------|-------|---------------|
| k1 | 5572 | 5572 | 6 |
| k2 | 5572 + 64608 | 70180 | 0 |
| k3 | 5572 + 64608 + 64608 | 134788 | 5 |
This yields a Bloom filter of `00001100001` or \\(\\{0,5,6\\}\\). In the binary form we enumerate the bits from left to right as they would be displayed in a big-endian unsigned integer where the bit n is represented by \\(2^n\\). Performing the same operations on "DOG" and "GUINEA PIG" yields:
| Name | CRC | Bit Set | Bloom filter |
|------|-----|----------------------|--------------|
| CAT | FD26 15C4 | \\(\\{0,5,6\\}\\) | 00001100001 |
| DOG | 3560 D2EF | \\(\\{2\\}\\) | 00000000100 |
| GUINEA PIG | E58C A739 | \\(\\{2,7,10\\}\\) | 10010000100 |
| Collection | | \\(\\{0,2,5,6,7,10\\}\\) | 10011100101 |
The collection is the union of the other three values. This represents the set of animals.
The interesting one in this collection is DOG. When we execute the naive combinatorial hashing algorithm on the DOG hash, every value is 2. Well come back to this in a moment.
If we perform the same calculations on “HORSE”, we get \\(\\{2,5,9\\}\\). Now to see if HORSE is in our collection we solve \\(\\{2,5,9\\} \cap \\{0,2,5,6,7,10\\} = \\{2,5,9\\} \longrightarrow \\{2,5\\} \ne \\{2,5,9\\}\\) . So HORSE is not in the collection.
If we only put CAT and GUINEA PIG into the collection, we get the same result for the collection. But when testing for DOG we get the true statement \\(\\{2\\} \cap \\{0,2,5,6,7,10\\} = \\{2\\}\\). The filter says that DOG is in the collection. This is an example of a false positive result.
Dog also shows the weakness of the naive combinatorial hashing technique. A proper implementation can be found in the EnhancedDoubleHashing class in the Apache Commons Collections version 4.5 library. In this case, a tetrahedral number is added to the increment to reduce the probability of a single bit being selected over the course of the hash.
## Statistics?
Bloom filters lend themselves to several statistics. The most common is the Hamming value or cardinality. This is simply the number of bits that are enabled in the vector. From this value a number of statistics can be calculated.
The hamming distance is the number of bits that have to be flipped to convert one Bloom filter to another. For example to convert Cat \\(\\{0,5,6\\}\\) to horse \\(\\{2,5,9\\}\\), we have to turn off bits \\(\\{0,6\\}\\) and turn on bits \\(\\{2,9\\}\\) so the hamming distance is 4. Bloom filters with lower hamming distances are in some sense similar.
Another measure of similarity is the Cosine similarity also known as Orchini similarity, Tucker coefficient of congruence or Ochiai similarity. To calculate it the cardinality of the intersection (bitwise and) of the two filters is then divided by the square root of the cardinality of the first filter times the cardinality of the second filter. The result is a number in the range \\([0,1]\\).
The cosine distance is calculated as `1 - cosine similarity`.
The final measure of similarity that we will cover is the Jaccard similarity also known as the Jaccard Index, Intersection over Union, and Jaccard similarity coefficient. To calculate the Jaccard index the cardinality of the intersection (bitwise and) of the two Bloom filters is calculated. This value is divided by the cardinality of the union (bitwise or) of the two Bloom filters.
The Jaccard distance, like the cosine distance, is calculated as `1 - Jaccard similarity`.
The similarity and distance statistics can be used to group similar Bloom filters together; for example when distributing files across a system that uses Bloom filters to determine where the file might be located. In this case it might make sense to store Bloom filters in the collection that has minimal distance.
In addition to basic similarity and difference, if the shape of the filter is known some information about the data behind the filters can be estimated. For example the number of items merged into a filter (n) can be estimated provided we have the cardinality (`c`), number of bits in the vector (`m`) and the number of hash functions (`k`) used when adding each element.
\\[ n = \frac{-m ln(1 - c/m)}{k} \\]
Estimating the size of the union of two filters is simply calculating n for the union (bitwise or) of the two filters.
Estimating the size of the intersection of two filters is the estimated n of the first + the estimated n of the second - the estimated union of the two. There are some tricky edge conditions, such as when one or both of the estimates of n is infinite.
## Usage Errors
There are several places that errors can creep into Bloom filter usage.
### Saturation errors
Saturation errors arise from under estimating the number of items that will be placed into the filter. Lets define “saturation” as the number of items merged into a filter divided by the number of items specified in the Shape. Then once `n` items have been inserted the filter is at a saturation of 1. As the saturation increases the false positive rate increases. Using the calculation for the false positive rate noted above, we can calculate the expected false positive rate for the various saturations. For an interactive version see Thomas Hursts online calculator.
For Bloom filters defined as k=17 and n=3 the calculation yields m=72, and p=0.00001. As the saturation increases the rates of false positives increase as per the following table:
| Saturation | Probability of false positive |
|------------|-------------------------------|
| 1 | 0.000010 |
| 2 | 0.008898 |
| 3 | 0.115070 |
| 4 | 0.356832 |
| 5 | 0.606726 |
The table shows that the probability of false positives is two orders of magnitude larger when the saturation reaches two. After five times the estimated number of items have been added the false positive rate is so high as to make the filter useless.
### Hashing errors
A second focus of errors is the generation of the hashes.
If a combinatorial hashing algorithm is used and the number of bits is significantly higher than the midpoint of the hash values then the generated values will be weighted to the lower bits. For example if byte values were used to set the increment and the initial values but the number of bits was in excess of 255, then the higher valued bits could not be selected on the first hash, and in all cases values far above 255 would be rarely selected.
## Review
So far we have covered what Bloom filters are, why we use them, how to construct and use them, explored a few statistics, and looked at potential problems arising from usage errors. In the next post we will see how the Apache Commons Collections code implements Bloom filters and look at how to implement the common usage patterns using the library.
## Footnotes
<span>
<ol class="footnotes>">
<li><a id='fn1'></a>
Burton H. Bloom. 1970. “Space/Time Trade-offs in Hash Coding with Allowable Errors". Commun. ACM, 13, 7 (July 1970), 422426
</li>
<li><a id='fn2'></a>
Michael Mitzenmacher and Eli Upfal. 2005. Probability and computing: Randomized algorithms and probabilistic analysis. Cambridge University Press, Cambridge, Cambridgeshire, UK. 109111, 308 pages.
</li>
<li><a id='fn3'></a>
L. L. Gremillion, “Designing a Bloom filter for differential file access,” Communications of the ACM, vol. 25, no. 7, pp. 600-604, 1982.
</li>
<li><a id='fn4'></a>
J. K. Mullin, “A second look at Bloom filters,” Communications of the ACM, vol. 26, no. 8, 1983
</li>
<li><a id='fn5'></a>
P. Bose, H. Guo, E. Kranakis, “On the false-positive rate of Bloom filters,” Information Processing Letters, vol. 108, no. 4, pp. 210-213, 2008.
</li>
<li><a id='fn6'></a>
Henrik Stranneheim, Max Käller, Tobias Allander, Björn Andersson, Lars Arvestad, and Joakim Lundeberg. 2015. Classification of DNA sequences using Bloom filters. Bioinfomatics 26, 13 (July 2015), 15951600. <a href="https://doi.org/10.1093/bioinformatics/btq230">https://doi.org/10.1093/bioinformatics/btq230</a>
</li>
<li><a id='fn7'></a>
Justin Chu, Sara Sadeghi, Anthony Raymond, Shaun D. Jackman, Ka Ming Nip, Richard Mar, Hamid Mohamadi, Yaron S. Butterfield, A. Gordon Robertson, and Inanç Birol. 2014. BioBloom tools: fast, accurate and memory-efficient host species sequence screening using bloom filters. Bioinfomatics 30, 23 (Dec. 2014), 302304. <a href="https://doi.org/10.1093/bioinformatics/btu558">https://doi.org/10.1093/bioinformatics/btu558</a>
</li>
<li><a id='fn8'></a>
Brad Solomon and Carl Kingsford. 2016. Fast Search of Thousands of Short-Read Sequencing Experiments. Nature Biotechnology 34, 3 (March 2016), 300302. <a href="https://doi.org/10.1038/nbt.3442">https://doi.org/10.1038/nbt.3442</a>
</li>
<li><a id='fn9'></a>
Steven M Bellovin and William R Cheswick. 2004. Privacy-Enchanced Searches Using Encrypted Bloom Filters". <a href="https://mice.cs.columbia.edu/getTechreport.php?techreportID=483">https://mice.cs.columbia.edu/getTechreport.php?techreportID=483</a>
</li>
<li><a id='fn10'></a>
Arisa Tajima, Hiroki Sato, and Hayato Yamana. 2018. Privacy-Preserving Join Processing over outsourced private datasets with Fully Homomorphic Encryption and Bloom Filters. <a href="https://db-event.jpn.org/deim2018/data/papers/201.pdf">https://db-event.jpn.org/deim2018/data/papers/201.pdf</a>
</li>
<li><a id='fn11'></a>
Jamie Talbot, Jul 15, 2015, What are Bloom filters? : A tale of code, dinner, and a favour with unexpected consequences., The Medium Blog, <a href="https://blog.medium.com/what-are-bloom-filters-1ec2a50c68ff">https://blog.medium.com/what-are-bloom-filters-1ec2a50c68ff</a>
</li>
<li><a id='fn12'></a>
BitcoinDeveloper, Documentation, https://developer.bitcoin.org/search.html?q=bloom+filter
</li>
<li><a id='fn13'></a>
<a href="https://en.wikipedia.org/wiki/Bloom_filter#Cache_filtering">https://en.wikipedia.org/wiki/Bloom_filter#Cache_filtering</a>
</li>
<li><a id='fn4'></a>
K. Nandhini and R. Balasubramaniam, "Malicious Website Detection Using Probabilistic Data Structure Bloom Filter," 2019 3rd International Conference on Computing Methodologies and Communication (ICCMC), Erode, India, 2019, pp. 311-316, doi: 10.1109/ICCMC.2019.8819818.
</li>
<li><a id='fn15'></a>
Thomas Hurst, Bloom Filter Calculator. <a href="https://hur.st/bloomfilter/">https://hur.st/bloomfilter/</a>
</li>
</ol>
</span>

View File

@ -0,0 +1,110 @@
<!---
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Bloom Filters Part 4: Bloom filters for indexing
In many cases Bloom filters are used as gatekeepers; that is, they are queried before attempting a longer operation to see if the longer operation should be executed. However, there is another type of Bloom filter: the reference type. The reference type contains the hashed values for the properties of a single object. For example a Person object might have the fields name, date of birth, address, and phone number. Each of those could be hashed and combined into a single Bloom filter. That Bloom filter could then be said to be a reference to the person. Reference Bloom filters tend to be fully or nearly fully saturated.
We can use reference Bloom filters to index data by storing the Bloom filter along with a record identifier that can be used to retrieve the data. The simplest solution is create a list of reference Bloom filters and their associated record identifiers and then perform the linear search for matches. For every Bloom filter that matches the search, return the associated record identifier.
Searching can be performed by creating a target Bloom filter with partial data, for example name and date of birth from the person example, and then searching through the list as described above. The associated records either have the name and birthdate or are false positives and need to be filtered out during retrieval.
## Multidimensional Bloom filters
The description above is a multidimensional Bloom filter. A multidimensional Bloom filter is simply a collection of searchable filters, the simplest implementation being a list. In fact, for fewer than 10K filters the list is the fastest possible solution. There are two basic reasons for this:
* Bloom filter comparisons are extremely fast taking on approximately five (5) machine instructions for the simple comparison.
* Bloom filters do not have an obvious natural order that can be used to reduce the search space without incurring significant overhead. The amount of overhead often overwhelms the advantage of the index.
There are, however, several multidimensional Bloom filter algorithms among them are: Bloofi, Flat Bloofi, BF-Trie, Hamming Skip List, Sharded List, and Natural Bloofi.
### Bloofi
Bloofi is a technique that uses a B+-tree structure where the inner nodes are merges of the Bloom filters below and the leaf nodes contain the actual Bloom filters.<span><a class="footnote-ref" href="#fn1">1</a></span> This technique works well for Bloom filters that are not densely populated (i.e. low saturation) and are designed with a very small false positive rate.
Bloofi is extremely fast when searching; however, inserting often requires updates to multiple inner nodes. Bloofi supports deletion, but deletion can also generate updates to multiple inner nodes.
### Flat Bloofi
Flat-Bloofi, expands each bit into a bit vector representing which Bloom filters in the index contain the specific bit. Conceptually this is a bit matrix, with the columns being the bits in the Bloom filter and the rows the bitMaps for the indexed Bloom filters. During insert, the Bloom filter being inserted is given an index number (row). In each bit vector associated with each enabled bit in the Bloom filter (column) the index number bit is enabled. During a search, all of the bit vectors associated with the enabled bits in the target Bloom filter (columns) are “and”ed together. The enabled bits in the resulting bit vector represent the index numbers of the matching Bloom filters. Implementations typically utilize the internal bit structure of native data types to compactly represent the matrix.
This solution is consistently among the fastest solutions of all solutions presented here. It is often the fastest or second fastest. It supports deletion through the addition of a list of deleted rows and reuse of space in the vectors.
### BF-Trie
BF-Trie creates a trie based on the byte values in the filters. It has the same insert, delete, and update characteristics you would expect from a trie structure.
During searching there is an expansion factor that has to be taken into account. Every zero in the target filter yields two possible solutions. For example the byte 0xFA has 4 potential matches (see table below) that have to be included in the search. This is performed by exploring multiple paths through the trie while finding the solution.
| code | matching pattern |
| ---- | -----------------|
| 0xFA | 1111 1010 |
| 0XFB | 1111 1011 |
| 0xFE | 1111 1110 |
| 0xFF | 1111 1111|
### Hamming Skip List
Conceptually the Hamming skip list is an implementation of a two segment key. This index arises from the observation that no target can match a filter with a lower hamming value. In addition, if the Bloom filter bit vector is interpreted as a very large unsigned integer no target can match a filter of a lower value. Since we have a binary representation of the very large unsigned integer we can calculate \\( log_2 \\) of the value. Now we construct a skip list based on the hamming value on the first level, and the \\( log_2 \\) value on the second. As an alternative, an index in a standard relational DB can be used. During the search all solutions where filter hamming value is greater than or equal to the target hamming value and the filter \\( log_2 \\) value is greater than or equal to the target \\( log_2 \\) value are returned.
This solution suffers from the clustering of hamming values. Most hamming values will cluster around the number of values that were merged times the number of hash functions \\( (n \times k) \\) adjusted for the expected collision rate. So the hamming value only provides a strong selector when the hamming value of the target is close to the saturation of the indexed filter. The \\( log_2 \\) index is fairly evenly distributed in the upper range and really only provides a strong selector when the hamming value is low but upper bits are enabled.
The Hamming Skip List is a good simple implementation for architectures that use relational databases or other environments where multi-segmented numerical indexes are present.
### Sharded List
A sharded list is a collection of lists of Bloom filters and builds upon the sharding solution presented in part 3 of this series. In this instance, as a Bloom filter is added to the index the filter is hashed and a Bloom filter created from that hash. The Bloom filters Bloom filter is then used to determine which list to add the filter to. When a collection reaches capacity (as defined by the Shape of the Bloom filters Bloom filter), it is removed from consideration for further inserts and a new empty list created and added to the collection for insert consideration.
This solution utilizes the rapidity of the standard list solution, while providing a mechanism to handle more than 10K filters at a time.
### Natural Bloofi
Natural Bloofi uses a Tree structure like Bloofi does except that each node in the tree is a filter that was inserted into the index.<span><a class="footnote-ref" href="#fn2">2</a></span> Natural Bloofi operates like the sharded list except that if the Bloom filter for a node is contained by a node in the list then it is made a child of that node. If the Bloom filter node contains a node in the list, then it becomes the parent of that node. This yields a flat Bloofi tree where the more saturated filters are closer to the root.
## Encrypted indexing
The idea of using Bloom filters for indexing encrypted data is not a new idea.<span><a class="footnote-ref" href="#fn3">3</a></span><span><a class="footnote-ref" href="#fn4">4</a></span><span><a class="footnote-ref" href="#fn5">5</a></span><span><a class="footnote-ref" href="#fn6">6</a></span> The salient points are that Bloom filters are a very effective one way hash with matching capabilities. The simplest solution is to create a reference Bloom filter comprising the plain text of the columns that are to be indexed. Encrypt the data. Send the encrypted data and the Bloom filter to the storage engine. The storage engine stores the encrypted data as a blob and indexes the Bloom filter with a reference to the stored blob.
When searching such an index, the desired plain text values are hashed into a Bloom filter and sent to the storage engine. The engine finds all the matching Bloom filters and returns the encrypted blobs associated with them. The client then decrypts the blobs and removes any false positives.
The technique ensures that the plain text data never leaves the clients system and guarantees that the service has no access to the plain text of the stored data.
## Review
In this section we discussed multidimensional Bloom filters and presented several implementations. We also explored the idea of an encrypted database where the data in transit and at rest is encrypted or at least strongly hashed.
I hope that over the course of this series you have developed a deeper understanding of Bloom filters, their construction and how they can be applied to various technical problems.
<span>
<ol class="footnotes>">
<li><a id='fn1'></a>
Adina Crainiceanu. 2013. Bloofi: a hierarchical Bloom filter index with applications to distributed data provenance. In Proceedings of the 2nd International Workshop on Cloud Intelligence. ACM, New York, NY, USA. https://doi.org/10.1145/2501928.2501931
</li>
<li><a id='fn2'></a>
Adina Crainiceanu and Daniel Lemire. 2015. Bloofi: Multidimensional Bloom filters. Information Systems 54, C (Dec. 2015), 311324. https://doi.org/10.1016/j.is.2015.01.002
</li>
<li><a id='fn3'></a>
Yan-Cheng Chang and Michael Mitzenmacher. Privacy Preserving Keyword Searches on Remote Encrypted Data. Accessed on 18-Dec-2019. 2004. url: https://eprint.iacr.org/2004/051.pdf.
</li>
<li><a id='fn4'></a>
Steven M Bellovin and William R Cheswick”. Privacy-Enchanced Searched Using Encrypted Bloom Filters”. Accessed on 18-Dec-2019. 2004. url: https://mice.cs.columbia.edu/getTechreport.php?techreportID=483.
</li>
<li><a id='fn5'></a>
Eu-Jin Goh. Secure Indexes. Accessed on 18-Dec-2019. 2004. url: https://crypto.stanford.edu/~eujin/papers/secureindex/ secureindex.pdf.
</li>
<li><a id='fn6'></a>
Arisa Tajima, Hiroki Sato, and Hayato Yamana. Privacy-Preserving Join Processing over outsourced private datasets with Fully Homomorphic Encryption and Bloom Filters. Accessed on 18-Dec-2019. 2018. url: https://db-event.jpn.org/deim2018/data/papers/201.pdf
</li>
</ol>
</span>

View File

@ -23,6 +23,26 @@
</bannerRight>
<body>
<head>&lt;script type="text/javascript" id="MathJax-script" async="async" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/MathJax.js?config=TeX-AMS-MML_HTMLorMML"&gt;&lt;/script&gt;
&lt;style&gt;
.footnotes ol li p {
display: inline;
}
.footnote-ref {
font-size: .83em ;
vertical-align: super ;
font-weight:bold;
}
.footnote-ref:after {
content: "\00a0";
}
.bodyTable {
width : max-content;
border: 1px solid black;
}
&lt;/style&gt;
</head>
<menu name="Commons Collections">
<item name="Overview" href="/index.html" />
<item name="Download" href="/download_collections.cgi" />

File diff suppressed because it is too large Load Diff