diff --git a/src/main/java/org/apache/commons/collections4/bloomfilter/package-info.java b/src/main/java/org/apache/commons/collections4/bloomfilter/package-info.java index d251fc525..e2b71f3b7 100644 --- a/src/main/java/org/apache/commons/collections4/bloomfilter/package-info.java +++ b/src/main/java/org/apache/commons/collections4/bloomfilter/package-info.java @@ -32,7 +32,7 @@ * list. There are lots of other uses, and in most cases the reason is to perform a fast check as a gateway for a longer * operation.

* - *

Some Bloom filters (e.g. {@link CountingBloomFilter}) use counters rather than bits. In this case each counter + *

Some Bloom filters (e.g. {@link org.apache.commons.collections4.bloomfilter.CountingBloomFilter}) use counters rather than bits. In this case each counter * is called a cell.

* *

BloomFilter

@@ -46,19 +46,25 @@ * * @@ -69,66 +75,72 @@ *

Implementation Notes

* *

The architecture is designed so that the implementation of the storage of bits is abstracted. Rather than specifying a - * specific state representation we require that all Bloom filters implement the {@link org.apache.commons.collections4.bloomfilter.BitMapExtractor} and {@link org.apache.commons.collections4.bloomfilter.IndexExtractor} interfaces, + * specific state representation we require that all Bloom filters implement the {@link org.apache.commons.collections4.bloomfilter.BitMapExtractor} + * and {@link org.apache.commons.collections4.bloomfilter.IndexExtractor} interfaces, * Counting-based Bloom filters implement {@link org.apache.commons.collections4.bloomfilter.CellExtractor} as well. There are static * methods in the various Extractor interfaces to convert from one type to another.

* - *

Programs that utilize the Bloom filters may use the {@link org.apache.commons.collections4.bloomfilter.BitMapExtractor} or {@link org.apache.commons.collections4.bloomfilter.IndexExtractor} to retrieve + *

Programs that utilize the Bloom filters may use the {@link org.apache.commons.collections4.bloomfilter.BitMapExtractor} + * or {@link org.apache.commons.collections4.bloomfilter.IndexExtractor} to retrieve * or process a representation of the internal structure. - * Additional methods are available in the {@link org.apache.commons.collections4.bloomfilter.BitMaps} class to assist in manipulation of bit map representations.

+ * Additional methods are available in the {@link org.apache.commons.collections4.bloomfilter.BitMaps} class to assist in + * manipulation of bit map representations.

* *

The Bloom filter is an interface that requires implementation of 9 methods:

* * *

Other methods should be implemented where they can be done so more efficiently than the default implementations.

* *

CountingBloomFilter

* - *

The {@link org.apache.commons.collections4.bloomfilter.CountingBloomFilter} extends the Bloom filter by counting the number of times a specific bit has been + *

The {@link org.apache.commons.collections4.bloomfilter.CountingBloomFilter} extends the Bloom filter by counting the number + * of times a specific bit has been * enabled or disabled. This allows the removal (opposite of merge) of Bloom filters at the expense of additional * overhead.

* *

LayeredBloomFilter

* - *

The {@link org.apache.commons.collections4.bloomfilter.LayeredBloomFilter} extends the Bloom filter by creating layers of Bloom filters that can be queried as a single + *

The {@link org.apache.commons.collections4.bloomfilter.LayeredBloomFilter} extends the Bloom filter by creating layers of Bloom + * filters that can be queried as a single * Filter or as a set of filters. This adds the ability to perform windowing on streams of data.

* *

Shape

* - *

The {@link org.apache.commons.collections4.bloomfilter.Shape} describes the Bloom filter using the number of bits and the number of hash functions. It can be specified + *

The {@link org.apache.commons.collections4.bloomfilter.Shape} describes the Bloom filter using the number of bits and the number + * of hash functions. It can be specified * by the number of expected items and desired false positive rate.

* *

Hasher

* - *

A {@link org.apache.commons.collections4.bloomfilter.Hasher} converts bytes into a series of integers based on a Shape. Each hasher represents one item being added + *

A {@link org.apache.commons.collections4.bloomfilter.Hasher} converts bytes into a series of integers based on a Shape. + * Each hasher represents one item being added * to the Bloom filter.

* - *

The {@link org.apache.commons.collections4.bloomfilter.EnhancedDoubleHasher} uses a combinatorial generation technique to create the integers. It is easily + *

The {@link org.apache.commons.collections4.bloomfilter.EnhancedDoubleHasher} uses a combinatorial generation technique to + * create the integers. It is easily * initialized by using a byte array returned by the standard {@link java.security.MessageDigest} or other hash function to * initialize the Hasher. Alternatively, a pair of a long values may also be used.

* - *

Other implementations of the {@link org.apache.commons.collections4.bloomfilter.Hasher} are easy to implement, and should make use of the {@code Hasher.Filter} - * and/or {@code Hasher.FileredIntConsumer} classes to filter out duplicate indices when implementing - * {@code Hasher.uniqueIndices(Shape)}.

+ *

Other implementations of the {@link org.apache.commons.collections4.bloomfilter.Hasher} are easy to implement.

* *

References

* diff --git a/src/site/markdown/bloomFilters/advanced.md b/src/site/markdown/bloomFilters/advanced.md index 6090eeb26..97a0bc7f1 100644 --- a/src/site/markdown/bloomFilters/advanced.md +++ b/src/site/markdown/bloomFilters/advanced.md @@ -220,9 +220,9 @@ However, the issue with this solution is that once the filters are saturated, th ``` The above calculation is dependent upon `BloomFilter.cardinality()` function, so it is advisable to use BloomFilters that track cardinality or can calculate cardinality quickly. -## Counting Bloom filters +## Counting Bloom Filters -Standard Bloom filters do not have a mechanism to remove items. One of the solutions for this is to convert each bit to a counter1. The counter and index together are commonly called a cell. As items are added to the filter, the values of the cells associated with the enabled bits are incremented. When an item is removed the values of the cells associated with the enabled bits are decremented. This solution supports removal of items at the expense of making the filter many times larger than a `BitMap` based one. +Standard Bloom filters do not have a mechanism to remove items. One of the solutions for this is to convert each bit to a counter1. The counter and index together are commonly called a cell. As items are added to the filter, the values of the cells associated with the enabled bits are incremented. When an item is removed the values of the cells associated with the enabled bits are decremented. This solution supports removal of items at the expense of making the filter many times larger than a bit map based one. The counting Bloom filter also has a couple of operations not found in other Bloom filters: * Counting Bloom filters can be added together so that the sum of their counts is achieved. @@ -269,7 +269,7 @@ The stable filter works well in environments where inserts occur at a fairly fix There is no implementation of a stable Bloom filter in commons collections. -### Layered Bloom filter +### Layered Bloom Filter The layered Bloom filter3 creates a list of filters. Items are added to a target filter and, after a specified condition is met, a new filter is added to the list and it becomes the target. Filters are removed from the list based on other specified conditions. @@ -279,7 +279,7 @@ The layered filter also has the capability to locate the layer in which a filter The layered filter can also be used in situations where the actual number of items is unknown when the Bloom filter is defined. By using a layered filter that adds a target whenever the saturation of the current target reaches 1, the false positive rate for the entire filter does not rise above the value calculated for the shape. -### Apache Commons Collections Implementation +#### Apache Commons Collections Implementation The Apache Commons Collections Bloom filter package contains a layered Bloom filter implementation. To support the `LayeredBloomFilter` there are a `LayerManager` class and a `BloomFilterExtractor` interface. @@ -301,9 +301,173 @@ The `LayeredBloomFilter` class defines several new methods: * `contains(BloomFilterExtractor others)` - returns true if the layered filter contains all of the other filters. * `find(BitmapExtractor)`, `find(BloomFilter)`, `find(Hasher)`, and `find(IndexExtractor)` - returns an array of ints identifying which layers match the pattern. * `get(int layer)` - returns the Bloom filter for the layer. -* `getDepth()` - returns the number of layers in the filter.. +* `getDepth()` - returns the number of layers in the filter. * `next()` - forces the the creation of a new layer. +#### Layered Bloom Filter Example + +In this example we are building a layered Bloom filter that will +* Retain information for a specified period of time: `layerExpiry`. +* Create a new layer when + * the current layer becomes saturates: `numberOfItems`; or + * a timer has expired `layerExpiry / layerCount`. + +This construct has the interesting effect of growing and shrinking the number of layers as demand grows or shrinks. If there is a steady flow of items that does not exceed \\( \frac{numberOfItems}{layerExpiry / layerCount} \\) then there will be `layerCount` layers. If the rate of entities is higher, then the number of layers will grow to handle the rate. When the rate decreases the number of layers will decrease as the items age out `(layerExpiry)`. If the rate drops to zero there will be one empty Bloom filter layer. However, the number of layers will not shrink unless `isEmpty()` or one of the `merge()` or `contains()` methods are called. + +This implementation is not thread safe and has the effect that the item is removed from the filter after `layerExpiry` has elapsed even if it was seen again. If the desired effect is to retain the reference until after the last time the item was seen then the check needs to verify that the item was seen in or near the last filter. + +```java +package org.example; + +import org.apache.commons.collections4.bloomfilter.BloomFilter; +import org.apache.commons.collections4.bloomfilter.Hasher; +import org.apache.commons.collections4.bloomfilter.LayerManager; +import org.apache.commons.collections4.bloomfilter.LayeredBloomFilter; +import org.apache.commons.collections4.bloomfilter.Shape; +import org.apache.commons.collections4.bloomfilter.SimpleBloomFilter; +import org.apache.commons.collections4.bloomfilter.WrappedBloomFilter; + +import java.time.Duration; +import java.util.Deque; +import java.util.Iterator; +import java.util.function.Consumer; +import java.util.function.Predicate; + +public class LayeredExample { + // the layered Bloom filter. + private final LayeredBloomFilter bloomFilter; + // the expiry time for a layer + private final Duration layerExpiry; + // the expected number of layers. + private final int layerCount; + + /** + * @param numberOfItems the expected number of item in a layer. + * @param falsePositiveRate the acceptable false positive rate for the filter. + * @param layerCount the number of expected layers. + * @param layerExpiry The length of time a layer should exist + */ + public LayeredExample(int numberOfItems, double falsePositiveRate, int layerCount, Duration layerExpiry) { + this.layerCount = layerCount; + this.layerExpiry = layerExpiry; + Shape shape = Shape.fromNP(numberOfItems, falsePositiveRate); + + // we will create a new Bloom filter (advance) every time the active filter in the layered filter becomes full + Predicate> advance = LayerManager.ExtendCheck.advanceOnCount(numberOfItems); + // or when the next window should be started. + advance = advance.or(new TimerPredicate()); + + // the cleanup for the LayerManager determines when to remove a layer. + // always remove the empty target. + Consumer> cleanup = LayerManager.Cleanup.removeEmptyTarget(); + // remove any expired layers from the front of the list. + cleanup = cleanup.andThen( + lst -> { + if (!lst.isEmpty()) { + long now = System.currentTimeMillis(); + Iterator iter = lst.iterator(); + while (iter.hasNext()) { + if (iter.next().expires < now) { + iter.remove(); + } else { + break; + } + } + } + }); + + LayerManager.Builder builder = LayerManager.builder(); + // the layer manager for the Bloom filter, performs automatic cleanup and advance when necessary. + LayerManager layerManager = builder.setExtendCheck(advance) + .setCleanup(cleanup) + // create a new TimestampedBloomFilter when needed. + .setSupplier(() -> new TimestampedBloomFilter(shape, layerExpiry.toMillis())) + .build(); + // create the layered bloom filter. + bloomFilter = new LayeredBloomFilter(shape, layerManager); + } + + // merge a hasher into the bloom filter. + public LayeredExample merge(Hasher hasher) { + bloomFilter.merge(hasher); + return this; + } + + /** + * Returns true if {@code bf} is found in the layered filter. + * + * @param bf the filter to look for. + * @return true if found. + */ + public boolean contains(BloomFilter bf) { + return bloomFilter.contains(bf); + } + + /** + * Returns true if {@code bf} is found in the layered filter. + * + * @param hasher the hasher representation to search for. + * @return true if found. + */ + public boolean contains(Hasher hasher) { + return bloomFilter.contains(hasher); + } + + /** + * @return true if there are no entities being tracked for the principal. + */ + public boolean isEmpty() { + bloomFilter.cleanup(); // forces clean + return bloomFilter.isEmpty(); + } + + /** + * A cleanup() predicate that triggers when a new layer should be created based on time + * and the desired number of layers. + */ + class TimerPredicate implements Predicate> { + long expires; + + TimerPredicate() { + expires = System.currentTimeMillis() + layerExpiry.toMillis() / layerCount; + } + + @Override + public boolean test(LayerManager o) { + long now = System.currentTimeMillis(); + if (expires > now) { + return false; + } + expires = now + layerExpiry.toMillis() / layerCount; + return true; + } + } + + /** + * A Bloom filter implementation that has a timestamp indicating when it expires. + * Used as the Bloom filter for the layer in the LayeredBloomFilter + */ + static class TimestampedBloomFilter extends WrappedBloomFilter { + long expires; + + TimestampedBloomFilter(Shape shape, long ttl) { + super(new SimpleBloomFilter(shape)); + expires = System.currentTimeMillis() + ttl; + } + + private TimestampedBloomFilter(TimestampedBloomFilter other) { + super(other.getWrapped().copy()); + this.expires = other.expires; + } + + @Override + public TimestampedBloomFilter copy() { + return new TimestampedBloomFilter(this); + } + } +} +``` + ## Review In this post we covered some unusual uses for Bloom filters as well as a couple of interesting unusual Bloom filters. In the next post, we will introduce the reference Bloom filter, and delve into multidimensional Bloom filters. We will also show how multidimensional Bloom filters can be used to search encrypted data without decrypting. diff --git a/src/site/markdown/bloomFilters/implementation.md b/src/site/markdown/bloomFilters/implementation.md index 3020c8a2b..0bd48d928 100644 --- a/src/site/markdown/bloomFilters/implementation.md +++ b/src/site/markdown/bloomFilters/implementation.md @@ -35,19 +35,19 @@ The `Shape` is constructed from one of five combinations of parameters specified ## The Extractors -The question of how to efficiently externally represent the internal representation of a Bloom filter led to the development of two “extractors”. One that logically represents an ordered collection of BitMap longs, and the other that represents a collection of enabled bits indices. In both cases the extractor feeds each value to a function - called a predicate - that does some operation with the value and returns true to continue processing, or false to stop processing. +The question of how to efficiently externally represent the internal representation of a Bloom filter led to the development of two “extractors”. One that logically represents an ordered collection of bit map longs, and the other that represents a collection of enabled bits indices. In both cases the extractor feeds each value to a function - called a predicate - that does some operation with the value and returns true to continue processing, or false to stop processing. The extractors allow implementation of different internal storage as long as implementation can produce one or both of the standard producers. Each producer has a static method to convert from the other type so only one implementation is required of the internal storage. ### BitMapExtractor -The BitMap extractor is an interface that defines objects that can produce BitMap vectors. BitMap vectors are vectors of long values where the bits are enabled as per the `BitMap` class. All Bloom filters implement this interface. The `BitMapExtractor` has static methods to create a producer from an array of long values, as well as from an `IndexExtractor`. The interface has default implementations that convert the extractor into an array of long values, and one that executes a LongPredicate for each BitMap. +The `BitMapExtractor` is an interface that defines objects that can produce bit map vectors. Bit map vectors are vectors of long values where the bits are enabled as per the `BitMap` class. All Bloom filters implement this interface. The `BitMapExtractor` has static methods to create an extractor from an array of long values, as well as from an `IndexExtractor`. The interface has default implementations that convert the extractor into an array of long values, and one that executes a LongPredicate for each bit map. -The method `processBitMaps(LongPredicate)` is the standard access method for the extractor. Each long value in the BitMap is passed to the predicate in order. Processing continues until the last BitMap is processed or the `LongPredicate` returns false. +The method `processBitMaps(LongPredicate)` is the standard access method for the extractor. Each long value in the bit map is passed to the predicate in order. Processing continues until the last bit map is processed or the `LongPredicate` returns false. ### IndexExtractor -The index extractor produces the index values for the enabled bits in a Bloom filter. All Bloom filters implement this interface. The `IndexExtractor` has static methods to create the extractor from an array of integers or a `BitMapExtractor`. The interface also has default implementations to convert the extractor to an array of integers and one that executes an `IntPredicate` on each index. +The `IndexExtractor` produces the index values for the enabled bits in a Bloom filter. All Bloom filters implement this interface. The `IndexExtractor` has static methods to create the extractor from an array of integers or a `BitMapExtractor`. The interface also has default implementations to convert the extractor to an array of integers and one that executes an `IntPredicate` on each index. The method `processIndices(IntPredicate)` is the standard access method for the extractor. Each int value in the extractor is passed to the predicate. Processing continues until the last int is processed or the `IntPredicate` returns false. The order and uniqueness of the values is not guaranteed. @@ -72,12 +72,11 @@ The constructor accepts either two longs or a byte array from a hashing function ### Bloom Filter Now we come to the Bloom filter. As noted above the `BloomFilter` interface implements both the `IndexExtractor` and the `BitMapExtractor`. The two extractors are the external representations of the internal representation of the Bloom filter. Bloom filter implementations are free to store bit values in any way deemed fit. The requirements for bit value storage are: - * Must be able to produce an IndexExtractor. - * Must be able to produce a BitMapExtractor. + * Must be able to produce an `IndexExtractor` or `BitMapExtractor`. * Must be able to clear the values. Reset the cardinality to zero. - * Must specify if the filter prefers (is faster creating) the IndexExtractor (Sparse characteristic) or the BitMapExtractor (Not sparse characteristic). - * Must be able to merge hashers, Bloom filters, index extractors, and BitMap extractors. When handling extractors it is often the case that an implementation will convert one type of extractor to the other for merging. The BloomFilter interface has default implementations for Bloom filter and hasher merging. - * Must be able to determine if the filter contains hashers, Bloom filters, index extractors, and BitMap extractors. The BloomFilter interface has default implementations for BitMap extractor, Bloom filter and hasher checking. + * Must specify if the filter prefers (is faster creating) the `IndexExtractor` (Sparse characteristic) or the `BitMapExtractor` (Not sparse characteristic). + * Must be able to merge hashers, Bloom filters, index extractors, and bit map extractors. When handling extractors it is often the case that an implementation will convert one type of extractor to the other for merging. The BloomFilter interface has default implementations for Bloom filter and hasher merging. + * Must be able to determine if the filter contains hashers, Bloom filters, index extractors, and bit map extractors. The BloomFilter interface has default implementations for bit map extractor, Bloom filter and hasher checking. * Must be able to make a deep copy of itself. * Must be able to produce its `Shape`. @@ -85,10 +84,10 @@ Several implementations of Bloom filter are provided. We will start by focusing ### SimpleBloomFilter -The `SimpleBloomFilter` implements its storage as an on-heap array of longs. This implementation is suitable for many applications. It can also serve as a good model for how to implement `BitMap` storage in specific off-heap situations. +The `SimpleBloomFilter` implements its storage as an on-heap array of longs. This implementation is suitable for many applications. It can also serve as a good model for how to implement bit map storage in specific off-heap situations. ### SparseBloomFilter -The `SparseBloomFilter` implements its storage as a TreeSet of Integers. Since each bit is randomly selected across the entire [0,shape.numberOfBits) range the Sparse Bloom filter only makes sense when \\( \frac{2 \times m}{64} \lt k \times n \\). The reasoning being that every BitMap in the array is equivalent to 2 integers in the sparse filter. The number of integers is the number of hash functions times the number of items – this is an overestimation due to the number of hash collisions that will occur over the number of bits. +The `SparseBloomFilter` implements its storage as a TreeSet of Integers. Since each bit is randomly selected across the entire [0,shape.numberOfBits) range the Sparse Bloom filter only makes sense when \\( \frac{2m}{64} \lt kn \\). The reasoning being that every bit map in the array is equivalent to 2 integers in the sparse filter. The number of integers is the number of hash functions times the number of items – this is an overestimation due to the number of hash collisions that will occur over the number of bits. ### Other diff --git a/src/site/markdown/bloomFilters/intro.md b/src/site/markdown/bloomFilters/intro.md index ea14fd880..dec248e8f 100644 --- a/src/site/markdown/bloomFilters/intro.md +++ b/src/site/markdown/bloomFilters/intro.md @@ -35,7 +35,7 @@ The equation for matching can be expressed as: \\( T \cap C = T \\) There are several properties that define a Bloom filter: the number of bits in the vector (`m`), the number of items that will be merged into the filter (`n`), the number of hash functions for each item (`k`), and the probability of false positives (`p`). All of these values are mathematically related. Mitzenmacher and Upfal2 have shown that the relationship between these properties is -\\[ p = \left( 1 - e^{-kn/m} \rigth) ^k \\] +\\[ p = \left( 1 - e^{-kn/m} \right) ^k \\] However, it has been reported that the false positive rate in real deployments is higher than the value given by this equation.34 Theoretically it has been proven that the equation offered a lower bound of the false positive rate, and a more accurate false positive rate has been discovered.5 @@ -48,7 +48,7 @@ Bloom filters are often used to reduce the search space. For example, consider a Examples of large Bloom filter collections can be found in bioinformatics678 where Bloom filters are used to represent gene sequences, and Bloom filter based databases where records are encoded into Bloom filters.910 -Medium, the digital publishing company, uses Bloom filters to track what articles have been read.11 Bitcoin uses them to speed up clients.12 They have been used to improve caching performance13 and in detecting malicious websites.14 +Medium, the digital publishing company, uses Bloom filters to track what articles have been read.11 Bitcoin uses them to speed up clients.12 Bloom filters have been used to improve caching performance13 and in detecting malicious websites.14 ## How? So, let’s work through an example. Let's assume we want to put 3 items in a filter `(n = 3)` with a 1/5 probability of collision `(p = 1/5 = 0.2)`. Solving \\( p = \left( 1 - e^{-kn/m} \right) ^k \\) yields `m=11` and `k=3`. Thomas Hurst has provided an online calculator15 where you can explore the interactions between the values. @@ -107,22 +107,22 @@ The Jaccard distance, like the cosine distance, is calculated as `1 - Jaccard si The similarity and distance statistics can be used to group similar Bloom filters together; for example when distributing files across a system that uses Bloom filters to determine where the file might be located. In this case it might make sense to store Bloom filters in the collection that has minimal distance. -In addition to basic similarity and difference, if the shape of the filter is known some information about the data behind the filters can be estimated. For example the number of items merged into a filter (n) can be estimated provided we have the cardinality (`c`), number of bits in the vector (`m`) and the number of hash functions (`k`) used when adding each element. +In addition to basic similarity and difference, if the shape of the filter is known some information about the data behind the filters can be estimated. For example the number of items merged into a filter (`n`) can be estimated provided we have the cardinality (`c`), number of bits in the vector (`m`) and the number of hash functions (`k`) used when adding each element. \\[ n = \frac{-m ln(1 - c/m)}{k} \\] -Estimating the size of the union of two filters is simply calculating n for the union (bitwise ‘or’) of the two filters. +Estimating the size of the union of two filters is simply calculating `n` for the union (bitwise ‘or’) of the two filters. -Estimating the size of the intersection of two filters is the estimated n of the first + the estimated n of the second - the estimated union of the two. There are some tricky edge conditions, such as when one or both of the estimates of n is infinite. +Estimating the size of the intersection of two filters is the estimated `n` of the first + the estimated `n` of the second - the estimated union of the two. There are some tricky edge conditions, such as when one or both of the estimates of `n` is infinite. ## Usage Errors There are several places that errors can creep into Bloom filter usage. ### Saturation errors -Saturation errors arise from under estimating the number of items that will be placed into the filter. Let’s define “saturation” as the number of items merged into a filter divided by the number of items specified in the Shape. Then once `n` items have been inserted the filter is at a saturation of 1. As the saturation increases the false positive rate increases. Using the calculation for the false positive rate noted above, we can calculate the expected false positive rate for the various saturations. For an interactive version see Thomas Hursts online calculator. +Saturation errors arise from underestimating the number of items that will be placed into the filter. Let’s define “saturation” as the number of items merged into a filter divided by the number of items specified in the Shape. Then once `n` items have been inserted the filter is at a saturation of 1. As the saturation increases the false positive rate increases. Using the calculation for the false positive rate noted above, we can calculate the expected false positive rate for the various saturations. For an interactive version see Thomas Hursts online calculator. -For Bloom filters defined as k=17 and n=3 the calculation yields m=72, and p=0.00001. As the saturation increases the rates of false positives increase as per the following table: +For Bloom filters defined as `k=17` and `n=3` the calculation yields `m=72`, and `p=0.00001`. As the saturation increases the rates of false positives increase as per the following table: | Saturation | Probability of false positive | |------------|-------------------------------| diff --git a/src/site/markdown/bloomFilters/multidimensional.md b/src/site/markdown/bloomFilters/multidimensional.md index 10f1fdbcc..693a3b4d2 100644 --- a/src/site/markdown/bloomFilters/multidimensional.md +++ b/src/site/markdown/bloomFilters/multidimensional.md @@ -22,7 +22,7 @@ We can use reference Bloom filters to index data by storing the Bloom filter alo Searching can be performed by creating a target Bloom filter with partial data, for example name and date of birth from the person example, and then searching through the list as described above. The associated records either have the name and birthdate or are false positives and need to be filtered out during retrieval. -## Multidimensional Bloom filters +## Multidimensional Bloom Filters The description above is a multidimensional Bloom filter. A multidimensional Bloom filter is simply a collection of searchable filters, the simplest implementation being a list. In fact, for fewer than 10K filters the list is the fastest possible solution. There are two basic reasons for this: * Bloom filter comparisons are extremely fast taking on approximately five (5) machine instructions for the simple comparison. @@ -72,7 +72,7 @@ This solution utilizes the rapidity of the standard list solution, while providi Natural Bloofi uses a Tree structure like Bloofi does except that each node in the tree is a filter that was inserted into the index.2 Natural Bloofi operates like the sharded list except that if the Bloom filter for a node is contained by a node in the list then it is made a child of that node. If the Bloom filter node contains a node in the list, then it becomes the parent of that node. This yields a flat Bloofi tree where the more saturated filters are closer to the root. -## Encrypted indexing +## Encrypted Indexing The idea of using Bloom filters for indexing encrypted data is not a new idea.3456 The salient points are that Bloom filters are a very effective one way hash with matching capabilities. The simplest solution is to create a reference Bloom filter comprising the plain text of the columns that are to be indexed. Encrypt the data. Send the encrypted data and the Bloom filter to the storage engine. The storage engine stores the encrypted data as a blob and indexes the Bloom filter with a reference to the stored blob.