[Docs] Improve Bloom filter topic (#17547)

* [Docs] Improve Bloom filter topic * Apply suggestions from code review Co-authored-by: Charles Smith <techdocsmith@gmail.com> * Update spelling file --------- Co-authored-by: Charles Smith <techdocsmith@gmail.com>
2024-12-10 13:43:56 -06:00 · 2024-12-10 13:43:56 -06:00 · a51061fa43
parent 61d986a179
commit a51061fa43
2 changed files with 56 additions and 58 deletions
--- a/docs/development/extensions-core/bloom-filter.md
+++ b/docs/development/extensions-core/bloom-filter.md
@ -23,28 +23,25 @@ title: "Bloom Filter"
  -->
-To use this Apache Druid extension, [include](../../configuration/extensions.md#loading-extensions) `druid-bloom-filter` in the extensions load list.
+To use the Apache Druid&circledR; Bloom filter extension, include `druid-bloom-filter` in the extensions load list. See [Loading extensions](../../configuration/extensions.md#loading-extensions) for more information.
-This extension adds the ability to both construct bloom filters from query results, and filter query results by testing
+This extension adds the abilities to construct Bloom filters from query results and to filter query results by testing
-against a bloom filter. A Bloom filter is a probabilistic data structure for performing a set membership check. A bloom
+against a Bloom filter. A Bloom filter is a probabilistic data structure to check for set membership. A Bloom
-filter is a good candidate to use with Druid for cases where an explicit filter is impossible, e.g. filtering a query
+filter is a good candidate to use when an explicit filter is impossible, such as filtering a query
 against a set of millions of values.
 Following are some characteristics of Bloom filters:
- Bloom filters are highly space efficient when compared to using a HashSet.
+- Bloom filters are significantly more space efficient than HashSets.
- Because of the probabilistic nature of bloom filters, false positive results are possible (element was not actually
+- Because they are probabilistic, false positive results are possible with Bloom filters. For example, the `test()` function might return `true` for an element that is not within the filter.
-inserted into a bloom filter during construction, but `test()` says true)
+- False negatives are not possible. If an element is present, `test()` always returns `true`.
- False negatives are not possible (if element is present then `test()` will never say false).
+- The false positive probability of this implementation is fixed at 5%. Increasing the number of entries that the filter can hold can decrease this false positive rate in exchange for overall size.
- The false positive probability of this implementation is currently fixed at 5%, but increasing the number of entries
+- Bloom filters are sensitive to the number of inserted elements. You must specify the expected number of entries at creation time. If the number of insertions exceeds the specified number of entries, the false positive probability increases accordingly.
 that the filter can hold can decrease this false positive rate in exchange for overall size.
 - Bloom filters are sensitive to number of elements that will be inserted in the bloom filter. During the creation of bloom filter expected number of entries must be specified. If the number of insertions exceed
 the specified initial number of entries then false positive probability will increase accordingly.
-This extension is currently based on `org.apache.hive.common.util.BloomKFilter` from `hive-storage-api`. Internally,
+This extension is based on `org.apache.hive.common.util.BloomKFilter` from `hive-storage-api`. Internally,
 this implementation uses Murmur3 as the hash algorithm.
-To construct a BloomKFilter externally with Java to use as a filter in a Druid query:
+The following Java example shows how to construct a BloomKFilter externally:
 ```java
 BloomKFilter bloomFilter = new BloomKFilter(1500);
@ -56,11 +53,12 @@ BloomKFilter.serialize(byteArrayOutputStream, bloomFilter);
 String base64Serialized = Base64.encodeBase64String(byteArrayOutputStream.toByteArray());
 ```
-This string can then be used in the native or SQL Druid query.
+You can then use the Base64 encoded string in JSON-based or SQL-based queries in Druid.
-## Filtering queries with a Bloom Filter
+## Filter queries with a Bloom filter
 ### JSON specification
 ### JSON Specification of Bloom Filter
 ```json
 {
  "type" : "bloom",
@ -70,50 +68,46 @@ This string can then be used in the native or SQL Druid query.
 }
 ```
-|Property                 |Description                   |required?                           |
+|Property|Description|Required|
-|-------------------------|------------------------------|----------------------------------|
+|--------|-----------|--------|
-|`type`                   |Filter Type. Should always be `bloom`|yes|
+|`type`|Filter type. Set to `bloom`.|Yes|
-|`dimension`              |The dimension to filter over. | yes |
+|`dimension`|Dimension to filter over.|Yes|
-|`bloomKFilter`           |Base64 encoded Binary representation of `org.apache.hive.common.util.BloomKFilter`| yes |
+|`bloomKFilter`|Base64 encoded binary representation of `org.apache.hive.common.util.BloomKFilter`.|Yes|
-|`extractionFn`|[Extraction function](../../querying/dimensionspecs.md#extraction-functions) to apply to the dimension values |no|
+|`extractionFn`|[Extraction function](../../querying/dimensionspecs.md#extraction-functions) to apply to the dimension values.|No|
 ### Serialized format for BloomKFilter
-### Serialized Format for BloomKFilter
+Serialized BloomKFilter format:
- Serialized BloomKFilter format:
+- 1 byte for the number of hash functions.
 - 1 big-endian integer for the number of longs in the bitset.
 - Big-endian longs in the BloomKFilter bitset.
- - 1 byte for the number of hash functions.
+`org.apache.hive.common.util.BloomKFilter` provides a method to serialize Bloom filters to `outputStream`.
 - 1 big endian int(That is how OutputStream works) for the number of longs in the bitset
 - big endian longs in the BloomKFilter bitset
-Note: `org.apache.hive.common.util.BloomKFilter` provides a serialize method which can be used to serialize bloom filters to outputStream.
+### Filter SQL queries
-### Filtering SQL Queries
+You can use Bloom filters in SQL `WHERE` clauses with the `bloom_filter_test` operator:
 Bloom filters can be used in SQL `WHERE` clauses via the `bloom_filter_test` operator:
 ```sql
 SELECT COUNT(*) FROM druid.foo WHERE bloom_filter_test(<expr>, '<serialized_bytes_for_BloomKFilter>')
 ```
-### Expression and Virtual Column Support
+### Expression and virtual column support
-The bloom filter extension also adds a bloom filter [Druid expression](../../querying/math-expr.md) which shares syntax
+The Bloom filter extension also adds a Bloom filter [Druid expression](../../querying/math-expr.md) which shares syntax
 with the SQL operator.
 ```sql
 bloom_filter_test(<expr>, '<serialized_bytes_for_BloomKFilter>')
 ```
-## Bloom Filter Query Aggregator
+## Bloom filter query aggregator
-Input for a `bloomKFilter` can also be created from a druid query with the `bloom` aggregator. Note that it is very
+You can create an input for a `BloomKFilter` from a Druid query with the `bloom` aggregator. Make sure to set a reasonable value for the `maxNumEntries` parameter to specify the maximum number of distinct entries that the Bloom filter can represent without increasing the false positive rate. Try performing a query using
-important to set a reasonable value for the `maxNumEntries` parameter, which is the maximum number of distinct entries
+one of the unique count sketches to calculate the value for this parameter to build a Bloom filter appropriate for the query.
 that the bloom filter can represent without increasing the false positive rate. It may be worth performing a query using
 one of the unique count sketches to calculate the value for this parameter in order to build a bloom filter appropriate
 for the query.
-### JSON Specification of Bloom Filter Aggregator
+### JSON specification
 ```json
 {
@ -124,15 +118,17 @@ for the query.
    }
 ```
-|Property                 |Description                   |required?                           |
+|Property|Description|Required|
-|-------------------------|------------------------------|----------------------------------|
+|--------|-----------|--------|
-|`type`                   |Aggregator Type. Should always be `bloom`|yes|
+|`type`|Aggregator type. Set to `bloom`.|Yes|
-|`name`                   |Output field name |yes|
+|`name`|Output field name.|Yes|
-|`field`                  |[DimensionSpec](../../querying/dimensionspecs.md) to add to `org.apache.hive.common.util.BloomKFilter` | yes |
+|`field`|[DimensionSpec](../../querying/dimensionspecs.md) to add to `org.apache.hive.common.util.BloomKFilter`.|Yes|
-|`maxNumEntries`          |Maximum number of distinct values supported by `org.apache.hive.common.util.BloomKFilter`, default `1500`| no |
+|`maxNumEntries`|Maximum number of distinct values supported by `org.apache.hive.common.util.BloomKFilter`. Defaults to `1500`.|No|
 ### Example
 The following example shows a timeseries query object with a `bloom` aggregator:
 ```json
 {
  "queryType": "timeseries",
@ -154,25 +150,26 @@ for the query.
 }
 ```
-response
+Example response:
 ```json
-[{"timestamp":"2015-09-12T00:00:00.000Z","result":{"userBloom":"BAAAJhAAAA..."}}]
+[
  {
    "timestamp":"2015-09-12T00:00:00.000Z",
    "result":{"userBloom":"BAAAJhAAAA..."}
  }
 ]
 ```
-These values can then be set in the filter specification described above.
+We recommend ordering by an alternative aggregation method instead of ordering results by a Bloom filter aggregator.
 Ordering results by a Bloom filter aggregator can be resource-intensive because Druid performs an expensive linear scan of the filter to approximate the count of items added to the set by counting the number of set bits. 
-Ordering results by a bloom filter aggregator, for example in a TopN query, will perform a comparatively expensive
+### SQL Bloom filter aggregator
 linear scan _of the filter itself_ to count the number of set bits as a means of approximating how many items have been
 added to the set. As such, ordering by an alternate aggregation is recommended if possible.
-
+You can compute Bloom filters in SQL expressions with the BLOOM_FILTER aggregator. For example:
 ### SQL Bloom Filter Aggregator
 Bloom filters can be computed in SQL expressions with the `bloom_filter` aggregator:
 ```sql
 SELECT BLOOM_FILTER(<expression>, <max number of entries>) FROM druid.foo WHERE dim2 = 'abc'
 ```
-but requires the setting `druid.sql.planner.serializeComplexValues` to be set to `true`. Bloom filter results in a SQL
+Druid serializes Bloom filter results in a SQL response into a Base64 string. You can use the resulting string in subsequent queries as a filter.
 response are serialized into a base64 string, which can then be used in subsequent queries as a filter.
--- a/website/.spelling
+++ b/website/.spelling
@ -117,6 +117,7 @@ Guice
 HDFS
 HLL
 HashSet
 HashSets
 Homebrew
 html
 HyperLogLog