Improve worst-case performance of LIKE filters by 20x (#16153)

* Expected-linear-time LIKE

`LikeDimFilter` was compiling the `LIKE` clause down to a `java.util.regex.Pattern`. Unfortunately, even seemingly simply regexes can lead to [catastrophic backtracking](https://www.regular-expressions.info/catastrophic.html). In particular, something as simple as a few `%` wildcards can end up in [exploding the time complexity](https://www.rexegg.com/regex-explosive-quantifiers.html#remote). This MR implements a simple greedy algorithm that avoids backtracking.

Technically, the algorithm runs in `O(nm)`, where `n` is the length of the string to match and `m` is the length of the pattern. In practice, it should run in linear time: essentially as fast as `String.indexOf()` can search for the next match. Running an updated version of the `LikeFilterBenchmark` with Java 11 on a `t2.xlarge` instance showed at least a 1.7x speed up for a simple "contains" query (`%50%`), and more than a 20x speed up for a "killer" query with four wildcards but no matches (`%%%%x`). The benchmark uses short strings: cases with longer strings should benefit more.

Note that the `REGEX` operator still suffers from the same potentially-catastrophic runtimes. Using a better library than the built-in `java.util.regex.Pattern` (e.g., [joni](https://github.com/jruby/joni)) would be a good idea to avoid accidental — or intentional — DoSing.

```
Benchmark                                (cardinality)  Mode  Cnt  Before Score       Error  After Score       Error  Units  Before / After
LikeFilterBenchmark.matchBoundPrefix              1000  avgt   10         6.686 ±     0.026        6.765 ±     0.087  us/op           0.99x
LikeFilterBenchmark.matchBoundPrefix            100000  avgt   10       163.936 ±     1.589      140.014 ±     0.563  us/op           1.17x
LikeFilterBenchmark.matchBoundPrefix           1000000  avgt   10      1235.259 ±     7.318     1165.330 ±     9.300  us/op           1.06x
LikeFilterBenchmark.matchLikeContains             1000  avgt   10       255.074 ±     1.530      130.212 ±     3.314  us/op           1.96x
LikeFilterBenchmark.matchLikeContains           100000  avgt   10     34789.639 ±   210.219    18563.644 ±   100.030  us/op           1.87x
LikeFilterBenchmark.matchLikeContains          1000000  avgt   10    287265.302 ±  1790.957   164684.778 ±   317.698  us/op           1.74x
LikeFilterBenchmark.matchLikeEquals               1000  avgt   10         0.410 ±     0.003        0.399 ±     0.001  us/op           1.03x
LikeFilterBenchmark.matchLikeEquals             100000  avgt   10         0.793 ±     0.005        0.719 ±     0.003  us/op           1.10x
LikeFilterBenchmark.matchLikeEquals            1000000  avgt   10         0.864 ±     0.004        0.839 ±     0.005  us/op           1.03x
LikeFilterBenchmark.matchLikeKiller               1000  avgt   10      3077.629 ±     7.928      103.714 ±     2.417  us/op          29.67x
LikeFilterBenchmark.matchLikeKiller             100000  avgt   10    311048.049 ± 13466.911    14777.567 ±    70.242  us/op          21.05x
LikeFilterBenchmark.matchLikeKiller            1000000  avgt   10   3055855.099 ± 18387.839    92476.621 ±  1198.255  us/op          33.04x
LikeFilterBenchmark.matchLikePrefix               1000  avgt   10         6.711 ±     0.035        6.653 ±     0.046  us/op           1.01x
LikeFilterBenchmark.matchLikePrefix             100000  avgt   10       161.535 ±     0.574      163.740 ±     0.833  us/op           0.99x
LikeFilterBenchmark.matchLikePrefix            1000000  avgt   10      1255.696 ±     5.207     1201.378 ±     3.466  us/op           1.05x
LikeFilterBenchmark.matchRegexContains            1000  avgt   10       467.736 ±     2.546      481.431 ±     5.647  us/op           0.97x
LikeFilterBenchmark.matchRegexContains          100000  avgt   10     64871.766 ±   223.341    65483.992 ±   391.249  us/op           0.99x
LikeFilterBenchmark.matchRegexContains         1000000  avgt   10    482906.004 ±  2003.583   477195.835 ±  3094.605  us/op           1.01x
LikeFilterBenchmark.matchRegexKiller              1000  avgt   10      8071.881 ±    18.026     8052.322 ±    17.336  us/op           1.00x
LikeFilterBenchmark.matchRegexKiller            100000  avgt   10   1120094.520 ±  2428.172   808321.542 ±  2411.032  us/op           1.39x
LikeFilterBenchmark.matchRegexKiller           1000000  avgt   10   8096745.012 ± 40782.747  8114114.896 ± 43250.204  us/op           1.00x
LikeFilterBenchmark.matchRegexPrefix              1000  avgt   10       170.843 ±     1.095      175.924 ±     1.144  us/op           0.97x
LikeFilterBenchmark.matchRegexPrefix            100000  avgt   10     17785.280 ±   116.813    18708.888 ±    61.857  us/op           0.95x
LikeFilterBenchmark.matchRegexPrefix           1000000  avgt   10    174415.586 ±  1827.478   173190.799 ±   949.224  us/op           1.01x
LikeFilterBenchmark.matchSelectorEquals           1000  avgt   10         0.411 ±     0.003        0.416 ±     0.002  us/op           0.99x
LikeFilterBenchmark.matchSelectorEquals         100000  avgt   10         0.728 ±     0.003        0.739 ±     0.003  us/op           0.99x
LikeFilterBenchmark.matchSelectorEquals        1000000  avgt   10         0.842 ±     0.002        0.879 ±     0.007  us/op           0.96x
```

* Take into account whether druid.generic.useDefaultValueForNull is set in LikeDimFilterTest assertions.

* Attempt to placate CodeQL.

* Fix handling of multi-pattern suffixes.

* Expected-linear-time LIKE

`LikeDimFilter` was compiling the `LIKE` clause down to a `java.util.regex.Pattern`. Unfortunately, even seemingly simply regexes can lead to [catastrophic backtracking](https://www.regular-expressions.info/catastrophic.html). In particular, something as simple as a few `%` wildcards can end up in [exploding the time complexity](https://www.rexegg.com/regex-explosive-quantifiers.html#remote). This MR implements a simple greedy algorithm that avoids the catastrophic backtracking, converting the `LIKE` pattern into a list of `java.util.regex.Pattern` by splitting on the `%` wildcard. The resulting sub-patterns do no backtracking, and a simple greedy loop using `Matcher.find()` to progress through the string is used.

Running an updated version of the `LikeFilterBenchmark` with Java 11 on a `t2.xlarge` instance showed at least a 1.15x speed up for a simple "contains" query (`%50%`), and more than a 20x speed up for a "killer" query with four wildcards but no matches (`%%%%x`). The benchmark uses short strings: cases with longer strings should benefit more.

Note that the `REGEX` operator still suffers from the same potentially-catastrophic runtimes. Using a better library than the built-in `java.util.regex.Pattern` (e.g., [joni](https://github.com/jruby/joni)) would be a good idea to avoid accidental — or intentional — DoSing.

```
Benchmark                                      (cardinality)  Mode  Cnt  Before Score       Error      After Score     Error  Units  Before/After
LikeFilterBenchmark.matchBoundPrefix                    1000  avgt   10         5.410 ±     0.010          5.582 ±     0.004  us/op         0.97x
LikeFilterBenchmark.matchBoundPrefix                  100000  avgt   10       140.920 ±     0.306        141.082 ±     0.391  us/op         1.00x
LikeFilterBenchmark.matchBoundPrefix                 1000000  avgt   10      1082.762 ±     1.070       1171.407 ±     1.628  us/op         0.92x
LikeFilterBenchmark.matchLikeComplexContains            1000  avgt   10       221.572 ±     0.228        183.742 ±     0.210  us/op         1.21x
LikeFilterBenchmark.matchLikeComplexContains          100000  avgt   10     25461.362 ±    21.481      17373.828 ±    42.577  us/op         1.47x
LikeFilterBenchmark.matchLikeComplexContains         1000000  avgt   10    221075.917 ±   919.238     177454.683 ±   506.420  us/op         1.25x
LikeFilterBenchmark.matchLikeContains                   1000  avgt   10       283.015 ±     0.219        218.835 ±     3.126  us/op         1.29x
LikeFilterBenchmark.matchLikeContains                 100000  avgt   10     30202.910 ±    32.697      26713.488 ±    49.525  us/op         1.13x
LikeFilterBenchmark.matchLikeContains                1000000  avgt   10    284661.411 ±   130.324     243381.857 ±   540.143  us/op         1.17x
LikeFilterBenchmark.matchLikeEquals                     1000  avgt   10         0.386 ±     0.001          0.380 ±     0.001  us/op         1.02x
LikeFilterBenchmark.matchLikeEquals                   100000  avgt   10         0.670 ±     0.001          0.705 ±     0.002  us/op         0.95x
LikeFilterBenchmark.matchLikeEquals                  1000000  avgt   10         0.839 ±     0.001          0.796 ±     0.001  us/op         1.05x
LikeFilterBenchmark.matchLikeKiller                     1000  avgt   10      4882.099 ±     7.953        170.142 ±     0.494  us/op        28.69x
LikeFilterBenchmark.matchLikeKiller                   100000  avgt   10    524122.010 ±   390.170      19461.637 ±   117.090  us/op        26.93x
LikeFilterBenchmark.matchLikeKiller                  1000000  avgt   10   5121795.377 ±  4176.052     181162.978 ±   368.443  us/op        28.27x
LikeFilterBenchmark.matchLikePrefix                     1000  avgt   10         5.708 ±     0.005          5.677 ±     0.011  us/op         1.01x
LikeFilterBenchmark.matchLikePrefix                   100000  avgt   10       141.853 ±     0.554        108.313 ±     0.330  us/op         1.31x
LikeFilterBenchmark.matchLikePrefix                  1000000  avgt   10      1199.148 ±     1.298       1153.297 ±     1.575  us/op         1.04x
LikeFilterBenchmark.matchLikeSuffix                     1000  avgt   10       256.020 ±     0.283        196.339 ±     0.564  us/op         1.30x
LikeFilterBenchmark.matchLikeSuffix                   100000  avgt   10     29917.931 ±    28.218      21450.997 ±    20.341  us/op         1.39x
LikeFilterBenchmark.matchLikeSuffix                  1000000  avgt   10    241225.193 ±   465.824     194034.292 ±   362.312  us/op         1.24x
LikeFilterBenchmark.matchRegexComplexContains           1000  avgt   10       119.597 ±     0.635        135.550 ±     0.697  us/op         0.88x
LikeFilterBenchmark.matchRegexComplexContains         100000  avgt   10     13089.670 ±    13.738      13766.712 ±    12.802  us/op         0.95x
LikeFilterBenchmark.matchRegexComplexContains        1000000  avgt   10    130822.830 ±  1624.048     131076.029 ±  1636.811  us/op         1.00x
LikeFilterBenchmark.matchRegexContains                  1000  avgt   10       573.273 ±     0.421        615.399 ±     0.633  us/op         0.93x
LikeFilterBenchmark.matchRegexContains                100000  avgt   10     57259.313 ±   162.747      62900.380 ±    44.746  us/op         0.91x
LikeFilterBenchmark.matchRegexContains               1000000  avgt   10    571335.768 ±  2822.776     542536.982 ±   780.290  us/op         1.05x
LikeFilterBenchmark.matchRegexKiller                    1000  avgt   10     11525.499 ±     8.741      11061.791 ±    21.746  us/op         1.04x
LikeFilterBenchmark.matchRegexKiller                  100000  avgt   10   1170414.723 ±   766.160    1144437.291 ±   886.263  us/op         1.02x
LikeFilterBenchmark.matchRegexKiller                 1000000  avgt   10  11507668.302 ± 11318.176  110381620.014 ± 10707.974  us/op         1.11x
LikeFilterBenchmark.matchRegexPrefix                    1000  avgt   10       156.460 ±     0.097        155.217 ±     0.431  us/op         1.01x
LikeFilterBenchmark.matchRegexPrefix                  100000  avgt   10     15056.491 ±    23.906      15508.965 ±   763.976  us/op         0.97x
LikeFilterBenchmark.matchRegexPrefix                 1000000  avgt   10    154416.563 ±   473.108     153737.912 ±   273.347  us/op         1.00x
LikeFilterBenchmark.matchRegexSuffix                    1000  avgt   10       610.684 ±     0.462        590.352 ±     0.334  us/op         1.03x
LikeFilterBenchmark.matchRegexSuffix                  100000  avgt   10     53196.517 ±    78.155      59460.261 ±    56.934  us/op         0.89x
LikeFilterBenchmark.matchRegexSuffix                 1000000  avgt   10    536100.944 ±   440.353     550098.917 ±   740.464  us/op         0.97x
LikeFilterBenchmark.matchSelectorEquals                 1000  avgt   10         0.390 ±     0.001          0.366 ±     0.001  us/op         1.07x
LikeFilterBenchmark.matchSelectorEquals               100000  avgt   10         0.724 ±     0.001          0.714 ±     0.001  us/op         1.01x
LikeFilterBenchmark.matchSelectorEquals              1000000  avgt   10         0.826 ±     0.001          0.847 ±     0.001  us/op         0.98x
```
This commit is contained in:
Tim Williamson 2024-04-23 22:45:23 -07:00 committed by GitHub
parent f1d24c868f
commit 4bdc1890f7
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
3 changed files with 368 additions and 29 deletions

View File

@ -49,7 +49,6 @@ import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.Warmup;
import org.openjdk.jmh.infra.Blackhole;
import java.nio.ByteBuffer;
import java.util.ArrayList;
@ -106,6 +105,58 @@ public class LikeFilterBenchmark
null
).toFilter();
private static final Filter REGEX_SUFFIX = new RegexDimFilter(
"foo",
".*50$",
null
).toFilter();
private static final Filter LIKE_SUFFIX = new LikeDimFilter(
"foo",
"%50",
null,
null
).toFilter();
private static final Filter LIKE_CONTAINS = new LikeDimFilter(
"foo",
"%50%",
null,
null
).toFilter();
private static final Filter REGEX_CONTAINS = new RegexDimFilter(
"foo",
".*50.*",
null
).toFilter();
private static final Filter LIKE_COMPLEX_CONTAINS = new LikeDimFilter(
"foo",
"%5_0%0_5%",
null,
null
).toFilter();
private static final Filter REGEX_COMPLEX_CONTAINS = new RegexDimFilter(
"foo",
"%5_0%0_5",
null
).toFilter();
private static final Filter LIKE_KILLER = new LikeDimFilter(
"foo",
"%%%%x",
null,
null
).toFilter();
private static final Filter REGEX_KILLER = new RegexDimFilter(
"foo",
".*.*.*.*x",
null
).toFilter();
// cardinality, the dictionary will contain evenly spaced integers
@Param({"1000", "100000", "1000000"})
int cardinality;
@ -147,46 +198,105 @@ public class LikeFilterBenchmark
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public void matchLikeEquals(Blackhole blackhole)
public ImmutableBitmap matchLikeEquals()
{
final ImmutableBitmap bitmapIndex = Filters.computeDefaultBitmapResults(LIKE_EQUALS, selector);
blackhole.consume(bitmapIndex);
return Filters.computeDefaultBitmapResults(LIKE_EQUALS, selector);
}
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public void matchSelectorEquals(Blackhole blackhole)
public ImmutableBitmap matchSelectorEquals()
{
final ImmutableBitmap bitmapIndex = Filters.computeDefaultBitmapResults(SELECTOR_EQUALS, selector);
blackhole.consume(bitmapIndex);
return Filters.computeDefaultBitmapResults(SELECTOR_EQUALS, selector);
}
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public void matchLikePrefix(Blackhole blackhole)
public ImmutableBitmap matchLikePrefix()
{
final ImmutableBitmap bitmapIndex = Filters.computeDefaultBitmapResults(LIKE_PREFIX, selector);
blackhole.consume(bitmapIndex);
return Filters.computeDefaultBitmapResults(LIKE_PREFIX, selector);
}
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public void matchBoundPrefix(Blackhole blackhole)
public ImmutableBitmap matchBoundPrefix()
{
final ImmutableBitmap bitmapIndex = Filters.computeDefaultBitmapResults(BOUND_PREFIX, selector);
blackhole.consume(bitmapIndex);
return Filters.computeDefaultBitmapResults(BOUND_PREFIX, selector);
}
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public void matchRegexPrefix(Blackhole blackhole)
public ImmutableBitmap matchRegexPrefix()
{
final ImmutableBitmap bitmapIndex = Filters.computeDefaultBitmapResults(REGEX_PREFIX, selector);
blackhole.consume(bitmapIndex);
return Filters.computeDefaultBitmapResults(REGEX_PREFIX, selector);
}
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public ImmutableBitmap matchLikeSuffix()
{
return Filters.computeDefaultBitmapResults(LIKE_SUFFIX, selector);
}
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public ImmutableBitmap matchRegexSuffix()
{
return Filters.computeDefaultBitmapResults(REGEX_SUFFIX, selector);
}
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public ImmutableBitmap matchLikeContains()
{
return Filters.computeDefaultBitmapResults(LIKE_CONTAINS, selector);
}
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public ImmutableBitmap matchRegexContains()
{
return Filters.computeDefaultBitmapResults(REGEX_CONTAINS, selector);
}
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public ImmutableBitmap matchLikeComplexContains()
{
return Filters.computeDefaultBitmapResults(LIKE_COMPLEX_CONTAINS, selector);
}
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public ImmutableBitmap matchRegexComplexContains()
{
return Filters.computeDefaultBitmapResults(REGEX_COMPLEX_CONTAINS, selector);
}
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public ImmutableBitmap matchLikeKiller()
{
return Filters.computeDefaultBitmapResults(LIKE_KILLER, selector);
}
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public ImmutableBitmap matchRegexKiller()
{
return Filters.computeDefaultBitmapResults(REGEX_KILLER, selector);
}
private List<Integer> generateInts()

View File

@ -35,8 +35,11 @@ import org.apache.druid.segment.filter.LikeFilter;
import javax.annotation.Nullable;
import java.nio.ByteBuffer;
import java.util.ArrayList;
import java.util.List;
import java.util.Objects;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class LikeDimFilter extends AbstractOptimizableDimFilter implements DimFilter
@ -44,7 +47,6 @@ public class LikeDimFilter extends AbstractOptimizableDimFilter implements DimFi
// Regex matching characters that are definitely okay to include unescaped in a regex.
// Leads to excessively paranoid escaping, although shouldn't affect runtime beyond compiling the regex.
private static final Pattern DEFINITELY_FINE = Pattern.compile("[\\w\\d\\s-]");
private static final String WILDCARD = ".*";
private final String dimension;
private final String pattern;
@ -73,7 +75,7 @@ public class LikeDimFilter extends AbstractOptimizableDimFilter implements DimFi
if (escape != null && escape.length() != 1) {
throw new IllegalArgumentException("Escape must be null or a single character");
} else {
this.escapeChar = (escape == null || escape.isEmpty()) ? null : escape.charAt(0);
this.escapeChar = escape == null ? null : escape.charAt(0);
}
this.likeMatcher = LikeMatcher.from(pattern, this.escapeChar);
@ -214,8 +216,8 @@ public class LikeDimFilter extends AbstractOptimizableDimFilter implements DimFi
// Prefix that matching strings are known to start with. May be empty.
private final String prefix;
// Regex pattern that describes matching strings.
private final Pattern pattern;
// Regex patterns that describes matching strings.
private final List<Pattern> pattern;
private final String likePattern;
@ -223,7 +225,7 @@ public class LikeDimFilter extends AbstractOptimizableDimFilter implements DimFi
final String likePattern,
final SuffixMatch suffixMatch,
final String prefix,
final Pattern pattern
final List<Pattern> pattern
)
{
this.likePattern = likePattern;
@ -238,7 +240,10 @@ public class LikeDimFilter extends AbstractOptimizableDimFilter implements DimFi
)
{
final StringBuilder prefix = new StringBuilder();
final StringBuilder regex = new StringBuilder();
// Splits the input on % to leave only eagerly-matchable sub-patterns. This is to avoid catastrophic backtracking:
// https://www.rexegg.com/regex-explosive-quantifiers.html#remote
final List<Pattern> pattern = new ArrayList<>();
final StringBuilder regex = new StringBuilder("^");
boolean escaping = false;
boolean inPrefix = true;
SuffixMatch suffixMatch = SuffixMatch.MATCH_EMPTY;
@ -251,11 +256,16 @@ public class LikeDimFilter extends AbstractOptimizableDimFilter implements DimFi
if (suffixMatch == SuffixMatch.MATCH_EMPTY) {
suffixMatch = SuffixMatch.MATCH_ANY;
}
regex.append(WILDCARD);
if (regex.length() > 0) {
if (regex.length() > 1 || regex.charAt(0) != '^') {
pattern.add(Pattern.compile(regex.toString(), Pattern.DOTALL));
}
regex.setLength(0);
}
} else if (c == '_' && !escaping) {
inPrefix = false;
suffixMatch = SuffixMatch.MATCH_PATTERN;
regex.append(".");
regex.append('.');
} else {
if (inPrefix) {
prefix.append(c);
@ -267,7 +277,14 @@ public class LikeDimFilter extends AbstractOptimizableDimFilter implements DimFi
}
}
return new LikeMatcher(likePattern, suffixMatch, prefix.toString(), Pattern.compile(regex.toString(), Pattern.DOTALL));
if (likePattern.isEmpty()) {
pattern.add(Pattern.compile("^$"));
} else if (regex.length() > 0) {
regex.append('$');
pattern.add(Pattern.compile(regex.toString(), Pattern.DOTALL));
}
return new LikeMatcher(likePattern, suffixMatch, prefix.toString(), pattern);
}
private static void addPatternCharacter(final StringBuilder patternBuilder, final char c)
@ -284,13 +301,31 @@ public class LikeDimFilter extends AbstractOptimizableDimFilter implements DimFi
return matches(s, pattern);
}
private static DruidPredicateMatch matches(@Nullable final String s, Pattern pattern)
private static DruidPredicateMatch matches(@Nullable final String s, List<Pattern> pattern)
{
String val = NullHandling.nullToEmptyIfNeeded(s);
if (val == null) {
return DruidPredicateMatch.UNKNOWN;
}
return DruidPredicateMatch.of(pattern.matcher(val).matches());
if (pattern.size() == 1) {
// Most common case is a single pattern: a% => ^a, %z => z$, %m% => m
return DruidPredicateMatch.of(pattern.get(0).matcher(val).find());
}
int offset = 0;
for (Pattern part : pattern) {
Matcher matcher = part.matcher(val);
if (!matcher.find(offset)) {
return DruidPredicateMatch.FALSE;
}
offset = matcher.end();
}
return DruidPredicateMatch.TRUE;
}
/**
@ -324,13 +359,19 @@ public class LikeDimFilter extends AbstractOptimizableDimFilter implements DimFi
return suffixMatch;
}
@VisibleForTesting
String describeCompilation()
{
return likePattern + " => " + prefix + ":" + pattern;
}
@VisibleForTesting
static class PatternDruidPredicateFactory implements DruidPredicateFactory
{
private final ExtractionFn extractionFn;
private final Pattern pattern;
private final List<Pattern> pattern;
PatternDruidPredicateFactory(ExtractionFn extractionFn, Pattern pattern)
PatternDruidPredicateFactory(ExtractionFn extractionFn, List<Pattern> pattern)
{
this.extractionFn = extractionFn;
this.pattern = pattern;

View File

@ -22,6 +22,7 @@ package org.apache.druid.query.filter;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.google.common.collect.Sets;
import nl.jqno.equalsverifier.EqualsVerifier;
import org.apache.druid.common.config.NullHandling;
import org.apache.druid.jackson.DefaultObjectMapper;
import org.apache.druid.query.extraction.SubstringDimExtractionFn;
import org.apache.druid.segment.column.ColumnIndexSupplier;
@ -146,4 +147,191 @@ public class LikeDimFilterTest extends InitializedNullHandlingTest
final BitmapColumnIndex retVal = likeFilter.getBitmapColumnIndex(indexSelector);
Assert.assertSame("likeFilter returns the intended bitmapColumnIndex", bitmapColumnIndex, retVal);
}
@Test
public void testPatternCompilation()
{
assertCompilation("", ":[^$]");
assertCompilation("a", "a:[^a$]");
assertCompilation("abc", "abc:[^abc$]");
assertCompilation("a%", "a:[^a]");
assertCompilation("%a", ":[a$]");
assertCompilation("%a%", ":[a]");
assertCompilation("%_a", ":[.a$]");
assertCompilation("_%a", ":[^., a$]");
assertCompilation("_%_a", ":[^., .a$]");
assertCompilation("abc%", "abc:[^abc]");
assertCompilation("a%b", "a:[^a, b$]");
assertCompilation("abc%x", "abc:[^abc, x$]");
assertCompilation("abc%xyz", "abc:[^abc, xyz$]");
assertCompilation("____", ":[^....$]");
assertCompilation("%%%%", ":[]");
assertCompilation("%_%_%%__", ":[., ., ..$]");
assertCompilation("%_%a_%bc%_d_", ":[., a., bc, .d.$]");
assertCompilation("%1 _ 5%6", ":[1 . 5, 6$]");
assertCompilation("\\%_%a_\\%b\\\\c\\___%_%_d_w%x_y_z", "%:[^\\u0025., a.\\u0025b\\u005Cc_.., ., .d.w, x.y.z$]");
}
@Test
public void testPatternEmpty()
{
assertMatch("", null, NullHandling.replaceWithDefault() ? DruidPredicateMatch.TRUE : DruidPredicateMatch.UNKNOWN);
assertMatch("", "", DruidPredicateMatch.TRUE);
assertMatch("", "a", DruidPredicateMatch.FALSE);
assertMatch("", "This is a test!", DruidPredicateMatch.FALSE);
}
@Test
public void testPatternExactMatch()
{
assertMatch("a\nb", "a\nb", DruidPredicateMatch.TRUE);
assertMatch("a\nb", "a\nc", DruidPredicateMatch.FALSE);
assertMatch("This is a test", "This is a test", DruidPredicateMatch.TRUE);
assertMatch("This is a test", "this is a test", DruidPredicateMatch.FALSE);
assertMatch("This is a test", "This is a tes", DruidPredicateMatch.FALSE);
assertMatch("This is a test", "his is a test", DruidPredicateMatch.FALSE);
assertMatch("This \\%is a\\_test", "This %is a_test", DruidPredicateMatch.TRUE);
assertMatch("This \\%is a\\_test", "This \\%is a_test", DruidPredicateMatch.FALSE);
}
@Test
public void testPatternTrickySuffixes()
{
assertMatch("%xyz", "abcxyzxyz", DruidPredicateMatch.TRUE);
assertMatch("ab%bc", "abc", DruidPredicateMatch.FALSE);
}
@Test
public void testPatternOnlySpecial()
{
assertMatch("%", null, NullHandling.replaceWithDefault() ? DruidPredicateMatch.TRUE : DruidPredicateMatch.UNKNOWN);
assertMatch("%", "", DruidPredicateMatch.TRUE);
assertMatch("%", "abcxyzxyz", DruidPredicateMatch.TRUE);
assertMatch("_", null, NullHandling.replaceWithDefault() ? DruidPredicateMatch.FALSE : DruidPredicateMatch.UNKNOWN);
assertMatch("_", "", DruidPredicateMatch.FALSE);
assertMatch("_", "a", DruidPredicateMatch.TRUE);
assertMatch("_", "ab", DruidPredicateMatch.FALSE);
assertMatch("____", "abc", DruidPredicateMatch.FALSE);
assertMatch("____", "abcd", DruidPredicateMatch.TRUE);
assertMatch("____", "abcde", DruidPredicateMatch.FALSE);
assertMatch("%____", "abcde", DruidPredicateMatch.TRUE);
assertMatch("%____", "abcd", DruidPredicateMatch.TRUE);
assertMatch("%____", "abc", DruidPredicateMatch.FALSE);
assertMatch("__%_%%_", "abc", DruidPredicateMatch.FALSE);
assertMatch("__%_%%_", "abcd", DruidPredicateMatch.TRUE);
assertMatch("__%_%%_", "abcdxyz", DruidPredicateMatch.TRUE);
assertMatch("%__%_%%_%", "abc", DruidPredicateMatch.FALSE);
assertMatch("%__%_%%_%", "abcd", DruidPredicateMatch.TRUE);
assertMatch("%__%_%%_%", "abcdxyz", DruidPredicateMatch.TRUE);
}
@Test
public void testPatternTrailingWildcard()
{
assertMatch("ab%", "abc", DruidPredicateMatch.TRUE);
assertMatch("ab%", "ab", DruidPredicateMatch.TRUE);
assertMatch("ab%", "a", DruidPredicateMatch.FALSE);
}
@Test
public void testPatternLeadingWildcard()
{
assertMatch("%yz", "xyz", DruidPredicateMatch.TRUE);
assertMatch("%yz", "yz", DruidPredicateMatch.TRUE);
assertMatch("%yz", "z", DruidPredicateMatch.FALSE);
assertMatch("%yz", "wxyz", DruidPredicateMatch.TRUE);
assertMatch("%yz", "xyza", DruidPredicateMatch.FALSE);
}
@Test
public void testPatternTrailingAny()
{
assertMatch("ab_", "abc", DruidPredicateMatch.TRUE);
assertMatch("ab_", "ab", DruidPredicateMatch.FALSE);
assertMatch("ab_", "abcd", DruidPredicateMatch.FALSE);
assertMatch("ab_", "xabc", DruidPredicateMatch.FALSE);
}
@Test
public void testPatternLeadingAny()
{
assertMatch("_yz", "xyz", DruidPredicateMatch.TRUE);
assertMatch("_yz", "yz", DruidPredicateMatch.FALSE);
assertMatch("_yz", "wxyz", DruidPredicateMatch.FALSE);
assertMatch("_yz", "xyza", DruidPredicateMatch.FALSE);
}
@Test
public void testPatternLeadingAndTrailing()
{
assertMatch("_jkl_", "jkl", DruidPredicateMatch.FALSE);
assertMatch("_jkl_", "ijklm", DruidPredicateMatch.TRUE);
assertMatch("_jkl_", "ijklmn", DruidPredicateMatch.FALSE);
assertMatch("_jkl_", "hijklm", DruidPredicateMatch.FALSE);
assertMatch("%jkl%", "jkl", DruidPredicateMatch.TRUE);
assertMatch("%jkl%", "ijklm", DruidPredicateMatch.TRUE);
assertMatch("%jkl%", "ijklmn", DruidPredicateMatch.TRUE);
assertMatch("%jkl%", "hijklm", DruidPredicateMatch.TRUE);
assertMatch("_jkl%", "jkl", DruidPredicateMatch.FALSE);
assertMatch("_jkl%", "ijklm", DruidPredicateMatch.TRUE);
assertMatch("_jkl%", "ijklmn", DruidPredicateMatch.TRUE);
assertMatch("_jkl%", "hijklm", DruidPredicateMatch.FALSE);
assertMatch("_jkl%", "hijklmn", DruidPredicateMatch.FALSE);
assertMatch("%jkl_", "jkl", DruidPredicateMatch.FALSE);
assertMatch("%jkl_", "ijklm", DruidPredicateMatch.TRUE);
assertMatch("%jkl_", "ijklmn", DruidPredicateMatch.FALSE);
assertMatch("%jkl_", "hijklm", DruidPredicateMatch.TRUE);
assertMatch("%jkl_", "hijklmn", DruidPredicateMatch.FALSE);
}
@Test
public void testPatternSuffixWithManyParts()
{
assertMatch("%ba_", "foo bar", DruidPredicateMatch.TRUE);
assertMatch("%ba_", "foo bar daz", DruidPredicateMatch.FALSE);
assertMatch("%ba_%", "foo bar baz", DruidPredicateMatch.TRUE);
assertMatch("a%b_d_", "abcde", DruidPredicateMatch.TRUE);
assertMatch("a%b_d_", "abcdexyzbcde", DruidPredicateMatch.TRUE);
assertMatch("%b_d_", "abcde", DruidPredicateMatch.TRUE);
assertMatch("%b_d_", "abcdexyzbcde", DruidPredicateMatch.TRUE);
assertMatch("%b_d_", "abcdexyzbcdef", DruidPredicateMatch.FALSE);
assertMatch("%b_d_", "abcdexyzbcd", DruidPredicateMatch.FALSE);
assertMatch("%z%_b_d_", "abcdexyzabcde", DruidPredicateMatch.TRUE);
assertMatch("%z%_b_d_", "abcdexyzbcde", DruidPredicateMatch.FALSE);
assertMatch("%z%_b_d_", "abcdexybcde", DruidPredicateMatch.FALSE);
assertMatch("%z%_b_d_", "abcdexbcde", DruidPredicateMatch.FALSE);
}
@Test
public void testPatternNoWildcards()
{
assertMatch("a_c_e_", "abcdef", DruidPredicateMatch.TRUE);
assertMatch("a_c_e_", "abcde", DruidPredicateMatch.FALSE);
assertMatch("x_c_e_", "abcdef", DruidPredicateMatch.FALSE);
assertMatch("xa_c_e_", "abcdef", DruidPredicateMatch.FALSE);
assertMatch("a_c_e_x", "abcde", DruidPredicateMatch.FALSE);
}
@Test
public void testPatternFindsCorrectMiddleMatch()
{
assertMatch("%km%z", "akmz", DruidPredicateMatch.TRUE);
assertMatch("%km%z", "akkmz", DruidPredicateMatch.TRUE);
assertMatch("%xy%yz", "xyz", DruidPredicateMatch.FALSE);
assertMatch("%xy%yz", "xyyz", DruidPredicateMatch.TRUE);
assertMatch("%1 _ 5%6", "1 2 3 1 4 5 6", DruidPredicateMatch.TRUE);
assertMatch("1 _ 5%6", "1 2 3 1 4 5 6", DruidPredicateMatch.FALSE);
}
private void assertCompilation(String pattern, String expected)
{
LikeDimFilter.LikeMatcher matcher = LikeDimFilter.LikeMatcher.from(pattern, '\\');
Assert.assertEquals(pattern + " => " + expected, matcher.describeCompilation());
}
private void assertMatch(String pattern, String value, DruidPredicateMatch expected)
{
LikeDimFilter.LikeMatcher matcher = LikeDimFilter.LikeMatcher.from(pattern, '\\');
Assert.assertEquals(matcher + " matches " + value, expected, matcher.matches(value));
}
}