mirror of https://github.com/apache/druid.git
Improve worst-case performance of LIKE filters by 20x (#16153)
* Expected-linear-time LIKE `LikeDimFilter` was compiling the `LIKE` clause down to a `java.util.regex.Pattern`. Unfortunately, even seemingly simply regexes can lead to [catastrophic backtracking](https://www.regular-expressions.info/catastrophic.html). In particular, something as simple as a few `%` wildcards can end up in [exploding the time complexity](https://www.rexegg.com/regex-explosive-quantifiers.html#remote). This MR implements a simple greedy algorithm that avoids backtracking. Technically, the algorithm runs in `O(nm)`, where `n` is the length of the string to match and `m` is the length of the pattern. In practice, it should run in linear time: essentially as fast as `String.indexOf()` can search for the next match. Running an updated version of the `LikeFilterBenchmark` with Java 11 on a `t2.xlarge` instance showed at least a 1.7x speed up for a simple "contains" query (`%50%`), and more than a 20x speed up for a "killer" query with four wildcards but no matches (`%%%%x`). The benchmark uses short strings: cases with longer strings should benefit more. Note that the `REGEX` operator still suffers from the same potentially-catastrophic runtimes. Using a better library than the built-in `java.util.regex.Pattern` (e.g., [joni](https://github.com/jruby/joni)) would be a good idea to avoid accidental — or intentional — DoSing. ``` Benchmark (cardinality) Mode Cnt Before Score Error After Score Error Units Before / After LikeFilterBenchmark.matchBoundPrefix 1000 avgt 10 6.686 ± 0.026 6.765 ± 0.087 us/op 0.99x LikeFilterBenchmark.matchBoundPrefix 100000 avgt 10 163.936 ± 1.589 140.014 ± 0.563 us/op 1.17x LikeFilterBenchmark.matchBoundPrefix 1000000 avgt 10 1235.259 ± 7.318 1165.330 ± 9.300 us/op 1.06x LikeFilterBenchmark.matchLikeContains 1000 avgt 10 255.074 ± 1.530 130.212 ± 3.314 us/op 1.96x LikeFilterBenchmark.matchLikeContains 100000 avgt 10 34789.639 ± 210.219 18563.644 ± 100.030 us/op 1.87x LikeFilterBenchmark.matchLikeContains 1000000 avgt 10 287265.302 ± 1790.957 164684.778 ± 317.698 us/op 1.74x LikeFilterBenchmark.matchLikeEquals 1000 avgt 10 0.410 ± 0.003 0.399 ± 0.001 us/op 1.03x LikeFilterBenchmark.matchLikeEquals 100000 avgt 10 0.793 ± 0.005 0.719 ± 0.003 us/op 1.10x LikeFilterBenchmark.matchLikeEquals 1000000 avgt 10 0.864 ± 0.004 0.839 ± 0.005 us/op 1.03x LikeFilterBenchmark.matchLikeKiller 1000 avgt 10 3077.629 ± 7.928 103.714 ± 2.417 us/op 29.67x LikeFilterBenchmark.matchLikeKiller 100000 avgt 10 311048.049 ± 13466.911 14777.567 ± 70.242 us/op 21.05x LikeFilterBenchmark.matchLikeKiller 1000000 avgt 10 3055855.099 ± 18387.839 92476.621 ± 1198.255 us/op 33.04x LikeFilterBenchmark.matchLikePrefix 1000 avgt 10 6.711 ± 0.035 6.653 ± 0.046 us/op 1.01x LikeFilterBenchmark.matchLikePrefix 100000 avgt 10 161.535 ± 0.574 163.740 ± 0.833 us/op 0.99x LikeFilterBenchmark.matchLikePrefix 1000000 avgt 10 1255.696 ± 5.207 1201.378 ± 3.466 us/op 1.05x LikeFilterBenchmark.matchRegexContains 1000 avgt 10 467.736 ± 2.546 481.431 ± 5.647 us/op 0.97x LikeFilterBenchmark.matchRegexContains 100000 avgt 10 64871.766 ± 223.341 65483.992 ± 391.249 us/op 0.99x LikeFilterBenchmark.matchRegexContains 1000000 avgt 10 482906.004 ± 2003.583 477195.835 ± 3094.605 us/op 1.01x LikeFilterBenchmark.matchRegexKiller 1000 avgt 10 8071.881 ± 18.026 8052.322 ± 17.336 us/op 1.00x LikeFilterBenchmark.matchRegexKiller 100000 avgt 10 1120094.520 ± 2428.172 808321.542 ± 2411.032 us/op 1.39x LikeFilterBenchmark.matchRegexKiller 1000000 avgt 10 8096745.012 ± 40782.747 8114114.896 ± 43250.204 us/op 1.00x LikeFilterBenchmark.matchRegexPrefix 1000 avgt 10 170.843 ± 1.095 175.924 ± 1.144 us/op 0.97x LikeFilterBenchmark.matchRegexPrefix 100000 avgt 10 17785.280 ± 116.813 18708.888 ± 61.857 us/op 0.95x LikeFilterBenchmark.matchRegexPrefix 1000000 avgt 10 174415.586 ± 1827.478 173190.799 ± 949.224 us/op 1.01x LikeFilterBenchmark.matchSelectorEquals 1000 avgt 10 0.411 ± 0.003 0.416 ± 0.002 us/op 0.99x LikeFilterBenchmark.matchSelectorEquals 100000 avgt 10 0.728 ± 0.003 0.739 ± 0.003 us/op 0.99x LikeFilterBenchmark.matchSelectorEquals 1000000 avgt 10 0.842 ± 0.002 0.879 ± 0.007 us/op 0.96x ``` * Take into account whether druid.generic.useDefaultValueForNull is set in LikeDimFilterTest assertions. * Attempt to placate CodeQL. * Fix handling of multi-pattern suffixes. * Expected-linear-time LIKE `LikeDimFilter` was compiling the `LIKE` clause down to a `java.util.regex.Pattern`. Unfortunately, even seemingly simply regexes can lead to [catastrophic backtracking](https://www.regular-expressions.info/catastrophic.html). In particular, something as simple as a few `%` wildcards can end up in [exploding the time complexity](https://www.rexegg.com/regex-explosive-quantifiers.html#remote). This MR implements a simple greedy algorithm that avoids the catastrophic backtracking, converting the `LIKE` pattern into a list of `java.util.regex.Pattern` by splitting on the `%` wildcard. The resulting sub-patterns do no backtracking, and a simple greedy loop using `Matcher.find()` to progress through the string is used. Running an updated version of the `LikeFilterBenchmark` with Java 11 on a `t2.xlarge` instance showed at least a 1.15x speed up for a simple "contains" query (`%50%`), and more than a 20x speed up for a "killer" query with four wildcards but no matches (`%%%%x`). The benchmark uses short strings: cases with longer strings should benefit more. Note that the `REGEX` operator still suffers from the same potentially-catastrophic runtimes. Using a better library than the built-in `java.util.regex.Pattern` (e.g., [joni](https://github.com/jruby/joni)) would be a good idea to avoid accidental — or intentional — DoSing. ``` Benchmark (cardinality) Mode Cnt Before Score Error After Score Error Units Before/After LikeFilterBenchmark.matchBoundPrefix 1000 avgt 10 5.410 ± 0.010 5.582 ± 0.004 us/op 0.97x LikeFilterBenchmark.matchBoundPrefix 100000 avgt 10 140.920 ± 0.306 141.082 ± 0.391 us/op 1.00x LikeFilterBenchmark.matchBoundPrefix 1000000 avgt 10 1082.762 ± 1.070 1171.407 ± 1.628 us/op 0.92x LikeFilterBenchmark.matchLikeComplexContains 1000 avgt 10 221.572 ± 0.228 183.742 ± 0.210 us/op 1.21x LikeFilterBenchmark.matchLikeComplexContains 100000 avgt 10 25461.362 ± 21.481 17373.828 ± 42.577 us/op 1.47x LikeFilterBenchmark.matchLikeComplexContains 1000000 avgt 10 221075.917 ± 919.238 177454.683 ± 506.420 us/op 1.25x LikeFilterBenchmark.matchLikeContains 1000 avgt 10 283.015 ± 0.219 218.835 ± 3.126 us/op 1.29x LikeFilterBenchmark.matchLikeContains 100000 avgt 10 30202.910 ± 32.697 26713.488 ± 49.525 us/op 1.13x LikeFilterBenchmark.matchLikeContains 1000000 avgt 10 284661.411 ± 130.324 243381.857 ± 540.143 us/op 1.17x LikeFilterBenchmark.matchLikeEquals 1000 avgt 10 0.386 ± 0.001 0.380 ± 0.001 us/op 1.02x LikeFilterBenchmark.matchLikeEquals 100000 avgt 10 0.670 ± 0.001 0.705 ± 0.002 us/op 0.95x LikeFilterBenchmark.matchLikeEquals 1000000 avgt 10 0.839 ± 0.001 0.796 ± 0.001 us/op 1.05x LikeFilterBenchmark.matchLikeKiller 1000 avgt 10 4882.099 ± 7.953 170.142 ± 0.494 us/op 28.69x LikeFilterBenchmark.matchLikeKiller 100000 avgt 10 524122.010 ± 390.170 19461.637 ± 117.090 us/op 26.93x LikeFilterBenchmark.matchLikeKiller 1000000 avgt 10 5121795.377 ± 4176.052 181162.978 ± 368.443 us/op 28.27x LikeFilterBenchmark.matchLikePrefix 1000 avgt 10 5.708 ± 0.005 5.677 ± 0.011 us/op 1.01x LikeFilterBenchmark.matchLikePrefix 100000 avgt 10 141.853 ± 0.554 108.313 ± 0.330 us/op 1.31x LikeFilterBenchmark.matchLikePrefix 1000000 avgt 10 1199.148 ± 1.298 1153.297 ± 1.575 us/op 1.04x LikeFilterBenchmark.matchLikeSuffix 1000 avgt 10 256.020 ± 0.283 196.339 ± 0.564 us/op 1.30x LikeFilterBenchmark.matchLikeSuffix 100000 avgt 10 29917.931 ± 28.218 21450.997 ± 20.341 us/op 1.39x LikeFilterBenchmark.matchLikeSuffix 1000000 avgt 10 241225.193 ± 465.824 194034.292 ± 362.312 us/op 1.24x LikeFilterBenchmark.matchRegexComplexContains 1000 avgt 10 119.597 ± 0.635 135.550 ± 0.697 us/op 0.88x LikeFilterBenchmark.matchRegexComplexContains 100000 avgt 10 13089.670 ± 13.738 13766.712 ± 12.802 us/op 0.95x LikeFilterBenchmark.matchRegexComplexContains 1000000 avgt 10 130822.830 ± 1624.048 131076.029 ± 1636.811 us/op 1.00x LikeFilterBenchmark.matchRegexContains 1000 avgt 10 573.273 ± 0.421 615.399 ± 0.633 us/op 0.93x LikeFilterBenchmark.matchRegexContains 100000 avgt 10 57259.313 ± 162.747 62900.380 ± 44.746 us/op 0.91x LikeFilterBenchmark.matchRegexContains 1000000 avgt 10 571335.768 ± 2822.776 542536.982 ± 780.290 us/op 1.05x LikeFilterBenchmark.matchRegexKiller 1000 avgt 10 11525.499 ± 8.741 11061.791 ± 21.746 us/op 1.04x LikeFilterBenchmark.matchRegexKiller 100000 avgt 10 1170414.723 ± 766.160 1144437.291 ± 886.263 us/op 1.02x LikeFilterBenchmark.matchRegexKiller 1000000 avgt 10 11507668.302 ± 11318.176 110381620.014 ± 10707.974 us/op 1.11x LikeFilterBenchmark.matchRegexPrefix 1000 avgt 10 156.460 ± 0.097 155.217 ± 0.431 us/op 1.01x LikeFilterBenchmark.matchRegexPrefix 100000 avgt 10 15056.491 ± 23.906 15508.965 ± 763.976 us/op 0.97x LikeFilterBenchmark.matchRegexPrefix 1000000 avgt 10 154416.563 ± 473.108 153737.912 ± 273.347 us/op 1.00x LikeFilterBenchmark.matchRegexSuffix 1000 avgt 10 610.684 ± 0.462 590.352 ± 0.334 us/op 1.03x LikeFilterBenchmark.matchRegexSuffix 100000 avgt 10 53196.517 ± 78.155 59460.261 ± 56.934 us/op 0.89x LikeFilterBenchmark.matchRegexSuffix 1000000 avgt 10 536100.944 ± 440.353 550098.917 ± 740.464 us/op 0.97x LikeFilterBenchmark.matchSelectorEquals 1000 avgt 10 0.390 ± 0.001 0.366 ± 0.001 us/op 1.07x LikeFilterBenchmark.matchSelectorEquals 100000 avgt 10 0.724 ± 0.001 0.714 ± 0.001 us/op 1.01x LikeFilterBenchmark.matchSelectorEquals 1000000 avgt 10 0.826 ± 0.001 0.847 ± 0.001 us/op 0.98x ```
This commit is contained in:
parent
f1d24c868f
commit
4bdc1890f7
|
@ -49,7 +49,6 @@ import org.openjdk.jmh.annotations.Scope;
|
|||
import org.openjdk.jmh.annotations.Setup;
|
||||
import org.openjdk.jmh.annotations.State;
|
||||
import org.openjdk.jmh.annotations.Warmup;
|
||||
import org.openjdk.jmh.infra.Blackhole;
|
||||
|
||||
import java.nio.ByteBuffer;
|
||||
import java.util.ArrayList;
|
||||
|
@ -106,6 +105,58 @@ public class LikeFilterBenchmark
|
|||
null
|
||||
).toFilter();
|
||||
|
||||
private static final Filter REGEX_SUFFIX = new RegexDimFilter(
|
||||
"foo",
|
||||
".*50$",
|
||||
null
|
||||
).toFilter();
|
||||
|
||||
private static final Filter LIKE_SUFFIX = new LikeDimFilter(
|
||||
"foo",
|
||||
"%50",
|
||||
null,
|
||||
null
|
||||
).toFilter();
|
||||
|
||||
private static final Filter LIKE_CONTAINS = new LikeDimFilter(
|
||||
"foo",
|
||||
"%50%",
|
||||
null,
|
||||
null
|
||||
).toFilter();
|
||||
|
||||
private static final Filter REGEX_CONTAINS = new RegexDimFilter(
|
||||
"foo",
|
||||
".*50.*",
|
||||
null
|
||||
).toFilter();
|
||||
|
||||
private static final Filter LIKE_COMPLEX_CONTAINS = new LikeDimFilter(
|
||||
"foo",
|
||||
"%5_0%0_5%",
|
||||
null,
|
||||
null
|
||||
).toFilter();
|
||||
|
||||
private static final Filter REGEX_COMPLEX_CONTAINS = new RegexDimFilter(
|
||||
"foo",
|
||||
"%5_0%0_5",
|
||||
null
|
||||
).toFilter();
|
||||
|
||||
private static final Filter LIKE_KILLER = new LikeDimFilter(
|
||||
"foo",
|
||||
"%%%%x",
|
||||
null,
|
||||
null
|
||||
).toFilter();
|
||||
|
||||
private static final Filter REGEX_KILLER = new RegexDimFilter(
|
||||
"foo",
|
||||
".*.*.*.*x",
|
||||
null
|
||||
).toFilter();
|
||||
|
||||
// cardinality, the dictionary will contain evenly spaced integers
|
||||
@Param({"1000", "100000", "1000000"})
|
||||
int cardinality;
|
||||
|
@ -147,46 +198,105 @@ public class LikeFilterBenchmark
|
|||
@Benchmark
|
||||
@BenchmarkMode(Mode.AverageTime)
|
||||
@OutputTimeUnit(TimeUnit.MICROSECONDS)
|
||||
public void matchLikeEquals(Blackhole blackhole)
|
||||
public ImmutableBitmap matchLikeEquals()
|
||||
{
|
||||
final ImmutableBitmap bitmapIndex = Filters.computeDefaultBitmapResults(LIKE_EQUALS, selector);
|
||||
blackhole.consume(bitmapIndex);
|
||||
return Filters.computeDefaultBitmapResults(LIKE_EQUALS, selector);
|
||||
}
|
||||
|
||||
@Benchmark
|
||||
@BenchmarkMode(Mode.AverageTime)
|
||||
@OutputTimeUnit(TimeUnit.MICROSECONDS)
|
||||
public void matchSelectorEquals(Blackhole blackhole)
|
||||
public ImmutableBitmap matchSelectorEquals()
|
||||
{
|
||||
final ImmutableBitmap bitmapIndex = Filters.computeDefaultBitmapResults(SELECTOR_EQUALS, selector);
|
||||
blackhole.consume(bitmapIndex);
|
||||
return Filters.computeDefaultBitmapResults(SELECTOR_EQUALS, selector);
|
||||
}
|
||||
|
||||
@Benchmark
|
||||
@BenchmarkMode(Mode.AverageTime)
|
||||
@OutputTimeUnit(TimeUnit.MICROSECONDS)
|
||||
public void matchLikePrefix(Blackhole blackhole)
|
||||
public ImmutableBitmap matchLikePrefix()
|
||||
{
|
||||
final ImmutableBitmap bitmapIndex = Filters.computeDefaultBitmapResults(LIKE_PREFIX, selector);
|
||||
blackhole.consume(bitmapIndex);
|
||||
return Filters.computeDefaultBitmapResults(LIKE_PREFIX, selector);
|
||||
}
|
||||
|
||||
@Benchmark
|
||||
@BenchmarkMode(Mode.AverageTime)
|
||||
@OutputTimeUnit(TimeUnit.MICROSECONDS)
|
||||
public void matchBoundPrefix(Blackhole blackhole)
|
||||
public ImmutableBitmap matchBoundPrefix()
|
||||
{
|
||||
final ImmutableBitmap bitmapIndex = Filters.computeDefaultBitmapResults(BOUND_PREFIX, selector);
|
||||
blackhole.consume(bitmapIndex);
|
||||
return Filters.computeDefaultBitmapResults(BOUND_PREFIX, selector);
|
||||
}
|
||||
|
||||
@Benchmark
|
||||
@BenchmarkMode(Mode.AverageTime)
|
||||
@OutputTimeUnit(TimeUnit.MICROSECONDS)
|
||||
public void matchRegexPrefix(Blackhole blackhole)
|
||||
public ImmutableBitmap matchRegexPrefix()
|
||||
{
|
||||
final ImmutableBitmap bitmapIndex = Filters.computeDefaultBitmapResults(REGEX_PREFIX, selector);
|
||||
blackhole.consume(bitmapIndex);
|
||||
return Filters.computeDefaultBitmapResults(REGEX_PREFIX, selector);
|
||||
}
|
||||
|
||||
@Benchmark
|
||||
@BenchmarkMode(Mode.AverageTime)
|
||||
@OutputTimeUnit(TimeUnit.MICROSECONDS)
|
||||
public ImmutableBitmap matchLikeSuffix()
|
||||
{
|
||||
return Filters.computeDefaultBitmapResults(LIKE_SUFFIX, selector);
|
||||
}
|
||||
|
||||
@Benchmark
|
||||
@BenchmarkMode(Mode.AverageTime)
|
||||
@OutputTimeUnit(TimeUnit.MICROSECONDS)
|
||||
public ImmutableBitmap matchRegexSuffix()
|
||||
{
|
||||
return Filters.computeDefaultBitmapResults(REGEX_SUFFIX, selector);
|
||||
}
|
||||
|
||||
@Benchmark
|
||||
@BenchmarkMode(Mode.AverageTime)
|
||||
@OutputTimeUnit(TimeUnit.MICROSECONDS)
|
||||
public ImmutableBitmap matchLikeContains()
|
||||
{
|
||||
return Filters.computeDefaultBitmapResults(LIKE_CONTAINS, selector);
|
||||
}
|
||||
|
||||
@Benchmark
|
||||
@BenchmarkMode(Mode.AverageTime)
|
||||
@OutputTimeUnit(TimeUnit.MICROSECONDS)
|
||||
public ImmutableBitmap matchRegexContains()
|
||||
{
|
||||
return Filters.computeDefaultBitmapResults(REGEX_CONTAINS, selector);
|
||||
}
|
||||
|
||||
@Benchmark
|
||||
@BenchmarkMode(Mode.AverageTime)
|
||||
@OutputTimeUnit(TimeUnit.MICROSECONDS)
|
||||
public ImmutableBitmap matchLikeComplexContains()
|
||||
{
|
||||
return Filters.computeDefaultBitmapResults(LIKE_COMPLEX_CONTAINS, selector);
|
||||
}
|
||||
|
||||
@Benchmark
|
||||
@BenchmarkMode(Mode.AverageTime)
|
||||
@OutputTimeUnit(TimeUnit.MICROSECONDS)
|
||||
public ImmutableBitmap matchRegexComplexContains()
|
||||
{
|
||||
return Filters.computeDefaultBitmapResults(REGEX_COMPLEX_CONTAINS, selector);
|
||||
}
|
||||
|
||||
@Benchmark
|
||||
@BenchmarkMode(Mode.AverageTime)
|
||||
@OutputTimeUnit(TimeUnit.MICROSECONDS)
|
||||
public ImmutableBitmap matchLikeKiller()
|
||||
{
|
||||
return Filters.computeDefaultBitmapResults(LIKE_KILLER, selector);
|
||||
}
|
||||
|
||||
@Benchmark
|
||||
@BenchmarkMode(Mode.AverageTime)
|
||||
@OutputTimeUnit(TimeUnit.MICROSECONDS)
|
||||
public ImmutableBitmap matchRegexKiller()
|
||||
{
|
||||
return Filters.computeDefaultBitmapResults(REGEX_KILLER, selector);
|
||||
}
|
||||
|
||||
private List<Integer> generateInts()
|
||||
|
|
|
@ -35,8 +35,11 @@ import org.apache.druid.segment.filter.LikeFilter;
|
|||
|
||||
import javax.annotation.Nullable;
|
||||
import java.nio.ByteBuffer;
|
||||
import java.util.ArrayList;
|
||||
import java.util.List;
|
||||
import java.util.Objects;
|
||||
import java.util.Set;
|
||||
import java.util.regex.Matcher;
|
||||
import java.util.regex.Pattern;
|
||||
|
||||
public class LikeDimFilter extends AbstractOptimizableDimFilter implements DimFilter
|
||||
|
@ -44,7 +47,6 @@ public class LikeDimFilter extends AbstractOptimizableDimFilter implements DimFi
|
|||
// Regex matching characters that are definitely okay to include unescaped in a regex.
|
||||
// Leads to excessively paranoid escaping, although shouldn't affect runtime beyond compiling the regex.
|
||||
private static final Pattern DEFINITELY_FINE = Pattern.compile("[\\w\\d\\s-]");
|
||||
private static final String WILDCARD = ".*";
|
||||
|
||||
private final String dimension;
|
||||
private final String pattern;
|
||||
|
@ -73,7 +75,7 @@ public class LikeDimFilter extends AbstractOptimizableDimFilter implements DimFi
|
|||
if (escape != null && escape.length() != 1) {
|
||||
throw new IllegalArgumentException("Escape must be null or a single character");
|
||||
} else {
|
||||
this.escapeChar = (escape == null || escape.isEmpty()) ? null : escape.charAt(0);
|
||||
this.escapeChar = escape == null ? null : escape.charAt(0);
|
||||
}
|
||||
|
||||
this.likeMatcher = LikeMatcher.from(pattern, this.escapeChar);
|
||||
|
@ -214,8 +216,8 @@ public class LikeDimFilter extends AbstractOptimizableDimFilter implements DimFi
|
|||
// Prefix that matching strings are known to start with. May be empty.
|
||||
private final String prefix;
|
||||
|
||||
// Regex pattern that describes matching strings.
|
||||
private final Pattern pattern;
|
||||
// Regex patterns that describes matching strings.
|
||||
private final List<Pattern> pattern;
|
||||
|
||||
private final String likePattern;
|
||||
|
||||
|
@ -223,7 +225,7 @@ public class LikeDimFilter extends AbstractOptimizableDimFilter implements DimFi
|
|||
final String likePattern,
|
||||
final SuffixMatch suffixMatch,
|
||||
final String prefix,
|
||||
final Pattern pattern
|
||||
final List<Pattern> pattern
|
||||
)
|
||||
{
|
||||
this.likePattern = likePattern;
|
||||
|
@ -238,7 +240,10 @@ public class LikeDimFilter extends AbstractOptimizableDimFilter implements DimFi
|
|||
)
|
||||
{
|
||||
final StringBuilder prefix = new StringBuilder();
|
||||
final StringBuilder regex = new StringBuilder();
|
||||
// Splits the input on % to leave only eagerly-matchable sub-patterns. This is to avoid catastrophic backtracking:
|
||||
// https://www.rexegg.com/regex-explosive-quantifiers.html#remote
|
||||
final List<Pattern> pattern = new ArrayList<>();
|
||||
final StringBuilder regex = new StringBuilder("^");
|
||||
boolean escaping = false;
|
||||
boolean inPrefix = true;
|
||||
SuffixMatch suffixMatch = SuffixMatch.MATCH_EMPTY;
|
||||
|
@ -251,11 +256,16 @@ public class LikeDimFilter extends AbstractOptimizableDimFilter implements DimFi
|
|||
if (suffixMatch == SuffixMatch.MATCH_EMPTY) {
|
||||
suffixMatch = SuffixMatch.MATCH_ANY;
|
||||
}
|
||||
regex.append(WILDCARD);
|
||||
if (regex.length() > 0) {
|
||||
if (regex.length() > 1 || regex.charAt(0) != '^') {
|
||||
pattern.add(Pattern.compile(regex.toString(), Pattern.DOTALL));
|
||||
}
|
||||
regex.setLength(0);
|
||||
}
|
||||
} else if (c == '_' && !escaping) {
|
||||
inPrefix = false;
|
||||
suffixMatch = SuffixMatch.MATCH_PATTERN;
|
||||
regex.append(".");
|
||||
regex.append('.');
|
||||
} else {
|
||||
if (inPrefix) {
|
||||
prefix.append(c);
|
||||
|
@ -267,7 +277,14 @@ public class LikeDimFilter extends AbstractOptimizableDimFilter implements DimFi
|
|||
}
|
||||
}
|
||||
|
||||
return new LikeMatcher(likePattern, suffixMatch, prefix.toString(), Pattern.compile(regex.toString(), Pattern.DOTALL));
|
||||
if (likePattern.isEmpty()) {
|
||||
pattern.add(Pattern.compile("^$"));
|
||||
} else if (regex.length() > 0) {
|
||||
regex.append('$');
|
||||
pattern.add(Pattern.compile(regex.toString(), Pattern.DOTALL));
|
||||
}
|
||||
|
||||
return new LikeMatcher(likePattern, suffixMatch, prefix.toString(), pattern);
|
||||
}
|
||||
|
||||
private static void addPatternCharacter(final StringBuilder patternBuilder, final char c)
|
||||
|
@ -284,13 +301,31 @@ public class LikeDimFilter extends AbstractOptimizableDimFilter implements DimFi
|
|||
return matches(s, pattern);
|
||||
}
|
||||
|
||||
private static DruidPredicateMatch matches(@Nullable final String s, Pattern pattern)
|
||||
private static DruidPredicateMatch matches(@Nullable final String s, List<Pattern> pattern)
|
||||
{
|
||||
String val = NullHandling.nullToEmptyIfNeeded(s);
|
||||
if (val == null) {
|
||||
return DruidPredicateMatch.UNKNOWN;
|
||||
}
|
||||
return DruidPredicateMatch.of(pattern.matcher(val).matches());
|
||||
|
||||
if (pattern.size() == 1) {
|
||||
// Most common case is a single pattern: a% => ^a, %z => z$, %m% => m
|
||||
return DruidPredicateMatch.of(pattern.get(0).matcher(val).find());
|
||||
}
|
||||
|
||||
int offset = 0;
|
||||
|
||||
for (Pattern part : pattern) {
|
||||
Matcher matcher = part.matcher(val);
|
||||
|
||||
if (!matcher.find(offset)) {
|
||||
return DruidPredicateMatch.FALSE;
|
||||
}
|
||||
|
||||
offset = matcher.end();
|
||||
}
|
||||
|
||||
return DruidPredicateMatch.TRUE;
|
||||
}
|
||||
|
||||
/**
|
||||
|
@ -324,13 +359,19 @@ public class LikeDimFilter extends AbstractOptimizableDimFilter implements DimFi
|
|||
return suffixMatch;
|
||||
}
|
||||
|
||||
@VisibleForTesting
|
||||
String describeCompilation()
|
||||
{
|
||||
return likePattern + " => " + prefix + ":" + pattern;
|
||||
}
|
||||
|
||||
@VisibleForTesting
|
||||
static class PatternDruidPredicateFactory implements DruidPredicateFactory
|
||||
{
|
||||
private final ExtractionFn extractionFn;
|
||||
private final Pattern pattern;
|
||||
private final List<Pattern> pattern;
|
||||
|
||||
PatternDruidPredicateFactory(ExtractionFn extractionFn, Pattern pattern)
|
||||
PatternDruidPredicateFactory(ExtractionFn extractionFn, List<Pattern> pattern)
|
||||
{
|
||||
this.extractionFn = extractionFn;
|
||||
this.pattern = pattern;
|
||||
|
|
|
@ -22,6 +22,7 @@ package org.apache.druid.query.filter;
|
|||
import com.fasterxml.jackson.databind.ObjectMapper;
|
||||
import com.google.common.collect.Sets;
|
||||
import nl.jqno.equalsverifier.EqualsVerifier;
|
||||
import org.apache.druid.common.config.NullHandling;
|
||||
import org.apache.druid.jackson.DefaultObjectMapper;
|
||||
import org.apache.druid.query.extraction.SubstringDimExtractionFn;
|
||||
import org.apache.druid.segment.column.ColumnIndexSupplier;
|
||||
|
@ -146,4 +147,191 @@ public class LikeDimFilterTest extends InitializedNullHandlingTest
|
|||
final BitmapColumnIndex retVal = likeFilter.getBitmapColumnIndex(indexSelector);
|
||||
Assert.assertSame("likeFilter returns the intended bitmapColumnIndex", bitmapColumnIndex, retVal);
|
||||
}
|
||||
|
||||
@Test
|
||||
public void testPatternCompilation()
|
||||
{
|
||||
assertCompilation("", ":[^$]");
|
||||
assertCompilation("a", "a:[^a$]");
|
||||
assertCompilation("abc", "abc:[^abc$]");
|
||||
assertCompilation("a%", "a:[^a]");
|
||||
assertCompilation("%a", ":[a$]");
|
||||
assertCompilation("%a%", ":[a]");
|
||||
assertCompilation("%_a", ":[.a$]");
|
||||
assertCompilation("_%a", ":[^., a$]");
|
||||
assertCompilation("_%_a", ":[^., .a$]");
|
||||
assertCompilation("abc%", "abc:[^abc]");
|
||||
assertCompilation("a%b", "a:[^a, b$]");
|
||||
assertCompilation("abc%x", "abc:[^abc, x$]");
|
||||
assertCompilation("abc%xyz", "abc:[^abc, xyz$]");
|
||||
assertCompilation("____", ":[^....$]");
|
||||
assertCompilation("%%%%", ":[]");
|
||||
assertCompilation("%_%_%%__", ":[., ., ..$]");
|
||||
assertCompilation("%_%a_%bc%_d_", ":[., a., bc, .d.$]");
|
||||
assertCompilation("%1 _ 5%6", ":[1 . 5, 6$]");
|
||||
assertCompilation("\\%_%a_\\%b\\\\c\\___%_%_d_w%x_y_z", "%:[^\\u0025., a.\\u0025b\\u005Cc_.., ., .d.w, x.y.z$]");
|
||||
}
|
||||
|
||||
@Test
|
||||
public void testPatternEmpty()
|
||||
{
|
||||
assertMatch("", null, NullHandling.replaceWithDefault() ? DruidPredicateMatch.TRUE : DruidPredicateMatch.UNKNOWN);
|
||||
assertMatch("", "", DruidPredicateMatch.TRUE);
|
||||
assertMatch("", "a", DruidPredicateMatch.FALSE);
|
||||
assertMatch("", "This is a test!", DruidPredicateMatch.FALSE);
|
||||
}
|
||||
|
||||
@Test
|
||||
public void testPatternExactMatch()
|
||||
{
|
||||
assertMatch("a\nb", "a\nb", DruidPredicateMatch.TRUE);
|
||||
assertMatch("a\nb", "a\nc", DruidPredicateMatch.FALSE);
|
||||
assertMatch("This is a test", "This is a test", DruidPredicateMatch.TRUE);
|
||||
assertMatch("This is a test", "this is a test", DruidPredicateMatch.FALSE);
|
||||
assertMatch("This is a test", "This is a tes", DruidPredicateMatch.FALSE);
|
||||
assertMatch("This is a test", "his is a test", DruidPredicateMatch.FALSE);
|
||||
assertMatch("This \\%is a\\_test", "This %is a_test", DruidPredicateMatch.TRUE);
|
||||
assertMatch("This \\%is a\\_test", "This \\%is a_test", DruidPredicateMatch.FALSE);
|
||||
}
|
||||
|
||||
@Test
|
||||
public void testPatternTrickySuffixes()
|
||||
{
|
||||
assertMatch("%xyz", "abcxyzxyz", DruidPredicateMatch.TRUE);
|
||||
assertMatch("ab%bc", "abc", DruidPredicateMatch.FALSE);
|
||||
}
|
||||
|
||||
@Test
|
||||
public void testPatternOnlySpecial()
|
||||
{
|
||||
assertMatch("%", null, NullHandling.replaceWithDefault() ? DruidPredicateMatch.TRUE : DruidPredicateMatch.UNKNOWN);
|
||||
assertMatch("%", "", DruidPredicateMatch.TRUE);
|
||||
assertMatch("%", "abcxyzxyz", DruidPredicateMatch.TRUE);
|
||||
assertMatch("_", null, NullHandling.replaceWithDefault() ? DruidPredicateMatch.FALSE : DruidPredicateMatch.UNKNOWN);
|
||||
assertMatch("_", "", DruidPredicateMatch.FALSE);
|
||||
assertMatch("_", "a", DruidPredicateMatch.TRUE);
|
||||
assertMatch("_", "ab", DruidPredicateMatch.FALSE);
|
||||
assertMatch("____", "abc", DruidPredicateMatch.FALSE);
|
||||
assertMatch("____", "abcd", DruidPredicateMatch.TRUE);
|
||||
assertMatch("____", "abcde", DruidPredicateMatch.FALSE);
|
||||
assertMatch("%____", "abcde", DruidPredicateMatch.TRUE);
|
||||
assertMatch("%____", "abcd", DruidPredicateMatch.TRUE);
|
||||
assertMatch("%____", "abc", DruidPredicateMatch.FALSE);
|
||||
assertMatch("__%_%%_", "abc", DruidPredicateMatch.FALSE);
|
||||
assertMatch("__%_%%_", "abcd", DruidPredicateMatch.TRUE);
|
||||
assertMatch("__%_%%_", "abcdxyz", DruidPredicateMatch.TRUE);
|
||||
assertMatch("%__%_%%_%", "abc", DruidPredicateMatch.FALSE);
|
||||
assertMatch("%__%_%%_%", "abcd", DruidPredicateMatch.TRUE);
|
||||
assertMatch("%__%_%%_%", "abcdxyz", DruidPredicateMatch.TRUE);
|
||||
}
|
||||
|
||||
@Test
|
||||
public void testPatternTrailingWildcard()
|
||||
{
|
||||
assertMatch("ab%", "abc", DruidPredicateMatch.TRUE);
|
||||
assertMatch("ab%", "ab", DruidPredicateMatch.TRUE);
|
||||
assertMatch("ab%", "a", DruidPredicateMatch.FALSE);
|
||||
}
|
||||
|
||||
@Test
|
||||
public void testPatternLeadingWildcard()
|
||||
{
|
||||
assertMatch("%yz", "xyz", DruidPredicateMatch.TRUE);
|
||||
assertMatch("%yz", "yz", DruidPredicateMatch.TRUE);
|
||||
assertMatch("%yz", "z", DruidPredicateMatch.FALSE);
|
||||
assertMatch("%yz", "wxyz", DruidPredicateMatch.TRUE);
|
||||
assertMatch("%yz", "xyza", DruidPredicateMatch.FALSE);
|
||||
}
|
||||
|
||||
@Test
|
||||
public void testPatternTrailingAny()
|
||||
{
|
||||
assertMatch("ab_", "abc", DruidPredicateMatch.TRUE);
|
||||
assertMatch("ab_", "ab", DruidPredicateMatch.FALSE);
|
||||
assertMatch("ab_", "abcd", DruidPredicateMatch.FALSE);
|
||||
assertMatch("ab_", "xabc", DruidPredicateMatch.FALSE);
|
||||
}
|
||||
|
||||
@Test
|
||||
public void testPatternLeadingAny()
|
||||
{
|
||||
assertMatch("_yz", "xyz", DruidPredicateMatch.TRUE);
|
||||
assertMatch("_yz", "yz", DruidPredicateMatch.FALSE);
|
||||
assertMatch("_yz", "wxyz", DruidPredicateMatch.FALSE);
|
||||
assertMatch("_yz", "xyza", DruidPredicateMatch.FALSE);
|
||||
}
|
||||
|
||||
@Test
|
||||
public void testPatternLeadingAndTrailing()
|
||||
{
|
||||
assertMatch("_jkl_", "jkl", DruidPredicateMatch.FALSE);
|
||||
assertMatch("_jkl_", "ijklm", DruidPredicateMatch.TRUE);
|
||||
assertMatch("_jkl_", "ijklmn", DruidPredicateMatch.FALSE);
|
||||
assertMatch("_jkl_", "hijklm", DruidPredicateMatch.FALSE);
|
||||
assertMatch("%jkl%", "jkl", DruidPredicateMatch.TRUE);
|
||||
assertMatch("%jkl%", "ijklm", DruidPredicateMatch.TRUE);
|
||||
assertMatch("%jkl%", "ijklmn", DruidPredicateMatch.TRUE);
|
||||
assertMatch("%jkl%", "hijklm", DruidPredicateMatch.TRUE);
|
||||
assertMatch("_jkl%", "jkl", DruidPredicateMatch.FALSE);
|
||||
assertMatch("_jkl%", "ijklm", DruidPredicateMatch.TRUE);
|
||||
assertMatch("_jkl%", "ijklmn", DruidPredicateMatch.TRUE);
|
||||
assertMatch("_jkl%", "hijklm", DruidPredicateMatch.FALSE);
|
||||
assertMatch("_jkl%", "hijklmn", DruidPredicateMatch.FALSE);
|
||||
assertMatch("%jkl_", "jkl", DruidPredicateMatch.FALSE);
|
||||
assertMatch("%jkl_", "ijklm", DruidPredicateMatch.TRUE);
|
||||
assertMatch("%jkl_", "ijklmn", DruidPredicateMatch.FALSE);
|
||||
assertMatch("%jkl_", "hijklm", DruidPredicateMatch.TRUE);
|
||||
assertMatch("%jkl_", "hijklmn", DruidPredicateMatch.FALSE);
|
||||
}
|
||||
|
||||
@Test
|
||||
public void testPatternSuffixWithManyParts()
|
||||
{
|
||||
assertMatch("%ba_", "foo bar", DruidPredicateMatch.TRUE);
|
||||
assertMatch("%ba_", "foo bar daz", DruidPredicateMatch.FALSE);
|
||||
assertMatch("%ba_%", "foo bar baz", DruidPredicateMatch.TRUE);
|
||||
assertMatch("a%b_d_", "abcde", DruidPredicateMatch.TRUE);
|
||||
assertMatch("a%b_d_", "abcdexyzbcde", DruidPredicateMatch.TRUE);
|
||||
assertMatch("%b_d_", "abcde", DruidPredicateMatch.TRUE);
|
||||
assertMatch("%b_d_", "abcdexyzbcde", DruidPredicateMatch.TRUE);
|
||||
assertMatch("%b_d_", "abcdexyzbcdef", DruidPredicateMatch.FALSE);
|
||||
assertMatch("%b_d_", "abcdexyzbcd", DruidPredicateMatch.FALSE);
|
||||
assertMatch("%z%_b_d_", "abcdexyzabcde", DruidPredicateMatch.TRUE);
|
||||
assertMatch("%z%_b_d_", "abcdexyzbcde", DruidPredicateMatch.FALSE);
|
||||
assertMatch("%z%_b_d_", "abcdexybcde", DruidPredicateMatch.FALSE);
|
||||
assertMatch("%z%_b_d_", "abcdexbcde", DruidPredicateMatch.FALSE);
|
||||
}
|
||||
|
||||
@Test
|
||||
public void testPatternNoWildcards()
|
||||
{
|
||||
assertMatch("a_c_e_", "abcdef", DruidPredicateMatch.TRUE);
|
||||
assertMatch("a_c_e_", "abcde", DruidPredicateMatch.FALSE);
|
||||
assertMatch("x_c_e_", "abcdef", DruidPredicateMatch.FALSE);
|
||||
assertMatch("xa_c_e_", "abcdef", DruidPredicateMatch.FALSE);
|
||||
assertMatch("a_c_e_x", "abcde", DruidPredicateMatch.FALSE);
|
||||
}
|
||||
|
||||
@Test
|
||||
public void testPatternFindsCorrectMiddleMatch()
|
||||
{
|
||||
assertMatch("%km%z", "akmz", DruidPredicateMatch.TRUE);
|
||||
assertMatch("%km%z", "akkmz", DruidPredicateMatch.TRUE);
|
||||
assertMatch("%xy%yz", "xyz", DruidPredicateMatch.FALSE);
|
||||
assertMatch("%xy%yz", "xyyz", DruidPredicateMatch.TRUE);
|
||||
assertMatch("%1 _ 5%6", "1 2 3 1 4 5 6", DruidPredicateMatch.TRUE);
|
||||
assertMatch("1 _ 5%6", "1 2 3 1 4 5 6", DruidPredicateMatch.FALSE);
|
||||
}
|
||||
|
||||
private void assertCompilation(String pattern, String expected)
|
||||
{
|
||||
LikeDimFilter.LikeMatcher matcher = LikeDimFilter.LikeMatcher.from(pattern, '\\');
|
||||
Assert.assertEquals(pattern + " => " + expected, matcher.describeCompilation());
|
||||
}
|
||||
|
||||
private void assertMatch(String pattern, String value, DruidPredicateMatch expected)
|
||||
{
|
||||
LikeDimFilter.LikeMatcher matcher = LikeDimFilter.LikeMatcher.from(pattern, '\\');
|
||||
Assert.assertEquals(matcher + " matches " + value, expected, matcher.matches(value));
|
||||
}
|
||||
}
|
||||
|
|
Loading…
Reference in New Issue