. Using this wrapper its easy to add fuzzy/wildcard to e.g. a SpanNearQuery. (Robert Muir, Uwe Schindler) - + * LUCENE-2838: ConstantScoreQuery now directly supports wrapping a Query instance for stripping off scores. The use of a QueryWrapperFilter is no longer needed and discouraged for that use case. Directly wrapping Query improves performance, as out-of-order collection is now supported. (Uwe Schindler) -* LUCENE-2864: Add getMaxTermFrequency (maximum within-document TF) to +* LUCENE-2864: Add getMaxTermFrequency (maximum within-document TF) to FieldInvertState so that it can be used in Similarity.computeNorm. (Robert Muir) +* LUCENE-2720: Segments now record the code version which created them. + (Shai Erera, Mike McCandless, Uwe Schindler) + * LUCENE-2474: Added expert ReaderFinishedListener API to IndexReader, to allow apps that maintain external per-segment caches to evict entries when a segment is finished. (Shay Banon, Yonik @@ -903,8 +1019,8 @@ New features * LUCENE-2911: The new StandardTokenizer, UAX29URLEmailTokenizer, and the ICUTokenizer in contrib now all tag types with a consistent set of token types (defined in StandardTokenizer). Tokens in the major - CJK types are explicitly marked to allow for custom downstream handling: -, , , and . + CJK types are explicitly marked to allow for custom downstream handling: + , , , and . (Robert Muir, Steven Rowe) * LUCENE-2913: Add missing getters to Numeric* classes. (Uwe Schindler) @@ -929,7 +1045,7 @@ Optimizations * LUCENE-2137: Switch to AtomicInteger for some ref counting (Earwin Burrfoot via Mike McCandless) -* LUCENE-2123, LUCENE-2261: Move FuzzyQuery rewrite to separate RewriteMode +* LUCENE-2123, LUCENE-2261: Move FuzzyQuery rewrite to separate RewriteMode into MultiTermQuery. The number of fuzzy expansions can be specified with the maxExpansions parameter to FuzzyQuery. (Uwe Schindler, Robert Muir, Mike McCandless) @@ -963,12 +1079,12 @@ Optimizations TermAttributeImpl, move DEFAULT_TYPE constant to TypeInterface, improve null-handling for TypeAttribute. (Uwe Schindler) -* LUCENE-2329: Switch TermsHash* from using a PostingList object per unique +* LUCENE-2329: Switch TermsHash* from using a PostingList object per unique term to parallel arrays, indexed by termID. This reduces garbage collection overhead significantly, which results in great indexing performance wins when the available JVM heap space is low. This will become even more important when the DocumentsWriter RAM buffer is searchable in the future, - because then it will make sense to make the RAM buffers as large as + because then it will make sense to make the RAM buffers as large as possible. (Mike McCandless, Michael Busch) * LUCENE-2380: The terms field cache methods (getTerms, @@ -983,7 +1099,7 @@ Optimizations causing too many fallbacks to compare-by-value (instead of by-ord). (Mike McCandless) -* LUCENE-2574: IndexInput exposes copyBytes(IndexOutput, long) to allow for +* LUCENE-2574: IndexInput exposes copyBytes(IndexOutput, long) to allow for efficient copying by sub-classes. Optimized copy is implemented for RAM and FS streams. (Shai Erera) @@ -1006,15 +1122,15 @@ Optimizations * LUCENE-2010: Segments with 100% deleted documents are now removed on IndexReader or IndexWriter commit. (Uwe Schindler, Mike McCandless) - + * LUCENE-1472: Removed synchronization from static DateTools methods by using a ThreadLocal. Also converted DateTools.Resolution to a Java 5 enum (this should not break backwards). (Uwe Schindler) Build -* LUCENE-2124: Moved the JDK-based collation support from contrib/collation - into core, and moved the ICU-based collation support into contrib/icu. +* LUCENE-2124: Moved the JDK-based collation support from contrib/collation + into core, and moved the ICU-based collation support into contrib/icu. (Robert Muir) * LUCENE-2326: Removed SVN checkouts for backwards tests. The backwards @@ -1026,14 +1142,14 @@ Build * LUCENE-1709: Tests are now parallelized by default (except for benchmark). You can force them to run sequentially by passing -Drunsequential=1 on the command - line. The number of threads that are spawned per CPU defaults to '1'. If you + line. The number of threads that are spawned per CPU defaults to '1'. If you wish to change that, you can run the tests with -DthreadsPerProcessor=[num]. (Robert Muir, Shai Erera, Peter Kofler) * LUCENE-2516: Backwards tests are now compiled against released lucene-core.jar from tarball of previous version. Backwards tests are now packaged together with src distribution. (Uwe Schindler) - + * LUCENE-2611: Added Ant target to install IntelliJ IDEA configuration: "ant idea". See http://wiki.apache.org/lucene-java/HowtoConfigureIntelliJ (Steven Rowe) @@ -1042,8 +1158,8 @@ Build generating Maven artifacts (Steven Rowe) * LUCENE-2609: Added jar-test-framework Ant target which packages Lucene's - tests' framework classes. (Drew Farris, Grant Ingersoll, Shai Erera, Steven - Rowe) + tests' framework classes. (Drew Farris, Grant Ingersoll, Shai Erera, + Steven Rowe) Test Cases @@ -1079,18 +1195,18 @@ Test Cases access to "real" files from the test folder itself, can use LuceneTestCase(J4).getDataFile(). (Uwe Schindler) -* LUCENE-2398, LUCENE-2611: Improve tests to work better from IDEs such +* LUCENE-2398, LUCENE-2611: Improve tests to work better from IDEs such as Eclipse and IntelliJ. (Paolo Castagna, Steven Rowe via Robert Muir) * LUCENE-2804: add newFSDirectory to LuceneTestCase to create a FSDirectory at random. (Shai Erera, Robert Muir) - + Documentation * LUCENE-2579: Fix oal.search's package.html description of abstract methods. (Santiago M. Mola via Mike McCandless) - + * LUCENE-2625: Add a note to IndexReader.termDocs() with additional verbiage that the TermEnum must be seeked since it is unpositioned. (Adriano Crestani via Robert Muir) diff --git a/lucene/MIGRATE.txt b/lucene/MIGRATE.txt index 79cb6da939a..779b6309a44 100644 --- a/lucene/MIGRATE.txt +++ b/lucene/MIGRATE.txt @@ -356,3 +356,9 @@ LUCENE-1458, LUCENE-2111: Flexible Indexing field as a parameter, this is removed due to the fact the entire Similarity (all methods) can now be configured per-field. Methods that apply to the entire query such as coord() and queryNorm() exist in SimilarityProvider. + +* LUCENE-1076: TieredMergePolicy is now the default merge policy. + It's able to merge non-contiguous segments; this may cause problems + for applications that rely on Lucene's internal document ID + assigment. If so, you should instead use LogByteSize/DocMergePolicy + during indexing. diff --git a/lucene/build.xml b/lucene/build.xml index 4d245d595f2..3a0a522249a 100644 --- a/lucene/build.xml +++ b/lucene/build.xml @@ -152,6 +152,7 @@ DEPRECATED - Doing Nothing. See http://wiki.apache.org/lucene-java/HowToUpdateTheWebsite +@@ -194,6 +195,17 @@ - + + + ++ + + ++ + + @@ -424,9 +436,12 @@ + + + +- diff --git a/lucene/common-build.xml b/lucene/common-build.xml index 38327fec496..f8db3369b21 100644 --- a/lucene/common-build.xml +++ b/lucene/common-build.xml @@ -78,6 +78,7 @@ + @@ -102,6 +103,7 @@ + @@ -306,7 +308,7 @@+ @@ -507,6 +509,8 @@ + + @@ -561,7 +565,7 @@ - + @@ -759,7 +763,8 @@ - + + diff --git a/lucene/contrib/CHANGES.txt b/lucene/contrib/CHANGES.txt index 376a1ecafe1..46a60c87712 100644 --- a/lucene/contrib/CHANGES.txt +++ b/lucene/contrib/CHANGES.txt @@ -4,107 +4,138 @@ Lucene contrib change Log Build - * LUCENE-2413: Moved the demo out of lucene core and into contrib/demo. - (Robert Muir) + * LUCENE-2845: Moved contrib/benchmark to modules. New Features - * LUCENE-2604: Added RegexpQuery support to contrib/queryparser. - (Simon Willnauer, Robert Muir) + * LUCENE-2604: Added RegexpQuery support to contrib/queryparser. + (Simon Willnauer, Robert Muir) - * LUCENE-2500: Added DirectIOLinuxDirectory, a Linux-specific - Directory impl that uses the O_DIRECT flag to bypass the buffer - cache. This is useful to prevent segment merging from evicting - pages from the buffer cache, since fadvise/madvise do not seem. - (Michael McCandless) + * LUCENE-2373: Added a Codec implementation that works with append-only + filesystems (such as e.g. Hadoop DFS). SegmentInfos writing/reading + code is refactored to support append-only FS, and to allow for future + customization of per-segment information. (Andrzej Bialecki) - * LUCENE-2373: Added a Codec implementation that works with append-only - filesystems (such as e.g. Hadoop DFS). SegmentInfos writing/reading - code is refactored to support append-only FS, and to allow for future - customization of per-segment information. (Andrzej Bialecki) + * LUCENE-2479: Added ability to provide a sort comparator for spelling suggestions along + with two implementations. The existing comparator (score, then frequency) is the default (Grant Ingersoll) - * LUCENE-2479: Added ability to provide a sort comparator for spelling suggestions along - with two implementations. The existing comparator (score, then frequency) is the default (Grant Ingersoll) - - * LUCENE-2608: Added the ability to specify the accuracy at method time in the SpellChecker. The per class - method is also still available. (Grant Ingersoll) + * LUCENE-2608: Added the ability to specify the accuracy at method time in the SpellChecker. The per class + method is also still available. (Grant Ingersoll) - * LUCENE-2507: Added DirectSpellChecker, which retrieves correction candidates directly - from the term dictionary using levenshtein automata. (Robert Muir) + * LUCENE-2507: Added DirectSpellChecker, which retrieves correction candidates directly + from the term dictionary using levenshtein automata. (Robert Muir) - * LUCENE-2791: Added WindowsDirectory, a Windows-specific Directory impl - that doesn't synchronize on the file handle. This can be useful to - avoid the performance problems of SimpleFSDirectory and NIOFSDirectory. - (Robert Muir, Simon Willnauer, Uwe Schindler, Michael McCandless) + * LUCENE-2836: Add FieldCacheRewriteMethod, which rewrites MultiTermQueries + using the FieldCache's TermsEnum. (Robert Muir) API Changes - * LUCENE-2606: Changed RegexCapabilities interface to fix thread - safety, serialization, and performance problems. If you have - written a custom RegexCapabilities it will need to be updated - to the new API. (Robert Muir, Uwe Schindler) + * LUCENE-2606: Changed RegexCapabilities interface to fix thread + safety, serialization, and performance problems. If you have + written a custom RegexCapabilities it will need to be updated + to the new API. (Robert Muir, Uwe Schindler) - * LUCENE-2638 MakeHighFreqTerms.TermStats public to make it more useful - for API use. (Andrzej Bialecki) + * LUCENE-2638 MakeHighFreqTerms.TermStats public to make it more useful + for API use. (Andrzej Bialecki) * LUCENE-2912: The field-specific hashmaps in SweetSpotSimilarity were removed. Instead, use SimilarityProvider to return different SweetSpotSimilaritys for different fields, this way all parameters (such as TF factors) can be customized on a per-field basis. (Robert Muir) + +Bug Fixes + + * LUCENE-3045: fixed QueryNodeImpl.containsTag(String key) that was + not lowercasing the key before checking for the tag (Adriano Crestani) ======================= Lucene 3.x (not yet released) ======================= -(No changes) +Bug Fixes -======================= Lucene 3.1 (not yet released) ======================= + * LUCENE-3045: fixed QueryNodeImpl.containsTag(String key) that was + not lowercasing the key before checking for the tag (Adriano Crestani) + + * LUCENE-3026: SmartChineseAnalyzer's WordTokenFilter threw NullPointerException + on sentences longer than 32,767 characters. (wangzhenghang via Robert Muir) + + * LUCENE-2939: Highlighter should try and use maxDocCharsToAnalyze in + WeightedSpanTermExtractor when adding a new field to MemoryIndex as well as + when using CachingTokenStream. This can be a significant performance bug for + large documents. (Mark Miller) + + * LUCENE-3043: GermanStemmer threw IndexOutOfBoundsException if it encountered + a zero-length token. (Robert Muir) + + * LUCENE-3044: ThaiWordFilter didn't reset its cached state correctly, this only + caused a problem if you consumed a tokenstream, then reused it, added different + attributes to it, and consumed it again. (Robert Muir, Uwe Schindler) + +New Features + + * LUCENE-3016: Add analyzer for Latvian. (Robert Muir) + +======================= Lucene 3.1.0 ======================= Changes in backwards compatibility policy * LUCENE-2100: All Analyzers in Lucene-contrib have been marked as final. Analyzers should be only act as a composition of TokenStreams, users should compose their own analyzers instead of subclassing existing ones. - (Simon Willnauer) + (Simon Willnauer) * LUCENE-2194, LUCENE-2201: Snowball APIs were upgraded to snowball revision - 502 (with some local modifications for improved performance). - Index backwards compatibility and binary backwards compatibility is - preserved, but some protected/public member variables changed type. This - does NOT affect java code/class files produced by the snowball compiler, + 502 (with some local modifications for improved performance). + Index backwards compatibility and binary backwards compatibility is + preserved, but some protected/public member variables changed type. This + does NOT affect java code/class files produced by the snowball compiler, but technically is a backwards compatibility break. (Robert Muir) - + * LUCENE-2226: Moved contrib/snowball functionality into contrib/analyzers. Be sure to remove any old obselete lucene-snowball jar files from your classpath! (Robert Muir) - + * LUCENE-2323: Moved contrib/wikipedia functionality into contrib/analyzers. Additionally the package was changed from org.apache.lucene.wikipedia.analysis to org.apache.lucene.analysis.wikipedia. (Robert Muir) * LUCENE-2581: Added new methods to FragmentsBuilder interface. These methods are used to set pre/post tags and Encoder. (Koji Sekiguchi) - + + * LUCENE-2391: Improved spellchecker (re)build time/ram usage by omitting + frequencies/positions/norms for single-valued fields, modifying the default + ramBufferMBSize to match IndexWriterConfig (16MB), making index optimization + an optional boolean parameter, and modifying the incremental update logic + to work well with unoptimized spellcheck indexes. The indexDictionary() methods + were made final to ensure a hard backwards break in case you were subclassing + Spellchecker. In general, subclassing Spellchecker is not recommended. (Robert Muir) + Changes in runtime behavior * LUCENE-2117: SnowballAnalyzer uses TurkishLowerCaseFilter instead of LowercaseFilter to correctly handle the unique Turkish casing behavior if used with Version > 3.0 and the TurkishStemmer. - (Robert Muir via Simon Willnauer) + (Robert Muir via Simon Willnauer) - * LUCENE-2055: GermanAnalyzer now uses the Snowball German2 algorithm and + * LUCENE-2055: GermanAnalyzer now uses the Snowball German2 algorithm and stopwords list by default for Version > 3.0. (Robert Muir, Uwe Schindler, Simon Willnauer) Bug fixes + * LUCENE-2855: contrib queryparser was using CharSequence as key in some internal + Map instances, which was leading to incorrect behavior, since some CharSequence + implementors do not override hashcode and equals methods. Now the internal Maps + are using String instead. (Adriano Crestani) + * LUCENE-2068: Fixed ReverseStringFilter which was not aware of supplementary characters. During reverse the filter created unpaired surrogates, which will be replaced by U+FFFD by the indexer, but not at query time. The filter now reverses supplementary characters correctly if used with Version > 3.0. (Simon Willnauer, Robert Muir) - * LUCENE-2035: TokenSources.getTokenStream() does not assign positionIncrement. + * LUCENE-2035: TokenSources.getTokenStream() does not assign positionIncrement. (Christopher Morris via Mark Miller) - + * LUCENE-2055: Deprecated RussianTokenizer, RussianStemmer, RussianStemFilter, FrenchStemmer, FrenchStemFilter, DutchStemmer, and DutchStemFilter. For these Analyzers, SnowballFilter is used instead (for Version > 3.0), as @@ -113,48 +144,55 @@ Bug fixes default. (Robert Muir, Uwe Schindler, Simon Willnauer) * LUCENE-2184: Fixed bug with handling best fit value when the proper best fit value is - not an indexed field. Note, this change affects the APIs. (Grant Ingersoll) - + not an indexed field. Note, this change affects the APIs. (Grant Ingersoll) + * LUCENE-2359: Fix bug in CartesianPolyFilterBuilder related to handling of behavior around - the 180th meridian (Grant Ingersoll) + the 180th meridian (Grant Ingersoll) * LUCENE-2404: Fix bugs with position increment and empty tokens in ThaiWordFilter. For matchVersion >= 3.1 the filter also no longer lowercases. ThaiAnalyzer will use a separate LowerCaseFilter instead. (Uwe Schindler, Robert Muir) -* LUCENE-2615: Fix DirectIOLinuxDirectory to not assign bogus - permissions to newly created files, and to not silently hardwire - buffer size to 1 MB. (Mark Miller, Robert Muir, Mike McCandless) + * LUCENE-2615: Fix DirectIOLinuxDirectory to not assign bogus + permissions to newly created files, and to not silently hardwire + buffer size to 1 MB. (Mark Miller, Robert Muir, Mike McCandless) -* LUCENE-2629: Fix gennorm2 task for generating ICUFoldingFilter's .nrm file. This allows - you to customize its normalization/folding, by editing the source data files in src/data - and regenerating a new .nrm with 'ant gennorm2'. (David Bowen via Robert Muir) + * LUCENE-2629: Fix gennorm2 task for generating ICUFoldingFilter's .nrm file. This allows + you to customize its normalization/folding, by editing the source data files in src/data + and regenerating a new .nrm with 'ant gennorm2'. (David Bowen via Robert Muir) -* LUCENE-2653: ThaiWordFilter depends on the JRE having a Thai dictionary, which is not - always the case. If the dictionary is unavailable, the filter will now throw - UnsupportedOperationException in the constructor. (Robert Muir) + * LUCENE-2653: ThaiWordFilter depends on the JRE having a Thai dictionary, which is not + always the case. If the dictionary is unavailable, the filter will now throw + UnsupportedOperationException in the constructor. (Robert Muir) -* LUCENE-589: Fix contrib/demo for international documents. - (Curtis d'Entremont via Robert Muir) - -* LUCENE-2246: Fix contrib/demo for Turkish html documents. - (Selim Nadi via Robert Muir) - -* LUCENE-590: Demo HTML parser gives incorrect summaries when title is repeated as a heading - (Curtis d'Entremont via Robert Muir) + * LUCENE-589: Fix contrib/demo for international documents. + (Curtis d'Entremont via Robert Muir) -* LUCENE-591: The demo indexer now indexes meta keywords. - (Curtis d'Entremont via Robert Muir) + * LUCENE-2246: Fix contrib/demo for Turkish html documents. + (Selim Nadi via Robert Muir) - * LUCENE-2943: Fix thread-safety issues with ICUCollationKeyFilter. + * LUCENE-590: Demo HTML parser gives incorrect summaries when title is repeated as a heading + (Curtis d'Entremont via Robert Muir) + + * LUCENE-591: The demo indexer now indexes meta keywords. + (Curtis d'Entremont via Robert Muir) + + * LUCENE-2874: Highlighting overlapping tokens outputted doubled words. + (Pierre Gossé via Robert Muir) + + * LUCENE-2943: Fix thread-safety issues with ICUCollationKeyFilter. (Robert Muir) - + API Changes + * LUCENE-2867: Some contrib queryparser methods that receives CharSequence as + identifier, such as QueryNode#unsetTag(CharSequence), were deprecated and + will be removed on version 4. (Adriano Crestani) + * LUCENE-2147: Spatial GeoHashUtils now always decode GeoHash strings with full precision. GeoHash#decode_exactly(String) was merged into GeoHash#decode(String). (Chris Male, Simon Willnauer) - + * LUCENE-2204: Change some package private classes/members to publicly accessible to implement custom FragmentsBuilders. (Koji Sekiguchi) @@ -171,14 +209,14 @@ API Changes * LUCENE-2626: FastVectorHighlighter: enable FragListBuilder and FragmentsBuilder to be set per-field override. (Koji Sekiguchi) - * LUCENE-2712: FieldBoostMapAttribute in contrib/queryparser was changed from + * LUCENE-2712: FieldBoostMapAttribute in contrib/queryparser was changed from a Map @@ -769,6 +774,7 @@ + + + + + + ++ ++ ++ + + ++ ++ ++ + to a Map . Per the CharSequence javadoc, CharSequence is inappropriate as a map key. (Robert Muir) * LUCENE-1937: Add more methods to manipulate QueryNodeProcessorPipeline elements. QueryNodeProcessorPipeline now implements the List interface, this is useful if you want to extend or modify an existing pipeline. (Adriano Crestani via Robert Muir) - + * LUCENE-2754, LUCENE-2757: Deprecated SpanRegexQuery. Use new SpanMultiTermQueryWrapper (new RegexQuery()) instead. (Robert Muir, Uwe Schindler) @@ -186,18 +224,27 @@ API Changes * LUCENE-2747: Deprecated ArabicLetterTokenizer. StandardTokenizer now tokenizes most languages correctly including Arabic. (Steven Rowe, Robert Muir) + * LUCENE-2830: Use StringBuilder instead of StringBuffer across Benchmark, and + remove the StringBuffer HtmlParser.parse() variant. (Shai Erera) + * LUCENE-2920: Deprecated ShingleMatrixFilter as it is unmaintained and does not work with custom Attributes or custom payload encoders. (Uwe Schindler) - + New features + * LUCENE-2500: Added DirectIOLinuxDirectory, a Linux-specific + Directory impl that uses the O_DIRECT flag to bypass the buffer + cache. This is useful to prevent segment merging from evicting + pages from the buffer cache, since fadvise/madvise do not seem. + (Michael McCandless) + * LUCENE-2306: Add NumericRangeFilter and NumericRangeQuery support to XMLQueryParser. (Jingkei Ly, via Mark Harwood) * LUCENE-2102: Add a Turkish LowerCase Filter. TurkishLowerCaseFilter handles Turkish and Azeri unique casing behavior correctly. (Ahmet Arslan, Robert Muir via Simon Willnauer) - + * LUCENE-2039: Add a extensible query parser to contrib/misc. ExtendableQueryParser enables arbitrary parser extensions based on a customizable field naming scheme. @@ -205,11 +252,11 @@ New features * LUCENE-2067: Add a Czech light stemmer. CzechAnalyzer will now stem words when Version is set to 3.1 or higher. (Robert Muir) - + * LUCENE-2062: Add a Bulgarian analyzer. (Robert Muir, Simon Willnauer) * LUCENE-2206: Add Snowball's stopword lists for Danish, Dutch, English, - Finnish, French, German, Hungarian, Italian, Norwegian, Russian, Spanish, + Finnish, French, German, Hungarian, Italian, Norwegian, Russian, Spanish, and Swedish. These can be loaded with WordListLoader.getSnowballWordSet. (Robert Muir, Simon Willnauer) @@ -217,7 +264,7 @@ New features (Koji Sekiguchi) * LUCENE-2218: ShingleFilter supports minimum shingle size, and the separator - character is now configurable. Its also up to 20% faster. + character is now configurable. Its also up to 20% faster. (Steven Rowe via Robert Muir) * LUCENE-2234: Add a Hindi analyzer. (Robert Muir) @@ -247,7 +294,7 @@ New features * LUCENE-2298: Add analyzers/stempel, an algorithmic stemmer with support for the Polish language. (Andrzej Bialecki via Robert Muir) - * LUCENE-2400: ShingleFilter was changed to don't output all-filler shingles and + * LUCENE-2400: ShingleFilter was changed to don't output all-filler shingles and unigrams, and uses a more performant algorithm to build grams using a linked list of AttributeSource.cloneAttributes() instances and the new copyTo() method. (Steven Rowe via Uwe Schindler) @@ -266,7 +313,7 @@ New features * LUCENE-2464: FastVectorHighlighter: add SingleFragListBuilder to return entire field contents. (Koji Sekiguchi) - * LUCENE-2503: Added lighter stemming alternatives for European languages. + * LUCENE-2503: Added lighter stemming alternatives for European languages. (Robert Muir) * LUCENE-2581: FastVectorHighlighter: add Encoder to FragmentsBuilder. @@ -274,12 +321,23 @@ New features * LUCENE-2624: Add Analyzers for Armenian, Basque, and Catalan, from snowball. (Robert Muir) - + * LUCENE-1938: PrecedenceQueryParser is now implemented with the flexible QP framework. This means that you can also add this functionality to your own QP pipeline by using BooleanModifiersQueryNodeProcessor, for example instead of GroupQueryNodeProcessor. (Adriano Crestani via Robert Muir) + * LUCENE-2791: Added WindowsDirectory, a Windows-specific Directory impl + that doesn't synchronize on the file handle. This can be useful to + avoid the performance problems of SimpleFSDirectory and NIOFSDirectory. + (Robert Muir, Simon Willnauer, Uwe Schindler, Michael McCandless) + + * LUCENE-2842: Add analyzer for Galician. Also adds the RSLP (Orengo) stemmer + for Portuguese. (Robert Muir) + + * SOLR-1057: Add PathHierarchyTokenizer that represents file path hierarchies as synonyms of + /something, /something/something, /something/something/else. (Ryan McKinley, Koji Sekiguchi) + Build * LUCENE-2124: Moved the JDK-based collation support from contrib/collation @@ -299,7 +357,12 @@ Build * LUCENE-2797: Upgrade contrib/icu's ICU jar file to ICU 4.6 (Robert Muir) - + + * LUCENE-2833: Upgrade contrib/ant's jtidy jar file to r938 (Robert Muir) + + * LUCENE-2413: Moved the demo out of lucene core and into contrib/demo. + (Robert Muir) + Optimizations * LUCENE-2157: DelimitedPayloadTokenFilter no longer copies the buffer diff --git a/lucene/contrib/ant/src/java/org/apache/lucene/ant/IndexTask.java b/lucene/contrib/ant/src/java/org/apache/lucene/ant/IndexTask.java index b22638c713a..9e1c7480df5 100644 --- a/lucene/contrib/ant/src/java/org/apache/lucene/ant/IndexTask.java +++ b/lucene/contrib/ant/src/java/org/apache/lucene/ant/IndexTask.java @@ -39,7 +39,7 @@ import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; -import org.apache.lucene.index.LogMergePolicy; +import org.apache.lucene.index.TieredMergePolicy; import org.apache.lucene.index.Term; import org.apache.lucene.index.IndexWriterConfig.OpenMode; import org.apache.lucene.search.IndexSearcher; @@ -285,9 +285,9 @@ public class IndexTask extends Task { IndexWriterConfig conf = new IndexWriterConfig( Version.LUCENE_CURRENT, analyzer).setOpenMode( create ? OpenMode.CREATE : OpenMode.APPEND); - LogMergePolicy lmp = (LogMergePolicy) conf.getMergePolicy(); - lmp.setUseCompoundFile(useCompoundIndex); - lmp.setMergeFactor(mergeFactor); + TieredMergePolicy tmp = (TieredMergePolicy) conf.getMergePolicy(); + tmp.setUseCompoundFile(useCompoundIndex); + tmp.setMaxMergeAtOnce(mergeFactor); IndexWriter writer = new IndexWriter(dir, conf); int totalFiles = 0; int totalIndexed = 0; diff --git a/lucene/contrib/db/bdb-je/lib/je-3.3.93.jar b/lucene/contrib/db/bdb-je/lib/je-3.3.93.jar new file mode 100644 index 00000000000..4ceafc9209a --- /dev/null +++ b/lucene/contrib/db/bdb-je/lib/je-3.3.93.jar @@ -0,0 +1,2 @@ +AnyObjectId[9a9ff077cdd36a96e7e0506986edd4e52b90a22f] was removed in git history. +Apache SVN contains full history. \ No newline at end of file diff --git a/lucene/contrib/db/bdb-je/lib/je-LICENSE-FAKE.txt b/lucene/contrib/db/bdb-je/lib/je-LICENSE-FAKE.txt new file mode 100644 index 00000000000..a1defaa3da4 --- /dev/null +++ b/lucene/contrib/db/bdb-je/lib/je-LICENSE-FAKE.txt @@ -0,0 +1 @@ +No bdb jars are shipped with lucene. This is a fake license to work around the automated license checking. diff --git a/lucene/contrib/db/bdb-je/lib/je-NOTICE-FAKE.txt b/lucene/contrib/db/bdb-je/lib/je-NOTICE-FAKE.txt new file mode 100644 index 00000000000..a1defaa3da4 --- /dev/null +++ b/lucene/contrib/db/bdb-je/lib/je-NOTICE-FAKE.txt @@ -0,0 +1 @@ +No bdb jars are shipped with lucene. This is a fake license to work around the automated license checking. diff --git a/lucene/contrib/db/bdb/lib/db--NOTICE-FAKE.txt b/lucene/contrib/db/bdb/lib/db--NOTICE-FAKE.txt new file mode 100644 index 00000000000..a1defaa3da4 --- /dev/null +++ b/lucene/contrib/db/bdb/lib/db--NOTICE-FAKE.txt @@ -0,0 +1 @@ +No bdb jars are shipped with lucene. This is a fake license to work around the automated license checking. diff --git a/lucene/contrib/db/bdb/lib/db-4.7.25.jar b/lucene/contrib/db/bdb/lib/db-4.7.25.jar new file mode 100644 index 00000000000..fedd3e2adf2 --- /dev/null +++ b/lucene/contrib/db/bdb/lib/db-4.7.25.jar @@ -0,0 +1,2 @@ +AnyObjectId[99baf20bacd712cae91dd6e4e1f46224cafa1a37] was removed in git history. +Apache SVN contains full history. \ No newline at end of file diff --git a/lucene/contrib/db/bdb/lib/db-LICENSE-FAKE.txt b/lucene/contrib/db/bdb/lib/db-LICENSE-FAKE.txt new file mode 100644 index 00000000000..a1defaa3da4 --- /dev/null +++ b/lucene/contrib/db/bdb/lib/db-LICENSE-FAKE.txt @@ -0,0 +1 @@ +No bdb jars are shipped with lucene. This is a fake license to work around the automated license checking. diff --git a/lucene/contrib/demo/src/test/org/apache/lucene/demo/TestDemo.java b/lucene/contrib/demo/src/test/org/apache/lucene/demo/TestDemo.java index 4457ef7aae3..d2bd59d0963 100644 --- a/lucene/contrib/demo/src/test/org/apache/lucene/demo/TestDemo.java +++ b/lucene/contrib/demo/src/test/org/apache/lucene/demo/TestDemo.java @@ -22,16 +22,17 @@ import java.io.File; import java.io.PrintStream; import org.apache.lucene.util.LuceneTestCase; +import org.apache.lucene.util._TestUtil; public class TestDemo extends LuceneTestCase { - private void testOneSearch(String query, int expectedHitCount) throws Exception { + private void testOneSearch(File indexPath, String query, int expectedHitCount) throws Exception { PrintStream outSave = System.out; try { ByteArrayOutputStream bytes = new ByteArrayOutputStream(); PrintStream fakeSystemOut = new PrintStream(bytes); System.setOut(fakeSystemOut); - SearchFiles.main(new String[] {"-query", query}); + SearchFiles.main(new String[] {"-query", query, "-index", indexPath.getPath()}); fakeSystemOut.flush(); String output = bytes.toString(); // intentionally use default encoding assertTrue("output=" + output, output.contains(expectedHitCount + " total matching documents")); @@ -42,12 +43,13 @@ public class TestDemo extends LuceneTestCase { public void testIndexSearch() throws Exception { File dir = getDataFile("test-files/docs"); - IndexFiles.main(new String[] { "-create", "-docs", dir.getPath() }); - testOneSearch("apache", 3); - testOneSearch("patent", 8); - testOneSearch("lucene", 0); - testOneSearch("gnu", 6); - testOneSearch("derivative", 8); - testOneSearch("license", 13); + File indexDir = _TestUtil.getTempDir("ContribDemoTest"); + IndexFiles.main(new String[] { "-create", "-docs", dir.getPath(), "-index", indexDir.getPath()}); + testOneSearch(indexDir, "apache", 3); + testOneSearch(indexDir, "patent", 8); + testOneSearch(indexDir, "lucene", 0); + testOneSearch(indexDir, "gnu", 6); + testOneSearch(indexDir, "derivative", 8); + testOneSearch(indexDir, "license", 13); } } diff --git a/lucene/contrib/highlighter/.cvsignore b/lucene/contrib/highlighter/.cvsignore deleted file mode 100644 index 9d0b71a3c79..00000000000 --- a/lucene/contrib/highlighter/.cvsignore +++ /dev/null @@ -1,2 +0,0 @@ -build -dist diff --git a/lucene/contrib/highlighter/src/java/org/apache/lucene/search/highlight/Highlighter.java b/lucene/contrib/highlighter/src/java/org/apache/lucene/search/highlight/Highlighter.java index 5deafd62faa..2c2104570e4 100644 --- a/lucene/contrib/highlighter/src/java/org/apache/lucene/search/highlight/Highlighter.java +++ b/lucene/contrib/highlighter/src/java/org/apache/lucene/search/highlight/Highlighter.java @@ -197,6 +197,11 @@ public class Highlighter tokenStream.reset(); TextFragment currentFrag = new TextFragment(newText,newText.length(), docFrags.size()); + + if (fragmentScorer instanceof QueryScorer) { + ((QueryScorer) fragmentScorer).setMaxDocCharsToAnalyze(maxDocCharsToAnalyze); + } + TokenStream newStream = fragmentScorer.init(tokenStream); if(newStream != null) { tokenStream = newStream; diff --git a/lucene/contrib/highlighter/src/java/org/apache/lucene/search/highlight/OffsetLimitTokenFilter.java b/lucene/contrib/highlighter/src/java/org/apache/lucene/search/highlight/OffsetLimitTokenFilter.java new file mode 100644 index 00000000000..2102c28d894 --- /dev/null +++ b/lucene/contrib/highlighter/src/java/org/apache/lucene/search/highlight/OffsetLimitTokenFilter.java @@ -0,0 +1,57 @@ +package org.apache.lucene.search.highlight; + +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +import java.io.IOException; + +import org.apache.lucene.analysis.TokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.OffsetAttribute; + +/** + * This TokenFilter limits the number of tokens while indexing by adding up the + * current offset. + */ +public final class OffsetLimitTokenFilter extends TokenFilter { + + private int offsetCount; + private OffsetAttribute offsetAttrib = getAttribute(OffsetAttribute.class); + private int offsetLimit; + + public OffsetLimitTokenFilter(TokenStream input, int offsetLimit) { + super(input); + this.offsetLimit = offsetLimit; + } + + @Override + public boolean incrementToken() throws IOException { + if (offsetCount < offsetLimit && input.incrementToken()) { + int offsetLength = offsetAttrib.endOffset() - offsetAttrib.startOffset(); + offsetCount += offsetLength; + return true; + } + return false; + } + + @Override + public void reset() throws IOException { + super.reset(); + offsetCount = 0; + } + +} diff --git a/lucene/contrib/highlighter/src/java/org/apache/lucene/search/highlight/QueryScorer.java b/lucene/contrib/highlighter/src/java/org/apache/lucene/search/highlight/QueryScorer.java index e0b76a4aebd..706fb89151e 100644 --- a/lucene/contrib/highlighter/src/java/org/apache/lucene/search/highlight/QueryScorer.java +++ b/lucene/contrib/highlighter/src/java/org/apache/lucene/search/highlight/QueryScorer.java @@ -54,6 +54,7 @@ public class QueryScorer implements Scorer { private IndexReader reader; private boolean skipInitExtractor; private boolean wrapToCaching = true; + private int maxCharsToAnalyze; /** * @param query Query to use for highlighting @@ -209,7 +210,7 @@ public class QueryScorer implements Scorer { private TokenStream initExtractor(TokenStream tokenStream) throws IOException { WeightedSpanTermExtractor qse = defaultField == null ? new WeightedSpanTermExtractor() : new WeightedSpanTermExtractor(defaultField); - + qse.setMaxDocCharsToAnalyze(maxCharsToAnalyze); qse.setExpandMultiTermQuery(expandMultiTermQuery); qse.setWrapIfNotCachingTokenFilter(wrapToCaching); if (reader == null) { @@ -265,4 +266,8 @@ public class QueryScorer implements Scorer { public void setWrapIfNotCachingTokenFilter(boolean wrap) { this.wrapToCaching = wrap; } + + public void setMaxDocCharsToAnalyze(int maxDocCharsToAnalyze) { + this.maxCharsToAnalyze = maxDocCharsToAnalyze; + } } diff --git a/lucene/contrib/highlighter/src/java/org/apache/lucene/search/highlight/WeightedSpanTermExtractor.java b/lucene/contrib/highlighter/src/java/org/apache/lucene/search/highlight/WeightedSpanTermExtractor.java index 471c29ee070..4d5990d6dab 100644 --- a/lucene/contrib/highlighter/src/java/org/apache/lucene/search/highlight/WeightedSpanTermExtractor.java +++ b/lucene/contrib/highlighter/src/java/org/apache/lucene/search/highlight/WeightedSpanTermExtractor.java @@ -56,6 +56,7 @@ public class WeightedSpanTermExtractor { private boolean expandMultiTermQuery; private boolean cachedTokenStream; private boolean wrapToCaching = true; + private int maxDocCharsToAnalyze; public WeightedSpanTermExtractor() { } @@ -320,13 +321,13 @@ public class WeightedSpanTermExtractor { private AtomicReaderContext getLeafContextForField(String field) throws IOException { if(wrapToCaching && !cachedTokenStream && !(tokenStream instanceof CachingTokenFilter)) { - tokenStream = new CachingTokenFilter(tokenStream); + tokenStream = new CachingTokenFilter(new OffsetLimitTokenFilter(tokenStream, maxDocCharsToAnalyze)); cachedTokenStream = true; } AtomicReaderContext context = readers.get(field); if (context == null) { MemoryIndex indexer = new MemoryIndex(); - indexer.addField(field, tokenStream); + indexer.addField(field, new OffsetLimitTokenFilter(tokenStream, maxDocCharsToAnalyze)); tokenStream.reset(); IndexSearcher searcher = indexer.createSearcher(); // MEM index has only atomic ctx @@ -545,4 +546,8 @@ public class WeightedSpanTermExtractor { public void setWrapIfNotCachingTokenFilter(boolean wrap) { this.wrapToCaching = wrap; } + + protected final void setMaxDocCharsToAnalyze(int maxDocCharsToAnalyze) { + this.maxDocCharsToAnalyze = maxDocCharsToAnalyze; + } } diff --git a/lucene/contrib/highlighter/src/test/org/apache/lucene/search/highlight/HighlighterPhraseTest.java b/lucene/contrib/highlighter/src/test/org/apache/lucene/search/highlight/HighlighterPhraseTest.java index 755d9f5d4ec..6687727a4a6 100644 --- a/lucene/contrib/highlighter/src/test/org/apache/lucene/search/highlight/HighlighterPhraseTest.java +++ b/lucene/contrib/highlighter/src/test/org/apache/lucene/search/highlight/HighlighterPhraseTest.java @@ -58,7 +58,7 @@ public class HighlighterPhraseTest extends LuceneTestCase { final String TEXT = "the fox jumped"; final Directory directory = newDirectory(); final IndexWriter indexWriter = new IndexWriter(directory, - newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(MockTokenizer.WHITESPACE, false))); + newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random, MockTokenizer.WHITESPACE, false))); try { final Document document = new Document(); document.add(new Field(FIELD, new TokenStreamConcurrent(), @@ -102,7 +102,7 @@ public class HighlighterPhraseTest extends LuceneTestCase { final String TEXT = "the fox jumped"; final Directory directory = newDirectory(); final IndexWriter indexWriter = new IndexWriter(directory, - newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(MockTokenizer.WHITESPACE, false))); + newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random, MockTokenizer.WHITESPACE, false))); try { final Document document = new Document(); document.add(new Field(FIELD, new TokenStreamConcurrent(), @@ -172,7 +172,7 @@ public class HighlighterPhraseTest extends LuceneTestCase { final String TEXT = "the fox did not jump"; final Directory directory = newDirectory(); final IndexWriter indexWriter = new IndexWriter(directory, - newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(MockTokenizer.WHITESPACE, false))); + newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random, MockTokenizer.WHITESPACE, false))); try { final Document document = new Document(); document.add(new Field(FIELD, new TokenStreamSparse(), @@ -215,7 +215,7 @@ public class HighlighterPhraseTest extends LuceneTestCase { final String TEXT = "the fox did not jump"; final Directory directory = newDirectory(); final IndexWriter indexWriter = new IndexWriter(directory, - newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(MockTokenizer.WHITESPACE, false))); + newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random, MockTokenizer.WHITESPACE, false))); try { final Document document = new Document(); document.add(new Field(FIELD, TEXT, Store.YES, Index.ANALYZED, @@ -256,7 +256,7 @@ public class HighlighterPhraseTest extends LuceneTestCase { final String TEXT = "the fox did not jump"; final Directory directory = newDirectory(); final IndexWriter indexWriter = new IndexWriter(directory, - newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(MockTokenizer.WHITESPACE, false))); + newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random, MockTokenizer.WHITESPACE, false))); try { final Document document = new Document(); document.add(new Field(FIELD, new TokenStreamSparse(), diff --git a/lucene/contrib/highlighter/src/test/org/apache/lucene/search/highlight/HighlighterTest.java b/lucene/contrib/highlighter/src/test/org/apache/lucene/search/highlight/HighlighterTest.java index 7c99c5d5a57..cea67428617 100644 --- a/lucene/contrib/highlighter/src/test/org/apache/lucene/search/highlight/HighlighterTest.java +++ b/lucene/contrib/highlighter/src/test/org/apache/lucene/search/highlight/HighlighterTest.java @@ -90,7 +90,7 @@ public class HighlighterTest extends BaseTokenStreamTestCase implements Formatte Directory ramDir; public IndexSearcher searcher = null; int numHighlights = 0; - final Analyzer analyzer = new MockAnalyzer(MockTokenizer.SIMPLE, true, MockTokenFilter.ENGLISH_STOPSET, true); + final Analyzer analyzer = new MockAnalyzer(random, MockTokenizer.SIMPLE, true, MockTokenFilter.ENGLISH_STOPSET, true); TopDocs hits; String[] texts = { @@ -101,7 +101,7 @@ public class HighlighterTest extends BaseTokenStreamTestCase implements Formatte "wordx wordy wordz wordx wordy wordx worda wordb wordy wordc", "y z x y z a b", "lets is a the lets is a the lets is a the lets" }; public void testQueryScorerHits() throws Exception { - Analyzer analyzer = new MockAnalyzer(MockTokenizer.SIMPLE, true); + Analyzer analyzer = new MockAnalyzer(random, MockTokenizer.SIMPLE, true); QueryParser qp = new QueryParser(TEST_VERSION_CURRENT, FIELD_NAME, analyzer); query = qp.parse("\"very long\""); searcher = new IndexSearcher(ramDir, true); @@ -133,7 +133,7 @@ public class HighlighterTest extends BaseTokenStreamTestCase implements Formatte String s1 = "I call our world Flatland, not because we call it so,"; - QueryParser parser = new QueryParser(TEST_VERSION_CURRENT, FIELD_NAME, new MockAnalyzer(MockTokenizer.SIMPLE, true, MockTokenFilter.ENGLISH_STOPSET, true)); + QueryParser parser = new QueryParser(TEST_VERSION_CURRENT, FIELD_NAME, new MockAnalyzer(random, MockTokenizer.SIMPLE, true, MockTokenFilter.ENGLISH_STOPSET, true)); // Verify that a query against the default field results in text being // highlighted @@ -165,7 +165,7 @@ public class HighlighterTest extends BaseTokenStreamTestCase implements Formatte */ private static String highlightField(Query query, String fieldName, String text) throws IOException, InvalidTokenOffsetsException { - TokenStream tokenStream = new MockAnalyzer(MockTokenizer.SIMPLE, true, MockTokenFilter.ENGLISH_STOPSET, true).tokenStream(fieldName, new StringReader(text)); + TokenStream tokenStream = new MockAnalyzer(random, MockTokenizer.SIMPLE, true, MockTokenFilter.ENGLISH_STOPSET, true).tokenStream(fieldName, new StringReader(text)); // Assuming "", "" used to highlight SimpleHTMLFormatter formatter = new SimpleHTMLFormatter(); QueryScorer scorer = new QueryScorer(query, fieldName, FIELD_NAME); @@ -210,7 +210,7 @@ public class HighlighterTest extends BaseTokenStreamTestCase implements Formatte String f2c = f2 + ":"; String q = "(" + f1c + ph1 + " OR " + f2c + ph1 + ") AND (" + f1c + ph2 + " OR " + f2c + ph2 + ")"; - Analyzer analyzer = new MockAnalyzer(MockTokenizer.WHITESPACE, false); + Analyzer analyzer = new MockAnalyzer(random, MockTokenizer.WHITESPACE, false); QueryParser qp = new QueryParser(TEST_VERSION_CURRENT, f1, analyzer); Query query = qp.parse(q); @@ -1134,13 +1134,13 @@ public class HighlighterTest extends BaseTokenStreamTestCase implements Formatte sb.append("stoppedtoken"); } SimpleHTMLFormatter fm = new SimpleHTMLFormatter(); - Highlighter hg = getHighlighter(query, "data", new MockAnalyzer(MockTokenizer.SIMPLE, true, stopWords, true).tokenStream( + Highlighter hg = getHighlighter(query, "data", new MockAnalyzer(random, MockTokenizer.SIMPLE, true, stopWords, true).tokenStream( "data", new StringReader(sb.toString())), fm);// new Highlighter(fm, // new // QueryTermScorer(query)); hg.setTextFragmenter(new NullFragmenter()); hg.setMaxDocCharsToAnalyze(100); - match = hg.getBestFragment(new MockAnalyzer(MockTokenizer.SIMPLE, true, stopWords, true), "data", sb.toString()); + match = hg.getBestFragment(new MockAnalyzer(random, MockTokenizer.SIMPLE, true, stopWords, true), "data", sb.toString()); assertTrue("Matched text should be no more than 100 chars in length ", match.length() < hg .getMaxDocCharsToAnalyze()); @@ -1151,7 +1151,7 @@ public class HighlighterTest extends BaseTokenStreamTestCase implements Formatte // + whitespace) sb.append(" "); sb.append(goodWord); - match = hg.getBestFragment(new MockAnalyzer(MockTokenizer.SIMPLE, true, stopWords, true), "data", sb.toString()); + match = hg.getBestFragment(new MockAnalyzer(random, MockTokenizer.SIMPLE, true, stopWords, true), "data", sb.toString()); assertTrue("Matched text should be no more than 100 chars in length ", match.length() < hg .getMaxDocCharsToAnalyze()); } @@ -1170,10 +1170,10 @@ public class HighlighterTest extends BaseTokenStreamTestCase implements Formatte String text = "this is a text with searchterm in it"; SimpleHTMLFormatter fm = new SimpleHTMLFormatter(); - Highlighter hg = getHighlighter(query, "text", new MockAnalyzer(MockTokenizer.SIMPLE, true, stopWords, true).tokenStream("text", new StringReader(text)), fm); + Highlighter hg = getHighlighter(query, "text", new MockAnalyzer(random, MockTokenizer.SIMPLE, true, stopWords, true).tokenStream("text", new StringReader(text)), fm); hg.setTextFragmenter(new NullFragmenter()); hg.setMaxDocCharsToAnalyze(36); - String match = hg.getBestFragment(new MockAnalyzer(MockTokenizer.SIMPLE, true, stopWords, true), "text", text); + String match = hg.getBestFragment(new MockAnalyzer(random, MockTokenizer.SIMPLE, true, stopWords, true), "text", text); assertTrue( "Matched text should contain remainder of text after highlighted query ", match.endsWith("in it")); @@ -1191,7 +1191,7 @@ public class HighlighterTest extends BaseTokenStreamTestCase implements Formatte // test to show how rewritten query can still be used if (searcher != null) searcher.close(); searcher = new IndexSearcher(ramDir, true); - Analyzer analyzer = new MockAnalyzer(MockTokenizer.SIMPLE, true, MockTokenFilter.ENGLISH_STOPSET, true); + Analyzer analyzer = new MockAnalyzer(random, MockTokenizer.SIMPLE, true, MockTokenFilter.ENGLISH_STOPSET, true); QueryParser parser = new QueryParser(TEST_VERSION_CURRENT, FIELD_NAME, analyzer); Query query = parser.parse("JF? or Kenned*"); @@ -1446,64 +1446,64 @@ public class HighlighterTest extends BaseTokenStreamTestCase implements Formatte Highlighter highlighter; String result; - query = new QueryParser(TEST_VERSION_CURRENT, "text", new MockAnalyzer(MockTokenizer.WHITESPACE, false)).parse("foo"); + query = new QueryParser(TEST_VERSION_CURRENT, "text", new MockAnalyzer(random, MockTokenizer.WHITESPACE, false)).parse("foo"); highlighter = getHighlighter(query, "text", getTS2(), HighlighterTest.this); result = highlighter.getBestFragments(getTS2(), s, 3, "..."); assertEquals("Hi-Speed10 foo", result); - query = new QueryParser(TEST_VERSION_CURRENT, "text", new MockAnalyzer(MockTokenizer.WHITESPACE, false)).parse("10"); + query = new QueryParser(TEST_VERSION_CURRENT, "text", new MockAnalyzer(random, MockTokenizer.WHITESPACE, false)).parse("10"); highlighter = getHighlighter(query, "text", getTS2(), HighlighterTest.this); result = highlighter.getBestFragments(getTS2(), s, 3, "..."); assertEquals("Hi-Speed10 foo", result); - query = new QueryParser(TEST_VERSION_CURRENT, "text", new MockAnalyzer(MockTokenizer.WHITESPACE, false)).parse("hi"); + query = new QueryParser(TEST_VERSION_CURRENT, "text", new MockAnalyzer(random, MockTokenizer.WHITESPACE, false)).parse("hi"); highlighter = getHighlighter(query, "text", getTS2(), HighlighterTest.this); result = highlighter.getBestFragments(getTS2(), s, 3, "..."); assertEquals("Hi-Speed10 foo", result); - query = new QueryParser(TEST_VERSION_CURRENT, "text", new MockAnalyzer(MockTokenizer.WHITESPACE, false)).parse("speed"); + query = new QueryParser(TEST_VERSION_CURRENT, "text", new MockAnalyzer(random, MockTokenizer.WHITESPACE, false)).parse("speed"); highlighter = getHighlighter(query, "text", getTS2(), HighlighterTest.this); result = highlighter.getBestFragments(getTS2(), s, 3, "..."); assertEquals("Hi-Speed10 foo", result); - query = new QueryParser(TEST_VERSION_CURRENT, "text", new MockAnalyzer(MockTokenizer.WHITESPACE, false)).parse("hispeed"); + query = new QueryParser(TEST_VERSION_CURRENT, "text", new MockAnalyzer(random, MockTokenizer.WHITESPACE, false)).parse("hispeed"); highlighter = getHighlighter(query, "text", getTS2(), HighlighterTest.this); result = highlighter.getBestFragments(getTS2(), s, 3, "..."); assertEquals("Hi-Speed10 foo", result); - query = new QueryParser(TEST_VERSION_CURRENT, "text", new MockAnalyzer(MockTokenizer.WHITESPACE, false)).parse("hi speed"); + query = new QueryParser(TEST_VERSION_CURRENT, "text", new MockAnalyzer(random, MockTokenizer.WHITESPACE, false)).parse("hi speed"); highlighter = getHighlighter(query, "text", getTS2(), HighlighterTest.this); result = highlighter.getBestFragments(getTS2(), s, 3, "..."); assertEquals("Hi-Speed10 foo", result); // ///////////////// same tests, just put the bigger overlapping token // first - query = new QueryParser(TEST_VERSION_CURRENT, "text", new MockAnalyzer(MockTokenizer.WHITESPACE, false)).parse("foo"); + query = new QueryParser(TEST_VERSION_CURRENT, "text", new MockAnalyzer(random, MockTokenizer.WHITESPACE, false)).parse("foo"); highlighter = getHighlighter(query, "text", getTS2a(), HighlighterTest.this); result = highlighter.getBestFragments(getTS2a(), s, 3, "..."); assertEquals("Hi-Speed10 foo", result); - query = new QueryParser(TEST_VERSION_CURRENT, "text", new MockAnalyzer(MockTokenizer.WHITESPACE, false)).parse("10"); + query = new QueryParser(TEST_VERSION_CURRENT, "text", new MockAnalyzer(random, MockTokenizer.WHITESPACE, false)).parse("10"); highlighter = getHighlighter(query, "text", getTS2a(), HighlighterTest.this); result = highlighter.getBestFragments(getTS2a(), s, 3, "..."); assertEquals("Hi-Speed10 foo", result); - query = new QueryParser(TEST_VERSION_CURRENT, "text", new MockAnalyzer(MockTokenizer.WHITESPACE, false)).parse("hi"); + query = new QueryParser(TEST_VERSION_CURRENT, "text", new MockAnalyzer(random, MockTokenizer.WHITESPACE, false)).parse("hi"); highlighter = getHighlighter(query, "text", getTS2a(), HighlighterTest.this); result = highlighter.getBestFragments(getTS2a(), s, 3, "..."); assertEquals("Hi-Speed10 foo", result); - query = new QueryParser(TEST_VERSION_CURRENT, "text", new MockAnalyzer(MockTokenizer.WHITESPACE, false)).parse("speed"); + query = new QueryParser(TEST_VERSION_CURRENT, "text", new MockAnalyzer(random, MockTokenizer.WHITESPACE, false)).parse("speed"); highlighter = getHighlighter(query, "text", getTS2a(), HighlighterTest.this); result = highlighter.getBestFragments(getTS2a(), s, 3, "..."); assertEquals("Hi-Speed10 foo", result); - query = new QueryParser(TEST_VERSION_CURRENT, "text", new MockAnalyzer(MockTokenizer.WHITESPACE, false)).parse("hispeed"); + query = new QueryParser(TEST_VERSION_CURRENT, "text", new MockAnalyzer(random, MockTokenizer.WHITESPACE, false)).parse("hispeed"); highlighter = getHighlighter(query, "text", getTS2a(), HighlighterTest.this); result = highlighter.getBestFragments(getTS2a(), s, 3, "..."); assertEquals("Hi-Speed10 foo", result); - query = new QueryParser(TEST_VERSION_CURRENT, "text", new MockAnalyzer(MockTokenizer.WHITESPACE, false)).parse("hi speed"); + query = new QueryParser(TEST_VERSION_CURRENT, "text", new MockAnalyzer(random, MockTokenizer.WHITESPACE, false)).parse("hi speed"); highlighter = getHighlighter(query, "text", getTS2a(), HighlighterTest.this); result = highlighter.getBestFragments(getTS2a(), s, 3, "..."); assertEquals("Hi-Speed10 foo", result); @@ -1514,7 +1514,7 @@ public class HighlighterTest extends BaseTokenStreamTestCase implements Formatte } private Directory dir; - private Analyzer a = new MockAnalyzer(MockTokenizer.WHITESPACE, false); + private Analyzer a = new MockAnalyzer(random, MockTokenizer.WHITESPACE, false); public void testWeightedTermsWithDeletes() throws IOException, ParseException, InvalidTokenOffsetsException { makeIndex(); @@ -1529,7 +1529,7 @@ public class HighlighterTest extends BaseTokenStreamTestCase implements Formatte } private void makeIndex() throws IOException { - IndexWriter writer = new IndexWriter(dir, new IndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(MockTokenizer.WHITESPACE, false))); + IndexWriter writer = new IndexWriter(dir, new IndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random, MockTokenizer.WHITESPACE, false))); writer.addDocument( doc( "t_text1", "random words for highlighting tests del" ) ); writer.addDocument( doc( "t_text1", "more random words for second field del" ) ); writer.addDocument( doc( "t_text1", "random words for highlighting tests del" ) ); @@ -1539,7 +1539,7 @@ public class HighlighterTest extends BaseTokenStreamTestCase implements Formatte } private void deleteDocument() throws IOException { - IndexWriter writer = new IndexWriter(dir, new IndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(MockTokenizer.WHITESPACE, false)).setOpenMode(OpenMode.APPEND)); + IndexWriter writer = new IndexWriter(dir, new IndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random, MockTokenizer.WHITESPACE, false)).setOpenMode(OpenMode.APPEND)); writer.deleteDocuments( new Term( "t_text1", "del" ) ); // To see negative idf, keep comment the following line //writer.optimize(); @@ -1644,7 +1644,7 @@ public class HighlighterTest extends BaseTokenStreamTestCase implements Formatte dir = newDirectory(); ramDir = newDirectory(); IndexWriter writer = new IndexWriter(ramDir, newIndexWriterConfig( - TEST_VERSION_CURRENT, new MockAnalyzer(MockTokenizer.SIMPLE, true, MockTokenFilter.ENGLISH_STOPSET, true))); + TEST_VERSION_CURRENT, new MockAnalyzer(random, MockTokenizer.SIMPLE, true, MockTokenFilter.ENGLISH_STOPSET, true))); for (String text : texts) { addDoc(writer, text); } diff --git a/lucene/contrib/highlighter/src/test/org/apache/lucene/search/highlight/OffsetLimitTokenFilterTest.java b/lucene/contrib/highlighter/src/test/org/apache/lucene/search/highlight/OffsetLimitTokenFilterTest.java new file mode 100644 index 00000000000..45aa3f51425 --- /dev/null +++ b/lucene/contrib/highlighter/src/test/org/apache/lucene/search/highlight/OffsetLimitTokenFilterTest.java @@ -0,0 +1,60 @@ +package org.apache.lucene.search.highlight; + +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +import java.io.Reader; +import java.io.StringReader; + +import org.apache.lucene.analysis.Analyzer; +import org.apache.lucene.analysis.BaseTokenStreamTestCase; +import org.apache.lucene.analysis.MockTokenizer; +import org.apache.lucene.analysis.TokenStream; + +public class OffsetLimitTokenFilterTest extends BaseTokenStreamTestCase { + + public void testFilter() throws Exception { + TokenStream stream = new MockTokenizer(new StringReader( + "short toolong evenmuchlongertext a ab toolong foo"), + MockTokenizer.WHITESPACE, false); + OffsetLimitTokenFilter filter = new OffsetLimitTokenFilter(stream, 10); + assertTokenStreamContents(filter, new String[] {"short", "toolong"}); + + stream = new MockTokenizer(new StringReader( + "short toolong evenmuchlongertext a ab toolong foo"), + MockTokenizer.WHITESPACE, false); + filter = new OffsetLimitTokenFilter(stream, 12); + assertTokenStreamContents(filter, new String[] {"short", "toolong"}); + + stream = new MockTokenizer(new StringReader( + "short toolong evenmuchlongertext a ab toolong foo"), + MockTokenizer.WHITESPACE, false); + filter = new OffsetLimitTokenFilter(stream, 30); + assertTokenStreamContents(filter, new String[] {"short", "toolong", + "evenmuchlongertext"}); + + + checkOneTermReuse(new Analyzer() { + + @Override + public TokenStream tokenStream(String fieldName, Reader reader) { + return new OffsetLimitTokenFilter(new MockTokenizer(reader, + MockTokenizer.WHITESPACE, false), 10); + } + }, "llenges", "llenges"); + } +} \ No newline at end of file diff --git a/lucene/contrib/highlighter/src/test/org/apache/lucene/search/vectorhighlight/AbstractTestCase.java b/lucene/contrib/highlighter/src/test/org/apache/lucene/search/vectorhighlight/AbstractTestCase.java index 85f95097c95..0f19ebfd459 100644 --- a/lucene/contrib/highlighter/src/test/org/apache/lucene/search/vectorhighlight/AbstractTestCase.java +++ b/lucene/contrib/highlighter/src/test/org/apache/lucene/search/vectorhighlight/AbstractTestCase.java @@ -87,9 +87,9 @@ public abstract class AbstractTestCase extends LuceneTestCase { @Override public void setUp() throws Exception { super.setUp(); - analyzerW = new MockAnalyzer(MockTokenizer.WHITESPACE, false); + analyzerW = new MockAnalyzer(random, MockTokenizer.WHITESPACE, false); analyzerB = new BigramAnalyzer(); - analyzerK = new MockAnalyzer(MockTokenizer.KEYWORD, false); + analyzerK = new MockAnalyzer(random, MockTokenizer.KEYWORD, false); paW = new QueryParser(TEST_VERSION_CURRENT, F, analyzerW ); paB = new QueryParser(TEST_VERSION_CURRENT, F, analyzerB ); dir = newDirectory(); diff --git a/lucene/contrib/instantiated/src/java/org/apache/lucene/store/instantiated/InstantiatedIndexReader.java b/lucene/contrib/instantiated/src/java/org/apache/lucene/store/instantiated/InstantiatedIndexReader.java index 742e101ce93..656a5a48ed8 100644 --- a/lucene/contrib/instantiated/src/java/org/apache/lucene/store/instantiated/InstantiatedIndexReader.java +++ b/lucene/contrib/instantiated/src/java/org/apache/lucene/store/instantiated/InstantiatedIndexReader.java @@ -32,8 +32,7 @@ import java.util.Comparator; import org.apache.lucene.document.Document; import org.apache.lucene.document.FieldSelector; import org.apache.lucene.index.*; -import org.apache.lucene.index.values.DocValues; -import org.apache.lucene.index.IndexReader.ReaderContext; +import org.apache.lucene.index.codecs.PerDocValues; import org.apache.lucene.store.Directory; import org.apache.lucene.util.BitVector; import org.apache.lucene.util.BytesRef; @@ -391,11 +390,6 @@ public class InstantiatedIndexReader extends IndexReader { public TermsEnum terms() { return new InstantiatedTermsEnum(orderedTerms, upto, currentField); } - - @Override - public DocValues docValues() throws IOException { - return null; - } }; } @@ -439,11 +433,6 @@ public class InstantiatedIndexReader extends IndexReader { } }; } - - @Override - public DocValues docValues(String field) throws IOException { - return null; - } }; } @@ -498,4 +487,9 @@ public class InstantiatedIndexReader extends IndexReader { } } } + + @Override + public PerDocValues perDocValues() throws IOException { + return null; + } } diff --git a/lucene/contrib/instantiated/src/test/org/apache/lucene/store/instantiated/TestEmptyIndex.java b/lucene/contrib/instantiated/src/test/org/apache/lucene/store/instantiated/TestEmptyIndex.java index f513a0bb423..40811908d2c 100644 --- a/lucene/contrib/instantiated/src/test/org/apache/lucene/store/instantiated/TestEmptyIndex.java +++ b/lucene/contrib/instantiated/src/test/org/apache/lucene/store/instantiated/TestEmptyIndex.java @@ -59,7 +59,7 @@ public class TestEmptyIndex extends LuceneTestCase { // make sure a Directory acts the same Directory d = newDirectory(); - new IndexWriter(d, newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer())).close(); + new IndexWriter(d, newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random))).close(); r = IndexReader.open(d, false); testNorms(r); r.close(); @@ -84,7 +84,7 @@ public class TestEmptyIndex extends LuceneTestCase { // make sure a Directory acts the same Directory d = newDirectory(); - new IndexWriter(d, newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer())).close(); + new IndexWriter(d, newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random))).close(); r = IndexReader.open(d, false); termsEnumTest(r); r.close(); diff --git a/lucene/contrib/instantiated/src/test/org/apache/lucene/store/instantiated/TestIndicesEquals.java b/lucene/contrib/instantiated/src/test/org/apache/lucene/store/instantiated/TestIndicesEquals.java index 7a5398c4ed0..4048d1c59a8 100644 --- a/lucene/contrib/instantiated/src/test/org/apache/lucene/store/instantiated/TestIndicesEquals.java +++ b/lucene/contrib/instantiated/src/test/org/apache/lucene/store/instantiated/TestIndicesEquals.java @@ -21,6 +21,7 @@ import java.util.Arrays; import java.util.Comparator; import java.util.Iterator; import java.util.List; +import java.util.Random; import org.apache.lucene.analysis.Token; import org.apache.lucene.analysis.TokenStream; @@ -65,7 +66,7 @@ public class TestIndicesEquals extends LuceneTestCase { // create dir data IndexWriter indexWriter = new IndexWriter(dir, newIndexWriterConfig( - TEST_VERSION_CURRENT, new MockAnalyzer()).setMergePolicy(newInOrderLogMergePolicy())); + TEST_VERSION_CURRENT, new MockAnalyzer(random)).setMergePolicy(newLogMergePolicy())); for (int i = 0; i < 20; i++) { Document document = new Document(); @@ -88,10 +89,13 @@ public class TestIndicesEquals extends LuceneTestCase { Directory dir = newDirectory(); InstantiatedIndex ii = new InstantiatedIndex(); - + + // we need to pass the "same" random to both, so they surely index the same payload data. + long seed = random.nextLong(); + // create dir data IndexWriter indexWriter = new IndexWriter(dir, newIndexWriterConfig( - TEST_VERSION_CURRENT, new MockAnalyzer()).setMergePolicy(newInOrderLogMergePolicy())); + TEST_VERSION_CURRENT, new MockAnalyzer(new Random(seed))).setMergePolicy(newLogMergePolicy())); indexWriter.setInfoStream(VERBOSE ? System.out : null); if (VERBOSE) { System.out.println("TEST: make test index"); @@ -104,7 +108,7 @@ public class TestIndicesEquals extends LuceneTestCase { indexWriter.close(); // test ii writer - InstantiatedIndexWriter instantiatedIndexWriter = ii.indexWriterFactory(new MockAnalyzer(), true); + InstantiatedIndexWriter instantiatedIndexWriter = ii.indexWriterFactory(new MockAnalyzer(new Random(seed)), true); for (int i = 0; i < 500; i++) { Document document = new Document(); assembleDocument(document, i); diff --git a/lucene/contrib/instantiated/src/test/org/apache/lucene/store/instantiated/TestRealTime.java b/lucene/contrib/instantiated/src/test/org/apache/lucene/store/instantiated/TestRealTime.java index 413d7f56fae..43b11cc0100 100644 --- a/lucene/contrib/instantiated/src/test/org/apache/lucene/store/instantiated/TestRealTime.java +++ b/lucene/contrib/instantiated/src/test/org/apache/lucene/store/instantiated/TestRealTime.java @@ -36,7 +36,7 @@ public class TestRealTime extends LuceneTestCase { InstantiatedIndex index = new InstantiatedIndex(); InstantiatedIndexReader reader = new InstantiatedIndexReader(index); - IndexSearcher searcher = newSearcher(reader); + IndexSearcher searcher = newSearcher(reader, false); InstantiatedIndexWriter writer = new InstantiatedIndexWriter(index); Document doc; diff --git a/lucene/contrib/instantiated/src/test/org/apache/lucene/store/instantiated/TestUnoptimizedReaderOnConstructor.java b/lucene/contrib/instantiated/src/test/org/apache/lucene/store/instantiated/TestUnoptimizedReaderOnConstructor.java index d3a06998edc..ae52ace5b9e 100644 --- a/lucene/contrib/instantiated/src/test/org/apache/lucene/store/instantiated/TestUnoptimizedReaderOnConstructor.java +++ b/lucene/contrib/instantiated/src/test/org/apache/lucene/store/instantiated/TestUnoptimizedReaderOnConstructor.java @@ -34,17 +34,17 @@ public class TestUnoptimizedReaderOnConstructor extends LuceneTestCase { public void test() throws Exception { Directory dir = newDirectory(); - IndexWriter iw = new IndexWriter(dir, new IndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer())); + IndexWriter iw = new IndexWriter(dir, new IndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random))); addDocument(iw, "Hello, world!"); addDocument(iw, "All work and no play makes jack a dull boy"); iw.close(); - iw = new IndexWriter(dir, new IndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer()).setOpenMode(OpenMode.APPEND)); + iw = new IndexWriter(dir, new IndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random)).setOpenMode(OpenMode.APPEND)); addDocument(iw, "Hello, tellus!"); addDocument(iw, "All work and no play makes danny a dull boy"); iw.close(); - iw = new IndexWriter(dir, new IndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer()).setOpenMode(OpenMode.APPEND)); + iw = new IndexWriter(dir, new IndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random)).setOpenMode(OpenMode.APPEND)); addDocument(iw, "Hello, earth!"); addDocument(iw, "All work and no play makes wendy a dull girl"); iw.close(); diff --git a/lucene/contrib/memory/src/java/org/apache/lucene/index/memory/MemoryIndex.java b/lucene/contrib/memory/src/java/org/apache/lucene/index/memory/MemoryIndex.java index 8103b01f30c..947d2eb658b 100644 --- a/lucene/contrib/memory/src/java/org/apache/lucene/index/memory/MemoryIndex.java +++ b/lucene/contrib/memory/src/java/org/apache/lucene/index/memory/MemoryIndex.java @@ -52,7 +52,7 @@ import org.apache.lucene.index.TermPositionVector; import org.apache.lucene.index.TermVectorMapper; import org.apache.lucene.index.FieldInvertState; import org.apache.lucene.index.IndexReader.ReaderContext; -import org.apache.lucene.index.values.DocValues; +import org.apache.lucene.index.codecs.PerDocValues; import org.apache.lucene.search.Collector; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; @@ -807,12 +807,6 @@ public class MemoryIndex { public TermsEnum terms() { return new MemoryTermsEnum(sortedFields[upto].getValue()); } - - @Override - public DocValues docValues() throws IOException { - // TODO - throw new UnsupportedOperationException("not implemented"); - } }; } @@ -848,12 +842,6 @@ public class MemoryIndex { }; } } - - @Override - public DocValues docValues(String field) throws IOException { - // TODO - throw new UnsupportedOperationException("not implemented"); - } }; } @@ -1287,6 +1275,11 @@ public class MemoryIndex { return Collections.unmodifiableSet(fields.keySet()); } + + @Override + public PerDocValues perDocValues() throws IOException { + return null; + } } diff --git a/lucene/contrib/memory/src/test/org/apache/lucene/index/memory/MemoryIndexTest.java b/lucene/contrib/memory/src/test/org/apache/lucene/index/memory/MemoryIndexTest.java index 8be566a1c99..197721b2236 100644 --- a/lucene/contrib/memory/src/test/org/apache/lucene/index/memory/MemoryIndexTest.java +++ b/lucene/contrib/memory/src/test/org/apache/lucene/index/memory/MemoryIndexTest.java @@ -143,9 +143,9 @@ public class MemoryIndexTest extends BaseTokenStreamTestCase { */ private Analyzer randomAnalyzer() { switch(random.nextInt(3)) { - case 0: return new MockAnalyzer(MockTokenizer.SIMPLE, true); - case 1: return new MockAnalyzer(MockTokenizer.SIMPLE, true, MockTokenFilter.ENGLISH_STOPSET, true); - default: return new MockAnalyzer(MockTokenizer.WHITESPACE, false); + case 0: return new MockAnalyzer(random, MockTokenizer.SIMPLE, true); + case 1: return new MockAnalyzer(random, MockTokenizer.SIMPLE, true, MockTokenFilter.ENGLISH_STOPSET, true); + default: return new MockAnalyzer(random, MockTokenizer.WHITESPACE, false); } } diff --git a/lucene/contrib/misc/src/test/org/apache/lucene/index/TestFieldNormModifier.java b/lucene/contrib/misc/src/test/org/apache/lucene/index/TestFieldNormModifier.java index 72d186090fc..a3ad664e05e 100644 --- a/lucene/contrib/misc/src/test/org/apache/lucene/index/TestFieldNormModifier.java +++ b/lucene/contrib/misc/src/test/org/apache/lucene/index/TestFieldNormModifier.java @@ -61,7 +61,7 @@ public class TestFieldNormModifier extends LuceneTestCase { super.setUp(); store = newDirectory(); IndexWriter writer = new IndexWriter(store, newIndexWriterConfig( - TEST_VERSION_CURRENT, new MockAnalyzer()).setMergePolicy(newInOrderLogMergePolicy())); + TEST_VERSION_CURRENT, new MockAnalyzer(random)).setMergePolicy(newLogMergePolicy())); for (int i = 0; i < NUM_DOCS; i++) { Document d = new Document(); diff --git a/lucene/contrib/misc/src/test/org/apache/lucene/index/TestIndexSplitter.java b/lucene/contrib/misc/src/test/org/apache/lucene/index/TestIndexSplitter.java index 9e4d20fb916..2b8b47dea07 100644 --- a/lucene/contrib/misc/src/test/org/apache/lucene/index/TestIndexSplitter.java +++ b/lucene/contrib/misc/src/test/org/apache/lucene/index/TestIndexSplitter.java @@ -39,7 +39,7 @@ public class TestIndexSplitter extends LuceneTestCase { mergePolicy.setNoCFSRatio(1); IndexWriter iw = new IndexWriter( fsDir, - new IndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer()). + new IndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random)). setOpenMode(OpenMode.CREATE). setMergePolicy(mergePolicy) ); diff --git a/lucene/contrib/misc/src/test/org/apache/lucene/index/TestMultiPassIndexSplitter.java b/lucene/contrib/misc/src/test/org/apache/lucene/index/TestMultiPassIndexSplitter.java index 158b24ff58b..776d0c9960d 100644 --- a/lucene/contrib/misc/src/test/org/apache/lucene/index/TestMultiPassIndexSplitter.java +++ b/lucene/contrib/misc/src/test/org/apache/lucene/index/TestMultiPassIndexSplitter.java @@ -32,7 +32,7 @@ public class TestMultiPassIndexSplitter extends LuceneTestCase { public void setUp() throws Exception { super.setUp(); dir = newDirectory(); - IndexWriter w = new IndexWriter(dir, newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer()).setMergePolicy(newInOrderLogMergePolicy())); + IndexWriter w = new IndexWriter(dir, newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random)).setMergePolicy(newLogMergePolicy())); Document doc; for (int i = 0; i < NUM_DOCS; i++) { doc = new Document(); diff --git a/lucene/contrib/misc/src/test/org/apache/lucene/index/TestTermVectorAccessor.java b/lucene/contrib/misc/src/test/org/apache/lucene/index/TestTermVectorAccessor.java index 65e6bca1d66..dd79d727835 100644 --- a/lucene/contrib/misc/src/test/org/apache/lucene/index/TestTermVectorAccessor.java +++ b/lucene/contrib/misc/src/test/org/apache/lucene/index/TestTermVectorAccessor.java @@ -25,7 +25,7 @@ public class TestTermVectorAccessor extends LuceneTestCase { public void test() throws Exception { Directory dir = newDirectory(); - IndexWriter iw = new IndexWriter(dir, newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer())); + IndexWriter iw = new IndexWriter(dir, newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random))); Document doc; diff --git a/lucene/contrib/misc/src/test/org/apache/lucene/index/codecs/appending/TestAppendingCodec.java b/lucene/contrib/misc/src/test/org/apache/lucene/index/codecs/appending/TestAppendingCodec.java index 593c895cf66..125cc1847a9 100644 --- a/lucene/contrib/misc/src/test/org/apache/lucene/index/codecs/appending/TestAppendingCodec.java +++ b/lucene/contrib/misc/src/test/org/apache/lucene/index/codecs/appending/TestAppendingCodec.java @@ -30,7 +30,7 @@ import org.apache.lucene.index.Fields; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; -import org.apache.lucene.index.LogMergePolicy; +import org.apache.lucene.index.TieredMergePolicy; import org.apache.lucene.index.MultiFields; import org.apache.lucene.index.Terms; import org.apache.lucene.index.TermsEnum; @@ -134,10 +134,10 @@ public class TestAppendingCodec extends LuceneTestCase { public void testCodec() throws Exception { Directory dir = new AppendingRAMDirectory(random, new RAMDirectory()); - IndexWriterConfig cfg = new IndexWriterConfig(Version.LUCENE_40, new MockAnalyzer()); + IndexWriterConfig cfg = new IndexWriterConfig(Version.LUCENE_40, new MockAnalyzer(random)); cfg.setCodecProvider(new AppendingCodecProvider()); - ((LogMergePolicy)cfg.getMergePolicy()).setUseCompoundFile(false); + ((TieredMergePolicy)cfg.getMergePolicy()).setUseCompoundFile(false); IndexWriter writer = new IndexWriter(dir, cfg); Document doc = new Document(); doc.add(newField("f", text, Store.YES, Index.ANALYZED, TermVector.WITH_POSITIONS_OFFSETS)); diff --git a/lucene/contrib/misc/src/test/org/apache/lucene/misc/TestHighFreqTerms.java b/lucene/contrib/misc/src/test/org/apache/lucene/misc/TestHighFreqTerms.java index cb33cfa8be9..5d6eb8ad8a9 100644 --- a/lucene/contrib/misc/src/test/org/apache/lucene/misc/TestHighFreqTerms.java +++ b/lucene/contrib/misc/src/test/org/apache/lucene/misc/TestHighFreqTerms.java @@ -40,7 +40,7 @@ public class TestHighFreqTerms extends LuceneTestCase { public static void setUpClass() throws Exception { dir = newDirectory(); writer = new IndexWriter(dir, newIndexWriterConfig(random, - TEST_VERSION_CURRENT, new MockAnalyzer(MockTokenizer.WHITESPACE, false)) + TEST_VERSION_CURRENT, new MockAnalyzer(random, MockTokenizer.WHITESPACE, false)) .setMaxBufferedDocs(2)); writer.setInfoStream(VERBOSE ? System.out : null); indexDocs(writer); diff --git a/lucene/contrib/misc/src/test/org/apache/lucene/misc/TestLengthNormModifier.java b/lucene/contrib/misc/src/test/org/apache/lucene/misc/TestLengthNormModifier.java index 5bd4ad530f5..ad290c7e490 100644 --- a/lucene/contrib/misc/src/test/org/apache/lucene/misc/TestLengthNormModifier.java +++ b/lucene/contrib/misc/src/test/org/apache/lucene/misc/TestLengthNormModifier.java @@ -66,7 +66,7 @@ public class TestLengthNormModifier extends LuceneTestCase { super.setUp(); store = newDirectory(); IndexWriter writer = new IndexWriter(store, newIndexWriterConfig( - TEST_VERSION_CURRENT, new MockAnalyzer()).setMergePolicy(newInOrderLogMergePolicy())); + TEST_VERSION_CURRENT, new MockAnalyzer(random)).setMergePolicy(newLogMergePolicy())); for (int i = 0; i < NUM_DOCS; i++) { Document d = new Document(); diff --git a/lucene/contrib/queries/src/test/org/apache/lucene/search/BooleanFilterTest.java b/lucene/contrib/queries/src/test/org/apache/lucene/search/BooleanFilterTest.java index 6fb01df5736..5a1bf66dab3 100644 --- a/lucene/contrib/queries/src/test/org/apache/lucene/search/BooleanFilterTest.java +++ b/lucene/contrib/queries/src/test/org/apache/lucene/search/BooleanFilterTest.java @@ -39,7 +39,7 @@ public class BooleanFilterTest extends LuceneTestCase { public void setUp() throws Exception { super.setUp(); directory = newDirectory(); - RandomIndexWriter writer = new RandomIndexWriter(random, directory, new MockAnalyzer(MockTokenizer.WHITESPACE, false)); + RandomIndexWriter writer = new RandomIndexWriter(random, directory, new MockAnalyzer(random, MockTokenizer.WHITESPACE, false)); //Add series of docs with filterable fields : acces rights, prices, dates and "in-stock" flags addDoc(writer, "admin guest", "010", "20040101","Y"); diff --git a/lucene/contrib/queries/src/test/org/apache/lucene/search/DuplicateFilterTest.java b/lucene/contrib/queries/src/test/org/apache/lucene/search/DuplicateFilterTest.java index 29c7f0f2e37..b4a6c8885bb 100644 --- a/lucene/contrib/queries/src/test/org/apache/lucene/search/DuplicateFilterTest.java +++ b/lucene/contrib/queries/src/test/org/apache/lucene/search/DuplicateFilterTest.java @@ -43,7 +43,7 @@ public class DuplicateFilterTest extends LuceneTestCase { public void setUp() throws Exception { super.setUp(); directory = newDirectory(); - RandomIndexWriter writer = new RandomIndexWriter(random, directory, newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer()).setMergePolicy(newInOrderLogMergePolicy())); + RandomIndexWriter writer = new RandomIndexWriter(random, directory, newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random)).setMergePolicy(newLogMergePolicy())); //Add series of docs with filterable fields : url, text and dates flags addDoc(writer, "http://lucene.apache.org", "lucene 1.4.3 available", "20040101"); diff --git a/lucene/contrib/queries/src/test/org/apache/lucene/search/FuzzyLikeThisQueryTest.java b/lucene/contrib/queries/src/test/org/apache/lucene/search/FuzzyLikeThisQueryTest.java index 0f9b6ca7712..5957bf05751 100644 --- a/lucene/contrib/queries/src/test/org/apache/lucene/search/FuzzyLikeThisQueryTest.java +++ b/lucene/contrib/queries/src/test/org/apache/lucene/search/FuzzyLikeThisQueryTest.java @@ -34,13 +34,13 @@ public class FuzzyLikeThisQueryTest extends LuceneTestCase { private Directory directory; private IndexSearcher searcher; private IndexReader reader; - private Analyzer analyzer=new MockAnalyzer(); + private Analyzer analyzer=new MockAnalyzer(random); @Override public void setUp() throws Exception { super.setUp(); directory = newDirectory(); - RandomIndexWriter writer = new RandomIndexWriter(random, directory, newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer()).setMergePolicy(newInOrderLogMergePolicy())); + RandomIndexWriter writer = new RandomIndexWriter(random, directory, newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random)).setMergePolicy(newLogMergePolicy())); //Add series of docs with misspelt names addDoc(writer, "jonathon smythe","1"); @@ -121,7 +121,7 @@ public class FuzzyLikeThisQueryTest extends LuceneTestCase { } public void testFuzzyLikeThisQueryEquals() { - Analyzer analyzer = new MockAnalyzer(); + Analyzer analyzer = new MockAnalyzer(random); FuzzyLikeThisQuery fltq1 = new FuzzyLikeThisQuery(10, analyzer); fltq1.addTerms("javi", "subject", 0.5f, 2); FuzzyLikeThisQuery fltq2 = new FuzzyLikeThisQuery(10, analyzer); diff --git a/lucene/contrib/queries/src/test/org/apache/lucene/search/TestFieldCacheRewriteMethod.java b/lucene/contrib/queries/src/test/org/apache/lucene/search/TestFieldCacheRewriteMethod.java index 73f666eee10..b261cdea031 100644 --- a/lucene/contrib/queries/src/test/org/apache/lucene/search/TestFieldCacheRewriteMethod.java +++ b/lucene/contrib/queries/src/test/org/apache/lucene/search/TestFieldCacheRewriteMethod.java @@ -36,8 +36,8 @@ public class TestFieldCacheRewriteMethod extends TestRegexpRandom2 { RegexpQuery filter = new RegexpQuery(new Term("field", regexp), RegExp.NONE); filter.setRewriteMethod(MultiTermQuery.CONSTANT_SCORE_FILTER_REWRITE); - TopDocs fieldCacheDocs = searcher.search(fieldCache, 25); - TopDocs filterDocs = searcher.search(filter, 25); + TopDocs fieldCacheDocs = searcher1.search(fieldCache, 25); + TopDocs filterDocs = searcher2.search(filter, 25); CheckHits.checkEqual(fieldCache, fieldCacheDocs.scoreDocs, filterDocs.scoreDocs); } diff --git a/lucene/contrib/queries/src/test/org/apache/lucene/search/regex/TestSpanRegexQuery.java b/lucene/contrib/queries/src/test/org/apache/lucene/search/regex/TestSpanRegexQuery.java index fd32f13abe6..ae7ad5f202c 100644 --- a/lucene/contrib/queries/src/test/org/apache/lucene/search/regex/TestSpanRegexQuery.java +++ b/lucene/contrib/queries/src/test/org/apache/lucene/search/regex/TestSpanRegexQuery.java @@ -56,7 +56,7 @@ public class TestSpanRegexQuery extends LuceneTestCase { public void testSpanRegex() throws Exception { Directory directory = newDirectory(); IndexWriter writer = new IndexWriter(directory, newIndexWriterConfig( - TEST_VERSION_CURRENT, new MockAnalyzer())); + TEST_VERSION_CURRENT, new MockAnalyzer(random))); Document doc = new Document(); // doc.add(newField("field", "the quick brown fox jumps over the lazy dog", // Field.Store.NO, Field.Index.ANALYZED)); @@ -97,14 +97,14 @@ public class TestSpanRegexQuery extends LuceneTestCase { // creating first index writer IndexWriter writerA = new IndexWriter(indexStoreA, newIndexWriterConfig( - TEST_VERSION_CURRENT, new MockAnalyzer()).setOpenMode(OpenMode.CREATE)); + TEST_VERSION_CURRENT, new MockAnalyzer(random)).setOpenMode(OpenMode.CREATE)); writerA.addDocument(lDoc); writerA.optimize(); writerA.close(); // creating second index writer IndexWriter writerB = new IndexWriter(indexStoreB, newIndexWriterConfig( - TEST_VERSION_CURRENT, new MockAnalyzer()).setOpenMode(OpenMode.CREATE)); + TEST_VERSION_CURRENT, new MockAnalyzer(random)).setOpenMode(OpenMode.CREATE)); writerB.addDocument(lDoc2); writerB.optimize(); writerB.close(); diff --git a/lucene/contrib/queries/src/test/org/apache/lucene/search/similar/TestMoreLikeThis.java b/lucene/contrib/queries/src/test/org/apache/lucene/search/similar/TestMoreLikeThis.java index 6de5e91ddc5..26b6a191c4b 100644 --- a/lucene/contrib/queries/src/test/org/apache/lucene/search/similar/TestMoreLikeThis.java +++ b/lucene/contrib/queries/src/test/org/apache/lucene/search/similar/TestMoreLikeThis.java @@ -74,7 +74,7 @@ public class TestMoreLikeThis extends LuceneTestCase { Map originalValues = getOriginalValues(); MoreLikeThis mlt = new MoreLikeThis(reader); - mlt.setAnalyzer(new MockAnalyzer(MockTokenizer.WHITESPACE, false)); + mlt.setAnalyzer(new MockAnalyzer(random, MockTokenizer.WHITESPACE, false)); mlt.setMinDocFreq(1); mlt.setMinTermFreq(1); mlt.setMinWordLen(1); @@ -109,7 +109,7 @@ public class TestMoreLikeThis extends LuceneTestCase { private Map getOriginalValues() throws IOException { Map originalValues = new HashMap (); MoreLikeThis mlt = new MoreLikeThis(reader); - mlt.setAnalyzer(new MockAnalyzer(MockTokenizer.WHITESPACE, false)); + mlt.setAnalyzer(new MockAnalyzer(random, MockTokenizer.WHITESPACE, false)); mlt.setMinDocFreq(1); mlt.setMinTermFreq(1); mlt.setMinWordLen(1); diff --git a/lucene/contrib/queryparser/src/java/org/apache/lucene/queryParser/core/nodes/QueryNodeImpl.java b/lucene/contrib/queryparser/src/java/org/apache/lucene/queryParser/core/nodes/QueryNodeImpl.java index 745d8f1529c..dcc4811febc 100644 --- a/lucene/contrib/queryparser/src/java/org/apache/lucene/queryParser/core/nodes/QueryNodeImpl.java +++ b/lucene/contrib/queryparser/src/java/org/apache/lucene/queryParser/core/nodes/QueryNodeImpl.java @@ -160,7 +160,7 @@ public abstract class QueryNodeImpl implements QueryNode, Cloneable { /** verify if a node contains a tag */ public boolean containsTag(String tagName) { - return this.tags.containsKey(tagName); + return this.tags.containsKey(tagName.toLowerCase()); } public Object getTag(String tagName) { diff --git a/lucene/contrib/queryparser/src/test/org/apache/lucene/queryParser/complexPhrase/TestComplexPhraseQuery.java b/lucene/contrib/queryparser/src/test/org/apache/lucene/queryParser/complexPhrase/TestComplexPhraseQuery.java index f163a4cece5..b8aaae839c7 100644 --- a/lucene/contrib/queryparser/src/test/org/apache/lucene/queryParser/complexPhrase/TestComplexPhraseQuery.java +++ b/lucene/contrib/queryparser/src/test/org/apache/lucene/queryParser/complexPhrase/TestComplexPhraseQuery.java @@ -34,7 +34,7 @@ import org.apache.lucene.util.LuceneTestCase; public class TestComplexPhraseQuery extends LuceneTestCase { Directory rd; - Analyzer analyzer = new MockAnalyzer(); + Analyzer analyzer = new MockAnalyzer(random); DocData docsContent[] = { new DocData("john smith", "1"), new DocData("johathon smith", "2"), diff --git a/lucene/contrib/queryparser/src/test/org/apache/lucene/queryParser/core/nodes/TestQueryNode.java b/lucene/contrib/queryparser/src/test/org/apache/lucene/queryParser/core/nodes/TestQueryNode.java index 23d4fb4ef4c..b805a438ce1 100644 --- a/lucene/contrib/queryparser/src/test/org/apache/lucene/queryParser/core/nodes/TestQueryNode.java +++ b/lucene/contrib/queryparser/src/test/org/apache/lucene/queryParser/core/nodes/TestQueryNode.java @@ -32,4 +32,16 @@ public class TestQueryNode extends LuceneTestCase { bq.add(Arrays.asList(nodeB)); assertEquals(2, bq.getChildren().size()); } + + /* LUCENE-3045 bug in QueryNodeImpl.containsTag(String key)*/ + public void testTags() throws Exception { + QueryNode node = new FieldQueryNode("foo", "A", 0, 1); + + node.setTag("TaG", new Object()); + assertTrue(node.getTagMap().size() > 0); + assertTrue(node.containsTag("tAg")); + assertTrue(node.getTag("tAg") != null); + + } + } diff --git a/lucene/contrib/queryparser/src/test/org/apache/lucene/queryParser/ext/TestExtendableQueryParser.java b/lucene/contrib/queryparser/src/test/org/apache/lucene/queryParser/ext/TestExtendableQueryParser.java index c60a0eb6895..366168feed7 100644 --- a/lucene/contrib/queryparser/src/test/org/apache/lucene/queryParser/ext/TestExtendableQueryParser.java +++ b/lucene/contrib/queryparser/src/test/org/apache/lucene/queryParser/ext/TestExtendableQueryParser.java @@ -43,7 +43,7 @@ public class TestExtendableQueryParser extends TestQueryParser { public QueryParser getParser(Analyzer a, Extensions extensions) throws Exception { if (a == null) - a = new MockAnalyzer(MockTokenizer.SIMPLE, true); + a = new MockAnalyzer(random, MockTokenizer.SIMPLE, true); QueryParser qp = extensions == null ? new ExtendableQueryParser( TEST_VERSION_CURRENT, "field", a) : new ExtendableQueryParser( TEST_VERSION_CURRENT, "field", a, extensions); diff --git a/lucene/contrib/queryparser/src/test/org/apache/lucene/queryParser/precedence/TestPrecedenceQueryParser.java b/lucene/contrib/queryparser/src/test/org/apache/lucene/queryParser/precedence/TestPrecedenceQueryParser.java index 5cba05b3111..cd719791f35 100644 --- a/lucene/contrib/queryparser/src/test/org/apache/lucene/queryParser/precedence/TestPrecedenceQueryParser.java +++ b/lucene/contrib/queryparser/src/test/org/apache/lucene/queryParser/precedence/TestPrecedenceQueryParser.java @@ -125,7 +125,7 @@ public class TestPrecedenceQueryParser extends LuceneTestCase { public PrecedenceQueryParser getParser(Analyzer a) throws Exception { if (a == null) - a = new MockAnalyzer(MockTokenizer.SIMPLE, true); + a = new MockAnalyzer(random, MockTokenizer.SIMPLE, true); PrecedenceQueryParser qp = new PrecedenceQueryParser(); qp.setAnalyzer(a); qp.setDefaultOperator(Operator.OR); @@ -171,7 +171,7 @@ public class TestPrecedenceQueryParser extends LuceneTestCase { public Query getQueryDOA(String query, Analyzer a) throws Exception { if (a == null) - a = new MockAnalyzer(MockTokenizer.SIMPLE, true); + a = new MockAnalyzer(random, MockTokenizer.SIMPLE, true); PrecedenceQueryParser qp = new PrecedenceQueryParser(); qp.setAnalyzer(a); qp.setDefaultOperator(Operator.AND); @@ -232,7 +232,7 @@ public class TestPrecedenceQueryParser extends LuceneTestCase { "+(title:dog title:cat) -author:\"bob dole\""); PrecedenceQueryParser qp = new PrecedenceQueryParser(); - qp.setAnalyzer(new MockAnalyzer()); + qp.setAnalyzer(new MockAnalyzer(random)); // make sure OR is the default: assertEquals(Operator.OR, qp.getDefaultOperator()); qp.setDefaultOperator(Operator.AND); @@ -246,7 +246,7 @@ public class TestPrecedenceQueryParser extends LuceneTestCase { } public void testPunct() throws Exception { - Analyzer a = new MockAnalyzer(MockTokenizer.WHITESPACE, false); + Analyzer a = new MockAnalyzer(random, MockTokenizer.WHITESPACE, false); assertQueryEquals("a&b", a, "a&b"); assertQueryEquals("a&&b", a, "a&&b"); assertQueryEquals(".NET", a, ".NET"); @@ -266,7 +266,7 @@ public class TestPrecedenceQueryParser extends LuceneTestCase { assertQueryEquals("term 1.0 1 2", null, "term"); assertQueryEquals("term term1 term2", null, "term term term"); - Analyzer a = new MockAnalyzer(); + Analyzer a = new MockAnalyzer(random); assertQueryEquals("3", a, "3"); assertQueryEquals("term 1.0 1 2", a, "term 1.0 1 2"); assertQueryEquals("term term1 term2", a, "term term1 term2"); @@ -405,7 +405,7 @@ public class TestPrecedenceQueryParser extends LuceneTestCase { final String defaultField = "default"; final String monthField = "month"; final String hourField = "hour"; - PrecedenceQueryParser qp = new PrecedenceQueryParser(new MockAnalyzer()); + PrecedenceQueryParser qp = new PrecedenceQueryParser(new MockAnalyzer(random)); Map fieldMap = new HashMap (); // set a field specific date resolution @@ -467,7 +467,7 @@ public class TestPrecedenceQueryParser extends LuceneTestCase { } public void testEscaped() throws Exception { - Analyzer a = new MockAnalyzer(MockTokenizer.WHITESPACE, false); + Analyzer a = new MockAnalyzer(random, MockTokenizer.WHITESPACE, false); assertQueryEquals("a\\-b:c", a, "a-b:c"); assertQueryEquals("a\\+b:c", a, "a+b:c"); @@ -533,7 +533,7 @@ public class TestPrecedenceQueryParser extends LuceneTestCase { public void testBoost() throws Exception { CharacterRunAutomaton stopSet = new CharacterRunAutomaton(BasicAutomata.makeString("on")); - Analyzer oneStopAnalyzer = new MockAnalyzer(MockTokenizer.SIMPLE, true, stopSet, true); + Analyzer oneStopAnalyzer = new MockAnalyzer(random, MockTokenizer.SIMPLE, true, stopSet, true); PrecedenceQueryParser qp = new PrecedenceQueryParser(); qp.setAnalyzer(oneStopAnalyzer); @@ -548,7 +548,7 @@ public class TestPrecedenceQueryParser extends LuceneTestCase { q = qp.parse("\"on\"^1.0", "field"); assertNotNull(q); - q = getParser(new MockAnalyzer(MockTokenizer.SIMPLE, true, MockTokenFilter.ENGLISH_STOPSET, true)).parse("the^3", + q = getParser(new MockAnalyzer(random, MockTokenizer.SIMPLE, true, MockTokenFilter.ENGLISH_STOPSET, true)).parse("the^3", "field"); assertNotNull(q); } @@ -564,7 +564,7 @@ public class TestPrecedenceQueryParser extends LuceneTestCase { public void testBooleanQuery() throws Exception { BooleanQuery.setMaxClauseCount(2); try { - getParser(new MockAnalyzer(MockTokenizer.WHITESPACE, false)).parse("one two three", "field"); + getParser(new MockAnalyzer(random, MockTokenizer.WHITESPACE, false)).parse("one two three", "field"); fail("ParseException expected due to too many boolean clauses"); } catch (QueryNodeException expected) { // too many boolean clauses, so ParseException is expected @@ -573,7 +573,7 @@ public class TestPrecedenceQueryParser extends LuceneTestCase { // LUCENE-792 public void testNOT() throws Exception { - Analyzer a = new MockAnalyzer(MockTokenizer.WHITESPACE, false); + Analyzer a = new MockAnalyzer(random, MockTokenizer.WHITESPACE, false); assertQueryEquals("NOT foo AND bar", a, "-foo +bar"); } @@ -582,7 +582,7 @@ public class TestPrecedenceQueryParser extends LuceneTestCase { * issue has been corrected. */ public void testPrecedence() throws Exception { - PrecedenceQueryParser parser = getParser(new MockAnalyzer(MockTokenizer.WHITESPACE, false)); + PrecedenceQueryParser parser = getParser(new MockAnalyzer(random, MockTokenizer.WHITESPACE, false)); Query query1 = parser.parse("A AND B OR C AND D", "field"); Query query2 = parser.parse("(A AND B) OR (C AND D)", "field"); assertEquals(query1, query2); diff --git a/lucene/contrib/queryparser/src/test/org/apache/lucene/queryParser/standard/TestMultiFieldQPHelper.java b/lucene/contrib/queryparser/src/test/org/apache/lucene/queryParser/standard/TestMultiFieldQPHelper.java index 55e9e183c09..11027b74bdf 100644 --- a/lucene/contrib/queryparser/src/test/org/apache/lucene/queryParser/standard/TestMultiFieldQPHelper.java +++ b/lucene/contrib/queryparser/src/test/org/apache/lucene/queryParser/standard/TestMultiFieldQPHelper.java @@ -80,7 +80,7 @@ public class TestMultiFieldQPHelper extends LuceneTestCase { String[] fields = { "b", "t" }; StandardQueryParser mfqp = new StandardQueryParser(); mfqp.setMultiFields(fields); - mfqp.setAnalyzer(new MockAnalyzer()); + mfqp.setAnalyzer(new MockAnalyzer(random)); Query q = mfqp.parse("one", null); assertEquals("b:one t:one", q.toString()); @@ -150,7 +150,7 @@ public class TestMultiFieldQPHelper extends LuceneTestCase { StandardQueryParser mfqp = new StandardQueryParser(); mfqp.setMultiFields(fields); mfqp.setFieldsBoost(boosts); - mfqp.setAnalyzer(new MockAnalyzer()); + mfqp.setAnalyzer(new MockAnalyzer(random)); // Check for simple Query q = mfqp.parse("one", null); @@ -178,24 +178,24 @@ public class TestMultiFieldQPHelper extends LuceneTestCase { public void testStaticMethod1() throws QueryNodeException { String[] fields = { "b", "t" }; String[] queries = { "one", "two" }; - Query q = QueryParserUtil.parse(queries, fields, new MockAnalyzer()); + Query q = QueryParserUtil.parse(queries, fields, new MockAnalyzer(random)); assertEquals("b:one t:two", q.toString()); String[] queries2 = { "+one", "+two" }; - q = QueryParserUtil.parse(queries2, fields, new MockAnalyzer()); + q = QueryParserUtil.parse(queries2, fields, new MockAnalyzer(random)); assertEquals("(+b:one) (+t:two)", q.toString()); String[] queries3 = { "one", "+two" }; - q = QueryParserUtil.parse(queries3, fields, new MockAnalyzer()); + q = QueryParserUtil.parse(queries3, fields, new MockAnalyzer(random)); assertEquals("b:one (+t:two)", q.toString()); String[] queries4 = { "one +more", "+two" }; - q = QueryParserUtil.parse(queries4, fields, new MockAnalyzer()); + q = QueryParserUtil.parse(queries4, fields, new MockAnalyzer(random)); assertEquals("(b:one +b:more) (+t:two)", q.toString()); String[] queries5 = { "blah" }; try { - q = QueryParserUtil.parse(queries5, fields, new MockAnalyzer()); + q = QueryParserUtil.parse(queries5, fields, new MockAnalyzer(random)); fail(); } catch (IllegalArgumentException e) { // expected exception, array length differs @@ -219,15 +219,15 @@ public class TestMultiFieldQPHelper extends LuceneTestCase { BooleanClause.Occur[] flags = { BooleanClause.Occur.MUST, BooleanClause.Occur.MUST_NOT }; Query q = QueryParserUtil.parse("one", fields, flags, - new MockAnalyzer()); + new MockAnalyzer(random)); assertEquals("+b:one -t:one", q.toString()); - q = QueryParserUtil.parse("one two", fields, flags, new MockAnalyzer()); + q = QueryParserUtil.parse("one two", fields, flags, new MockAnalyzer(random)); assertEquals("+(b:one b:two) -(t:one t:two)", q.toString()); try { BooleanClause.Occur[] flags2 = { BooleanClause.Occur.MUST }; - q = QueryParserUtil.parse("blah", fields, flags2, new MockAnalyzer()); + q = QueryParserUtil.parse("blah", fields, flags2, new MockAnalyzer(random)); fail(); } catch (IllegalArgumentException e) { // expected exception, array length differs @@ -240,19 +240,19 @@ public class TestMultiFieldQPHelper extends LuceneTestCase { BooleanClause.Occur.MUST_NOT }; StandardQueryParser parser = new StandardQueryParser(); parser.setMultiFields(fields); - parser.setAnalyzer(new MockAnalyzer()); + parser.setAnalyzer(new MockAnalyzer(random)); Query q = QueryParserUtil.parse("one", fields, flags, - new MockAnalyzer());// , fields, flags, new + new MockAnalyzer(random));// , fields, flags, new // MockAnalyzer()); assertEquals("+b:one -t:one", q.toString()); - q = QueryParserUtil.parse("one two", fields, flags, new MockAnalyzer()); + q = QueryParserUtil.parse("one two", fields, flags, new MockAnalyzer(random)); assertEquals("+(b:one b:two) -(t:one t:two)", q.toString()); try { BooleanClause.Occur[] flags2 = { BooleanClause.Occur.MUST }; - q = QueryParserUtil.parse("blah", fields, flags2, new MockAnalyzer()); + q = QueryParserUtil.parse("blah", fields, flags2, new MockAnalyzer(random)); fail(); } catch (IllegalArgumentException e) { // expected exception, array length differs @@ -265,13 +265,13 @@ public class TestMultiFieldQPHelper extends LuceneTestCase { BooleanClause.Occur[] flags = { BooleanClause.Occur.MUST, BooleanClause.Occur.MUST_NOT, BooleanClause.Occur.SHOULD }; Query q = QueryParserUtil.parse(queries, fields, flags, - new MockAnalyzer()); + new MockAnalyzer(random)); assertEquals("+f1:one -f2:two f3:three", q.toString()); try { BooleanClause.Occur[] flags2 = { BooleanClause.Occur.MUST }; q = QueryParserUtil - .parse(queries, fields, flags2, new MockAnalyzer()); + .parse(queries, fields, flags2, new MockAnalyzer(random)); fail(); } catch (IllegalArgumentException e) { // expected exception, array length differs @@ -284,13 +284,13 @@ public class TestMultiFieldQPHelper extends LuceneTestCase { BooleanClause.Occur[] flags = { BooleanClause.Occur.MUST, BooleanClause.Occur.MUST_NOT }; Query q = QueryParserUtil.parse(queries, fields, flags, - new MockAnalyzer()); + new MockAnalyzer(random)); assertEquals("+b:one -t:two", q.toString()); try { BooleanClause.Occur[] flags2 = { BooleanClause.Occur.MUST }; q = QueryParserUtil - .parse(queries, fields, flags2, new MockAnalyzer()); + .parse(queries, fields, flags2, new MockAnalyzer(random)); fail(); } catch (IllegalArgumentException e) { // expected exception, array length differs @@ -316,7 +316,7 @@ public class TestMultiFieldQPHelper extends LuceneTestCase { } public void testStopWordSearching() throws Exception { - Analyzer analyzer = new MockAnalyzer(); + Analyzer analyzer = new MockAnalyzer(random); Directory ramDir = newDirectory(); IndexWriter iw = new IndexWriter(ramDir, newIndexWriterConfig(TEST_VERSION_CURRENT, analyzer)); Document doc = new Document(); @@ -342,7 +342,7 @@ public class TestMultiFieldQPHelper extends LuceneTestCase { * Return empty tokens for field "f1". */ private static final class AnalyzerReturningNull extends Analyzer { - MockAnalyzer stdAnalyzer = new MockAnalyzer(); + MockAnalyzer stdAnalyzer = new MockAnalyzer(random); public AnalyzerReturningNull() { } diff --git a/lucene/contrib/queryparser/src/test/org/apache/lucene/queryParser/standard/TestQPHelper.java b/lucene/contrib/queryparser/src/test/org/apache/lucene/queryParser/standard/TestQPHelper.java index 563aaf2fd10..e3de2ee0aa3 100644 --- a/lucene/contrib/queryparser/src/test/org/apache/lucene/queryParser/standard/TestQPHelper.java +++ b/lucene/contrib/queryparser/src/test/org/apache/lucene/queryParser/standard/TestQPHelper.java @@ -191,7 +191,7 @@ public class TestQPHelper extends LuceneTestCase { public StandardQueryParser getParser(Analyzer a) throws Exception { if (a == null) - a = new MockAnalyzer(MockTokenizer.SIMPLE, true); + a = new MockAnalyzer(random, MockTokenizer.SIMPLE, true); StandardQueryParser qp = new StandardQueryParser(); qp.setAnalyzer(a); @@ -281,7 +281,7 @@ public class TestQPHelper extends LuceneTestCase { public Query getQueryDOA(String query, Analyzer a) throws Exception { if (a == null) - a = new MockAnalyzer(MockTokenizer.SIMPLE, true); + a = new MockAnalyzer(random, MockTokenizer.SIMPLE, true); StandardQueryParser qp = new StandardQueryParser(); qp.setAnalyzer(a); qp.setDefaultOperator(Operator.AND); @@ -301,7 +301,7 @@ public class TestQPHelper extends LuceneTestCase { } public void testConstantScoreAutoRewrite() throws Exception { - StandardQueryParser qp = new StandardQueryParser(new MockAnalyzer(MockTokenizer.WHITESPACE, false)); + StandardQueryParser qp = new StandardQueryParser(new MockAnalyzer(random, MockTokenizer.WHITESPACE, false)); Query q = qp.parse("foo*bar", "field"); assertTrue(q instanceof WildcardQuery); assertEquals(MultiTermQuery.CONSTANT_SCORE_AUTO_REWRITE_DEFAULT, ((MultiTermQuery) q).getRewriteMethod()); @@ -410,9 +410,9 @@ public class TestQPHelper extends LuceneTestCase { public void testSimple() throws Exception { assertQueryEquals("\"term germ\"~2", null, "\"term germ\"~2"); assertQueryEquals("term term term", null, "term term term"); - assertQueryEquals("t�rm term term", new MockAnalyzer(MockTokenizer.WHITESPACE, false), + assertQueryEquals("t�rm term term", new MockAnalyzer(random, MockTokenizer.WHITESPACE, false), "t�rm term term"); - assertQueryEquals("�mlaut", new MockAnalyzer(MockTokenizer.WHITESPACE, false), "�mlaut"); + assertQueryEquals("�mlaut", new MockAnalyzer(random, MockTokenizer.WHITESPACE, false), "�mlaut"); // FIXME: change MockAnalyzer to not extend CharTokenizer for this test //assertQueryEquals("\"\"", new KeywordAnalyzer(), ""); @@ -470,7 +470,7 @@ public class TestQPHelper extends LuceneTestCase { } public void testPunct() throws Exception { - Analyzer a = new MockAnalyzer(MockTokenizer.WHITESPACE, false); + Analyzer a = new MockAnalyzer(random, MockTokenizer.WHITESPACE, false); assertQueryEquals("a&b", a, "a&b"); assertQueryEquals("a&&b", a, "a&&b"); assertQueryEquals(".NET", a, ".NET"); @@ -491,7 +491,7 @@ public class TestQPHelper extends LuceneTestCase { assertQueryEquals("term 1.0 1 2", null, "term"); assertQueryEquals("term term1 term2", null, "term term term"); - Analyzer a = new MockAnalyzer(MockTokenizer.WHITESPACE, false); + Analyzer a = new MockAnalyzer(random, MockTokenizer.WHITESPACE, false); assertQueryEquals("3", a, "3"); assertQueryEquals("term 1.0 1 2", a, "term 1.0 1 2"); assertQueryEquals("term term1 term2", a, "term term1 term2"); @@ -726,7 +726,7 @@ public class TestQPHelper extends LuceneTestCase { } public void testEscaped() throws Exception { - Analyzer a = new MockAnalyzer(MockTokenizer.WHITESPACE, false); + Analyzer a = new MockAnalyzer(random, MockTokenizer.WHITESPACE, false); /* * assertQueryEquals("\\[brackets", a, "\\[brackets"); @@ -825,7 +825,7 @@ public class TestQPHelper extends LuceneTestCase { } public void testQueryStringEscaping() throws Exception { - Analyzer a = new MockAnalyzer(MockTokenizer.WHITESPACE, false); + Analyzer a = new MockAnalyzer(random, MockTokenizer.WHITESPACE, false); assertEscapedQueryEquals("a-b:c", a, "a\\-b\\:c"); assertEscapedQueryEquals("a+b:c", a, "a\\+b\\:c"); @@ -866,7 +866,7 @@ public class TestQPHelper extends LuceneTestCase { @Ignore("contrib queryparser shouldn't escape wildcard terms") public void testEscapedWildcard() throws Exception { StandardQueryParser qp = new StandardQueryParser(); - qp.setAnalyzer(new MockAnalyzer(MockTokenizer.WHITESPACE, false)); + qp.setAnalyzer(new MockAnalyzer(random, MockTokenizer.WHITESPACE, false)); WildcardQuery q = new WildcardQuery(new Term("field", "foo\\?ba?r")); assertEquals(q, qp.parse("foo\\?ba?r", "field")); @@ -904,7 +904,7 @@ public class TestQPHelper extends LuceneTestCase { public void testBoost() throws Exception { CharacterRunAutomaton stopSet = new CharacterRunAutomaton(BasicAutomata.makeString("on")); - Analyzer oneStopAnalyzer = new MockAnalyzer(MockTokenizer.SIMPLE, true, stopSet, true); + Analyzer oneStopAnalyzer = new MockAnalyzer(random, MockTokenizer.SIMPLE, true, stopSet, true); StandardQueryParser qp = new StandardQueryParser(); qp.setAnalyzer(oneStopAnalyzer); @@ -920,7 +920,7 @@ public class TestQPHelper extends LuceneTestCase { assertNotNull(q); StandardQueryParser qp2 = new StandardQueryParser(); - qp2.setAnalyzer(new MockAnalyzer(MockTokenizer.SIMPLE, true, MockTokenFilter.ENGLISH_STOPSET, true)); + qp2.setAnalyzer(new MockAnalyzer(random, MockTokenizer.SIMPLE, true, MockTokenFilter.ENGLISH_STOPSET, true)); q = qp2.parse("the^3", "field"); // "the" is a stop word so the result is an empty query: @@ -950,7 +950,7 @@ public class TestQPHelper extends LuceneTestCase { public void testCustomQueryParserWildcard() { try { - new QPTestParser(new MockAnalyzer(MockTokenizer.WHITESPACE, false)).parse("a?t", "contents"); + new QPTestParser(new MockAnalyzer(random, MockTokenizer.WHITESPACE, false)).parse("a?t", "contents"); fail("Wildcard queries should not be allowed"); } catch (QueryNodeException expected) { // expected exception @@ -959,7 +959,7 @@ public class TestQPHelper extends LuceneTestCase { public void testCustomQueryParserFuzzy() throws Exception { try { - new QPTestParser(new MockAnalyzer(MockTokenizer.WHITESPACE, false)).parse("xunit~", "contents"); + new QPTestParser(new MockAnalyzer(random, MockTokenizer.WHITESPACE, false)).parse("xunit~", "contents"); fail("Fuzzy queries should not be allowed"); } catch (QueryNodeException expected) { // expected exception @@ -970,7 +970,7 @@ public class TestQPHelper extends LuceneTestCase { BooleanQuery.setMaxClauseCount(2); try { StandardQueryParser qp = new StandardQueryParser(); - qp.setAnalyzer(new MockAnalyzer(MockTokenizer.WHITESPACE, false)); + qp.setAnalyzer(new MockAnalyzer(random, MockTokenizer.WHITESPACE, false)); qp.parse("one two three", "field"); fail("ParseException expected due to too many boolean clauses"); @@ -984,7 +984,7 @@ public class TestQPHelper extends LuceneTestCase { */ public void testPrecedence() throws Exception { StandardQueryParser qp = new StandardQueryParser(); - qp.setAnalyzer(new MockAnalyzer(MockTokenizer.WHITESPACE, false)); + qp.setAnalyzer(new MockAnalyzer(random, MockTokenizer.WHITESPACE, false)); Query query1 = qp.parse("A AND B OR C AND D", "field"); Query query2 = qp.parse("+A +B +C +D", "field"); @@ -995,7 +995,7 @@ public class TestQPHelper extends LuceneTestCase { // Todo: Convert from DateField to DateUtil // public void testLocalDateFormat() throws IOException, QueryNodeException { // Directory ramDir = newDirectory(); -// IndexWriter iw = new IndexWriter(ramDir, newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(MockTokenizer.WHITESPACE, false))); +// IndexWriter iw = new IndexWriter(ramDir, newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random, MockTokenizer.WHITESPACE, false))); // addDateDoc("a", 2005, 12, 2, 10, 15, 33, iw); // addDateDoc("b", 2005, 12, 4, 22, 15, 00, iw); // iw.close(); @@ -1116,7 +1116,7 @@ public class TestQPHelper extends LuceneTestCase { public void testStopwords() throws Exception { StandardQueryParser qp = new StandardQueryParser(); CharacterRunAutomaton stopSet = new CharacterRunAutomaton(new RegExp("the|foo").toAutomaton()); - qp.setAnalyzer(new MockAnalyzer(MockTokenizer.SIMPLE, true, stopSet, true)); + qp.setAnalyzer(new MockAnalyzer(random, MockTokenizer.SIMPLE, true, stopSet, true)); Query result = qp.parse("a:the OR a:foo", "a"); assertNotNull("result is null and it shouldn't be", result); @@ -1140,7 +1140,7 @@ public class TestQPHelper extends LuceneTestCase { public void testPositionIncrement() throws Exception { StandardQueryParser qp = new StandardQueryParser(); qp.setAnalyzer( - new MockAnalyzer(MockTokenizer.SIMPLE, true, MockTokenFilter.ENGLISH_STOPSET, true)); + new MockAnalyzer(random, MockTokenizer.SIMPLE, true, MockTokenFilter.ENGLISH_STOPSET, true)); qp.setEnablePositionIncrements(true); @@ -1161,7 +1161,7 @@ public class TestQPHelper extends LuceneTestCase { public void testMatchAllDocs() throws Exception { StandardQueryParser qp = new StandardQueryParser(); - qp.setAnalyzer(new MockAnalyzer(MockTokenizer.WHITESPACE, false)); + qp.setAnalyzer(new MockAnalyzer(random, MockTokenizer.WHITESPACE, false)); assertEquals(new MatchAllDocsQuery(), qp.parse("*:*", "field")); assertEquals(new MatchAllDocsQuery(), qp.parse("(*:*)", "field")); @@ -1173,7 +1173,7 @@ public class TestQPHelper extends LuceneTestCase { private void assertHits(int expected, String query, IndexSearcher is) throws IOException, QueryNodeException { StandardQueryParser qp = new StandardQueryParser(); - qp.setAnalyzer(new MockAnalyzer(MockTokenizer.WHITESPACE, false)); + qp.setAnalyzer(new MockAnalyzer(random, MockTokenizer.WHITESPACE, false)); qp.setLocale(Locale.ENGLISH); Query q = qp.parse(query, "date"); diff --git a/lucene/contrib/queryparser/src/test/org/apache/lucene/queryParser/surround/query/SingleFieldTestDb.java b/lucene/contrib/queryparser/src/test/org/apache/lucene/queryParser/surround/query/SingleFieldTestDb.java index f526d07d5c9..dffb925ed6b 100644 --- a/lucene/contrib/queryparser/src/test/org/apache/lucene/queryParser/surround/query/SingleFieldTestDb.java +++ b/lucene/contrib/queryparser/src/test/org/apache/lucene/queryParser/surround/query/SingleFieldTestDb.java @@ -41,7 +41,7 @@ public class SingleFieldTestDb { fieldName = fName; IndexWriter writer = new IndexWriter(db, new IndexWriterConfig( Version.LUCENE_CURRENT, - new MockAnalyzer())); + new MockAnalyzer(random))); for (int j = 0; j < docs.length; j++) { Document d = new Document(); d.add(new Field(fieldName, docs[j], Field.Store.NO, Field.Index.ANALYZED)); diff --git a/lucene/contrib/spatial/src/test/org/apache/lucene/spatial/tier/TestCartesian.java b/lucene/contrib/spatial/src/test/org/apache/lucene/spatial/tier/TestCartesian.java index 3f417f40844..06bb23fa35e 100644 --- a/lucene/contrib/spatial/src/test/org/apache/lucene/spatial/tier/TestCartesian.java +++ b/lucene/contrib/spatial/src/test/org/apache/lucene/spatial/tier/TestCartesian.java @@ -71,7 +71,7 @@ public class TestCartesian extends LuceneTestCase { super.setUp(); directory = newDirectory(); - IndexWriter writer = new IndexWriter(directory, newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer())); + IndexWriter writer = new IndexWriter(directory, newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random))); setUpPlotter( 2, 15); diff --git a/lucene/contrib/spatial/src/test/org/apache/lucene/spatial/tier/TestDistance.java b/lucene/contrib/spatial/src/test/org/apache/lucene/spatial/tier/TestDistance.java index 7aaa919a335..b23d3b382e6 100644 --- a/lucene/contrib/spatial/src/test/org/apache/lucene/spatial/tier/TestDistance.java +++ b/lucene/contrib/spatial/src/test/org/apache/lucene/spatial/tier/TestDistance.java @@ -47,7 +47,7 @@ public class TestDistance extends LuceneTestCase { public void setUp() throws Exception { super.setUp(); directory = newDirectory(); - writer = new IndexWriter(directory, newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer())); + writer = new IndexWriter(directory, newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random))); addData(writer); } diff --git a/lucene/contrib/spellchecker/src/java/org/apache/lucene/search/spell/SpellChecker.java b/lucene/contrib/spellchecker/src/java/org/apache/lucene/search/spell/SpellChecker.java index a4ed8407f2f..bfeae31581e 100755 --- a/lucene/contrib/spellchecker/src/java/org/apache/lucene/search/spell/SpellChecker.java +++ b/lucene/contrib/spellchecker/src/java/org/apache/lucene/search/spell/SpellChecker.java @@ -29,7 +29,7 @@ import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; -import org.apache.lucene.index.LogMergePolicy; +import org.apache.lucene.index.TieredMergePolicy; import org.apache.lucene.index.Term; import org.apache.lucene.index.IndexWriterConfig.OpenMode; import org.apache.lucene.index.Terms; @@ -45,7 +45,6 @@ import org.apache.lucene.store.Directory; import org.apache.lucene.util.BytesRef; import org.apache.lucene.util.ReaderUtil; import org.apache.lucene.util.Version; -import org.apache.lucene.util.VirtualMethod; /** * @@ -508,7 +507,7 @@ public class SpellChecker implements java.io.Closeable { ensureOpen(); final Directory dir = this.spellIndex; final IndexWriter writer = new IndexWriter(dir, new IndexWriterConfig(Version.LUCENE_CURRENT, new WhitespaceAnalyzer(Version.LUCENE_CURRENT)).setRAMBufferSizeMB(ramMB)); - ((LogMergePolicy) writer.getConfig().getMergePolicy()).setMergeFactor(mergeFactor); + ((TieredMergePolicy) writer.getConfig().getMergePolicy()).setMaxMergeAtOnce(mergeFactor); IndexSearcher indexSearcher = obtainSearcher(); final List
termsEnums = new ArrayList (); diff --git a/lucene/contrib/spellchecker/src/test/org/apache/lucene/search/spell/TestDirectSpellChecker.java b/lucene/contrib/spellchecker/src/test/org/apache/lucene/search/spell/TestDirectSpellChecker.java index ff9975d7381..3de4a91959a 100644 --- a/lucene/contrib/spellchecker/src/test/org/apache/lucene/search/spell/TestDirectSpellChecker.java +++ b/lucene/contrib/spellchecker/src/test/org/apache/lucene/search/spell/TestDirectSpellChecker.java @@ -35,7 +35,7 @@ public class TestDirectSpellChecker extends LuceneTestCase { spellChecker.setMinQueryLength(0); Directory dir = newDirectory(); RandomIndexWriter writer = new RandomIndexWriter(random, dir, - new MockAnalyzer(MockTokenizer.SIMPLE, true)); + new MockAnalyzer(random, MockTokenizer.SIMPLE, true)); for (int i = 0; i < 20; i++) { Document doc = new Document(); @@ -93,7 +93,7 @@ public class TestDirectSpellChecker extends LuceneTestCase { public void testOptions() throws Exception { Directory dir = newDirectory(); RandomIndexWriter writer = new RandomIndexWriter(random, dir, - new MockAnalyzer(MockTokenizer.SIMPLE, true)); + new MockAnalyzer(random, MockTokenizer.SIMPLE, true)); Document doc = new Document(); doc.add(newField("text", "foobar", Field.Store.NO, Field.Index.ANALYZED)); diff --git a/lucene/contrib/spellchecker/src/test/org/apache/lucene/search/spell/TestLuceneDictionary.java b/lucene/contrib/spellchecker/src/test/org/apache/lucene/search/spell/TestLuceneDictionary.java index 943b6d6daf0..e5cc7684f6d 100644 --- a/lucene/contrib/spellchecker/src/test/org/apache/lucene/search/spell/TestLuceneDictionary.java +++ b/lucene/contrib/spellchecker/src/test/org/apache/lucene/search/spell/TestLuceneDictionary.java @@ -46,7 +46,7 @@ public class TestLuceneDictionary extends LuceneTestCase { public void setUp() throws Exception { super.setUp(); store = newDirectory(); - IndexWriter writer = new IndexWriter(store, newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(MockTokenizer.WHITESPACE, false))); + IndexWriter writer = new IndexWriter(store, newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random, MockTokenizer.WHITESPACE, false))); Document doc; diff --git a/lucene/contrib/spellchecker/src/test/org/apache/lucene/search/spell/TestSpellChecker.java b/lucene/contrib/spellchecker/src/test/org/apache/lucene/search/spell/TestSpellChecker.java index ad753068edc..85313a0a811 100755 --- a/lucene/contrib/spellchecker/src/test/org/apache/lucene/search/spell/TestSpellChecker.java +++ b/lucene/contrib/spellchecker/src/test/org/apache/lucene/search/spell/TestSpellChecker.java @@ -54,7 +54,7 @@ public class TestSpellChecker extends LuceneTestCase { //create a user index userindex = newDirectory(); IndexWriter writer = new IndexWriter(userindex, new IndexWriterConfig( - TEST_VERSION_CURRENT, new MockAnalyzer())); + TEST_VERSION_CURRENT, new MockAnalyzer(random))); for (int i = 0; i < 1000; i++) { Document doc = new Document(); diff --git a/lucene/contrib/wordnet/src/java/org/apache/lucene/wordnet/Syns2Index.java b/lucene/contrib/wordnet/src/java/org/apache/lucene/wordnet/Syns2Index.java index 437b7e98973..abe77850bd5 100644 --- a/lucene/contrib/wordnet/src/java/org/apache/lucene/wordnet/Syns2Index.java +++ b/lucene/contrib/wordnet/src/java/org/apache/lucene/wordnet/Syns2Index.java @@ -36,7 +36,7 @@ import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; -import org.apache.lucene.index.LogMergePolicy; +import org.apache.lucene.index.TieredMergePolicy; import org.apache.lucene.index.IndexWriterConfig.OpenMode; import org.apache.lucene.store.FSDirectory; import org.apache.lucene.util.Version; @@ -250,7 +250,7 @@ public class Syns2Index // override the specific index if it already exists IndexWriter writer = new IndexWriter(dir, new IndexWriterConfig( Version.LUCENE_CURRENT, ana).setOpenMode(OpenMode.CREATE)); - ((LogMergePolicy) writer.getConfig().getMergePolicy()).setUseCompoundFile(true); // why? + ((TieredMergePolicy) writer.getConfig().getMergePolicy()).setUseCompoundFile(true); // why? Iterator i1 = word2Nums.keySet().iterator(); while (i1.hasNext()) // for each word { diff --git a/lucene/contrib/wordnet/src/test/org/apache/lucene/wordnet/TestWordnet.java b/lucene/contrib/wordnet/src/test/org/apache/lucene/wordnet/TestWordnet.java index 52171479992..ccd855931a5 100644 --- a/lucene/contrib/wordnet/src/test/org/apache/lucene/wordnet/TestWordnet.java +++ b/lucene/contrib/wordnet/src/test/org/apache/lucene/wordnet/TestWordnet.java @@ -29,6 +29,7 @@ import org.apache.lucene.search.Query; import org.apache.lucene.search.TermQuery; import org.apache.lucene.store.Directory; import org.apache.lucene.util.LuceneTestCase; +import org.apache.lucene.util._TestUtil; public class TestWordnet extends LuceneTestCase { private IndexSearcher searcher; @@ -42,6 +43,7 @@ public class TestWordnet extends LuceneTestCase { // create a temporary synonym index File testFile = getDataFile("testSynonyms.txt"); String commandLineArgs[] = { testFile.getAbsolutePath(), storePathName }; + _TestUtil.rmDir(new File(storePathName)); try { Syns2Index.main(commandLineArgs); @@ -61,7 +63,7 @@ public class TestWordnet extends LuceneTestCase { private void assertExpandsTo(String term, String expected[]) throws IOException { Query expandedQuery = SynExpand.expand(term, searcher, new - MockAnalyzer(), "field", 1F); + MockAnalyzer(random), "field", 1F); BooleanQuery expectedQuery = new BooleanQuery(); for (String t : expected) expectedQuery.add(new TermQuery(new Term("field", t)), @@ -71,8 +73,12 @@ public class TestWordnet extends LuceneTestCase { @Override public void tearDown() throws Exception { - searcher.close(); - dir.close(); + if (searcher != null) { + searcher.close(); + } + if (dir != null) { + dir.close(); + } rmDir(storePathName); // delete our temporary synonym index super.tearDown(); } diff --git a/lucene/contrib/xml-query-parser/src/test/org/apache/lucene/xmlparser/TestParser.java b/lucene/contrib/xml-query-parser/src/test/org/apache/lucene/xmlparser/TestParser.java index ae5f02be532..ffe82630b65 100644 --- a/lucene/contrib/xml-query-parser/src/test/org/apache/lucene/xmlparser/TestParser.java +++ b/lucene/contrib/xml-query-parser/src/test/org/apache/lucene/xmlparser/TestParser.java @@ -49,7 +49,7 @@ public class TestParser extends LuceneTestCase { @BeforeClass public static void beforeClass() throws Exception { // TODO: rewrite test (this needs to set QueryParser.enablePositionIncrements, too, for work with CURRENT): - Analyzer analyzer=new MockAnalyzer(MockTokenizer.WHITESPACE, true, MockTokenFilter.ENGLISH_STOPSET, false); + Analyzer analyzer=new MockAnalyzer(random, MockTokenizer.WHITESPACE, true, MockTokenFilter.ENGLISH_STOPSET, false); //initialize the parser builder=new CorePlusExtensionsParser("contents",analyzer); @@ -187,7 +187,8 @@ public class TestParser extends LuceneTestCase { } public void testDuplicateFilterQueryXML() throws ParserException, IOException { - Assume.assumeTrue(searcher.getIndexReader().getSequentialSubReaders().length == 1); + Assume.assumeTrue(searcher.getIndexReader().getSequentialSubReaders() == null || + searcher.getIndexReader().getSequentialSubReaders().length == 1); Query q=parse("DuplicateFilterQuery.xml"); int h = searcher.search(q, null, 1000).totalHits; assertEquals("DuplicateFilterQuery should produce 1 result ", 1,h); diff --git a/lucene/contrib/xml-query-parser/src/test/org/apache/lucene/xmlparser/TestQueryTemplateManager.java b/lucene/contrib/xml-query-parser/src/test/org/apache/lucene/xmlparser/TestQueryTemplateManager.java index 9d87f8ae03f..f5cb65d9901 100644 --- a/lucene/contrib/xml-query-parser/src/test/org/apache/lucene/xmlparser/TestQueryTemplateManager.java +++ b/lucene/contrib/xml-query-parser/src/test/org/apache/lucene/xmlparser/TestQueryTemplateManager.java @@ -44,7 +44,7 @@ import org.xml.sax.SAXException; public class TestQueryTemplateManager extends LuceneTestCase { CoreParser builder; - Analyzer analyzer=new MockAnalyzer(); + Analyzer analyzer=new MockAnalyzer(random); private IndexSearcher searcher; private Directory dir; diff --git a/lucene/lib/ant-junit-LICENSE.txt b/lucene/lib/ant-junit-LICENSE-ASL.txt similarity index 100% rename from lucene/lib/ant-junit-LICENSE.txt rename to lucene/lib/ant-junit-LICENSE-ASL.txt diff --git a/lucene/src/java/org/apache/lucene/index/BufferedDeletes.java b/lucene/src/java/org/apache/lucene/index/BufferedDeletes.java index c72a1f6b0a3..ae544cbaf86 100644 --- a/lucene/src/java/org/apache/lucene/index/BufferedDeletes.java +++ b/lucene/src/java/org/apache/lucene/index/BufferedDeletes.java @@ -72,13 +72,18 @@ class BufferedDeletes { public static final Integer MAX_INT = Integer.valueOf(Integer.MAX_VALUE); - final AtomicLong bytesUsed = new AtomicLong(); + final AtomicLong bytesUsed; private final static boolean VERBOSE_DELETES = false; long gen; - public BufferedDeletes(boolean sortTerms) { + this(sortTerms, new AtomicLong()); + } + + BufferedDeletes(boolean sortTerms, AtomicLong bytesUsed) { + assert bytesUsed != null; + this.bytesUsed = bytesUsed; if (sortTerms) { terms = new TreeMap (); } else { diff --git a/lucene/src/java/org/apache/lucene/index/BufferedDeletesStream.java b/lucene/src/java/org/apache/lucene/index/BufferedDeletesStream.java index 692496ba406..11e55734046 100644 --- a/lucene/src/java/org/apache/lucene/index/BufferedDeletesStream.java +++ b/lucene/src/java/org/apache/lucene/index/BufferedDeletesStream.java @@ -33,8 +33,8 @@ import org.apache.lucene.search.Query; import org.apache.lucene.search.Scorer; import org.apache.lucene.search.Weight; -/* Tracks the stream of {@link BuffereDeletes}. - * When DocumensWriter flushes, its buffered +/* Tracks the stream of {@link BufferedDeletes}. + * When DocumentsWriterPerThread flushes, its buffered * deletes are appended to this stream. We later * apply these deletes (resolve them to the actual * docIDs, per segment) when a merge is started @@ -60,7 +60,7 @@ class BufferedDeletesStream { // used only by assert private Term lastDeleteTerm; - + private PrintStream infoStream; private final AtomicLong bytesUsed = new AtomicLong(); private final AtomicInteger numTerms = new AtomicInteger(); @@ -75,26 +75,36 @@ class BufferedDeletesStream { infoStream.println("BD " + messageID + " [" + new Date() + "; " + Thread.currentThread().getName() + "]: " + message); } } - + public synchronized void setInfoStream(PrintStream infoStream) { this.infoStream = infoStream; } // Appends a new packet of buffered deletes to the stream, // setting its generation: - public synchronized void push(FrozenBufferedDeletes packet) { + public synchronized long push(FrozenBufferedDeletes packet) { + /* + * The insert operation must be atomic. If we let threads increment the gen + * and push the packet afterwards we risk that packets are out of order. + * With DWPT this is possible if two or more flushes are racing for pushing + * updates. If the pushed packets get our of order would loose documents + * since deletes are applied to the wrong segments. + */ + packet.setDelGen(nextGen++); assert packet.any(); - assert checkDeleteStats(); - assert packet.gen < nextGen; + assert checkDeleteStats(); + assert packet.delGen() < nextGen; + assert deletes.isEmpty() || deletes.get(deletes.size()-1).delGen() < packet.delGen() : "Delete packets must be in order"; deletes.add(packet); numTerms.addAndGet(packet.numTermDeletes); bytesUsed.addAndGet(packet.bytesUsed); if (infoStream != null) { - message("push deletes " + packet + " delGen=" + packet.gen + " packetCount=" + deletes.size()); + message("push deletes " + packet + " delGen=" + packet.delGen() + " packetCount=" + deletes.size()); } - assert checkDeleteStats(); + assert checkDeleteStats(); + return packet.delGen(); } - + public synchronized void clear() { deletes.clear(); nextGen = 1; @@ -132,7 +142,7 @@ class BufferedDeletesStream { } // Sorts SegmentInfos from smallest to biggest bufferedDelGen: - private static final Comparator sortByDelGen = new Comparator () { + private static final Comparator sortSegInfoByDelGen = new Comparator () { // @Override -- not until Java 1.6 public int compare(SegmentInfo si1, SegmentInfo si2) { final long cmp = si1.getBufferedDeletesGen() - si2.getBufferedDeletesGen(); @@ -147,10 +157,10 @@ class BufferedDeletesStream { @Override public boolean equals(Object other) { - return sortByDelGen == other; + return sortSegInfoByDelGen == other; } }; - + /** Resolves the buffered deleted Term/Query/docIDs, into * actual deleted docIDs in the deletedDocs BitVector for * each SegmentReader. */ @@ -174,7 +184,7 @@ class BufferedDeletesStream { SegmentInfos infos2 = new SegmentInfos(); infos2.addAll(infos); - Collections.sort(infos2, sortByDelGen); + Collections.sort(infos2, sortSegInfoByDelGen); BufferedDeletes coalescedDeletes = null; boolean anyNewDeletes = false; @@ -191,19 +201,30 @@ class BufferedDeletesStream { final SegmentInfo info = infos2.get(infosIDX); final long segGen = info.getBufferedDeletesGen(); - if (packet != null && segGen < packet.gen) { + if (packet != null && segGen < packet.delGen()) { //System.out.println(" coalesce"); if (coalescedDeletes == null) { coalescedDeletes = new BufferedDeletes(true); } - coalescedDeletes.update(packet); + if (!packet.isSegmentPrivate) { + /* + * Only coalesce if we are NOT on a segment private del packet: the segment private del packet + * must only applied to segments with the same delGen. Yet, if a segment is already deleted + * from the SI since it had no more documents remaining after some del packets younger than + * its segPrivate packet (higher delGen) have been applied, the segPrivate packet has not been + * removed. + */ + coalescedDeletes.update(packet); + } + delIDX--; - } else if (packet != null && segGen == packet.gen) { + } else if (packet != null && segGen == packet.delGen()) { + assert packet.isSegmentPrivate : "Packet and Segments deletegen can only match on a segment private del packet"; //System.out.println(" eq"); // Lock order: IW -> BD -> RP assert readerPool.infoIsLive(info); - SegmentReader reader = readerPool.get(info, false); + final SegmentReader reader = readerPool.get(info, false); int delCount = 0; final boolean segAllDeletes; try { @@ -213,7 +234,7 @@ class BufferedDeletesStream { delCount += applyQueryDeletes(coalescedDeletes.queriesIterable(), reader); } //System.out.println(" del exact"); - // Don't delete by Term here; DocumentsWriter + // Don't delete by Term here; DocumentsWriterPerThread // already did that on flush: delCount += applyQueryDeletes(packet.queriesIterable(), reader); segAllDeletes = reader.numDocs() == 0; @@ -236,7 +257,12 @@ class BufferedDeletesStream { if (coalescedDeletes == null) { coalescedDeletes = new BufferedDeletes(true); } - coalescedDeletes.update(packet); + + /* + * Since we are on a segment private del packet we must not + * update the coalescedDeletes here! We can simply advance to the + * next packet and seginfo. + */ delIDX--; infosIDX--; info.setBufferedDeletesGen(nextGen); @@ -281,11 +307,11 @@ class BufferedDeletesStream { message("applyDeletes took " + (System.currentTimeMillis()-t0) + " msec"); } // assert infos != segmentInfos || !any() : "infos=" + infos + " segmentInfos=" + segmentInfos + " any=" + any; - + return new ApplyDeletesResult(anyNewDeletes, nextGen++, allDeleted); } - public synchronized long getNextGen() { + synchronized long getNextGen() { return nextGen++; } @@ -303,10 +329,9 @@ class BufferedDeletesStream { if (infoStream != null) { message("prune sis=" + segmentInfos + " minGen=" + minGen + " packetCount=" + deletes.size()); } - final int limit = deletes.size(); for(int delIDX=0;delIDX = minGen) { + if (deletes.get(delIDX).delGen() >= minGen) { prune(delIDX); assert checkDeleteStats(); return; @@ -345,10 +370,10 @@ class BufferedDeletesStream { } TermsEnum termsEnum = null; - + String currentField = null; DocsEnum docs = null; - + assert checkDeleteTerm(null); for (Term term : termsIter) { @@ -372,10 +397,10 @@ class BufferedDeletesStream { assert checkDeleteTerm(term); // System.out.println(" term=" + term); - + if (termsEnum.seek(term.bytes(), false) == TermsEnum.SeekStatus.FOUND) { DocsEnum docsEnum = termsEnum.docs(reader.getDeletedDocs(), docs); - + if (docsEnum != null) { while (true) { final int docID = docsEnum.nextDoc(); @@ -401,7 +426,7 @@ class BufferedDeletesStream { public final Query query; public final int limit; public QueryAndLimit(Query query, int limit) { - this.query = query; + this.query = query; this.limit = limit; } } @@ -449,7 +474,7 @@ class BufferedDeletesStream { lastDeleteTerm = term; return true; } - + // only for assert private boolean checkDeleteStats() { int numTerms2 = 0; diff --git a/lucene/src/java/org/apache/lucene/index/ByteSliceWriter.java b/lucene/src/java/org/apache/lucene/index/ByteSliceWriter.java index 5355dee4f83..5c8b921f087 100644 --- a/lucene/src/java/org/apache/lucene/index/ByteSliceWriter.java +++ b/lucene/src/java/org/apache/lucene/index/ByteSliceWriter.java @@ -81,6 +81,6 @@ final class ByteSliceWriter extends DataOutput { } public int getAddress() { - return upto + (offset0 & DocumentsWriter.BYTE_BLOCK_NOT_MASK); + return upto + (offset0 & DocumentsWriterPerThread.BYTE_BLOCK_NOT_MASK); } } \ No newline at end of file diff --git a/lucene/src/java/org/apache/lucene/index/CheckIndex.java b/lucene/src/java/org/apache/lucene/index/CheckIndex.java index ca8f357aba2..61b3fc07da0 100644 --- a/lucene/src/java/org/apache/lucene/index/CheckIndex.java +++ b/lucene/src/java/org/apache/lucene/index/CheckIndex.java @@ -661,10 +661,13 @@ public class CheckIndex { status.termCount++; final DocsEnum docs2; + final boolean hasPositions; if (postings != null) { docs2 = postings; + hasPositions = true; } else { docs2 = docs; + hasPositions = false; } int lastDoc = -1; @@ -733,6 +736,67 @@ public class CheckIndex { throw new RuntimeException("term " + term + " totalTermFreq=" + totalTermFreq2 + " != recomputed totalTermFreq=" + totalTermFreq); } } + + // Test skipping + if (docFreq >= 16) { + if (hasPositions) { + for(int idx=0;idx<7;idx++) { + final int skipDocID = (int) (((idx+1)*(long) maxDoc)/8); + postings = terms.docsAndPositions(delDocs, postings); + final int docID = postings.advance(skipDocID); + if (docID == DocsEnum.NO_MORE_DOCS) { + break; + } else { + if (docID < skipDocID) { + throw new RuntimeException("term " + term + ": advance(docID=" + skipDocID + ") returned docID=" + docID); + } + final int freq = postings.freq(); + if (freq <= 0) { + throw new RuntimeException("termFreq " + freq + " is out of bounds"); + } + int lastPosition = -1; + for(int posUpto=0;posUpto threads, final SegmentWriteState state) throws IOException; + abstract void processDocument(FieldInfos fieldInfos) throws IOException; + abstract void finishDocument() throws IOException; + abstract void flush(final SegmentWriteState state) throws IOException; abstract void abort(); abstract boolean freeRAM(); + abstract void doAfterFlush(); } diff --git a/lucene/src/java/org/apache/lucene/index/DocFieldConsumer.java b/lucene/src/java/org/apache/lucene/index/DocFieldConsumer.java index 2abc0bb5531..18555300003 100644 --- a/lucene/src/java/org/apache/lucene/index/DocFieldConsumer.java +++ b/lucene/src/java/org/apache/lucene/index/DocFieldConsumer.java @@ -18,22 +18,25 @@ package org.apache.lucene.index; */ import java.io.IOException; -import java.util.Collection; import java.util.Map; abstract class DocFieldConsumer { - /** Called when DocumentsWriter decides to create a new + /** Called when DocumentsWriterPerThread decides to create a new * segment */ - abstract void flush(Map > threadsAndFields, SegmentWriteState state) throws IOException; + abstract void flush(Map fieldsToFlush, SegmentWriteState state) throws IOException; /** Called when an aborting exception is hit */ abstract void abort(); - /** Add a new thread */ - abstract DocFieldConsumerPerThread addThread(DocFieldProcessorPerThread docFieldProcessorPerThread) throws IOException; - - /** Called when DocumentsWriter is using too much RAM. + /** Called when DocumentsWriterPerThread is using too much RAM. * The consumer should free RAM, if possible, returning * true if any RAM was in fact freed. */ abstract boolean freeRAM(); - } + + abstract void startDocument() throws IOException; + + abstract DocFieldConsumerPerField addField(FieldInfo fi); + + abstract void finishDocument() throws IOException; + +} diff --git a/lucene/src/java/org/apache/lucene/index/DocFieldConsumerPerField.java b/lucene/src/java/org/apache/lucene/index/DocFieldConsumerPerField.java index f70e815d8d5..960ea59eae8 100644 --- a/lucene/src/java/org/apache/lucene/index/DocFieldConsumerPerField.java +++ b/lucene/src/java/org/apache/lucene/index/DocFieldConsumerPerField.java @@ -24,4 +24,5 @@ abstract class DocFieldConsumerPerField { /** Processes all occurrences of a single field */ abstract void processFields(Fieldable[] fields, int count) throws IOException; abstract void abort(); + abstract FieldInfo getFieldInfo(); } diff --git a/lucene/src/java/org/apache/lucene/index/DocFieldConsumers.java b/lucene/src/java/org/apache/lucene/index/DocFieldConsumers.java new file mode 100644 index 00000000000..3d20248ff61 --- /dev/null +++ b/lucene/src/java/org/apache/lucene/index/DocFieldConsumers.java @@ -0,0 +1,90 @@ +package org.apache.lucene.index; + +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +import java.io.IOException; +import java.util.HashMap; +import java.util.Map; + +/** This is just a "splitter" class: it lets you wrap two + * DocFieldConsumer instances as a single consumer. */ + +final class DocFieldConsumers extends DocFieldConsumer { + final DocFieldConsumer one; + final DocFieldConsumer two; + final DocumentsWriterPerThread.DocState docState; + + public DocFieldConsumers(DocFieldProcessor processor, DocFieldConsumer one, DocFieldConsumer two) { + this.one = one; + this.two = two; + this.docState = processor.docState; + } + + @Override + public void flush(Map fieldsToFlush, SegmentWriteState state) throws IOException { + + Map oneFieldsToFlush = new HashMap (); + Map twoFieldsToFlush = new HashMap (); + + for (Map.Entry fieldToFlush : fieldsToFlush.entrySet()) { + DocFieldConsumersPerField perField = (DocFieldConsumersPerField) fieldToFlush.getValue(); + oneFieldsToFlush.put(fieldToFlush.getKey(), perField.one); + twoFieldsToFlush.put(fieldToFlush.getKey(), perField.two); + } + + one.flush(oneFieldsToFlush, state); + two.flush(twoFieldsToFlush, state); + } + + @Override + public void abort() { + try { + one.abort(); + } finally { + two.abort(); + } + } + + @Override + public boolean freeRAM() { + boolean any = one.freeRAM(); + any |= two.freeRAM(); + return any; + } + + @Override + public void finishDocument() throws IOException { + try { + one.finishDocument(); + } finally { + two.finishDocument(); + } + } + + @Override + public void startDocument() throws IOException { + one.startDocument(); + two.startDocument(); + } + + @Override + public DocFieldConsumerPerField addField(FieldInfo fi) { + return new DocFieldConsumersPerField(this, fi, one.addField(fi), two.addField(fi)); + } + +} diff --git a/lucene/src/java/org/apache/lucene/index/FreqProxTermsWriterPerThread.java b/lucene/src/java/org/apache/lucene/index/DocFieldConsumersPerField.java similarity index 52% rename from lucene/src/java/org/apache/lucene/index/FreqProxTermsWriterPerThread.java rename to lucene/src/java/org/apache/lucene/index/DocFieldConsumersPerField.java index 87af8608174..5abf003d5a1 100644 --- a/lucene/src/java/org/apache/lucene/index/FreqProxTermsWriterPerThread.java +++ b/lucene/src/java/org/apache/lucene/index/DocFieldConsumersPerField.java @@ -17,29 +17,40 @@ package org.apache.lucene.index; * limitations under the License. */ -final class FreqProxTermsWriterPerThread extends TermsHashConsumerPerThread { - final TermsHashPerThread termsHashPerThread; - final DocumentsWriter.DocState docState; +import java.io.IOException; +import org.apache.lucene.document.Fieldable; - public FreqProxTermsWriterPerThread(TermsHashPerThread perThread) { - docState = perThread.docState; - termsHashPerThread = perThread; - } - - @Override - public TermsHashConsumerPerField addField(TermsHashPerField termsHashPerField, FieldInfo fieldInfo) { - return new FreqProxTermsWriterPerField(termsHashPerField, this, fieldInfo); +final class DocFieldConsumersPerField extends DocFieldConsumerPerField { + + final DocFieldConsumerPerField one; + final DocFieldConsumerPerField two; + final DocFieldConsumers parent; + final FieldInfo fieldInfo; + + public DocFieldConsumersPerField(DocFieldConsumers parent, FieldInfo fi, DocFieldConsumerPerField one, DocFieldConsumerPerField two) { + this.parent = parent; + this.one = one; + this.two = two; + this.fieldInfo = fi; } @Override - void startDocument() { + public void processFields(Fieldable[] fields, int count) throws IOException { + one.processFields(fields, count); + two.processFields(fields, count); } @Override - DocumentsWriter.DocWriter finishDocument() { - return null; + public void abort() { + try { + one.abort(); + } finally { + two.abort(); + } } @Override - public void abort() {} + FieldInfo getFieldInfo() { + return fieldInfo; + } } diff --git a/lucene/src/java/org/apache/lucene/index/DocFieldProcessor.java b/lucene/src/java/org/apache/lucene/index/DocFieldProcessor.java index a566e72f9bb..3f7faf62c20 100644 --- a/lucene/src/java/org/apache/lucene/index/DocFieldProcessor.java +++ b/lucene/src/java/org/apache/lucene/index/DocFieldProcessor.java @@ -20,13 +20,20 @@ package org.apache.lucene.index; import java.io.IOException; import java.util.Collection; import java.util.HashMap; +import java.util.HashSet; +import java.util.List; import java.util.Map; -import org.apache.lucene.index.codecs.FieldsConsumer; +import org.apache.lucene.document.Document; +import org.apache.lucene.document.Fieldable; +import org.apache.lucene.index.DocumentsWriterPerThread.DocState; +import org.apache.lucene.index.codecs.Codec; +import org.apache.lucene.index.codecs.PerDocConsumer; import org.apache.lucene.index.codecs.docvalues.DocValuesConsumer; import org.apache.lucene.index.values.PerDocFieldValues; import org.apache.lucene.store.Directory; + /** * This is a DocConsumer that gathers all fields under the * same name, and calls per-field consumers to process field @@ -37,66 +44,39 @@ import org.apache.lucene.store.Directory; final class DocFieldProcessor extends DocConsumer { - final DocumentsWriter docWriter; final DocFieldConsumer consumer; final StoredFieldsWriter fieldsWriter; - final private Map docValues = new HashMap (); - private FieldsConsumer fieldsConsumer; // TODO this should be encapsulated in DocumentsWriter - private SegmentWriteState docValuesConsumerState; // TODO this should be encapsulated in DocumentsWriter + // Holds all fields seen in current doc + DocFieldProcessorPerField[] fields = new DocFieldProcessorPerField[1]; + int fieldCount; - synchronized DocValuesConsumer docValuesConsumer(Directory dir, - String segment, String name, PerDocFieldValues values, FieldInfo fieldInfo) - throws IOException { - DocValuesConsumer valuesConsumer; - if ((valuesConsumer = docValues.get(name)) == null) { - fieldInfo.setDocValues(values.type()); + // Hash table for all fields ever seen + DocFieldProcessorPerField[] fieldHash = new DocFieldProcessorPerField[2]; + int hashMask = 1; + int totalFieldCount; - if(fieldsConsumer == null) { - /* TODO (close to no commit) -- this is a hack and only works since DocValuesCodec supports initializing the FieldsConsumer twice. - * we need to find a way that allows us to obtain a FieldsConsumer per DocumentsWriter. Currently some codecs rely on - * the SegmentsWriteState passed in right at the moment when the segment is flushed (doccount etc) but we need the consumer earlier - * to support docvalues and later on stored fields too. - */ - docValuesConsumerState = docWriter.segWriteState(false); - fieldsConsumer = docValuesConsumerState.segmentCodecs.codec().fieldsConsumer(docValuesConsumerState); - } - valuesConsumer = fieldsConsumer.addValuesField(fieldInfo); - docValues.put(name, valuesConsumer); - } - return valuesConsumer; + float docBoost; + int fieldGen; + final DocumentsWriterPerThread.DocState docState; - } - - - public DocFieldProcessor(DocumentsWriter docWriter, DocFieldConsumer consumer) { - this.docWriter = docWriter; + public DocFieldProcessor(DocumentsWriterPerThread docWriter, DocFieldConsumer consumer) { + this.docState = docWriter.docState; this.consumer = consumer; fieldsWriter = new StoredFieldsWriter(docWriter); } @Override - public void flush(Collection threads, SegmentWriteState state) throws IOException { + public void flush(SegmentWriteState state) throws IOException { - Map > childThreadsAndFields = new HashMap >(); - for ( DocConsumerPerThread thread : threads) { - DocFieldProcessorPerThread perThread = (DocFieldProcessorPerThread) thread; - childThreadsAndFields.put(perThread.consumer, perThread.fields()); + Map childFields = new HashMap (); + Collection fields = fields(); + for (DocFieldConsumerPerField f : fields) { + childFields.put(f.getFieldInfo(), f); } + fieldsWriter.flush(state); - consumer.flush(childThreadsAndFields, state); - - for(DocValuesConsumer p : docValues.values()) { - if (p != null) { - p.finish(state.numDocs); - } - } - docValues.clear(); - if(fieldsConsumer != null) { - fieldsConsumer.close(); // TODO remove this once docvalues are fully supported by codecs - docValuesConsumerState = null; - fieldsConsumer = null; - } + consumer.flush(childFields, state); // Important to save after asking consumer to flush so // consumer can alter the FieldInfo* if necessary. EG, @@ -104,12 +84,35 @@ final class DocFieldProcessor extends DocConsumer { // FieldInfo.storePayload. final String fileName = IndexFileNames.segmentFileName(state.segmentName, "", IndexFileNames.FIELD_INFOS_EXTENSION); state.fieldInfos.write(state.directory, fileName); + for (DocValuesConsumer consumers : docValues.values()) { + consumers.finish(state.numDocs); + }; } @Override public void abort() { - fieldsWriter.abort(); - consumer.abort(); + for(int i=0;i fields() { + Collection fields = new HashSet (); + for(int i=0;i fieldHash.length; + + final DocFieldProcessorPerField newHashArray[] = new DocFieldProcessorPerField[newHashSize]; + + // Rehash + int newHashMask = newHashSize-1; + for(int j=0;j docFields = doc.getFields(); + final int numDocFields = docFields.size(); + + // Absorb any new fields first seen in this document. + // Also absorb any changes to fields we had already + // seen before (eg suddenly turning on norms or + // vectors, etc.): + + for(int i=0;i = fieldHash.length/2) + rehash(); + } else { + fieldInfos.addOrUpdate(fp.fieldInfo.name, field.isIndexed(), field.isTermVectorStored(), + field.isStorePositionWithTermVector(), field.isStoreOffsetWithTermVector(), + field.getOmitNorms(), false, field.getOmitTermFreqAndPositions(), field.docValuesType()); + } + + if (thisFieldGen != fp.lastGen) { + + // First time we're seeing this field for this doc + fp.fieldCount = 0; + + if (fieldCount == fields.length) { + final int newSize = fields.length*2; + DocFieldProcessorPerField newArray[] = new DocFieldProcessorPerField[newSize]; + System.arraycopy(fields, 0, newArray, 0, fieldCount); + fields = newArray; + } + + fields[fieldCount++] = fp; + fp.lastGen = thisFieldGen; + } + + fp.addField(field); + + if (field.isStored()) { + fieldsWriter.addField(field, fp.fieldInfo); + } + if (field.hasDocValues()) { + final DocValuesConsumer docValuesConsumer = docValuesConsumer(docState, fp.fieldInfo, fieldInfos); + docValuesConsumer.add(docState.docID, field.getDocValues()); + } + } + + // If we are writing vectors then we must visit + // fields in sorted order so they are written in + // sorted order. TODO: we actually only need to + // sort the subset of fields that have vectors + // enabled; we could save [small amount of] CPU + // here. + quickSort(fields, 0, fieldCount-1); + + for(int i=0;i = hi) + return; + else if (hi == 1+lo) { + if (array[lo].fieldInfo.name.compareTo(array[hi].fieldInfo.name) > 0) { + final DocFieldProcessorPerField tmp = array[lo]; + array[lo] = array[hi]; + array[hi] = tmp; + } + return; + } + + int mid = (lo + hi) >>> 1; + + if (array[lo].fieldInfo.name.compareTo(array[mid].fieldInfo.name) > 0) { + DocFieldProcessorPerField tmp = array[lo]; + array[lo] = array[mid]; + array[mid] = tmp; + } + + if (array[mid].fieldInfo.name.compareTo(array[hi].fieldInfo.name) > 0) { + DocFieldProcessorPerField tmp = array[mid]; + array[mid] = array[hi]; + array[hi] = tmp; + + if (array[lo].fieldInfo.name.compareTo(array[mid].fieldInfo.name) > 0) { + DocFieldProcessorPerField tmp2 = array[lo]; + array[lo] = array[mid]; + array[mid] = tmp2; + } + } + + int left = lo + 1; + int right = hi - 1; + + if (left >= right) + return; + + DocFieldProcessorPerField partition = array[mid]; + + for (; ;) { + while (array[right].fieldInfo.name.compareTo(partition.fieldInfo.name) > 0) + --right; + + while (left < right && array[left].fieldInfo.name.compareTo(partition.fieldInfo.name) <= 0) + ++left; + + if (left < right) { + DocFieldProcessorPerField tmp = array[left]; + array[left] = array[right]; + array[right] = tmp; + --right; + } else { + break; + } + } + + quickSort(array, lo, left); + quickSort(array, left + 1, hi); + } + final private Map docValues = new HashMap (); + final private Map perDocConsumers = new HashMap (); + + DocValuesConsumer docValuesConsumer(DocState docState, FieldInfo fieldInfo, FieldInfos infos) + throws IOException { + DocValuesConsumer docValuesConsumer = docValues.get(fieldInfo.name); + if (docValuesConsumer != null) { + return docValuesConsumer; + } + PerDocConsumer perDocConsumer = perDocConsumers.get(fieldInfo.getCodecId()); + if (perDocConsumer == null) { + PerDocWriteState perDocWriteState = docState.docWriter.newPerDocWriteState(fieldInfo.getCodecId()); + SegmentCodecs codecs = perDocWriteState.segmentCodecs; + assert codecs.codecs.length > fieldInfo.getCodecId(); + + Codec codec = codecs.codecs[fieldInfo.getCodecId()]; + perDocConsumer = codec.docsConsumer(perDocWriteState); + perDocConsumers.put(Integer.valueOf(fieldInfo.getCodecId()), perDocConsumer); + } + docValuesConsumer = perDocConsumer.addValuesField(fieldInfo); + docValues.put(fieldInfo.name, docValuesConsumer); + return docValuesConsumer; + } + } diff --git a/lucene/src/java/org/apache/lucene/index/DocFieldProcessorPerField.java b/lucene/src/java/org/apache/lucene/index/DocFieldProcessorPerField.java index 8fb1da45280..36b1908f6d3 100644 --- a/lucene/src/java/org/apache/lucene/index/DocFieldProcessorPerField.java +++ b/lucene/src/java/org/apache/lucene/index/DocFieldProcessorPerField.java @@ -18,6 +18,8 @@ package org.apache.lucene.index; */ import org.apache.lucene.document.Fieldable; +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.RamUsageEstimator; /** * Holds all per thread, per field state. @@ -34,11 +36,22 @@ final class DocFieldProcessorPerField { int fieldCount; Fieldable[] fields = new Fieldable[1]; - public DocFieldProcessorPerField(final DocFieldProcessorPerThread perThread, final FieldInfo fieldInfo) { - this.consumer = perThread.consumer.addField(fieldInfo); + public DocFieldProcessorPerField(final DocFieldProcessor docFieldProcessor, final FieldInfo fieldInfo) { + this.consumer = docFieldProcessor.consumer.addField(fieldInfo); this.fieldInfo = fieldInfo; } + public void addField(Fieldable field) { + if (fieldCount == fields.length) { + int newSize = ArrayUtil.oversize(fieldCount + 1, RamUsageEstimator.NUM_BYTES_OBJECT_REF); + Fieldable[] newArray = new Fieldable[newSize]; + System.arraycopy(fields, 0, newArray, 0, fieldCount); + fields = newArray; + } + + fields[fieldCount++] = field; + } + public void abort() { consumer.abort(); } diff --git a/lucene/src/java/org/apache/lucene/index/DocFieldProcessorPerThread.java b/lucene/src/java/org/apache/lucene/index/DocFieldProcessorPerThread.java deleted file mode 100644 index e69424be90d..00000000000 --- a/lucene/src/java/org/apache/lucene/index/DocFieldProcessorPerThread.java +++ /dev/null @@ -1,320 +0,0 @@ -package org.apache.lucene.index; - -/** - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import java.util.Comparator; -import java.util.Collection; -import java.util.HashSet; -import java.util.List; -import java.io.IOException; - -import org.apache.lucene.document.Document; -import org.apache.lucene.document.Fieldable; -import org.apache.lucene.index.codecs.docvalues.DocValuesConsumer; -import org.apache.lucene.index.values.PerDocFieldValues; -import org.apache.lucene.util.ArrayUtil; -import org.apache.lucene.util.RamUsageEstimator; - -/** - * Gathers all Fieldables for a document under the same - * name, updates FieldInfos, and calls per-field consumers - * to process field by field. - * - * Currently, only a single thread visits the fields, - * sequentially, for processing. - */ - -final class DocFieldProcessorPerThread extends DocConsumerPerThread { - - float docBoost; - int fieldGen; - final DocFieldProcessor docFieldProcessor; - final DocFieldConsumerPerThread consumer; - - // Holds all fields seen in current doc - DocFieldProcessorPerField[] fields = new DocFieldProcessorPerField[1]; - int fieldCount; - - // Hash table for all fields seen in current segment - DocFieldProcessorPerField[] fieldHash = new DocFieldProcessorPerField[2]; - int hashMask = 1; - int totalFieldCount; - - final StoredFieldsWriterPerThread fieldsWriter; - - final DocumentsWriter.DocState docState; - - public DocFieldProcessorPerThread(DocumentsWriterThreadState threadState, DocFieldProcessor docFieldProcessor) throws IOException { - this.docState = threadState.docState; - this.docFieldProcessor = docFieldProcessor; - this.consumer = docFieldProcessor.consumer.addThread(this); - fieldsWriter = docFieldProcessor.fieldsWriter.addThread(docState); - } - - @Override - public void abort() { - for(int i=0;i fields() { - Collection fields = new HashSet (); - for(int i=0;i fieldHash.length; - - final DocFieldProcessorPerField newHashArray[] = new DocFieldProcessorPerField[newHashSize]; - - // Rehash - int newHashMask = newHashSize-1; - for(int j=0;j docFields = doc.getFields(); - final int numDocFields = docFields.size(); - - // Absorb any new fields first seen in this document. - // Also absorb any changes to fields we had already - // seen before (eg suddenly turning on norms or - // vectors, etc.): - - for(int i=0;i = fieldHash.length/2) - rehash(); - } else { - fieldInfos.addOrUpdate(fp.fieldInfo.name, field.isIndexed(), field.isTermVectorStored(), - field.isStorePositionWithTermVector(), field.isStoreOffsetWithTermVector(), - field.getOmitNorms(), false, field.getOmitTermFreqAndPositions(), field.docValuesType()); - } - if (thisFieldGen != fp.lastGen) { - - // First time we're seeing this field for this doc - fp.fieldCount = 0; - - if (fieldCount == fields.length) { - final int newSize = fields.length*2; - DocFieldProcessorPerField newArray[] = new DocFieldProcessorPerField[newSize]; - System.arraycopy(fields, 0, newArray, 0, fieldCount); - fields = newArray; - } - - fields[fieldCount++] = fp; - fp.lastGen = thisFieldGen; - } - - if (fp.fieldCount == fp.fields.length) { - Fieldable[] newArray = new Fieldable[fp.fields.length*2]; - System.arraycopy(fp.fields, 0, newArray, 0, fp.fieldCount); - fp.fields = newArray; - } - - fp.fields[fp.fieldCount++] = field; - if (field.isStored()) { - fieldsWriter.addField(field, fp.fieldInfo); - } - } - - // If we are writing vectors then we must visit - // fields in sorted order so they are written in - // sorted order. TODO: we actually only need to - // sort the subset of fields that have vectors - // enabled; we could save [small amount of] CPU - // here. - ArrayUtil.quickSort(fields, 0, fieldCount, fieldsComp); - - - for(int i=0;i fieldsComp = new Comparator () { - public int compare(DocFieldProcessorPerField o1, DocFieldProcessorPerField o2) { - return o1.fieldInfo.name.compareTo(o2.fieldInfo.name); - } - }; - - PerDoc[] docFreeList = new PerDoc[1]; - int freeCount; - int allocCount; - - synchronized PerDoc getPerDoc() { - if (freeCount == 0) { - allocCount++; - if (allocCount > docFreeList.length) { - // Grow our free list up front to make sure we have - // enough space to recycle all outstanding PerDoc - // instances - assert allocCount == 1+docFreeList.length; - docFreeList = new PerDoc[ArrayUtil.oversize(allocCount, RamUsageEstimator.NUM_BYTES_OBJECT_REF)]; - } - return new PerDoc(); - } else - return docFreeList[--freeCount]; - } - - synchronized void freePerDoc(PerDoc perDoc) { - assert freeCount < docFreeList.length; - docFreeList[freeCount++] = perDoc; - } - - class PerDoc extends DocumentsWriter.DocWriter { - - DocumentsWriter.DocWriter one; - DocumentsWriter.DocWriter two; - - @Override - public long sizeInBytes() { - return one.sizeInBytes() + two.sizeInBytes(); - } - - @Override - public void finish() throws IOException { - try { - try { - one.finish(); - } finally { - two.finish(); - } - } finally { - freePerDoc(this); - } - } - - @Override - public void abort() { - try { - try { - one.abort(); - } finally { - two.abort(); - } - } finally { - freePerDoc(this); - } - } - } -} diff --git a/lucene/src/java/org/apache/lucene/index/DocInverter.java b/lucene/src/java/org/apache/lucene/index/DocInverter.java index 48e8edfb2ba..95c09763fad 100644 --- a/lucene/src/java/org/apache/lucene/index/DocInverter.java +++ b/lucene/src/java/org/apache/lucene/index/DocInverter.java @@ -18,12 +18,13 @@ package org.apache.lucene.index; */ import java.io.IOException; -import java.util.Collection; import java.util.HashMap; -import java.util.HashSet; - import java.util.Map; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; +import org.apache.lucene.analysis.tokenattributes.OffsetAttribute; +import org.apache.lucene.util.AttributeSource; + /** This is a DocFieldConsumer that inverts each field, * separately, from a Document, and accepts a @@ -34,42 +35,72 @@ final class DocInverter extends DocFieldConsumer { final InvertedDocConsumer consumer; final InvertedDocEndConsumer endConsumer; - public DocInverter(InvertedDocConsumer consumer, InvertedDocEndConsumer endConsumer) { + final DocumentsWriterPerThread.DocState docState; + + final FieldInvertState fieldState = new FieldInvertState(); + + final SingleTokenAttributeSource singleToken = new SingleTokenAttributeSource(); + + static class SingleTokenAttributeSource extends AttributeSource { + final CharTermAttribute termAttribute; + final OffsetAttribute offsetAttribute; + + private SingleTokenAttributeSource() { + termAttribute = addAttribute(CharTermAttribute.class); + offsetAttribute = addAttribute(OffsetAttribute.class); + } + + public void reinit(String stringValue, int startOffset, int endOffset) { + termAttribute.setEmpty().append(stringValue); + offsetAttribute.setOffset(startOffset, endOffset); + } + } + + // Used to read a string value for a field + final ReusableStringReader stringReader = new ReusableStringReader(); + + public DocInverter(DocumentsWriterPerThread.DocState docState, InvertedDocConsumer consumer, InvertedDocEndConsumer endConsumer) { + this.docState = docState; this.consumer = consumer; this.endConsumer = endConsumer; } @Override - void flush(Map > threadsAndFields, SegmentWriteState state) throws IOException { + void flush(Map fieldsToFlush, SegmentWriteState state) throws IOException { - Map > childThreadsAndFields = new HashMap >(); - Map > endChildThreadsAndFields = new HashMap >(); + Map childFieldsToFlush = new HashMap (); + Map endChildFieldsToFlush = new HashMap (); - for (Map.Entry > entry : threadsAndFields.entrySet() ) { - - - DocInverterPerThread perThread = (DocInverterPerThread) entry.getKey(); - - Collection childFields = new HashSet (); - Collection endChildFields = new HashSet (); - for (final DocFieldConsumerPerField field: entry.getValue() ) { - DocInverterPerField perField = (DocInverterPerField) field; - childFields.add(perField.consumer); - endChildFields.add(perField.endConsumer); - } - - childThreadsAndFields.put(perThread.consumer, childFields); - endChildThreadsAndFields.put(perThread.endConsumer, endChildFields); + for (Map.Entry fieldToFlush : fieldsToFlush.entrySet()) { + DocInverterPerField perField = (DocInverterPerField) fieldToFlush.getValue(); + childFieldsToFlush.put(fieldToFlush.getKey(), perField.consumer); + endChildFieldsToFlush.put(fieldToFlush.getKey(), perField.endConsumer); } - - consumer.flush(childThreadsAndFields, state); - endConsumer.flush(endChildThreadsAndFields, state); + + consumer.flush(childFieldsToFlush, state); + endConsumer.flush(endChildFieldsToFlush, state); + } + + @Override + public void startDocument() throws IOException { + consumer.startDocument(); + endConsumer.startDocument(); + } + + public void finishDocument() throws IOException { + // TODO: allow endConsumer.finishDocument to also return + // a DocWriter + endConsumer.finishDocument(); + consumer.finishDocument(); } @Override void abort() { - consumer.abort(); - endConsumer.abort(); + try { + consumer.abort(); + } finally { + endConsumer.abort(); + } } @Override @@ -78,7 +109,8 @@ final class DocInverter extends DocFieldConsumer { } @Override - public DocFieldConsumerPerThread addThread(DocFieldProcessorPerThread docFieldProcessorPerThread) { - return new DocInverterPerThread(docFieldProcessorPerThread, this); + public DocFieldConsumerPerField addField(FieldInfo fi) { + return new DocInverterPerField(this, fi); } + } diff --git a/lucene/src/java/org/apache/lucene/index/DocInverterPerField.java b/lucene/src/java/org/apache/lucene/index/DocInverterPerField.java index d360fbfb230..2463326295c 100644 --- a/lucene/src/java/org/apache/lucene/index/DocInverterPerField.java +++ b/lucene/src/java/org/apache/lucene/index/DocInverterPerField.java @@ -35,20 +35,20 @@ import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute; final class DocInverterPerField extends DocFieldConsumerPerField { - final private DocInverterPerThread perThread; - final private FieldInfo fieldInfo; + final private DocInverter parent; + final FieldInfo fieldInfo; final InvertedDocConsumerPerField consumer; final InvertedDocEndConsumerPerField endConsumer; - final DocumentsWriter.DocState docState; + final DocumentsWriterPerThread.DocState docState; final FieldInvertState fieldState; - public DocInverterPerField(DocInverterPerThread perThread, FieldInfo fieldInfo) { - this.perThread = perThread; + public DocInverterPerField(DocInverter parent, FieldInfo fieldInfo) { + this.parent = parent; this.fieldInfo = fieldInfo; - docState = perThread.docState; - fieldState = perThread.fieldState; - this.consumer = perThread.consumer.addField(this, fieldInfo); - this.endConsumer = perThread.endConsumer.addField(this, fieldInfo); + docState = parent.docState; + fieldState = parent.fieldState; + this.consumer = parent.consumer.addField(this, fieldInfo); + this.endConsumer = parent.endConsumer.addField(this, fieldInfo); } @Override @@ -80,8 +80,8 @@ final class DocInverterPerField extends DocFieldConsumerPerField { if (!field.isTokenized()) { // un-tokenized field String stringValue = field.stringValue(); final int valueLength = stringValue.length(); - perThread.singleToken.reinit(stringValue, 0, valueLength); - fieldState.attributeSource = perThread.singleToken; + parent.singleToken.reinit(stringValue, 0, valueLength); + fieldState.attributeSource = parent.singleToken; consumer.start(field); boolean success = false; @@ -89,8 +89,9 @@ final class DocInverterPerField extends DocFieldConsumerPerField { consumer.add(); success = true; } finally { - if (!success) + if (!success) { docState.docWriter.setAborting(); + } } fieldState.offset += valueLength; fieldState.length++; @@ -114,8 +115,8 @@ final class DocInverterPerField extends DocFieldConsumerPerField { if (stringValue == null) { throw new IllegalArgumentException("field must have either TokenStream, String or Reader value"); } - perThread.stringReader.init(stringValue); - reader = perThread.stringReader; + parent.stringReader.init(stringValue); + reader = parent.stringReader; } // Tokenize field and add to postingTable @@ -166,8 +167,9 @@ final class DocInverterPerField extends DocFieldConsumerPerField { consumer.add(); success = true; } finally { - if (!success) + if (!success) { docState.docWriter.setAborting(); + } } fieldState.length++; fieldState.position++; @@ -195,4 +197,9 @@ final class DocInverterPerField extends DocFieldConsumerPerField { consumer.finish(); endConsumer.finish(); } + + @Override + FieldInfo getFieldInfo() { + return fieldInfo; + } } diff --git a/lucene/src/java/org/apache/lucene/index/DocInverterPerThread.java b/lucene/src/java/org/apache/lucene/index/DocInverterPerThread.java deleted file mode 100644 index 2816519f9b2..00000000000 --- a/lucene/src/java/org/apache/lucene/index/DocInverterPerThread.java +++ /dev/null @@ -1,92 +0,0 @@ -package org.apache.lucene.index; - -/** - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import java.io.IOException; - -import org.apache.lucene.util.AttributeSource; -import org.apache.lucene.analysis.tokenattributes.OffsetAttribute; -import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; - -/** This is a DocFieldConsumer that inverts each field, - * separately, from a Document, and accepts a - * InvertedTermsConsumer to process those terms. */ - -final class DocInverterPerThread extends DocFieldConsumerPerThread { - final DocInverter docInverter; - final InvertedDocConsumerPerThread consumer; - final InvertedDocEndConsumerPerThread endConsumer; - final SingleTokenAttributeSource singleToken = new SingleTokenAttributeSource(); - - static class SingleTokenAttributeSource extends AttributeSource { - final CharTermAttribute termAttribute; - final OffsetAttribute offsetAttribute; - - private SingleTokenAttributeSource() { - termAttribute = addAttribute(CharTermAttribute.class); - offsetAttribute = addAttribute(OffsetAttribute.class); - } - - public void reinit(String stringValue, int startOffset, int endOffset) { - termAttribute.setEmpty().append(stringValue); - offsetAttribute.setOffset(startOffset, endOffset); - } - } - - final DocumentsWriter.DocState docState; - - final FieldInvertState fieldState = new FieldInvertState(); - - // Used to read a string value for a field - final ReusableStringReader stringReader = new ReusableStringReader(); - - public DocInverterPerThread(DocFieldProcessorPerThread docFieldProcessorPerThread, DocInverter docInverter) { - this.docInverter = docInverter; - docState = docFieldProcessorPerThread.docState; - consumer = docInverter.consumer.addThread(this); - endConsumer = docInverter.endConsumer.addThread(this); - } - - @Override - public void startDocument() throws IOException { - consumer.startDocument(); - endConsumer.startDocument(); - } - - @Override - public DocumentsWriter.DocWriter finishDocument() throws IOException { - // TODO: allow endConsumer.finishDocument to also return - // a DocWriter - endConsumer.finishDocument(); - return consumer.finishDocument(); - } - - @Override - void abort() { - try { - consumer.abort(); - } finally { - endConsumer.abort(); - } - } - - @Override - public DocFieldConsumerPerField addField(FieldInfo fi) { - return new DocInverterPerField(this, fi); - } -} diff --git a/lucene/src/java/org/apache/lucene/index/DocTermOrds.java b/lucene/src/java/org/apache/lucene/index/DocTermOrds.java new file mode 100644 index 00000000000..7bf10a8b06f --- /dev/null +++ b/lucene/src/java/org/apache/lucene/index/DocTermOrds.java @@ -0,0 +1,801 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.index; + +import org.apache.lucene.util.PagedBytes; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.Bits; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.List; +import java.util.Comparator; + +/** + * This class enables fast access to multiple term ords for + * a specified field across all docIDs. + * + * Like FieldCache, it uninverts the index and holds a + * packed data structure in RAM to enable fast access. + * Unlike FieldCache, it can handle multi-valued fields, + * and, it does not hold the term bytes in RAM. Rather, you + * must obtain a TermsEnum from the {@link #getOrdTermsEnum} + * method, and then seek-by-ord to get the term's bytes. + * + * While normally term ords are type long, in this API they are + * int as the internal representation here cannot address + * more than MAX_INT unique terms. Also, typically this + * class is used on fields with relatively few unique terms + * vs the number of documents. In addition, there is an + * internal limit (16 MB) on how many bytes each chunk of + * documents may consume. If you trip this limit you'll hit + * an IllegalStateException. + * + * Deleted documents are skipped during uninversion, and if + * you look them up you'll get 0 ords. + * + * The returned per-document ords do not retain their + * original order in the document. Instead they are returned + * in sorted (by ord, ie term's BytesRef comparator) order. They + * are also de-dup'd (ie if doc has same term more than once + * in this field, you'll only get that ord back once). + * + * This class tests whether the provided reader is able to + * retrieve terms by ord (ie, it's single segment, and it + * uses an ord-capable terms index). If not, this class + * will create its own term index internally, allowing to + * create a wrapped TermsEnum that can handle ord. The + * {@link #getOrdTermsEnum} method then provides this + * wrapped enum, if necessary. + * + * The RAM consumption of this class can be high! + * + * @lucene.experimental + */ + +/* + * Final form of the un-inverted field: + * Each document points to a list of term numbers that are contained in that document. + * + * Term numbers are in sorted order, and are encoded as variable-length deltas from the + * previous term number. Real term numbers start at 2 since 0 and 1 are reserved. A + * term number of 0 signals the end of the termNumber list. + * + * There is a single int[maxDoc()] which either contains a pointer into a byte[] for + * the termNumber lists, or directly contains the termNumber list if it fits in the 4 + * bytes of an integer. If the first byte in the integer is 1, the next 3 bytes + * are a pointer into a byte[] where the termNumber list starts. + * + * There are actually 256 byte arrays, to compensate for the fact that the pointers + * into the byte arrays are only 3 bytes long. The correct byte array for a document + * is a function of it's id. + * + * To save space and speed up faceting, any term that matches enough documents will + * not be un-inverted... it will be skipped while building the un-inverted field structure, + * and will use a set intersection method during faceting. + * + * To further save memory, the terms (the actual string values) are not all stored in + * memory, but a TermIndex is used to convert term numbers to term values only + * for the terms needed after faceting has completed. Only every 128th term value + * is stored, along with it's corresponding term number, and this is used as an + * index to find the closest term and iterate until the desired number is hit (very + * much like Lucene's own internal term index). + * + */ + +public class DocTermOrds { + + // Term ords are shifted by this, internally, to reseve + // values 0 (end term) and 1 (index is a pointer into byte array) + private final static int TNUM_OFFSET = 2; + + // Default: every 128th term is indexed + public final static int DEFAULT_INDEX_INTERVAL_BITS = 7; // decrease to a low number like 2 for testing + + private int indexIntervalBits; + private int indexIntervalMask; + private int indexInterval; + + protected final int maxTermDocFreq; + + protected final String field; + + protected int numTermsInField; + protected long termInstances; // total number of references to term numbers + private long memsz; + protected int total_time; // total time to uninvert the field + protected int phase1_time; // time for phase1 of the uninvert process + + protected int[] index; + protected byte[][] tnums = new byte[256][]; + protected long sizeOfIndexedStrings; + protected BytesRef[] indexedTermsArray; + protected BytesRef prefix; + protected int ordBase; + + protected DocsEnum docsEnum; //used while uninverting + + public long ramUsedInBytes() { + // can cache the mem size since it shouldn't change + if (memsz!=0) return memsz; + long sz = 8*8 + 32; // local fields + if (index != null) sz += index.length * 4; + if (tnums!=null) { + for (byte[] arr : tnums) + if (arr != null) sz += arr.length; + } + memsz = sz; + return sz; + } + + /** Inverts all terms */ + public DocTermOrds(IndexReader reader, String field) throws IOException { + this(reader, field, null, Integer.MAX_VALUE); + } + + /** Inverts only terms starting w/ prefix */ + public DocTermOrds(IndexReader reader, String field, BytesRef termPrefix) throws IOException { + this(reader, field, termPrefix, Integer.MAX_VALUE); + } + + /** Inverts only terms starting w/ prefix, and only terms + * whose docFreq (not taking deletions into account) is + * <= maxTermDocFreq */ + public DocTermOrds(IndexReader reader, String field, BytesRef termPrefix, int maxTermDocFreq) throws IOException { + this(reader, field, termPrefix, maxTermDocFreq, DEFAULT_INDEX_INTERVAL_BITS); + uninvert(reader, termPrefix); + } + + /** Inverts only terms starting w/ prefix, and only terms + * whose docFreq (not taking deletions into account) is + * <= maxTermDocFreq, with a custom indexing interval + * (default is every 128nd term). */ + public DocTermOrds(IndexReader reader, String field, BytesRef termPrefix, int maxTermDocFreq, int indexIntervalBits) throws IOException { + this(field, maxTermDocFreq, indexIntervalBits); + uninvert(reader, termPrefix); + } + + /** Subclass inits w/ this, but be sure you then call + * uninvert, only once */ + protected DocTermOrds(String field, int maxTermDocFreq, int indexIntervalBits) throws IOException { + //System.out.println("DTO init field=" + field + " maxTDFreq=" + maxTermDocFreq); + this.field = field; + this.maxTermDocFreq = maxTermDocFreq; + this.indexIntervalBits = indexIntervalBits; + indexIntervalMask = 0xffffffff >>> (32-indexIntervalBits); + indexInterval = 1 << indexIntervalBits; + } + + /** Returns a TermsEnum that implements ord. If the + * provided reader supports ord, we just return its + * TermsEnum; if it does not, we build a "private" terms + * index internally (WARNING: consumes RAM) and use that + * index to implement ord. This also enables ord on top + * of a composite reader. The returned TermsEnum is + * unpositioned. This returns null if there are no terms. + * + * NOTE: you must pass the same reader that was + * used when creating this class */ + public TermsEnum getOrdTermsEnum(IndexReader reader) throws IOException { + if (termInstances == 0) { + return null; + } + if (indexedTermsArray == null) { + //System.out.println("GET normal enum"); + final Terms terms = MultiFields.getTerms(reader, field); + if (terms != null) { + return terms.iterator(); + } else { + return null; + } + } else { + //System.out.println("GET wrapped enum ordBase=" + ordBase); + return new OrdWrappedTermsEnum(reader); + } + } + + /** Subclass can override this */ + protected void visitTerm(TermsEnum te, int termNum) throws IOException { + } + + protected void setActualDocFreq(int termNum, int df) throws IOException { + } + + // Call this only once (if you subclass!) + protected void uninvert(final IndexReader reader, final BytesRef termPrefix) throws IOException { + //System.out.println("DTO uninvert field=" + field + " prefix=" + termPrefix); + final long startTime = System.currentTimeMillis(); + prefix = termPrefix == null ? null : new BytesRef(termPrefix); + + final int maxDoc = reader.maxDoc(); + final int[] index = new int[maxDoc]; // immediate term numbers, or the index into the byte[] representing the last number + final int[] lastTerm = new int[maxDoc]; // last term we saw for this document + final byte[][] bytes = new byte[maxDoc][]; // list of term numbers for the doc (delta encoded vInts) + + final Terms terms = MultiFields.getTerms(reader, field); + if (terms == null) { + // No terms + return; + } + + final TermsEnum te = terms.iterator(); + final BytesRef seekStart = termPrefix != null ? termPrefix : new BytesRef(); + //System.out.println("seekStart=" + seekStart.utf8ToString()); + if (te.seek(seekStart) == TermsEnum.SeekStatus.END) { + // No terms match + return; + } + + // If we need our "term index wrapper", these will be + // init'd below: + List
indexedTerms = null; + PagedBytes indexedTermsBytes = null; + + boolean testedOrd = false; + + final Bits delDocs = MultiFields.getDeletedDocs(reader); + + // we need a minimum of 9 bytes, but round up to 12 since the space would + // be wasted with most allocators anyway. + byte[] tempArr = new byte[12]; + + // + // enumerate all terms, and build an intermediate form of the un-inverted field. + // + // During this intermediate form, every document has a (potential) byte[] + // and the int[maxDoc()] array either contains the termNumber list directly + // or the *end* offset of the termNumber list in it's byte array (for faster + // appending and faster creation of the final form). + // + // idea... if things are too large while building, we could do a range of docs + // at a time (but it would be a fair amount slower to build) + // could also do ranges in parallel to take advantage of multiple CPUs + + // OPTIONAL: remap the largest df terms to the lowest 128 (single byte) + // values. This requires going over the field first to find the most + // frequent terms ahead of time. + + int termNum = 0; + docsEnum = null; + + // Loop begins with te positioned to first term (we call + // seek above): + for (;;) { + final BytesRef t = te.term(); + if (t == null || (termPrefix != null && !t.startsWith(termPrefix))) { + break; + } + //System.out.println("visit term=" + t.utf8ToString() + " " + t + " termNum=" + termNum); + + if (!testedOrd) { + try { + ordBase = (int) te.ord(); + //System.out.println("got ordBase=" + ordBase); + } catch (UnsupportedOperationException uoe) { + // Reader cannot provide ord support, so we wrap + // our own support by creating our own terms index: + indexedTerms = new ArrayList (); + indexedTermsBytes = new PagedBytes(15); + //System.out.println("NO ORDS"); + } + testedOrd = true; + } + + visitTerm(te, termNum); + + if (indexedTerms != null && (termNum & indexIntervalMask) == 0) { + // Index this term + sizeOfIndexedStrings += t.length; + BytesRef indexedTerm = new BytesRef(); + indexedTermsBytes.copy(t, indexedTerm); + // TODO: really should 1) strip off useless suffix, + // and 2) use FST not array/PagedBytes + indexedTerms.add(indexedTerm); + } + + final int df = te.docFreq(); + if (df <= maxTermDocFreq) { + + docsEnum = te.docs(delDocs, docsEnum); + + final DocsEnum.BulkReadResult bulkResult = docsEnum.getBulkResult(); + + // dF, but takes deletions into account + int actualDF = 0; + + for (;;) { + int chunk = docsEnum.read(); + if (chunk <= 0) { + break; + } + //System.out.println(" chunk=" + chunk + " docs"); + + actualDF += chunk; + + for (int i=0; i >>=8; + } + // point at the end index in the byte[] + index[doc] = (endPos<<8) | 1; + bytes[doc] = tempArr; + tempArr = new byte[12]; + } + } + } + } + setActualDocFreq(termNum, actualDF); + } + + termNum++; + if (te.next() == null) { + break; + } + } + + numTermsInField = termNum; + + long midPoint = System.currentTimeMillis(); + + if (termInstances == 0) { + // we didn't invert anything + // lower memory consumption. + tnums = null; + } else { + + this.index = index; + + // + // transform intermediate form into the final form, building a single byte[] + // at a time, and releasing the intermediate byte[]s as we go to avoid + // increasing the memory footprint. + // + + for (int pass = 0; pass<256; pass++) { + byte[] target = tnums[pass]; + int pos=0; // end in target; + if (target != null) { + pos = target.length; + } else { + target = new byte[4096]; + } + + // loop over documents, 0x00ppxxxx, 0x01ppxxxx, 0x02ppxxxx + // where pp is the pass (which array we are building), and xx is all values. + // each pass shares the same byte[] for termNumber lists. + for (int docbase = pass<<16; docbase maxDoc) + break; + } + + if (indexedTerms != null) { + indexedTermsArray = indexedTerms.toArray(new BytesRef[indexedTerms.size()]); + } + } + + long endTime = System.currentTimeMillis(); + + total_time = (int)(endTime-startTime); + phase1_time = (int)(midPoint-startTime); + } + + /** Number of bytes to represent an unsigned int as a vint. */ + private static int vIntSize(int x) { + if ((x & (0xffffffff << (7*1))) == 0 ) { + return 1; + } + if ((x & (0xffffffff << (7*2))) == 0 ) { + return 2; + } + if ((x & (0xffffffff << (7*3))) == 0 ) { + return 3; + } + if ((x & (0xffffffff << (7*4))) == 0 ) { + return 4; + } + return 5; + } + + // todo: if we know the size of the vInt already, we could do + // a single switch on the size + private static int writeInt(int x, byte[] arr, int pos) { + int a; + a = (x >>> (7*4)); + if (a != 0) { + arr[pos++] = (byte)(a | 0x80); + } + a = (x >>> (7*3)); + if (a != 0) { + arr[pos++] = (byte)(a | 0x80); + } + a = (x >>> (7*2)); + if (a != 0) { + arr[pos++] = (byte)(a | 0x80); + } + a = (x >>> (7*1)); + if (a != 0) { + arr[pos++] = (byte)(a | 0x80); + } + arr[pos++] = (byte)(x & 0x7f); + return pos; + } + + public class TermOrdsIterator { + private int tnum; + private int upto; + private byte[] arr; + + /** Buffer must be at least 5 ints long. Returns number + * of term ords placed into buffer; if this count is + * less than buffer.length then that is the end. */ + public int read(int[] buffer) { + int bufferUpto = 0; + if (arr == null) { + // code is inlined into upto + //System.out.println("inlined"); + int code = upto; + int delta = 0; + for (;;) { + delta = (delta << 7) | (code & 0x7f); + if ((code & 0x80)==0) { + if (delta==0) break; + tnum += delta - TNUM_OFFSET; + buffer[bufferUpto++] = ordBase+tnum; + //System.out.println(" tnum=" + tnum); + delta = 0; + } + code >>>= 8; + } + } else { + // code is a pointer + for(;;) { + int delta = 0; + for(;;) { + byte b = arr[upto++]; + delta = (delta << 7) | (b & 0x7f); + //System.out.println(" cycle: upto=" + upto + " delta=" + delta + " b=" + b); + if ((b & 0x80) == 0) break; + } + //System.out.println(" delta=" + delta); + if (delta == 0) break; + tnum += delta - TNUM_OFFSET; + //System.out.println(" tnum=" + tnum); + buffer[bufferUpto++] = ordBase+tnum; + if (bufferUpto == buffer.length) { + break; + } + } + } + + return bufferUpto; + } + + public TermOrdsIterator reset(int docID) { + //System.out.println(" reset docID=" + docID); + tnum = 0; + final int code = index[docID]; + if ((code & 0xff)==1) { + // a pointer + upto = code>>>8; + //System.out.println(" pointer! upto=" + upto); + int whichArray = (docID >>> 16) & 0xff; + arr = tnums[whichArray]; + } else { + //System.out.println(" inline!"); + arr = null; + upto = code; + } + return this; + } + } + + /** Returns an iterator to step through the term ords for + * this document. It's also possible to subclass this + * class and directly access members. */ + public TermOrdsIterator lookup(int doc, TermOrdsIterator reuse) { + final TermOrdsIterator ret; + if (reuse != null) { + ret = reuse; + } else { + ret = new TermOrdsIterator(); + } + return ret.reset(doc); + } + + /* Only used if original IndexReader doesn't implement + * ord; in this case we "wrap" our own terms index + * around it. */ + private final class OrdWrappedTermsEnum extends TermsEnum { + private final IndexReader reader; + private final TermsEnum termsEnum; + private BytesRef term; + private long ord = -indexInterval-1; // force "real" seek + + public OrdWrappedTermsEnum(IndexReader reader) throws IOException { + this.reader = reader; + assert indexedTermsArray != null; + termsEnum = MultiFields.getTerms(reader, field).iterator(); + } + + @Override + public Comparator getComparator() throws IOException { + return termsEnum.getComparator(); + } + + @Override + public DocsEnum docs(Bits skipDocs, DocsEnum reuse) throws IOException { + return termsEnum.docs(skipDocs, reuse); + } + + @Override + public DocsAndPositionsEnum docsAndPositions(Bits skipDocs, DocsAndPositionsEnum reuse) throws IOException { + return termsEnum.docsAndPositions(skipDocs, reuse); + } + + @Override + public BytesRef term() { + return term; + } + + @Override + public BytesRef next() throws IOException { + ord++; + if (termsEnum.next() == null) { + term = null; + return null; + } + return setTerm(); // this is extra work if we know we are in bounds... + } + + @Override + public int docFreq() throws IOException { + return termsEnum.docFreq(); + } + + @Override + public long totalTermFreq() throws IOException { + return termsEnum.totalTermFreq(); + } + + @Override + public long ord() throws IOException { + return ordBase + ord; + } + + @Override + public SeekStatus seek(BytesRef target, boolean useCache) throws IOException { + + // already here + if (term != null && term.equals(target)) { + return SeekStatus.FOUND; + } + + int startIdx = Arrays.binarySearch(indexedTermsArray, target); + + if (startIdx >= 0) { + // we hit the term exactly... lucky us! + TermsEnum.SeekStatus seekStatus = termsEnum.seek(target); + assert seekStatus == TermsEnum.SeekStatus.FOUND; + ord = startIdx << indexIntervalBits; + setTerm(); + assert term != null; + return SeekStatus.FOUND; + } + + // we didn't hit the term exactly + startIdx = -startIdx-1; + + if (startIdx == 0) { + // our target occurs *before* the first term + TermsEnum.SeekStatus seekStatus = termsEnum.seek(target); + assert seekStatus == TermsEnum.SeekStatus.NOT_FOUND; + ord = 0; + setTerm(); + assert term != null; + return SeekStatus.NOT_FOUND; + } + + // back up to the start of the block + startIdx--; + + if ((ord >> indexIntervalBits) == startIdx && term != null && term.compareTo(target) <= 0) { + // we are already in the right block and the current term is before the term we want, + // so we don't need to seek. + } else { + // seek to the right block + TermsEnum.SeekStatus seekStatus = termsEnum.seek(indexedTermsArray[startIdx]); + assert seekStatus == TermsEnum.SeekStatus.FOUND; + ord = startIdx << indexIntervalBits; + setTerm(); + assert term != null; // should be non-null since it's in the index + } + + while (term != null && term.compareTo(target) < 0) { + next(); + } + + if (term == null) { + return SeekStatus.END; + } else if (term.compareTo(target) == 0) { + return SeekStatus.FOUND; + } else { + return SeekStatus.NOT_FOUND; + } + } + + @Override + public SeekStatus seek(long targetOrd) throws IOException { + int delta = (int) (targetOrd - ordBase - ord); + //System.out.println(" seek(ord) targetOrd=" + targetOrd + " delta=" + delta + " ord=" + ord); + if (delta < 0 || delta > indexInterval) { + final int idx = (int) (targetOrd >>> indexIntervalBits); + final BytesRef base = indexedTermsArray[idx]; + //System.out.println(" do seek term=" + base.utf8ToString()); + ord = idx << indexIntervalBits; + delta = (int) (targetOrd - ord); + final TermsEnum.SeekStatus seekStatus = termsEnum.seek(base, true); + assert seekStatus == TermsEnum.SeekStatus.FOUND; + } else { + //System.out.println("seek w/in block"); + } + + while (--delta >= 0) { + BytesRef br = termsEnum.next(); + if (br == null) { + term = null; + return null; + } + ord++; + } + + setTerm(); + return term == null ? SeekStatus.END : SeekStatus.FOUND; + //System.out.println(" return term=" + term.utf8ToString()); + } + + private BytesRef setTerm() throws IOException { + term = termsEnum.term(); + //System.out.println(" setTerm() term=" + term.utf8ToString() + " vs prefix=" + (prefix == null ? "null" : prefix.utf8ToString())); + if (prefix != null && !term.startsWith(prefix)) { + term = null; + } + return term; + } + } + + public BytesRef lookupTerm(TermsEnum termsEnum, int ord) throws IOException { + TermsEnum.SeekStatus status = termsEnum.seek(ord); + assert status == TermsEnum.SeekStatus.FOUND; + return termsEnum.term(); + } +} diff --git a/lucene/src/java/org/apache/lucene/index/DocumentsWriter.java b/lucene/src/java/org/apache/lucene/index/DocumentsWriter.java index 4f81085a38e..5e316c21fee 100644 --- a/lucene/src/java/org/apache/lucene/index/DocumentsWriter.java +++ b/lucene/src/java/org/apache/lucene/index/DocumentsWriter.java @@ -19,36 +19,27 @@ package org.apache.lucene.index; import java.io.IOException; import java.io.PrintStream; -import java.text.NumberFormat; -import java.util.ArrayList; import java.util.Collection; -import java.util.HashMap; -import java.util.HashSet; +import java.util.Iterator; +import java.util.LinkedList; import java.util.List; -import java.util.concurrent.atomic.AtomicLong; +import java.util.Queue; +import java.util.concurrent.atomic.AtomicInteger; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.document.Document; +import org.apache.lucene.index.DocumentsWriterPerThread.FlushedSegment; +import org.apache.lucene.index.DocumentsWriterPerThread.IndexingChain; +import org.apache.lucene.index.DocumentsWriterPerThreadPool.ThreadState; +import org.apache.lucene.index.FieldInfos.FieldNumberBiMap; import org.apache.lucene.search.Query; import org.apache.lucene.search.SimilarityProvider; import org.apache.lucene.store.AlreadyClosedException; import org.apache.lucene.store.Directory; -import org.apache.lucene.store.RAMFile; -import org.apache.lucene.util.ArrayUtil; -import org.apache.lucene.util.BitVector; -import org.apache.lucene.util.RamUsageEstimator; -import org.apache.lucene.util.RecyclingByteBlockAllocator; -import org.apache.lucene.util.ThreadInterruptedException; - -import static org.apache.lucene.util.ByteBlockPool.BYTE_BLOCK_MASK; -import static org.apache.lucene.util.ByteBlockPool.BYTE_BLOCK_SIZE; /** * This class accepts multiple added documents and directly - * writes a single segment file. It does this more - * efficiently than creating a single segment per document - * (with DocumentWriter) and doing standard merges on those - * segments. + * writes segment files. * * Each added document is passed to the {@link DocConsumer}, * which in turn processes the document and interacts with @@ -111,266 +102,117 @@ import static org.apache.lucene.util.ByteBlockPool.BYTE_BLOCK_SIZE; */ final class DocumentsWriter { - final AtomicLong bytesUsed = new AtomicLong(0); - IndexWriter writer; Directory directory; - String segment; // Current segment we are working on - - private int nextDocID; // Next docID to be added - private int numDocs; // # of docs added, but not yet flushed - - // Max # ThreadState instances; if there are more threads - // than this they share ThreadStates - private DocumentsWriterThreadState[] threadStates = new DocumentsWriterThreadState[0]; - private final HashMap threadBindings = new HashMap (); - - boolean bufferIsFull; // True when it's time to write segment - private boolean aborting; // True if an abort is pending + private volatile boolean closed; PrintStream infoStream; SimilarityProvider similarityProvider; - // max # simultaneous threads; if there are more than - // this, they wait for others to finish first - private final int maxThreadStates; + List newFiles; - // TODO: cutover to BytesRefHash - // Deletes for our still-in-RAM (to be flushed next) segment - private BufferedDeletes pendingDeletes = new BufferedDeletes(false); - - static class DocState { - DocumentsWriter docWriter; - Analyzer analyzer; - PrintStream infoStream; - SimilarityProvider similarityProvider; - int docID; - Document doc; - String maxTermPrefix; + final IndexWriter indexWriter; - // Only called by asserts - public boolean testPoint(String name) { - return docWriter.writer.testPoint(name); - } + private AtomicInteger numDocsInRAM = new AtomicInteger(0); - public void clear() { - // don't hold onto doc nor analyzer, in case it is - // largish: - doc = null; - analyzer = null; - } - } + // TODO: cut over to BytesRefHash in BufferedDeletes + volatile DocumentsWriterDeleteQueue deleteQueue = new DocumentsWriterDeleteQueue(); + private final Queue ticketQueue = new LinkedList (); - /** Consumer returns this on each doc. This holds any - * state that must be flushed synchronized "in docID - * order". We gather these and flush them in order. */ - abstract static class DocWriter { - DocWriter next; - int docID; - abstract void finish() throws IOException; - abstract void abort(); - abstract long sizeInBytes(); + private Collection abortedFiles; // List of files that were written before last abort() - void setNext(DocWriter next) { - this.next = next; - } - } + final IndexingChain chain; - /** - * Create and return a new DocWriterBuffer. - */ - PerDocBuffer newPerDocBuffer() { - return new PerDocBuffer(); - } - - /** - * RAMFile buffer for DocWriters. - */ - class PerDocBuffer extends RAMFile { - - /** - * Allocate bytes used from shared pool. - */ - @Override - protected byte[] newBuffer(int size) { - assert size == PER_DOC_BLOCK_SIZE; - return perDocAllocator.getByteBlock(); - } - - /** - * Recycle the bytes used. - */ - synchronized void recycle() { - if (buffers.size() > 0) { - setLength(0); - - // Recycle the blocks - perDocAllocator.recycleByteBlocks(buffers); - buffers.clear(); - sizeInBytes = 0; - - assert numBuffers() == 0; - } - } - } - - /** - * The IndexingChain must define the {@link #getChain(DocumentsWriter)} method - * which returns the DocConsumer that the DocumentsWriter calls to process the - * documents. - */ - abstract static class IndexingChain { - abstract DocConsumer getChain(DocumentsWriter documentsWriter); - } - - static final IndexingChain defaultIndexingChain = new IndexingChain() { - - @Override - DocConsumer getChain(DocumentsWriter documentsWriter) { - /* - This is the current indexing chain: - - DocConsumer / DocConsumerPerThread - --> code: DocFieldProcessor / DocFieldProcessorPerThread - --> DocFieldConsumer / DocFieldConsumerPerThread / DocFieldConsumerPerField - --> code: DocFieldConsumers / DocFieldConsumersPerThread / DocFieldConsumersPerField - --> code: DocInverter / DocInverterPerThread / DocInverterPerField - --> InvertedDocConsumer / InvertedDocConsumerPerThread / InvertedDocConsumerPerField - --> code: TermsHash / TermsHashPerThread / TermsHashPerField - --> TermsHashConsumer / TermsHashConsumerPerThread / TermsHashConsumerPerField - --> code: FreqProxTermsWriter / FreqProxTermsWriterPerThread / FreqProxTermsWriterPerField - --> code: TermVectorsTermsWriter / TermVectorsTermsWriterPerThread / TermVectorsTermsWriterPerField - --> InvertedDocEndConsumer / InvertedDocConsumerPerThread / InvertedDocConsumerPerField - --> code: NormsWriter / NormsWriterPerThread / NormsWriterPerField - --> code: StoredFieldsWriter / StoredFieldsWriterPerThread / StoredFieldsWriterPerField - */ - - // Build up indexing chain: - - final TermsHashConsumer termVectorsWriter = new TermVectorsTermsWriter(documentsWriter); - final TermsHashConsumer freqProxWriter = new FreqProxTermsWriter(); - /* - * nesting TermsHash instances here to allow the secondary (TermVectors) share the interned postings - * via a shared ByteBlockPool. See TermsHashPerField for details. - */ - final TermsHash termVectorsTermHash = new TermsHash(documentsWriter, false, termVectorsWriter, null); - final InvertedDocConsumer termsHash = new TermsHash(documentsWriter, true, freqProxWriter, termVectorsTermHash); - final NormsWriter normsWriter = new NormsWriter(); - final DocInverter docInverter = new DocInverter(termsHash, normsWriter); - return new DocFieldProcessor(documentsWriter, docInverter); - } - }; - - final DocConsumer consumer; - - // How much RAM we can use before flushing. This is 0 if - // we are flushing by doc count instead. - - private final IndexWriterConfig config; - - private boolean closed; - private FieldInfos fieldInfos; - - private final BufferedDeletesStream bufferedDeletesStream; - private final IndexWriter.FlushControl flushControl; - - DocumentsWriter(IndexWriterConfig config, Directory directory, IndexWriter writer, IndexingChain indexingChain, FieldInfos fieldInfos, + final DocumentsWriterPerThreadPool perThreadPool; + final FlushPolicy flushPolicy; + final DocumentsWriterFlushControl flushControl; + final Healthiness healthiness; + DocumentsWriter(IndexWriterConfig config, Directory directory, IndexWriter writer, FieldNumberBiMap globalFieldNumbers, BufferedDeletesStream bufferedDeletesStream) throws IOException { this.directory = directory; - this.writer = writer; + this.indexWriter = writer; this.similarityProvider = config.getSimilarityProvider(); - this.maxThreadStates = config.getMaxThreadStates(); - this.fieldInfos = fieldInfos; - this.bufferedDeletesStream = bufferedDeletesStream; - flushControl = writer.flushControl; - consumer = config.getIndexingChain().getChain(this); - this.config = config; + this.perThreadPool = config.getIndexerThreadPool(); + this.chain = config.getIndexingChain(); + this.perThreadPool.initialize(this, globalFieldNumbers, config); + final FlushPolicy configuredPolicy = config.getFlushPolicy(); + if (configuredPolicy == null) { + flushPolicy = new FlushByRamOrCountsPolicy(); + } else { + flushPolicy = configuredPolicy; + } + flushPolicy.init(this); + + healthiness = new Healthiness(); + final long maxRamPerDWPT = config.getRAMPerThreadHardLimitMB() * 1024 * 1024; + flushControl = new DocumentsWriterFlushControl(this, healthiness, maxRamPerDWPT); } - // Buffer a specific docID for deletion. Currently only - // used when we hit a exception when adding a document - synchronized void deleteDocID(int docIDUpto) { - pendingDeletes.addDocID(docIDUpto); - // NOTE: we do not trigger flush here. This is - // potentially a RAM leak, if you have an app that tries - // to add docs but every single doc always hits a - // non-aborting exception. Allowing a flush here gets - // very messy because we are only invoked when handling - // exceptions so to do this properly, while handling an - // exception we'd have to go off and flush new deletes - // which is risky (likely would hit some other - // confounding exception). - } - - boolean deleteQueries(Query... queries) { - final boolean doFlush = flushControl.waitUpdate(0, queries.length); - synchronized(this) { - for (Query query : queries) { - pendingDeletes.addQuery(query, numDocs); - } + synchronized void deleteQueries(final Query... queries) throws IOException { + deleteQueue.addDelete(queries); + flushControl.doOnDelete(); + if (flushControl.doApplyAllDeletes()) { + applyAllDeletes(deleteQueue); } - return doFlush; - } - - boolean deleteQuery(Query query) { - final boolean doFlush = flushControl.waitUpdate(0, 1); - synchronized(this) { - pendingDeletes.addQuery(query, numDocs); - } - return doFlush; - } - - boolean deleteTerms(Term... terms) { - final boolean doFlush = flushControl.waitUpdate(0, terms.length); - synchronized(this) { - for (Term term : terms) { - pendingDeletes.addTerm(term, numDocs); - } - } - return doFlush; } // TODO: we could check w/ FreqProxTermsWriter: if the // term doesn't exist, don't bother buffering into the // per-DWPT map (but still must go into the global map) - boolean deleteTerm(Term term, boolean skipWait) { - final boolean doFlush = flushControl.waitUpdate(0, 1, skipWait); - synchronized(this) { - pendingDeletes.addTerm(term, numDocs); + synchronized void deleteTerms(final Term... terms) throws IOException { + final DocumentsWriterDeleteQueue deleteQueue = this.deleteQueue; + deleteQueue.addDelete(terms); + flushControl.doOnDelete(); + if (flushControl.doApplyAllDeletes()) { + applyAllDeletes(deleteQueue); } - return doFlush; } - /** If non-null, various details of indexing are printed - * here. */ + DocumentsWriterDeleteQueue currentDeleteSession() { + return deleteQueue; + } + + private void applyAllDeletes(DocumentsWriterDeleteQueue deleteQueue) throws IOException { + if (deleteQueue != null) { + synchronized (ticketQueue) { + // Freeze and insert the delete flush ticket in the queue + ticketQueue.add(new FlushTicket(deleteQueue.freezeGlobalBuffer(null), false)); + applyFlushTickets(); + } + } + indexWriter.applyAllDeletes(); + indexWriter.flushCount.incrementAndGet(); + } + synchronized void setInfoStream(PrintStream infoStream) { this.infoStream = infoStream; - for(int i=0;i it = perThreadPool.getAllPerThreadsIterator(); + while (it.hasNext()) { + it.next().perThread.docState.infoStream = infoStream; } } - /** Get current segment name we are writing. */ - synchronized String getSegment() { - return segment; - } - /** Returns how many docs are currently buffered in RAM. */ - synchronized int getNumDocs() { - return numDocs; + int getNumDocs() { + return numDocsInRAM.get(); } - void message(String message) { - if (infoStream != null) { - writer.message("DW: " + message); - } + Collection abortedFiles() { + return abortedFiles; } - synchronized void setAborting() { + // returns boolean for asserts + boolean message(String message) { if (infoStream != null) { - message("setAborting"); + indexWriter.message("DW: " + message); + } + return true; + } + + private void ensureOpen() throws AlreadyClosedException { + if (closed) { + throw new AlreadyClosedException("this IndexWriter is closed"); } - aborting = true; } /** Called if we hit an exception at a bad time (when @@ -378,820 +220,335 @@ final class DocumentsWriter { * currently buffered docs. This resets our state, * discarding any docs added since last flush. */ synchronized void abort() throws IOException { - if (infoStream != null) { - message("docWriter: abort"); - } - boolean success = false; + synchronized (this) { + deleteQueue.clear(); + } + try { - - // Forcefully remove waiting ThreadStates from line - waitQueue.abort(); - - // Wait for all other threads to finish with - // DocumentsWriter: - waitIdle(); - if (infoStream != null) { - message("docWriter: abort waitIdle done"); + message("docWriter: abort"); } - assert 0 == waitQueue.numWaiting: "waitQueue.numWaiting=" + waitQueue.numWaiting; + final Iterator threadsIterator = perThreadPool.getActivePerThreadsIterator(); - waitQueue.waitingBytes = 0; - - pendingDeletes.clear(); - - for (DocumentsWriterThreadState threadState : threadStates) + while (threadsIterator.hasNext()) { + ThreadState perThread = threadsIterator.next(); + perThread.lock(); try { - threadState.consumer.abort(); - } catch (Throwable t) { + if (perThread.isActive()) { // we might be closed + perThread.perThread.abort(); + perThread.perThread.checkAndResetHasAborted(); + } else { + assert closed; + } + } finally { + perThread.unlock(); } - - try { - consumer.abort(); - } catch (Throwable t) { } - // Reset all postings data - doAfterFlush(); success = true; } finally { - aborting = false; - notifyAll(); if (infoStream != null) { - message("docWriter: done abort; success=" + success); + message("docWriter: done abort; abortedFiles=" + abortedFiles + " success=" + success); } } } - /** Reset after a flush */ - private void doAfterFlush() throws IOException { - // All ThreadStates should be idle when we are called - assert allThreadsIdle(); - for (DocumentsWriterThreadState threadState : threadStates) { - threadState.consumer.doAfterFlush(); - } - - threadBindings.clear(); - waitQueue.reset(); - segment = null; - fieldInfos = new FieldInfos(fieldInfos); - numDocs = 0; - nextDocID = 0; - bufferIsFull = false; - for(int i=0;i BD - final long delGen = bufferedDeletesStream.getNextGen(); - if (pendingDeletes.any()) { - if (segmentInfos.size() > 0 || newSegment != null) { - final FrozenBufferedDeletes packet = new FrozenBufferedDeletes(pendingDeletes, delGen); - if (infoStream != null) { - message("flush: push buffered deletes startSize=" + pendingDeletes.bytesUsed.get() + " frozenSize=" + packet.bytesUsed); - } - bufferedDeletesStream.push(packet); - if (infoStream != null) { - message("flush: delGen=" + packet.gen); - } - if (newSegment != null) { - newSegment.setBufferedDeletesGen(packet.gen); - } - } else { - if (infoStream != null) { - message("flush: drop buffered deletes: no segments"); - } - // We can safely discard these deletes: since - // there are no segments, the deletions cannot - // affect anything. - } - pendingDeletes.clear(); - } else if (newSegment != null) { - newSegment.setBufferedDeletesGen(delGen); - } + //for testing + public int getNumBufferedDeleteTerms() { + return deleteQueue.numGlobalTermDeletes(); } public boolean anyDeletions() { - return pendingDeletes.any(); + return deleteQueue.anyChanges(); } - /** Flush all pending docs to a new segment */ - // Lock order: IW -> DW - synchronized SegmentInfo flush(IndexWriter writer, IndexFileDeleter deleter, MergePolicy mergePolicy, SegmentInfos segmentInfos) throws IOException { - - final long startTime = System.currentTimeMillis(); - - // We change writer's segmentInfos: - assert Thread.holdsLock(writer); - - waitIdle(); - - if (numDocs == 0) { - // nothing to do! - if (infoStream != null) { - message("flush: no docs; skipping"); - } - // Lock order: IW -> DW -> BD - pushDeletes(null, segmentInfos); - return null; - } - - if (aborting) { - if (infoStream != null) { - message("flush: skip because aborting is set"); - } - return null; - } - - boolean success = false; - - SegmentInfo newSegment; - - try { - assert nextDocID == numDocs; - assert waitQueue.numWaiting == 0; - assert waitQueue.waitingBytes == 0; - - if (infoStream != null) { - message("flush postings as segment " + segment + " numDocs=" + numDocs); - } - - final SegmentWriteState flushState = segWriteState(true); - // Apply delete-by-docID now (delete-byDocID only - // happens when an exception is hit processing that - // doc, eg if analyzer has some problem w/ the text): - if (pendingDeletes.docIDs.size() > 0) { - flushState.deletedDocs = new BitVector(numDocs); - for(int delDocID : pendingDeletes.docIDs) { - flushState.deletedDocs.set(delDocID); - } - pendingDeletes.bytesUsed.addAndGet(-pendingDeletes.docIDs.size() * BufferedDeletes.BYTES_PER_DEL_DOCID); - pendingDeletes.docIDs.clear(); - } - - newSegment = new SegmentInfo(segment, numDocs, directory, false, fieldInfos.hasProx(), flushState.segmentCodecs, false, fieldInfos); - - Collection threads = new HashSet (); - for (DocumentsWriterThreadState threadState : threadStates) { - threads.add(threadState.consumer); - } - - double startMBUsed = bytesUsed()/1024./1024.; - - consumer.flush(threads, flushState); - - newSegment.setHasVectors(flushState.hasVectors); - - if (infoStream != null) { - message("new segment has " + (flushState.hasVectors ? "vectors" : "no vectors")); - if (flushState.deletedDocs != null) { - message("new segment has " + flushState.deletedDocs.count() + " deleted docs"); - } - message("flushedFiles=" + newSegment.files()); - message("flushed codecs=" + newSegment.getSegmentCodecs()); - } - - if (mergePolicy.useCompoundFile(segmentInfos, newSegment)) { - final String cfsFileName = IndexFileNames.segmentFileName(segment, "", IndexFileNames.COMPOUND_FILE_EXTENSION); - - if (infoStream != null) { - message("flush: create compound file \"" + cfsFileName + "\""); - } - - CompoundFileWriter cfsWriter = new CompoundFileWriter(directory, cfsFileName); - for(String fileName : newSegment.files()) { - cfsWriter.addFile(fileName); - } - cfsWriter.close(); - deleter.deleteNewFiles(newSegment.files()); - newSegment.setUseCompoundFile(true); - } - - // Must write deleted docs after the CFS so we don't - // slurp the del file into CFS: - if (flushState.deletedDocs != null) { - final int delCount = flushState.deletedDocs.count(); - assert delCount > 0; - newSegment.setDelCount(delCount); - newSegment.advanceDelGen(); - final String delFileName = newSegment.getDelFileName(); - if (infoStream != null) { - message("flush: write " + delCount + " deletes to " + delFileName); - } - boolean success2 = false; - try { - // TODO: in the NRT case it'd be better to hand - // this del vector over to the - // shortly-to-be-opened SegmentReader and let it - // carry the changes; there's no reason to use - // filesystem as intermediary here. - flushState.deletedDocs.write(directory, delFileName); - success2 = true; - } finally { - if (!success2) { - try { - directory.deleteFile(delFileName); - } catch (Throwable t) { - // suppress this so we keep throwing the - // original exception - } - } - } - } - - if (infoStream != null) { - message("flush: segment=" + newSegment); - final double newSegmentSizeNoStore = newSegment.sizeInBytes(false)/1024./1024.; - final double newSegmentSize = newSegment.sizeInBytes(true)/1024./1024.; - message(" ramUsed=" + nf.format(startMBUsed) + " MB" + - " newFlushedSize=" + nf.format(newSegmentSize) + " MB" + - " (" + nf.format(newSegmentSizeNoStore) + " MB w/o doc stores)" + - " docs/MB=" + nf.format(numDocs / newSegmentSize) + - " new/old=" + nf.format(100.0 * newSegmentSizeNoStore / startMBUsed) + "%"); - } - - success = true; - } finally { - notifyAll(); - if (!success) { - if (segment != null) { - deleter.refresh(segment); - } - abort(); - } - } - - doAfterFlush(); - - // Lock order: IW -> DW -> BD - pushDeletes(newSegment, segmentInfos); - if (infoStream != null) { - message("flush time " + (System.currentTimeMillis()-startTime) + " msec"); - } - - return newSegment; - } - - SegmentWriteState segWriteState(boolean flush) { - return new SegmentWriteState(infoStream, directory, segment, fieldInfos, - numDocs, writer.getConfig().getTermIndexInterval(), - fieldInfos.buildSegmentCodecs(flush), - pendingDeletes, bytesUsed); - } - - synchronized void close() { + void close() { closed = true; - notifyAll(); + flushControl.setClosed(); } - /** Returns a free (idle) ThreadState that may be used for - * indexing this one document. This call also pauses if a - * flush is pending. If delTerm is non-null then we - * buffer this deleted term after the thread state has - * been acquired. */ - synchronized DocumentsWriterThreadState getThreadState(Document doc, Term delTerm) throws IOException { + boolean updateDocument(final Document doc, final Analyzer analyzer, + final Term delTerm) throws CorruptIndexException, IOException { + ensureOpen(); + boolean maybeMerge = false; + final boolean isUpdate = delTerm != null; + if (healthiness.anyStalledThreads()) { - final Thread currentThread = Thread.currentThread(); - assert !Thread.holdsLock(writer); + // Help out flushing any pending DWPTs so we can un-stall: + if (infoStream != null) { + message("WARNING DocumentsWriter has stalled threads; will hijack this thread to flush pending segment(s)"); + } - // First, find a thread state. If this thread already - // has affinity to a specific ThreadState, use that one - // again. - DocumentsWriterThreadState state = threadBindings.get(currentThread); - if (state == null) { - - // First time this thread has called us since last - // flush. Find the least loaded thread state: - DocumentsWriterThreadState minThreadState = null; - for(int i=0;i = maxThreadStates)) { - state = minThreadState; - state.numThreads++; - } else { - // Just create a new "private" thread state - DocumentsWriterThreadState[] newArray = new DocumentsWriterThreadState[1+threadStates.length]; - if (threadStates.length > 0) { - System.arraycopy(threadStates, 0, newArray, 0, threadStates.length); - } - state = newArray[threadStates.length] = new DocumentsWriterThreadState(this); - threadStates = newArray; - } - threadBindings.put(currentThread, state); - } - // Next, wait until my thread state is idle (in case - // it's shared with other threads), and no flush/abort - // pending - waitReady(state); - - // Allocate segment name if this is the first doc since - // last flush: - if (segment == null) { - segment = writer.newSegmentName(); - assert numDocs == 0; - } - - state.docState.docID = nextDocID++; - - if (delTerm != null) { - pendingDeletes.addTerm(delTerm, state.docState.docID); - } - - numDocs++; - state.isIdle = false; - return state; - } - - boolean addDocument(Document doc, Analyzer analyzer) throws CorruptIndexException, IOException { - return updateDocument(doc, analyzer, null); - } - - boolean updateDocument(Document doc, Analyzer analyzer, Term delTerm) - throws CorruptIndexException, IOException { - - // Possibly trigger a flush, or wait until any running flush completes: - boolean doFlush = flushControl.waitUpdate(1, delTerm != null ? 1 : 0); - - // This call is synchronized but fast - final DocumentsWriterThreadState state = getThreadState(doc, delTerm); - - final DocState docState = state.docState; - docState.doc = doc; - docState.analyzer = analyzer; - - boolean success = false; - try { - // This call is not synchronized and does all the - // work - final DocWriter perDoc; - try { - perDoc = state.consumer.processDocument(fieldInfos); - } finally { - docState.clear(); + if (infoStream != null && healthiness.anyStalledThreads()) { + message("WARNING DocumentsWriter still has stalled threads; waiting"); } - // This call is synchronized but fast - finishDocument(state, perDoc); + healthiness.waitIfStalled(); // block if stalled - success = true; - } finally { - if (!success) { - - // If this thread state had decided to flush, we - // must clear it so another thread can flush - if (doFlush) { - flushControl.clearFlushPending(); - } - - if (infoStream != null) { - message("exception in updateDocument aborting=" + aborting); - } - - synchronized(this) { - - state.isIdle = true; - notifyAll(); - - if (aborting) { - abort(); - } else { - skipDocWriter.docID = docState.docID; - boolean success2 = false; - try { - waitQueue.add(skipDocWriter); - success2 = true; - } finally { - if (!success2) { - abort(); - return false; - } - } - - // Immediately mark this document as deleted - // since likely it was partially added. This - // keeps indexing as "all or none" (atomic) when - // adding a document: - deleteDocID(state.docState.docID); - } - } + if (infoStream != null && healthiness.anyStalledThreads()) { + message("WARNING DocumentsWriter done waiting"); } } - doFlush |= flushControl.flushByRAMUsage("new document"); - - return doFlush; - } - - public synchronized void waitIdle() { - while (!allThreadsIdle()) { - try { - wait(); - } catch (InterruptedException ie) { - throw new ThreadInterruptedException(ie); - } - } - } - - synchronized void waitReady(DocumentsWriterThreadState state) { - while (!closed && (!state.isIdle || aborting)) { - try { - wait(); - } catch (InterruptedException ie) { - throw new ThreadInterruptedException(ie); - } - } - - if (closed) { - throw new AlreadyClosedException("this IndexWriter is closed"); - } - } - - /** Does the synchronized work to finish/flush the - * inverted document. */ - private void finishDocument(DocumentsWriterThreadState perThread, DocWriter docWriter) throws IOException { - - // Must call this w/o holding synchronized(this) else - // we'll hit deadlock: - balanceRAM(); - - synchronized(this) { - - assert docWriter == null || docWriter.docID == perThread.docState.docID; - - if (aborting) { - - // We are currently aborting, and another thread is - // waiting for me to become idle. We just forcefully - // idle this threadState; it will be fully reset by - // abort() - if (docWriter != null) { - try { - docWriter.abort(); - } catch (Throwable t) { - } - } - - perThread.isIdle = true; - - // wakes up any threads waiting on the wait queue - notifyAll(); - - return; - } - - final boolean doPause; - - if (docWriter != null) { - doPause = waitQueue.add(docWriter); - } else { - skipDocWriter.docID = perThread.docState.docID; - doPause = waitQueue.add(skipDocWriter); - } - - if (doPause) { - waitForWaitQueue(); - } - - perThread.isIdle = true; - - // wakes up any threads waiting on the wait queue - notifyAll(); - } - } - - synchronized void waitForWaitQueue() { - do { - try { - wait(); - } catch (InterruptedException ie) { - throw new ThreadInterruptedException(ie); - } - } while (!waitQueue.doResume()); - } - - private static class SkipDocWriter extends DocWriter { - @Override - void finish() { - } - @Override - void abort() { - } - @Override - long sizeInBytes() { - return 0; - } - } - final SkipDocWriter skipDocWriter = new SkipDocWriter(); - - NumberFormat nf = NumberFormat.getInstance(); - - /* Initial chunks size of the shared byte[] blocks used to - store postings data */ - final static int BYTE_BLOCK_NOT_MASK = ~BYTE_BLOCK_MASK; - - /* if you increase this, you must fix field cache impl for - * getTerms/getTermsIndex requires <= 32768. */ - final static int MAX_TERM_LENGTH_UTF8 = BYTE_BLOCK_SIZE-2; - - /* Initial chunks size of the shared int[] blocks used to - store postings data */ - final static int INT_BLOCK_SHIFT = 13; - final static int INT_BLOCK_SIZE = 1 << INT_BLOCK_SHIFT; - final static int INT_BLOCK_MASK = INT_BLOCK_SIZE - 1; - - private List freeIntBlocks = new ArrayList (); - - /* Allocate another int[] from the shared pool */ - synchronized int[] getIntBlock() { - final int size = freeIntBlocks.size(); - final int[] b; - if (0 == size) { - b = new int[INT_BLOCK_SIZE]; - bytesUsed.addAndGet(INT_BLOCK_SIZE*RamUsageEstimator.NUM_BYTES_INT); - } else { - b = freeIntBlocks.remove(size-1); - } - return b; - } - - long bytesUsed() { - return bytesUsed.get() + pendingDeletes.bytesUsed.get(); - } - - /* Return int[]s to the pool */ - synchronized void recycleIntBlocks(int[][] blocks, int start, int end) { - for(int i=start;i = ramBufferSize; + try { + + if (!perThread.isActive()) { + ensureOpen(); + assert false: "perThread is not active but we are still open"; + } + + final DocumentsWriterPerThread dwpt = perThread.perThread; + try { + dwpt.updateDocument(doc, analyzer, delTerm); + numDocsInRAM.incrementAndGet(); + } finally { + if (dwpt.checkAndResetHasAborted()) { + flushControl.doOnAbort(perThread); + } + } + flushingDWPT = flushControl.doAfterDocument(perThread, isUpdate); + } finally { + perThread.unlock(); } - - if (doBalance) { - - if (infoStream != null) { - message(" RAM: balance allocations: usedMB=" + toMB(bytesUsed()) + - " vs trigger=" + toMB(ramBufferSize) + - " deletesMB=" + toMB(deletesRAMUsed) + - " byteBlockFree=" + toMB(byteBlockAllocator.bytesUsed()) + - " perDocFree=" + toMB(perDocAllocator.bytesUsed())); + + if (flushingDWPT != null) { + maybeMerge |= doFlush(flushingDWPT); + } else { + final DocumentsWriterPerThread nextPendingFlush = flushControl.nextPendingFlush(); + if (nextPendingFlush != null) { + maybeMerge |= doFlush(nextPendingFlush); } + } + return maybeMerge; + } - final long startBytesUsed = bytesUsed() + deletesRAMUsed; - - int iter = 0; - - // We free equally from each pool in 32 KB - // chunks until we are below our threshold - // (freeLevel) - - boolean any = true; - - final long freeLevel = (long) (0.95 * ramBufferSize); - - while(bytesUsed()+deletesRAMUsed > freeLevel) { + private boolean doFlush(DocumentsWriterPerThread flushingDWPT) throws IOException { + boolean maybeMerge = false; + while (flushingDWPT != null) { + maybeMerge = true; + boolean success = false; + FlushTicket ticket = null; - synchronized(this) { - if (0 == perDocAllocator.numBufferedBlocks() && - 0 == byteBlockAllocator.numBufferedBlocks() && - 0 == freeIntBlocks.size() && !any) { - // Nothing else to free -- must flush now. - bufferIsFull = bytesUsed()+deletesRAMUsed > ramBufferSize; - if (infoStream != null) { - if (bytesUsed()+deletesRAMUsed > ramBufferSize) { - message(" nothing to free; set bufferIsFull"); - } else { - message(" nothing to free"); - } + try { + assert currentFullFlushDelQueue == null + || flushingDWPT.deleteQueue == currentFullFlushDelQueue : "expected: " + + currentFullFlushDelQueue + "but was: " + flushingDWPT.deleteQueue + + " " + flushControl.isFullFlush(); + /* + * Since with DWPT the flush process is concurrent and several DWPT + * could flush at the same time we must maintain the order of the + * flushes before we can apply the flushed segment and the frozen global + * deletes it is buffering. The reason for this is that the global + * deletes mark a certain point in time where we took a DWPT out of + * rotation and freeze the global deletes. + * + * Example: A flush 'A' starts and freezes the global deletes, then + * flush 'B' starts and freezes all deletes occurred since 'A' has + * started. if 'B' finishes before 'A' we need to wait until 'A' is done + * otherwise the deletes frozen by 'B' are not applied to 'A' and we + * might miss to deletes documents in 'A'. + */ + try { + synchronized (ticketQueue) { + // Each flush is assigned a ticket in the order they accquire the ticketQueue lock + ticket = new FlushTicket(flushingDWPT.prepareFlush(), true); + ticketQueue.add(ticket); + } + + // flush concurrently without locking + final FlushedSegment newSegment = flushingDWPT.flush(); + synchronized (ticketQueue) { + ticket.segment = newSegment; + } + // flush was successful once we reached this point - new seg. has been assigned to the ticket! + success = true; + } finally { + if (!success && ticket != null) { + synchronized (ticketQueue) { + // In the case of a failure make sure we are making progress and + // apply all the deletes since the segment flush failed since the flush + // ticket could hold global deletes see FlushTicket#canPublish() + ticket.isSegmentFlush = false; } - break; - } - - if ((0 == iter % 4) && byteBlockAllocator.numBufferedBlocks() > 0) { - byteBlockAllocator.freeBlocks(1); - } - if ((1 == iter % 4) && freeIntBlocks.size() > 0) { - freeIntBlocks.remove(freeIntBlocks.size()-1); - bytesUsed.addAndGet(-INT_BLOCK_SIZE * RamUsageEstimator.NUM_BYTES_INT); - } - if ((2 == iter % 4) && perDocAllocator.numBufferedBlocks() > 0) { - perDocAllocator.freeBlocks(32); // Remove upwards of 32 blocks (each block is 1K) } } - - if ((3 == iter % 4) && any) { - // Ask consumer to free any recycled state - any = consumer.freeRAM(); - } - - iter++; + /* + * Now we are done and try to flush the ticket queue if the head of the + * queue has already finished the flush. + */ + applyFlushTickets(); + } finally { + flushControl.doAfterFlush(flushingDWPT); + flushingDWPT.checkAndResetHasAborted(); + indexWriter.flushCount.incrementAndGet(); } + + flushingDWPT = flushControl.nextPendingFlush(); + } + return maybeMerge; + } - if (infoStream != null) { - message(" after free: freedMB=" + nf.format((startBytesUsed-bytesUsed()-deletesRAMUsed)/1024./1024.) + " usedMB=" + nf.format((bytesUsed()+deletesRAMUsed)/1024./1024.)); + private void applyFlushTickets() throws IOException { + synchronized (ticketQueue) { + while (true) { + // Keep publishing eligible flushed segments: + final FlushTicket head = ticketQueue.peek(); + if (head != null && head.canPublish()) { + ticketQueue.poll(); + finishFlush(head.segment, head.frozenDeletes); + } else { + break; + } } } } - final WaitQueue waitQueue = new WaitQueue(); - - private class WaitQueue { - DocWriter[] waiting; - int nextWriteDocID; - int nextWriteLoc; - int numWaiting; - long waitingBytes; - - public WaitQueue() { - waiting = new DocWriter[10]; - } - - synchronized void reset() { - // NOTE: nextWriteLoc doesn't need to be reset - assert numWaiting == 0; - assert waitingBytes == 0; - nextWriteDocID = 0; - } - - synchronized boolean doResume() { - final double mb = config.getRAMBufferSizeMB(); - final long waitQueueResumeBytes; - if (mb == IndexWriterConfig.DISABLE_AUTO_FLUSH) { - waitQueueResumeBytes = 2*1024*1024; - } else { - waitQueueResumeBytes = (long) (mb*1024*1024*0.05); - } - return waitingBytes <= waitQueueResumeBytes; - } - - synchronized boolean doPause() { - final double mb = config.getRAMBufferSizeMB(); - final long waitQueuePauseBytes; - if (mb == IndexWriterConfig.DISABLE_AUTO_FLUSH) { - waitQueuePauseBytes = 4*1024*1024; - } else { - waitQueuePauseBytes = (long) (mb*1024*1024*0.1); - } - return waitingBytes > waitQueuePauseBytes; - } - - synchronized void abort() { - int count = 0; - for(int i=0;i BDS so that the {@link SegmentInfo}'s + * delete generation is always GlobalPacket_deleteGeneration + 1 + */ + private void publishFlushedSegment(FlushedSegment newSegment, FrozenBufferedDeletes globalPacket) + throws IOException { + assert newSegment != null; + final SegmentInfo segInfo = indexWriter.prepareFlushedSegment(newSegment); + final BufferedDeletes deletes = newSegment.segmentDeletes; + FrozenBufferedDeletes packet = null; + if (deletes != null && deletes.any()) { + // Segment private delete + packet = new FrozenBufferedDeletes(deletes, true); + if (infoStream != null) { + message("flush: push buffered seg private deletes: " + packet); } } - synchronized public boolean add(DocWriter doc) throws IOException { + // now publish! + indexWriter.publishFlushedSegment(segInfo, packet, globalPacket); + } + + // for asserts + private volatile DocumentsWriterDeleteQueue currentFullFlushDelQueue = null; + // for asserts + private synchronized boolean setFlushingDeleteQueue(DocumentsWriterDeleteQueue session) { + currentFullFlushDelQueue = session; + return true; + } + + /* + * FlushAllThreads is synced by IW fullFlushLock. Flushing all threads is a + * two stage operation; the caller must ensure (in try/finally) that finishFlush + * is called after this method, to release the flush lock in DWFlushControl + */ + final boolean flushAllThreads() + throws IOException { + final DocumentsWriterDeleteQueue flushingDeleteQueue; - assert doc.docID >= nextWriteDocID; - - if (doc.docID == nextWriteDocID) { - writeDocument(doc); - while(true) { - doc = waiting[nextWriteLoc]; - if (doc != null) { - numWaiting--; - waiting[nextWriteLoc] = null; - waitingBytes -= doc.sizeInBytes(); - writeDocument(doc); - } else { - break; - } - } - } else { - - // I finished before documents that were added - // before me. This can easily happen when I am a - // small doc and the docs before me were large, or, - // just due to luck in the thread scheduling. Just - // add myself to the queue and when that large doc - // finishes, it will flush me: - int gap = doc.docID - nextWriteDocID; - if (gap >= waiting.length) { - // Grow queue - DocWriter[] newArray = new DocWriter[ArrayUtil.oversize(gap, RamUsageEstimator.NUM_BYTES_OBJECT_REF)]; - assert nextWriteLoc >= 0; - System.arraycopy(waiting, nextWriteLoc, newArray, 0, waiting.length-nextWriteLoc); - System.arraycopy(waiting, 0, newArray, waiting.length-nextWriteLoc, nextWriteLoc); - nextWriteLoc = 0; - waiting = newArray; - gap = doc.docID - nextWriteDocID; - } - - int loc = nextWriteLoc + gap; - if (loc >= waiting.length) { - loc -= waiting.length; - } - - // We should only wrap one time - assert loc < waiting.length; - - // Nobody should be in my spot! - assert waiting[loc] == null; - waiting[loc] = doc; - numWaiting++; - waitingBytes += doc.sizeInBytes(); + synchronized (this) { + flushingDeleteQueue = deleteQueue; + /* Cutover to a new delete queue. This must be synced on the flush control + * otherwise a new DWPT could sneak into the loop with an already flushing + * delete queue */ + flushControl.markForFullFlush(); // swaps the delQueue synced on FlushControl + assert setFlushingDeleteQueue(flushingDeleteQueue); + } + assert currentFullFlushDelQueue != null; + assert currentFullFlushDelQueue != deleteQueue; + + boolean anythingFlushed = false; + try { + DocumentsWriterPerThread flushingDWPT; + // Help out with flushing: + while ((flushingDWPT = flushControl.nextPendingFlush()) != null) { + anythingFlushed |= doFlush(flushingDWPT); } - - return doPause(); + // If a concurrent flush is still in flight wait for it + while (flushControl.anyFlushing()) { + flushControl.waitForFlush(); + } + if (!anythingFlushed) { // apply deletes if we did not flush any document + synchronized (ticketQueue) { + ticketQueue.add(new FlushTicket(flushingDeleteQueue.freezeGlobalBuffer(null), false)); + } + applyFlushTickets(); + } + } finally { + assert flushingDeleteQueue == currentFullFlushDelQueue; + } + return anythingFlushed; + } + + final void finishFullFlush(boolean success) { + assert setFlushingDeleteQueue(null); + if (success) { + // Release the flush lock + flushControl.finishFullFlush(); + } else { + flushControl.abortFullFlushes(); + } + } + + static final class FlushTicket { + final FrozenBufferedDeletes frozenDeletes; + /* access to non-final members must be synchronized on DW#ticketQueue */ + FlushedSegment segment; + boolean isSegmentFlush; + + FlushTicket(FrozenBufferedDeletes frozenDeletes, boolean isSegmentFlush) { + this.frozenDeletes = frozenDeletes; + this.isSegmentFlush = isSegmentFlush; + } + + boolean canPublish() { + return (!isSegmentFlush || segment != null); } } } diff --git a/lucene/src/java/org/apache/lucene/index/DocumentsWriterDeleteQueue.java b/lucene/src/java/org/apache/lucene/index/DocumentsWriterDeleteQueue.java new file mode 100644 index 00000000000..486c12659f7 --- /dev/null +++ b/lucene/src/java/org/apache/lucene/index/DocumentsWriterDeleteQueue.java @@ -0,0 +1,396 @@ +package org.apache.lucene.index; + +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with this + * work for additional information regarding copyright ownership. The ASF + * licenses this file to You under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT + * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the + * License for the specific language governing permissions and limitations under + * the License. + */ +import java.util.concurrent.atomic.AtomicReferenceFieldUpdater; +import java.util.concurrent.locks.ReentrantLock; + +import org.apache.lucene.search.Query; + +/** + * {@link DocumentsWriterDeleteQueue} is a non-blocking linked pending deletes + * queue. In contrast to other queue implementation we only maintain the + * tail of the queue. A delete queue is always used in a context of a set of + * DWPTs and a global delete pool. Each of the DWPT and the global pool need to + * maintain their 'own' head of the queue (as a DeleteSlice instance per DWPT). + * The difference between the DWPT and the global pool is that the DWPT starts + * maintaining a head once it has added its first document since for its segments + * private deletes only the deletes after that document are relevant. The global + * pool instead starts maintaining the head once this instance is created by + * taking the sentinel instance as its initial head. + * + * Since each {@link DeleteSlice} maintains its own head and the list is only + * single linked the garbage collector takes care of pruning the list for us. + * All nodes in the list that are still relevant should be either directly or + * indirectly referenced by one of the DWPT's private {@link DeleteSlice} or by + * the global {@link BufferedDeletes} slice. + *
+ * Each DWPT as well as the global delete pool maintain their private + * DeleteSlice instance. In the DWPT case updating a slice is equivalent to + * atomically finishing the document. The slice update guarantees a "happens + * before" relationship to all other updates in the same indexing session. When a + * DWPT updates a document it: + * + *
+ *
+ * + * The DWPT also doesn't apply its current documents delete term until it has + * updated its delete slice which ensures the consistency of the update. If the + * update fails before the DeleteSlice could have been updated the deleteTerm + * will also not be added to its private deletes neither to the global deletes. + * + */ +final class DocumentsWriterDeleteQueue { + + private volatile Node tail; + + private static final AtomicReferenceFieldUpdater- consumes a document and finishes its processing
+ *- updates its private {@link DeleteSlice} either by calling + * {@link #updateSlice(DeleteSlice)} or {@link #add(Term, DeleteSlice)} (if the + * document has a delTerm)
+ *- applies all deletes in the slice to its private {@link BufferedDeletes} + * and resets it
+ *- increments its internal document id
+ *tailUpdater = AtomicReferenceFieldUpdater + .newUpdater(DocumentsWriterDeleteQueue.class, Node.class, "tail"); + + private final DeleteSlice globalSlice; + private final BufferedDeletes globalBufferedDeletes; + /* only acquired to update the global deletes */ + private final ReentrantLock globalBufferLock = new ReentrantLock(); + + final long generation; + + DocumentsWriterDeleteQueue() { + this(0); + } + + DocumentsWriterDeleteQueue(long generation) { + this(new BufferedDeletes(false), generation); + } + + DocumentsWriterDeleteQueue(BufferedDeletes globalBufferedDeletes, long generation) { + this.globalBufferedDeletes = globalBufferedDeletes; + this.generation = generation; + /* + * we use a sentinel instance as our initial tail. No slice will ever try to + * apply this tail since the head is always omitted. + */ + tail = new Node(null); // sentinel + globalSlice = new DeleteSlice(tail); + } + + void addDelete(Query... queries) { + add(new QueryArrayNode(queries)); + tryApplyGlobalSlice(); + } + + void addDelete(Term... terms) { + add(new TermArrayNode(terms)); + tryApplyGlobalSlice(); + } + + /** + * invariant for document update + */ + void add(Term term, DeleteSlice slice) { + final TermNode termNode = new TermNode(term); + add(termNode); + /* + * this is an update request where the term is the updated documents + * delTerm. in that case we need to guarantee that this insert is atomic + * with regards to the given delete slice. This means if two threads try to + * update the same document with in turn the same delTerm one of them must + * win. By taking the node we have created for our del term as the new tail + * it is guaranteed that if another thread adds the same right after us we + * will apply this delete next time we update our slice and one of the two + * competing updates wins! + */ + slice.sliceTail = termNode; + assert slice.sliceHead != slice.sliceTail : "slice head and tail must differ after add"; + tryApplyGlobalSlice(); // TODO doing this each time is not necessary maybe + // we can do it just every n times or so? + } + + void add(Node item) { + /* + * this non-blocking / 'wait-free' linked list add was inspired by Apache + * Harmony's ConcurrentLinkedQueue Implementation. + */ + while (true) { + final Node currentTail = this.tail; + final Node tailNext = currentTail.next; + if (tail == currentTail) { + if (tailNext != null) { + /* + * we are in intermediate state here. the tails next pointer has been + * advanced but the tail itself might not be updated yet. help to + * advance the tail and try again updating it. + */ + tailUpdater.compareAndSet(this, currentTail, tailNext); // can fail + } else { + /* + * we are in quiescent state and can try to insert the item to the + * current tail if we fail to insert we just retry the operation since + * somebody else has already added its item + */ + if (currentTail.casNext(null, item)) { + /* + * now that we are done we need to advance the tail while another + * thread could have advanced it already so we can ignore the return + * type of this CAS call + */ + tailUpdater.compareAndSet(this, currentTail, item); + return; + } + } + } + } + } + + boolean anyChanges() { + globalBufferLock.lock(); + try { + return !globalSlice.isEmpty() || globalBufferedDeletes.any(); + } finally { + globalBufferLock.unlock(); + } + } + + void tryApplyGlobalSlice() { + if (globalBufferLock.tryLock()) { + /* + * The global buffer must be locked but we don't need to upate them if + * there is an update going on right now. It is sufficient to apply the + * deletes that have been added after the current in-flight global slices + * tail the next time we can get the lock! + */ + try { + if (updateSlice(globalSlice)) { + globalSlice.apply(globalBufferedDeletes, BufferedDeletes.MAX_INT); + } + } finally { + globalBufferLock.unlock(); + } + } + } + + FrozenBufferedDeletes freezeGlobalBuffer(DeleteSlice callerSlice) { + globalBufferLock.lock(); + /* + * Here we freeze the global buffer so we need to lock it, apply all + * deletes in the queue and reset the global slice to let the GC prune the + * queue. + */ + final Node currentTail = tail; // take the current tail make this local any + // Changes after this call are applied later + // and not relevant here + if (callerSlice != null) { + // Update the callers slices so we are on the same page + callerSlice.sliceTail = currentTail; + } + try { + if (globalSlice.sliceTail != currentTail) { + globalSlice.sliceTail = currentTail; + globalSlice.apply(globalBufferedDeletes, BufferedDeletes.MAX_INT); + } + + final FrozenBufferedDeletes packet = new FrozenBufferedDeletes( + globalBufferedDeletes, false); + globalBufferedDeletes.clear(); + return packet; + } finally { + globalBufferLock.unlock(); + } + } + + DeleteSlice newSlice() { + return new DeleteSlice(tail); + } + + boolean updateSlice(DeleteSlice slice) { + if (slice.sliceTail != tail) { // If we are the same just + slice.sliceTail = tail; + return true; + } + return false; + } + + static class DeleteSlice { + // No need to be volatile, slices are thread captive (only accessed by one thread)! + Node sliceHead; // we don't apply this one + Node sliceTail; + + DeleteSlice(Node currentTail) { + assert currentTail != null; + /* + * Initially this is a 0 length slice pointing to the 'current' tail of + * the queue. Once we update the slice we only need to assign the tail and + * have a new slice + */ + sliceHead = sliceTail = currentTail; + } + + void apply(BufferedDeletes del, int docIDUpto) { + if (sliceHead == sliceTail) { + // 0 length slice + return; + } + /* + * When we apply a slice we take the head and get its next as our first + * item to apply and continue until we applied the tail. If the head and + * tail in this slice are not equal then there will be at least one more + * non-null node in the slice! + */ + Node current = sliceHead; + do { + current = current.next; + assert current != null : "slice property violated between the head on the tail must not be a null node"; + current.apply(del, docIDUpto); + } while (current != sliceTail); + reset(); + } + + void reset() { + // Reset to a 0 length slice + sliceHead = sliceTail; + } + + /** + * Returns true
iff the given item is identical to the item + * hold by the slices tail, otherwisefalse
. + */ + boolean isTailItem(Object item) { + return sliceTail.item == item; + } + + boolean isEmpty() { + return sliceHead == sliceTail; + } + } + + public int numGlobalTermDeletes() { + return globalBufferedDeletes.numTermDeletes.get(); + } + + void clear() { + globalBufferLock.lock(); + try { + final Node currentTail = tail; + globalSlice.sliceHead = globalSlice.sliceTail = currentTail; + globalBufferedDeletes.clear(); + } finally { + globalBufferLock.unlock(); + } + } + + private static class Node { + volatile Node next; + final Object item; + + private Node(Object item) { + this.item = item; + } + + static final AtomicReferenceFieldUpdaternextUpdater = AtomicReferenceFieldUpdater + .newUpdater(Node.class, Node.class, "next"); + + void apply(BufferedDeletes bufferedDeletes, int docIDUpto) { + assert false : "sentinel item must never be applied"; + } + + boolean casNext(Node cmp, Node val) { + return nextUpdater.compareAndSet(this, cmp, val); + } + } + + private static final class TermNode extends Node { + + TermNode(Term term) { + super(term); + } + + @Override + void apply(BufferedDeletes bufferedDeletes, int docIDUpto) { + bufferedDeletes.addTerm((Term) item, docIDUpto); + } + } + + private static final class QueryArrayNode extends Node { + QueryArrayNode(Query[] query) { + super(query); + } + + @Override + void apply(BufferedDeletes bufferedDeletes, int docIDUpto) { + final Query[] queries = (Query[]) item; + for (Query query : queries) { + bufferedDeletes.addQuery(query, docIDUpto); + } + } + } + + private static final class TermArrayNode extends Node { + TermArrayNode(Term[] term) { + super(term); + } + + @Override + void apply(BufferedDeletes bufferedDeletes, int docIDUpto) { + final Term[] terms = (Term[]) item; + for (Term term : terms) { + bufferedDeletes.addTerm(term, docIDUpto); + } + } + } + + + private boolean forceApplyGlobalSlice() { + globalBufferLock.lock(); + final Node currentTail = tail; + try { + if (globalSlice.sliceTail != currentTail) { + globalSlice.sliceTail = currentTail; + globalSlice.apply(globalBufferedDeletes, BufferedDeletes.MAX_INT); + } + return globalBufferedDeletes.any(); + } finally { + globalBufferLock.unlock(); + } + } + + public int getBufferedDeleteTermsSize() { + globalBufferLock.lock(); + try { + forceApplyGlobalSlice(); + return globalBufferedDeletes.terms.size(); + } finally { + globalBufferLock.unlock(); + } + } + + public long bytesUsed() { + return globalBufferedDeletes.bytesUsed.get(); + } + + @Override + public String toString() { + return "DWDQ: [ generation: " + generation + " ]"; + } + + +} diff --git a/lucene/src/java/org/apache/lucene/index/DocumentsWriterFlushControl.java b/lucene/src/java/org/apache/lucene/index/DocumentsWriterFlushControl.java new file mode 100644 index 00000000000..443df5139ca --- /dev/null +++ b/lucene/src/java/org/apache/lucene/index/DocumentsWriterFlushControl.java @@ -0,0 +1,394 @@ +package org.apache.lucene.index; + +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +import java.util.ArrayList; +import java.util.HashMap; +import java.util.Iterator; +import java.util.LinkedList; +import java.util.Queue; +import java.util.concurrent.atomic.AtomicBoolean; + +import org.apache.lucene.index.DocumentsWriterPerThreadPool.ThreadState; +import org.apache.lucene.util.ThreadInterruptedException; + +/** + * This class controls {@link DocumentsWriterPerThread} flushing during + * indexing. It tracks the memory consumption per + * {@link DocumentsWriterPerThread} and uses a configured {@link FlushPolicy} to + * decide if a {@link DocumentsWriterPerThread} must flush. + * + * In addition to the {@link FlushPolicy} the flush control might set certain + * {@link DocumentsWriterPerThread} as flush pending iff a + * {@link DocumentsWriterPerThread} exceeds the + * {@link IndexWriterConfig#getRAMPerThreadHardLimitMB()} to prevent address + * space exhaustion. + */ +public final class DocumentsWriterFlushControl { + + private final long hardMaxBytesPerDWPT; + private long activeBytes = 0; + private long flushBytes = 0; + private volatile int numPending = 0; + private volatile int numFlushing = 0; + final AtomicBoolean flushDeletes = new AtomicBoolean(false); + private boolean fullFlush = false; + private Queue
flushQueue = new LinkedList (); + // only for safety reasons if a DWPT is close to the RAM limit + private Queue blockedFlushes = new LinkedList (); + + + long peakActiveBytes = 0;// only with assert + long peakFlushBytes = 0;// only with assert + long peakNetBytes = 0;// only with assert + private final Healthiness healthiness; + private final DocumentsWriterPerThreadPool perThreadPool; + private final FlushPolicy flushPolicy; + private boolean closed = false; + private final HashMap flushingWriters = new HashMap (); + private final DocumentsWriter documentsWriter; + + DocumentsWriterFlushControl(DocumentsWriter documentsWriter, + Healthiness healthiness, long hardMaxBytesPerDWPT) { + this.healthiness = healthiness; + this.perThreadPool = documentsWriter.perThreadPool; + this.flushPolicy = documentsWriter.flushPolicy; + this.hardMaxBytesPerDWPT = hardMaxBytesPerDWPT; + this.documentsWriter = documentsWriter; + } + + public synchronized long activeBytes() { + return activeBytes; + } + + public synchronized long flushBytes() { + return flushBytes; + } + + public synchronized long netBytes() { + return flushBytes + activeBytes; + } + + private void commitPerThreadBytes(ThreadState perThread) { + final long delta = perThread.perThread.bytesUsed() + - perThread.bytesUsed; + perThread.bytesUsed += delta; + /* + * We need to differentiate here if we are pending since setFlushPending + * moves the perThread memory to the flushBytes and we could be set to + * pending during a delete + */ + if (perThread.flushPending) { + flushBytes += delta; + } else { + activeBytes += delta; + } + assert updatePeaks(delta); + } + + // only for asserts + private boolean updatePeaks(long delta) { + peakActiveBytes = Math.max(peakActiveBytes, activeBytes); + peakFlushBytes = Math.max(peakFlushBytes, flushBytes); + peakNetBytes = Math.max(peakNetBytes, netBytes()); + return true; + } + + synchronized DocumentsWriterPerThread doAfterDocument(ThreadState perThread, + boolean isUpdate) { + commitPerThreadBytes(perThread); + if (!perThread.flushPending) { + if (isUpdate) { + flushPolicy.onUpdate(this, perThread); + } else { + flushPolicy.onInsert(this, perThread); + } + if (!perThread.flushPending && perThread.bytesUsed > hardMaxBytesPerDWPT) { + // Safety check to prevent a single DWPT exceeding its RAM limit. This + // is super important since we can not address more than 2048 MB per DWPT + setFlushPending(perThread); + if (fullFlush) { + DocumentsWriterPerThread toBlock = internalTryCheckOutForFlush(perThread, false); + assert toBlock != null; + blockedFlushes.add(toBlock); + } + } + } + final DocumentsWriterPerThread flushingDWPT = tryCheckoutForFlush(perThread, false); + healthiness.updateStalled(this); + return flushingDWPT; + } + + synchronized void doAfterFlush(DocumentsWriterPerThread dwpt) { + assert flushingWriters.containsKey(dwpt); + try { + numFlushing--; + Long bytes = flushingWriters.remove(dwpt); + flushBytes -= bytes.longValue(); + perThreadPool.recycle(dwpt); + healthiness.updateStalled(this); + } finally { + notifyAll(); + } + } + + public synchronized boolean anyFlushing() { + return numFlushing != 0; + } + + public synchronized void waitForFlush() { + if (numFlushing != 0) { + try { + this.wait(); + } catch (InterruptedException e) { + throw new ThreadInterruptedException(e); + } + } + } + + /** + * Sets flush pending state on the given {@link ThreadState}. The + * {@link ThreadState} must have indexed at least on Document and must not be + * already pending. + */ + public synchronized void setFlushPending(ThreadState perThread) { + assert !perThread.flushPending; + if (perThread.perThread.getNumDocsInRAM() > 0) { + perThread.flushPending = true; // write access synced + final long bytes = perThread.bytesUsed; + flushBytes += bytes; + activeBytes -= bytes; + numPending++; // write access synced + } // don't assert on numDocs since we could hit an abort excp. while selecting that dwpt for flushing + + } + + synchronized void doOnAbort(ThreadState state) { + if (state.flushPending) { + flushBytes -= state.bytesUsed; + } else { + activeBytes -= state.bytesUsed; + } + // Take it out of the loop this DWPT is stale + perThreadPool.replaceForFlush(state, closed); + healthiness.updateStalled(this); + } + + synchronized DocumentsWriterPerThread tryCheckoutForFlush( + ThreadState perThread, boolean setPending) { + if (fullFlush) { + return null; + } + return internalTryCheckOutForFlush(perThread, setPending); + } + + private DocumentsWriterPerThread internalTryCheckOutForFlush( + ThreadState perThread, boolean setPending) { + if (setPending && !perThread.flushPending) { + setFlushPending(perThread); + } + if (perThread.flushPending) { + // We are pending so all memory is already moved to flushBytes + if (perThread.tryLock()) { + try { + if (perThread.isActive()) { + assert perThread.isHeldByCurrentThread(); + final DocumentsWriterPerThread dwpt; + final long bytes = perThread.bytesUsed; // do that before + // replace! + dwpt = perThreadPool.replaceForFlush(perThread, closed); + assert !flushingWriters.containsKey(dwpt) : "DWPT is already flushing"; + // Record the flushing DWPT to reduce flushBytes in doAfterFlush + flushingWriters.put(dwpt, Long.valueOf(bytes)); + numPending--; // write access synced + numFlushing++; + return dwpt; + } + } finally { + perThread.unlock(); + } + } + } + return null; + } + + @Override + public String toString() { + return "DocumentsWriterFlushControl [activeBytes=" + activeBytes + + ", flushBytes=" + flushBytes + "]"; + } + + DocumentsWriterPerThread nextPendingFlush() { + synchronized (this) { + DocumentsWriterPerThread poll = flushQueue.poll(); + if (poll != null) { + return poll; + } + } + if (numPending > 0) { + final Iterator allActiveThreads = perThreadPool + .getActivePerThreadsIterator(); + while (allActiveThreads.hasNext() && numPending > 0) { + ThreadState next = allActiveThreads.next(); + if (next.flushPending) { + final DocumentsWriterPerThread dwpt = tryCheckoutForFlush(next, false); + if (dwpt != null) { + return dwpt; + } + } + } + } + return null; + } + + synchronized void setClosed() { + // set by DW to signal that we should not release new DWPT after close + this.closed = true; + } + + /** + * Returns an iterator that provides access to all currently active {@link ThreadState}s + */ + public Iterator allActiveThreads() { + return perThreadPool.getActivePerThreadsIterator(); + } + + synchronized void doOnDelete() { + // pass null this is a global delete no update + flushPolicy.onDelete(this, null); + } + + /** + * Returns the number of delete terms in the global pool + */ + public int getNumGlobalTermDeletes() { + return documentsWriter.deleteQueue.numGlobalTermDeletes(); + } + + int numFlushingDWPT() { + return numFlushing; + } + + public boolean doApplyAllDeletes() { + return flushDeletes.getAndSet(false); + } + + public void setApplyAllDeletes() { + flushDeletes.set(true); + } + + int numActiveDWPT() { + return this.perThreadPool.getMaxThreadStates(); + } + + void markForFullFlush() { + final DocumentsWriterDeleteQueue flushingQueue; + synchronized (this) { + assert !fullFlush; + fullFlush = true; + flushingQueue = documentsWriter.deleteQueue; + // Set a new delete queue - all subsequent DWPT will use this queue until + // we do another full flush + DocumentsWriterDeleteQueue newQueue = new DocumentsWriterDeleteQueue(flushingQueue.generation+1); + documentsWriter.deleteQueue = newQueue; + } + final Iterator allActiveThreads = perThreadPool + .getActivePerThreadsIterator(); + final ArrayList toFlush = new ArrayList (); + while (allActiveThreads.hasNext()) { + final ThreadState next = allActiveThreads.next(); + next.lock(); + try { + if (!next.isActive()) { + continue; + } + assert next.perThread.deleteQueue == flushingQueue + || next.perThread.deleteQueue == documentsWriter.deleteQueue : " flushingQueue: " + + flushingQueue + + " currentqueue: " + + documentsWriter.deleteQueue + + " perThread queue: " + + next.perThread.deleteQueue + + " numDocsInRam: " + next.perThread.getNumDocsInRAM(); + if (next.perThread.deleteQueue != flushingQueue) { + // this one is already a new DWPT + continue; + } + if (next.perThread.getNumDocsInRAM() > 0 ) { + final DocumentsWriterPerThread dwpt = next.perThread; // just for assert + final DocumentsWriterPerThread flushingDWPT = internalTryCheckOutForFlush(next, true); + assert flushingDWPT != null : "DWPT must never be null here since we hold the lock and it holds documents"; + assert dwpt == flushingDWPT : "flushControl returned different DWPT"; + toFlush.add(flushingDWPT); + } else { + // get the new delete queue from DW + next.perThread.initialize(); + } + } finally { + next.unlock(); + } + } + synchronized (this) { + assert assertBlockedFlushes(flushingQueue); + flushQueue.addAll(blockedFlushes); + blockedFlushes.clear(); + flushQueue.addAll(toFlush); + } + } + + synchronized void finishFullFlush() { + assert fullFlush; + assert flushQueue.isEmpty(); + try { + if (!blockedFlushes.isEmpty()) { + assert assertBlockedFlushes(documentsWriter.deleteQueue); + flushQueue.addAll(blockedFlushes); + blockedFlushes.clear(); + } + } finally { + fullFlush = false; + } + } + + boolean assertBlockedFlushes(DocumentsWriterDeleteQueue flushingQueue) { + Queue flushes = this.blockedFlushes; + for (DocumentsWriterPerThread documentsWriterPerThread : flushes) { + assert documentsWriterPerThread.deleteQueue == flushingQueue; + } + return true; + } + + synchronized void abortFullFlushes() { + try { + for (DocumentsWriterPerThread dwpt : flushQueue) { + doAfterFlush(dwpt); + } + for (DocumentsWriterPerThread dwpt : blockedFlushes) { + doAfterFlush(dwpt); + } + + } finally { + fullFlush = false; + flushQueue.clear(); + blockedFlushes.clear(); + } + } + + synchronized boolean isFullFlush() { + return fullFlush; + } +} \ No newline at end of file diff --git a/lucene/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java b/lucene/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java new file mode 100644 index 00000000000..e943055bc37 --- /dev/null +++ b/lucene/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java @@ -0,0 +1,501 @@ +package org.apache.lucene.index; + +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +import static org.apache.lucene.util.ByteBlockPool.BYTE_BLOCK_MASK; +import static org.apache.lucene.util.ByteBlockPool.BYTE_BLOCK_SIZE; + +import java.io.IOException; +import java.io.PrintStream; +import java.text.NumberFormat; +import java.util.concurrent.atomic.AtomicLong; + +import org.apache.lucene.analysis.Analyzer; +import org.apache.lucene.document.Document; +import org.apache.lucene.index.DocumentsWriterDeleteQueue.DeleteSlice; +import org.apache.lucene.search.SimilarityProvider; +import org.apache.lucene.store.Directory; +import org.apache.lucene.util.BitVector; +import org.apache.lucene.util.ByteBlockPool.Allocator; +import org.apache.lucene.util.RamUsageEstimator; + +public class DocumentsWriterPerThread { + + /** + * The IndexingChain must define the {@link #getChain(DocumentsWriter)} method + * which returns the DocConsumer that the DocumentsWriter calls to process the + * documents. + */ + abstract static class IndexingChain { + abstract DocConsumer getChain(DocumentsWriterPerThread documentsWriterPerThread); + } + + + static final IndexingChain defaultIndexingChain = new IndexingChain() { + + @Override + DocConsumer getChain(DocumentsWriterPerThread documentsWriterPerThread) { + /* + This is the current indexing chain: + + DocConsumer / DocConsumerPerThread + --> code: DocFieldProcessor / DocFieldProcessorPerThread + --> DocFieldConsumer / DocFieldConsumerPerThread / DocFieldConsumerPerField + --> code: DocFieldConsumers / DocFieldConsumersPerThread / DocFieldConsumersPerField + --> code: DocInverter / DocInverterPerThread / DocInverterPerField + --> InvertedDocConsumer / InvertedDocConsumerPerThread / InvertedDocConsumerPerField + --> code: TermsHash / TermsHashPerThread / TermsHashPerField + --> TermsHashConsumer / TermsHashConsumerPerThread / TermsHashConsumerPerField + --> code: FreqProxTermsWriter / FreqProxTermsWriterPerThread / FreqProxTermsWriterPerField + --> code: TermVectorsTermsWriter / TermVectorsTermsWriterPerThread / TermVectorsTermsWriterPerField + --> InvertedDocEndConsumer / InvertedDocConsumerPerThread / InvertedDocConsumerPerField + --> code: NormsWriter / NormsWriterPerThread / NormsWriterPerField + --> code: StoredFieldsWriter / StoredFieldsWriterPerThread / StoredFieldsWriterPerField + */ + + // Build up indexing chain: + + final TermsHashConsumer termVectorsWriter = new TermVectorsTermsWriter(documentsWriterPerThread); + final TermsHashConsumer freqProxWriter = new FreqProxTermsWriter(); + + final InvertedDocConsumer termsHash = new TermsHash(documentsWriterPerThread, freqProxWriter, true, + new TermsHash(documentsWriterPerThread, termVectorsWriter, false, null)); + final NormsWriter normsWriter = new NormsWriter(); + final DocInverter docInverter = new DocInverter(documentsWriterPerThread.docState, termsHash, normsWriter); + return new DocFieldProcessor(documentsWriterPerThread, docInverter); + } + }; + + static class DocState { + final DocumentsWriterPerThread docWriter; + Analyzer analyzer; + PrintStream infoStream; + SimilarityProvider similarityProvider; + int docID; + Document doc; + String maxTermPrefix; + + DocState(DocumentsWriterPerThread docWriter) { + this.docWriter = docWriter; + } + + // Only called by asserts + public boolean testPoint(String name) { + return docWriter.writer.testPoint(name); + } + + public void clear() { + // don't hold onto doc nor analyzer, in case it is + // largish: + doc = null; + analyzer = null; + } + } + + static class FlushedSegment { + final SegmentInfo segmentInfo; + final BufferedDeletes segmentDeletes; + final BitVector deletedDocuments; + + private FlushedSegment(SegmentInfo segmentInfo, + BufferedDeletes segmentDeletes, BitVector deletedDocuments) { + this.segmentInfo = segmentInfo; + this.segmentDeletes = segmentDeletes; + this.deletedDocuments = deletedDocuments; + } + } + + /** Called if we hit an exception at a bad time (when + * updating the index files) and must discard all + * currently buffered docs. This resets our state, + * discarding any docs added since last flush. */ + void abort() throws IOException { + hasAborted = aborting = true; + try { + if (infoStream != null) { + message("docWriter: now abort"); + } + try { + consumer.abort(); + } catch (Throwable t) { + } + + pendingDeletes.clear(); + deleteSlice = deleteQueue.newSlice(); + // Reset all postings data + doAfterFlush(); + + } finally { + aborting = false; + if (infoStream != null) { + message("docWriter: done abort"); + } + } + } + + final DocumentsWriter parent; + final IndexWriter writer; + final Directory directory; + final DocState docState; + final DocConsumer consumer; + final AtomicLong bytesUsed; + + SegmentWriteState flushState; + //Deletes for our still-in-RAM (to be flushed next) segment + BufferedDeletes pendingDeletes; + String segment; // Current segment we are working on + boolean aborting = false; // True if an abort is pending + boolean hasAborted = false; // True if the last exception throws by #updateDocument was aborting + + private FieldInfos fieldInfos; + private final PrintStream infoStream; + private int numDocsInRAM; + private int flushedDocCount; + DocumentsWriterDeleteQueue deleteQueue; + DeleteSlice deleteSlice; + private final NumberFormat nf = NumberFormat.getInstance(); + + + public DocumentsWriterPerThread(Directory directory, DocumentsWriter parent, + FieldInfos fieldInfos, IndexingChain indexingChain) { + this.directory = directory; + this.parent = parent; + this.fieldInfos = fieldInfos; + this.writer = parent.indexWriter; + this.infoStream = parent.indexWriter.getInfoStream(); + this.docState = new DocState(this); + this.docState.similarityProvider = parent.indexWriter.getConfig() + .getSimilarityProvider(); + + consumer = indexingChain.getChain(this); + bytesUsed = new AtomicLong(0); + pendingDeletes = new BufferedDeletes(false); + initialize(); + } + + public DocumentsWriterPerThread(DocumentsWriterPerThread other, FieldInfos fieldInfos) { + this(other.directory, other.parent, fieldInfos, other.parent.chain); + } + + void initialize() { + deleteQueue = parent.deleteQueue; + assert numDocsInRAM == 0 : "num docs " + numDocsInRAM; + pendingDeletes.clear(); + deleteSlice = null; + } + + void setAborting() { + aborting = true; + } + + boolean checkAndResetHasAborted() { + final boolean retval = hasAborted; + hasAborted = false; + return retval; + } + + public void updateDocument(Document doc, Analyzer analyzer, Term delTerm) throws IOException { + assert writer.testPoint("DocumentsWriterPerThread addDocument start"); + assert deleteQueue != null; + docState.doc = doc; + docState.analyzer = analyzer; + docState.docID = numDocsInRAM; + if (segment == null) { + // this call is synchronized on IndexWriter.segmentInfos + segment = writer.newSegmentName(); + assert numDocsInRAM == 0; + } + + boolean success = false; + try { + try { + consumer.processDocument(fieldInfos); + } finally { + docState.clear(); + } + success = true; + } finally { + if (!success) { + if (!aborting) { + // mark document as deleted + deleteDocID(docState.docID); + numDocsInRAM++; + } else { + abort(); + } + } + } + success = false; + try { + consumer.finishDocument(); + success = true; + } finally { + if (!success) { + abort(); + } + } + finishDocument(delTerm); + } + + private void finishDocument(Term delTerm) throws IOException { + /* + * here we actually finish the document in two steps 1. push the delete into + * the queue and update our slice. 2. increment the DWPT private document + * id. + * + * the updated slice we get from 1. holds all the deletes that have occurred + * since we updated the slice the last time. + */ + if (deleteSlice == null) { + deleteSlice = deleteQueue.newSlice(); + if (delTerm != null) { + deleteQueue.add(delTerm, deleteSlice); + deleteSlice.reset(); + } + + } else { + if (delTerm != null) { + deleteQueue.add(delTerm, deleteSlice); + assert deleteSlice.isTailItem(delTerm) : "expected the delete term as the tail item"; + deleteSlice.apply(pendingDeletes, numDocsInRAM); + } else if (deleteQueue.updateSlice(deleteSlice)) { + deleteSlice.apply(pendingDeletes, numDocsInRAM); + } + } + ++numDocsInRAM; + } + + // Buffer a specific docID for deletion. Currently only + // used when we hit a exception when adding a document + void deleteDocID(int docIDUpto) { + pendingDeletes.addDocID(docIDUpto); + // NOTE: we do not trigger flush here. This is + // potentially a RAM leak, if you have an app that tries + // to add docs but every single doc always hits a + // non-aborting exception. Allowing a flush here gets + // very messy because we are only invoked when handling + // exceptions so to do this properly, while handling an + // exception we'd have to go off and flush new deletes + // which is risky (likely would hit some other + // confounding exception). + } + + /** + * Returns the number of delete terms in this {@link DocumentsWriterPerThread} + */ + public int numDeleteTerms() { + // public for FlushPolicy + return pendingDeletes.numTermDeletes.get(); + } + + /** + * Returns the number of RAM resident documents in this {@link DocumentsWriterPerThread} + */ + public int getNumDocsInRAM() { + // public for FlushPolicy + return numDocsInRAM; + } + + SegmentCodecs getCodec() { + return flushState.segmentCodecs; + } + + /** Reset after a flush */ + private void doAfterFlush() throws IOException { + segment = null; + consumer.doAfterFlush(); + fieldInfos = new FieldInfos(fieldInfos); + parent.subtractFlushedNumDocs(numDocsInRAM); + numDocsInRAM = 0; + } + + /** + * Prepares this DWPT for flushing. This method will freeze and return the + * {@link DocumentsWriterDeleteQueue}s global buffer and apply all pending + * deletes to this DWPT. + */ + FrozenBufferedDeletes prepareFlush() { + assert numDocsInRAM > 0; + final FrozenBufferedDeletes globalDeletes = deleteQueue.freezeGlobalBuffer(deleteSlice); + /* deleteSlice can possibly be null if we have hit non-aborting exceptions during indexing and never succeeded + adding a document. */ + if (deleteSlice != null) { + // apply all deletes before we flush and release the delete slice + deleteSlice.apply(pendingDeletes, numDocsInRAM); + assert deleteSlice.isEmpty(); + deleteSlice = null; + } + return globalDeletes; + } + + /** Flush all pending docs to a new segment */ + FlushedSegment flush() throws IOException { + assert numDocsInRAM > 0; + assert deleteSlice == null : "all deletes must be applied in prepareFlush"; + flushState = new SegmentWriteState(infoStream, directory, segment, fieldInfos, + numDocsInRAM, writer.getConfig().getTermIndexInterval(), + fieldInfos.buildSegmentCodecs(true), pendingDeletes); + final double startMBUsed = parent.flushControl.netBytes() / 1024. / 1024.; + // Apply delete-by-docID now (delete-byDocID only + // happens when an exception is hit processing that + // doc, eg if analyzer has some problem w/ the text): + if (pendingDeletes.docIDs.size() > 0) { + flushState.deletedDocs = new BitVector(numDocsInRAM); + for(int delDocID : pendingDeletes.docIDs) { + flushState.deletedDocs.set(delDocID); + } + pendingDeletes.bytesUsed.addAndGet(-pendingDeletes.docIDs.size() * BufferedDeletes.BYTES_PER_DEL_DOCID); + pendingDeletes.docIDs.clear(); + } + + if (infoStream != null) { + message("flush postings as segment " + flushState.segmentName + " numDocs=" + numDocsInRAM); + } + + if (aborting) { + if (infoStream != null) { + message("flush: skip because aborting is set"); + } + return null; + } + + boolean success = false; + + try { + + SegmentInfo newSegment = new SegmentInfo(segment, flushState.numDocs, directory, false, fieldInfos.hasProx(), flushState.segmentCodecs, false, fieldInfos); + consumer.flush(flushState); + pendingDeletes.terms.clear(); + newSegment.setHasVectors(flushState.hasVectors); + + if (infoStream != null) { + message("new segment has " + (flushState.deletedDocs == null ? 0 : flushState.deletedDocs.count()) + " deleted docs"); + message("new segment has " + (flushState.hasVectors ? "vectors" : "no vectors")); + message("flushedFiles=" + newSegment.files()); + message("flushed codecs=" + newSegment.getSegmentCodecs()); + } + flushedDocCount += flushState.numDocs; + + final BufferedDeletes segmentDeletes; + if (pendingDeletes.queries.isEmpty()) { + pendingDeletes.clear(); + segmentDeletes = null; + } else { + segmentDeletes = pendingDeletes; + pendingDeletes = new BufferedDeletes(false); + } + + if (infoStream != null) { + final double newSegmentSizeNoStore = newSegment.sizeInBytes(false)/1024./1024.; + final double newSegmentSize = newSegment.sizeInBytes(true)/1024./1024.; + message("flushed: segment=" + newSegment + + " ramUsed=" + nf.format(startMBUsed) + " MB" + + " newFlushedSize=" + nf.format(newSegmentSize) + " MB" + + " (" + nf.format(newSegmentSizeNoStore) + " MB w/o doc stores)" + + " docs/MB=" + nf.format(flushedDocCount / newSegmentSize) + + " new/old=" + nf.format(100.0 * newSegmentSizeNoStore / startMBUsed) + "%"); + } + doAfterFlush(); + success = true; + + return new FlushedSegment(newSegment, segmentDeletes, flushState.deletedDocs); + } finally { + if (!success) { + if (segment != null) { + synchronized(parent.indexWriter) { + parent.indexWriter.deleter.refresh(segment); + } + } + abort(); + } + } + } + + /** Get current segment name we are writing. */ + String getSegment() { + return segment; + } + + long bytesUsed() { + return bytesUsed.get() + pendingDeletes.bytesUsed.get(); + } + + FieldInfos getFieldInfos() { + return fieldInfos; + } + + void message(String message) { + writer.message("DWPT: " + message); + } + + /* Initial chunks size of the shared byte[] blocks used to + store postings data */ + final static int BYTE_BLOCK_NOT_MASK = ~BYTE_BLOCK_MASK; + + /* if you increase this, you must fix field cache impl for + * getTerms/getTermsIndex requires <= 32768 */ + final static int MAX_TERM_LENGTH_UTF8 = BYTE_BLOCK_SIZE-2; + + /* Initial chunks size of the shared int[] blocks used to + store postings data */ + final static int INT_BLOCK_SHIFT = 13; + final static int INT_BLOCK_SIZE = 1 << INT_BLOCK_SHIFT; + final static int INT_BLOCK_MASK = INT_BLOCK_SIZE - 1; + + /* Allocate another int[] from the shared pool */ + int[] getIntBlock() { + int[] b = new int[INT_BLOCK_SIZE]; + bytesUsed.addAndGet(INT_BLOCK_SIZE*RamUsageEstimator.NUM_BYTES_INT); + return b; + } + + void recycleIntBlocks(int[][] blocks, int offset, int length) { + bytesUsed.addAndGet(-(length *(INT_BLOCK_SIZE*RamUsageEstimator.NUM_BYTES_INT))); + } + + final Allocator byteBlockAllocator = new DirectTrackingAllocator(); + + + private class DirectTrackingAllocator extends Allocator { + public DirectTrackingAllocator() { + this(BYTE_BLOCK_SIZE); + } + + public DirectTrackingAllocator(int blockSize) { + super(blockSize); + } + + public byte[] getByteBlock() { + bytesUsed.addAndGet(blockSize); + return new byte[blockSize]; + } + @Override + public void recycleByteBlocks(byte[][] blocks, int start, int end) { + bytesUsed.addAndGet(-((end-start)* blockSize)); + for (int i = start; i < end; i++) { + blocks[i] = null; + } + } + + }; + + PerDocWriteState newPerDocWriteState(int codecId) { + assert segment != null; + return new PerDocWriteState(infoStream, directory, segment, fieldInfos, bytesUsed, codecId); + } +} diff --git a/lucene/src/java/org/apache/lucene/index/DocumentsWriterPerThreadPool.java b/lucene/src/java/org/apache/lucene/index/DocumentsWriterPerThreadPool.java new file mode 100644 index 00000000000..0a03ea39248 --- /dev/null +++ b/lucene/src/java/org/apache/lucene/index/DocumentsWriterPerThreadPool.java @@ -0,0 +1,268 @@ +package org.apache.lucene.index; +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +import java.util.Iterator; +import java.util.concurrent.locks.ReentrantLock; + +import org.apache.lucene.document.Document; +import org.apache.lucene.index.FieldInfos.FieldNumberBiMap; +import org.apache.lucene.index.SegmentCodecs.SegmentCodecsBuilder; +import org.apache.lucene.index.codecs.CodecProvider; +import org.apache.lucene.util.SetOnce; + +/** + * {@link DocumentsWriterPerThreadPool} controls {@link ThreadState} instances + * and their thread assignments during indexing. Each {@link ThreadState} holds + * a reference to a {@link DocumentsWriterPerThread} that is once a + * {@link ThreadState} is obtained from the pool exclusively used for indexing a + * single document by the obtaining thread. Each indexing thread must obtain + * such a {@link ThreadState} to make progress. Depending on the + * {@link DocumentsWriterPerThreadPool} implementation {@link ThreadState} + * assignments might differ from document to document. + * + * Once a {@link DocumentsWriterPerThread} is selected for flush the thread pool + * is reusing the flushing {@link DocumentsWriterPerThread}s ThreadState with a + * new {@link DocumentsWriterPerThread} instance. + *
+ */ +public abstract class DocumentsWriterPerThreadPool { + /** The maximum number of simultaneous threads that may be + * indexing documents at once in IndexWriter; if more + * than this many threads arrive they will wait for + * others to finish. */ + public final static int DEFAULT_MAX_THREAD_STATES = 8; + + /** + * {@link ThreadState} references and guards a + * {@link DocumentsWriterPerThread} instance that is used during indexing to + * build a in-memory index segment. {@link ThreadState} also holds all flush + * related per-thread data controlled by {@link DocumentsWriterFlushControl}. + *+ * A {@link ThreadState}, its methods and members should only accessed by one + * thread a time. Users must acquire the lock via {@link ThreadState#lock()} + * and release the lock in a finally block via {@link ThreadState#unlock()} + * before accessing the state. + */ + @SuppressWarnings("serial") + public final static class ThreadState extends ReentrantLock { + // package private for FlushPolicy + DocumentsWriterPerThread perThread; + // write access guarded by DocumentsWriterFlushControl + volatile boolean flushPending = false; + // write access guarded by DocumentsWriterFlushControl + long bytesUsed = 0; + // guarded by Reentrant lock + private boolean isActive = true; + + ThreadState(DocumentsWriterPerThread perThread) { + this.perThread = perThread; + } + + /** + * Resets the internal {@link DocumentsWriterPerThread} with the given one. + * if the given DWPT is
null
this ThreadState is marked as inactive and should not be used + * for indexing anymore. + * @see #isActive() + */ + void resetWriter(DocumentsWriterPerThread perThread) { + assert this.isHeldByCurrentThread(); + if (perThread == null) { + isActive = false; + } + this.perThread = perThread; + this.bytesUsed = 0; + this.flushPending = false; + } + + /** + * Returnstrue
if this ThreadState is still open. This will + * only returnfalse
iff the DW has been closed and this + * ThreadState is already checked out for flush. + */ + boolean isActive() { + assert this.isHeldByCurrentThread(); + return isActive; + } + + /** + * Returns the number of currently active bytes in this ThreadState's + * {@link DocumentsWriterPerThread} + */ + public long getBytesUsedPerThread() { + assert this.isHeldByCurrentThread(); + // public for FlushPolicy + return bytesUsed; + } + + /** + * Returns this {@link ThreadState}s {@link DocumentsWriterPerThread} + */ + public DocumentsWriterPerThread getDocumentsWriterPerThread() { + assert this.isHeldByCurrentThread(); + // public for FlushPolicy + return perThread; + } + + /** + * Returnstrue
iff this {@link ThreadState} is marked as flush + * pending otherwisefalse
+ */ + public boolean isFlushPending() { + return flushPending; + } + } + + private final ThreadState[] perThreads; + private volatile int numThreadStatesActive; + private CodecProvider codecProvider; + private FieldNumberBiMap globalFieldMap; + private final SetOncedocumentsWriter = new SetOnce (); + + /** + * Creates a new {@link DocumentsWriterPerThreadPool} with max. + * {@link #DEFAULT_MAX_THREAD_STATES} thread states. + */ + public DocumentsWriterPerThreadPool() { + this(DEFAULT_MAX_THREAD_STATES); + } + + public DocumentsWriterPerThreadPool(int maxNumPerThreads) { + maxNumPerThreads = (maxNumPerThreads < 1) ? DEFAULT_MAX_THREAD_STATES : maxNumPerThreads; + perThreads = new ThreadState[maxNumPerThreads]; + numThreadStatesActive = 0; + } + + public void initialize(DocumentsWriter documentsWriter, FieldNumberBiMap globalFieldMap, IndexWriterConfig config) { + this.documentsWriter.set(documentsWriter); // thread pool is bound to DW + final CodecProvider codecs = config.getCodecProvider(); + this.codecProvider = codecs; + this.globalFieldMap = globalFieldMap; + for (int i = 0; i < perThreads.length; i++) { + final FieldInfos infos = globalFieldMap.newFieldInfos(SegmentCodecsBuilder.create(codecs)); + perThreads[i] = new ThreadState(new DocumentsWriterPerThread(documentsWriter.directory, documentsWriter, infos, documentsWriter.chain)); + } + } + + /** + * Returns the max number of {@link ThreadState} instances available in this + * {@link DocumentsWriterPerThreadPool} + */ + public int getMaxThreadStates() { + return perThreads.length; + } + + /** + * Returns a new {@link ThreadState} iff any new state is available otherwise + * null
. + *+ * NOTE: the returned {@link ThreadState} is already locked iff non- + *
null
. + * + * @return a new {@link ThreadState} iff any new state is available otherwise + *null
+ */ + public synchronized ThreadState newThreadState() { + if (numThreadStatesActive < perThreads.length) { + final ThreadState threadState = perThreads[numThreadStatesActive]; + threadState.lock(); // lock so nobody else will get this ThreadState + numThreadStatesActive++; // increment will publish the ThreadState + threadState.perThread.initialize(); + return threadState; + } + return null; + } + + protected DocumentsWriterPerThread replaceForFlush(ThreadState threadState, boolean closed) { + assert threadState.isHeldByCurrentThread(); + final DocumentsWriterPerThread dwpt = threadState.perThread; + if (!closed) { + final FieldInfos infos = globalFieldMap.newFieldInfos(SegmentCodecsBuilder.create(codecProvider)); + final DocumentsWriterPerThread newDwpt = new DocumentsWriterPerThread(dwpt, infos); + newDwpt.initialize(); + threadState.resetWriter(newDwpt); + } else { + threadState.resetWriter(null); + } + return dwpt; + } + + public void recycle(DocumentsWriterPerThread dwpt) { + // don't recycle DWPT by default + } + + public abstract ThreadState getAndLock(Thread requestingThread, DocumentsWriter documentsWriter, Document doc); + + /** + * Returns an iterator providing access to all {@link ThreadState} + * instances. + */ + // TODO: new Iterator per indexed doc is overkill...? + public IteratorgetAllPerThreadsIterator() { + return getPerThreadsIterator(this.perThreads.length); + } + + /** + * Returns an iterator providing access to all active {@link ThreadState} + * instances. + * + * Note: The returned iterator will only iterator + * {@link ThreadState}s that are active at the point in time when this method + * has been called. + * + */ + // TODO: new Iterator per indexed doc is overkill...? + public Iterator
getActivePerThreadsIterator() { + return getPerThreadsIterator(numThreadStatesActive); + } + + private Iterator getPerThreadsIterator(final int upto) { + return new Iterator () { + int i = 0; + + public boolean hasNext() { + return i < upto; + } + + public ThreadState next() { + return perThreads[i++]; + } + + public void remove() { + throw new UnsupportedOperationException("remove() not supported."); + } + }; + } + + /** + * Returns the ThreadState with the minimum estimated number of threads + * waiting to acquire its lock or null
if no {@link ThreadState} + * is yet visible to the calling thread. + */ + protected ThreadState minContendedThreadState() { + ThreadState minThreadState = null; + // TODO: new Iterator per indexed doc is overkill...? + final Iteratorit = getActivePerThreadsIterator(); + while (it.hasNext()) { + final ThreadState state = it.next(); + if (minThreadState == null || state.getQueueLength() < minThreadState.getQueueLength()) { + minThreadState = state; + } + } + return minThreadState; + } +} diff --git a/lucene/src/java/org/apache/lucene/index/DocumentsWriterThreadState.java b/lucene/src/java/org/apache/lucene/index/DocumentsWriterThreadState.java deleted file mode 100644 index 611098a64bc..00000000000 --- a/lucene/src/java/org/apache/lucene/index/DocumentsWriterThreadState.java +++ /dev/null @@ -1,47 +0,0 @@ -package org.apache.lucene.index; - -/** - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import java.io.IOException; - -/** Used by DocumentsWriter to maintain per-thread state. - * We keep a separate Posting hash and other state for each - * thread and then merge postings hashes from all threads - * when writing the segment. */ -final class DocumentsWriterThreadState { - - boolean isIdle = true; // false if this is currently in use by a thread - int numThreads = 1; // Number of threads that share this instance - final DocConsumerPerThread consumer; - final DocumentsWriter.DocState docState; - - final DocumentsWriter docWriter; - - public DocumentsWriterThreadState(DocumentsWriter docWriter) throws IOException { - this.docWriter = docWriter; - docState = new DocumentsWriter.DocState(); - docState.infoStream = docWriter.infoStream; - docState.similarityProvider = docWriter.similarityProvider; - docState.docWriter = docWriter; - consumer = docWriter.consumer.addThread(this); - } - - void doAfterFlush() { - numThreads = 0; - } -} diff --git a/lucene/src/java/org/apache/lucene/index/FieldInfo.java b/lucene/src/java/org/apache/lucene/index/FieldInfo.java index 144e0e1e3cb..3aba2850b42 100644 --- a/lucene/src/java/org/apache/lucene/index/FieldInfo.java +++ b/lucene/src/java/org/apache/lucene/index/FieldInfo.java @@ -22,9 +22,11 @@ import org.apache.lucene.index.values.Type; /** @lucene.experimental */ public final class FieldInfo { public static final int UNASSIGNED_CODEC_ID = -1; - public String name; + + public final String name; + public final int number; + public boolean isIndexed; - public int number; Type docValues; @@ -61,6 +63,7 @@ public final class FieldInfo { this.omitNorms = false; this.omitTermFreqAndPositions = false; } + assert !omitTermFreqAndPositions || !storePayloads; } void setCodecId(int codecId) { @@ -83,6 +86,7 @@ public final class FieldInfo { // should only be called by FieldInfos#addOrUpdate void update(boolean isIndexed, boolean storeTermVector, boolean storePositionWithTermVector, boolean storeOffsetWithTermVector, boolean omitNorms, boolean storePayloads, boolean omitTermFreqAndPositions) { + if (this.isIndexed != isIndexed) { this.isIndexed = true; // once indexed, always index } @@ -104,8 +108,10 @@ public final class FieldInfo { } if (this.omitTermFreqAndPositions != omitTermFreqAndPositions) { this.omitTermFreqAndPositions = true; // if one require omitTermFreqAndPositions at least once, it remains off for life + this.storePayloads = false; } } + assert !this.omitTermFreqAndPositions || !this.storePayloads; } void setDocValues(Type v) { diff --git a/lucene/src/java/org/apache/lucene/index/FieldInfos.java b/lucene/src/java/org/apache/lucene/index/FieldInfos.java index 33124c772b5..c62649a6bf1 100644 --- a/lucene/src/java/org/apache/lucene/index/FieldInfos.java +++ b/lucene/src/java/org/apache/lucene/index/FieldInfos.java @@ -28,6 +28,7 @@ import java.util.SortedMap; import java.util.TreeMap; import java.util.Map.Entry; +import org.apache.lucene.index.SegmentCodecs; // Required for Java 1.5 javadocs import org.apache.lucene.index.SegmentCodecs.SegmentCodecsBuilder; import org.apache.lucene.index.codecs.CodecProvider; import org.apache.lucene.index.values.Type; @@ -187,7 +188,7 @@ public final class FieldInfos implements Iterable { } // used by assert - boolean containsConsistent(Integer number, String name) { + synchronized boolean containsConsistent(Integer number, String name) { return name.equals(numberToName.get(number)) && number.equals(nameToNumber.get(name)); } @@ -222,12 +223,13 @@ public final class FieldInfos implements Iterable { /** * Creates a new {@link FieldInfos} instance with a private - * {@link FieldNumberBiMap} and a default {@link SegmentCodecsBuilder} + * {@link org.apache.lucene.index.FieldInfos.FieldNumberBiMap} and a default {@link SegmentCodecsBuilder} * initialized with {@link CodecProvider#getDefault()}. * * Note: this ctor should not be used during indexing use * {@link FieldInfos#FieldInfos(FieldInfos)} or - * {@link FieldInfos#FieldInfos(FieldNumberBiMap)} instead. + * {@link FieldInfos#FieldInfos(FieldNumberBiMap,org.apache.lucene.index.SegmentCodecs.SegmentCodecsBuilder)} + * instead. */ public FieldInfos() { this(new FieldNumberBiMap(), SegmentCodecsBuilder.create(CodecProvider.getDefault())); @@ -556,9 +558,10 @@ public final class FieldInfos implements Iterable
{ /** * Returns true
iff this instance is not backed by a - * {@link FieldNumberBiMap}. Instances read from a directory via + * {@link org.apache.lucene.index.FieldInfos.FieldNumberBiMap}. Instances read from a directory via * {@link FieldInfos#FieldInfos(Directory, String)} will always be read-only - * since no {@link FieldNumberBiMap} is supplied, otherwisefalse
. + * since no {@link org.apache.lucene.index.FieldInfos.FieldNumberBiMap} is supplied, otherwise + *false
. */ public final boolean isReadOnly() { return globalFieldNumbers == null; @@ -568,6 +571,7 @@ public final class FieldInfos implements Iterable{ output.writeVInt(FORMAT_CURRENT); output.writeVInt(size()); for (FieldInfo fi : this) { + assert !fi.omitTermFreqAndPositions || !fi.storePayloads; byte bits = 0x0; if (fi.isIndexed) bits |= IS_INDEXED; if (fi.storeTermVector) bits |= STORE_TERMVECTOR; @@ -647,6 +651,14 @@ public final class FieldInfos implements Iterable { boolean omitNorms = (bits & OMIT_NORMS) != 0; boolean storePayloads = (bits & STORE_PAYLOADS) != 0; boolean omitTermFreqAndPositions = (bits & OMIT_TERM_FREQ_AND_POSITIONS) != 0; + + // LUCENE-3027: past indices were able to write + // storePayloads=true when omitTFAP is also true, + // which is invalid. We correct that, here: + if (omitTermFreqAndPositions) { + storePayloads = false; + } + Type docValuesType = null; if (format <= FORMAT_INDEX_VALUES) { final byte b = input.readByte(); diff --git a/lucene/src/java/org/apache/lucene/index/Fields.java b/lucene/src/java/org/apache/lucene/index/Fields.java index 20e7176f4ec..01b7f0d50ca 100644 --- a/lucene/src/java/org/apache/lucene/index/Fields.java +++ b/lucene/src/java/org/apache/lucene/index/Fields.java @@ -19,8 +19,6 @@ package org.apache.lucene.index; import java.io.IOException; -import org.apache.lucene.index.values.DocValues; - /** Flex API for access to fields and terms * @lucene.experimental */ @@ -34,15 +32,5 @@ public abstract class Fields { * null if the field does not exist. */ public abstract Terms terms(String field) throws IOException; - /** - * Returns {@link DocValues} for the current field. - * - * @param field the field name - * @return the {@link DocValues} for this field or null
if not - * applicable. - * @throws IOException - */ - public abstract DocValues docValues(String field) throws IOException; - public final static Fields[] EMPTY_ARRAY = new Fields[0]; } diff --git a/lucene/src/java/org/apache/lucene/index/FieldsEnum.java b/lucene/src/java/org/apache/lucene/index/FieldsEnum.java index 290cd107cfb..51ffa5f04b9 100644 --- a/lucene/src/java/org/apache/lucene/index/FieldsEnum.java +++ b/lucene/src/java/org/apache/lucene/index/FieldsEnum.java @@ -58,16 +58,6 @@ public abstract class FieldsEnum { * will not return null. */ public abstract TermsEnum terms() throws IOException; - /** - * Returns {@link DocValues} for the current field. - * - * @return the {@link DocValues} for this field ornull
if not - * applicable. - * @throws IOException - */ - public abstract DocValues docValues() throws IOException; - - public final static FieldsEnum[] EMPTY_ARRAY = new FieldsEnum[0]; /** Provides zero fields */ @@ -82,10 +72,5 @@ public abstract class FieldsEnum { public TermsEnum terms() { throw new IllegalStateException("this method should never be called"); } - - @Override - public DocValues docValues() throws IOException { - throw new IllegalStateException("this method should never be called"); - } }; } diff --git a/lucene/src/java/org/apache/lucene/index/FieldsWriter.java b/lucene/src/java/org/apache/lucene/index/FieldsWriter.java index f694bb4342c..303aa912bc3 100644 --- a/lucene/src/java/org/apache/lucene/index/FieldsWriter.java +++ b/lucene/src/java/org/apache/lucene/index/FieldsWriter.java @@ -2,13 +2,13 @@ package org.apache.lucene.index; /** * Copyright 2004 The Apache Software Foundation - * + * * Licensed under the Apache License, Version 2.0 (the "License"); you may not * use this file except in compliance with the License. You may obtain a copy of * the License at - * + * * http://www.apache.org/licenses/LICENSE-2.0 - * + * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the @@ -22,15 +22,14 @@ import java.util.List; import org.apache.lucene.document.Document; import org.apache.lucene.document.Fieldable; import org.apache.lucene.store.Directory; -import org.apache.lucene.store.RAMOutputStream; -import org.apache.lucene.store.IndexOutput; import org.apache.lucene.store.IndexInput; +import org.apache.lucene.store.IndexOutput; import org.apache.lucene.util.IOUtils; final class FieldsWriter { static final byte FIELD_IS_TOKENIZED = 0x1; static final byte FIELD_IS_BINARY = 0x2; - + // Lucene 3.0: Removal of compressed fields static final int FORMAT_LUCENE_3_0_NO_COMPRESSED_FIELDS = 2; @@ -38,7 +37,7 @@ final class FieldsWriter { // than the current one, and always change this if you // switch to a new format! static final int FORMAT_CURRENT = FORMAT_LUCENE_3_0_NO_COMPRESSED_FIELDS; - + // when removing support for old versions, leave the last supported version here static final int FORMAT_MINIMUM = FORMAT_LUCENE_3_0_NO_COMPRESSED_FIELDS; @@ -83,10 +82,9 @@ final class FieldsWriter { // and adds a new entry for this document into the index // stream. This assumes the buffer was already written // in the correct fields format. - void flushDocument(int numStoredFields, RAMOutputStream buffer) throws IOException { + void startDocument(int numStoredFields) throws IOException { indexStream.writeLong(fieldsStream.getFilePointer()); fieldsStream.writeVInt(numStoredFields); - buffer.writeTo(fieldsStream); } void skipDocument() throws IOException { @@ -121,8 +119,8 @@ final class FieldsWriter { } } - final void writeField(FieldInfo fi, Fieldable field) throws IOException { - fieldsStream.writeVInt(fi.number); + final void writeField(int fieldNumber, Fieldable field) throws IOException { + fieldsStream.writeVInt(fieldNumber); byte bits = 0; if (field.isTokenized()) bits |= FieldsWriter.FIELD_IS_TOKENIZED; @@ -175,10 +173,9 @@ final class FieldsWriter { fieldsStream.writeVInt(storedCount); - for (Fieldable field : fields) { if (field.isStored()) - writeField(fieldInfos.fieldInfo(field.name()), field); + writeField(fieldInfos.fieldNumber(field.name()), field); } } } diff --git a/lucene/src/java/org/apache/lucene/index/FilterIndexReader.java b/lucene/src/java/org/apache/lucene/index/FilterIndexReader.java index 4dc7cfee89e..8d17f534db4 100644 --- a/lucene/src/java/org/apache/lucene/index/FilterIndexReader.java +++ b/lucene/src/java/org/apache/lucene/index/FilterIndexReader.java @@ -19,9 +19,7 @@ package org.apache.lucene.index; import org.apache.lucene.document.Document; import org.apache.lucene.document.FieldSelector; -import org.apache.lucene.index.IndexReader.ReaderContext; -import org.apache.lucene.index.values.DocValues; -import org.apache.lucene.index.values.DocValuesEnum; +import org.apache.lucene.index.codecs.PerDocValues; import org.apache.lucene.store.Directory; import org.apache.lucene.util.Bits; import org.apache.lucene.util.BytesRef; @@ -62,11 +60,6 @@ public class FilterIndexReader extends IndexReader { public Terms terms(String field) throws IOException { return in.terms(field); } - - @Override - public DocValues docValues(String field) throws IOException { - return in.docValues(field); - } } /** Base class for filtering {@link Terms} @@ -130,11 +123,6 @@ public class FilterIndexReader extends IndexReader { public TermsEnum terms() throws IOException { return in.terms(); } - - @Override - public DocValues docValues() throws IOException { - return in.docValues(); - } } /** Base class for filtering {@link TermsEnum} implementations. */ @@ -475,4 +463,9 @@ public class FilterIndexReader extends IndexReader { super.removeReaderFinishedListener(listener); in.removeReaderFinishedListener(listener); } + + @Override + public PerDocValues perDocValues() throws IOException { + return in.perDocValues(); + } } diff --git a/lucene/src/java/org/apache/lucene/index/FlushByRamOrCountsPolicy.java b/lucene/src/java/org/apache/lucene/index/FlushByRamOrCountsPolicy.java new file mode 100644 index 00000000000..81e3676246b --- /dev/null +++ b/lucene/src/java/org/apache/lucene/index/FlushByRamOrCountsPolicy.java @@ -0,0 +1,128 @@ +package org.apache.lucene.index; + +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +import org.apache.lucene.index.DocumentsWriterPerThreadPool.ThreadState; + +/** + * Default {@link FlushPolicy} implementation that flushes based on RAM used, + * document count and number of buffered deletes depending on the IndexWriter's + * {@link IndexWriterConfig}. + * + *+ *
+ * All {@link IndexWriterConfig} settings are used to mark + * {@link DocumentsWriterPerThread} as flush pending during indexing with + * respect to their live updates. + *- {@link #onDelete(DocumentsWriterFlushControl, DocumentsWriterPerThreadPool.ThreadState)} - flushes + * based on the global number of buffered delete terms iff + * {@link IndexWriterConfig#getMaxBufferedDeleteTerms()} is enabled
+ *- {@link #onInsert(DocumentsWriterFlushControl, DocumentsWriterPerThreadPool.ThreadState)} - flushes + * either on the number of documents per {@link DocumentsWriterPerThread} ( + * {@link DocumentsWriterPerThread#getNumDocsInRAM()}) or on the global active + * memory consumption in the current indexing session iff + * {@link IndexWriterConfig#getMaxBufferedDocs()} or + * {@link IndexWriterConfig#getRAMBufferSizeMB()} is enabled respectively
+ *- {@link #onUpdate(DocumentsWriterFlushControl, DocumentsWriterPerThreadPool.ThreadState)} - calls + * {@link #onInsert(DocumentsWriterFlushControl, DocumentsWriterPerThreadPool.ThreadState)} and + * {@link #onDelete(DocumentsWriterFlushControl, DocumentsWriterPerThreadPool.ThreadState)} in order
+ *+ * If {@link IndexWriterConfig#setRAMBufferSizeMB(double)} is enabled, the + * largest ram consuming {@link DocumentsWriterPerThread} will be marked as + * pending iff the global active RAM consumption is >= the configured max RAM + * buffer. + */ +public class FlushByRamOrCountsPolicy extends FlushPolicy { + + @Override + public void onDelete(DocumentsWriterFlushControl control, ThreadState state) { + if (flushOnDeleteTerms()) { + // Flush this state by num del terms + final int maxBufferedDeleteTerms = indexWriterConfig + .getMaxBufferedDeleteTerms(); + if (control.getNumGlobalTermDeletes() >= maxBufferedDeleteTerms) { + control.setApplyAllDeletes(); + } + } + final DocumentsWriter writer = this.writer.get(); + // If deletes alone are consuming > 1/2 our RAM + // buffer, force them all to apply now. This is to + // prevent too-frequent flushing of a long tail of + // tiny segments: + if ((flushOnRAM() && + writer.deleteQueue.bytesUsed() > (1024*1024*indexWriterConfig.getRAMBufferSizeMB()/2))) { + control.setApplyAllDeletes(); + if (writer.infoStream != null) { + writer.message("force apply deletes bytesUsed=" + writer.deleteQueue.bytesUsed() + " vs ramBuffer=" + (1024*1024*indexWriterConfig.getRAMBufferSizeMB())); + } + } + } + + @Override + public void onInsert(DocumentsWriterFlushControl control, ThreadState state) { + if (flushOnDocCount() + && state.perThread.getNumDocsInRAM() >= indexWriterConfig + .getMaxBufferedDocs()) { + // Flush this state by num docs + control.setFlushPending(state); + } else if (flushOnRAM()) {// flush by RAM + final long limit = (long) (indexWriterConfig.getRAMBufferSizeMB() * 1024.d * 1024.d); + final long totalRam = control.activeBytes(); + if (totalRam >= limit) { + markLargestWriterPending(control, state, totalRam); + } + } + } + + /** + * Marks the most ram consuming active {@link DocumentsWriterPerThread} flush + * pending + */ + protected void markLargestWriterPending(DocumentsWriterFlushControl control, + ThreadState perThreadState, final long currentBytesPerThread) { + control + .setFlushPending(findLargestNonPendingWriter(control, perThreadState)); + } + + /** + * Returns
true
if this {@link FlushPolicy} flushes on + * {@link IndexWriterConfig#getMaxBufferedDocs()}, otherwise + *false
. + */ + protected boolean flushOnDocCount() { + return indexWriterConfig.getMaxBufferedDocs() != IndexWriterConfig.DISABLE_AUTO_FLUSH; + } + + /** + * Returnstrue
if this {@link FlushPolicy} flushes on + * {@link IndexWriterConfig#getMaxBufferedDeleteTerms()}, otherwise + *false
. + */ + protected boolean flushOnDeleteTerms() { + return indexWriterConfig.getMaxBufferedDeleteTerms() != IndexWriterConfig.DISABLE_AUTO_FLUSH; + } + + /** + * Returnstrue
if this {@link FlushPolicy} flushes on + * {@link IndexWriterConfig#getRAMBufferSizeMB()}, otherwise + *false
. + */ + protected boolean flushOnRAM() { + return indexWriterConfig.getRAMBufferSizeMB() != IndexWriterConfig.DISABLE_AUTO_FLUSH; + } +} diff --git a/lucene/src/java/org/apache/lucene/index/FlushPolicy.java b/lucene/src/java/org/apache/lucene/index/FlushPolicy.java new file mode 100644 index 00000000000..13f8a45e847 --- /dev/null +++ b/lucene/src/java/org/apache/lucene/index/FlushPolicy.java @@ -0,0 +1,131 @@ +package org.apache.lucene.index; + +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +import java.util.Iterator; + +import org.apache.lucene.index.DocumentsWriterPerThreadPool.ThreadState; +import org.apache.lucene.store.Directory; +import org.apache.lucene.util.SetOnce; + +/** + * {@link FlushPolicy} controls when segments are flushed from a RAM resident + * internal data-structure to the {@link IndexWriter}s {@link Directory}. + *+ * Segments are traditionally flushed by: + *
+ *
+ * + * The {@link IndexWriter} consults a provided {@link FlushPolicy} to control the + * flushing process. The policy is informed for each added or + * updated document as well as for each delete term. Based on the + * {@link FlushPolicy}, the information provided via {@link ThreadState} and + * {@link DocumentsWriterFlushControl}, the {@link FlushPolicy} decides if a + * {@link DocumentsWriterPerThread} needs flushing and mark it as + * flush-pending via + * {@link DocumentsWriterFlushControl#setFlushPending(DocumentsWriterPerThreadPool.ThreadState)}. + * + * @see ThreadState + * @see DocumentsWriterFlushControl + * @see DocumentsWriterPerThread + * @see IndexWriterConfig#setFlushPolicy(FlushPolicy) + */ +public abstract class FlushPolicy { + protected final SetOnce- RAM consumption - configured via + * {@link IndexWriterConfig#setRAMBufferSizeMB(double)}
+ *- Number of RAM resident documents - configured via + * {@link IndexWriterConfig#setMaxBufferedDocs(int)}
+ *- Number of buffered delete terms/queries - configured via + * {@link IndexWriterConfig#setMaxBufferedDeleteTerms(int)}
+ *writer = new SetOnce (); + protected IndexWriterConfig indexWriterConfig; + + /** + * Called for each delete term. If this is a delete triggered due to an update + * the given {@link ThreadState} is non-null. + * + * Note: This method is called synchronized on the given + * {@link DocumentsWriterFlushControl} and it is guaranteed that the calling + * thread holds the lock on the given {@link ThreadState} + */ + public abstract void onDelete(DocumentsWriterFlushControl control, + ThreadState state); + + /** + * Called for each document update on the given {@link ThreadState}'s + * {@link DocumentsWriterPerThread}. + *
+ * Note: This method is called synchronized on the given + * {@link DocumentsWriterFlushControl} and it is guaranteed that the calling + * thread holds the lock on the given {@link ThreadState} + */ + public void onUpdate(DocumentsWriterFlushControl control, ThreadState state) { + onInsert(control, state); + if (!state.flushPending) { + onDelete(control, state); + } + } + + /** + * Called for each document addition on the given {@link ThreadState}s + * {@link DocumentsWriterPerThread}. + *
+ * Note: This method is synchronized by the given + * {@link DocumentsWriterFlushControl} and it is guaranteed that the calling + * thread holds the lock on the given {@link ThreadState} + */ + public abstract void onInsert(DocumentsWriterFlushControl control, + ThreadState state); + + /** + * Called by DocumentsWriter to initialize the FlushPolicy + */ + protected synchronized void init(DocumentsWriter docsWriter) { + writer.set(docsWriter); + indexWriterConfig = docsWriter.indexWriter.getConfig(); + } + + /** + * Returns the current most RAM consuming non-pending {@link ThreadState} with + * at least one indexed document. + *
+ * This method will never return
null
+ */ + protected ThreadState findLargestNonPendingWriter( + DocumentsWriterFlushControl control, ThreadState perThreadState) { + assert perThreadState.perThread.getNumDocsInRAM() > 0; + long maxRamSoFar = perThreadState.bytesUsed; + // the dwpt which needs to be flushed eventually + ThreadState maxRamUsingThreadState = perThreadState; + assert !perThreadState.flushPending : "DWPT should have flushed"; + IteratoractivePerThreadsIterator = control.allActiveThreads(); + while (activePerThreadsIterator.hasNext()) { + ThreadState next = activePerThreadsIterator.next(); + if (!next.flushPending) { + final long nextRam = next.bytesUsed; + if (nextRam > maxRamSoFar && next.perThread.getNumDocsInRAM() > 0) { + maxRamSoFar = nextRam; + maxRamUsingThreadState = next; + } + } + } + assert writer.get().message( + "set largest ram consuming thread pending on lower watermark"); + return maxRamUsingThreadState; + } + +} diff --git a/lucene/src/java/org/apache/lucene/index/FreqProxFieldMergeState.java b/lucene/src/java/org/apache/lucene/index/FreqProxFieldMergeState.java deleted file mode 100644 index de2a8cce677..00000000000 --- a/lucene/src/java/org/apache/lucene/index/FreqProxFieldMergeState.java +++ /dev/null @@ -1,115 +0,0 @@ -package org.apache.lucene.index; - -/** - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import java.io.IOException; -import java.util.Comparator; - -import org.apache.lucene.util.ByteBlockPool; -import org.apache.lucene.util.BytesRef; - -import org.apache.lucene.index.FreqProxTermsWriterPerField.FreqProxPostingsArray; - -// TODO FI: some of this is "generic" to TermsHash* so we -// should factor it out so other consumers don't have to -// duplicate this code - -/** Used by DocumentsWriter to merge the postings from - * multiple ThreadStates when creating a segment */ -final class FreqProxFieldMergeState { - - final FreqProxTermsWriterPerField field; - final int numPostings; - private final ByteBlockPool bytePool; - final int[] termIDs; - final FreqProxPostingsArray postings; - int currentTermID; - - final BytesRef text = new BytesRef(); - - private int postingUpto = -1; - - final ByteSliceReader freq = new ByteSliceReader(); - final ByteSliceReader prox = new ByteSliceReader(); - - int docID; - int termFreq; - - public FreqProxFieldMergeState(FreqProxTermsWriterPerField field, Comparator termComp) { - this.field = field; - this.numPostings = field.termsHashPerField.bytesHash.size(); - this.bytePool = field.perThread.termsHashPerThread.bytePool; - this.termIDs = field.termsHashPerField.sortPostings(termComp); - this.postings = (FreqProxPostingsArray) field.termsHashPerField.postingsArray; - } - - boolean nextTerm() throws IOException { - postingUpto++; - if (postingUpto == numPostings) { - return false; - } - - currentTermID = termIDs[postingUpto]; - docID = 0; - - // Get BytesRef - final int textStart = postings.textStarts[currentTermID]; - bytePool.setBytesRef(text, textStart); - - field.termsHashPerField.initReader(freq, currentTermID, 0); - if (!field.fieldInfo.omitTermFreqAndPositions) { - field.termsHashPerField.initReader(prox, currentTermID, 1); - } - - // Should always be true - boolean result = nextDoc(); - assert result; - - return true; - } - - public boolean nextDoc() throws IOException { - if (freq.eof()) { - if (postings.lastDocCodes[currentTermID] != -1) { - // Return last doc - docID = postings.lastDocIDs[currentTermID]; - if (!field.omitTermFreqAndPositions) - termFreq = postings.docFreqs[currentTermID]; - postings.lastDocCodes[currentTermID] = -1; - return true; - } else - // EOF - return false; - } - - final int code = freq.readVInt(); - if (field.omitTermFreqAndPositions) - docID += code; - else { - docID += code >>> 1; - if ((code & 1) != 0) - termFreq = 1; - else - termFreq = freq.readVInt(); - } - - assert docID != postings.lastDocIDs[currentTermID]; - - return true; - } -} diff --git a/lucene/src/java/org/apache/lucene/index/FreqProxTermsWriter.java b/lucene/src/java/org/apache/lucene/index/FreqProxTermsWriter.java index d342cb47249..0622fc672f8 100644 --- a/lucene/src/java/org/apache/lucene/index/FreqProxTermsWriter.java +++ b/lucene/src/java/org/apache/lucene/index/FreqProxTermsWriter.java @@ -19,55 +19,35 @@ package org.apache.lucene.index; import java.io.IOException; import java.util.ArrayList; -import java.util.Collection; -import java.util.Comparator; import java.util.List; import java.util.Map; import org.apache.lucene.index.codecs.FieldsConsumer; -import org.apache.lucene.index.codecs.PostingsConsumer; -import org.apache.lucene.index.codecs.TermStats; -import org.apache.lucene.index.codecs.TermsConsumer; -import org.apache.lucene.util.BitVector; import org.apache.lucene.util.BytesRef; import org.apache.lucene.util.CollectionUtil; final class FreqProxTermsWriter extends TermsHashConsumer { - @Override - public TermsHashConsumerPerThread addThread(TermsHashPerThread perThread) { - return new FreqProxTermsWriterPerThread(perThread); - } - @Override void abort() {} - private int flushedDocCount; - // TODO: would be nice to factor out more of this, eg the // FreqProxFieldMergeState, and code to visit all Fields // under the same FieldInfo together, up into TermsHash*. // Other writers would presumably share alot of this... @Override - public void flush(Map > threadsAndFields, final SegmentWriteState state) throws IOException { + public void flush(Map fieldsToFlush, final SegmentWriteState state) throws IOException { // Gather all FieldData's that have postings, across all // ThreadStates List allFields = new ArrayList (); - - flushedDocCount = state.numDocs; - for (Map.Entry > entry : threadsAndFields.entrySet()) { - - Collection fields = entry.getValue(); - - - for (final TermsHashConsumerPerField i : fields) { - final FreqProxTermsWriterPerField perField = (FreqProxTermsWriterPerField) i; - if (perField.termsHashPerField.bytesHash.size() > 0) + for (TermsHashConsumerPerField f : fieldsToFlush.values()) { + final FreqProxTermsWriterPerField perField = (FreqProxTermsWriterPerField) f; + if (perField.termsHashPerField.bytesHash.size() > 0) { allFields.add(perField); - } + } } final int numAllFields = allFields.size(); @@ -77,6 +57,8 @@ final class FreqProxTermsWriter extends TermsHashConsumer { final FieldsConsumer consumer = state.segmentCodecs.codec().fieldsConsumer(state); + TermsHash termsHash = null; + /* Current writer chain: FieldsConsumer @@ -89,255 +71,48 @@ final class FreqProxTermsWriter extends TermsHashConsumer { -> IMPL: FormatPostingsPositionsWriter */ - int start = 0; - while(start < numAllFields) { - final FieldInfo fieldInfo = allFields.get(start).fieldInfo; - final String fieldName = fieldInfo.name; + for (int fieldNumber = 0; fieldNumber < numAllFields; fieldNumber++) { + final FieldInfo fieldInfo = allFields.get(fieldNumber).fieldInfo; - int end = start+1; - while(end < numAllFields && allFields.get(end).fieldInfo.name.equals(fieldName)) - end++; - - FreqProxTermsWriterPerField[] fields = new FreqProxTermsWriterPerField[end-start]; - for(int i=start;i > entry : threadsAndFields.entrySet()) { - FreqProxTermsWriterPerThread perThread = (FreqProxTermsWriterPerThread) entry.getKey(); - perThread.termsHashPerThread.reset(true); + if (termsHash != null) { + termsHash.reset(); } consumer.close(); } BytesRef payload; - /* Walk through all unique text tokens (Posting - * instances) found in this field and serialize them - * into a single RAM segment. */ - void appendPostings(String fieldName, SegmentWriteState state, - FreqProxTermsWriterPerField[] fields, - FieldsConsumer consumer) - throws CorruptIndexException, IOException { + @Override + public TermsHashConsumerPerField addField(TermsHashPerField termsHashPerField, FieldInfo fieldInfo) { + return new FreqProxTermsWriterPerField(termsHashPerField, this, fieldInfo); + } - int numFields = fields.length; + @Override + void finishDocument(TermsHash termsHash) throws IOException { + } - final BytesRef text = new BytesRef(); - - final FreqProxFieldMergeState[] mergeStates = new FreqProxFieldMergeState[numFields]; - - final TermsConsumer termsConsumer = consumer.addField(fields[0].fieldInfo); - final Comparator termComp = termsConsumer.getComparator(); - - for(int i=0;i 0; if (omitTermFreqAndPositions) { @@ -169,7 +177,7 @@ final class FreqProxTermsWriterPerField extends TermsHashConsumerPerField implem } } } - + @Override ParallelPostingsArray createPostingsArray(int size) { return new FreqProxPostingsArray(size); @@ -212,7 +220,180 @@ final class FreqProxTermsWriterPerField extends TermsHashConsumerPerField implem return ParallelPostingsArray.BYTES_PER_POSTING + 4 * RamUsageEstimator.NUM_BYTES_INT; } } - + public void abort() {} + + BytesRef payload; + + /* Walk through all unique text tokens (Posting + * instances) found in this field and serialize them + * into a single RAM segment. */ + void flush(String fieldName, FieldsConsumer consumer, final SegmentWriteState state) + throws CorruptIndexException, IOException { + + final TermsConsumer termsConsumer = consumer.addField(fieldInfo); + final Comparator termComp = termsConsumer.getComparator(); + + final Term protoTerm = new Term(fieldName); + + final boolean currentFieldOmitTermFreqAndPositions = fieldInfo.omitTermFreqAndPositions; + + final Map segDeletes; + if (state.segDeletes != null && state.segDeletes.terms.size() > 0) { + segDeletes = state.segDeletes.terms; + } else { + segDeletes = null; + } + + final int[] termIDs = termsHashPerField.sortPostings(termComp); + final int numTerms = termsHashPerField.bytesHash.size(); + final BytesRef text = new BytesRef(); + final FreqProxPostingsArray postings = (FreqProxPostingsArray) termsHashPerField.postingsArray; + final ByteSliceReader freq = new ByteSliceReader(); + final ByteSliceReader prox = new ByteSliceReader(); + + long sumTotalTermFreq = 0; + for (int i = 0; i < numTerms; i++) { + final int termID = termIDs[i]; + // Get BytesRef + final int textStart = postings.textStarts[termID]; + termsHashPerField.bytePool.setBytesRef(text, textStart); + + termsHashPerField.initReader(freq, termID, 0); + if (!fieldInfo.omitTermFreqAndPositions) { + termsHashPerField.initReader(prox, termID, 1); + } + + // TODO: really TermsHashPerField should take over most + // of this loop, including merge sort of terms from + // multiple threads and interacting with the + // TermsConsumer, only calling out to us (passing us the + // DocsConsumer) to handle delivery of docs/positions + + final PostingsConsumer postingsConsumer = termsConsumer.startTerm(text); + + final int delDocLimit; + if (segDeletes != null) { + final Integer docIDUpto = segDeletes.get(protoTerm.createTerm(text)); + if (docIDUpto != null) { + delDocLimit = docIDUpto; + } else { + delDocLimit = 0; + } + } else { + delDocLimit = 0; + } + + // Now termStates has numToMerge FieldMergeStates + // which all share the same term. Now we must + // interleave the docID streams. + int numDocs = 0; + long totTF = 0; + int docID = 0; + int termFreq = 0; + + while(true) { + if (freq.eof()) { + if (postings.lastDocCodes[termID] != -1) { + // Return last doc + docID = postings.lastDocIDs[termID]; + if (!omitTermFreqAndPositions) { + termFreq = postings.docFreqs[termID]; + } + postings.lastDocCodes[termID] = -1; + } else { + // EOF + break; + } + } else { + final int code = freq.readVInt(); + if (omitTermFreqAndPositions) { + docID += code; + } else { + docID += code >>> 1; + if ((code & 1) != 0) { + termFreq = 1; + } else { + termFreq = freq.readVInt(); + } + } + + assert docID != postings.lastDocIDs[termID]; + } + + numDocs++; + assert docID < state.numDocs: "doc=" + docID + " maxDoc=" + state.numDocs; + final int termDocFreq = termFreq; + + // NOTE: we could check here if the docID was + // deleted, and skip it. However, this is somewhat + // dangerous because it can yield non-deterministic + // behavior since we may see the docID before we see + // the term that caused it to be deleted. This + // would mean some (but not all) of its postings may + // make it into the index, which'd alter the docFreq + // for those terms. We could fix this by doing two + // passes, ie first sweep marks all del docs, and + // 2nd sweep does the real flush, but I suspect + // that'd add too much time to flush. + postingsConsumer.startDoc(docID, termDocFreq); + if (docID < delDocLimit) { + // Mark it deleted. TODO: we could also skip + // writing its postings; this would be + // deterministic (just for this Term's docs). + if (state.deletedDocs == null) { + state.deletedDocs = new BitVector(state.numDocs); + } + state.deletedDocs.set(docID); + } + + // Carefully copy over the prox + payload info, + // changing the format to match Lucene's segment + // format. + if (!currentFieldOmitTermFreqAndPositions) { + // omitTermFreqAndPositions == false so we do write positions & + // payload + int position = 0; + totTF += termDocFreq; + for(int j=0;j > 1; + + final int payloadLength; + final BytesRef thisPayload; + + if ((code & 1) != 0) { + // This position has a payload + payloadLength = prox.readVInt(); + + if (payload == null) { + payload = new BytesRef(); + payload.bytes = new byte[payloadLength]; + } else if (payload.bytes.length < payloadLength) { + payload.grow(payloadLength); + } + + prox.readBytes(payload.bytes, 0, payloadLength); + payload.length = payloadLength; + thisPayload = payload; + + } else { + payloadLength = 0; + thisPayload = null; + } + + postingsConsumer.addPosition(position, thisPayload); + } + + postingsConsumer.finishDoc(); + } + } + termsConsumer.finishTerm(text, new TermStats(numDocs, totTF)); + sumTotalTermFreq += totTF; + } + + termsConsumer.finish(sumTotalTermFreq); + } + } diff --git a/lucene/src/java/org/apache/lucene/index/FrozenBufferedDeletes.java b/lucene/src/java/org/apache/lucene/index/FrozenBufferedDeletes.java index b54213966ac..8ff3142e6ef 100644 --- a/lucene/src/java/org/apache/lucene/index/FrozenBufferedDeletes.java +++ b/lucene/src/java/org/apache/lucene/index/FrozenBufferedDeletes.java @@ -52,9 +52,15 @@ class FrozenBufferedDeletes { final int[] queryLimits; final int bytesUsed; final int numTermDeletes; - final long gen; + private long gen = -1; // assigned by BufferedDeletesStream once pushed + final boolean isSegmentPrivate; // set to true iff this frozen packet represents + // a segment private deletes. in that case is should + // only have Queries - public FrozenBufferedDeletes(BufferedDeletes deletes, long gen) { + + public FrozenBufferedDeletes(BufferedDeletes deletes, boolean isSegmentPrivate) { + this.isSegmentPrivate = isSegmentPrivate; + assert !isSegmentPrivate || deletes.terms.size() == 0 : "segment private package should only have del queries"; terms = deletes.terms.keySet().toArray(new Term[deletes.terms.size()]); queries = new Query[deletes.queries.size()]; queryLimits = new int[deletes.queries.size()]; @@ -66,8 +72,17 @@ class FrozenBufferedDeletes { } bytesUsed = terms.length * BYTES_PER_DEL_TERM + queries.length * BYTES_PER_DEL_QUERY; numTermDeletes = deletes.numTermDeletes.get(); + } + + public void setDelGen(long gen) { + assert this.gen == -1; this.gen = gen; } + + public long delGen() { + assert gen != -1; + return gen; + } public Iterable termsIterable() { return new Iterable () { diff --git a/lucene/src/java/org/apache/lucene/index/Healthiness.java b/lucene/src/java/org/apache/lucene/index/Healthiness.java new file mode 100644 index 00000000000..dcb9868ab0d --- /dev/null +++ b/lucene/src/java/org/apache/lucene/index/Healthiness.java @@ -0,0 +1,121 @@ +package org.apache.lucene.index; + +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +import java.util.concurrent.locks.AbstractQueuedSynchronizer; + +import org.apache.lucene.index.DocumentsWriterPerThreadPool.ThreadState; + +/** + * Controls the health status of a {@link DocumentsWriter} sessions. This class + * used to block incoming indexing threads if flushing significantly slower than + * indexing to ensure the {@link DocumentsWriter}s healthiness. If flushing is + * significantly slower than indexing the net memory used within an + * {@link IndexWriter} session can increase very quickly and easily exceed the + * JVM's available memory. + * + * To prevent OOM Errors and ensure IndexWriter's stability this class blocks + * incoming threads from indexing once 2 x number of available + * {@link ThreadState}s in {@link DocumentsWriterPerThreadPool} is exceeded. + * Once flushing catches up and the number of flushing DWPT is equal or lower + * than the number of active {@link ThreadState}s threads are released and can + * continue indexing. + */ +//TODO: rename this to DocumentsWriterStallControl (or something like that)? +final class Healthiness { + + @SuppressWarnings("serial") + private static final class Sync extends AbstractQueuedSynchronizer { + volatile boolean hasBlockedThreads = false; // only with assert + + Sync() { + setState(0); + } + + boolean isHealthy() { + return getState() == 0; + } + + boolean trySetStalled() { + int state = getState(); + return compareAndSetState(state, state + 1); + } + + boolean tryReset() { + final int oldState = getState(); + if (oldState == 0) + return true; + if (compareAndSetState(oldState, 0)) { + releaseShared(0); + return true; + } + return false; + } + + @Override + public int tryAcquireShared(int acquires) { + assert maybeSetHasBlocked(getState()); + return getState() == 0 ? 1 : -1; + } + + // only used for testing + private boolean maybeSetHasBlocked(int state) { + hasBlockedThreads |= getState() != 0; + return true; + } + + @Override + public boolean tryReleaseShared(int newState) { + return (getState() == 0); + } + } + + private final Sync sync = new Sync(); + volatile boolean wasStalled = false; // only with asserts + + boolean anyStalledThreads() { + return !sync.isHealthy(); + } + + /** + * Update the stalled flag status. This method will set the stalled flag to + *
true
iff the number of flushing + * {@link DocumentsWriterPerThread} is greater than the number of active + * {@link DocumentsWriterPerThread}. Otherwise it will reset the + * {@link Healthiness} to healthy and release all threads waiting on + * {@link #waitIfStalled()} + */ + void updateStalled(DocumentsWriterFlushControl flushControl) { + do { + // if we have more flushing DWPT than numActiveDWPT we stall! + while (flushControl.numActiveDWPT() < flushControl.numFlushingDWPT()) { + if (sync.trySetStalled()) { + assert wasStalled = true; + return; + } + } + } while (!sync.tryReset()); + } + + void waitIfStalled() { + sync.acquireShared(0); + } + + boolean hasBlocked() { + return sync.hasBlockedThreads; + } +} \ No newline at end of file diff --git a/lucene/src/java/org/apache/lucene/index/IndexFileDeleter.java b/lucene/src/java/org/apache/lucene/index/IndexFileDeleter.java index 5d2f959bc65..ecf41bacabc 100644 --- a/lucene/src/java/org/apache/lucene/index/IndexFileDeleter.java +++ b/lucene/src/java/org/apache/lucene/index/IndexFileDeleter.java @@ -21,7 +21,13 @@ import java.io.FileNotFoundException; import java.io.FilenameFilter; import java.io.IOException; import java.io.PrintStream; -import java.util.*; +import java.util.ArrayList; +import java.util.Collection; +import java.util.Collections; +import java.util.Date; +import java.util.HashMap; +import java.util.List; +import java.util.Map; import org.apache.lucene.index.codecs.CodecProvider; import org.apache.lucene.store.Directory; @@ -49,12 +55,12 @@ import org.apache.lucene.util.CollectionUtil; * (IndexDeletionPolicy) is consulted on creation (onInit) * and once per commit (onCommit), to decide when a commit * should be removed. - * + * * It is the business of the IndexDeletionPolicy to choose * when to delete commit points. The actual mechanics of * file deletion, retrying, etc, derived from the deletion * of commit points is the business of the IndexFileDeleter. - * + * * The current default deletion policy is {@link * KeepOnlyLastCommitDeletionPolicy}, which removes all * prior commits when a new commit has completed. This @@ -72,7 +78,7 @@ final class IndexFileDeleter { * so we will retry them again later: */ private Listdeletable; - /* Reference count for all files in the index. + /* Reference count for all files in the index. * Counts how many existing commits reference a file. **/ private Map refCounts = new HashMap (); @@ -88,7 +94,7 @@ final class IndexFileDeleter { * non-commit checkpoint: */ private List > lastFiles = new ArrayList >(); - /* Commits that the IndexDeletionPolicy have decided to delete: */ + /* Commits that the IndexDeletionPolicy have decided to delete: */ private List commitsToDelete = new ArrayList (); private PrintStream infoStream; @@ -108,7 +114,7 @@ final class IndexFileDeleter { message("setInfoStream deletionPolicy=" + policy); } } - + private void message(String message) { infoStream.println("IFD [" + new Date() + "; " + Thread.currentThread().getName() + "]: " + message); } @@ -139,12 +145,12 @@ final class IndexFileDeleter { // counts: long currentGen = segmentInfos.getGeneration(); indexFilenameFilter = new IndexFileNameFilter(codecs); - + CommitPoint currentCommitPoint = null; String[] files = null; try { files = directory.listAll(); - } catch (NoSuchDirectoryException e) { + } catch (NoSuchDirectoryException e) { // it means the directory is empty, so ignore it. files = new String[0]; } @@ -152,7 +158,7 @@ final class IndexFileDeleter { for (String fileName : files) { if ((indexFilenameFilter.accept(null, fileName)) && !fileName.endsWith("write.lock") && !fileName.equals(IndexFileNames.SEGMENTS_GEN)) { - + // Add this file to refCounts with initial count 0: getRefCount(fileName); @@ -233,7 +239,7 @@ final class IndexFileDeleter { // Now delete anything with ref count at 0. These are // presumably abandoned files eg due to crash of // IndexWriter. - for(Map.Entry entry : refCounts.entrySet() ) { + for(Map.Entry entry : refCounts.entrySet() ) { RefCount rc = entry.getValue(); final String fileName = entry.getKey(); if (0 == rc.count) { @@ -253,7 +259,7 @@ final class IndexFileDeleter { // Always protect the incoming segmentInfos since // sometime it may not be the most recent commit checkpoint(segmentInfos, false); - + startingCommitDeleted = currentCommitPoint == null ? false : currentCommitPoint.isDeleted(); deleteCommits(); @@ -327,7 +333,7 @@ final class IndexFileDeleter { segmentPrefix1 = null; segmentPrefix2 = null; } - + for(int i=0;i oldDeletable = deletable; @@ -397,7 +403,7 @@ final class IndexFileDeleter { /** * For definition of "check point" see IndexWriter comments: * "Clarification: Check Points (and commits)". - * + * * Writer calls this when it has made a "consistent * change" to the index, meaning new files are written to * the index and the in-memory SegmentInfos have been @@ -417,7 +423,7 @@ final class IndexFileDeleter { public void checkpoint(SegmentInfos segmentInfos, boolean isCommit) throws IOException { if (infoStream != null) { - message("now checkpoint \"" + segmentInfos.getCurrentSegmentFileName() + "\" [" + segmentInfos.size() + " segments " + "; isCommit = " + isCommit + "]"); + message("now checkpoint \"" + segmentInfos + "\" [" + segmentInfos.size() + " segments " + "; isCommit = " + isCommit + "]"); } // Try again now to delete any previously un-deletable diff --git a/lucene/src/java/org/apache/lucene/index/IndexReader.java b/lucene/src/java/org/apache/lucene/index/IndexReader.java index ae49b504868..984f77b7117 100644 --- a/lucene/src/java/org/apache/lucene/index/IndexReader.java +++ b/lucene/src/java/org/apache/lucene/index/IndexReader.java @@ -23,6 +23,7 @@ import org.apache.lucene.search.FieldCache; // javadocs import org.apache.lucene.search.Similarity; import org.apache.lucene.index.codecs.Codec; import org.apache.lucene.index.codecs.CodecProvider; +import org.apache.lucene.index.codecs.PerDocValues; import org.apache.lucene.index.values.DocValues; import org.apache.lucene.store.*; import org.apache.lucene.util.ArrayUtil; @@ -923,6 +924,22 @@ public abstract class IndexReader implements Cloneable,Closeable { } } + /** + * Returns true
if an index exists at the specified directory. + * @param directory the directory to check for an index + * @param codecProvider provides a CodecProvider in case the index uses non-core codecs + * @returntrue
if an index exists;false
otherwise + * @throws IOException if there is a problem with accessing the index + */ + public static boolean indexExists(Directory directory, CodecProvider codecProvider) throws IOException { + try { + new SegmentInfos().read(directory, codecProvider); + return true; + } catch (IOException ioe) { + return false; + } + } + /** Returns the number of documents in this index. */ public abstract int numDocs(); @@ -1051,6 +1068,9 @@ public abstract class IndexReader implements Cloneable,Closeable { * using {@link ReaderUtil#gatherSubReaders} and iterate * through them yourself. */ public abstract Fields fields() throws IOException; + + // nocommit javadoc + public abstract PerDocValues perDocValues() throws IOException; public int docFreq(Term term) throws IOException { return docFreq(term.field(), term.bytes()); @@ -1554,11 +1574,11 @@ public abstract class IndexReader implements Cloneable,Closeable { } public DocValues docValues(String field) throws IOException { - final Fields fields = fields(); - if (fields == null) { + final PerDocValues perDoc = perDocValues(); + if (perDoc == null) { return null; } - return fields.docValues(field); + return perDoc.docValues(field); } private volatile Fields fields; @@ -1572,6 +1592,19 @@ public abstract class IndexReader implements Cloneable,Closeable { Fields retrieveFields() { return fields; } + + private volatile PerDocValues perDocValues; + + /** @lucene.internal */ + void storePerDoc(PerDocValues perDocValues) { + this.perDocValues = perDocValues; + } + + /** @lucene.internal */ + PerDocValues retrievePerDoc() { + return perDocValues; + } + /** * A struct like class that represents a hierarchical relationship between diff --git a/lucene/src/java/org/apache/lucene/index/IndexWriter.java b/lucene/src/java/org/apache/lucene/index/IndexWriter.java index cedd1990905..166a6d594dd 100644 --- a/lucene/src/java/org/apache/lucene/index/IndexWriter.java +++ b/lucene/src/java/org/apache/lucene/index/IndexWriter.java @@ -35,6 +35,7 @@ import java.util.concurrent.ConcurrentHashMap; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.document.Document; +import org.apache.lucene.index.DocumentsWriterPerThread.FlushedSegment; import org.apache.lucene.index.FieldInfos.FieldNumberBiMap; import org.apache.lucene.index.IndexWriterConfig.OpenMode; import org.apache.lucene.index.PayloadProcessorProvider.DirPayloadProcessor; @@ -46,6 +47,7 @@ import org.apache.lucene.store.BufferedIndexInput; import org.apache.lucene.store.Directory; import org.apache.lucene.store.Lock; import org.apache.lucene.store.LockObtainFailedException; +import org.apache.lucene.util.BitVector; import org.apache.lucene.util.Bits; import org.apache.lucene.util.Constants; import org.apache.lucene.util.ThreadInterruptedException; @@ -54,17 +56,16 @@ import org.apache.lucene.util.MapBackedSet; /** AnIndexWriter
creates and maintains an index. -The
create
argument to the {@link - #IndexWriter(Directory, IndexWriterConfig) constructor} determines +The {@link OpenMode} option on + {@link IndexWriterConfig#setOpenMode(OpenMode)} determines whether a new index is created, or whether an existing index is - opened. Note that you can open an index with
+ and won't see the newly created index until they re-open. If + {@link OpenMode#CREATE_OR_APPEND} is used IndexWriter will create a + new index if there is not already an index at the provided path + and otherwise open the existing index.create=true
- even while readers are using the index. The old readers will + opened. Note that you can open an index with {@link OpenMode#CREATE} + even while readers are using the index. The old readers will continue to search the "point in time" snapshot they had opened, - and won't see the newly created index until they re-open. There are - also {@link #IndexWriter(Directory, IndexWriterConfig) constructors} - with nocreate
argument which will create a new index - if there is not already an index at the provided path and otherwise - open the existing index.In either case, documents are added with {@link #addDocument(Document) addDocument} and removed with {@link #deleteDocuments(Term)} or {@link @@ -76,15 +77,19 @@ import org.apache.lucene.util.MapBackedSet;
These changes are buffered in memory and periodically flushed to the {@link Directory} (during the above method - calls). A flush is triggered when there are enough - buffered deletes (see {@link IndexWriterConfig#setMaxBufferedDeleteTerms}) - or enough added documents since the last flush, whichever - is sooner. For the added documents, flushing is triggered - either by RAM usage of the documents (see {@link - IndexWriterConfig#setRAMBufferSizeMB}) or the number of added documents. - The default is to flush when RAM usage hits 16 MB. For + calls). A flush is triggered when there are enough added documents + since the last flush. Flushing is triggered either by RAM usage of the + documents (see {@link IndexWriterConfig#setRAMBufferSizeMB}) or the + number of added documents (see {@link IndexWriterConfig#setMaxBufferedDocs(int)}). + The default is to flush when RAM usage hits + {@link IndexWriterConfig#DEFAULT_RAM_BUFFER_SIZE_MB} MB. For best indexing speed you should flush by RAM usage with a - large RAM buffer. Note that flushing just moves the + large RAM buffer. Additionally, if IndexWriter reaches the configured number of + buffered deletes (see {@link IndexWriterConfig#setMaxBufferedDeleteTerms}) + the deleted terms and queries are flushed and applied to existing segments. + In contrast to the other flush options {@link IndexWriterConfig#setRAMBufferSizeMB} and + {@link IndexWriterConfig#setMaxBufferedDocs(int)}, deleted terms + won't trigger a segment flush. Note that flushing just moves the internal buffered state in IndexWriter into the index, but these changes are not visible to IndexReader until either {@link #commit()} or {@link #close} is called. A flush may @@ -165,21 +170,21 @@ import org.apache.lucene.util.MapBackedSet; /* * Clarification: Check Points (and commits) * IndexWriter writes new index files to the directory without writing a new segments_N - * file which references these new files. It also means that the state of + * file which references these new files. It also means that the state of * the in memory SegmentInfos object is different than the most recent * segments_N file written to the directory. - * - * Each time the SegmentInfos is changed, and matches the (possibly - * modified) directory files, we have a new "check point". - * If the modified/new SegmentInfos is written to disk - as a new - * (generation of) segments_N file - this check point is also an + * + * Each time the SegmentInfos is changed, and matches the (possibly + * modified) directory files, we have a new "check point". + * If the modified/new SegmentInfos is written to disk - as a new + * (generation of) segments_N file - this check point is also an * IndexCommit. - * - * A new checkpoint always replaces the previous checkpoint and - * becomes the new "front" of the index. This allows the IndexFileDeleter + * + * A new checkpoint always replaces the previous checkpoint and + * becomes the new "front" of the index. This allows the IndexFileDeleter * to delete files that are referenced only by stale checkpoints. * (files that were created since the last commit, but are no longer - * referenced by the "front" of the index). For this, IndexFileDeleter + * referenced by the "front" of the index). For this, IndexFileDeleter * keeps track of the last non commit checkpoint. */ public class IndexWriter implements Closeable { @@ -195,7 +200,7 @@ public class IndexWriter implements Closeable { * printed to infoStream, if set (see {@link * #setInfoStream}). */ - public final static int MAX_TERM_LENGTH = DocumentsWriter.MAX_TERM_LENGTH_UTF8; + public final static int MAX_TERM_LENGTH = DocumentsWriterPerThread.MAX_TERM_LENGTH_UTF8; // The normal read buffer size defaults to 1024, but // increasing this during merging seems to yield @@ -225,7 +230,7 @@ public class IndexWriter implements Closeable { final FieldNumberBiMap globalFieldNumberMap; private DocumentsWriter docWriter; - private IndexFileDeleter deleter; + final IndexFileDeleter deleter; private Set
- * + * *segmentsToOptimize = new HashSet (); // used by optimize to note those needing optimization private int optimizeMaxNumSegments; @@ -247,12 +252,12 @@ public class IndexWriter implements Closeable { private long mergeGen; private boolean stopMerges; - private final AtomicInteger flushCount = new AtomicInteger(); - private final AtomicInteger flushDeletesCount = new AtomicInteger(); + final AtomicInteger flushCount = new AtomicInteger(); + final AtomicInteger flushDeletesCount = new AtomicInteger(); final ReaderPool readerPool = new ReaderPool(); final BufferedDeletesStream bufferedDeletesStream; - + // This is a "write once" variable (like the organic dye // on a DVD-R that may or may not be heated by a laser and // then cooled to permanently record the event): it's @@ -339,31 +344,58 @@ public class IndexWriter implements Closeable { */ IndexReader getReader(boolean applyAllDeletes) throws IOException { ensureOpen(); - + final long tStart = System.currentTimeMillis(); if (infoStream != null) { message("flush at getReader"); } - // Do this up front before flushing so that the readers // obtained during this flush are pooled, the first time // this method is called: poolReaders = true; - - // Prevent segmentInfos from changing while opening the - // reader; in theory we could do similar retry logic, - // just like we do when loading segments_N - IndexReader r; - synchronized(this) { - flush(false, applyAllDeletes); - r = new DirectoryReader(this, segmentInfos, config.getReaderTermsIndexDivisor(), codecs, applyAllDeletes); - if (infoStream != null) { - message("return reader version=" + r.getVersion() + " reader=" + r); + final IndexReader r; + doBeforeFlush(); + final boolean anySegmentFlushed; + /* + * for releasing a NRT reader we must ensure that + * DW doesn't add any segments or deletes until we are + * done with creating the NRT DirectoryReader. + * We release the two stage full flush after we are done opening the + * directory reader! + */ + synchronized (fullFlushLock) { + boolean success = false; + try { + anySegmentFlushed = docWriter.flushAllThreads(); + if (!anySegmentFlushed) { + // prevent double increment since docWriter#doFlush increments the flushcount + // if we flushed anything. + flushCount.incrementAndGet(); + } + success = true; + // Prevent segmentInfos from changing while opening the + // reader; in theory we could do similar retry logic, + // just like we do when loading segments_N + synchronized(this) { + maybeApplyDeletes(applyAllDeletes); + r = new DirectoryReader(this, segmentInfos, config.getReaderTermsIndexDivisor(), codecs, applyAllDeletes); + if (infoStream != null) { + message("return reader version=" + r.getVersion() + " reader=" + r); + } + } + } finally { + if (!success && infoStream != null) { + message("hit exception during while NRT reader"); + } + // Done: finish the full flush! + docWriter.finishFullFlush(success); + doAfterFlush(); } } - maybeMerge(); - + if (anySegmentFlushed) { + maybeMerge(); + } if (infoStream != null) { message("getReader took " + (System.currentTimeMillis() - tStart) + " msec"); } @@ -400,10 +432,10 @@ public class IndexWriter implements Closeable { if (r != null) { r.hasChanges = false; } - } + } } } - + // used only by asserts public synchronized boolean infoIsLive(SegmentInfo info) { int idx = segmentInfos.indexOf(info); @@ -419,7 +451,7 @@ public class IndexWriter implements Closeable { } return info; } - + /** * Release the segment reader (i.e. decRef it and close if there * are no more references. @@ -432,7 +464,7 @@ public class IndexWriter implements Closeable { public synchronized boolean release(SegmentReader sr) throws IOException { return release(sr, false); } - + /** * Release the segment reader (i.e. decRef it and close if there * are no more references. @@ -493,7 +525,7 @@ public class IndexWriter implements Closeable { sr.close(); } } - + /** Remove all our references to readers, and commits * any pending changes. */ synchronized void close() throws IOException { @@ -503,7 +535,7 @@ public class IndexWriter implements Closeable { Iterator > iter = readerMap.entrySet().iterator(); while (iter.hasNext()) { - + Map.Entry ent = iter.next(); SegmentReader sr = ent.getValue(); @@ -526,7 +558,7 @@ public class IndexWriter implements Closeable { sr.decRef(); } } - + /** * Commit all segment reader in the pool. * @throws IOException @@ -550,7 +582,7 @@ public class IndexWriter implements Closeable { } } } - + /** * Returns a ref to a clone. NOTE: this clone is not * enrolled in the pool, so you should simply close() @@ -564,7 +596,7 @@ public class IndexWriter implements Closeable { sr.decRef(); } } - + /** * Obtain a SegmentReader from the readerPool. The reader * must be returned by calling {@link #release(SegmentReader)} @@ -580,7 +612,7 @@ public class IndexWriter implements Closeable { /** * Obtain a SegmentReader from the readerPool. The reader * must be returned by calling {@link #release(SegmentReader)} - * + * * @see #release(SegmentReader) * @param info * @param doOpenStores @@ -638,7 +670,7 @@ public class IndexWriter implements Closeable { return sr; } } - + /** * Obtain the number of deleted docs for a pooled reader. * If the reader isn't being pooled, the segmentInfo's @@ -658,7 +690,7 @@ public class IndexWriter implements Closeable { } } } - + /** * Used internally to throw an {@link * AlreadyClosedException} if this IndexWriter has been @@ -721,7 +753,7 @@ public class IndexWriter implements Closeable { mergePolicy.setIndexWriter(this); mergeScheduler = conf.getMergeScheduler(); codecs = conf.getCodecProvider(); - + bufferedDeletesStream = new BufferedDeletesStream(messageID); bufferedDeletesStream.setInfoStream(infoStream); poolReaders = conf.getReaderPooling(); @@ -790,8 +822,7 @@ public class IndexWriter implements Closeable { // start with previous field numbers, but new FieldInfos globalFieldNumberMap = segmentInfos.getOrLoadGlobalFieldNumberMap(directory); - docWriter = new DocumentsWriter(config, directory, this, conf.getIndexingChain(), - globalFieldNumberMap.newFieldInfos(SegmentCodecsBuilder.create(codecs)), bufferedDeletesStream); + docWriter = new DocumentsWriter(config, directory, this, globalFieldNumberMap, bufferedDeletesStream); docWriter.setInfoStream(infoStream); // Default deleter (for backwards compatibility) is @@ -849,7 +880,7 @@ public class IndexWriter implements Closeable { public IndexWriterConfig getConfig() { return config; } - + /** If non-null, this will be the default infoStream used * by a newly instantiated IndexWriter. * @see #setInfoStream @@ -871,7 +902,7 @@ public class IndexWriter implements Closeable { * message when maxFieldLength is reached will be printed * to this. */ - public void setInfoStream(PrintStream infoStream) { + public void setInfoStream(PrintStream infoStream) throws IOException { ensureOpen(); this.infoStream = infoStream; docWriter.setInfoStream(infoStream); @@ -881,7 +912,7 @@ public class IndexWriter implements Closeable { messageState(); } - private void messageState() { + private void messageState() throws IOException { message("\ndir=" + directory + "\n" + "index=" + segString() + "\n" + "version=" + Constants.LUCENE_VERSION + "\n" + @@ -901,7 +932,7 @@ public class IndexWriter implements Closeable { public boolean verbose() { return infoStream != null; } - + /** * Commits all changes to an index and closes all * associated files. Note that this may be a costly @@ -916,7 +947,7 @@ public class IndexWriter implements Closeable { * even though part of it (flushing buffered documents) * may have succeeded, so the write lock will still be * held. If you can correct the underlying cause (eg free up * some disk space) then you can call close() again. * Failing that, if you want to force the write lock to be @@ -1036,7 +1067,7 @@ public class IndexWriter implements Closeable { if (infoStream != null) message("now call final commit()"); - + if (!hitOOM) { commitInternal(null); } @@ -1049,7 +1080,7 @@ public class IndexWriter implements Closeable { docWriter = null; deleter.close(); } - + if (writeLock != null) { writeLock.release(); // release write lock writeLock = null; @@ -1072,7 +1103,7 @@ public class IndexWriter implements Closeable { } /** Returns the Directory used by this index. */ - public Directory getDirectory() { + public Directory getDirectory() { // Pass false because the flush during closing calls getDirectory ensureOpen(false); return directory; @@ -1196,22 +1227,7 @@ public class IndexWriter implements Closeable { * @throws IOException if there is a low-level IO error */ public void addDocument(Document doc, Analyzer analyzer) throws CorruptIndexException, IOException { - ensureOpen(); - boolean doFlush = false; - boolean success = false; - try { - try { - doFlush = docWriter.updateDocument(doc, analyzer, null); - success = true; - } finally { - if (!success && infoStream != null) - message("hit exception adding document"); - } - if (doFlush) - flush(true, false); - } catch (OutOfMemoryError oom) { - handleOOM(oom, "addDocument"); - } + updateDocument(null, doc, analyzer); } /** @@ -1228,9 +1244,7 @@ public class IndexWriter implements Closeable { public void deleteDocuments(Term term) throws CorruptIndexException, IOException { ensureOpen(); try { - if (docWriter.deleteTerm(term, false)) { - flush(true, false); - } + docWriter.deleteTerms(term); } catch (OutOfMemoryError oom) { handleOOM(oom, "deleteDocuments(Term)"); } @@ -1238,7 +1252,8 @@ public class IndexWriter implements Closeable { /** * Deletes the document(s) containing any of the - * terms. All deletes are flushed at the same time. + * terms. All given deletes are applied and flushed atomically + * at the same time. * *
NOTE: if this method hits an OutOfMemoryError * you should immediately close the writer. See NOTE: if this method hits an OutOfMemoryError * you should immediately close the writer. See (segmentInfos); optimizeMaxNumSegments = maxNumSegments; - + // Now mark all pending & running merges as optimize // merge: for(final MergePolicy.OneMerge merge : pendingMerges) { @@ -1612,12 +1622,12 @@ public class IndexWriter implements Closeable { if (merge.optimize) return true; } - + for (final MergePolicy.OneMerge merge : runningMerges) { if (merge.optimize) return true; } - + return false; } @@ -1640,6 +1650,8 @@ public class IndexWriter implements Closeable { throws CorruptIndexException, IOException { ensureOpen(); + flush(true, true); + if (infoStream != null) message("expungeDeletes: index now " + segString()); @@ -1712,6 +1724,10 @@ public class IndexWriter implements Closeable { * documents, so you must do so yourself if necessary. * See also {@link #expungeDeletes(boolean)} * + *
NOTE: this method first flushes a new + * segment (if there are indexed documents), and applies + * all buffered deletes. + * *
NOTE: if this method hits an OutOfMemoryError * you should immediately close the writer. See above for details.
@@ -1908,7 +1924,7 @@ public class IndexWriter implements Closeable { /** * Delete all documents in the index. * - *This method will drop all buffered documents and will + *
This method will drop all buffered documents and will * remove all segments from the index. This change will not be * visible until a {@link #commit()} has been called. This method * can be rolled back using {@link #rollback()}.
@@ -1938,7 +1954,7 @@ public class IndexWriter implements Closeable { deleter.refresh(); // Don't bother saving any changes in our segmentInfos - readerPool.clear(null); + readerPool.clear(null); // Mark that the index has changed ++changeCount; @@ -1965,7 +1981,7 @@ public class IndexWriter implements Closeable { mergeFinish(merge); } pendingMerges.clear(); - + for (final MergePolicy.OneMerge merge : runningMerges) { if (infoStream != null) message("now abort running merge " + merge.segString(directory)); @@ -1992,7 +2008,7 @@ public class IndexWriter implements Closeable { message("all running merges have aborted"); } else { - // waitForMerges() will ensure any running addIndexes finishes. + // waitForMerges() will ensure any running addIndexes finishes. // It's fine if a new one attempts to start because from our // caller above the call will see that we are in the // process of closing, and will throw an @@ -2004,7 +2020,7 @@ public class IndexWriter implements Closeable { /** * Wait for any currently outstanding merges to finish. * - *It is guaranteed that any merges started prior to calling this method + *
It is guaranteed that any merges started prior to calling this method * will have completed once this method completes.
*/ public synchronized void waitForMerges() { @@ -2034,6 +2050,125 @@ public class IndexWriter implements Closeable { deleter.checkpoint(segmentInfos, false); } + /** + * Prepares the {@link SegmentInfo} for the new flushed segment and persists + * the deleted documents {@link BitVector}. Use + * {@link #publishFlushedSegment(SegmentInfo, FrozenBufferedDeletes)} to + * publish the returned {@link SegmentInfo} together with its segment private + * delete packet. + * + * @see #publishFlushedSegment(SegmentInfo, FrozenBufferedDeletes) + */ + SegmentInfo prepareFlushedSegment(FlushedSegment flushedSegment) throws IOException { + assert flushedSegment != null; + + SegmentInfo newSegment = flushedSegment.segmentInfo; + + setDiagnostics(newSegment, "flush"); + + boolean success = false; + try { + if (useCompoundFile(newSegment)) { + String compoundFileName = IndexFileNames.segmentFileName(newSegment.name, "", IndexFileNames.COMPOUND_FILE_EXTENSION); + message("creating compound file " + compoundFileName); + // Now build compound file + CompoundFileWriter cfsWriter = new CompoundFileWriter(directory, compoundFileName); + for(String fileName : newSegment.files()) { + cfsWriter.addFile(fileName); + } + + // Perform the merge + cfsWriter.close(); + synchronized(this) { + deleter.deleteNewFiles(newSegment.files()); + } + + newSegment.setUseCompoundFile(true); + } + + // Must write deleted docs after the CFS so we don't + // slurp the del file into CFS: + if (flushedSegment.deletedDocuments != null) { + final int delCount = flushedSegment.deletedDocuments.count(); + assert delCount > 0; + newSegment.setDelCount(delCount); + newSegment.advanceDelGen(); + final String delFileName = newSegment.getDelFileName(); + if (infoStream != null) { + message("flush: write " + delCount + " deletes to " + delFileName); + } + boolean success2 = false; + try { + // TODO: in the NRT case it'd be better to hand + // this del vector over to the + // shortly-to-be-opened SegmentReader and let it + // carry the changes; there's no reason to use + // filesystem as intermediary here. + flushedSegment.deletedDocuments.write(directory, delFileName); + success2 = true; + } finally { + if (!success2) { + try { + directory.deleteFile(delFileName); + } catch (Throwable t) { + // suppress this so we keep throwing the + // original exception + } + } + } + } + + success = true; + } finally { + if (!success) { + if (infoStream != null) { + message("hit exception " + + "reating compound file for newly flushed segment " + newSegment.name); + } + + synchronized(this) { + deleter.refresh(newSegment.name); + } + } + } + return newSegment; + } + + /** + * Atomically adds the segment private delete packet and publishes the flushed + * segments SegmentInfo to the index writer. NOTE: use + * {@link #prepareFlushedSegment(FlushedSegment)} to obtain the + * {@link SegmentInfo} for the flushed segment. + * + * @see #prepareFlushedSegment(FlushedSegment) + */ + synchronized void publishFlushedSegment(SegmentInfo newSegment, + FrozenBufferedDeletes packet, FrozenBufferedDeletes globalPacket) throws IOException { + // Lock order IW -> BDS + synchronized (bufferedDeletesStream) { + if (globalPacket != null && globalPacket.any()) { + bufferedDeletesStream.push(globalPacket); + } + // Publishing the segment must be synched on IW -> BDS to make the sure + // that no merge prunes away the seg. private delete packet + final long nextGen; + if (packet != null && packet.any()) { + nextGen = bufferedDeletesStream.push(packet); + } else { + // Since we don't have a delete packet to apply we can get a new + // generation right away + nextGen = bufferedDeletesStream.getNextGen(); + } + newSegment.setBufferedDeletesGen(nextGen); + segmentInfos.add(newSegment); + checkpoint(); + } + } + + synchronized boolean useCompoundFile(SegmentInfo segmentInfo) throws IOException { + return mergePolicy.useCompoundFile(segmentInfos, segmentInfo); + } + private synchronized void resetMergeExceptions() { mergeExceptions = new ArrayList(); mergeGen++; @@ -2082,11 +2217,11 @@ public class IndexWriter implements Closeable { * * NOTE: this method only copies the segments of the incoming indexes * and does not merge them. Therefore deleted documents are not removed and - * the new segments are not merged with the existing ones. Also, the segments - * are copied as-is, meaning they are not converted to CFS if they aren't, - * and vice-versa. If you wish to do that, you can call {@link #maybeMerge} + * the new segments are not merged with the existing ones. Also, the segments + * are copied as-is, meaning they are not converted to CFS if they aren't, + * and vice-versa. If you wish to do that, you can call {@link #maybeMerge} * or {@link #optimize} afterwards. - * + * *
This requires this index not be among those to be added. * *
@@ -2123,7 +2258,7 @@ public class IndexWriter implements Closeable { docCount += info.docCount; String newSegName = newSegmentName(); String dsName = info.getDocStoreSegment(); - + if (infoStream != null) { message("addIndexes: process segment origName=" + info.name + " newName=" + newSegName + " dsName=" + dsName + " info=" + info); } @@ -2170,7 +2305,7 @@ public class IndexWriter implements Closeable { infos.add(info); } - } + } synchronized (this) { ensureOpen(); @@ -2211,15 +2346,20 @@ public class IndexWriter implements Closeable { ensureOpen(); try { + if (infoStream != null) + message("flush at addIndexes(IndexReader...)"); + flush(false, true); + String mergedName = newSegmentName(); SegmentMerger merger = new SegmentMerger(directory, config.getTermIndexInterval(), mergedName, null, codecs, payloadProcessorProvider, globalFieldNumberMap.newFieldInfos(SegmentCodecsBuilder.create(codecs))); - + for (IndexReader reader : readers) // add new indexes merger.add(reader); - + int docCount = merger.merge(); // merge 'em + final FieldInfos fieldInfos = merger.fieldInfos(); SegmentInfo info = new SegmentInfo(mergedName, docCount, directory, false, fieldInfos.hasProx(), merger.getSegmentCodecs(), @@ -2231,11 +2371,11 @@ public class IndexWriter implements Closeable { synchronized(this) { // Guard segmentInfos useCompoundFile = mergePolicy.useCompoundFile(segmentInfos, info); } - + // Now create the compound file if needed if (useCompoundFile) { merger.createCompoundFile(mergedName + ".cfs", info); - + // delete new non cfs files directly: they were never // registered with IFD deleter.deleteNewFiles(info.files()); @@ -2287,7 +2427,7 @@ public class IndexWriter implements Closeable { * #commit()} to finish the commit, or {@link * #rollback()} to revert the commit and undo all changes * done since the writer was opened.
- * + * * You can also just call {@link #commit(Map)} directly * without prepareCommit first in which case that method * will internally call prepareCommit. @@ -2431,6 +2571,10 @@ public class IndexWriter implements Closeable { } } + // Ensures only one flush() is actually flushing segments + // at a time: + private final Object fullFlushLock = new Object(); + /** * Flush all in-memory buffered updates (adds and deletes) * to the Directory. @@ -2454,116 +2598,104 @@ public class IndexWriter implements Closeable { } } - // TODO: this method should not have to be entirely - // synchronized, ie, merges should be allowed to commit - // even while a flush is happening - private synchronized boolean doFlush(boolean applyAllDeletes) throws CorruptIndexException, IOException { - + private boolean doFlush(boolean applyAllDeletes) throws CorruptIndexException, IOException { if (hitOOM) { throw new IllegalStateException("this writer hit an OutOfMemoryError; cannot flush"); } doBeforeFlush(); - assert testPoint("startDoFlush"); - - // We may be flushing because it was triggered by doc - // count, del count, ram usage (in which case flush - // pending is already set), or we may be flushing - // due to external event eg getReader or commit is - // called (in which case we now set it, and this will - // pause all threads): - flushControl.setFlushPendingNoWait("explicit flush"); - boolean success = false; - try { if (infoStream != null) { message(" start flush: applyAllDeletes=" + applyAllDeletes); message(" index before flush " + segString()); } - - final SegmentInfo newSegment = docWriter.flush(this, deleter, mergePolicy, segmentInfos); - if (newSegment != null) { - setDiagnostics(newSegment, "flush"); - segmentInfos.add(newSegment); - checkpoint(); - } - - if (!applyAllDeletes) { - // If deletes alone are consuming > 1/2 our RAM - // buffer, force them all to apply now. This is to - // prevent too-frequent flushing of a long tail of - // tiny segments: - if (flushControl.getFlushDeletes() || - (config.getRAMBufferSizeMB() != IndexWriterConfig.DISABLE_AUTO_FLUSH && - bufferedDeletesStream.bytesUsed() > (1024*1024*config.getRAMBufferSizeMB()/2))) { - applyAllDeletes = true; - if (infoStream != null) { - message("force apply deletes bytesUsed=" + bufferedDeletesStream.bytesUsed() + " vs ramBuffer=" + (1024*1024*config.getRAMBufferSizeMB())); - } + final boolean anySegmentFlushed; + + synchronized (fullFlushLock) { + try { + anySegmentFlushed = docWriter.flushAllThreads(); + success = true; + } finally { + docWriter.finishFullFlush(success); } } - - if (applyAllDeletes) { - if (infoStream != null) { - message("apply all deletes during flush"); + success = false; + synchronized(this) { + maybeApplyDeletes(applyAllDeletes); + doAfterFlush(); + if (!anySegmentFlushed) { + // flushCount is incremented in flushAllThreads + flushCount.incrementAndGet(); } - flushDeletesCount.incrementAndGet(); - final BufferedDeletesStream.ApplyDeletesResult result = bufferedDeletesStream.applyDeletes(readerPool, segmentInfos); - if (result.anyDeletes) { - checkpoint(); - } - if (!keepFullyDeletedSegments && result.allDeleted != null) { - if (infoStream != null) { - message("drop 100% deleted segments: " + result.allDeleted); - } - for(SegmentInfo info : result.allDeleted) { - // If a merge has already registered for this - // segment, we leave it in the readerPool; the - // merge will skip merging it and will then drop - // it once it's done: - if (!mergingSegments.contains(info)) { - segmentInfos.remove(info); - if (readerPool != null) { - readerPool.drop(info); - } - } - } - checkpoint(); - } - bufferedDeletesStream.prune(segmentInfos); - assert !bufferedDeletesStream.any(); - flushControl.clearDeletes(); - } else if (infoStream != null) { - message("don't apply deletes now delTermCount=" + bufferedDeletesStream.numTerms() + " bytesUsed=" + bufferedDeletesStream.bytesUsed()); + success = true; + return anySegmentFlushed; } - - doAfterFlush(); - flushCount.incrementAndGet(); - - success = true; - - return newSegment != null; - } catch (OutOfMemoryError oom) { handleOOM(oom, "doFlush"); // never hit return false; } finally { - flushControl.clearFlushPending(); if (!success && infoStream != null) message("hit exception during flush"); } } + + final synchronized void maybeApplyDeletes(boolean applyAllDeletes) throws IOException { + if (applyAllDeletes) { + if (infoStream != null) { + message("apply all deletes during flush"); + } + applyAllDeletes(); + } else if (infoStream != null) { + message("don't apply deletes now delTermCount=" + bufferedDeletesStream.numTerms() + " bytesUsed=" + bufferedDeletesStream.bytesUsed()); + } + + } + + final synchronized void applyAllDeletes() throws IOException { + flushDeletesCount.incrementAndGet(); + final BufferedDeletesStream.ApplyDeletesResult result = bufferedDeletesStream + .applyDeletes(readerPool, segmentInfos); + if (result.anyDeletes) { + checkpoint(); + } + if (!keepFullyDeletedSegments && result.allDeleted != null) { + if (infoStream != null) { + message("drop 100% deleted segments: " + result.allDeleted); + } + for (SegmentInfo info : result.allDeleted) { + // If a merge has already registered for this + // segment, we leave it in the readerPool; the + // merge will skip merging it and will then drop + // it once it's done: + if (!mergingSegments.contains(info)) { + segmentInfos.remove(info); + if (readerPool != null) { + readerPool.drop(info); + } + } + } + checkpoint(); + } + bufferedDeletesStream.prune(segmentInfos); + } /** Expert: Return the total size of all index files currently cached in memory. * Useful for size management with flushRamDocs() */ public final long ramSizeInBytes() { ensureOpen(); - return docWriter.bytesUsed() + bufferedDeletesStream.bytesUsed(); + return docWriter.flushControl.netBytes() + bufferedDeletesStream.bytesUsed(); + } + + // for testing only + DocumentsWriter getDocsWriter() { + boolean test = false; + assert test = true; + return test?docWriter: null; } /** Expert: Return the number of documents currently @@ -2573,7 +2705,7 @@ public class IndexWriter implements Closeable { return docWriter.getNumDocs(); } - private void ensureValidMerge(MergePolicy.OneMerge merge) { + private void ensureValidMerge(MergePolicy.OneMerge merge) throws IOException { for(SegmentInfo info : merge.segments) { if (segmentInfos.indexOf(info) == -1) { throw new MergePolicy.MergeException("MergePolicy selected a segment (" + info.name + ") that is not in the current index " + segString(), directory); @@ -2699,7 +2831,7 @@ public class IndexWriter implements Closeable { } commitMergedDeletes(merge, mergedReader); - + // If the doc store we are using has been closed and // is in now compound format (but wasn't when we // started), then we will switch to the compound @@ -2713,7 +2845,7 @@ public class IndexWriter implements Closeable { message("merged segment " + merge.info + " is 100% deleted" + (keepFullyDeletedSegments ? "" : "; skipping insert")); } - final Set mergedAway = new HashSet(merge.segments); + final Set mergedAway = new HashSet (merge.segments); int segIdx = 0; int newSegIdx = 0; boolean inserted = false; @@ -2760,15 +2892,15 @@ public class IndexWriter implements Closeable { // them so that they don't bother writing them to // disk, updating SegmentInfo, etc.: readerPool.clear(merge.segments); - + if (merge.optimize) { // cascade the optimize: segmentsToOptimize.add(merge.info); } - + return true; } - + final private void handleMergeException(Throwable t, MergePolicy.OneMerge merge) throws IOException { if (infoStream != null) { @@ -2857,14 +2989,14 @@ public class IndexWriter implements Closeable { /** Hook that's called when the specified merge is complete. */ void mergeSuccess(MergePolicy.OneMerge merge) { } - + /** Checks whether this merge involves any segments * already participating in a merge. If not, this merge * is "registered", meaning we record that its segments * are now participating in a merge, and true is * returned. Else (the merge conflicts) false is * returned. */ - final synchronized boolean registerMerge(MergePolicy.OneMerge merge) throws MergePolicy.MergeAbortedException { + final synchronized boolean registerMerge(MergePolicy.OneMerge merge) throws MergePolicy.MergeAbortedException, IOException { if (merge.registerDone) return true; @@ -2874,10 +3006,8 @@ public class IndexWriter implements Closeable { throw new MergePolicy.MergeAbortedException("merge is aborted: " + merge.segString(directory)); } - final int count = merge.segments.size(); boolean isExternal = false; - for(int i=0;i (); merge.readerClones = new ArrayList (); + merge.estimatedMergeBytes = 0; + // This is try/finally to make sure merger's readers are // closed: boolean success = false; @@ -3134,6 +3273,13 @@ public class IndexWriter implements Closeable { -config.getReaderTermsIndexDivisor()); merge.readers.add(reader); + final int readerMaxDoc = reader.maxDoc(); + if (readerMaxDoc > 0) { + final int delCount = reader.numDeletedDocs(); + final double delRatio = ((double) delCount)/readerMaxDoc; + merge.estimatedMergeBytes += info.sizeInBytes(true) * (1.0 - delRatio); + } + // We clone the segment readers because other // deletes may come in while we're merging so we // need readers that will not change @@ -3166,7 +3312,7 @@ public class IndexWriter implements Closeable { message("merge store matchedCount=" + merger.getMatchedSubReaderCount() + " vs " + merge.readers.size()); } anyNonBulkMerges |= merger.getAnyNonBulkMerges(); - + assert mergedDocCount == totDocCount: "mergedDocCount=" + mergedDocCount + " vs " + totDocCount; // Very important to do this before opening the reader @@ -3235,8 +3381,11 @@ public class IndexWriter implements Closeable { merge.info.setUseCompoundFile(true); } - final IndexReaderWarmer mergedSegmentWarmer = config.getMergedSegmentWarmer(); + if (infoStream != null) { + message(String.format("merged segment size=%.3f MB vs estimate=%.3f MB", merge.info.sizeInBytes(true)/1024./1024., merge.estimatedMergeBytes/1024/1024.)); + } + final IndexReaderWarmer mergedSegmentWarmer = config.getMergedSegmentWarmer(); final int termsIndexDivisor; final boolean loadDocStores; @@ -3297,12 +3446,12 @@ public class IndexWriter implements Closeable { // For test purposes. final int getBufferedDeleteTermsSize() { - return docWriter.getPendingDeletes().terms.size(); + return docWriter.getBufferedDeleteTermsSize(); } // For test purposes. final int getNumBufferedDeleteTerms() { - return docWriter.getPendingDeletes().numTermDeletes.get(); + return docWriter.getNumBufferedDeleteTerms(); } // utility routines for tests @@ -3310,21 +3459,41 @@ public class IndexWriter implements Closeable { return segmentInfos.size() > 0 ? segmentInfos.info(segmentInfos.size()-1) : null; } - public synchronized String segString() { + /** @lucene.internal */ + public synchronized String segString() throws IOException { return segString(segmentInfos); } - private synchronized String segString(SegmentInfos infos) { + /** @lucene.internal */ + public synchronized String segString(SegmentInfos infos) throws IOException { StringBuilder buffer = new StringBuilder(); final int count = infos.size(); for(int i = 0; i < count; i++) { if (i > 0) { buffer.append(' '); } - final SegmentInfo info = infos.info(i); - buffer.append(info.toString(directory, 0)); - if (info.dir != directory) - buffer.append("**"); + buffer.append(segString(infos.info(i))); + } + + return buffer.toString(); + } + + public synchronized String segString(SegmentInfo info) throws IOException { + StringBuilder buffer = new StringBuilder(); + SegmentReader reader = readerPool.getIfExists(info); + try { + if (reader != null) { + buffer.append(reader.toString()); + } else { + buffer.append(info.toString(directory, 0)); + if (info.dir != directory) { + buffer.append("**"); + } + } + } finally { + if (reader != null) { + readerPool.release(reader); + } } return buffer.toString(); } @@ -3397,17 +3566,17 @@ public class IndexWriter implements Closeable { assert lastCommitChangeCount <= changeCount; myChangeCount = changeCount; - + if (changeCount == lastCommitChangeCount) { if (infoStream != null) message(" skip startCommit(): no changes pending"); return; } - + // First, we clone & incref the segmentInfos we intend // to sync, then, without locking, we sync() all files // referenced by toSync, in the background. - + if (infoStream != null) message("startCommit index=" + segString(segmentInfos) + " changeCount=" + changeCount); @@ -3415,10 +3584,10 @@ public class IndexWriter implements Closeable { toSync = (SegmentInfos) segmentInfos.clone(); assert filesExist(toSync); - + if (commitUserData != null) toSync.setUserData(commitUserData); - + // This protects the segmentInfos we are now going // to commit. This is important in case, eg, while // we are trying to sync all referenced files, a @@ -3550,7 +3719,7 @@ public class IndexWriter implements Closeable { /** Expert: remove any index files that are no longer * used. - * + * * IndexWriter normally deletes unused files itself, * during indexing. However, on Windows, which disallows * deletion of open files, if there is a reader open on @@ -3599,7 +3768,7 @@ public class IndexWriter implements Closeable { public void setPayloadProcessorProvider(PayloadProcessorProvider pcp) { payloadProcessorProvider = pcp; } - + /** * Returns the {@link PayloadProcessorProvider} that is used during segment * merges to process payloads. @@ -3607,124 +3776,4 @@ public class IndexWriter implements Closeable { public PayloadProcessorProvider getPayloadProcessorProvider() { return payloadProcessorProvider; } - - // decides when flushes happen - final class FlushControl { - - private boolean flushPending; - private boolean flushDeletes; - private int delCount; - private int docCount; - private boolean flushing; - - private synchronized boolean setFlushPending(String reason, boolean doWait) { - if (flushPending || flushing) { - if (doWait) { - while(flushPending || flushing) { - try { - wait(); - } catch (InterruptedException ie) { - throw new ThreadInterruptedException(ie); - } - } - } - return false; - } else { - if (infoStream != null) { - message("now trigger flush reason=" + reason); - } - flushPending = true; - return flushPending; - } - } - - public synchronized void setFlushPendingNoWait(String reason) { - setFlushPending(reason, false); - } - - public synchronized boolean getFlushPending() { - return flushPending; - } - - public synchronized boolean getFlushDeletes() { - return flushDeletes; - } - - public synchronized void clearFlushPending() { - if (infoStream != null) { - message("clearFlushPending"); - } - flushPending = false; - flushDeletes = false; - docCount = 0; - notifyAll(); - } - - public synchronized void clearDeletes() { - delCount = 0; - } - - public synchronized boolean waitUpdate(int docInc, int delInc) { - return waitUpdate(docInc, delInc, false); - } - - public synchronized boolean waitUpdate(int docInc, int delInc, boolean skipWait) { - while(flushPending) { - try { - wait(); - } catch (InterruptedException ie) { - throw new ThreadInterruptedException(ie); - } - } - - // skipWait is only used when a thread is BOTH adding - // a doc and buffering a del term, and, the adding of - // the doc already triggered a flush - if (skipWait) { - docCount += docInc; - delCount += delInc; - return false; - } - - final int maxBufferedDocs = config.getMaxBufferedDocs(); - if (maxBufferedDocs != IndexWriterConfig.DISABLE_AUTO_FLUSH && - (docCount+docInc) >= maxBufferedDocs) { - return setFlushPending("maxBufferedDocs", true); - } - docCount += docInc; - - final int maxBufferedDeleteTerms = config.getMaxBufferedDeleteTerms(); - if (maxBufferedDeleteTerms != IndexWriterConfig.DISABLE_AUTO_FLUSH && - (delCount+delInc) >= maxBufferedDeleteTerms) { - flushDeletes = true; - return setFlushPending("maxBufferedDeleteTerms", true); - } - delCount += delInc; - - return flushByRAMUsage("add delete/doc"); - } - - public synchronized boolean flushByRAMUsage(String reason) { - final double ramBufferSizeMB = config.getRAMBufferSizeMB(); - if (ramBufferSizeMB != IndexWriterConfig.DISABLE_AUTO_FLUSH) { - final long limit = (long) (ramBufferSizeMB*1024*1024); - long used = bufferedDeletesStream.bytesUsed() + docWriter.bytesUsed(); - if (used >= limit) { - - // DocumentsWriter may be able to free up some - // RAM: - // Lock order: FC -> DW - docWriter.balanceRAM(); - - used = bufferedDeletesStream.bytesUsed() + docWriter.bytesUsed(); - if (used >= limit) { - return setFlushPending("ram full: " + reason, false); - } - } - } - return false; - } - } - - final FlushControl flushControl = new FlushControl(); } diff --git a/lucene/src/java/org/apache/lucene/index/IndexWriterConfig.java b/lucene/src/java/org/apache/lucene/index/IndexWriterConfig.java index 1674068491d..742043dd5cb 100644 --- a/lucene/src/java/org/apache/lucene/index/IndexWriterConfig.java +++ b/lucene/src/java/org/apache/lucene/index/IndexWriterConfig.java @@ -18,7 +18,7 @@ package org.apache.lucene.index; */ import org.apache.lucene.analysis.Analyzer; -import org.apache.lucene.index.DocumentsWriter.IndexingChain; +import org.apache.lucene.index.DocumentsWriterPerThread.IndexingChain; import org.apache.lucene.index.IndexWriter.IndexReaderWarmer; import org.apache.lucene.index.codecs.CodecProvider; import org.apache.lucene.search.IndexSearcher; @@ -41,7 +41,7 @@ import org.apache.lucene.util.Version; * IndexWriterConfig conf = new IndexWriterConfig(analyzer); * conf.setter1().setter2(); * - * + * * @since 3.1 */ public final class IndexWriterConfig implements Cloneable { @@ -56,7 +56,7 @@ public final class IndexWriterConfig implements Cloneable { * */ public static enum OpenMode { CREATE, APPEND, CREATE_OR_APPEND } - + /** Default value is 32. Change using {@link #setTermIndexInterval(int)}. */ public static final int DEFAULT_TERM_INDEX_INTERVAL = 32; // TODO: this should be private to the codec, not settable here @@ -77,23 +77,19 @@ public final class IndexWriterConfig implements Cloneable { /** * Default value for the write lock timeout (1,000 ms). - * + * * @see #setDefaultWriteLockTimeout(long) */ public static long WRITE_LOCK_TIMEOUT = 1000; - /** The maximum number of simultaneous threads that may be - * indexing documents at once in IndexWriter; if more - * than this many threads arrive they will wait for - * others to finish. */ - public final static int DEFAULT_MAX_THREAD_STATES = 8; - /** Default setting for {@link #setReaderPooling}. */ public final static boolean DEFAULT_READER_POOLING = false; /** Default value is 1. Change using {@link #setReaderTermsIndexDivisor(int)}. */ public static final int DEFAULT_READER_TERMS_INDEX_DIVISOR = IndexReader.DEFAULT_TERMS_INDEX_DIVISOR; + /** Default value is 1945. Change using {@link #setRAMPerThreadHardLimitMB(int)} */ + public static final int DEFAULT_RAM_PER_THREAD_HARD_LIMIT_MB = 1945; /** * Sets the default (for any instance) maximum time to wait for a write lock * (in milliseconds). @@ -105,7 +101,7 @@ public final class IndexWriterConfig implements Cloneable { /** * Returns the default write lock timeout for newly instantiated * IndexWriterConfigs. - * + * * @see #setDefaultWriteLockTimeout(long) */ public static long getDefaultWriteLockTimeout() { @@ -127,10 +123,12 @@ public final class IndexWriterConfig implements Cloneable { private volatile IndexReaderWarmer mergedSegmentWarmer; private volatile CodecProvider codecProvider; private volatile MergePolicy mergePolicy; - private volatile int maxThreadStates; + private volatile DocumentsWriterPerThreadPool indexerThreadPool; private volatile boolean readerPooling; private volatile int readerTermsIndexDivisor; - + private volatile FlushPolicy flushPolicy; + private volatile int perThreadHardLimitMB; + private Version matchVersion; /** @@ -153,15 +151,16 @@ public final class IndexWriterConfig implements Cloneable { maxBufferedDeleteTerms = DEFAULT_MAX_BUFFERED_DELETE_TERMS; ramBufferSizeMB = DEFAULT_RAM_BUFFER_SIZE_MB; maxBufferedDocs = DEFAULT_MAX_BUFFERED_DOCS; - indexingChain = DocumentsWriter.defaultIndexingChain; + indexingChain = DocumentsWriterPerThread.defaultIndexingChain; mergedSegmentWarmer = null; codecProvider = CodecProvider.getDefault(); - mergePolicy = new LogByteSizeMergePolicy(); - maxThreadStates = DEFAULT_MAX_THREAD_STATES; + mergePolicy = new TieredMergePolicy(); readerPooling = DEFAULT_READER_POOLING; + indexerThreadPool = new ThreadAffinityDocumentsWriterThreadPool(); readerTermsIndexDivisor = DEFAULT_READER_TERMS_INDEX_DIVISOR; + perThreadHardLimitMB = DEFAULT_RAM_PER_THREAD_HARD_LIMIT_MB; } - + @Override public Object clone() { // Shallow clone is the only thing that's possible, since parameters like @@ -186,7 +185,7 @@ public final class IndexWriterConfig implements Cloneable { this.openMode = openMode; return this; } - + /** Returns the {@link OpenMode} set by {@link #setOpenMode(OpenMode)}. */ public OpenMode getOpenMode() { return openMode; @@ -261,7 +260,7 @@ public final class IndexWriterConfig implements Cloneable { public SimilarityProvider getSimilarityProvider() { return similarityProvider; } - + /** * Expert: set the interval between indexed terms. Large values cause less * memory to be used by IndexReader, but slow random-access to terms. Small @@ -281,7 +280,7 @@ public final class IndexWriterConfig implements Cloneable { * In particular,
numUniqueTerms/interval
terms are read into * memory by an IndexReader, and, on average,interval/2
terms * must be scanned for each random term access. - * + * * @see #DEFAULT_TERM_INDEX_INTERVAL * *Takes effect immediately, but only applies to newly @@ -293,7 +292,7 @@ public final class IndexWriterConfig implements Cloneable { /** * Returns the interval between indexed terms. - * + * * @see #setTermIndexInterval(int) */ public int getTermIndexInterval() { // TODO: this should be private to the codec, not settable here @@ -331,10 +330,10 @@ public final class IndexWriterConfig implements Cloneable { this.writeLockTimeout = writeLockTimeout; return this; } - + /** * Returns allowed timeout when acquiring the write lock. - * + * * @see #setWriteLockTimeout(long) */ public long getWriteLockTimeout() { @@ -343,15 +342,16 @@ public final class IndexWriterConfig implements Cloneable { /** * Determines the minimal number of delete terms required before the buffered - * in-memory delete terms are applied and flushed. If there are documents - * buffered in memory at the time, they are merged and a new segment is - * created. - - *
Disabled by default (writer flushes by RAM usage). + * in-memory delete terms and queries are applied and flushed. + *
Disabled by default (writer flushes by RAM usage).
+ *+ * NOTE: This setting won't trigger a segment flush. + *
* * @throws IllegalArgumentException if maxBufferedDeleteTerms * is enabled but smaller than 1 * @see #setRAMBufferSizeMB + * @see #setFlushPolicy(FlushPolicy) * *Takes effect immediately, but only the next time a * document is added, updated or deleted. @@ -366,9 +366,9 @@ public final class IndexWriterConfig implements Cloneable { } /** - * Returns the number of buffered deleted terms that will trigger a flush if - * enabled. - * + * Returns the number of buffered deleted terms that will trigger a flush of all + * buffered deletes if enabled. + * * @see #setMaxBufferedDeleteTerms(int) */ public int getMaxBufferedDeleteTerms() { @@ -380,45 +380,50 @@ public final class IndexWriterConfig implements Cloneable { * and deletions before they are flushed to the Directory. Generally for * faster indexing performance it's best to flush by RAM usage instead of * document count and use as large a RAM buffer as you can. - * *
* When this is set, the writer will flush whenever buffered documents and * deletions use this much RAM. Pass in {@link #DISABLE_AUTO_FLUSH} to prevent * triggering a flush due to RAM usage. Note that if flushing by document * count is also enabled, then the flush will be triggered by whichever comes * first. - * + *
+ * The maximum RAM limit is inherently determined by the JVMs available memory. + * Yet, an {@link IndexWriter} session can consume a significantly larger amount + * of memory than the given RAM limit since this limit is just an indicator when + * to flush memory resident documents to the Directory. Flushes are likely happen + * concurrently while other threads adding documents to the writer. For application + * stability the available memory in the JVM should be significantly larger than + * the RAM buffer used for indexing. *
* NOTE: the account of RAM usage for pending deletions is only * approximate. Specifically, if you delete by Query, Lucene currently has no * way to measure the RAM usage of individual Queries so the accounting will * under-estimate and you should compensate by either calling commit() * periodically yourself, or by using {@link #setMaxBufferedDeleteTerms(int)} - * to flush by count instead of RAM usage (each buffered delete Query counts - * as one). - * + * to flush and apply buffered deletes by count instead of RAM usage + * (for each buffered delete Query a constant number of bytes is used to estimate + * RAM usage). Note that enabling {@link #setMaxBufferedDeleteTerms(int)} will + * not trigger any segment flushes. *
- * NOTE: because IndexWriter uses
int
s when managing its - * internal storage, the absolute maximum value for this setting is somewhat - * less than 2048 MB. The precise limit depends on various factors, such as - * how large your documents are, how many fields have norms, etc., so it's - * best to set this value comfortably under 2048. - * + * NOTE: It's not guaranteed that all memory resident documents are flushed + * once this limit is exceeded. Depending on the configured {@link FlushPolicy} only a + * subset of the buffered documents are flushed and therefore only parts of the RAM + * buffer is released. *+ * * The default value is {@link #DEFAULT_RAM_BUFFER_SIZE_MB}. - * + * @see #setFlushPolicy(FlushPolicy) + * @see #setRAMPerThreadHardLimitMB(int) + * *
Takes effect immediately, but only the next time a * document is added, updated or deleted. * * @throws IllegalArgumentException * if ramBufferSize is enabled but non-positive, or it disables * ramBufferSize when maxBufferedDocs is already disabled + * */ public IndexWriterConfig setRAMBufferSizeMB(double ramBufferSizeMB) { - if (ramBufferSizeMB > 2048.0) { - throw new IllegalArgumentException("ramBufferSize " + ramBufferSizeMB - + " is too large; should be comfortably less than 2048"); - } if (ramBufferSizeMB != DISABLE_AUTO_FLUSH && ramBufferSizeMB <= 0.0) throw new IllegalArgumentException( "ramBufferSize should be > 0.0 MB when enabled"); @@ -438,22 +443,22 @@ public final class IndexWriterConfig implements Cloneable { * Determines the minimal number of documents required before the buffered * in-memory documents are flushed as a new Segment. Large values generally * give faster indexing. - * + * *
* When this is set, the writer will flush every maxBufferedDocs added * documents. Pass in {@link #DISABLE_AUTO_FLUSH} to prevent triggering a * flush due to number of buffered documents. Note that if flushing by RAM * usage is also enabled, then the flush will be triggered by whichever comes * first. - * + * *
* Disabled by default (writer flushes by RAM usage). - * + * *
Takes effect immediately, but only the next time a * document is added, updated or deleted. * * @see #setRAMBufferSizeMB(double) - * + * @see #setFlushPolicy(FlushPolicy) * @throws IllegalArgumentException * if maxBufferedDocs is enabled but smaller than 2, or it disables * maxBufferedDocs when ramBufferSize is already disabled @@ -473,7 +478,7 @@ public final class IndexWriterConfig implements Cloneable { /** * Returns the number of buffered added documents that will trigger a flush if * enabled. - * + * * @see #setMaxBufferedDocs(int) */ public int getMaxBufferedDocs() { @@ -519,32 +524,43 @@ public final class IndexWriterConfig implements Cloneable { return codecProvider; } - + /** * Returns the current MergePolicy in use by this writer. - * + * * @see #setMergePolicy(MergePolicy) */ public MergePolicy getMergePolicy() { return mergePolicy; } - /** - * Sets the max number of simultaneous threads that may be indexing documents - * at once in IndexWriter. Values < 1 are invalid and if passed - *
maxThreadStates
will be set to - * {@link #DEFAULT_MAX_THREAD_STATES}. - * - *Only takes effect when IndexWriter is first created. */ - public IndexWriterConfig setMaxThreadStates(int maxThreadStates) { - this.maxThreadStates = maxThreadStates < 1 ? DEFAULT_MAX_THREAD_STATES : maxThreadStates; + /** Expert: Sets the {@link DocumentsWriterPerThreadPool} instance used by the + * IndexWriter to assign thread-states to incoming indexing threads. If no + * {@link DocumentsWriterPerThreadPool} is set {@link IndexWriter} will use + * {@link ThreadAffinityDocumentsWriterThreadPool} with max number of + * thread-states set to {@link DocumentsWriterPerThreadPool#DEFAULT_MAX_THREAD_STATES} (see + * {@link DocumentsWriterPerThreadPool#DEFAULT_MAX_THREAD_STATES}). + *
+ *+ * NOTE: The given {@link DocumentsWriterPerThreadPool} instance must not be used with + * other {@link IndexWriter} instances once it has been initialized / associated with an + * {@link IndexWriter}. + *
+ *+ * NOTE: This only takes effect when IndexWriter is first created.
*/ + public IndexWriterConfig setIndexerThreadPool(DocumentsWriterPerThreadPool threadPool) { + if(threadPool == null) { + throw new IllegalArgumentException("DocumentsWriterPerThreadPool must not be nul"); + } + this.indexerThreadPool = threadPool; return this; } - /** Returns the max number of simultaneous threads that - * may be indexing documents at once in IndexWriter. */ - public int getMaxThreadStates() { - return maxThreadStates; + /** Returns the configured {@link DocumentsWriterPerThreadPool} instance. + * @see #setIndexerThreadPool(DocumentsWriterPerThreadPool) + * @return the configured {@link DocumentsWriterPerThreadPool} instance.*/ + public DocumentsWriterPerThreadPool getIndexerThreadPool() { + return this.indexerThreadPool; } /** By default, IndexWriter does not pool the @@ -572,10 +588,10 @@ public final class IndexWriterConfig implements Cloneable { * *Only takes effect when IndexWriter is first created. */ IndexWriterConfig setIndexingChain(IndexingChain indexingChain) { - this.indexingChain = indexingChain == null ? DocumentsWriter.defaultIndexingChain : indexingChain; + this.indexingChain = indexingChain == null ? DocumentsWriterPerThread.defaultIndexingChain : indexingChain; return this; } - + /** Returns the indexing chain set on {@link #setIndexingChain(IndexingChain)}. */ IndexingChain getIndexingChain() { return indexingChain; @@ -604,6 +620,53 @@ public final class IndexWriterConfig implements Cloneable { return readerTermsIndexDivisor; } + /** + * Expert: Controls when segments are flushed to disk during indexing. + * The {@link FlushPolicy} initialized during {@link IndexWriter} instantiation and once initialized + * the given instance is bound to this {@link IndexWriter} and should not be used with another writer. + * @see #setMaxBufferedDeleteTerms(int) + * @see #setMaxBufferedDocs(int) + * @see #setRAMBufferSizeMB(double) + */ + public IndexWriterConfig setFlushPolicy(FlushPolicy flushPolicy) { + this.flushPolicy = flushPolicy; + return this; + } + + /** + * Expert: Sets the maximum memory consumption per thread triggering a forced + * flush if exceeded. A {@link DocumentsWriterPerThread} is forcefully flushed + * once it exceeds this limit even if the {@link #getRAMBufferSizeMB()} has + * not been exceeded. This is a safety limit to prevent a + * {@link DocumentsWriterPerThread} from address space exhaustion due to its + * internal 32 bit signed integer based memory addressing. + * The given value must be less that 2GB (2048MB) + * + * @see #DEFAULT_RAM_PER_THREAD_HARD_LIMIT_MB + */ + public IndexWriterConfig setRAMPerThreadHardLimitMB(int perThreadHardLimitMB) { + if (perThreadHardLimitMB <= 0 || perThreadHardLimitMB >= 2048) { + throw new IllegalArgumentException("PerThreadHardLimit must be greater than 0 and less than 2048MB"); + } + this.perThreadHardLimitMB = perThreadHardLimitMB; + return this; + } + + /** + * Returns the max amount of memory each {@link DocumentsWriterPerThread} can + * consume until forcefully flushed. + * @see #setRAMPerThreadHardLimitMB(int) + */ + public int getRAMPerThreadHardLimitMB() { + return perThreadHardLimitMB; + } + /** + * @see #setFlushPolicy(FlushPolicy) + */ + public FlushPolicy getFlushPolicy() { + return flushPolicy; + } + @Override public String toString() { StringBuilder sb = new StringBuilder(); @@ -623,9 +686,13 @@ public final class IndexWriterConfig implements Cloneable { sb.append("mergedSegmentWarmer=").append(mergedSegmentWarmer).append("\n"); sb.append("codecProvider=").append(codecProvider).append("\n"); sb.append("mergePolicy=").append(mergePolicy).append("\n"); - sb.append("maxThreadStates=").append(maxThreadStates).append("\n"); + sb.append("indexerThreadPool=").append(indexerThreadPool).append("\n"); sb.append("readerPooling=").append(readerPooling).append("\n"); sb.append("readerTermsIndexDivisor=").append(readerTermsIndexDivisor).append("\n"); + sb.append("flushPolicy=").append(flushPolicy).append("\n"); + sb.append("perThreadHardLimitMB=").append(perThreadHardLimitMB).append("\n"); + return sb.toString(); } + } diff --git a/lucene/src/java/org/apache/lucene/index/IntBlockPool.java b/lucene/src/java/org/apache/lucene/index/IntBlockPool.java index 013c7b3248f..16093a5c34e 100644 --- a/lucene/src/java/org/apache/lucene/index/IntBlockPool.java +++ b/lucene/src/java/org/apache/lucene/index/IntBlockPool.java @@ -1,5 +1,7 @@ package org.apache.lucene.index; +import java.util.Arrays; + /** * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with @@ -22,24 +24,24 @@ final class IntBlockPool { public int[][] buffers = new int[10][]; int bufferUpto = -1; // Which buffer we are upto - public int intUpto = DocumentsWriter.INT_BLOCK_SIZE; // Where we are in head buffer + public int intUpto = DocumentsWriterPerThread.INT_BLOCK_SIZE; // Where we are in head buffer public int[] buffer; // Current head buffer - public int intOffset = -DocumentsWriter.INT_BLOCK_SIZE; // Current head offset + public int intOffset = -DocumentsWriterPerThread.INT_BLOCK_SIZE; // Current head offset - final private DocumentsWriter docWriter; + final private DocumentsWriterPerThread docWriter; - public IntBlockPool(DocumentsWriter docWriter) { + public IntBlockPool(DocumentsWriterPerThread docWriter) { this.docWriter = docWriter; } public void reset() { if (bufferUpto != -1) { - if (bufferUpto > 0) - // Recycle all but the first buffer - docWriter.recycleIntBlocks(buffers, 1, 1+bufferUpto); - // Reuse first buffer + if (bufferUpto > 0) { + docWriter.recycleIntBlocks(buffers, 1, bufferUpto-1); + Arrays.fill(buffers, 1, bufferUpto, null); + } bufferUpto = 0; intUpto = 0; intOffset = 0; @@ -57,7 +59,7 @@ final class IntBlockPool { bufferUpto++; intUpto = 0; - intOffset += DocumentsWriter.INT_BLOCK_SIZE; + intOffset += DocumentsWriterPerThread.INT_BLOCK_SIZE; } } diff --git a/lucene/src/java/org/apache/lucene/index/InvertedDocConsumer.java b/lucene/src/java/org/apache/lucene/index/InvertedDocConsumer.java index 76ca1d7fddf..5f4a84072a8 100644 --- a/lucene/src/java/org/apache/lucene/index/InvertedDocConsumer.java +++ b/lucene/src/java/org/apache/lucene/index/InvertedDocConsumer.java @@ -17,20 +17,22 @@ package org.apache.lucene.index; * limitations under the License. */ -import java.util.Collection; -import java.util.Map; import java.io.IOException; +import java.util.Map; abstract class InvertedDocConsumer { - /** Add a new thread */ - abstract InvertedDocConsumerPerThread addThread(DocInverterPerThread docInverterPerThread); - /** Abort (called after hitting AbortException) */ abstract void abort(); /** Flush a new segment */ - abstract void flush(Map
> threadsAndFields, SegmentWriteState state) throws IOException; + abstract void flush(Map fieldsToFlush, SegmentWriteState state) throws IOException; + + abstract InvertedDocConsumerPerField addField(DocInverterPerField docInverterPerField, FieldInfo fieldInfo); + + abstract void startDocument() throws IOException; + + abstract void finishDocument() throws IOException; /** Attempt to free RAM, returning true if any RAM was * freed */ diff --git a/lucene/src/java/org/apache/lucene/index/InvertedDocEndConsumer.java b/lucene/src/java/org/apache/lucene/index/InvertedDocEndConsumer.java index 351529f381b..2477cef5f6f 100644 --- a/lucene/src/java/org/apache/lucene/index/InvertedDocEndConsumer.java +++ b/lucene/src/java/org/apache/lucene/index/InvertedDocEndConsumer.java @@ -17,12 +17,13 @@ package org.apache.lucene.index; * limitations under the License. */ -import java.util.Collection; -import java.util.Map; import java.io.IOException; +import java.util.Map; abstract class InvertedDocEndConsumer { - abstract InvertedDocEndConsumerPerThread addThread(DocInverterPerThread docInverterPerThread); - abstract void flush(Map > threadsAndFields, SegmentWriteState state) throws IOException; + abstract void flush(Map fieldsToFlush, SegmentWriteState state) throws IOException; abstract void abort(); + abstract InvertedDocEndConsumerPerField addField(DocInverterPerField docInverterPerField, FieldInfo fieldInfo); + abstract void startDocument() throws IOException; + abstract void finishDocument() throws IOException; } diff --git a/lucene/src/java/org/apache/lucene/index/InvertedDocEndConsumerPerThread.java b/lucene/src/java/org/apache/lucene/index/InvertedDocEndConsumerPerThread.java deleted file mode 100644 index 4b3119f30e1..00000000000 --- a/lucene/src/java/org/apache/lucene/index/InvertedDocEndConsumerPerThread.java +++ /dev/null @@ -1,25 +0,0 @@ -package org.apache.lucene.index; - -/** - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -abstract class InvertedDocEndConsumerPerThread { - abstract void startDocument(); - abstract InvertedDocEndConsumerPerField addField(DocInverterPerField docInverterPerField, FieldInfo fieldInfo); - abstract void finishDocument(); - abstract void abort(); -} diff --git a/lucene/src/java/org/apache/lucene/index/LogMergePolicy.java b/lucene/src/java/org/apache/lucene/index/LogMergePolicy.java index 669d3b0d901..1be4f26b77f 100644 --- a/lucene/src/java/org/apache/lucene/index/LogMergePolicy.java +++ b/lucene/src/java/org/apache/lucene/index/LogMergePolicy.java @@ -20,7 +20,6 @@ package org.apache.lucene.index; import java.io.IOException; import java.util.ArrayList; import java.util.Collection; -import java.util.Collections; import java.util.Comparator; import java.util.List; import java.util.Set; @@ -72,7 +71,6 @@ public abstract class LogMergePolicy extends MergePolicy { // out there wrote his own LMP ... protected long maxMergeSizeForOptimize = Long.MAX_VALUE; protected int maxMergeDocs = DEFAULT_MAX_MERGE_DOCS; - protected boolean requireContiguousMerge = false; protected double noCFSRatio = DEFAULT_NO_CFS_RATIO; @@ -111,21 +109,6 @@ public abstract class LogMergePolicy extends MergePolicy { writer.get().message("LMP: " + message); } - /** If true, merges must be in-order slice of the - * segments. If false, then the merge policy is free to - * pick any segments. The default is false, which is - * in general more efficient than true since it gives the - * merge policy more freedom to pick closely sized - * segments. */ - public void setRequireContiguousMerge(boolean v) { - requireContiguousMerge = v; - } - - /** See {@link #setRequireContiguousMerge}. */ - public boolean getRequireContiguousMerge() { - return requireContiguousMerge; - } - /** Returns the number of segments that are merged at * once and also controls the total number of segments * allowed to accumulate in the index.
*/ @@ -378,8 +361,6 @@ public abstract class LogMergePolicy extends MergePolicy { return null; } - // TODO: handle non-contiguous merge case differently? - // Find the newest (rightmost) segment that needs to // be optimized (other segments may have been flushed // since optimize started): @@ -499,14 +480,6 @@ public abstract class LogMergePolicy extends MergePolicy { } } - private static class SortByIndex implements Comparator{ - public int compare(SegmentInfoAndLevel o1, SegmentInfoAndLevel o2) { - return o1.index - o2.index; - } - } - - private static final SortByIndex sortByIndex = new SortByIndex(); - /** Checks if any merges are now necessary and returns a * {@link MergePolicy.MergeSpecification} if so. A merge * is necessary when there are more than {@link @@ -532,29 +505,22 @@ public abstract class LogMergePolicy extends MergePolicy { final SegmentInfo info = infos.info(i); long size = size(info); - // When we require contiguous merge, we still add the - // segment to levels to avoid merging "across" a set - // of segment being merged: - if (!requireContiguousMerge && mergingSegments.contains(info)) { - if (verbose()) { - message("seg " + info.name + " already being merged; skip"); - } - continue; - } - // Floor tiny segments if (size < 1) { size = 1; } + final SegmentInfoAndLevel infoLevel = new SegmentInfoAndLevel(info, (float) Math.log(size)/norm, i); levels.add(infoLevel); - if (verbose()) { - message("seg " + info.name + " level=" + infoLevel.level + " size=" + size); - } - } - if (!requireContiguousMerge) { - Collections.sort(levels); + if (verbose()) { + final long segBytes = sizeBytes(info); + String extra = mergingSegments.contains(info) ? " [merging]" : ""; + if (size >= maxMergeSize) { + extra += " [skip: too large]"; + } + message("seg=" + writer.get().segString(info) + " level=" + infoLevel.level + " size=" + String.format("%.3f MB", segBytes/1024/1024.) + extra); + } } final float levelFloor; @@ -614,23 +580,29 @@ public abstract class LogMergePolicy extends MergePolicy { int end = start + mergeFactor; while(end <= 1+upto) { boolean anyTooLarge = false; + boolean anyMerging = false; for(int i=start;i = maxMergeSize || sizeDocs(info) >= maxMergeDocs); + if (mergingSegments.contains(info)) { + anyMerging = true; + break; + } } - if (!anyTooLarge) { + if (anyMerging) { + // skip + } else if (!anyTooLarge) { if (spec == null) spec = new MergeSpecification(); - if (verbose()) { - message(" " + start + " to " + end + ": add this merge"); - } - Collections.sort(levels.subList(start, end), sortByIndex); final SegmentInfos mergeInfos = new SegmentInfos(); for(int i=start;i readers; // used by IndexWriter List readerClones; // used by IndexWriter public final SegmentInfos segments; diff --git a/lucene/src/java/org/apache/lucene/index/MultiFields.java b/lucene/src/java/org/apache/lucene/index/MultiFields.java index 0943aacbdaa..841349a4a33 100644 --- a/lucene/src/java/org/apache/lucene/index/MultiFields.java +++ b/lucene/src/java/org/apache/lucene/index/MultiFields.java @@ -51,7 +51,6 @@ public final class MultiFields extends Fields { private final Fields[] subs; private final ReaderUtil.Slice[] subSlices; private final Map terms = new ConcurrentHashMap (); - private final Map docValues = new ConcurrentHashMap (); /** Returns a single {@link Fields} instance for this * reader, merging fields/terms/docs/positions on the @@ -193,12 +192,6 @@ public final class MultiFields extends Fields { } } - /** This method may return null if the field does not exist.*/ - public static DocValues getDocValues(IndexReader r, String field) throws IOException { - final Fields fields = getFields(r); - return fields == null? null: fields.docValues(field); - } - /** Returns {@link DocsEnum} for the specified field & * term. This may return null if the term does not * exist. */ @@ -283,41 +276,5 @@ public final class MultiFields extends Fields { return result; } - @Override - public DocValues docValues(String field) throws IOException { - DocValues result = docValues.get(field); - if (result == null) { - // Lazy init: first time this field is requested, we - // create & add to docValues: - final List docValuesIndex = new ArrayList (); - int docsUpto = 0; - Type type = null; - // Gather all sub-readers that share this field - for(int i=0;i docValuesIndex = new ArrayList (); - int docsUpto = 0; - Type type = null; - final int numEnums = enumWithSlices.length; - for (int i = 0; i < numEnums; i++) { - FieldsEnumWithSlice withSlice = enumWithSlices[i]; - Slice slice = withSlice.slice; - final DocValues values = withSlice.fields.docValues(); - final int start = slice.start; - final int length = slice.length; - if (values != null && currentField.equals(withSlice.current)) { - if (docsUpto != start) { - type = values.type(); - docValuesIndex.add(new MultiDocValues.DocValuesIndex( - new MultiDocValues.DummyDocValues(start, type), docsUpto, start - - docsUpto)); - } - docValuesIndex.add(new MultiDocValues.DocValuesIndex(values, start, - length)); - docsUpto = start + length; - - } else if (i + 1 == numEnums && !docValuesIndex.isEmpty()) { - docValuesIndex.add(new MultiDocValues.DocValuesIndex( - new MultiDocValues.DummyDocValues(start, type), docsUpto, start - - docsUpto)); - } - } - return docValuesIndex.isEmpty() ? null : docValues.reset(docValuesIndex - .toArray(MultiDocValues.DocValuesIndex.EMPTY_ARRAY)); - } +// @Override +// public DocValues docValues() throws IOException { +// final List docValuesIndex = new ArrayList (); +// int docsUpto = 0; +// Type type = null; +// final int numEnums = enumWithSlices.length; +// for (int i = 0; i < numEnums; i++) { +// FieldsEnumWithSlice withSlice = enumWithSlices[i]; +// Slice slice = withSlice.slice; +// final DocValues values = withSlice.fields.docValues(); +// final int start = slice.start; +// final int length = slice.length; +// if (values != null && currentField.equals(withSlice.current)) { +// if (docsUpto != start) { +// type = values.type(); +// docValuesIndex.add(new MultiDocValues.DocValuesIndex( +// new MultiDocValues.DummyDocValues(start, type), docsUpto, start +// - docsUpto)); +// } +// docValuesIndex.add(new MultiDocValues.DocValuesIndex(values, start, +// length)); +// docsUpto = start + length; +// +// } else if (i + 1 == numEnums && !docValuesIndex.isEmpty()) { +// docValuesIndex.add(new MultiDocValues.DocValuesIndex( +// new MultiDocValues.DummyDocValues(start, type), docsUpto, start +// - docsUpto)); +// } +// } +// return docValuesIndex.isEmpty() ? null : docValues.reset(docValuesIndex +// .toArray(MultiDocValues.DocValuesIndex.EMPTY_ARRAY)); +// } } diff --git a/lucene/src/java/org/apache/lucene/index/MultiPerDocValues.java b/lucene/src/java/org/apache/lucene/index/MultiPerDocValues.java new file mode 100644 index 00000000000..bf10a43f6a0 --- /dev/null +++ b/lucene/src/java/org/apache/lucene/index/MultiPerDocValues.java @@ -0,0 +1,148 @@ +package org.apache.lucene.index; +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +import java.io.IOException; +import java.util.ArrayList; +import java.util.Collection; +import java.util.List; +import java.util.Map; +import java.util.TreeSet; +import java.util.concurrent.ConcurrentHashMap; + +import org.apache.lucene.index.codecs.PerDocValues; +import org.apache.lucene.index.values.DocValues; +import org.apache.lucene.index.values.MultiDocValues; +import org.apache.lucene.index.values.Type; +import org.apache.lucene.index.values.MultiDocValues.DocValuesIndex; +import org.apache.lucene.util.ReaderUtil; + +/** + * + * nocommit - javadoc + * @experimental + */ +public class MultiPerDocValues extends PerDocValues { + private final PerDocValues[] subs; + private final ReaderUtil.Slice[] subSlices; + private final Map docValues = new ConcurrentHashMap (); + private final TreeSet fields; + + public MultiPerDocValues(PerDocValues[] subs, ReaderUtil.Slice[] subSlices) { + this.subs = subs; + this.subSlices = subSlices; + fields = new TreeSet (); + for (PerDocValues sub : subs) { + fields.addAll(sub.fields()); + } + } + + public static PerDocValues getPerDocs(IndexReader r) throws IOException { + final IndexReader[] subs = r.getSequentialSubReaders(); + if (subs == null) { + // already an atomic reader + return r.perDocValues(); + } else if (subs.length == 0) { + // no fields + return null; + } else if (subs.length == 1) { + return getPerDocs(subs[0]); + } + PerDocValues perDocValues = r.retrievePerDoc(); + if (perDocValues == null) { + + final List producer = new ArrayList (); + final List slices = new ArrayList (); + + new ReaderUtil.Gather(r) { + @Override + protected void add(int base, IndexReader r) throws IOException { + final PerDocValues f = r.perDocValues(); + if (f != null) { + producer.add(f); + slices + .add(new ReaderUtil.Slice(base, r.maxDoc(), producer.size() - 1)); + } + } + }.run(); + + if (producer.size() == 0) { + return null; + } else if (producer.size() == 1) { + perDocValues = producer.get(0); + } else { + perDocValues = new MultiPerDocValues( + producer.toArray(PerDocValues.EMPTY_ARRAY), + slices.toArray(ReaderUtil.Slice.EMPTY_ARRAY)); + } + r.storePerDoc(perDocValues); + } + return perDocValues; + } + + public DocValues docValues(String field) throws IOException { + DocValues result = docValues.get(field); + if (result == null) { + // Lazy init: first time this field is requested, we + // create & add to docValues: + final List docValuesIndex = new ArrayList (); + int docsUpto = 0; + Type type = null; + // Gather all sub-readers that share this field + for (int i = 0; i < subs.length; i++) { + DocValues values = subs[i].docValues(field); + final int start = subSlices[i].start; + final int length = subSlices[i].length; + if (values != null) { + if (docsUpto != start) { + type = values.type(); + docValuesIndex.add(new MultiDocValues.DocValuesIndex( + new MultiDocValues.DummyDocValues(start, type), docsUpto, start + - docsUpto)); + } + docValuesIndex.add(new MultiDocValues.DocValuesIndex(values, start, + length)); + docsUpto = start + length; + + } else if (i + 1 == subs.length && !docValuesIndex.isEmpty()) { + docValuesIndex.add(new MultiDocValues.DocValuesIndex( + new MultiDocValues.DummyDocValues(start, type), docsUpto, start + - docsUpto)); + } + } + if (docValuesIndex.isEmpty()) { + return null; + } + result = new MultiDocValues( + docValuesIndex.toArray(DocValuesIndex.EMPTY_ARRAY)); + docValues.put(field, result); + } + return result; + } + + @Override + public void close() throws IOException { + PerDocValues[] perDocValues = this.subs; + for (PerDocValues values : perDocValues) { + values.close(); + } + } + + @Override + public Collection fields() { + return fields; + } +} diff --git a/lucene/src/java/org/apache/lucene/index/MultiReader.java b/lucene/src/java/org/apache/lucene/index/MultiReader.java index c2682e40231..7a943fadcd0 100644 --- a/lucene/src/java/org/apache/lucene/index/MultiReader.java +++ b/lucene/src/java/org/apache/lucene/index/MultiReader.java @@ -24,6 +24,7 @@ import java.util.concurrent.ConcurrentHashMap; import org.apache.lucene.document.Document; import org.apache.lucene.document.FieldSelector; +import org.apache.lucene.index.codecs.PerDocValues; import org.apache.lucene.util.Bits; import org.apache.lucene.util.BytesRef; import org.apache.lucene.util.ReaderUtil; @@ -403,4 +404,9 @@ public class MultiReader extends IndexReader implements Cloneable { sub.removeReaderFinishedListener(listener); } } + + @Override + public PerDocValues perDocValues() throws IOException { + throw new UnsupportedOperationException("please use MultiPerDoc#getPerDocs, or wrap your IndexReader with SlowMultiReaderWrapper, if you really need a top level Fields"); + } } diff --git a/lucene/src/java/org/apache/lucene/index/NormsWriter.java b/lucene/src/java/org/apache/lucene/index/NormsWriter.java index e0cff83de02..5064a47f3bc 100644 --- a/lucene/src/java/org/apache/lucene/index/NormsWriter.java +++ b/lucene/src/java/org/apache/lucene/index/NormsWriter.java @@ -19,11 +19,7 @@ package org.apache.lucene.index; import java.io.IOException; import java.util.Collection; -import java.util.Iterator; -import java.util.HashMap; import java.util.Map; -import java.util.List; -import java.util.ArrayList; import org.apache.lucene.store.IndexOutput; @@ -36,10 +32,6 @@ import org.apache.lucene.store.IndexOutput; final class NormsWriter extends InvertedDocEndConsumer { - @Override - public InvertedDocEndConsumerPerThread addThread(DocInverterPerThread docInverterPerThread) { - return new NormsWriterPerThread(docInverterPerThread, this); - } @Override public void abort() {} @@ -50,40 +42,11 @@ final class NormsWriter extends InvertedDocEndConsumer { /** Produce _X.nrm if any document had a field with norms * not disabled */ @Override - public void flush(Map > threadsAndFields, SegmentWriteState state) throws IOException { - - final Map > byField = new HashMap >(); - + public void flush(Map fieldsToFlush, SegmentWriteState state) throws IOException { if (!state.fieldInfos.hasNorms()) { return; } - // Typically, each thread will have encountered the same - // field. So first we collate by field, ie, all - // per-thread field instances that correspond to the - // same FieldInfo - for (final Map.Entry > entry : threadsAndFields.entrySet()) { - final Collection fields = entry.getValue(); - final Iterator fieldsIt = fields.iterator(); - - while (fieldsIt.hasNext()) { - final NormsWriterPerField perField = (NormsWriterPerField) fieldsIt.next(); - - if (perField.upto > 0) { - // It has some norms - List l = byField.get(perField.fieldInfo); - if (l == null) { - l = new ArrayList (); - byField.put(perField.fieldInfo, l); - } - l.add(perField); - } else - // Remove this field since we haven't seen it - // since the previous flush - fieldsIt.remove(); - } - } - final String normsFileName = IndexFileNames.segmentFileName(state.segmentName, "", IndexFileNames.NORMS_EXTENSION); IndexOutput normsOut = state.directory.createOutput(normsFileName); @@ -93,60 +56,25 @@ final class NormsWriter extends InvertedDocEndConsumer { int normCount = 0; for (FieldInfo fi : state.fieldInfos) { - final List toMerge = byField.get(fi); + final NormsWriterPerField toWrite = (NormsWriterPerField) fieldsToFlush.get(fi); int upto = 0; - if (toMerge != null) { - - final int numFields = toMerge.size(); - + if (toWrite != null && toWrite.upto > 0) { normCount++; - final NormsWriterPerField[] fields = new NormsWriterPerField[numFields]; - int[] uptos = new int[numFields]; - - for(int j=0;j 0) { - - assert uptos[0] < fields[0].docIDs.length : " uptos[0]=" + uptos[0] + " len=" + (fields[0].docIDs.length); - - int minLoc = 0; - int minDocID = fields[0].docIDs[uptos[0]]; - - for(int j=1;j { - final NormsWriterPerThread perThread; final FieldInfo fieldInfo; - final DocumentsWriter.DocState docState; + final DocumentsWriterPerThread.DocState docState; final Similarity similarity; // Holds all docID/norm pairs we've seen @@ -46,10 +45,9 @@ final class NormsWriterPerField extends InvertedDocEndConsumerPerField implement upto = 0; } - public NormsWriterPerField(final DocInverterPerField docInverterPerField, final NormsWriterPerThread perThread, final FieldInfo fieldInfo) { - this.perThread = perThread; + public NormsWriterPerField(final DocInverterPerField docInverterPerField, final FieldInfo fieldInfo) { this.fieldInfo = fieldInfo; - docState = perThread.docState; + docState = docInverterPerField.docState; fieldState = docInverterPerField.fieldState; similarity = docState.similarityProvider.get(fieldInfo.name); } diff --git a/lucene/src/java/org/apache/lucene/index/ParallelReader.java b/lucene/src/java/org/apache/lucene/index/ParallelReader.java index 57476e2cd94..4b5d78d5682 100644 --- a/lucene/src/java/org/apache/lucene/index/ParallelReader.java +++ b/lucene/src/java/org/apache/lucene/index/ParallelReader.java @@ -21,9 +21,8 @@ import org.apache.lucene.document.Document; import org.apache.lucene.document.FieldSelector; import org.apache.lucene.document.FieldSelectorResult; import org.apache.lucene.document.Fieldable; -import org.apache.lucene.index.values.DocValues; +import org.apache.lucene.index.codecs.PerDocValues; import org.apache.lucene.util.Bits; -import org.apache.lucene.util.Pair; import org.apache.lucene.util.BytesRef; import org.apache.lucene.util.MapBackedSet; @@ -183,21 +182,15 @@ public class ParallelReader extends IndexReader { } } - @Override - public DocValues docValues() throws IOException { - assert currentReader != null; - return MultiFields.getDocValues(currentReader, currentField); - } } // Single instance of this, per ParallelReader instance private class ParallelFields extends Fields { - final HashMap > fields = new HashMap >(); + final HashMap fields = new HashMap (); public void addField(String field, IndexReader r) throws IOException { Fields multiFields = MultiFields.getFields(r); - fields.put(field, new Pair ( multiFields.terms(field), - multiFields.docValues(field))); + fields.put(field, multiFields.terms(field)); } @Override @@ -206,12 +199,7 @@ public class ParallelReader extends IndexReader { } @Override public Terms terms(String field) throws IOException { - return fields.get(field).cur; - } - - @Override - public DocValues docValues(String field) throws IOException { - return fields.get(field).cud; + return fields.get(field); } } @@ -578,6 +566,12 @@ public class ParallelReader extends IndexReader { reader.removeReaderFinishedListener(listener); } } + + @Override + public PerDocValues perDocValues() throws IOException { + // TODO Auto-generated method stub + return null; + } } diff --git a/lucene/src/java/org/apache/lucene/index/PerDocWriteState.java b/lucene/src/java/org/apache/lucene/index/PerDocWriteState.java new file mode 100644 index 00000000000..652f1b6d5a5 --- /dev/null +++ b/lucene/src/java/org/apache/lucene/index/PerDocWriteState.java @@ -0,0 +1,77 @@ +package org.apache.lucene.index; +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +import java.io.PrintStream; +import java.util.concurrent.atomic.AtomicLong; + +import org.apache.lucene.store.Directory; + +/** + * nocommit - javadoc + * @lucene.experimental + */ +public class PerDocWriteState { + public final PrintStream infoStream; + public final Directory directory; + public final String segmentName; + public final FieldInfos fieldInfos; + public final AtomicLong bytesUsed; + public final SegmentCodecs segmentCodecs; + public final int codecId; + + /** Expert: The fraction of terms in the "dictionary" which should be stored + * in RAM. Smaller values use more memory, but make searching slightly + * faster, while larger values use less memory and make searching slightly + * slower. Searching is typically not dominated by dictionary lookup, so + * tweaking this is rarely useful.*/ + public int termIndexInterval; // TODO: this should be private to the codec, not settable here or in IWC + + public PerDocWriteState(PrintStream infoStream, Directory directory, String segmentName, FieldInfos fieldInfos, AtomicLong bytesUsed, int codecId) { + this.infoStream = infoStream; + this.directory = directory; + this.segmentName = segmentName; + this.fieldInfos = fieldInfos; + this.segmentCodecs = fieldInfos.buildSegmentCodecs(false); + this.codecId = codecId; + this.bytesUsed = bytesUsed; + } + + public PerDocWriteState(SegmentWriteState state) { + infoStream = state.infoStream; + directory = state.directory; + segmentCodecs = state.segmentCodecs; + segmentName = state.segmentName; + fieldInfos = state.fieldInfos; + codecId = state.codecId; + bytesUsed = new AtomicLong(0); + } + + public PerDocWriteState(PerDocWriteState state, int codecId) { + this.infoStream = state.infoStream; + this.directory = state.directory; + this.segmentName = state.segmentName; + this.fieldInfos = state.fieldInfos; + this.segmentCodecs = state.segmentCodecs; + this.codecId = codecId; + this.bytesUsed = state.bytesUsed; + } + + + public String codecIdAsString() { + return "" + codecId; + } +} diff --git a/lucene/src/java/org/apache/lucene/index/PerFieldCodecWrapper.java b/lucene/src/java/org/apache/lucene/index/PerFieldCodecWrapper.java index d441aa4e5c3..d1acaf46a5e 100644 --- a/lucene/src/java/org/apache/lucene/index/PerFieldCodecWrapper.java +++ b/lucene/src/java/org/apache/lucene/index/PerFieldCodecWrapper.java @@ -19,6 +19,7 @@ package org.apache.lucene.index; import java.io.IOException; import java.util.ArrayList; +import java.util.Collection; import java.util.HashMap; import java.util.Iterator; import java.util.Map; @@ -28,6 +29,8 @@ import java.util.TreeSet; import org.apache.lucene.index.codecs.Codec; import org.apache.lucene.index.codecs.FieldsConsumer; import org.apache.lucene.index.codecs.FieldsProducer; +import org.apache.lucene.index.codecs.PerDocConsumer; +import org.apache.lucene.index.codecs.PerDocValues; import org.apache.lucene.index.codecs.TermsConsumer; import org.apache.lucene.index.codecs.docvalues.DocValuesConsumer; import org.apache.lucene.index.values.DocValues; @@ -74,12 +77,6 @@ final class PerFieldCodecWrapper extends Codec { return fields.addField(field); } - @Override - public DocValuesConsumer addValuesField(FieldInfo field) throws IOException { - final FieldsConsumer fields = consumers.get(field.getCodecId()); - return fields.addValuesField(field); - } - @Override public void close() throws IOException { Iterator it = consumers.iterator(); @@ -113,7 +110,7 @@ final class PerFieldCodecWrapper extends Codec { boolean success = false; try { for (FieldInfo fi : fieldInfos) { - if (fi.isIndexed || fi.hasDocValues()) { // TODO this does not work for non-indexed fields + if (fi.isIndexed) { fields.add(fi.name); assert fi.getCodecId() != FieldInfo.UNASSIGNED_CODEC_ID; Codec codec = segmentCodecs.codecs[fi.getCodecId()]; @@ -171,11 +168,6 @@ final class PerFieldCodecWrapper extends Codec { return TermsEnum.EMPTY; } } - - @Override - public DocValues docValues() throws IOException { - return codecs.get(current).docValues(current); - } } @Override @@ -189,12 +181,6 @@ final class PerFieldCodecWrapper extends Codec { return fields == null ? null : fields.terms(field); } - @Override - public DocValues docValues(String field) throws IOException { - FieldsProducer fieldsProducer = codecs.get(field); - return fieldsProducer == null? null: fieldsProducer.docValues(field); - } - @Override public void close() throws IOException { Iterator it = codecs.values().iterator(); @@ -244,4 +230,133 @@ final class PerFieldCodecWrapper extends Codec { codec.getExtensions(extensions); } } + + @Override + public PerDocConsumer docsConsumer(PerDocWriteState state) throws IOException { + return new PerDocConsumers(state); + } + + @Override + public PerDocValues docsProducer(SegmentReadState state) throws IOException { + return new PerDocProducers(state.dir, state.fieldInfos, state.segmentInfo, + state.readBufferSize, state.termsIndexDivisor); + } + + private final class PerDocProducers extends PerDocValues { + private final Set fields = new TreeSet (); + private final Map codecs = new HashMap (); + + public PerDocProducers(Directory dir, FieldInfos fieldInfos, SegmentInfo si, + int readBufferSize, int indexDivisor) throws IOException { + final Map producers = new HashMap (); + boolean success = false; + try { + for (FieldInfo fi : fieldInfos) { + if (fi.hasDocValues()) { + fields.add(fi.name); + assert fi.getCodecId() != FieldInfo.UNASSIGNED_CODEC_ID; + Codec codec = segmentCodecs.codecs[fi.getCodecId()]; + if (!producers.containsKey(codec)) { + producers.put(codec, codec.docsProducer(new SegmentReadState(dir, + si, fieldInfos, readBufferSize, indexDivisor, fi.getCodecId()))); + } + codecs.put(fi.name, producers.get(codec)); + } + } + success = true; + } finally { + if (!success) { + // If we hit exception (eg, IOE because writer was + // committing, or, for any other reason) we must + // go back and close all FieldsProducers we opened: + for(PerDocValues producer : producers.values()) { + try { + producer.close(); + } catch (Throwable t) { + // Suppress all exceptions here so we continue + // to throw the original one + } + } + } + } + } + @Override + public Collection fields() { + return fields; + } + @Override + public DocValues docValues(String field) throws IOException { + final PerDocValues perDocProducer = codecs.get(field); + if (perDocProducer == null) { + return null; + } + return perDocProducer.docValues(field); + } + + @Override + public void close() throws IOException { + final Iterator it = codecs.values().iterator(); + IOException err = null; + while (it.hasNext()) { + try { + it.next().close(); + } catch (IOException ioe) { + // keep first IOException we hit but keep + // closing the rest + if (err == null) { + err = ioe; + } + } + } + if (err != null) { + throw err; + } + } + } + + private final class PerDocConsumers extends PerDocConsumer { + private final ArrayList consumers = new ArrayList (); + + public PerDocConsumers(PerDocWriteState state) throws IOException { + assert segmentCodecs == state.segmentCodecs; + final Codec[] codecs = segmentCodecs.codecs; + for (int i = 0; i < codecs.length; i++) { + consumers.add(codecs[i].docsConsumer(new PerDocWriteState(state, i))); + } + } + + @Override + public void close() throws IOException { + Iterator it = consumers.iterator(); + IOException err = null; + while (it.hasNext()) { + try { + PerDocConsumer next = it.next(); + if (next != null) { + next.close(); + } + } catch (IOException ioe) { + // keep first IOException we hit but keep + // closing the rest + if (err == null) { + err = ioe; + } + } + } + if (err != null) { + throw err; + } + } + + @Override + public DocValuesConsumer addValuesField(FieldInfo field) throws IOException { + assert field.getCodecId() != FieldInfo.UNASSIGNED_CODEC_ID; + final PerDocConsumer perDoc = consumers.get(field.getCodecId()); + if (perDoc == null) { + return null; + } + return perDoc.addValuesField(field); + } + + } } diff --git a/lucene/src/java/org/apache/lucene/index/SegmentInfo.java b/lucene/src/java/org/apache/lucene/index/SegmentInfo.java index 679367e4bf4..f7999da4219 100644 --- a/lucene/src/java/org/apache/lucene/index/SegmentInfo.java +++ b/lucene/src/java/org/apache/lucene/index/SegmentInfo.java @@ -39,14 +39,14 @@ import org.apache.lucene.util.Constants; /** * Information about a segment such as it's name, directory, and files related * to the segment. - * + * * @lucene.experimental */ public final class SegmentInfo { static final int NO = -1; // e.g. no norms; no deletes; static final int YES = 1; // e.g. have norms; have deletes; - static final int WITHOUT_GEN = 0; // a file name that has no GEN in it. + static final int WITHOUT_GEN = 0; // a file name that has no GEN in it. public String name; // unique name in dir public int docCount; // number of docs in seg @@ -58,7 +58,7 @@ public final class SegmentInfo { * - YES or higher if there are deletes at generation N */ private long delGen; - + /* * Current generation of each field's norm file. If this array is null, * means no separate norms. If this array is not null, its values mean: @@ -67,7 +67,7 @@ public final class SegmentInfo { */ private Map normGen; - private boolean isCompoundFile; + private boolean isCompoundFile; private volatile List files; // cached list of files that this segment uses // in the Directory @@ -75,10 +75,13 @@ public final class SegmentInfo { private volatile long sizeInBytesNoStore = -1; // total byte size of all but the store files (computed on demand) private volatile long sizeInBytesWithStore = -1; // total byte size of all of our files (computed on demand) + //TODO: LUCENE-2555: remove once we don't need to support shared doc stores (pre 4.0) private int docStoreOffset; // if this segment shares stored fields & vectors, this // offset is where in that file this segment's docs begin + //TODO: LUCENE-2555: remove once we don't need to support shared doc stores (pre 4.0) private String docStoreSegment; // name used to derive fields/vectors file we share with // other segments + //TODO: LUCENE-2555: remove once we don't need to support shared doc stores (pre 4.0) private boolean docStoreIsCompoundFile; // whether doc store files are stored in compound file (*.cfx) private int delCount; // How many deleted docs in this segment @@ -93,9 +96,9 @@ public final class SegmentInfo { private Map diagnostics; - // Tracks the Lucene version this segment was created with, since 3.1. Null + // Tracks the Lucene version this segment was created with, since 3.1. Null // indicates an older than 3.0 index, and it's used to detect a too old index. - // The format expected is "x.y" - "2.x" for pre-3.0 indexes (or null), and + // The format expected is "x.y" - "2.x" for pre-3.0 indexes (or null), and // specific versions afterwards ("3.0", "3.1" etc.). // see Constants.LUCENE_MAIN_VERSION. private String version; @@ -103,7 +106,7 @@ public final class SegmentInfo { // NOTE: only used in-RAM by IW to track buffered deletes; // this is never written to/read from the Directory private long bufferedDeletesGen; - + public SegmentInfo(String name, int docCount, Directory dir, boolean isCompoundFile, boolean hasProx, SegmentCodecs segmentCodecs, boolean hasVectors, FieldInfos fieldInfos) { this.name = name; @@ -184,11 +187,13 @@ public final class SegmentInfo { docStoreSegment = name; docStoreIsCompoundFile = false; } + if (format > DefaultSegmentInfosWriter.FORMAT_4_0) { // pre-4.0 indexes write a byte if there is a single norms file byte b = input.readByte(); assert 1 == b; } + int numNormGen = input.readInt(); if (numNormGen == NO) { normGen = null; @@ -209,7 +214,7 @@ public final class SegmentInfo { assert delCount <= docCount; hasProx = input.readByte() == YES; - + // System.out.println(Thread.currentThread().getName() + ": si.read hasProx=" + hasProx + " seg=" + name); if (format <= DefaultSegmentInfosWriter.FORMAT_4_0) { segmentCodecs = new SegmentCodecs(codecs, input); @@ -219,7 +224,7 @@ public final class SegmentInfo { segmentCodecs = new SegmentCodecs(codecs, new Codec[] { codecs.lookup("PreFlex")}); } diagnostics = input.readStringStringMap(); - + if (format <= DefaultSegmentInfosWriter.FORMAT_HAS_VECTORS) { hasVectors = input.readByte() == 1; } else { @@ -368,7 +373,7 @@ public final class SegmentInfo { // against this segment return null; } else { - return IndexFileNames.fileNameFromGeneration(name, IndexFileNames.DELETES_EXTENSION, delGen); + return IndexFileNames.fileNameFromGeneration(name, IndexFileNames.DELETES_EXTENSION, delGen); } } @@ -434,7 +439,7 @@ public final class SegmentInfo { if (hasSeparateNorms(number)) { return IndexFileNames.fileNameFromGeneration(name, "s" + number, normGen.get(number)); } else { - // single file for all norms + // single file for all norms return IndexFileNames.fileNameFromGeneration(name, IndexFileNames.NORMS_EXTENSION, WITHOUT_GEN); } } @@ -467,39 +472,74 @@ public final class SegmentInfo { assert delCount <= docCount; } + /** + * @deprecated shared doc stores are not supported in >= 4.0 + */ + @Deprecated public int getDocStoreOffset() { + // TODO: LUCENE-2555: remove once we don't need to support shared doc stores (pre 4.0) return docStoreOffset; } - + + /** + * @deprecated shared doc stores are not supported in >= 4.0 + */ + @Deprecated public boolean getDocStoreIsCompoundFile() { + // TODO: LUCENE-2555: remove once we don't need to support shared doc stores (pre 4.0) return docStoreIsCompoundFile; } - - void setDocStoreIsCompoundFile(boolean v) { - docStoreIsCompoundFile = v; - clearFilesCache(); - } - - public String getDocStoreSegment() { - return docStoreSegment; - } - - public void setDocStoreSegment(String segment) { - docStoreSegment = segment; - } - - void setDocStoreOffset(int offset) { - docStoreOffset = offset; + + /** + * @deprecated shared doc stores are not supported in >= 4.0 + */ + @Deprecated + public void setDocStoreIsCompoundFile(boolean docStoreIsCompoundFile) { + // TODO: LUCENE-2555: remove once we don't need to support shared doc stores (pre 4.0) + this.docStoreIsCompoundFile = docStoreIsCompoundFile; clearFilesCache(); } - void setDocStore(int offset, String segment, boolean isCompoundFile) { + /** + * @deprecated shared doc stores are not supported in >= 4.0 + */ + @Deprecated + void setDocStore(int offset, String segment, boolean isCompoundFile) { + // TODO: LUCENE-2555: remove once we don't need to support shared doc stores (pre 4.0) docStoreOffset = offset; docStoreSegment = segment; docStoreIsCompoundFile = isCompoundFile; clearFilesCache(); } - + + /** + * @deprecated shared doc stores are not supported in >= 4.0 + */ + @Deprecated + public String getDocStoreSegment() { + // TODO: LUCENE-2555: remove once we don't need to support shared doc stores (pre 4.0) + return docStoreSegment; + } + + /** + * @deprecated shared doc stores are not supported in >= 4.0 + */ + @Deprecated + void setDocStoreOffset(int offset) { + // TODO: LUCENE-2555: remove once we don't need to support shared doc stores (pre 4.0) + docStoreOffset = offset; + clearFilesCache(); + } + + /** + * @deprecated shared doc stores are not supported in 4.0 + */ + @Deprecated + public void setDocStoreSegment(String docStoreSegment) { + // TODO: LUCENE-2555: remove once we don't need to support shared doc stores (pre 4.0) + this.docStoreSegment = docStoreSegment; + } + /** Save this segment's info. */ public void write(IndexOutput output) throws IOException { @@ -509,12 +549,14 @@ public final class SegmentInfo { output.writeString(name); output.writeInt(docCount); output.writeLong(delGen); + output.writeInt(docStoreOffset); if (docStoreOffset != -1) { output.writeString(docStoreSegment); output.writeByte((byte) (docStoreIsCompoundFile ? 1:0)); } + if (normGen == null) { output.writeInt(NO); } else { @@ -524,7 +566,7 @@ public final class SegmentInfo { output.writeLong(entry.getValue()); } } - + output.writeByte((byte) (isCompoundFile ? YES : NO)); output.writeInt(delCount); output.writeByte((byte) (hasProx ? 1:0)); @@ -572,9 +614,9 @@ public final class SegmentInfo { // Already cached: return files; } - + Set fileSet = new HashSet (); - + boolean useCompoundFile = getUseCompoundFile(); if (useCompoundFile) { @@ -608,7 +650,7 @@ public final class SegmentInfo { fileSet.add(IndexFileNames.segmentFileName(name, "", IndexFileNames.VECTORS_INDEX_EXTENSION)); fileSet.add(IndexFileNames.segmentFileName(name, "", IndexFileNames.VECTORS_DOCUMENTS_EXTENSION)); fileSet.add(IndexFileNames.segmentFileName(name, "", IndexFileNames.VECTORS_FIELDS_EXTENSION)); - } + } } String delFileName = IndexFileNames.fileNameFromGeneration(name, IndexFileNames.DELETES_EXTENSION, delGen); @@ -646,7 +688,7 @@ public final class SegmentInfo { } /** Used for debugging. Format may suddenly change. - * + * * Current format looks like *
_a(3.1):c45/4->_1
, which means the segment's * name is_a
; it was created with Lucene 3.1 (or @@ -661,7 +703,6 @@ public final class SegmentInfo { StringBuilder s = new StringBuilder(); s.append(name).append('(').append(version == null ? "?" : version).append(')').append(':'); - char cfs = getUseCompoundFile() ? 'c' : 'C'; s.append(cfs); @@ -677,7 +718,7 @@ public final class SegmentInfo { if (delCount != 0) { s.append('/').append(delCount); } - + if (docStoreOffset != -1) { s.append("->").append(docStoreSegment); if (docStoreIsCompoundFile) { @@ -717,13 +758,13 @@ public final class SegmentInfo { * NOTE: this method is used for internal purposes only - you should * not modify the version of a SegmentInfo, or it may result in unexpected * exceptions thrown when you attempt to open the index. - * + * * @lucene.internal */ public void setVersion(String version) { this.version = version; } - + /** Returns the version of the code which wrote the segment. */ public String getVersion() { return version; diff --git a/lucene/src/java/org/apache/lucene/index/SegmentMerger.java b/lucene/src/java/org/apache/lucene/index/SegmentMerger.java index 2303207149c..46c050e3588 100644 --- a/lucene/src/java/org/apache/lucene/index/SegmentMerger.java +++ b/lucene/src/java/org/apache/lucene/index/SegmentMerger.java @@ -22,7 +22,6 @@ import java.util.ArrayList; import java.util.Arrays; import java.util.Collection; import java.util.List; -import java.util.concurrent.atomic.AtomicLong; import org.apache.lucene.document.Document; import org.apache.lucene.index.IndexReader.FieldOption; @@ -31,6 +30,8 @@ import org.apache.lucene.index.codecs.Codec; import org.apache.lucene.index.codecs.CodecProvider; import org.apache.lucene.index.codecs.FieldsConsumer; import org.apache.lucene.index.codecs.MergeState; +import org.apache.lucene.index.codecs.PerDocConsumer; +import org.apache.lucene.index.codecs.PerDocValues; import org.apache.lucene.store.Directory; import org.apache.lucene.store.IndexInput; import org.apache.lucene.store.IndexOutput; @@ -40,24 +41,24 @@ import org.apache.lucene.util.ReaderUtil; /** * The SegmentMerger class combines two or more Segments, represented by an IndexReader ({@link #add}, - * into a single Segment. After adding the appropriate readers, call the merge method to combine the + * into a single Segment. After adding the appropriate readers, call the merge method to combine the * segments. - * + * * @see #merge * @see #add */ final class SegmentMerger { - + /** norms header placeholder */ - static final byte[] NORMS_HEADER = new byte[]{'N','R','M',-1}; - + static final byte[] NORMS_HEADER = new byte[]{'N','R','M',-1}; + private Directory directory; private String segment; private int termIndexInterval = IndexWriterConfig.DEFAULT_TERM_INDEX_INTERVAL; private Listreaders = new ArrayList (); private final FieldInfos fieldInfos; - + private int mergedDocs; private final MergeState.CheckAbort checkAbort; @@ -65,12 +66,12 @@ final class SegmentMerger { /** Maximum number of contiguous documents to bulk-copy when merging stored fields */ private final static int MAX_RAW_MERGE_DOCS = 4192; - + private Codec codec; private SegmentWriteState segmentWriteState; private PayloadProcessorProvider payloadProcessorProvider; - + SegmentMerger(Directory dir, int termIndexInterval, String name, MergePolicy.OneMerge merge, CodecProvider codecs, PayloadProcessorProvider payloadProcessorProvider, FieldInfos fieldInfos) { this.payloadProcessorProvider = payloadProcessorProvider; directory = dir; @@ -133,10 +134,10 @@ final class SegmentMerger { for (String file : files) { cfsWriter.addFile(file); } - + // Perform the merge cfsWriter.close(); - + return files; } @@ -194,13 +195,12 @@ final class SegmentMerger { } /** - * + * * @return The number of documents in all of the readers * @throws CorruptIndexException if the index is corrupt * @throws IOException if there is a low-level IO error */ private int mergeFields() throws CorruptIndexException, IOException { - for (IndexReader reader : readers) { if (reader instanceof SegmentReader) { SegmentReader segmentReader = (SegmentReader) reader; @@ -263,8 +263,8 @@ final class SegmentMerger { // details. throw new RuntimeException("mergeFields produced an invalid result: docCount is " + docCount + " but fdx file size is " + fdxFileLength + " file=" + fileName + " file exists?=" + directory.fileExists(fileName) + "; now aborting this merge to prevent index corruption"); - segmentWriteState = new SegmentWriteState(null, directory, segment, fieldInfos, docCount, termIndexInterval, codecInfo, null, new AtomicLong(0)); - + segmentWriteState = new SegmentWriteState(null, directory, segment, fieldInfos, docCount, termIndexInterval, codecInfo, null); + return docCount; } @@ -282,7 +282,7 @@ final class SegmentMerger { ++j; continue; } - // We can optimize this case (doing a bulk byte copy) since the field + // We can optimize this case (doing a bulk byte copy) since the field // numbers are identical int start = j, numDocs = 0; do { @@ -294,7 +294,7 @@ final class SegmentMerger { break; } } while(numDocs < MAX_RAW_MERGE_DOCS); - + IndexInput stream = matchingFieldsReader.rawDocs(rawDocLengths, start, numDocs); fieldsWriter.addRawDocuments(stream, rawDocLengths, numDocs); docCount += numDocs; @@ -348,7 +348,7 @@ final class SegmentMerger { * @throws IOException */ private final void mergeVectors() throws IOException { - TermVectorsWriter termVectorsWriter = + TermVectorsWriter termVectorsWriter = new TermVectorsWriter(directory, segment, fieldInfos); try { @@ -368,7 +368,7 @@ final class SegmentMerger { copyVectorsWithDeletions(termVectorsWriter, matchingVectorsReader, reader); } else { copyVectorsNoDeletions(termVectorsWriter, matchingVectorsReader, reader); - + } } } finally { @@ -401,7 +401,7 @@ final class SegmentMerger { ++docNum; continue; } - // We can optimize this case (doing a bulk byte copy) since the field + // We can optimize this case (doing a bulk byte copy) since the field // numbers are identical int start = docNum, numDocs = 0; do { @@ -413,7 +413,7 @@ final class SegmentMerger { break; } } while(numDocs < MAX_RAW_MERGE_DOCS); - + matchingVectorsReader.rawDocs(rawDocLengths, rawDocLengths2, start, numDocs); termVectorsWriter.addRawDocuments(matchingVectorsReader, rawDocLengths, rawDocLengths2, numDocs); checkAbort.work(300 * numDocs); @@ -424,7 +424,7 @@ final class SegmentMerger { // skip deleted docs continue; } - + // NOTE: it's very important to first assign to vectors then pass it to // termVectorsWriter.addAllDocVectors; see LUCENE-1282 TermFreqVector[] vectors = reader.getTermFreqVectors(docNum); @@ -433,7 +433,7 @@ final class SegmentMerger { } } } - + private void copyVectorsNoDeletions(final TermVectorsWriter termVectorsWriter, final TermVectorsReader matchingVectorsReader, final IndexReader reader) @@ -469,13 +469,20 @@ final class SegmentMerger { // Let CodecProvider decide which codec will be used to write // the new segment: - + int docBase = 0; final List fields = new ArrayList (); + final List slices = new ArrayList (); final List bits = new ArrayList (); final List bitsStarts = new ArrayList (); + + // TODO: move this into its own method - this merges currently only docvalues + final List perDocProducers = new ArrayList (); + final List perDocSlices = new ArrayList (); + final List perDocBits = new ArrayList (); + final List perDocBitsStarts = new ArrayList (); for(IndexReader r : readers) { final Fields f = r.fields(); @@ -486,10 +493,18 @@ final class SegmentMerger { bits.add(r.getDeletedDocs()); bitsStarts.add(docBase); } + final PerDocValues producer = r.perDocValues(); + if (producer != null) { + perDocSlices.add(new ReaderUtil.Slice(docBase, maxDoc, fields.size())); + perDocProducers.add(producer); + perDocBits.add(r.getDeletedDocs()); + perDocBitsStarts.add(docBase); + } docBase += maxDoc; } bitsStarts.add(docBase); + perDocBitsStarts.add(docBase); // we may gather more readers than mergeState.readerCount mergeState = new MergeState(); @@ -497,7 +512,7 @@ final class SegmentMerger { mergeState.readerCount = readers.size(); mergeState.fieldInfos = fieldInfos; mergeState.mergedDocCount = mergedDocs; - + // Remap docIDs mergeState.delCounts = new int[mergeState.readerCount]; mergeState.docMaps = new int[mergeState.readerCount][]; @@ -535,7 +550,7 @@ final class SegmentMerger { } assert delCount == mergeState.delCounts[i]: "reader delCount=" + mergeState.delCounts[i] + " vs recomputed delCount=" + delCount; } - + if (payloadProcessorProvider != null) { mergeState.dirPayloadProcessor[i] = payloadProcessorProvider.getDirProcessor(reader.directory()); } @@ -548,7 +563,7 @@ final class SegmentMerger { // apart when we step through the docs enums in // MultiDocsEnum. mergeState.multiDeletedDocs = new MultiBits(bits, bitsStarts); - + try { consumer.merge(mergeState, new MultiFields(fields.toArray(Fields.EMPTY_ARRAY), @@ -556,6 +571,21 @@ final class SegmentMerger { } finally { consumer.close(); } + if (!perDocSlices.isEmpty()) { + mergeState.multiDeletedDocs = new MultiBits(perDocBits, perDocBitsStarts); + final PerDocConsumer docsConsumer = codec + .docsConsumer(new PerDocWriteState(segmentWriteState)); + try { + docsConsumer.merge( + mergeState, + new MultiPerDocValues(perDocProducers + .toArray(PerDocValues.EMPTY_ARRAY), perDocSlices + .toArray(ReaderUtil.Slice.EMPTY_ARRAY))); + } finally { + docsConsumer.close(); + } + } + } private MergeState mergeState; @@ -567,7 +597,7 @@ final class SegmentMerger { int[] getDelCounts() { return mergeState.delCounts; } - + public boolean getAnyNonBulkMerges() { assert matchedCount <= readers.size(); return matchedCount != readers.size(); @@ -578,7 +608,7 @@ final class SegmentMerger { try { for (FieldInfo fi : fieldInfos) { if (fi.isIndexed && !fi.omitNorms) { - if (output == null) { + if (output == null) { output = directory.createOutput(IndexFileNames.segmentFileName(segment, "", IndexFileNames.NORMS_EXTENSION)); output.writeBytes(NORMS_HEADER,NORMS_HEADER.length); } @@ -609,7 +639,7 @@ final class SegmentMerger { } } } finally { - if (output != null) { + if (output != null) { output.close(); } } diff --git a/lucene/src/java/org/apache/lucene/index/SegmentReader.java b/lucene/src/java/org/apache/lucene/index/SegmentReader.java index 90607888af6..0aa94487adf 100644 --- a/lucene/src/java/org/apache/lucene/index/SegmentReader.java +++ b/lucene/src/java/org/apache/lucene/index/SegmentReader.java @@ -28,20 +28,16 @@ import java.util.Set; import java.util.concurrent.atomic.AtomicInteger; import org.apache.lucene.document.Document; -import org.apache.lucene.document.Field; import org.apache.lucene.document.FieldSelector; import org.apache.lucene.index.codecs.FieldsProducer; +import org.apache.lucene.index.codecs.PerDocValues; import org.apache.lucene.store.BufferedIndexInput; import org.apache.lucene.store.Directory; import org.apache.lucene.store.IndexInput; import org.apache.lucene.store.IndexOutput; import org.apache.lucene.util.BitVector; import org.apache.lucene.util.Bits; -import org.apache.lucene.index.values.Bytes; -import org.apache.lucene.index.values.Ints; import org.apache.lucene.index.values.DocValues; -import org.apache.lucene.index.values.Floats; -import org.apache.lucene.index.values.Type; import org.apache.lucene.util.BytesRef; import org.apache.lucene.util.CloseableThreadLocal; @@ -61,6 +57,9 @@ public class SegmentReader extends IndexReader implements Cloneable { AtomicInteger deletedDocsRef = null; private boolean deletedDocsDirty = false; private boolean normsDirty = false; + + // TODO: we should move this tracking into SegmentInfo; + // this way SegmentInfo.toString shows pending deletes private int pendingDeleteCount; private boolean rollbackHasChanges = false; @@ -91,6 +90,7 @@ public class SegmentReader extends IndexReader implements Cloneable { final FieldInfos fieldInfos; final FieldsProducer fields; + final PerDocValues perDocProducer; final Directory dir; final Directory cfsDir; @@ -130,8 +130,10 @@ public class SegmentReader extends IndexReader implements Cloneable { this.termsIndexDivisor = termsIndexDivisor; // Ask codec for its Fields - fields = segmentCodecs.codec().fieldsProducer(new SegmentReadState(cfsDir, si, fieldInfos, readBufferSize, termsIndexDivisor)); + final SegmentReadState segmentReadState = new SegmentReadState(cfsDir, si, fieldInfos, readBufferSize, termsIndexDivisor); + fields = segmentCodecs.codec().fieldsProducer(segmentReadState); assert fields != null; + perDocProducer = segmentCodecs.codec().docsProducer(segmentReadState); success = true; } finally { if (!success) { @@ -169,6 +171,10 @@ public class SegmentReader extends IndexReader implements Cloneable { if (fields != null) { fields.close(); } + + if (perDocProducer != null) { + perDocProducer.close(); + } if (termVectorsReaderOrig != null) { termVectorsReaderOrig.close(); @@ -808,8 +814,9 @@ public class SegmentReader extends IndexReader implements Cloneable { oldRef.decrementAndGet(); } deletedDocsDirty = true; - if (!deletedDocs.getAndSet(docNum)) + if (!deletedDocs.getAndSet(docNum)) { pendingDeleteCount++; + } } @Override @@ -1211,6 +1218,11 @@ public class SegmentReader extends IndexReader implements Cloneable { @Override public DocValues docValues(String field) throws IOException { - return core.fields.docValues(field); + return core.perDocProducer.docValues(field); + } + + @Override + public PerDocValues perDocValues() throws IOException { + return core.perDocProducer; } } diff --git a/lucene/src/java/org/apache/lucene/index/SegmentWriteState.java b/lucene/src/java/org/apache/lucene/index/SegmentWriteState.java index 1b273f56cad..c29add9bd93 100644 --- a/lucene/src/java/org/apache/lucene/index/SegmentWriteState.java +++ b/lucene/src/java/org/apache/lucene/index/SegmentWriteState.java @@ -33,7 +33,6 @@ public class SegmentWriteState { public final FieldInfos fieldInfos; public final int numDocs; public boolean hasVectors; - public final AtomicLong bytesUsed; // Deletes to apply while we are flushing the segment. A // Term is enrolled in here if it was deleted at one @@ -56,7 +55,7 @@ public class SegmentWriteState { public int termIndexInterval; // TODO: this should be private to the codec, not settable here or in IWC public SegmentWriteState(PrintStream infoStream, Directory directory, String segmentName, FieldInfos fieldInfos, - int numDocs, int termIndexInterval, SegmentCodecs segmentCodecs, BufferedDeletes segDeletes, AtomicLong bytesUsed) { + int numDocs, int termIndexInterval, SegmentCodecs segmentCodecs, BufferedDeletes segDeletes) { this.infoStream = infoStream; this.segDeletes = segDeletes; this.directory = directory; @@ -66,7 +65,6 @@ public class SegmentWriteState { this.termIndexInterval = termIndexInterval; this.segmentCodecs = segmentCodecs; codecId = -1; - this.bytesUsed = bytesUsed; } /** @@ -82,7 +80,6 @@ public class SegmentWriteState { segmentCodecs = state.segmentCodecs; this.codecId = codecId; segDeletes = state.segDeletes; - bytesUsed = state.bytesUsed; } public String codecIdAsString() { diff --git a/lucene/src/java/org/apache/lucene/index/SlowMultiReaderWrapper.java b/lucene/src/java/org/apache/lucene/index/SlowMultiReaderWrapper.java index 78c834f8008..82e976098ff 100644 --- a/lucene/src/java/org/apache/lucene/index/SlowMultiReaderWrapper.java +++ b/lucene/src/java/org/apache/lucene/index/SlowMultiReaderWrapper.java @@ -26,6 +26,7 @@ import org.apache.lucene.util.ReaderUtil; // javadoc import org.apache.lucene.index.DirectoryReader; // javadoc import org.apache.lucene.index.MultiReader; // javadoc +import org.apache.lucene.index.codecs.PerDocValues; /** * This class forces a composite reader (eg a {@link @@ -64,6 +65,11 @@ public final class SlowMultiReaderWrapper extends FilterIndexReader { return MultiFields.getFields(in); } + @Override + public PerDocValues perDocValues() throws IOException { + return MultiPerDocValues.getPerDocs(in); + } + @Override public Bits getDeletedDocs() { return MultiFields.getDeletedDocs(in); diff --git a/lucene/src/java/org/apache/lucene/index/StoredFieldsWriter.java b/lucene/src/java/org/apache/lucene/index/StoredFieldsWriter.java index 9f04dcb9786..c3aa5c86b60 100644 --- a/lucene/src/java/org/apache/lucene/index/StoredFieldsWriter.java +++ b/lucene/src/java/org/apache/lucene/index/StoredFieldsWriter.java @@ -18,7 +18,8 @@ package org.apache.lucene.index; */ import java.io.IOException; -import org.apache.lucene.store.RAMOutputStream; + +import org.apache.lucene.document.Fieldable; import org.apache.lucene.util.ArrayUtil; import org.apache.lucene.util.RamUsageEstimator; @@ -26,22 +27,38 @@ import org.apache.lucene.util.RamUsageEstimator; final class StoredFieldsWriter { FieldsWriter fieldsWriter; - final DocumentsWriter docWriter; + final DocumentsWriterPerThread docWriter; int lastDocID; - PerDoc[] docFreeList = new PerDoc[1]; int freeCount; - public StoredFieldsWriter(DocumentsWriter docWriter) { + final DocumentsWriterPerThread.DocState docState; + + public StoredFieldsWriter(DocumentsWriterPerThread docWriter) { this.docWriter = docWriter; + this.docState = docWriter.docState; } - public StoredFieldsWriterPerThread addThread(DocumentsWriter.DocState docState) throws IOException { - return new StoredFieldsWriterPerThread(docState, this); + private int numStoredFields; + private Fieldable[] storedFields; + private int[] fieldNumbers; + + public void reset() { + numStoredFields = 0; + storedFields = new Fieldable[1]; + fieldNumbers = new int[1]; } - synchronized public void flush(SegmentWriteState state) throws IOException { - if (state.numDocs > lastDocID) { + public void startDocument() { + reset(); + } + + public void flush(SegmentWriteState state) throws IOException { + + if (state.numDocs > 0) { + // It's possible that all documents seen in this segment + // hit non-aborting exceptions, in which case we will + // not have yet init'd the FieldsWriter: initFieldsWriter(); fill(state.numDocs); } @@ -67,23 +84,9 @@ final class StoredFieldsWriter { int allocCount; - synchronized PerDoc getPerDoc() { - if (freeCount == 0) { - allocCount++; - if (allocCount > docFreeList.length) { - // Grow our free list up front to make sure we have - // enough space to recycle all outstanding PerDoc - // instances - assert allocCount == 1+docFreeList.length; - docFreeList = new PerDoc[ArrayUtil.oversize(allocCount, RamUsageEstimator.NUM_BYTES_OBJECT_REF)]; - } - return new PerDoc(); - } else { - return docFreeList[--freeCount]; - } - } + void abort() { + reset(); - synchronized void abort() { if (fieldsWriter != null) { fieldsWriter.abort(); fieldsWriter = null; @@ -101,53 +104,40 @@ final class StoredFieldsWriter { } } - synchronized void finishDocument(PerDoc perDoc) throws IOException { + void finishDocument() throws IOException { assert docWriter.writer.testPoint("StoredFieldsWriter.finishDocument start"); + initFieldsWriter(); + fill(docState.docID); - fill(perDoc.docID); + if (fieldsWriter != null && numStoredFields > 0) { + fieldsWriter.startDocument(numStoredFields); + for (int i = 0; i < numStoredFields; i++) { + fieldsWriter.writeField(fieldNumbers[i], storedFields[i]); + } + lastDocID++; + } - // Append stored fields to the real FieldsWriter: - fieldsWriter.flushDocument(perDoc.numStoredFields, perDoc.fdt); - lastDocID++; - perDoc.reset(); - free(perDoc); + reset(); assert docWriter.writer.testPoint("StoredFieldsWriter.finishDocument end"); } - synchronized void free(PerDoc perDoc) { - assert freeCount < docFreeList.length; - assert 0 == perDoc.numStoredFields; - assert 0 == perDoc.fdt.length(); - assert 0 == perDoc.fdt.getFilePointer(); - docFreeList[freeCount++] = perDoc; - } - - class PerDoc extends DocumentsWriter.DocWriter { - final DocumentsWriter.PerDocBuffer buffer = docWriter.newPerDocBuffer(); - RAMOutputStream fdt = new RAMOutputStream(buffer); - int numStoredFields; - - void reset() { - fdt.reset(); - buffer.recycle(); - numStoredFields = 0; + public void addField(Fieldable field, FieldInfo fieldInfo) throws IOException { + if (numStoredFields == storedFields.length) { + int newSize = ArrayUtil.oversize(numStoredFields + 1, RamUsageEstimator.NUM_BYTES_OBJECT_REF); + Fieldable[] newArray = new Fieldable[newSize]; + System.arraycopy(storedFields, 0, newArray, 0, numStoredFields); + storedFields = newArray; } - @Override - void abort() { - reset(); - free(this); + if (numStoredFields == fieldNumbers.length) { + fieldNumbers = ArrayUtil.grow(fieldNumbers); } - @Override - public long sizeInBytes() { - return buffer.getSizeInBytes(); - } + storedFields[numStoredFields] = field; + fieldNumbers[numStoredFields] = fieldInfo.number; + numStoredFields++; - @Override - public void finish() throws IOException { - finishDocument(this); - } + assert docState.testPoint("StoredFieldsWriterPerThread.processFields.writeField"); } } diff --git a/lucene/src/java/org/apache/lucene/index/StoredFieldsWriterPerThread.java b/lucene/src/java/org/apache/lucene/index/StoredFieldsWriterPerThread.java deleted file mode 100644 index 85c6b57583b..00000000000 --- a/lucene/src/java/org/apache/lucene/index/StoredFieldsWriterPerThread.java +++ /dev/null @@ -1,79 +0,0 @@ -package org.apache.lucene.index; - -/** - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import java.io.IOException; -import org.apache.lucene.store.IndexOutput; -import org.apache.lucene.document.Fieldable; - -final class StoredFieldsWriterPerThread { - - final FieldsWriter localFieldsWriter; - final StoredFieldsWriter storedFieldsWriter; - final DocumentsWriter.DocState docState; - - StoredFieldsWriter.PerDoc doc; - - public StoredFieldsWriterPerThread(DocumentsWriter.DocState docState, StoredFieldsWriter storedFieldsWriter) throws IOException { - this.storedFieldsWriter = storedFieldsWriter; - this.docState = docState; - localFieldsWriter = new FieldsWriter((IndexOutput) null, (IndexOutput) null); - } - - public void startDocument() { - if (doc != null) { - // Only happens if previous document hit non-aborting - // exception while writing stored fields into - // localFieldsWriter: - doc.reset(); - doc.docID = docState.docID; - } - } - - public void addField(Fieldable field, FieldInfo fieldInfo) throws IOException { - if (doc == null) { - doc = storedFieldsWriter.getPerDoc(); - doc.docID = docState.docID; - localFieldsWriter.setFieldsStream(doc.fdt); - assert doc.numStoredFields == 0: "doc.numStoredFields=" + doc.numStoredFields; - assert 0 == doc.fdt.length(); - assert 0 == doc.fdt.getFilePointer(); - } - - localFieldsWriter.writeField(fieldInfo, field); - assert docState.testPoint("StoredFieldsWriterPerThread.processFields.writeField"); - doc.numStoredFields++; - } - - public DocumentsWriter.DocWriter finishDocument() { - // If there were any stored fields in this doc, doc will - // be non-null; else it's null. - try { - return doc; - } finally { - doc = null; - } - } - - public void abort() { - if (doc != null) { - doc.abort(); - doc = null; - } - } -} diff --git a/lucene/src/java/org/apache/lucene/index/TermVectorsTermsWriter.java b/lucene/src/java/org/apache/lucene/index/TermVectorsTermsWriter.java index a5d631efc53..da43f3ad311 100644 --- a/lucene/src/java/org/apache/lucene/index/TermVectorsTermsWriter.java +++ b/lucene/src/java/org/apache/lucene/index/TermVectorsTermsWriter.java @@ -17,49 +17,48 @@ package org.apache.lucene.index; * limitations under the License. */ +import java.io.IOException; +import java.util.Map; + import org.apache.lucene.store.IndexOutput; -import org.apache.lucene.store.RAMOutputStream; import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.BytesRef; import org.apache.lucene.util.IOUtils; import org.apache.lucene.util.RamUsageEstimator; -import java.io.IOException; -import java.util.Collection; - -import java.util.Map; - final class TermVectorsTermsWriter extends TermsHashConsumer { - final DocumentsWriter docWriter; - PerDoc[] docFreeList = new PerDoc[1]; + final DocumentsWriterPerThread docWriter; int freeCount; IndexOutput tvx; IndexOutput tvd; IndexOutput tvf; int lastDocID; + + final DocumentsWriterPerThread.DocState docState; + final BytesRef flushTerm = new BytesRef(); + + // Used by perField when serializing the term vectors + final ByteSliceReader vectorSliceReader = new ByteSliceReader(); boolean hasVectors; - public TermVectorsTermsWriter(DocumentsWriter docWriter) { + public TermVectorsTermsWriter(DocumentsWriterPerThread docWriter) { this.docWriter = docWriter; + docState = docWriter.docState; } @Override - public TermsHashConsumerPerThread addThread(TermsHashPerThread termsHashPerThread) { - return new TermVectorsTermsWriterPerThread(termsHashPerThread, this); - } - - @Override - synchronized void flush(Map > threadsAndFields, final SegmentWriteState state) throws IOException { + void flush(Map fieldsToFlush, final SegmentWriteState state) throws IOException { if (tvx != null) { // At least one doc in this run had term vectors enabled fill(state.numDocs); + assert state.segmentName != null; + String idxName = IndexFileNames.segmentFileName(state.segmentName, "", IndexFileNames.VECTORS_INDEX_EXTENSION); tvx.close(); tvf.close(); tvd.close(); tvx = tvd = tvf = null; - assert state.segmentName != null; - String idxName = IndexFileNames.segmentFileName(state.segmentName, "", IndexFileNames.VECTORS_INDEX_EXTENSION); - if (4 + ((long) state.numDocs) * 16 != state.directory.fileLength(idxName)) { + if (4+((long) state.numDocs)*16 != state.directory.fileLength(idxName)) { throw new RuntimeException("after flush: tvx size mismatch: " + state.numDocs + " docs vs " + state.directory.fileLength(idxName) + " length in bytes of " + idxName + " file exists?=" + state.directory.fileExists(idxName)); } @@ -68,33 +67,10 @@ final class TermVectorsTermsWriter extends TermsHashConsumer { hasVectors = false; } - for (Map.Entry