LUCENE-9344: Convert .txt files to properly formatted .md files (#1449)

This commit is contained in:
Tomoko Uchida 2020-04-24 14:28:12 +09:00 committed by GitHub
parent a11b78e06a
commit c7697b088c
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
7 changed files with 77 additions and 63 deletions

View File

@ -1,17 +1,17 @@
Lucene Build Instructions
# Lucene Build Instructions
Basic steps:
0) Install OpenJDK 11 (or greater), Ant 1.8.2+, Ivy 2.2.0
1) Download Lucene from Apache and unpack it
2) Connect to the top-level of your Lucene installation
3) Install JavaCC (optional)
4) Run ant
## Basic steps:
Step 0) Set up your development environment (OpenJDK 11 or greater,
Ant 1.8.2+, Ivy 2.2.0)
0. Install OpenJDK 11 (or greater), Ant 1.8.2+, Ivy 2.2.0
1. Download Lucene from Apache and unpack it
2. Connect to the top-level of your Lucene installation
3. Install JavaCC (optional)
4. Run ant
## Step 0) Set up your development environment (OpenJDK 11 or greater, Ant 1.8.2+, Ivy 2.2.0)
We'll assume that you know how to get and set up the JDK - if you
don't, then we suggest starting at http://www.oracle.com/java/ and learning
don't, then we suggest starting at https://www.oracle.com/java/ and learning
more about Java, before returning to this README. Lucene runs with
Java 11 and later.
@ -22,31 +22,31 @@ Ant is "kind of like make without make's wrinkles". Ant is
implemented in java and uses XML-based configuration files. You can
get it at:
http://ant.apache.org
https://ant.apache.org
You'll need to download the Ant binary distribution. Install it
according to the instructions at:
http://ant.apache.org/manual
https://ant.apache.org/manual
Finally, you'll need to install ivy into your ant lib folder
(~/.ant/lib). You can get it from http://ant.apache.org/ivy/.
If you skip this step, the Lucene build system will offer to do it
for you.
Step 1) Download Lucene from Apache
## Step 1) Download Lucene from Apache
We'll assume you already did this, or you wouldn't be reading this
file. However, you might have received this file by some alternate
route, or you might have an incomplete copy of the Lucene, so: Lucene
releases are available for download at:
http://www.apache.org/dyn/closer.cgi/lucene/java/
https://www.apache.org/dyn/closer.cgi/lucene/java/
Download either a zip or a tarred/gzipped version of the archive, and
uncompress it into a directory of your choice.
Step 2) From the command line, change (cd) into the top-level directory of your Lucene installation
## Step 2) From the command line, change (cd) into the top-level directory of your Lucene installation
Lucene's top-level directory contains the build.xml file. By default,
you do not need to change any of the settings in this file, but you do
@ -66,7 +66,7 @@ system.
NOTE: the ~ character represents your user account home directory.
Step 3) Run ant
## Step 4) Run ant
Assuming you have ant in your PATH and have set ANT_HOME to the
location of your ant installation, typing "ant" at the shell prompt
@ -76,10 +76,12 @@ and command prompt should run ant. Ant will by default look for the
If you want to build the documentation, type "ant documentation".
For further information on Lucene, go to:
http://lucene.apache.org/
https://lucene.apache.org/
Please join the Lucene-User mailing list by visiting this site:
http://lucene.apache.org/core/discussion.html
https://lucene.apache.org/core/discussion.html
Please post suggestions, questions, corrections or additions to this
document to the lucene-user mailing list.
@ -87,4 +89,4 @@ document to the lucene-user mailing list.
This file was originally written by Steven J. Owens <puff@darksleep.com>.
This file was modified by Jon S. Stevens <jon@latchkey.com>.
Copyright (c) 2001-2005 The Apache Software Foundation. All rights reserved.
Copyright (c) 2001-2020 The Apache Software Foundation. All rights reserved.

View File

@ -115,6 +115,8 @@ Other
* LUCENE-8656: Deprecations in FuzzyQuery and get compiler warnings out of
queryparser code (Alan Woodward, Erick Erickson)
* LUCENE-9344: Convert .txt files to properly formatted .md files. (Tomoko Uchida, Uwe Schindler)
======================= Lucene 8.6.0 =======================
API Changes

View File

@ -19,16 +19,16 @@ For reference, JRE major versions with their corresponding Unicode versions:
* Java 8, Unicode 6.2
* Java 9, Unicode 8.0
In general, whether or not you need to re-index largely depends upon the data that
In general, whether you need to re-index largely depends upon the data that
you are searching, and what was changed in any given Unicode version. For example,
if you are completely sure that your content is limited to the "Basic Latin" range
if you are completely sure your content is limited to the "Basic Latin" range
of Unicode, you can safely ignore this.
## Special Notes: LUCENE 2.9 TO 3.0, JAVA 1.4 TO JAVA 5 TRANSITION
* `StandardAnalyzer` will return the same results under Java 5 as it did under
Java 1.4. This is because it is largely independent of the runtime JRE for
Unicode support, (with the exception of lowercasing). However, no changes to
Unicode support, (except for lowercasing). However, no changes to
casing have occurred in Unicode 4.0 that affect StandardAnalyzer, so if you are
using this Analyzer you are NOT affected.

View File

@ -1,33 +1,35 @@
# Apache Lucene Migration Guide
## NGramFilterFactory "keepShortTerm" option was fixed to "preserveOriginal" (LUCENE-9259) ##
## NGramFilterFactory "keepShortTerm" option was fixed to "preserveOriginal" (LUCENE-9259)
The factory option name to output the original term was corrected in accordance with its Javadoc.
## o.a.l.misc.IndexMergeTool defaults changes (LUCENE-9206) ##
## o.a.l.misc.IndexMergeTool defaults changes (LUCENE-9206)
This command-line tool no longer forceMerges to a single segment. Instead, by
default it just follows (configurable) merge policy. If you really want to merge
to a single segment, you can pass -max-segments 1.
## o.a.l.util.fst.Builder is renamed FSTCompiler with fluent-style Builder (LUCENE-9089) ##
## o.a.l.util.fst.Builder is renamed FSTCompiler with fluent-style Builder (LUCENE-9089)
Simply use FSTCompiler instead of the previous Builder. Use either the simple constructor with default settings, or
the FSTCompiler.Builder to tune and tweak any parameter.
## Kuromoji user dictionary now forbids illegal segmentation (LUCENE-8933) ##
## Kuromoji user dictionary now forbids illegal segmentation (LUCENE-8933)
User dictionary now strictly validates if the (concatenated) segment is the same as the surface form. This change avoids
unexpected runtime exceptions or behaviours.
For example, these entries are not allowed at all and an exception is thrown when loading the dictionary file.
```
# concatenated "日本経済新聞" does not match the surface form "日経新聞"
日経新聞,日本 経済 新聞,ニホン ケイザイ シンブン,カスタム名詞
# concatenated "日経新聞" does not match the surface form "日本経済新聞"
日本経済新聞,日経 新聞,ニッケイ シンブン,カスタム名詞
```
## JapaneseTokenizer no longer emits original (compound) tokens by default when the mode is not NORMAL (LUCENE-9123) ##
## JapaneseTokenizer no longer emits original (compound) tokens by default when the mode is not NORMAL (LUCENE-9123)
JapaneseTokenizer and JapaneseAnalyzer no longer emits original tokens when discardCompoundToken option is not specified.
The constructor option has been introduced since Lucene 8.5.0, and the default value is changed to true.
@ -37,13 +39,15 @@ longer outputs the original token "株式会社" by default. To output original
explicitly set to false. Be aware that if this option is set to false SynonymFilter or SynonymGraphFilter does not work
correctly (see LUCENE-9173).
## Analysis factories now have customizable symbolic names (LUCENE-8778) and need additional no-arg constructor (LUCENE-9281) ##
## Analysis factories now have customizable symbolic names (LUCENE-8778) and need additional no-arg constructor (LUCENE-9281)
The SPI names for concrete subclasses of TokenizerFactory, TokenFilterFactory, and CharfilterFactory are no longer
derived from their class name. Instead, each factory must have a static "NAME" field like this:
```
/** o.a.l.a.standard.StandardTokenizerFactory's SPI name */
public static final String NAME = "standard";
```
A factory can be resolved/instantiated with its NAME by using methods such as TokenizerFactory#lookupClass(String)
or TokenizerFactory#forName(String, Map<String,String>).
@ -60,35 +64,37 @@ In the future, extensions to Lucene developed on the Java Module System may expo
This constructor is never called by Lucene, so by default it throws a UnsupportedOperationException. User-defined
factory classes should implement it in the following way:
```
/** Default ctor for compatibility with SPI */
public StandardTokenizerFactory() {
throw defaultCtorException();
}
```
(`defaultCtorException()` is a protected static helper method)
## TermsEnum is now fully abstract (LUCENE-8292) ##
## TermsEnum is now fully abstract (LUCENE-8292)
TermsEnum has been changed to be fully abstract, so non-abstract subclass must implement all it's methods.
Non-Performance critical TermsEnums can use BaseTermsEnum as a base class instead. The change was motivated
by several performance issues with FilterTermsEnum that caused significant slowdowns and massive memory consumption due
to not delegating all method from TermsEnum. See LUCENE-8292 and LUCENE-8662
## RAMDirectory, RAMFile, RAMInputStream, RAMOutputStream removed ##
## RAMDirectory, RAMFile, RAMInputStream, RAMOutputStream removed
RAM-based directory implementation have been removed. (LUCENE-8474).
ByteBuffersDirectory can be used as a RAM-resident replacement, although it
is discouraged in favor of the default memory-mapped directory.
## Similarity.SimScorer.computeXXXFactor methods removed (LUCENE-8014) ##
## Similarity.SimScorer.computeXXXFactor methods removed (LUCENE-8014)
SpanQuery and PhraseQuery now always calculate their slops as (1.0 / (1.0 +
distance)). Payload factor calculation is performed by PayloadDecoder in the
queries module
## Scorer must produce positive scores (LUCENE-7996) ##
## Scorer must produce positive scores (LUCENE-7996)
Scorers are no longer allowed to produce negative scores. If you have custom
query implementations, you should make sure their score formula may never produce
@ -98,21 +104,23 @@ As a side-effect of this change, negative boosts are now rejected and
FunctionScoreQuery maps negative values to 0.
## CustomScoreQuery, BoostedQuery and BoostingQuery removed (LUCENE-8099) ##
## CustomScoreQuery, BoostedQuery and BoostingQuery removed (LUCENE-8099)
Instead use FunctionScoreQuery and a DoubleValuesSource implementation. BoostedQuery
and BoostingQuery may be replaced by calls to FunctionScoreQuery.boostByValue() and
FunctionScoreQuery.boostByQuery(). To replace more complex calculations in
CustomScoreQuery, use the lucene-expressions module:
```
SimpleBindings bindings = new SimpleBindings();
bindings.add("score", DoubleValuesSource.SCORES);
bindings.add("boost1", DoubleValuesSource.fromIntField("myboostfield"));
bindings.add("boost2", DoubleValuesSource.fromIntField("myotherboostfield"));
Expression expr = JavascriptCompiler.compile("score * (boost1 + ln(boost2))");
FunctionScoreQuery q = new FunctionScoreQuery(inputQuery, expr.getDoubleValuesSource(bindings));
```
## Index options can no longer be changed dynamically (LUCENE-8134) ##
## Index options can no longer be changed dynamically (LUCENE-8134)
Changing index options on the fly is now going to result into an
IllegalArgumentException. If a field is indexed
@ -120,62 +128,64 @@ IllegalArgumentException. If a field is indexed
the same index options for that field.
## IndexSearcher.createNormalizedWeight() removed (LUCENE-8242) ##
## IndexSearcher.createNormalizedWeight() removed (LUCENE-8242)
Instead use IndexSearcher.createWeight(), rewriting the query first, and using
a boost of 1f.
## Memory codecs removed (LUCENE-8267) ##
## Memory codecs removed (LUCENE-8267)
Memory codecs have been removed from the codebase (MemoryPostings, MemoryDocValues).
## Direct doc-value format removed (LUCENE-8917) ##
## Direct doc-value format removed (LUCENE-8917)
The "Direct" doc-value format has been removed from the codebase.
## QueryCachingPolicy.ALWAYS_CACHE removed (LUCENE-8144) ##
## QueryCachingPolicy.ALWAYS_CACHE removed (LUCENE-8144)
Caching everything is discouraged as it disables the ability to skip non-interesting documents.
ALWAYS_CACHE can be replaced by a UsageTrackingQueryCachingPolicy with an appropriate config.
## English stopwords are no longer removed by default in StandardAnalyzer (LUCENE_7444) ##
## English stopwords are no longer removed by default in StandardAnalyzer (LUCENE_7444)
To retain the old behaviour, pass EnglishAnalyzer.ENGLISH_STOP_WORDS_SET as an argument
to the constructor
## StandardAnalyzer.ENGLISH_STOP_WORDS_SET has been moved ##
## StandardAnalyzer.ENGLISH_STOP_WORDS_SET has been moved
English stop words are now defined in EnglishAnalyzer#ENGLISH_STOP_WORDS_SET in the
analysis-common module
## TopDocs.maxScore removed ##
## TopDocs.maxScore removed
TopDocs.maxScore is removed. IndexSearcher and TopFieldCollector no longer have
an option to compute the maximum score when sorting by field. If you need to
know the maximum score for a query, the recommended approach is to run a
separate query:
```
TopDocs topHits = searcher.search(query, 1);
float maxScore = topHits.scoreDocs.length == 0 ? Float.NaN : topHits.scoreDocs[0].score;
```
Thanks to other optimizations that were added to Lucene 8, this query will be
able to efficiently select the top-scoring document without having to visit
all matches.
## TopFieldCollector always assumes fillFields=true ##
## TopFieldCollector always assumes fillFields=true
Because filling sort values doesn't have a significant overhead, the fillFields
option has been removed from TopFieldCollector factory methods. Everything
behaves as if it was set to true.
## TopFieldCollector no longer takes a trackDocScores option ##
## TopFieldCollector no longer takes a trackDocScores option
Computing scores at collection time is less efficient than running a second
request in order to only compute scores for documents that made it to the top
hits. As a consequence, the trackDocScores option has been removed and can be
replaced with the new TopFieldCollector#populateScores helper method.
## IndexSearcher.search(After) may return lower bounds of the hit count and TopDocs.totalHits is no longer a long ##
## IndexSearcher.search(After) may return lower bounds of the hit count and TopDocs.totalHits is no longer a long
Lucene 8 received optimizations for collection of top-k matches by not visiting
all matches. However these optimizations won't help if all matches still need
@ -185,37 +195,36 @@ accurately up to 1,000, and Topdocs.totalHits was changed from a long to an
object that says whether the hit count is accurate or a lower bound of the
actual hit count.
## RAMDirectory, RAMFile, RAMInputStream, RAMOutputStream are deprecated ##
## RAMDirectory, RAMFile, RAMInputStream, RAMOutputStream are deprecated
This RAM-based directory implementation is an old piece of code that uses inefficient
thread synchronization primitives and can be confused as "faster" than the NIO-based
MMapDirectory. It is deprecated and scheduled for removal in future versions of
Lucene. (LUCENE-8467, LUCENE-8438)
## LeafCollector.setScorer() now takes a Scorable rather than a Scorer ##
## LeafCollector.setScorer() now takes a Scorable rather than a Scorer
Scorer has a number of methods that should never be called from Collectors, for example
those that advance the underlying iterators. To hide these, LeafCollector.setScorer()
now takes a Scorable, an abstract class that Scorers can extend, with methods
docId() and score() (LUCENE-6228)
## Scorers must have non-null Weights ##
## Scorers must have non-null Weights
If a custom Scorer implementation does not have an associated Weight, it can probably
be replaced with a Scorable instead.
## Suggesters now return Long instead of long for weight() during indexing, and double
instead of long at suggest time ##
## Suggesters now return Long instead of long for weight() during indexing, and double instead of long at suggest time
Most code should just require recompilation, though possibly requiring some added casts.
## TokenStreamComponents is now final ##
## TokenStreamComponents is now final
Instead of overriding TokenStreamComponents#setReader() to customise analyzer
initialisation, you should now pass a Consumer&lt;Reader> instance to the
TokenStreamComponents constructor.
## LowerCaseTokenizer and LowerCaseTokenizerFactory have been removed ##
## LowerCaseTokenizer and LowerCaseTokenizerFactory have been removed
LowerCaseTokenizer combined tokenization and filtering in a way that broke token
normalization, so they have been removed. Instead, use a LetterTokenizer followed by
@ -231,12 +240,12 @@ use a TokenFilter chain as you would with any other Tokenizer.
Both Highlighter and FastVectorHighlighter need a custom WeightedSpanTermExtractor or FieldQuery respectively
in order to support ToParent/ToChildBlockJoinQuery.
## MultiTermAwareComponent replaced by CharFilterFactory#normalize() and TokenFilterFactory#normalize() ##
## MultiTermAwareComponent replaced by CharFilterFactory#normalize() and TokenFilterFactory#normalize()
Normalization is now type-safe, with CharFilterFactory#normalize() returning a Reader and
TokenFilterFactory#normalize() returning a TokenFilter.
## k1+1 constant factor removed from BM25 similarity numerator (LUCENE-8563) ##
## k1+1 constant factor removed from BM25 similarity numerator (LUCENE-8563)
Scores computed by the BM25 similarity are lower than previously as the k1+1
constant factor was removed from the numerator of the scoring formula.
@ -244,17 +253,18 @@ Ordering of results is preserved unless scores are computed from multiple
fields using different similarities. The previous behaviour is now exposed
by the LegacyBM25Similarity class which can be found in the lucene-misc jar.
## IndexWriter#maxDoc()/#numDocs() removed in favor of IndexWriter#getDocStats() ##
## IndexWriter#maxDoc()/#numDocs() removed in favor of IndexWriter#getDocStats()
IndexWriter#getDocStats() should be used instead of #maxDoc() / #numDocs() which offers a consistent
view on document stats. Previously calling two methods in order ot get point in time stats was subject
to concurrent changes.
## maxClausesCount moved from BooleanQuery To IndexSearcher (LUCENE-8811) ##
## maxClausesCount moved from BooleanQuery To IndexSearcher (LUCENE-8811)
IndexSearcher now performs max clause count checks on all types of queries (including BooleanQueries).
This led to a logical move of the clauses count from BooleanQuery to IndexSearcher.
## TopDocs.merge shall no longer allow setting of shard indices ##
## TopDocs.merge shall no longer allow setting of shard indices
TopDocs.merge's API has been changed to stop allowing passing in a parameter to indicate if it should
set shard indices for hits as they are seen during the merge process. This is done to simplify the API
@ -262,7 +272,7 @@ to be more dynamic in terms of passing in custom tie breakers.
If shard indices are to be used for tie breaking docs with equal scores during TopDocs.merge, then it is
mandatory that the input ScoreDocs have their shard indices set to valid values prior to calling TopDocs.merge
## TopDocsCollector Shall Throw IllegalArgumentException For Malformed Arguments ##
## TopDocsCollector Shall Throw IllegalArgumentException For Malformed Arguments
TopDocsCollector shall no longer return an empty TopDocs for malformed arguments.
Rather, an IllegalArgumentException shall be thrown. This is introduced for better

View File

@ -14,5 +14,5 @@ implementing Lucene (document size, number of documents, and number of
hits retrieved to name a few). The benchmarks page has some information
related to performance on particular platforms.
*To build Apache Lucene from source, refer to the `BUILD.txt` file in
*To build Apache Lucene from the source, refer to the `BUILD.txt` file in
the distribution directory.*

View File

@ -32,9 +32,9 @@
excludes="poms/**,**/*-src.jar,**/*-javadoc.jar"
/>
<patternset id="binary.root.dist.patterns"
includes="LICENSE.txt,NOTICE.txt,README.txt,
MIGRATE.txt,JRE_VERSION_MIGRATION.txt,
SYSTEM_REQUIREMENTS.txt,
includes="LICENSE.txt,NOTICE.txt,README.md,
MIGRATE.md,JRE_VERSION_MIGRATION.md,
SYSTEM_REQUIREMENTS.md,
CHANGES.txt,
**/lib/*.jar,
licenses/**,
@ -229,8 +229,8 @@
</xslt>
<markdown todir="${javadoc.dir}">
<fileset dir="." includes="MIGRATE.txt,JRE_VERSION_MIGRATION.txt,SYSTEM_REQUIREMENTS.txt"/>
<globmapper from="*.txt" to="*.html"/>
<fileset dir="." includes="MIGRATE.md,JRE_VERSION_MIGRATION.md,SYSTEM_REQUIREMENTS.md"/>
<globmapper from="*.md" to="*.html"/>
</markdown>
<copy todir="${javadoc.dir}">