LUCENE-9344: Convert .txt files to properly formatted .md files (#1449)

2020-04-24 14:28:12 +09:00 · 2020-04-24 14:28:12 +09:00 · c7697b088c
parent a11b78e06a
commit c7697b088c
7 changed files with 77 additions and 63 deletions
--- a/lucene/BUILD.txt
+++ b/lucene/BUILD.txt
@ -1,17 +1,17 @@
-Lucene Build Instructions
+# Lucene Build Instructions

-Basic steps:
-  0) Install OpenJDK 11 (or greater), Ant 1.8.2+, Ivy 2.2.0
-  1) Download Lucene from Apache and unpack it
-  2) Connect to the top-level of your Lucene installation
-  3) Install JavaCC (optional)
-  4) Run ant
+## Basic steps:
+  
+  0. Install OpenJDK 11 (or greater), Ant 1.8.2+, Ivy 2.2.0
+  1. Download Lucene from Apache and unpack it
+  2. Connect to the top-level of your Lucene installation
+  3. Install JavaCC (optional)
+  4. Run ant

-Step 0) Set up your development environment (OpenJDK 11 or greater,
-Ant 1.8.2+, Ivy 2.2.0)
+## Step 0) Set up your development environment (OpenJDK 11 or greater, Ant 1.8.2+, Ivy 2.2.0)

 We'll assume that you know how to get and set up the JDK - if you
-don't, then we suggest starting at http://www.oracle.com/java/ and learning
+don't, then we suggest starting at https://www.oracle.com/java/ and learning
 more about Java, before returning to this README. Lucene runs with
 Java 11 and later.

@ -22,31 +22,31 @@ Ant is "kind of like make without make's wrinkles".  Ant is
 implemented in java and uses XML-based configuration files.  You can
 get it at:

-  http://ant.apache.org
+  https://ant.apache.org

 You'll need to download the Ant binary distribution.  Install it
 according to the instructions at:

-  http://ant.apache.org/manual
+  https://ant.apache.org/manual

 Finally, you'll need to install ivy into your ant lib folder
 (~/.ant/lib). You can get it from http://ant.apache.org/ivy/.
 If you skip this step, the Lucene build system will offer to do it 
 for you.

-Step 1) Download Lucene from Apache
+## Step 1) Download Lucene from Apache

 We'll assume you already did this, or you wouldn't be reading this
 file.  However, you might have received this file by some alternate
 route, or you might have an incomplete copy of the Lucene, so: Lucene
 releases are available for download at:

-  http://www.apache.org/dyn/closer.cgi/lucene/java/
+  https://www.apache.org/dyn/closer.cgi/lucene/java/

 Download either a zip or a tarred/gzipped version of the archive, and
 uncompress it into a directory of your choice.

-Step 2) From the command line, change (cd) into the top-level directory of your Lucene installation
+## Step 2) From the command line, change (cd) into the top-level directory of your Lucene installation

 Lucene's top-level directory contains the build.xml file. By default,
 you do not need to change any of the settings in this file, but you do
@ -66,7 +66,7 @@ system.

 NOTE: the ~ character represents your user account home directory.

-Step 3) Run ant
+## Step 4) Run ant

 Assuming you have ant in your PATH and have set ANT_HOME to the
 location of your ant installation, typing "ant" at the shell prompt
@ -76,10 +76,12 @@ and command prompt should run ant.  Ant will by default look for the
 If you want to build the documentation, type "ant documentation".

 For further information on Lucene, go to:
-  http://lucene.apache.org/
+
+  https://lucene.apache.org/

 Please join the Lucene-User mailing list by visiting this site:
-  http://lucene.apache.org/core/discussion.html
+
+  https://lucene.apache.org/core/discussion.html

 Please post suggestions, questions, corrections or additions to this
 document to the lucene-user mailing list.
@ -87,4 +89,4 @@ document to the lucene-user mailing list.
 This file was originally written by Steven J. Owens <puff@darksleep.com>.
 This file was modified by Jon S. Stevens <jon@latchkey.com>.

-Copyright (c) 2001-2005 The Apache Software Foundation.  All rights reserved.
+Copyright (c) 2001-2020 The Apache Software Foundation.  All rights reserved.
--- a/lucene/CHANGES.txt
+++ b/lucene/CHANGES.txt
@ -115,6 +115,8 @@ Other
 * LUCENE-8656: Deprecations in FuzzyQuery and get compiler warnings out of 
  queryparser code (Alan Woodward, Erick Erickson)

+* LUCENE-9344: Convert .txt files to properly formatted .md files. (Tomoko Uchida, Uwe Schindler)
+
 ======================= Lucene 8.6.0 =======================

 API Changes
--- a/lucene/JRE_VERSION_MIGRATION.txt
+++ b/lucene/JRE_VERSION_MIGRATION.txt
@ -19,16 +19,16 @@ For reference, JRE major versions with their corresponding Unicode versions:
 * Java 8, Unicode 6.2
 * Java 9, Unicode 8.0

-In general, whether or not you need to re-index largely depends upon the data that
+In general, whether you need to re-index largely depends upon the data that
 you are searching, and what was changed in any given Unicode version. For example, 
-if you are completely sure that your content is limited to the "Basic Latin" range 
+if you are completely sure your content is limited to the "Basic Latin" range
 of Unicode, you can safely ignore this. 

 ## Special Notes: LUCENE 2.9 TO 3.0, JAVA 1.4 TO JAVA 5 TRANSITION

 * `StandardAnalyzer` will return the same results under Java 5 as it did under 
 Java 1.4. This is because it is largely independent of the runtime JRE for
-Unicode support, (with the exception of lowercasing).  However, no changes to 
+Unicode support, (except for lowercasing).  However, no changes to
 casing have occurred in Unicode 4.0 that affect StandardAnalyzer, so if you are 
 using this Analyzer you are NOT affected.

--- a/lucene/MIGRATE.txt
+++ b/lucene/MIGRATE.txt
@ -1,33 +1,35 @@
 # Apache Lucene Migration Guide

-## NGramFilterFactory "keepShortTerm" option was fixed to "preserveOriginal" (LUCENE-9259) ##
+## NGramFilterFactory "keepShortTerm" option was fixed to "preserveOriginal" (LUCENE-9259)

 The factory option name to output the original term was corrected in accordance with its Javadoc.

-## o.a.l.misc.IndexMergeTool defaults changes (LUCENE-9206) ##
+## o.a.l.misc.IndexMergeTool defaults changes (LUCENE-9206)

 This command-line tool no longer forceMerges to a single segment. Instead, by
 default it just follows (configurable) merge policy. If you really want to merge
 to a single segment, you can pass -max-segments 1.

-## o.a.l.util.fst.Builder is renamed FSTCompiler with fluent-style Builder (LUCENE-9089) ##
+## o.a.l.util.fst.Builder is renamed FSTCompiler with fluent-style Builder (LUCENE-9089)

 Simply use FSTCompiler instead of the previous Builder. Use either the simple constructor with default settings, or
 the FSTCompiler.Builder to tune and tweak any parameter.

-## Kuromoji user dictionary now forbids illegal segmentation (LUCENE-8933) ##
+## Kuromoji user dictionary now forbids illegal segmentation (LUCENE-8933)

 User dictionary now strictly validates if the (concatenated) segment is the same as the surface form. This change avoids
 unexpected runtime exceptions or behaviours.
 For example, these entries are not allowed at all and an exception is thrown when loading the dictionary file.

+```
 # concatenated "日本経済新聞" does not match the surface form "日経新聞"
 日経新聞,日本 経済 新聞,ニホン ケイザイ シンブン,カスタム名詞

 # concatenated "日経新聞" does not match the surface form "日本経済新聞"
 日本経済新聞,日経 新聞,ニッケイ シンブン,カスタム名詞
+```

-## JapaneseTokenizer no longer emits original (compound) tokens by default when the mode is not NORMAL (LUCENE-9123) ##
+## JapaneseTokenizer no longer emits original (compound) tokens by default when the mode is not NORMAL (LUCENE-9123)

 JapaneseTokenizer and JapaneseAnalyzer no longer emits original tokens when discardCompoundToken option is not specified.
 The constructor option has been introduced since Lucene 8.5.0, and the default value is changed to true.
@ -37,13 +39,15 @@ longer outputs the original token "株式会社" by default. To output original
 explicitly set to false. Be aware that if this option is set to false SynonymFilter or SynonymGraphFilter does not work
 correctly (see LUCENE-9173).

-## Analysis factories now have customizable symbolic names (LUCENE-8778) and need additional no-arg constructor (LUCENE-9281) ##
+## Analysis factories now have customizable symbolic names (LUCENE-8778) and need additional no-arg constructor (LUCENE-9281)

 The SPI names for concrete subclasses of TokenizerFactory, TokenFilterFactory, and CharfilterFactory are no longer
 derived from their class name. Instead, each factory must have a static "NAME" field like this:

+```
    /** o.a.l.a.standard.StandardTokenizerFactory's SPI name */
    public static final String NAME = "standard";
+```

 A factory can be resolved/instantiated with its NAME by using methods such as TokenizerFactory#lookupClass(String)
 or TokenizerFactory#forName(String, Map<String,String>).
@ -60,35 +64,37 @@ In the future, extensions to Lucene developed on the Java Module System may expo
 This constructor is never called by Lucene, so by default it throws a UnsupportedOperationException. User-defined
 factory classes should implement it in the following way:

+```
    /** Default ctor for compatibility with SPI */
    public StandardTokenizerFactory() {
      throw defaultCtorException();
    }
+```

 (`defaultCtorException()` is a protected static helper method)

-## TermsEnum is now fully abstract (LUCENE-8292) ##
+## TermsEnum is now fully abstract (LUCENE-8292)

 TermsEnum has been changed to be fully abstract, so non-abstract subclass must implement all it's methods.
 Non-Performance critical TermsEnums can use BaseTermsEnum as a base class instead. The change was motivated
 by several performance issues with FilterTermsEnum that caused significant slowdowns and massive memory consumption due
 to not delegating all method from TermsEnum. See LUCENE-8292 and LUCENE-8662

-## RAMDirectory, RAMFile, RAMInputStream, RAMOutputStream removed ##
+## RAMDirectory, RAMFile, RAMInputStream, RAMOutputStream removed

 RAM-based directory implementation have been removed. (LUCENE-8474). 
 ByteBuffersDirectory can be used as a RAM-resident replacement, although it
 is discouraged in favor of the default memory-mapped directory.


-## Similarity.SimScorer.computeXXXFactor methods removed (LUCENE-8014) ##
+## Similarity.SimScorer.computeXXXFactor methods removed (LUCENE-8014)

 SpanQuery and PhraseQuery now always calculate their slops as (1.0 / (1.0 +
 distance)).  Payload factor calculation is performed by PayloadDecoder in the
 queries module


-## Scorer must produce positive scores (LUCENE-7996) ##
+## Scorer must produce positive scores (LUCENE-7996)

 Scorers are no longer allowed to produce negative scores. If you have custom
 query implementations, you should make sure their score formula may never produce
@ -98,21 +104,23 @@ As a side-effect of this change, negative boosts are now rejected and
 FunctionScoreQuery maps negative values to 0.


-## CustomScoreQuery, BoostedQuery and BoostingQuery removed (LUCENE-8099) ##
+## CustomScoreQuery, BoostedQuery and BoostingQuery removed (LUCENE-8099)

 Instead use FunctionScoreQuery and a DoubleValuesSource implementation.  BoostedQuery
 and BoostingQuery may be replaced by calls to FunctionScoreQuery.boostByValue() and
 FunctionScoreQuery.boostByQuery().  To replace more complex calculations in
 CustomScoreQuery, use the lucene-expressions module:

+```
 SimpleBindings bindings = new SimpleBindings();
 bindings.add("score", DoubleValuesSource.SCORES);
 bindings.add("boost1", DoubleValuesSource.fromIntField("myboostfield"));
 bindings.add("boost2", DoubleValuesSource.fromIntField("myotherboostfield"));
 Expression expr = JavascriptCompiler.compile("score * (boost1 + ln(boost2))");
 FunctionScoreQuery q = new FunctionScoreQuery(inputQuery, expr.getDoubleValuesSource(bindings));
+```

-## Index options can no longer be changed dynamically (LUCENE-8134) ##
+## Index options can no longer be changed dynamically (LUCENE-8134)

 Changing index options on the fly is now going to result into an
 IllegalArgumentException. If a field is indexed
@ -120,62 +128,64 @@ IllegalArgumentException. If a field is indexed
 the same index options for that field.


-## IndexSearcher.createNormalizedWeight() removed (LUCENE-8242) ##
+## IndexSearcher.createNormalizedWeight() removed (LUCENE-8242)

 Instead use IndexSearcher.createWeight(), rewriting the query first, and using
 a boost of 1f.

-## Memory codecs removed (LUCENE-8267) ##
+## Memory codecs removed (LUCENE-8267)

 Memory codecs have been removed from the codebase (MemoryPostings, MemoryDocValues).

-## Direct doc-value format removed (LUCENE-8917) ##
+## Direct doc-value format removed (LUCENE-8917)

 The "Direct" doc-value format has been removed from the codebase.

-## QueryCachingPolicy.ALWAYS_CACHE removed (LUCENE-8144) ##
+## QueryCachingPolicy.ALWAYS_CACHE removed (LUCENE-8144)

 Caching everything is discouraged as it disables the ability to skip non-interesting documents.
 ALWAYS_CACHE can be replaced by a UsageTrackingQueryCachingPolicy with an appropriate config.

-## English stopwords are no longer removed by default in StandardAnalyzer (LUCENE_7444) ##
+## English stopwords are no longer removed by default in StandardAnalyzer (LUCENE_7444)

 To retain the old behaviour, pass EnglishAnalyzer.ENGLISH_STOP_WORDS_SET as an argument
 to the constructor

-## StandardAnalyzer.ENGLISH_STOP_WORDS_SET has been moved ##
+## StandardAnalyzer.ENGLISH_STOP_WORDS_SET has been moved

 English stop words are now defined in EnglishAnalyzer#ENGLISH_STOP_WORDS_SET in the
 analysis-common module

-## TopDocs.maxScore removed ##
+## TopDocs.maxScore removed

 TopDocs.maxScore is removed. IndexSearcher and TopFieldCollector no longer have
 an option to compute the maximum score when sorting by field. If you need to
 know the maximum score for a query, the recommended approach is to run a
 separate query:

+```
  TopDocs topHits = searcher.search(query, 1);
  float maxScore = topHits.scoreDocs.length == 0 ? Float.NaN : topHits.scoreDocs[0].score;
+```

 Thanks to other optimizations that were added to Lucene 8, this query will be
 able to efficiently select the top-scoring document without having to visit
 all matches.

-## TopFieldCollector always assumes fillFields=true ##
+## TopFieldCollector always assumes fillFields=true

 Because filling sort values doesn't have a significant overhead, the fillFields
 option has been removed from TopFieldCollector factory methods. Everything
 behaves as if it was set to true.

-## TopFieldCollector no longer takes a trackDocScores option ##
+## TopFieldCollector no longer takes a trackDocScores option

 Computing scores at collection time is less efficient than running a second
 request in order to only compute scores for documents that made it to the top
 hits. As a consequence, the trackDocScores option has been removed and can be
 replaced with the new TopFieldCollector#populateScores helper method.

-## IndexSearcher.search(After) may return lower bounds of the hit count and TopDocs.totalHits is no longer a long ##
+## IndexSearcher.search(After) may return lower bounds of the hit count and TopDocs.totalHits is no longer a long

 Lucene 8 received optimizations for collection of top-k matches by not visiting
 all matches. However these optimizations won't help if all matches still need
@ -185,37 +195,36 @@ accurately up to 1,000, and Topdocs.totalHits was changed from a long to an
 object that says whether the hit count is accurate or a lower bound of the
 actual hit count.

-## RAMDirectory, RAMFile, RAMInputStream, RAMOutputStream are deprecated ##
+## RAMDirectory, RAMFile, RAMInputStream, RAMOutputStream are deprecated

 This RAM-based directory implementation is an old piece of code that uses inefficient
 thread synchronization primitives and can be confused as "faster" than the NIO-based
 MMapDirectory. It is deprecated and scheduled for removal in future versions of 
 Lucene. (LUCENE-8467, LUCENE-8438)

-## LeafCollector.setScorer() now takes a Scorable rather than a Scorer ##
+## LeafCollector.setScorer() now takes a Scorable rather than a Scorer

 Scorer has a number of methods that should never be called from Collectors, for example
 those that advance the underlying iterators.  To hide these, LeafCollector.setScorer()
 now takes a Scorable, an abstract class that Scorers can extend, with methods
 docId() and score() (LUCENE-6228)

-## Scorers must have non-null Weights ##
+## Scorers must have non-null Weights

 If a custom Scorer implementation does not have an associated Weight, it can probably
 be replaced with a Scorable instead.

-## Suggesters now return Long instead of long for weight() during indexing, and double
-instead of long at suggest time ##
+## Suggesters now return Long instead of long for weight() during indexing, and double instead of long at suggest time 

 Most code should just require recompilation, though possibly requiring some added casts.

-## TokenStreamComponents is now final ##
+## TokenStreamComponents is now final

 Instead of overriding TokenStreamComponents#setReader() to customise analyzer
 initialisation, you should now pass a Consumer&lt;Reader> instance to the
 TokenStreamComponents constructor.

-## LowerCaseTokenizer and LowerCaseTokenizerFactory have been removed ##
+## LowerCaseTokenizer and LowerCaseTokenizerFactory have been removed

 LowerCaseTokenizer combined tokenization and filtering in a way that broke token
 normalization, so they have been removed. Instead, use a LetterTokenizer followed by
@ -231,12 +240,12 @@ use a TokenFilter chain as you would with any other Tokenizer.
 Both Highlighter and FastVectorHighlighter need a custom WeightedSpanTermExtractor or FieldQuery respectively
 in order to support ToParent/ToChildBlockJoinQuery.

-## MultiTermAwareComponent replaced by CharFilterFactory#normalize() and TokenFilterFactory#normalize() ##
+## MultiTermAwareComponent replaced by CharFilterFactory#normalize() and TokenFilterFactory#normalize()

 Normalization is now type-safe, with CharFilterFactory#normalize() returning a Reader and
 TokenFilterFactory#normalize() returning a TokenFilter.

-## k1+1 constant factor removed from BM25 similarity numerator (LUCENE-8563) ##
+## k1+1 constant factor removed from BM25 similarity numerator (LUCENE-8563)

 Scores computed by the BM25 similarity are lower than previously as the k1+1
 constant factor was removed from the numerator of the scoring formula.
@ -244,17 +253,18 @@ Ordering of results is preserved unless scores are computed from multiple
 fields using different similarities. The previous behaviour is now exposed
 by the LegacyBM25Similarity class which can be found in the lucene-misc jar.

-## IndexWriter#maxDoc()/#numDocs() removed in favor of IndexWriter#getDocStats() ##
+## IndexWriter#maxDoc()/#numDocs() removed in favor of IndexWriter#getDocStats()

 IndexWriter#getDocStats() should be used instead of #maxDoc() / #numDocs() which offers a consistent 
 view on document stats. Previously calling two methods in order ot get point in time stats was subject
 to concurrent changes.

-## maxClausesCount moved from BooleanQuery To IndexSearcher (LUCENE-8811) ##
+## maxClausesCount moved from BooleanQuery To IndexSearcher (LUCENE-8811)
+
 IndexSearcher now performs max clause count checks on all types of queries (including BooleanQueries).
 This led to a logical move of the clauses count from BooleanQuery to IndexSearcher.

-## TopDocs.merge shall no longer allow setting of shard indices ##
+## TopDocs.merge shall no longer allow setting of shard indices

 TopDocs.merge's API has been changed to stop allowing passing in a parameter to indicate if it should
 set shard indices for hits as they are seen during the merge process. This is done to simplify the API
@ -262,7 +272,7 @@ to be more dynamic in terms of passing in custom tie breakers.
 If shard indices are to be used for tie breaking docs with equal scores during TopDocs.merge, then it is
 mandatory that the input ScoreDocs have their shard indices set to valid values prior to calling TopDocs.merge

-## TopDocsCollector Shall Throw IllegalArgumentException For Malformed Arguments ##
+## TopDocsCollector Shall Throw IllegalArgumentException For Malformed Arguments

 TopDocsCollector shall no longer return an empty TopDocs for malformed arguments.
 Rather, an IllegalArgumentException shall be thrown. This is introduced for better
--- a/lucene/README.txt
+++ b/lucene/README.txt
--- a/lucene/SYSTEM_REQUIREMENTS.txt
+++ b/lucene/SYSTEM_REQUIREMENTS.txt
@ -14,5 +14,5 @@ implementing Lucene (document size, number of documents, and number of
 hits retrieved to name a few). The benchmarks page has some information 
 related to performance on particular platforms. 

-*To build Apache Lucene from source, refer to the `BUILD.txt` file in 
+*To build Apache Lucene from the source, refer to the `BUILD.txt` file in
 the distribution directory.*
--- a/lucene/build.xml
+++ b/lucene/build.xml
@ -32,9 +32,9 @@
              excludes="poms/**,**/*-src.jar,**/*-javadoc.jar"
  />
  <patternset id="binary.root.dist.patterns"
-              includes="LICENSE.txt,NOTICE.txt,README.txt,
-                        MIGRATE.txt,JRE_VERSION_MIGRATION.txt,
-                        SYSTEM_REQUIREMENTS.txt,
+              includes="LICENSE.txt,NOTICE.txt,README.md,
+                        MIGRATE.md,JRE_VERSION_MIGRATION.md,
+                        SYSTEM_REQUIREMENTS.md,
                        CHANGES.txt,
                        **/lib/*.jar,
                        licenses/**,
@ -229,8 +229,8 @@
    </xslt>
    
    <markdown todir="${javadoc.dir}">
-      <fileset dir="." includes="MIGRATE.txt,JRE_VERSION_MIGRATION.txt,SYSTEM_REQUIREMENTS.txt"/>
-      <globmapper from="*.txt" to="*.html"/>
+      <fileset dir="." includes="MIGRATE.md,JRE_VERSION_MIGRATION.md,SYSTEM_REQUIREMENTS.md"/>
+      <globmapper from="*.md" to="*.html"/>
    </markdown>

    <copy todir="${javadoc.dir}">