lucene/help/regeneration.txt

169 lines
6.5 KiB
Plaintext

Regeneration
============
Lucene has a number of machine-generated resources - some of these are
resource (binary) files, others are Java source files that are stored
(and compiled) with the rest of Lucene source code.
If you're reading this, chances are that:
1) you've hit a precommit check error that said you've modified a generated
resource and some checksums are out of sync.
2) you need to regenerate one (or more) of these resources.
In many cases hitting (1) means you'll have to do (2) so let's discuss
these in order.
SPECIAL NOTE
------------
Regeneration tasks currently don't play well with the current gradle version.
To workaround the issue, compile everything first:
gradlew compileJava
Then run the commands as described in this document, excluding compileJava.
For example to run "regenerate":
gradlew regenerate -x compileJava
More information: https://github.com/apache/lucene/issues/13240
Checksum validation errors
--------------------------
LUCENE-9868 introduced a system of storing (and validating) checksums of
generated files so that they are not accidentally modified. This checkums
system will fail the build with a message similar to this one:
Execution failed for task ':lucene:core:generateStandardTokenizerChecksumCheck'.
> Checksums mismatch for derived resources; you might have modified a generated resource (regenerate task: :lucene:core:generateStandardTokenizerIfChanged):
Actual:
lucene/core/[...]/StandardTokenizerImpl.java=3298326986432483248962398462938649869326
Expected:
lucene/core/[...]/StandardTokenizerImpl.java=8e33c2698446c1c7a9479796a41316d1932ceda8
The message shows you which resources have mismatches on checksums (in this case
StandardTokenizerImpl.java) but also the *module* where the generated
resource exists and the *task name* that should be used to regenerate this resource:
:lucene:core:generateStandardTokenizerIfChanged
To resolve the problem, try to:
1) "git diff" the changes that caused the build failure (to see why the checksums
changed) and then decide whether to update the generated resource's template (or whatever
it is using to emit the generated resource);
2) regenerate the derived resources, possibly saving new checksums. If you decide to
regenerate, just run the task hinted at in the error message, for example:
gradlew :lucene:core:generateStandardTokenizerIfChanged
This regenerates all resources the task "generateStandardTokenizer" produces
and updates the corresponding checksums.
Resource regeneration
---------------------
The "convention" task for regenerating all derived resources in a given
module is called "regenerate" and you can apply it to all Lucene modules
by running:
gradlew regenerate
It is typically much wiser to limit the scope of regeneration to only
the module you're working with though:
gradlew -p lucene/analysis/common regenerate
If you're interested in what specific generation tasks are available, see
the task list for the generation group:
gradlew tasks --group generation
or limit the output to a particular module:
gradlew -p lucene/analysis/common tasks --group generation
which displays (at the moment of writing):
generateClassicTokenizer - Regenerate ClassicTokenizerImpl.java (if sources changed)
generateHTMLStripCharFilter - Regenerate HTMLStripCharFilter.java (if sources changed)
generateTlds - Regenerate top-level domain jflex macros and tests (if sources changed)
generateUAX29URLEmailTokenizer - Regenerate UAX29URLEmailTokenizerImpl.java (if sources changed)
generateWikipediaTokenizer - Regenerate WikipediaTokenizerImpl.java (if sources changed)
regenerate - Rerun any code or static data generation tasks.
snowball - Regenerates snowball stemmers.
You may wonder why none of these tasks actually exist in gradle source files (identically
named tasks with a suffix "Internal" exist).
Resource checksums, incremental generation and advanced topics
--------------------------------------------------------------
Many resource generation tasks require specific tools (perl, python, bash shell)
and resources that may not be available on all platforms. In LUCENE-9868 we tried
to make resource generation tasks "incremental" so that they only run if their
sources (or outputs) have changed. So if you run the generic "regenerate" task, many of the
actual regeneration sub-tasks will be "skipped" - you can see this if you run gradle with
plain console, for example:
gradlew -p lucene/analysis/common regenerate --console=plain
...
> Task :lucene:analysis:common:generateUnicodeProps
Checksums consistent with sources, skipping task: :lucene:analysis:common:generateUnicodePropsInternal
...
This shouldn't worry you at all - the internal tasks are skipped by wrappers
if the inputs and outputs of the internal task have not changed. If they have changed,
the task is re-run and followed up by other tasks, such as code-formatting (tidy).
Of course, sometimes you may want to *force* the regeneration task to run, even if the
checksums indicate nothing has changed. This may happen because of several reasons:
- the generation task has outputs but no inputs or the inputs are volatile. In this case
only the outputs have checksums and the task will be skipped if the outputs haven't changed.
- you may want to run the regeneration task just to see that it actually runs and produces
the same checksums (git diff should be clean). This would be a wise periodic sanity check
to ensure everything works as expected.
If you want to force-run the regeneration, use gradle's "--rerun-tasks" option:
gradlew regenerate --rerun-tasks
Scoping the call to a particular module will also work:
gradlew -p lucene/analysis/common regenerate --rerun-tasks
Scoping the call to a particular task will also work:
gradlew -p lucene/analysis/common generateUnicodeProps --rerun-tasks
You *should not* call the underlying generation task directly; this is possible
but discouraged:
gradlew -p lucene/analysis/common generateUnicodePropsInternal --rerun-tasks
The reason is that some of these generation tasks require follow-up (for example
source code tidying) and, more importantly, the checksums for these
regenerated resources won't be saved (so the next time you run 'check' it'll fail
with checksum mismatches).
Finally, if you do feel like force-regenerating everything, remember to exclude this
monster...
gradlew regenerate -x generateUAX29URLEmailTokenizerInternal --rerun-tasks
and on Windows, exclude snowball regeneration (requires bash):
gradlew regenerate -x generateUAX29URLEmailTokenizerInternal -x snowball --rerun-tasks