Regeneration
============

Lucene has a number of machine-generated resources - some of these are
resource (binary) files, others are Java source files that are stored
(and compiled) with the rest of Lucene source code.

If you're reading this, chances are that:

1) you've hit a precommit check error that said you've modified a generated
   resource and some checksums are out of sync.

2) you need to regenerate one (or more) of these resources.

In many cases hitting (1) means you'll have to do (2) so let's discuss
these in order.


Checksum validation errors
--------------------------

LUCENE-9868 introduced a system of storing (and validating) checksums of
generated files so that they are not accidentally modified. This checkums
system will fail the build with a message similar to this one:

Execution failed for task ':lucene:core:generateStandardTokenizerChecksumCheck'.
> Checksums mismatch for derived resources; you might have modified a generated resource (regenerate task: :lucene:core:generateStandardTokenizerIfChanged):
  Actual:
    lucene/core/[...]/StandardTokenizerImpl.java=3298326986432483248962398462938649869326

  Expected:
    lucene/core/[...]/StandardTokenizerImpl.java=8e33c2698446c1c7a9479796a41316d1932ceda8

The message shows you which resources have mismatches on checksums (in this case 
StandardTokenizerImpl.java) but also the *module* where the generated
resource exists and the *task name* that should be used to regenerate this resource:

:lucene:core:generateStandardTokenizerIfChanged

To resolve the problem, try to:

1) "git diff" the changes that caused the build failure (to see why the checksums
changed) and then decide whether to update the generated resource's template (or whatever
it is using to emit the generated resource);

2) regenerate the derived resources, possibly saving new checksums. If you decide to 
regenerate, just run the task hinted at in the error message, for example:

gradlew :lucene:core:generateStandardTokenizerIfChanged

This regenerates all resources the task "generateStandardTokenizer" produces 
and updates the corresponding checksums.


Resource regeneration
---------------------

The "convention" task for regenerating all derived resources in a given
module is called "regenerate" and you can apply it to all Lucene modules
by running:

gradlew regenerate

It is typically much wiser to limit the scope of regeneration to only 
the module you're working with though:

gradlew -p lucene/analysis/common regenerate

If you're interested in what specific generation tasks are available, see
the task list for the generation group:

gradlew tasks --group generation

or limit the output to a particular module:

gradlew -p lucene/analysis/common tasks --group generation

which displays (at the moment of writing):

generateClassicTokenizer - Regenerate ClassicTokenizerImpl.java (if sources changed)
generateHTMLStripCharFilter - Regenerate HTMLStripCharFilter.java (if sources changed)
generateTlds - Regenerate top-level domain jflex macros and tests (if sources changed)
generateUAX29URLEmailTokenizer - Regenerate UAX29URLEmailTokenizerImpl.java (if sources changed)
generateWikipediaTokenizer - Regenerate WikipediaTokenizerImpl.java (if sources changed)
regenerate - Rerun any code or static data generation tasks.
snowball - Regenerates snowball stemmers.

You may wonder why none of these tasks actually exist in gradle source files (identically
named tasks with a suffix "Internal" exist).


Resource checksums, incremental generation and advanced topics
--------------------------------------------------------------

Many resource generation tasks require specific tools (perl, python, bash shell)
and resources that may not be available on all platforms. In LUCENE-9868 we tried 
to make resource generation tasks "incremental" so that they only run if their 
sources (or outputs) have changed. So if you run the generic "regenerate" task, many of the
actual regeneration sub-tasks will be "skipped" - you can see this if you run gradle with
plain console, for example:

gradlew -p lucene/analysis/common regenerate --console=plain

...
> Task :lucene:analysis:common:generateUnicodeProps
Checksums consistent with sources, skipping task: :lucene:analysis:common:generateUnicodePropsInternal
...

This shouldn't worry you at all - the internal tasks are skipped by wrappers
if the inputs and outputs of the internal task have not changed. If they have changed,
the task is re-run and followed up by other tasks, such as code-formatting (tidy).

Of course, sometimes you may want to *force* the regeneration task to run, even if the
checksums indicate nothing has changed. This may happen because of several reasons:

- the generation task has outputs but no inputs or the inputs are volatile. In this case
only the outputs have checksums and the task will be skipped if the outputs haven't changed.

- you may want to run the regeneration task just to see that it actually runs and produces
the same checksums (git diff should be clean). This would be a wise periodic sanity check
to ensure everything works as expected.

If you want to force-run the regeneration, use gradle's "--rerun-tasks" option:

gradlew regenerate --rerun-tasks

Scoping the call to a particular module will also work:

gradlew -p lucene/analysis/common regenerate --rerun-tasks

Scoping the call to a particular task will also work:

gradlew -p lucene/analysis/common generateUnicodeProps --rerun-tasks

You *should not* call the underlying generation task directly; this is possible
but discouraged:

gradlew -p lucene/analysis/common generateUnicodePropsInternal --rerun-tasks

The reason is that some of these generation tasks require follow-up (for example
source code tidying) and, more importantly, the checksums for these 
regenerated resources won't be saved (so the next time you run 'check' it'll fail
with checksum mismatches).

Finally, if you do feel like force-regenerating everything, remember to exclude this 
monster...

gradlew regenerate -x generateUAX29URLEmailTokenizerInternal --rerun-tasks

and on Windows, exclude snowball regeneration (requires bash):

gradlew regenerate -x generateUAX29URLEmailTokenizerInternal -x snowball --rerun-tasks