SOLR-14954: Heavily edit reindexing.adoc

This commit is contained in:
Erick Erickson 2020-10-26 13:22:00 -04:00
parent 38f02869b4
commit c29b0083d7
2 changed files with 50 additions and 117 deletions

View File

@ -46,7 +46,8 @@ Several parameters can be used to trigger faceting based on the indexed terms in
When using these parameters, it is important to remember that "term" is a very specific concept in Lucene: it relates to the literal field/value pairs that are indexed after any analysis occurs. For text fields that include stemming, lowercasing, or word splitting, the resulting terms may not be what you expect. When using these parameters, it is important to remember that "term" is a very specific concept in Lucene: it relates to the literal field/value pairs that are indexed after any analysis occurs. For text fields that include stemming, lowercasing, or word splitting, the resulting terms may not be what you expect.
If you want Solr to perform both analysis (for searching) and faceting on the full literal strings, use the `copyField` directive in your Schema to create two versions of the field: one Text and one String. Make sure both are `indexed="true"`. (For more information about the `copyField` directive, see <<documents-fields-and-schema-design.adoc#documents-fields-and-schema-design,Documents, Fields, and Schema Design>>.) If you want Solr to perform both analysis (for searching) and faceting on the full literal strings, use the `copyField` directive in your Schema to create two versions of the field: one Text and one String. The Text field should have `indexed="true" docValues=“false"` if used for searching but not faceting and the String field should have `indexed="false" docValues="true"` if used for faceting but not searching.
(For more information about the `copyField` directive, see <<documents-fields-and-schema-design.adoc#documents-fields-and-schema-design,Documents, Fields, and Schema Design>>.)
Unless otherwise specified, all of the parameters below can be specified on a per-field basis with the syntax of `f.<fieldname>.facet.<parameter>` Unless otherwise specified, all of the parameters below can be specified on a per-field basis with the syntax of `f.<fieldname>.facet.<parameter>`

View File

@ -16,18 +16,18 @@
// specific language governing permissions and limitations // specific language governing permissions and limitations
// under the License. // under the License.
There are several types of changes to Solr configuration that require you to reindex your data. There are several types of changes to Solr configuration that require you to reindex your data, particularly changes to your schema.
These changes include editing properties of fields or field types; adding fields, field types, or copy field rules; These changes include editing properties of fields or field types; adding fields, or copy field rules; upgrading Solr; and changing certain system configuration properties.
upgrading Solr; and some system configuration properties.
It's important to be aware that many changes require reindexing, because there are times when not reindexing It's important to be aware that failing to reindex can have both obvious and subtle consequences for Solr or for users finding what they are looking for.
can have negative consequences for Solr as a system, or for the ability of your users to find what they are looking for.
There is no process in Solr for programmatically reindexing data. When we say "reindex", we mean, literally, "Reindex" in this context means _first delete the existing index and repeat the process you used to ingest the entire corpus from the system-of-record_. It is strongly recommended that Solr users have a consistent, repeatable process for indexing so that the indexes can be recreated as the need arises.
"index it again". However you got the data into the index the first time, you will run that process again.
It is strongly recommended that Solr users index their data in a repeatable, consistent way, so that the process can be [CAUTION]
easily repeated when the need for reindexing arises. ====
Re-ingesting all the documents in your corpus without first insuring that all documents and Lucene segments have been deleted is *not* sufficient, see the section <<reindexing.adoc#reindexing-strategies,Reindexing Strategies>>.
====
Reindexing is recommended during major upgrades, so in addition to covering what types of configuration changes should trigger a reindex, this section will also cover strategies for reindexing. Reindexing is recommended during major upgrades, so in addition to covering what types of configuration changes should trigger a reindex, this section will also cover strategies for reindexing.
@ -35,126 +35,55 @@ Reindexing is recommended during major upgrades, so in addition to covering what
=== Schema Changes === Schema Changes
All changes to a collection's schema require reindexing. This is because many of the available options are only With very few exceptions, changes to a collection's schema require reindexing. This is because many of the available options are only applied during the indexing process. Solr has no way to implement the desired change without reindexing the data.
applied during the indexing process. Solr simply has no way to implement the desired change without reindexing
the data.
To understand the general reason why reindexing is ever required, it's helpful to understand the relationship between To understand the general reason why reindexing is ever required, it's helpful to understand the relationship between Solr's schema and the underlying Lucene index. Lucene does not use a schema, schemas a Solr-only concept. When you change Solr's schema, the Lucene index is not modified in any way.
Solr's schema and the underlying Lucene index. Lucene does not use a schema, it is a Solr-only concept. When you delete
a field from Solr's schema, it does not modify Lucene's index in any way. When you add a field to Solr's schema, the
field does not exist in Lucene's index until a document that contains the field is indexed.
This means that there are many types of schema changes that cannot be reflected in the index simply by modifying This means that there are many types of schema changes that cannot be reflected in the index simply by modifying Solr's schema. This is different from most database models where schemas are used. When indexing, Solr's schema acts like a rulebook for indexing documents by telling Lucene how to interpret the data being sent. Once the documents are in Lucene, Solr's schema has no control over the underlying data structure.
Solr's schema. This is different from most database models where schemas are used. With regard to indexing, Solr's
schema acts like a rulebook for indexing documents by telling Lucene how to interpret the data being sent. Once the
documents are in Lucene, Solr's schema has no control over the underlying data structure.
In addition to the types of schema changes described in the following sections, changing the schema `version` property In addition, changing the schema `version` property is equivalent to changing field type properties. This type of change is usually only made during or because of a major upgrade.
is equivalent to changing field type properties. This type of change is usually only made during or because of a major upgrade.
==== Adding or Deleting Fields
If you add or delete a field from Solr's schema, it's strongly recommended to reindex.
When you add a field, you generally do so with the intent to use the field in some way.
Since documents were indexed before the field was added, the index will not hold any references to the field for earlier documents.
If you want to use the new field for faceting, for example, the new field facet will not include any documents that were not indexed with the new field.
There is a slightly different situation when deleting a field.
In this case, since simply removing the field from the schema doesn't change anything about the index, the field will still be in the index until the documents are reindexed.
In fact, Lucene may keep a reference to a deleted field _forever_ (see also https://issues.apache.org/jira/browse/LUCENE-1761[LUCENE-1761]).
This may only be an issue for your environment if you try to add a field that has the same name as a deleted field,
but it can also be an issue for dynamic field rules that are later removed.
==== Changing Field and Field Type Properties ==== Changing Field and Field Type Properties
Solr has two ways of defining field properties. When you change your schema by adding fields, removing fields, or changing the field or field type definitions you generally do so with the intent that those changes alter how documents are searched. The full effects of those changes are not reflected in the corpus as a whole until all documents are reindexed.
The first is to define properties on a field type. These properties are then applied to all fields of that type unless they are explicitly overriden. Changes to *any* field/field type property described in <<field-type-definitions-and-properties.adoc#field-type-properties,Field Type Properties>> must be reindexed in order for the change to be reflected in _all_ documents.
The second is an override to a property inherited from the field type defined on the field itself.
If a property has been defined for a field type but the property is not overridden by defining a different value for the
property for a field, then changing the property on the field type is equivalent to changing it on the field itself.
Changes to *any* field/field type property described in <<field-type-definitions-and-properties.adoc#field-type-properties,Field Type Properties>> must be reindexed in order for the change to be reflected in all documents.
The list of changes that require reindexing includes (but is not limited to):
* Changing a field from stored to not stored, and vice versa.
* Changing a field from indexed to not indexed, and vice versa.
* Changing a field from multi-valued to single-valued, and vice versa.
* <<Changing Field Analysis>>.
* Changing the `type` of field, or the `class` for a field type.
* Enabling or disabling <<docvalues.adoc#docvalues,docValues>>.
Be sure to reference the Field Type Properties section linked above for the complete list of properties that would require a reindex.
[CAUTION] [CAUTION]
==== ====
In some cases, it can be possible to change a field/field type property value and it will only apply to documents Changing field properties that affect indexing without reindexing is not recommended. This should only be attempted with a thorough understanding of the consequences. Negative impacts on the user may not be immediately apparent.
indexed _after_ the change.
For example, you could change a field from being indexed (`indexed="true"`) to no longer indexed (`indexed="false"`)
and over time, as documents are updated, the index will be purged of the fields that shouldn't be indexed anymore.
You could also change a field from not being stored (`stored="false"`) to being stored (`stored="true"`).
In this case, if you want to use the field immediately, only documents indexed after the change will contain data in the field.
However, you would need to ensure that your client is able to handle fields missing from documents that have
not yet been reindexed.
It's important to note this is not possible for all field/field type properties.
If you change whether or not docValues are enabled, for example, you absolutely must reindex.
This is due to the way docValues have been implemented in Lucene, and how Lucene handles dovValue segments.
Changing any field properties without reindexing is _never_ recommended to ensure consistent behavior, and should only
be attempted when you have tested thoroughly and feel confident that you understand the ramifications on your
documents and front-end clients.
==== ====
==== Changing Field Analysis ==== Changing Field Analysis
Beyond specific field-level properties, <<analyzers.adoc#analyzers,analysis chains>> are also configured on field types, and are applied at index and/or query time. Beyond specific field-level properties, <<analyzers.adoc#analyzers,analysis chains>> are also configured on field types, and are applied at index and query time.
It's possible to define separate analysis chains for indexing and query events, or you can define a single chain If separate analysis chains are defined for query and indexing events for a field and you change _only_ the query-time analysis chain, reindexing is not necessary.
that is applied to both event types.
If you change the analysis chain that applies to indexing events, it is strongly recommended that you reindex. Any change to the index-time analysis chain requires reindexing in almost all cases.
This is because all of the changes that occur due to the chain configuration are applied to documents as they are
being indexed, and only reindexing will allow your changes to take effect on documents.
While reindexing after analyzer changes is not required, be aware that not reindexing can cause unexpected
query results in many cases.
For example, if you indexed a number of documents and then decide you'd like to use the `LowerCaseTokenizerFactory`
to ensure all text is converted to lower case, you will have a mix of entries in the field: some in their original
case ("iPhone"), and newer documents in all lower-case ("iphone"). If you do not reindex the original set of documents,
a query such as "iphone" will not match documents with "iPhone", because the schema rules enforce lower case on the
query, but that's not what is in the index.
The only time you do not have to reindex when changing a field type's analysis chain is when the changes impact
queries *only* (and you know that you do not need to make corresponding changes to the index analysis).
=== Solrconfig Changes === Solrconfig Changes
Identifying changes to solrconfig.xml that alter how data is ingested and thus require reindexing is less straightforward. The general rule is "anything that changes what gets stored in the index requires reindexing". Here are several known examples.
Only one parameter change to Solr's `solrconfig.xml` requires reindexing. That parameter is the `luceneMatchVersion`, The parameter `luceneMatchVersion` in solrconfig.xml controls the compatibility of Solr with Lucene. Since this parameter can change the rules for analysis behind the scenes, it's always recommended to reindex when changing it. Usually this is only changed in conjunction with a major upgrade.
which controls the compatibility of Solr with Lucene changes. Since this parameter can change the rules for analysis behind the scenes, it's always recommended to reindex when changing this value. Usually, however, this is only changed in conjunction with a major upgrade.
However, if you make a change to Solr's <<update-request-processors.adoc#update-request-processors,Update Request Processors>>, it's generally because you want to change something about how _update requests_ (documents) are _processed_ (indexed). In this case, you can decide based on the change if you want to reindex your documents to implement the changes you've made. If you make a change to Solr's <<update-request-processors.adoc#update-request-processors,Update Request Processors>>, it's generally because you want to change something about how _update requests_ (documents) are _processed_ (indexed). In this case, we recommend that you reindex your documents to implement the changes you've made just as if you had changed the schema.
Similarly, if you change the `codecFactory` parameter in `solrconfig.xml`, it is again strongly recommended that you Similarly, if you change the `codecFactory` parameter in `solrconfig.xml`, it is again strongly recommended that you
plan to reindex your documents to avoid unintended behavior. plan to reindex your documents to avoid unintended behavior.
== Upgrades == Upgrades
When upgrading between major versions (for example, from a 7.x release to 8.0 or 8.x), a best practice When upgrading between major versions (for example, from a 7.x release to 8.x), a best practice is to always reindex your data. The reason for this is that subtle changes may occur in default field type definitions or the underlying code.
is to always reindex your data.
The reason for this is that subtle changes may occur in default field type definitions or the underlying code. Lucene works hard to insure one major version back-compatability, thus Solr 8x functions with indexes created with Solr 7x. However, given that this guarantee does _not_ apply to Solr X-2 (Solr 6x in this example) we still recommend completely reindexing when moving from Solr X-1 to Solr X.
[NOTE] [NOTE]
If you have *not* changed your schema as part of an upgrade from one minor release to another (such as, from 7.x If you have *not* changed your schema as part of an upgrade from one minor release to another (such as, from 7.x to a later 7.x release), you can often skip reindexing your documents. However, when upgrading to a major release, you should plan to reindex your documents.
to a later 7.x release), you can often skip reindexing your documents.
However, when upgrading to a major release, you should plan to reindex your documents because of the likelihood of [NOTE]
changes that break back-compatibility. You must always re-index your corpus when upgrading an index produced with a Solr version more than X-1 old. For instance, if you're upgrading to Solr 8x, an index ever used by Solr 6x must be deleted and re-ingested as outlined below. A marker is written identifying the version of Lucene used to ingest the first document. That marker is preserved in the index forever unless the index is entirely deleted. If Lucene finds a marker more than X-1 major versions old, it will refuse to open the index.
== Reindexing Strategies == Reindexing Strategies
@ -163,6 +92,15 @@ There are a few approaches available to perform the reindex.
The strategies described below ensure that the Lucene index is completely dropped so you can recreate it to accommodate your changes. The strategies described below ensure that the Lucene index is completely dropped so you can recreate it to accommodate your changes.
They allow you to recreate the Lucene index without having Lucene segments lingering with stale data. They allow you to recreate the Lucene index without having Lucene segments lingering with stale data.
[CAUTION]
====
A Lucene index is a _lossy abstraction designed for fast search_. Once a document is added to the index, the original data cannot be assumed to be available. Therefore it is not possible for Lucene to "fix up" existing documents to reflect changes to the schema, they must be indexed again.
There are a number of technical reasons that make re-ingesting all documents correctly without deleting the entire corpus first difficult and error-prone to code and maintain.
Therefore, since all documents have to be re-ingested to insure the abstraction faithfully reflects the new schema for all documents, we recommend deleting all documents after insuring that there are no old Lucene segments or reindexing to a new collection.
====
=== Delete All Documents === Delete All Documents
The best approach is to first delete everything from the index, and then index your data again. The best approach is to first delete everything from the index, and then index your data again.
@ -174,9 +112,7 @@ curl -X POST -H 'Content-Type: application/json' --data-binary '{"delete":{"quer
It's important to verify that *all* documents have been deleted, as that ensures the Lucene index segments have been It's important to verify that *all* documents have been deleted, as that ensures the Lucene index segments have been
deleted as well. deleted as well.
To verify that there are no segments in your index, look in the data directory and confirm it is empty. To verify that there are no segments in your index, look in the data/index directory and confirm it has no segments files. Since the data directory can be customized, see the section <<datadir-and-directoryfactory-in-solrconfig.adoc#specifying-a-location-for-index-data-with-the-datadir-parameter,Specifying a Location for Index Data with the dataDir Parameter>> for the location of your index files.
Since the data directory can be customized, see the section <<datadir-and-directoryfactory-in-solrconfig.adoc#specifying-a-location-for-index-data-with-the-datadir-parameter,Specifying a Location for Index Data with the dataDir Parameter>>
for where to look to find the index files.
Note you will need to verify the indexes have been removed in every shard and every replica on every node of a cluster. Note you will need to verify the indexes have been removed in every shard and every replica on every node of a cluster.
It is not sufficient to only query for the number of documents because you may have no documents but still have index It is not sufficient to only query for the number of documents because you may have no documents but still have index
@ -184,25 +120,21 @@ segments.
Once the indexes have been cleared, you can start reindexing by re-running the original index process. Once the indexes have been cleared, you can start reindexing by re-running the original index process.
[NOTE]
A variation on this approch is to delete and recreate your collection using the updated schema, then reindex if you can afford to have your collection offline for the duration of the reindexing process.
=== Index to Another Collection === Index to Another Collection
In cases where you cannot take a production collection offline to delete all the documents, one option is to use Solr's <<collection-aliasing.adoc#createalias,collection alias>> feature. Another approach is to use index to a new collection and use Solr's <<collection-aliasing.adoc#createalias,collection alias>> feature to seamlessly point the application to a new collection without downtime.
This option is only available for Solr installations running in SolrCloud mode. This option is only available for Solr installations running in SolrCloud mode.
With this approach, you will index your documents into a newly created collection and once everything is completed, With this approach, you will index your documents into a new collection that uses your changes and, once indexing and testing are complete, create an alias that points your front-end at the new collection. From that point, new queries and updates will be routed to the new collection seamlessly.
create an alias for the collection and point your front-end at the collection alias. Queries will be routed
to the new collection seamlessly.
Here is an example of creating an alias that points to a single collection: Once the alias is in place and you are satisfied you no longer need the old data, you can delete the old collection with the Collections API <<collection-management.adoc#delete,DELETE command>>.
[source,bash] [NOTE]
http://localhost:8983/solr/admin/collections?action=CREATEALIAS&name=myData&collections=newCollection One advantage of this option is that if you you can switch back to the old collection if you discover problems our testing did not uncover. Of course this option can require more resources until the old collection can be deleted.
Once the alias is in place and you are satisfied you no longer need the old data, you can delete the old collection with the <<collection-management.adoc#delete,DELETE command>> of the Collections API:
[source,bash]
http://localhost:8983/solr/admin/collections?action=DELETE&name=oldCollection
== Changes that Do Not Require Reindex == Changes that Do Not Require Reindex