lucene/solr/solr-ref-guide/src/indexconfig-in-solrconfig.adoc

212 lines
12 KiB
Plaintext

= IndexConfig in SolrConfig
:page-shortname: indexconfig-in-solrconfig
:page-permalink: indexconfig-in-solrconfig.html
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
The `<indexConfig>` section of `solrconfig.xml` defines low-level behavior of the Lucene index writers.
By default, the settings are commented out in the sample `solrconfig.xml` included with Solr, which means the defaults are used. In most cases, the defaults are fine.
[source,xml]
----
<indexConfig>
...
</indexConfig>
----
[[IndexConfiginSolrConfig-WritingNewSegments]]
== Writing New Segments
[[IndexConfiginSolrConfig-ramBufferSizeMB]]
=== ramBufferSizeMB
Once accumulated document updates exceed this much memory space (defined in megabytes), then the pending updates are flushed. This can also create new segments or trigger a merge. Using this setting is generally preferable to `maxBufferedDocs`. If both `maxBufferedDocs` and `ramBufferSizeMB` are set in `solrconfig.xml`, then a flush will occur when either limit is reached. The default is 100Mb.
[source,xml]
----
<ramBufferSizeMB>100</ramBufferSizeMB>
----
[[IndexConfiginSolrConfig-maxBufferedDocs]]
=== maxBufferedDocs
Sets the number of document updates to buffer in memory before they are flushed as a new segment. This may also trigger a merge. The default Solr configuration sets to flush by RAM usage (`ramBufferSizeMB`).
[source,xml]
----
<maxBufferedDocs>1000</maxBufferedDocs>
----
[[IndexConfiginSolrConfig-useCompoundFile]]
=== useCompoundFile
Controls whether newly written (and not yet merged) index segments should use the <<IndexConfiginSolrConfig-CompoundFileSegments,Compound File Segment>> format. The default is false.
[source,xml]
----
<useCompoundFile>false</useCompoundFile>
----
[[IndexConfiginSolrConfig-MergingIndexSegments]]
== Merging Index Segments
[[IndexConfiginSolrConfig-mergePolicyFactory]]
=== mergePolicyFactory
Defines how merging segments is done.
The default in Solr is to use a `TieredMergePolicy`, which merges segments of approximately equal size, subject to an allowed number of segments per tier.
Other policies available are the `LogByteSizeMergePolicy` and `LogDocMergePolicy`. For more information on these policies, please see {lucene-javadocs}/core/org/apache/lucene/index/MergePolicy.html[the MergePolicy javadocs].
[source,xml]
----
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">10</int>
<int name="segmentsPerTier">10</int>
</mergePolicyFactory>
----
[[merge-factors]]
=== Controlling Segment Sizes: Merge Factors
The most common adjustment users make to the configuration of TieredMergePolicy (or LogByteSizeMergePolicy) are the "merge factors" to change how many segments should be merged at one time.
For TieredMergePolicy, this is controlled by setting the `<int name="maxMergeAtOnce">` and `<int name="segmentsPerTier">` options, while LogByteSizeMergePolicy has a single `<int name="mergeFactor">` option (all of which default to `10`).
To understand why these options are important, consider what happens when an update is made to an index using LogByteSizeMergePolicy: Documents are always added to the most recently opened segment. When a segment fills up, a new segment is created and subsequent updates are placed there.
If creating a new segment would cause the number of lowest-level segments to exceed the `mergeFactor` value, then all those segments are merged together to form a single large segment. Thus, if the merge factor is 10, each merge results in the creation of a single segment that is roughly ten times larger than each of its ten constituents. When there are 10 of these larger segments, then they in turn are merged into an even larger single segment. This process can continue indefinitely.
When using TieredMergePolicy, the process is the same, but instead of a single `mergeFactor` value, the `segmentsPerTier` setting is used as the threshold to decide if a merge should happen, and the `maxMergeAtOnce` setting determines how many segments should be included in the merge.
Choosing the best merge factors is generally a trade-off of indexing speed vs. searching speed. Having fewer segments in the index generally accelerates searches, because there are fewer places to look. It also can also result in fewer physical files on disk. But to keep the number of segments low, merges will occur more often, which can add load to the system and slow down updates to the index.
Conversely, keeping more segments can accelerate indexing, because merges happen less often, making an update is less likely to trigger a merge. But searches become more computationally expensive and will likely be slower, because search terms must be looked up in more index segments. Faster index updates also means shorter commit turnaround times, which means more timely search results.
[[IndexConfiginSolrConfig-CustomizingMergePolicies]]
=== Customizing Merge Policies
If the configuration options for the built-in merge policies do not fully suit your use case, you can customize them: either by creating a custom merge policy factory that you specify in your configuration, or by configuring a {solr-javadocs}/solr-core/org/apache/solr/index/WrapperMergePolicyFactory.html[merge policy wrapper] which uses a `wrapped.prefix` configuration option to control how the factory it wraps will be configured:
[source,xml]
----
<mergePolicyFactory class="org.apache.solr.index.SortingMergePolicyFactory">
<str name="sort">timestamp desc</str>
<str name="wrapped.prefix">inner</str>
<str name="inner.class">org.apache.solr.index.TieredMergePolicyFactory</str>
<int name="inner.maxMergeAtOnce">10</int>
<int name="inner.segmentsPerTier">10</int>
</mergePolicyFactory>
----
The example above shows Solr's {solr-javadocs}/solr-core/org/apache/solr/index/SortingMergePolicyFactory.html[`SortingMergePolicyFactory`] being configured to sort documents in merged segments by `"timestamp desc"`, and wrapped around a `TieredMergePolicyFactory` configured to use the values `maxMergeAtOnce=10` and `segmentsPerTier=10` via the `inner` prefix defined by `SortingMergePolicyFactory` 's `wrapped.prefix` option. For more information on using `SortingMergePolicyFactory`, see <<common-query-parameters.adoc#CommonQueryParameters-ThesegmentTerminateEarlyParameter,the segmentTerminateEarly parameter>>.
[[IndexConfiginSolrConfig-mergeScheduler]]
=== mergeScheduler
The merge scheduler controls how merges are performed. The default `ConcurrentMergeScheduler` performs merges in the background using separate threads. The alternative, `SerialMergeScheduler`, does not perform merges with separate threads.
[source,xml]
----
<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"/>
----
[[IndexConfiginSolrConfig-mergedSegmentWarmer]]
=== mergedSegmentWarmer
When using Solr in for <<near-real-time-searching.adoc#near-real-time-searching,Near Real Time Searching>> a merged segment warmer can be configured to warm the reader on the newly merged segment, before the merge commits. This is not required for near real-time search, but will reduce search latency on opening a new near real-time reader after a merge completes.
[source,xml]
----
<mergedSegmentWarmer class="org.apache.lucene.index.SimpleMergedSegmentWarmer"/>
----
[[IndexConfiginSolrConfig-CompoundFileSegments]]
== Compound File Segments
Each Lucene segment is typically comprised of a dozen or so files. Lucene can be configured to bundle all of the files for a segment into a single compound file using a file extension of `.cfs`; it's an abbreviation for Compound File Segment.
CFS segments may incur a minor performance hit for various reasons, depending on the runtime environment. For example, filesystem buffers are typically associated with open file descriptors, which may limit the total cache space available to each index.
On systems where the number of open files allowed per process is limited, CFS may avoid hitting that limit. The open files limit might also be tunable for your OS with the Linux/Unix `ulimit` command, or something similar for other operating systems.
.CFS: New Segments vs Merged Segments
[NOTE]
====
To configure whether _newly written segments_ should use CFS, see the <<IndexConfiginSolrConfig-useCompoundFile,`useCompoundFile`>> setting described above. To configure whether _merged segments_ use CFS, review the Javadocs for your <<IndexConfiginSolrConfig-mergePolicyFactory,`mergePolicyFactory`>> .
Many <<IndexConfiginSolrConfig-MergingIndexSegments,Merge Policy>> implementations support `noCFSRatio` and `maxCFSSegmentSizeMB` settings with default values that prevent compound files from being used for large segments, but do use compound files for small segments.
====
[[IndexConfiginSolrConfig-IndexLocks]]
== Index Locks
[[IndexConfiginSolrConfig-lockType]]
=== lockType
The LockFactory options specify the locking implementation to use.
The set of valid lock type options depends on the <<datadir-and-directoryfactory-in-solrconfig.adoc#datadir-and-directoryfactory-in-solrconfig,DirectoryFactory>> you have configured. The values listed below are are supported by `StandardDirectoryFactory` (the default):
* `native` (default) uses NativeFSLockFactory to specify native OS file locking. If a second Solr process attempts to access the directory, it will fail. Do not use when multiple Solr web applications are attempting to share a single index.
* `simple` uses SimpleFSLockFactory to specify a plain file for locking.
* `single` (expert) uses SingleInstanceLockFactory. Use for special situations of a read-only index directory, or when there is no possibility of more than one process trying to modify the index (even sequentially). This type will protect against multiple cores within the _same_ JVM attempting to access the same index. WARNING! If multiple Solr instances in different JVMs modify an index, this type will _not_ protect against index corruption.
* `hdfs` uses HdfsLockFactory to support reading and writing index and transaction log files to a HDFS filesystem. See the section <<running-solr-on-hdfs.adoc#running-solr-on-hdfs,Running Solr on HDFS>> for more details on using this feature.
For more information on the nuances of each LockFactory, see http://wiki.apache.org/lucene-java/AvailableLockFactories.
[source,xml]
----
<lockType>native</lockType>
----
[[IndexConfiginSolrConfig-writeLockTimeout]]
=== writeLockTimeout
The maximum time to wait for a write lock on an IndexWriter. The default is 1000, expressed in milliseconds.
[source,xml]
----
<writeLockTimeout>1000</writeLockTimeout>
----
[[IndexConfiginSolrConfig-OtherIndexingSettings]]
== Other Indexing Settings
There are a few other parameters that may be important to configure for your implementation. These settings affect how or when updates are made to an index.
`reopenReaders`:: Controls if IndexReaders will be re-opened, instead of closed and then opened, which is often less efficient. The default is true.
`deletionPolicy`:: Controls how commits are retained in case of rollback. The default is `SolrDeletionPolicy`, which has sub-parameters for the maximum number of commits to keep (`maxCommitsToKeep`), the maximum number of optimized commits to keep (`maxOptimizedCommitsToKeep`), and the maximum age of any commit to keep (`maxCommitAge`), which supports `DateMathParser` syntax.
`infoStream`:: The InfoStream setting instructs the underlying Lucene classes to write detailed debug information from the indexing process as Solr log messages.
[source,xml]
----
<reopenReaders>true</reopenReaders>
<deletionPolicy class="solr.SolrDeletionPolicy">
<str name="maxCommitsToKeep">1</str>
<str name="maxOptimizedCommitsToKeep">0</str>
<str name="maxCommitAge">1DAY</str>
</deletionPolicy>
<infoStream>false</infoStream>
----