mirror of https://github.com/apache/lucene.git
109 lines
4.8 KiB
Plaintext
109 lines
4.8 KiB
Plaintext
= De-Duplication
|
|
// Licensed to the Apache Software Foundation (ASF) under one
|
|
// or more contributor license agreements. See the NOTICE file
|
|
// distributed with this work for additional information
|
|
// regarding copyright ownership. The ASF licenses this file
|
|
// to you under the Apache License, Version 2.0 (the
|
|
// "License"); you may not use this file except in compliance
|
|
// with the License. You may obtain a copy of the License at
|
|
//
|
|
// http://www.apache.org/licenses/LICENSE-2.0
|
|
//
|
|
// Unless required by applicable law or agreed to in writing,
|
|
// software distributed under the License is distributed on an
|
|
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
// KIND, either express or implied. See the License for the
|
|
// specific language governing permissions and limitations
|
|
// under the License.
|
|
|
|
If duplicate, or near-duplicate documents are a concern in your index, de-duplication may be worth implementing.
|
|
|
|
Preventing duplicate or near duplicate documents from entering an index or tagging documents with a signature/fingerprint for duplicate field collapsing can be efficiently achieved with a low collision or fuzzy hash algorithm. Solr natively supports de-duplication techniques of this type via the `Signature` class and allows for the easy addition of new hash/signature implementations. A Signature can be implemented in a few ways:
|
|
|
|
* MD5Signature: 128-bit hash used for exact duplicate detection.
|
|
* Lookup3Signature: 64-bit hash used for exact duplicate detection. This is much faster than MD5 and smaller to index.
|
|
* http://wiki.apache.org/solr/TextProfileSignature[TextProfileSignature]: Fuzzy hashing implementation from Apache Nutch for near duplicate detection. It's tunable but works best on longer text.
|
|
|
|
Other, more sophisticated algorithms for fuzzy/near hashing can be added later.
|
|
|
|
[IMPORTANT]
|
|
====
|
|
Adding in the de-duplication process will change the `allowDups` setting so that it applies to an update term (with `signatureField` in this case) rather than the unique field Term.
|
|
|
|
Of course the `signatureField` could be the unique field, but generally you want the unique field to be unique. When a document is added, a signature will automatically be generated and attached to the document in the specified `signatureField`.
|
|
====
|
|
|
|
== Configuration Options
|
|
|
|
There are two places in Solr to configure de-duplication: in `solrconfig.xml` and in `schema.xml`.
|
|
|
|
=== In solrconfig.xml
|
|
|
|
The `SignatureUpdateProcessorFactory` has to be registered in `solrconfig.xml` as part of an <<update-request-processors.adoc#update-request-processors,Update Request Processor Chain>>, as in this example:
|
|
|
|
[source,xml]
|
|
----
|
|
<updateRequestProcessorChain name="dedupe">
|
|
<processor class="solr.processor.SignatureUpdateProcessorFactory">
|
|
<bool name="enabled">true</bool>
|
|
<str name="signatureField">id</str>
|
|
<bool name="overwriteDupes">false</bool>
|
|
<str name="fields">name,features,cat</str>
|
|
<str name="signatureClass">solr.processor.Lookup3Signature</str>
|
|
</processor>
|
|
<processor class="solr.LogUpdateProcessorFactory" />
|
|
<processor class="solr.RunUpdateProcessorFactory" />
|
|
</updateRequestProcessorChain>
|
|
----
|
|
|
|
The `SignatureUpdateProcessorFactory` takes several properties:
|
|
|
|
signatureClass::
|
|
A Signature implementation for generating a signature hash. The default is `org.apache.solr.update.processor.Lookup3Signature`.
|
|
+
|
|
The full classpath of the implementation must be specified. The available options are described above, the associated classpaths to use are:
|
|
|
|
* `org.apache.solr.update.processor.Lookup3Signature`
|
|
* `org.apache.solr.update.processor.MD5Signature`
|
|
* `org.apache.solr.update.process.TextProfileSignature`
|
|
|
|
fields::
|
|
The fields to use to generate the signature hash in a comma separated list. By default, all fields on the document will be used.
|
|
|
|
signatureField::
|
|
The name of the field used to hold the fingerprint/signature. The field should be defined in `schema.xml`. The default is `signatureField`.
|
|
|
|
enabled::
|
|
Set to *false* to disable de-duplication processing. The default is *true*.
|
|
|
|
overwriteDupes::
|
|
If true, the default, when a document exists that already matches this signature, it will be overwritten.
|
|
|
|
=== In schema.xml
|
|
|
|
If you are using a separate field for storing the signature, you must have it indexed:
|
|
|
|
[source,xml]
|
|
----
|
|
<field name="signatureField" type="string" stored="true" indexed="true" multiValued="false" />
|
|
----
|
|
|
|
Be sure to change your update handlers to use the defined chain, as below:
|
|
|
|
[source,xml]
|
|
----
|
|
<requestHandler name="/update" class="solr.UpdateRequestHandler" >
|
|
<lst name="defaults">
|
|
<str name="update.chain">dedupe</str>
|
|
</lst>
|
|
...
|
|
</requestHandler>
|
|
----
|
|
|
|
This example assumes you have other sections of your request handler defined.
|
|
|
|
[TIP]
|
|
====
|
|
The update processor can also be specified per request with a parameter of `update.chain=dedupe`.
|
|
====
|