SOLR-9526: Data driven schema now indexes text field "foo" as both "foo" (text_general) and as "foo_str" (string) to facilitate both search and faceting

This commit is contained in:
Jan Høydahl 2017-07-06 15:56:51 +02:00
parent 80b1430a3e
commit a60ec1b432
13 changed files with 425 additions and 68 deletions

View File

@ -183,11 +183,15 @@ Upgrading from Solr 6.x
* The unused 'valType' option has been removed from ExternalFileField, if you have this in your schema you
can safely remove it. see SOLR-10929 for more details.
* SOLR-10574: basic_configs and data_driven_schema_configs have now been merged into _default. It has data driven nature
* Config sets basic_configs and data_driven_schema_configs have now been merged into _default. It has data driven nature
enabled by default, and can be turned off (after creating a collection) with:
curl http://host:8983/solr/mycollection/config -d '{"set-user-property": {"update.autoCreateFields":"false"}}'
Please see SOLR-10574 for details.
* The data driven config (now _default) for auto-creating fields earlier defaulted to "string" for text input.
Now the default is to use "text_general" and to add a copyField to the schema, copying to a "*_str" dynamic field,
with a cutoff at 256 characters. This enables full text search as well as faceting. See SOLR-9526 for more.
* SOLR-10123: The Analytics Component has been upgraded to support distributed collections, expressions over multivalued
fields, a new JSON request language, and more. DocValues are now required for any field used in the analytics expression
whereas previously docValues was not required. Please see SOLR-10123 for details.
@ -254,10 +258,16 @@ New Features
* SOLR-10406: v2 API error messages list the URL request path as /solr/____v2/... when the original path was /v2/... (Cao Manh Dat, noble)
* SOLR-10574: New _default config set replacing basic_configs and data_driven_schema_configs.
(Ishan Chattopadhyaya, noble, shalin, hossman, David Smiley, Jan Hoydahl, Alexandre Rafalovich)
(Ishan Chattopadhyaya, noble, shalin, hossman, David Smiley, janhoy, Alexandre Rafalovich)
* SOLR-10272: Use _default config set if no collection.configName is specified with CREATE (Ishan Chattopadhyaya)
* SOLR-9526: Data driven schema now indexes text field "foo" as both "foo" (text_general) and as "foo_str" (string)
to facilitate both search and faceting. AddSchemaFieldsUpdateProcessor now has the ability to add a "copyField" to
the type mappings, with an optional maxChars limitation. You can also define one typeMappings as default.
This also solves issues SOLR-8495, SOLR-6966, and SOLR-7058
(janhoy, Steve Rowe, hossman, Alexandre Rafalovich, Shawn Heisey, Cao Manh Dat)
* SOLR-10123: Upgraded the Analytics Component to version 2.0 which now supports distributed collections, expressions over
multivalued fields, a new JSON request language, and more. DocValues are now required for any field used in the analytics
expression whereas previously docValues was not required. Please see SOLR-10123 for details. (Houston Putman)

View File

@ -109,8 +109,11 @@ public final class ManagedIndexSchema extends IndexSchema {
}
/** Persist the schema to local storage or to ZooKeeper */
boolean persistManagedSchema(boolean createOnly) {
/**
* Persist the schema to local storage or to ZooKeeper
* @param createOnly set to false to allow update of existing schema
*/
public boolean persistManagedSchema(boolean createOnly) {
if (loader instanceof ZkSolrResourceLoader) {
return persistManagedSchemaToZooKeeper(createOnly);
}

View File

@ -26,6 +26,7 @@ import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.stream.Collectors;
import org.apache.solr.common.SolrException;
import org.apache.solr.common.SolrInputDocument;
@ -128,7 +129,11 @@ public class AddSchemaFieldsUpdateProcessorFactory extends UpdateRequestProcesso
private static final String VALUE_CLASS_PARAM = "valueClass";
private static final String FIELD_TYPE_PARAM = "fieldType";
private static final String DEFAULT_FIELD_TYPE_PARAM = "defaultFieldType";
private static final String COPY_FIELD_PARAM = "copyField";
private static final String DEST_PARAM = "dest";
private static final String MAX_CHARS_PARAM = "maxChars";
private static final String IS_DEFAULT_PARAM = "default";
private List<TypeMapping> typeMappings = Collections.emptyList();
private SelectorParams inclusions = new SelectorParams();
private Collection<SelectorParams> exclusions = new ArrayList<>();
@ -152,16 +157,18 @@ public class AddSchemaFieldsUpdateProcessorFactory extends UpdateRequestProcesso
validateSelectorParams(exclusion);
}
Object defaultFieldTypeParam = args.remove(DEFAULT_FIELD_TYPE_PARAM);
if (null == defaultFieldTypeParam) {
throw new SolrException(SERVER_ERROR, "Missing required init param '" + DEFAULT_FIELD_TYPE_PARAM + "'");
} else {
if (null != defaultFieldTypeParam) {
if ( ! (defaultFieldTypeParam instanceof CharSequence)) {
throw new SolrException(SERVER_ERROR, "Init param '" + DEFAULT_FIELD_TYPE_PARAM + "' must be a <str>");
}
defaultFieldType = defaultFieldTypeParam.toString();
}
defaultFieldType = defaultFieldTypeParam.toString();
typeMappings = parseTypeMappings(args);
if (null == defaultFieldType && typeMappings.stream().noneMatch(TypeMapping::isDefault)) {
throw new SolrException(SERVER_ERROR, "Must specify either '" + DEFAULT_FIELD_TYPE_PARAM +
"' or declare one typeMapping as default.");
}
super.init(args);
}
@ -207,8 +214,59 @@ public class AddSchemaFieldsUpdateProcessorFactory extends UpdateRequestProcesso
throw new SolrException(SERVER_ERROR,
"Each '" + TYPE_MAPPING_PARAM + "' <lst/> must contain at least one '" + VALUE_CLASS_PARAM + "' <str>");
}
typeMappings.add(new TypeMapping(fieldType, valueClasses));
// isDefault (optional)
Boolean isDefault = false;
Object isDefaultObj = typeMappingNamedList.remove(IS_DEFAULT_PARAM);
if (null != isDefaultObj) {
if ( ! (isDefaultObj instanceof Boolean)) {
throw new SolrException(SERVER_ERROR, "'" + IS_DEFAULT_PARAM + "' init param must be a <bool>");
}
if (null != typeMappingNamedList.get(IS_DEFAULT_PARAM)) {
throw new SolrException(SERVER_ERROR,
"Each '" + COPY_FIELD_PARAM + "' <lst/> may contain only one '" + IS_DEFAULT_PARAM + "' <bool>");
}
isDefault = Boolean.parseBoolean(isDefaultObj.toString());
}
Collection<CopyFieldDef> copyFieldDefs = new ArrayList<>();
while (typeMappingNamedList.get(COPY_FIELD_PARAM) != null) {
Object copyFieldObj = typeMappingNamedList.remove(COPY_FIELD_PARAM);
if ( ! (copyFieldObj instanceof NamedList)) {
throw new SolrException(SERVER_ERROR, "'" + COPY_FIELD_PARAM + "' init param must be a <lst>");
}
NamedList copyFieldNamedList = (NamedList)copyFieldObj;
// dest
Object destObj = copyFieldNamedList.remove(DEST_PARAM);
if (null == destObj) {
throw new SolrException(SERVER_ERROR,
"Each '" + COPY_FIELD_PARAM + "' <lst/> must contain a '" + DEST_PARAM + "' <str>");
}
if ( ! (destObj instanceof CharSequence)) {
throw new SolrException(SERVER_ERROR, "'" + COPY_FIELD_PARAM + "' init param must be a <str>");
}
if (null != copyFieldNamedList.get(COPY_FIELD_PARAM)) {
throw new SolrException(SERVER_ERROR,
"Each '" + COPY_FIELD_PARAM + "' <lst/> may contain only one '" + COPY_FIELD_PARAM + "' <str>");
}
String dest = destObj.toString();
// maxChars (optional)
Integer maxChars = 0;
Object maxCharsObj = copyFieldNamedList.remove(MAX_CHARS_PARAM);
if (null != maxCharsObj) {
if ( ! (maxCharsObj instanceof Integer)) {
throw new SolrException(SERVER_ERROR, "'" + MAX_CHARS_PARAM + "' init param must be a <int>");
}
if (null != copyFieldNamedList.get(MAX_CHARS_PARAM)) {
throw new SolrException(SERVER_ERROR,
"Each '" + COPY_FIELD_PARAM + "' <lst/> may contain only one '" + MAX_CHARS_PARAM + "' <str>");
}
maxChars = Integer.parseInt(maxCharsObj.toString());
}
copyFieldDefs.add(new CopyFieldDef(dest, maxChars));
}
typeMappings.add(new TypeMapping(fieldType, valueClasses, isDefault, copyFieldDefs));
if (0 != typeMappingNamedList.size()) {
throw new SolrException(SERVER_ERROR,
"Unexpected '" + TYPE_MAPPING_PARAM + "' init sub-param(s): '" + typeMappingNamedList.toString() + "'");
@ -233,11 +291,16 @@ public class AddSchemaFieldsUpdateProcessorFactory extends UpdateRequestProcesso
private static class TypeMapping {
public String fieldTypeName;
public Collection<String> valueClassNames;
public Collection<CopyFieldDef> copyFieldDefs;
public Set<Class<?>> valueClasses;
public Boolean isDefault;
public TypeMapping(String fieldTypeName, Collection<String> valueClassNames) {
public TypeMapping(String fieldTypeName, Collection<String> valueClassNames, boolean isDefault,
Collection<CopyFieldDef> copyFieldDefs) {
this.fieldTypeName = fieldTypeName;
this.valueClassNames = valueClassNames;
this.isDefault = isDefault;
this.copyFieldDefs = copyFieldDefs;
// this.valueClasses population is delayed until the schema is available
}
@ -257,6 +320,38 @@ public class AddSchemaFieldsUpdateProcessorFactory extends UpdateRequestProcesso
}
}
}
public boolean isDefault() {
return isDefault;
}
}
private static class CopyFieldDef {
private final String destGlob;
private final Integer maxChars;
public CopyFieldDef(String destGlob, Integer maxChars) {
this.destGlob = destGlob;
this.maxChars = maxChars;
if (destGlob.contains("*") && (!destGlob.startsWith("*") && !destGlob.endsWith("*"))) {
throw new SolrException(SERVER_ERROR, "dest '" + destGlob +
"' is invalid. Must either be a plain field name or start or end with '*'");
}
}
public Integer getMaxChars() {
return maxChars;
}
public String getDest(String srcFieldName) {
if (!destGlob.contains("*")) {
return destGlob;
} else if (destGlob.startsWith("*")) {
return srcFieldName + destGlob.substring(1);
} else {
return destGlob.substring(0,destGlob.length()-1) + srcFieldName;
}
}
}
private class AddSchemaFieldsUpdateProcessor extends UpdateRequestProcessor {
@ -278,6 +373,8 @@ public class AddSchemaFieldsUpdateProcessorFactory extends UpdateRequestProcesso
IndexSchema oldSchema = cmd.getReq().getSchema();
for (;;) {
List<SchemaField> newFields = new ArrayList<>();
// Group copyField defs per field and then per maxChar, to adapt to IndexSchema API
Map<String,Map<Integer,List<CopyFieldDef>>> newCopyFields = new HashMap<>();
// build a selector each time through the loop b/c the schema we are
// processing may have changed
FieldNameSelector selector = buildSelector(oldSchema);
@ -285,12 +382,20 @@ public class AddSchemaFieldsUpdateProcessorFactory extends UpdateRequestProcesso
getUnknownFields(selector, doc, unknownFields);
for (final Map.Entry<String,List<SolrInputField>> entry : unknownFields.entrySet()) {
String fieldName = entry.getKey();
String fieldTypeName = mapValueClassesToFieldType(entry.getValue());
String fieldTypeName = defaultFieldType;
TypeMapping typeMapping = mapValueClassesToFieldType(entry.getValue());
if (typeMapping != null) {
fieldTypeName = typeMapping.fieldTypeName;
if (!typeMapping.copyFieldDefs.isEmpty()) {
newCopyFields.put(fieldName,
typeMapping.copyFieldDefs.stream().collect(Collectors.groupingBy(CopyFieldDef::getMaxChars)));
}
}
newFields.add(oldSchema.newField(fieldName, fieldTypeName, Collections.<String,Object>emptyMap()));
}
if (newFields.isEmpty()) {
if (newFields.isEmpty() && newCopyFields.isEmpty()) {
// nothing to do - no fields will be added - exit from the retry loop
log.debug("No fields to add to the schema.");
log.debug("No fields or copyFields to add to the schema.");
break;
} else if ( isImmutableConfigSet(core) ) {
final String message = "This ConfigSet is immutable.";
@ -298,7 +403,7 @@ public class AddSchemaFieldsUpdateProcessorFactory extends UpdateRequestProcesso
}
if (log.isDebugEnabled()) {
StringBuilder builder = new StringBuilder();
builder.append("Fields to be added to the schema: [");
builder.append("\nFields to be added to the schema: [");
boolean isFirst = true;
for (SchemaField field : newFields) {
builder.append(isFirst ? "" : ",");
@ -307,20 +412,44 @@ public class AddSchemaFieldsUpdateProcessorFactory extends UpdateRequestProcesso
builder.append("{type=").append(field.getType().getTypeName()).append("}");
}
builder.append("]");
builder.append("\nCopyFields to be added to the schema: [");
isFirst = true;
for (String fieldName : newCopyFields.keySet()) {
builder.append(isFirst ? "" : ",");
isFirst = false;
builder.append("source=").append(fieldName).append("{");
for (List<CopyFieldDef> copyFieldDefList : newCopyFields.get(fieldName).values()) {
for (CopyFieldDef copyFieldDef : copyFieldDefList) {
builder.append("{dest=").append(copyFieldDef.getDest(fieldName));
builder.append(", maxChars=").append(copyFieldDef.getMaxChars()).append("}");
}
}
builder.append("}");
}
builder.append("]");
log.debug(builder.toString());
}
// Need to hold the lock during the entire attempt to ensure that
// the schema on the request is the latest
synchronized (oldSchema.getSchemaUpdateLock()) {
try {
IndexSchema newSchema = oldSchema.addFields(newFields);
IndexSchema newSchema = oldSchema.addFields(newFields, Collections.emptyMap(), false);
// Add copyFields
for (String srcField : newCopyFields.keySet()) {
for (Integer maxChars : newCopyFields.get(srcField).keySet()) {
newSchema = newSchema.addCopyFields(srcField,
newCopyFields.get(srcField).get(maxChars).stream().map(f -> f.getDest(srcField)).collect(Collectors.toList()),
maxChars);
}
}
if (null != newSchema) {
((ManagedIndexSchema)newSchema).persistManagedSchema(false);
core.setLatestSchema(newSchema);
cmd.getReq().updateSchemaToLatest();
log.debug("Successfully added field(s) to the schema.");
log.debug("Successfully added field(s) and copyField(s) to the schema.");
break; // success - exit from the retry loop
} else {
throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, "Failed to add fields.");
throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, "Failed to add fields and/or copyFields.");
}
} catch (ManagedIndexSchema.FieldExistsException e) {
log.error("At least one field to be added already exists in the schema - retrying.");
@ -360,11 +489,11 @@ public class AddSchemaFieldsUpdateProcessorFactory extends UpdateRequestProcesso
}
/**
* Maps all given field values' classes to a field type using the configured type mapping rules.
* Maps all given field values' classes to a typeMapping object
*
* @param fields one or more (same-named) field values from one or more documents
*/
private String mapValueClassesToFieldType(List<SolrInputField> fields) {
private TypeMapping mapValueClassesToFieldType(List<SolrInputField> fields) {
NEXT_TYPE_MAPPING: for (TypeMapping typeMapping : typeMappings) {
for (SolrInputField field : fields) {
NEXT_FIELD_VALUE: for (Object fieldValue : field.getValues()) {
@ -379,10 +508,18 @@ public class AddSchemaFieldsUpdateProcessorFactory extends UpdateRequestProcesso
}
}
// Success! Each of this field's values is an instance of a mapped valueClass
return typeMapping.fieldTypeName;
return typeMapping;
}
// At least one of this field's values is not an instance of any of the mapped valueClass-s
return defaultFieldType;
// Return the typeMapping marked as default, if we have one, else return null to use fallback type
List<TypeMapping> defaultMappings = typeMappings.stream().filter(TypeMapping::isDefault).collect(Collectors.toList());
if (defaultMappings.size() > 1) {
throw new SolrException(SERVER_ERROR, "Only one typeMapping can be default");
} else if (defaultMappings.size() == 1) {
return defaultMappings.get(0);
} else {
return null;
}
}
private FieldNameSelector buildSelector(IndexSchema schema) {

View File

@ -48,6 +48,7 @@
<field name="_version_" type="long" indexed="true" stored="true"/>
<field name="_root_" type="string" indexed="true" stored="true" multiValued="false"/>
<dynamicField name="*_str" type="string" stored="false" multiValued="true" docValues="true" useDocValuesAsStored="false"/>
<dynamicField name="*_t" type="text" indexed="true" stored="true"/>
<dynamicField name="*_ti" type="tint" indexed="true" stored="true"/>
<dynamicField name="*_tl" type="tlong" indexed="true" stored="true"/>

View File

@ -68,6 +68,80 @@
<updateRequestProcessorChain name="add-fields">
<processor class="solr.AddSchemaFieldsUpdateProcessorFactory">
<str name="defaultFieldType">text</str>
<lst name="typeMapping">
<str name="valueClass">java.lang.String</str>
<str name="fieldType">text</str>
<lst name="copyField">
<str name="dest">*_str</str>
</lst>
</lst>
<lst name="typeMapping">
<str name="valueClass">java.lang.Boolean</str>
<str name="fieldType">boolean</str>
</lst>
<lst name="typeMapping">
<str name="valueClass">java.lang.Integer</str>
<str name="fieldType">pints</str>
</lst>
<lst name="typeMapping">
<str name="valueClass">java.lang.Float</str>
<str name="fieldType">pfloats</str>
</lst>
<lst name="typeMapping">
<str name="valueClass">java.util.Date</str>
<str name="fieldType">pdates</str>
</lst>
<lst name="typeMapping">
<str name="valueClass">java.lang.Long</str>
<str name="valueClass">java.lang.Integer</str>
<str name="fieldType">plongs</str>
</lst>
<lst name="typeMapping">
<str name="valueClass">java.lang.Number</str>
<str name="fieldType">pdoubles</str>
</lst>
</processor>
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
<updateRequestProcessorChain name="add-fields-maxchars">
<processor class="solr.AddSchemaFieldsUpdateProcessorFactory">
<str name="defaultFieldType">text</str>
<lst name="typeMapping">
<str name="valueClass">java.lang.String</str>
<str name="fieldType">text</str>
<lst name="copyField">
<str name="dest">*_str</str>
<int name="maxChars">10</int>
</lst>
<lst name="copyField">
<str name="dest">*_t</str>
<int name="maxChars">20</int>
</lst>
<lst name="copyField">
<str name="dest">*2_t</str>
<int name="maxChars">20</int>
</lst>
</lst>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.DistributedUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
<!-- This chain has one of the typeMappings set as default=true, instead of falling back to the defaultFieldType -->
<updateRequestProcessorChain name="add-fields-default-mapping">
<processor class="solr.AddSchemaFieldsUpdateProcessorFactory">
<lst name="typeMapping">
<str name="valueClass">java.lang.String</str>
<str name="fieldType">text</str>
<lst name="copyField">
<str name="dest">*_str</str>
<int name="maxChars">10</int>
</lst>
<!-- Use as default mapping instead of defaultFieldType -->
<bool name="default">true</bool>
</lst>
<lst name="typeMapping">
<str name="valueClass">java.lang.Boolean</str>
<str name="fieldType">boolean</str>

View File

@ -137,6 +137,9 @@
<dynamicField name="*_d" type="double" indexed="true" stored="true"/>
<dynamicField name="*_ds" type="doubles" indexed="true" stored="true"/>
<!-- Type used for data-driven schema, to add a string copy for each text field -->
<dynamicField name="*_str" type="strings" stored="false" docValues="true" indexed="false" />
<dynamicField name="*_dt" type="date" indexed="true" stored="true"/>
<dynamicField name="*_dts" type="date" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="*_p" type="location" indexed="true" stored="true"/>

View File

@ -1138,7 +1138,8 @@
Field type guessing update processors that will
attempt to parse string-typed field values as Booleans, Longs,
Doubles, or Dates, and then add schema fields with the guessed
field types.
field types. Text content will be indexed as "text_general" as
well as a copy to a plain string version in *_str.
These require that the schema is both managed and mutable, by
declaring schemaFactory as ManagedIndexSchemaFactory, with
@ -1177,7 +1178,16 @@
</arr>
</updateProcessor>
<updateProcessor class="solr.AddSchemaFieldsUpdateProcessorFactory" name="add-schema-fields">
<str name="defaultFieldType">strings</str>
<lst name="typeMapping">
<str name="valueClass">java.lang.String</str>
<str name="fieldType">text_general</str>
<lst name="copyField">
<str name="dest">*_str</str>
<int name="maxChars">256</int>
</lst>
<!-- Use as default mapping instead of defaultFieldType -->
<bool name="default">true</bool>
</lst>
<lst name="typeMapping">
<str name="valueClass">java.lang.Boolean</str>
<str name="fieldType">booleans</str>

View File

@ -17,10 +17,12 @@
package org.apache.solr.update.processor;
import java.io.File;
import java.util.Collections;
import java.util.Date;
import org.apache.commons.io.FileUtils;
import org.apache.solr.common.SolrInputDocument;
import org.apache.solr.common.params.ModifiableSolrParams;
import org.apache.solr.schema.IndexSchema;
import org.joda.time.DateTime;
import org.joda.time.format.DateTimeFormat;
@ -115,6 +117,29 @@ public class AddSchemaFieldsUpdateProcessorFactoryTest extends UpdateProcessorTe
schema = h.getCore().getLatestSchema();
assertNotNull(schema.getFieldOrNull(fieldName));
assertEquals("text", schema.getFieldType(fieldName).getTypeName());
assertEquals(0, schema.getCopyFieldProperties(true, Collections.singleton(fieldName), null).size());
assertU(commit());
assertQ(req("id:4")
,"//arr[@name='" + fieldName + "']/str[.='" + fieldValue1.toString() + "']"
,"//arr[@name='" + fieldName + "']/str[.='" + fieldValue2.toString() + "']"
,"//arr[@name='" + fieldName + "']/str[.='" + fieldValue3.toString() + "']"
);
}
public void testSingleFieldDefaultTypeMappingRoundTrip() throws Exception {
IndexSchema schema = h.getCore().getLatestSchema();
final String fieldName = "newfield4";
assertNull(schema.getFieldOrNull(fieldName));
Float fieldValue1 = -13258.0f;
Double fieldValue2 = 8.4828800808E10;
String fieldValue3 = "blah blah";
SolrInputDocument d = processAdd
("add-fields-default-mapping", doc(f("id", "4"), f(fieldName, fieldValue1, fieldValue2, fieldValue3)));
assertNotNull(d);
schema = h.getCore().getLatestSchema();
assertNotNull(schema.getFieldOrNull(fieldName));
assertEquals("text", schema.getFieldType(fieldName).getTypeName());
assertEquals(1, schema.getCopyFieldProperties(true, Collections.singleton(fieldName), null).size());
assertU(commit());
assertQ(req("id:4")
,"//arr[@name='" + fieldName + "']/str[.='" + fieldValue1.toString() + "']"
@ -209,6 +234,60 @@ public class AddSchemaFieldsUpdateProcessorFactoryTest extends UpdateProcessorTe
,"//arr[@name='" + fieldName3 + "']/str[.='" + field3String2 + "']"
,"//arr[@name='" + fieldName4 + "']/date[.='" + field4Value1String + "']");
}
public void testStringWithCopyField() throws Exception {
IndexSchema schema = h.getCore().getLatestSchema();
final String fieldName = "stringField";
final String strFieldName = fieldName+"_str";
assertNull(schema.getFieldOrNull(fieldName));
String content = "This is a text that should be copied to a string field but not be cutoff";
SolrInputDocument d = processAdd("add-fields", doc(f("id", "1"), f(fieldName, content)));
assertNotNull(d);
schema = h.getCore().getLatestSchema();
assertNotNull(schema.getFieldOrNull(fieldName));
assertNotNull(schema.getFieldOrNull(strFieldName));
assertEquals("text", schema.getFieldType(fieldName).getTypeName());
assertEquals(1, schema.getCopyFieldProperties(true, Collections.singleton(fieldName), Collections.singleton(strFieldName)).size());
}
public void testStringWithCopyFieldAndMaxChars() throws Exception {
IndexSchema schema = h.getCore().getLatestSchema();
final String fieldName = "stringField";
final String strFieldName = fieldName+"_str";
assertNull(schema.getFieldOrNull(fieldName));
String content = "This is a text that should be copied to a string field and cutoff at 10 characters";
SolrInputDocument d = processAdd("add-fields-maxchars", doc(f("id", "1"), f(fieldName, content)));
assertNotNull(d);
System.out.println("Document is "+d);
schema = h.getCore().getLatestSchema();
assertNotNull(schema.getFieldOrNull(fieldName));
assertNotNull(schema.getFieldOrNull(strFieldName));
assertEquals("text", schema.getFieldType(fieldName).getTypeName());
// We have three copyFields, one with maxChars 10 and two with maxChars 20
assertEquals(3, schema.getCopyFieldProperties(true, Collections.singleton(fieldName), null).size());
assertEquals("The configured maxChars cutoff does not exist on the copyField", 10,
schema.getCopyFieldProperties(true, Collections.singleton(fieldName), Collections.singleton(strFieldName))
.get(0).get("maxChars"));
assertEquals("The configured maxChars cutoff does not exist on the copyField", 20,
schema.getCopyFieldProperties(true, Collections.singleton(fieldName), Collections.singleton(fieldName+"_t"))
.get(0).get("maxChars"));
assertEquals("The configured maxChars cutoff does not exist on the copyField", 20,
schema.getCopyFieldProperties(true, Collections.singleton(fieldName), Collections.singleton(fieldName+"2_t"))
.get(0).get("maxChars"));
}
public void testCopyFieldByIndexing() throws Exception {
String content = "This is a text that should be copied to a string field and cutoff at 10 characters";
SolrInputDocument d = processAdd("add-fields-default-mapping", doc(f("id", "1"), f("mynewfield", content)));
assertU(commit());
ModifiableSolrParams params = new ModifiableSolrParams();
params.add("q", "*:*").add("facet", "true").add("facet.field", "mynewfield_str");
assertQ(req(params)
, "*[count(//doc)=1]"
,"//lst[@name='mynewfield_str']/int[@name='This is a '][.='1']"
);
}
@After
private void deleteCoreAndTempSolrHomeDirectory() throws Exception {

View File

@ -137,6 +137,9 @@
<dynamicField name="*_d" type="double" indexed="true" stored="true"/>
<dynamicField name="*_ds" type="doubles" indexed="true" stored="true"/>
<!-- Type used for data-driven schema, to add a string copy for each text field -->
<dynamicField name="*_str" type="strings" stored="false" docValues="true" indexed="false" />
<dynamicField name="*_dt" type="date" indexed="true" stored="true"/>
<dynamicField name="*_dts" type="date" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="*_p" type="location" indexed="true" stored="true"/>

View File

@ -1138,7 +1138,8 @@
Field type guessing update processors that will
attempt to parse string-typed field values as Booleans, Longs,
Doubles, or Dates, and then add schema fields with the guessed
field types.
field types. Text content will be indexed as "text_general" as
well as a copy to a plain string version in *_str.
These require that the schema is both managed and mutable, by
declaring schemaFactory as ManagedIndexSchemaFactory, with
@ -1177,7 +1178,16 @@
</arr>
</updateProcessor>
<updateProcessor class="solr.AddSchemaFieldsUpdateProcessorFactory" name="add-schema-fields">
<str name="defaultFieldType">strings</str>
<lst name="typeMapping">
<str name="valueClass">java.lang.String</str>
<str name="fieldType">text_general</str>
<lst name="copyField">
<str name="dest">*_str</str>
<int name="maxChars">256</int>
</lst>
<!-- Use as default mapping instead of defaultFieldType -->
<bool name="default">true</bool>
</lst>
<lst name="typeMapping">
<str name="valueClass">java.lang.Boolean</str>
<str name="fieldType">booleans</str>

View File

@ -39,5 +39,5 @@ This section helps you get Solr up and running quickly, and introduces you to th
[TIP]
====
Solr includes a Quick Start tutorial which will be helpful if you are just starting out with Solr. You can find it online at http://lucene.apache.org/solr/quickstart.html, or in your Solr installation at `$SOLR_INSTALL_DIR/docs/quickstart.html`.
Solr includes a Quick Start tutorial which will be helpful if you are just starting out with Solr. You can find it online at http://lucene.apache.org/solr/quickstart.html.
====

View File

@ -70,7 +70,7 @@ You can use the `/schema/fields` <<schema-api.adoc#schema-api,Schema API>> to co
[[SchemalessMode-ConfiguringSchemalessMode]]
== Configuring Schemaless Mode
As described above, there are three configuration elements that need to be in place to use Solr in schemaless mode. In the default (`_default`) config set included with Solr these are already configured. If, however, you would like to implement schemaless on your own, you should make the following changes.
As described above, there are three configuration elements that need to be in place to use Solr in schemaless mode. In the `_default` config set included with Solr these are already configured. If, however, you would like to implement schemaless on your own, you should make the following changes.
[[SchemalessMode-EnableManagedSchema]]
=== Enable Managed Schema
@ -94,18 +94,16 @@ The UpdateRequestProcessorChain allows Solr to guess field types, and you can de
[source,xml]
----
<updateRequestProcessorChain name="add-unknown-fields-to-the-schema">
<!-- UUIDUpdateProcessorFactory will generate an id if none is present in the incoming document -->
<processor class="solr.UUIDUpdateProcessorFactory" />
<processor class="solr.RemoveBlankFieldUpdateProcessorFactory"/>
<processor class="solr.FieldNameMutatingUpdateProcessorFactory">
<updateProcessor class="solr.UUIDUpdateProcessorFactory" name="uuid"/>
<updateProcessor class="solr.RemoveBlankFieldUpdateProcessorFactory" name="remove-blank"/>
<updateProcessor class="solr.FieldNameMutatingUpdateProcessorFactory" name="field-name-mutating">
<str name="pattern">[^\w-\.]</str>
<str name="replacement">_</str>
</processor>
<processor class="solr.ParseBooleanFieldUpdateProcessorFactory"/>
<processor class="solr.ParseLongFieldUpdateProcessorFactory"/>
<processor class="solr.ParseDoubleFieldUpdateProcessorFactory"/>
<processor class="solr.ParseDateFieldUpdateProcessorFactory">
</updateProcessor>
<updateProcessor class="solr.ParseBooleanFieldUpdateProcessorFactory" name="parse-boolean"/>
<updateProcessor class="solr.ParseLongFieldUpdateProcessorFactory" name="parse-long"/>
<updateProcessor class="solr.ParseDoubleFieldUpdateProcessorFactory" name="parse-double"/>
<updateProcessor class="solr.ParseDateFieldUpdateProcessorFactory" name="parse-date">
<arr name="format">
<str>yyyy-MM-dd'T'HH:mm:ss.SSSZ</str>
<str>yyyy-MM-dd'T'HH:mm:ss,SSSZ</str>
@ -125,9 +123,18 @@ The UpdateRequestProcessorChain allows Solr to guess field types, and you can de
<str>yyyy-MM-dd HH:mm</str>
<str>yyyy-MM-dd</str>
</arr>
</processor>
<processor class="solr.AddSchemaFieldsUpdateProcessorFactory">
<str name="defaultFieldType">strings</str>
</updateProcessor>
<updateProcessor class="solr.AddSchemaFieldsUpdateProcessorFactory" name="add-schema-fields">
<lst name="typeMapping">
<str name="valueClass">java.lang.String</str>
<str name="fieldType">text_general</str>
<lst name="copyField">
<str name="dest">*_str</str>
<int name="maxChars">256</int>
</lst>
<!-- Use as default mapping instead of defaultFieldType -->
<bool name="default">true</bool>
</lst>
<lst name="typeMapping">
<str name="valueClass">java.lang.Boolean</str>
<str name="fieldType">booleans</str>
@ -145,11 +152,15 @@ The UpdateRequestProcessorChain allows Solr to guess field types, and you can de
<str name="valueClass">java.lang.Number</str>
<str name="fieldType">pdoubles</str>
</lst>
</processor>
<processor class="solr.LogUpdateProcessorFactory"/>
<processor class="solr.DistributedUpdateProcessorFactory"/>
<processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>
</updateProcessor>
<!-- The update.autoCreateFields property can be turned to false to disable schemaless mode -->
<updateRequestProcessorChain name="add-unknown-fields-to-the-schema" default="${update.autoCreateFields:true}"
processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date,add-schema-fields">
<processor class="solr.LogUpdateProcessorFactory"/>
<processor class="solr.DistributedUpdateProcessorFactory"/>
<processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>
----
Javadocs for update processor factories mentioned above:
@ -166,7 +177,7 @@ Javadocs for update processor factories mentioned above:
[[SchemalessMode-MaketheUpdateRequestProcessorChaintheDefaultfortheUpdateRequestHandler]]
=== Make the UpdateRequestProcessorChain the Default for the UpdateRequestHandler
Once the UpdateRequestProcessorChain has been defined, you must instruct your UpdateRequestHandlers to use it when working with index updates (i.e., adding, removing, replacing documents). Here is an example using <<initparams-in-solrconfig.adoc#initparams-in-solrconfig,InitParams>> to set the defaults on all `/update` request handlers:
Once the UpdateRequestProcessorChain has been defined, you must instruct your UpdateRequestHandlers to use it when working with index updates (i.e., adding, removing, replacing documents). There are two ways to do this. The update chain shown above has a `default=true` attribute which will use it for any update handler. An alternative, more explicit way is to use <<initparams-in-solrconfig.adoc#initparams-in-solrconfig,InitParams>> to set the defaults on all `/update` request handlers:
[source,xml]
----
@ -185,9 +196,9 @@ After each of these changes have been made, Solr should be restarted (or, you ca
[[SchemalessMode-ExamplesofIndexedDocuments]]
== Examples of Indexed Documents
Once the schemaless mode has been enabled (whether you configured it manually or are using `_default` ), documents that include fields that are not defined in your schema should be added to the index, and the new fields added to the schema.
Once the schemaless mode has been enabled (whether you configured it manually or are using `_default`), documents that include fields that are not defined in your schema will be indexed, using the guessed field types which are automatically added to the schema.
For example, adding a CSV document will cause its fields that are not in the schema to be added, with fieldTypes based on values:
For example, adding a CSV document will cause unknown fields to be added, with fieldTypes based on values:
[source,bash]
----
@ -212,37 +223,51 @@ The fields now in the schema (output from `curl \http://localhost:8983/solr/gett
{
"responseHeader":{
"status":0,
"QTime":1},
"QTime":2},
"fields":[{
"name":"Album",
"type":"strings"}, // Field value guessed as String -> strings fieldType
"type":"text_general"},
{
"name":"Artist",
"type":"strings"}, // Field value guessed as String -> strings fieldType
"type":"text_general"},
{
"name":"FromDistributor",
"type":"tlongs"}, // Field value guessed as Long -> tlongs fieldType
"type":"plongs"},
{
"name":"Rating",
"type":"tdoubles"}, // Field value guessed as Double -> tdoubles fieldType
"type":"pdoubles"},
{
"name":"Released",
"type":"tdates"}, // Field value guessed as Date -> tdates fieldType
"type":"pdates"},
{
"name":"Sold",
"type":"tlongs"}, // Field value guessed as Long -> tlongs fieldType
"type":"plongs"},
{
"name":"_text_",
...
},
"name":"_root_" ...}
{
"name":"_version_",
...
},
"name":"_text_" ...}
{
"name":"id",
...
}]}
"name":"_version_" ...}
{
"name":"id" ...}
----
In addition string versions of the text fields are indexed, using copyFields to a `*_str` dynamic field: (output from `curl \http://localhost:8983/solr/gettingstarted/schema/copyfields` ):
[source,json]
----
{
"responseHeader":{
"status":0,
"QTime":0},
"copyFields":[{
"source":"Artist",
"dest":"Artist_str",
"maxChars":256},
{
"source":"Album",
"dest":"Album_str",
"maxChars":256}]}
----
.You Can Still Be Explicit
@ -251,9 +276,11 @@ The fields now in the schema (output from `curl \http://localhost:8983/solr/gett
Even if you want to use schemaless mode for most fields, you can still use the <<schema-api.adoc#schema-api,Schema API>> to pre-emptively create some fields, with explicit types, before you index documents that use them.
Internally, the Schema API and the Schemaless Update Processors both use the same <<schema-factory-definition-in-solrconfig.adoc#schema-factory-definition-in-solrconfig,Managed Schema>> functionality.
Also, if you do not need the `*_str` version of a text field, you can simply remove the `copyField` definition from the auto-generated schema and it will not be re-added since the original field is now defined.
====
Once a field has been added to the schema, its field type is fixed. As a consequence, adding documents with field value(s) that conflict with the previously guessed field type will fail. For example, after adding the above document, the "```Sold```" field has the fieldType `tlongs`, but the document below has a non-integral decimal value in this field:
Once a field has been added to the schema, its field type is fixed. As a consequence, adding documents with field value(s) that conflict with the previously guessed field type will fail. For example, after adding the above document, the "```Sold```" field has the fieldType `plongs`, but the document below has a non-integral decimal value in this field:
[source,bash]
----

View File

@ -153,7 +153,7 @@ In this example, we simply named the field paths (such as `/exams/test`). Solr w
[TIP]
====
If you are working in <<schemaless-mode.adoc#schemaless-mode,Schemaless Mode>>, fields that don't exist will be created on the fly with Solr's best guess for the field type. Documents WILL get rejected if the fields do not exist in the schema before indexing. So, if you are NOT using schemaless mode, pre-create those fields.
Documents WILL get rejected if the fields do not exist in the schema before indexing. So, if you are NOT using schemaless mode, pre-create those fields. If you are working in <<schemaless-mode.adoc#schemaless-mode,Schemaless Mode>>, fields that don't exist will be created on the fly with Solr's best guess for the field type.
====
@ -336,7 +336,7 @@ With this example, the documents indexed would be, as follows:
== Tips for Custom JSON Indexing
1. Schemaless mode: This handles field creation automatically. The field guessing may not be exactly as you expect, but it works. The best thing to do is to setup a local server in schemaless mode, index a few sample docs and create those fields in your real setup with proper field types before indexing
2. Pre-created Schema : Post your docs to the `/update/`json`/docs` endpoint with `echo=true`. This gives you the list of field names you need to create. Create the fields before you actually index
2. Pre-created Schema : Post your docs to the `/update/json/docs` endpoint with `echo=true`. This gives you the list of field names you need to create. Create the fields before you actually index
3. No schema, only full-text search : All you need to do is to do full-text search on your JSON. Set the configuration as given in the Setting JSON Defaults section.
[[TransformingandIndexingCustomJSON-SettingJSONDefaults]]