Ref Guide: fix inconsistent "eg" and "ie" usages; capitalize ZooKeeper properly

This commit is contained in:
Cassandra Targett 2018-06-08 11:08:41 -05:00
parent eb7bb2d906
commit a06c24cf92
10 changed files with 32 additions and 36 deletions

View File

@ -247,7 +247,7 @@ The split is performed by dividing the original shard's hash range into two equa
The newly created shards will have as many replicas as the parent shard.
You must ensure that the node running the leader of the parent shard has enough free disk space i.e. more than twice the index size, for the split to succeed. The API uses the Autoscaling framework to find nodes that can satisfy the disk requirements for the new replicas but only when an Autoscaling policy is configured. Refer to <<solrcloud-autoscaling-policy-preferences.adoc#solrcloud-autoscaling-policy-preferences,Autoscaling Policy and Preferences>> section for more details.
You must ensure that the node running the leader of the parent shard has enough free disk space i.e., more than twice the index size, for the split to succeed. The API uses the Autoscaling framework to find nodes that can satisfy the disk requirements for the new replicas but only when an Autoscaling policy is configured. Refer to <<solrcloud-autoscaling-policy-preferences.adoc#solrcloud-autoscaling-policy-preferences,Autoscaling Policy and Preferences>> section for more details.
Shard splitting can be a long running process. In order to avoid timeouts, you should run this as an <<Asynchronous Calls,asynchronous call>>.

View File

@ -148,7 +148,7 @@ Lastly, it follows that the value of this feature diminishes as the number of sh
Solr allows you to pass an optional string parameter named `shards.preference` to indicate that a distributed query should sort the available replicas in the given order of precedence within each shard. The syntax is: `shards.preference=property:value`. The order of the properties and the values are significant meaning that the first one is the primary sort, second one is secondary etc.
IMPORTANT: `shards.preference` only works for distributed queries, i.e. queries targeting multiple shards. Not implemented yet for single shard scenarios
IMPORTANT: `shards.preference` only works for distributed queries, i.e., queries targeting multiple shards. Not implemented yet for single shard scenarios
The properties that can be specified are as follows:
@ -156,7 +156,7 @@ The properties that can be specified are as follows:
One or more replica types that are preferred. Any combination of PULL, TLOG and NRT is allowed.
`replica.location`::
One or more replica locations that are preferred. A location starts with `http://hostname:port`. Matching is done for the given string as a prefix, so it's possible to e.g. leave out the port. `local` may be used as special value to denote any local replica running on the same Solr instance as the one handling the query. This is useful when a query requests many fields or large fields to be returned per document because it avoids moving large amounts of data over the network when it is available locally. In addition, this feature can be useful for minimizing the impact of a problematic replica with degraded performance, as it reduces the likelihood that the degraded replica will be hit by other healthy replicas.
One or more replica locations that are preferred. A location starts with `http://hostname:port`. Matching is done for the given string as a prefix, so it's possible to e.g., leave out the port. `local` may be used as special value to denote any local replica running on the same Solr instance as the one handling the query. This is useful when a query requests many fields or large fields to be returned per document because it avoids moving large amounts of data over the network when it is available locally. In addition, this feature can be useful for minimizing the impact of a problematic replica with degraded performance, as it reduces the likelihood that the degraded replica will be hit by other healthy replicas.
The value of `replica.location:local` diminishes as the number of shards (that have no locally-available replicas) in a collection increases because the query controller will have to direct the query to non-local replicas for most of the shards. In other words, this feature is mostly useful for optimizing queries directed towards collections with a small number of shards and many replicas. Also, this option should only be used if you are load balancing requests across all nodes that host replicas for the collection you are querying, as Solr's CloudSolrClient will do. If not load-balancing, this feature can introduce a hotspot in the cluster since queries won't be evenly distributed across the cluster.
@ -166,7 +166,7 @@ Examples:
`shards.preference=replica.type:PULL`
* Prefer PULL replicas, or TLOG replicas if PULL replicas not available:
`shards.preference=replica.type:PULL,replica.type:TLOG`
`shards.preference=replica.type:PULL,replica.type:TLOG`
* Prefer any local replicas:
`shards.preference=replica.location:local`

View File

@ -343,8 +343,8 @@ Aggregation functions, also called *facet functions, analytic functions,* or **m
|avg |`avg(popularity)` |average of numeric values
|min |`min(salary)` |minimum value
|max |`max(mul(price,popularity))` |maximum value
|unique |`unique(author)` |number of unique values of the given field. Beyond 100 values it yields not exact estimate
|uniqueBlock |`uniqueBlock(\_root_)` |same as above with smaller footprint strictly requires <<uploading-data-with-index-handlers.adoc#nested-child-documents, block index>>. The given field is expected to be unique across blocks, now only singlevalued string fields are supported, docValues are recommended.
|unique |`unique(author)` |number of unique values of the given field. Beyond 100 values it yields not exact estimate
|uniqueBlock |`uniqueBlock(\_root_)` |same as above with smaller footprint strictly requires <<uploading-data-with-index-handlers.adoc#nested-child-documents, block index>>. The given field is expected to be unique across blocks, now only singlevalued string fields are supported, docValues are recommended.
|hll |`hll(author)` |distributed cardinality estimate via hyper-log-log algorithm
|percentile |`percentile(salary,50,75,99,99.9)` |Percentile estimates via t-digest algorithm. When sorting by this metric, the first percentile listed is used as the sort value.
|sumsq |`sumsq(rent)` |sum of squares of field or function
@ -447,7 +447,6 @@ And the response will look something like:
}
},
[...]
----
By default "top authors" is defined by simple document count descending, but we could use our aggregation functions to sort by more interesting metrics.
@ -474,7 +473,6 @@ Suppose we have products with multiple SKUs, and we want to count products for e
{ "id": "14", "type": "SKU", "color": "Blue", "size": "S" }
]
}
----
For *SKU domain* we can request
@ -489,8 +487,6 @@ For *SKU domain* we can request
productsCount: "uniqueBlock(_root_)"
}
}
----
and get
@ -585,7 +581,7 @@ curl -sS -X POST http://localhost:8983/solr/gettingstarted/query -d 'rows=0&q=*:
<1> Use the entire collection as our "Background Set"
<2> Use a query for "age >= 35" to define our (initial) "Foreground Set"
<3> For both the top level `hobbies` facet & the sub-facet on `state` we will be sorting on the `relatedness(...)` values
<4> In both calls to the `relatedness(...)` function, we use <<local-parameters-in-queries.adoc#parameter-dereferencing,Parameter Variables>> to refer to the previously defined `fore` and `back` queries.
<4> In both calls to the `relatedness(...)` function, we use <<local-parameters-in-queries.adoc#parameter-dereferencing,Parameter Variables>> to refer to the previously defined `fore` and `back` queries.
.The Facet Response
[source,javascript,subs="verbatim,callouts"]
@ -626,11 +622,11 @@ curl -sS -X POST http://localhost:8983/solr/gettingstarted/query -d 'rows=0&q=*:
"buckets":[{
...
----
<1> Even though `hobbies:golf` has a lower total facet `count` then `hobbies:painting`, it has a higher `relatedness` score, indicating that relative to the Background Set (the entire collection) Golf has a stronger correlation to our Foreground Set (people age 35+) then Painting.
<1> Even though `hobbies:golf` has a lower total facet `count` then `hobbies:painting`, it has a higher `relatedness` score, indicating that relative to the Background Set (the entire collection) Golf has a stronger correlation to our Foreground Set (people age 35+) then Painting.
<2> The number of documents matching `age:[35 TO *]` _and_ `hobbies:golf` is 31.25% of the total number of documents in the Background Set
<3> 37.5% of the documents in the Background Set match `hobbies:golf`
<4> The state of Arizona (AZ) has a _positive_ relatedness correlation with the _nested_ Foreground Set (people ages 35+ who play Golf) compared to the Background Set -- ie: "People in Arizona are statistically more likely to be '35+ year old Golfers' then the country as a whole."
<5> The state of Colorado (CO) has a _negative_ correlation with the nested Foreground Set -- ie: "People in Colorado are statistically less likely to be '35+ year old Golfers' then the country as a whole."
<4> The state of Arizona (AZ) has a _positive_ relatedness correlation with the _nested_ Foreground Set (people ages 35+ who play Golf) compared to the Background Set -- i.e., "People in Arizona are statistically more likely to be '35+ year old Golfers' then the country as a whole."
<5> The state of Colorado (CO) has a _negative_ correlation with the nested Foreground Set -- i.e., "People in Colorado are statistically less likely to be '35+ year old Golfers' then the country as a whole."
<6> The number documents matching `age:[35 TO *]` _and_ `hobbies:golf` _and_ `state:AZ` is 18.75% of the total number of documents in the Background Set
<7> 50% of the documents in the Background Set match `state:AZ`

View File

@ -55,7 +55,7 @@ Since a Solr cluster requires internode communication, each node must also be ab
When setting up a kerberized SolrCloud cluster, it is recommended to enable Kerberos security for ZooKeeper as well.
In such a setup, the client principal used to authenticate requests with ZooKeeper can be shared for internode communication as well. This has the benefit of not needing to renew the ticket granting tickets (TGTs) separately, since the Zookeeper client used by Solr takes care of this. To achieve this, a single JAAS configuration (with the app name as Client) can be used for the Kerberos plugin as well as for the Zookeeper client.
In such a setup, the client principal used to authenticate requests with ZooKeeper can be shared for internode communication as well. This has the benefit of not needing to renew the ticket granting tickets (TGTs) separately, since the ZooKeeper client used by Solr takes care of this. To achieve this, a single JAAS configuration (with the app name as Client) can be used for the Kerberos plugin as well as for the ZooKeeper client.
See the <<ZooKeeper Configuration>> section below for an example of starting ZooKeeper in Kerberos mode.

View File

@ -20,7 +20,7 @@
=== Round-robin databases
Solr collects long-term history of certain key metrics both in SolrCloud and in standalone mode.
This information can be used for very simple monitoring and troubleshooting, but also some
Solr Cloud components (eg. autoscaling) can use this data for making informed decisions based on
Solr Cloud components (e.g., autoscaling) can use this data for making informed decisions based on
long-term trends of selected metrics.
[IMPORTANT]
@ -62,7 +62,7 @@ update operations than storing each data point in a separate Solr document. Metr
detailed data from each database, including retrieval of all individual datapoints.
Databases are identified primarily by their corresponding metric registry name, so for databases that
keep track of aggregated metrics this will be eg. `solr.jvm`, `solr.node`, `solr.collection.gettingstarted`.
keep track of aggregated metrics this will be e.g., `solr.jvm`, `solr.node`, `solr.collection.gettingstarted`.
For databases with non-aggregated metrics the name consists of the registry name, optionally with a node name
to identify databases with the same name coming from different nodes. For example, per-node databases are
name like this: `solr.jvm.localhost:8983_solr`, `solr.node.localhost:7574_solr`, but per-replica names are
@ -138,7 +138,7 @@ is collected for each collection.
`enableNodes`:: boolean, default is false. When this is true then non-aggregated history will be
collected separately for each node (for node and JVM metrics), with database names consisting of
base registry name with appended node name, eg. `solr.jvm.localhost:8983_solr`. When this is false
base registry name with appended node name, e.g., `solr.jvm.localhost:8983_solr`. When this is false
then only aggregated history will be collected in a single `solr.jvm` and `solr.node` cluster-wide
databases.

View File

@ -520,7 +520,7 @@ A few query parameters are available to limit your request to only certain metri
`prefix`:: The first characters of metric name that will filter the metrics returned to those starting with the provided string. It can be combined with `group` and/or `type` parameters. More than one prefix can be specified in a request; multiple prefixes should be separated by a comma. Prefix matching is also case-sensitive.
`regex`:: A regular expression matching metric names. Note: dot separators in metric names must be escaped, eg.
`regex`:: A regular expression matching metric names. Note: dot separators in metric names must be escaped, e.g.,
`QUERY\./select\..*` is a valid regex that matches all metrics with the `QUERY./select.` prefix.
`property`:: Allows requesting only this metric from any compound metric. Multiple `property` parameters can be combined to act as an OR request. For example, to only get the 99th and 999th percentile values from all metric types and groups, you can add `&property=p99_ms&property=p999_ms` to your request. This can be combined with `group`, `type`, and `prefix` as necessary.

View File

@ -351,31 +351,31 @@ Now you will not have to enter the connection string when starting Solr.
== Increasing ZooKeeper's 1MB File Size Limit
ZooKeeper is designed to hold small files, on the order of kilobytes. By default, ZooKeeper's file size limit is 1MB. Attempting to write or read files larger than this will cause errors.
ZooKeeper is designed to hold small files, on the order of kilobytes. By default, ZooKeeper's file size limit is 1MB. Attempting to write or read files larger than this will cause errors.
Some Solr features, e.g. text analysis synonyms, LTR, and OpenNLP named entity recognition, require configuration resources that can be larger than the default limit. ZooKeeper can be configured, via Java system property https://zookeeper.apache.org/doc/r{ivy-zookeeper-version}/zookeeperAdmin.html#Unsafe+Options[`jute.maxbuffer`], to increase this limit. Note that this configuration, which is required both for ZooKeeper server(s) and for all clients that connect to the server(s), must be the same everywhere it is specified.
Some Solr features, e.g., text analysis synonyms, LTR, and OpenNLP named entity recognition, require configuration resources that can be larger than the default limit. ZooKeeper can be configured, via Java system property https://zookeeper.apache.org/doc/r{ivy-zookeeper-version}/zookeeperAdmin.html#Unsafe+Options[`jute.maxbuffer`], to increase this limit. Note that this configuration, which is required both for ZooKeeper server(s) and for all clients that connect to the server(s), must be the same everywhere it is specified.
=== Configuring jute.maxbuffer on ZooKeeper nodes
`jute.maxbuffer` must be configured on each external ZooKeeper node. This can be achieved in any of the following ways; note though that only the first option works on Windows:
`jute.maxbuffer` must be configured on each external ZooKeeper node. This can be achieved in any of the following ways; note though that only the first option works on Windows:
. In `<ZOOKEEPER_HOME>/conf/zoo.cfg`, e.g. to increase the file size limit to one byte less than 10MB, add this line:
. In `<ZOOKEEPER_HOME>/conf/zoo.cfg`, e.g., to increase the file size limit to one byte less than 10MB, add this line:
+
[source,properties]
jute.maxbuffer=0x9fffff
. In `<ZOOKEEPER_HOME>/conf/zookeeper-env.sh`, e.g. to increase the file size limit to 50MiB, add this line:
. In `<ZOOKEEPER_HOME>/conf/zookeeper-env.sh`, e.g., to increase the file size limit to 50MiB, add this line:
+
[source,properties]
JVMFLAGS="$JVMFLAGS -Djute.maxbuffer=50000000"
. In `<ZOOKEEPER_HOME>/bin/zkServer.sh`, add a `JVMFLAGS` environment variable assignment near the top of the script, e.g. to increase the file size limit to 5MiB:
. In `<ZOOKEEPER_HOME>/bin/zkServer.sh`, add a `JVMFLAGS` environment variable assignment near the top of the script, e.g., to increase the file size limit to 5MiB:
+
[source,properties]
JVMFLAGS="$JVMFLAGS -Djute.maxbuffer=5000000"
=== Configuring jute.maxbuffer for ZooKeeper clients
The `bin/solr` script invokes Java programs that act as ZooKeeper clients. (When you use Solr's bundled ZooKeeper server instead of setting up an external ZooKeeper ensemble, the configuration described below will also configure the ZooKeeper server.)
The `bin/solr` script invokes Java programs that act as ZooKeeper clients. When you use Solr's bundled ZooKeeper server instead of setting up an external ZooKeeper ensemble, the configuration described below will also configure the ZooKeeper server.
Add the setting to the `SOLR_OPTS` environment variable in Solr's include file (`bin/solr.in.sh` or `solr.in.cmd`):
[.dynamic-tabs]

View File

@ -54,7 +54,7 @@ The following properties are common to all event types:
`nodeAdded` event this will be the time when the node was added and not when the event was actually
generated, which may significantly differ due to the rate limits set by `waitFor`.
`properties`:: (map, optional) Any additional properties. Currently includes eg. `nodeNames` property that
`properties`:: (map, optional) Any additional properties. Currently includes e.g., `nodeNames` property that
indicates the nodes that were lost or added.
== Auto Add Replicas Trigger
@ -216,7 +216,7 @@ metric value, and it may use the colon syntax for selecting one property of a co
requested in a single autoscaling event. The default value is 3 and it helps to smooth out
the changes to the number of replicas during periods of large search rate fluctuations.
`minReplicas`:: (integer, optional) minimum acceptable number of searchable replicas (ie. replicas other
`minReplicas`:: (integer, optional) minimum acceptable number of searchable replicas (i.e., replicas other
than `PULL` type). The trigger will not generate any DELETEREPLICA requests when the number of
searchable replicas in a shard reaches this threshold. When this value is not set (the default)
the `replicationFactor` property of the collection is used, and if that property is not set then
@ -247,7 +247,7 @@ to effectively disable the action but still report it to the listeners.
`belowNodeOp`:: action to request when the lower threshold for a node is exceeded.
Default action is null (not set) and the condition is ignored, because in many cases the
trigger will monitor only some selected resources (replicas from selected
collections / shards) so setting this by default to eg. `DELETENODE` could interfere with
collections / shards) so setting this by default to e.g., `DELETENODE` could interfere with
these non-monitored resources. The trigger will request 1 operation per cold node per event.
If both `belowOp` and `belowNodeOp` operations are requested then `belowOp` operations are
always requested first.

View File

@ -40,7 +40,7 @@ Split on whitespace. If set to `true`, text analysis is invoked separately for e
`mm`::
Minimum should match. See the <<the-dismax-query-parser.adoc#mm-minimum-should-match-parameter,DisMax mm parameter>> for a description of `mm`. The default eDisMax `mm` value differs from that of DisMax:
+
+
* The default `mm` value is 0%:
** if the query contains an explicit operator other than "AND" ("-", "+", "OR", "NOT"); or
** if `q.op` is "OR" or is not specified.
@ -87,7 +87,7 @@ The default is to allow all fields and no embedded Solr queries, equivalent to `
* To allow title and all fields ending with '_s', use `uf=title *_s`.
* To allow all fields except title, use `uf=* -title`.
* To disallow all fielded searches, use `uf=-*`.
* To allow embedded Solr queries (e.g. `\_query_:"..."` or `\_val_:"..."` or `{!lucene ...}`),
* To allow embedded Solr queries (e.g., `\_query_:"..."` or `\_val_:"..."` or `{!lucene ...}`),
you _must_ expressly enable this by referring to the magic field `\_query_` in `uf`.
=== Field Aliasing using Per-Field qf Overrides

View File

@ -20,7 +20,7 @@ The name of each collection is comprised of the TRA name and the start timestamp
Ideally, as a user of this feature, you needn't concern yourself with the particulars of the collection naming pattern
since both queries and updates may be done via the alias.
When adding data, you should usually direct documents to the alias (e.g. reference the alias name instead of any collection).
When adding data, you should usually direct documents to the alias (e.g., reference the alias name instead of any collection).
The Solr server and CloudSolrClient will direct an update request to the first collection that an alias points to.
The collections list for a TRA is always reverse sorted, and thus the connection path of the request will route to the
lead collection. Using CloudSolrClient is preferable as it can reduce the number of underlying physical HTTP requests by one.
@ -39,7 +39,7 @@ TRUP first reads TRA configuration from the alias properties when it is initiali
* If TRUP needs to send it to a time segment represented by a collection other than the one that
the client chose to communicate with, then it will do so using mechanisms shared with DUP.
Once the document is forwarded to the correct collection (i.e. the correct TRA time segment), it skips directly to
Once the document is forwarded to the correct collection (i.e., the correct TRA time segment), it skips directly to
DUP on the target collection and continues normally, potentially being routed again to the correct shard & replica
within the target collection.
@ -81,7 +81,7 @@ Some _potential_ areas for improvement that _are not implemented yet_ are:
== Limitations & Assumptions
* Only *time* routed aliases are supported. If you instead have some other sequential number, you could fake it
as a time (e.g. convert to a timestamp assuming some epoch and increment).
as a time (e.g., convert to a timestamp assuming some epoch and increment).
The smallest possible interval is one second.
No other routing scheme is supported, although this feature was developed with considerations that it could be
extended/improved to other schemes.
@ -92,4 +92,4 @@ Some _potential_ areas for improvement that _are not implemented yet_ are:
the next collection, since it is otherwise not stored in any way.
* Avoid sending updates to the oldest collection if you have also configured that old collections should be
automatically deleted. It could lead to exceptions bubbling back to the indexing client.
automatically deleted. It could lead to exceptions bubbling back to the indexing client.