Merge branch 'apache-https-master' into jira/solr-8593

This commit is contained in:
Kevin Risden 2017-01-06 16:03:20 -06:00
commit 4b17b82a91
4 changed files with 109 additions and 26 deletions

View File

@ -76,6 +76,9 @@ public class AutomatonTermsEnum extends FilteredTermsEnum {
*/
public AutomatonTermsEnum(TermsEnum tenum, CompiledAutomaton compiled) {
super(tenum);
if (compiled.type != CompiledAutomaton.AUTOMATON_TYPE.NORMAL) {
throw new IllegalArgumentException("please use CompiledAutomaton.getTermsEnum instead");
}
this.finite = compiled.finite;
this.runAutomaton = compiled.runAutomaton;
assert this.runAutomaton != null;

View File

@ -1016,4 +1016,12 @@ public class TestTermsEnum extends LuceneTestCase {
w.close();
d.close();
}
// LUCENE-7576
public void testInvalidAutomatonTermsEnum() throws Exception {
expectThrows(IllegalArgumentException.class,
() -> {
new AutomatonTermsEnum(TermsEnum.EMPTY, new CompiledAutomaton(Automata.makeString("foo")));
});
}
}

View File

@ -28,33 +28,105 @@ Please refer to the Solr Reference Guide's section on [Result Reranking](https:/
4. Search and rerank the results using the trained model
http://localhost:8983/solr/techproducts/query?indent=on&q=test&wt=json&rq={!ltr%20model=exampleModel%20reRankDocs=25%20efi.user_query=%27test%27}&fl=price,score,name
```
http://localhost:8983/solr/techproducts/query?indent=on&q=test&wt=json&rq={!ltr%20model=exampleModel%20reRankDocs=25%20efi.user_query=%27test%27}&fl=price,score,name
```
# Assemble training data
In order to train a learning to rank model you need training data. Training data is
what "teaches" the model what the appropriate weight for each feature is. In general
what *teaches* the model what the appropriate weight for each feature is. In general
training data is a collection of queries with associated documents and what their ranking/score
should be. As an example:
```
hard drive|SP2514N|0.6666666|CLICK_LOGS
hard drive|6H500F0|0.330082034|CLICK_LOGS
hard drive|SP2514N |0.6|CLICK_LOGS
hard drive|6H500F0 |0.3|CLICK_LOGS
hard drive|F8V7067-APL-KIT|0.0|CLICK_LOGS
hard drive|IW-02|0.0|CLICK_LOGS
hard drive|IW-02 |0.0|CLICK_LOGS
ipod|MA147LL/A|1.0|EXPLICIT
ipod|F8V7067-APL-KIT|0.25|EXPLICIT
ipod|IW-02|0.25|EXPLICIT
ipod|6H500F0|0.0|EXPLICIT
ipod |MA147LL/A |1.0|HUMAN_JUDGEMENT
ipod |F8V7067-APL-KIT|0.5|HUMAN_JUDGEMENT
ipod |IW-02 |0.5|HUMAN_JUDGEMENT
ipod |6H500F0 |0.0|HUMAN_JUDGEMENT
```
In this example the first column indicates the query, the second column indicates a unique id for that doc,
the third column indicates the relative importance or relevance of that doc, and the fourth column indicates the source.
There are 2 primary ways you might collect data for use with your machine learning algorithim. The first
is to collect the clicks of your users given a specific query. There are many ways of preparing this data
to train a model (http://www.cs.cornell.edu/people/tj/publications/joachims_etal_05a.pdf). The general idea
is that if a user sees multiple documents and clicks the one lower down, that document should be scored higher
than the one above it. The second way is explicitly through a crowdsourcing platform like Mechanical Turk or
CrowdFlower. These platforms allow you to show human workers documents associated with a query and have them
tell you what the correct ranking should be.
The columns in the example represent:
1. the user query;
2. a unique id for a document in the response;
3. the a score representing the relevance of that document (not necessarily between zero and one);
4. the source, i.e., if the training record was produced by using interaction data (`CLICK_LOGS`) or by human judgements (`HUMAN_JUDGEMENT`).
## How to produce training data
You might collect data for use with your machine learning algorithm relying on:
* **Users Interactions**: given a specific query, it is possible to log all the users interactions (e.g., clicks, shares on social networks, send by email etc), and then use them as proxies for relevance;
* **Human Judgements**: A training dataset is produced by explicitly asking some judges to evaluate the relevance of a document given the query.
### How to prepare training data from interaction data?
There are many ways of preparing interaction data for training a model, and it is outside the scope of this readme to provide a complete review of all the techniques. In the following we illustrate a simple way for obtaining training data from simple interaction data.
Simple interaction data will be a log file generated by your application after it
has talked to Solr. The log will contain two different types of record:
* **query**: when a user performs a query we have a record with `user-id, query, responses`,
where `responses` is a list of unique document ids returned for a query.
**Example:**
```
diego, hard drive, [SP2514N,6H500F0,F8V7067-APL-KIT,IW-02]
```
* **click**: when a user performs a click we have a record with `user-id, query, document-id, click`
**Example:**
```
christine, hard drive, SP2154N
diego , hard drive, SP2154N
michael , hard drive, SP2154N
joshua , hard drive, IW-02
```
Given a log composed by records like these, a simple way to produce a training dataset is to group on the query field
and then assign to each query a relevance score equal to the number of clicks:
```
hard drive|SP2514N |3|CLICK_LOGS
hard drive|IW-02 |1|CLICK_LOGS
hard drive|6H500F0 |0|CLICK_LOGS
hard drive|F8V7067-APL-KIT|0|CLICK_LOGS
```
This is a really trival way to generate a training dataset, and in many settings
it might not produce great results. Indeed, it is a well known fact that
clicks are *biased*: users tend to click on the first
result proposed for a query, also if it is not relevant. A click on a document in position
five could be considered more important than a click on a document in position one, because
the user took the effort to browse the results list until position five.
Some approaches take into account the time spent on the clicked document (if the user
spent only two seconds on the document and then clicked on other documents in the list,
probably she did not intend to click that document).
There are many papers proposing techniques for removing the bias, or for taking into account the click positions,
a good survey is [Click Models for Web Search](http://clickmodels.weebly.com/uploads/5/2/2/5/52257029/mc2015-clickmodels.pdf),
by Chuklin, Markov and Rijke.
### Prepare training data from human judgements
Another way to get training data is asking human judges to label them.
Producing human judgements is in general more expensive, but the quality of the
dataset produced can be better than the one produced from interaction data.
It is worth to note that human judgements can be produced also relying on a
crowdsourcing platform, that allows a user to show human workers documents associated with a
query and to get back relevance labels.
Usually a human worker visualizes a query together with a list of results and the task
consists in assigning a relevance label to each document (e.g., Perfect, Excellent, Good, Fair, Not relevant).
Training data can then be obtained by translating the labels into numeric scores
(e.g., Perfect = 4, Excellent = 3, Good = 2, Fair = 1, Not relevant = 0).
At this point you'll need to collect feature vectors for each query document pair. You can use the information
from the Extract features section above to do this. An example script has been included in example/train_and_upload_demo_model.py.

View File

@ -1,8 +1,8 @@
hard drive|SP2514N|0.6666666|CLICK_LOGS
hard drive|6H500F0|0.330082034|CLICK_LOGS
hard drive|SP2514N|0.6|CLICK_LOGS
hard drive|6H500F0|0.3|CLICK_LOGS
hard drive|F8V7067-APL-KIT|0.0|CLICK_LOGS
hard drive|IW-02|0.0|CLICK_LOGS
ipod|MA147LL/A|1.0|EXPLICIT
ipod|F8V7067-APL-KIT|0.25|EXPLICIT
ipod|IW-02|0.25|EXPLICIT
ipod|6H500F0|0.0|EXPLICIT
ipod|MA147LL/A|1.0|HUMAN_JUDGEMENT
ipod|F8V7067-APL-KIT|0.5|HUMAN_JUDGEMENT
ipod|IW-02|0.5|HUMAN_JUDGEMENT
ipod|6H500F0|0.0|HUMAN_JUDGEMENT