mirror of https://github.com/apache/lucene.git
Merge branch 'apache-https-master' into jira/solr-8593
This commit is contained in:
commit
4b17b82a91
|
@ -76,6 +76,9 @@ public class AutomatonTermsEnum extends FilteredTermsEnum {
|
|||
*/
|
||||
public AutomatonTermsEnum(TermsEnum tenum, CompiledAutomaton compiled) {
|
||||
super(tenum);
|
||||
if (compiled.type != CompiledAutomaton.AUTOMATON_TYPE.NORMAL) {
|
||||
throw new IllegalArgumentException("please use CompiledAutomaton.getTermsEnum instead");
|
||||
}
|
||||
this.finite = compiled.finite;
|
||||
this.runAutomaton = compiled.runAutomaton;
|
||||
assert this.runAutomaton != null;
|
||||
|
|
|
@ -1016,4 +1016,12 @@ public class TestTermsEnum extends LuceneTestCase {
|
|||
w.close();
|
||||
d.close();
|
||||
}
|
||||
|
||||
// LUCENE-7576
|
||||
public void testInvalidAutomatonTermsEnum() throws Exception {
|
||||
expectThrows(IllegalArgumentException.class,
|
||||
() -> {
|
||||
new AutomatonTermsEnum(TermsEnum.EMPTY, new CompiledAutomaton(Automata.makeString("foo")));
|
||||
});
|
||||
}
|
||||
}
|
||||
|
|
|
@ -28,33 +28,105 @@ Please refer to the Solr Reference Guide's section on [Result Reranking](https:/
|
|||
|
||||
4. Search and rerank the results using the trained model
|
||||
|
||||
http://localhost:8983/solr/techproducts/query?indent=on&q=test&wt=json&rq={!ltr%20model=exampleModel%20reRankDocs=25%20efi.user_query=%27test%27}&fl=price,score,name
|
||||
```
|
||||
http://localhost:8983/solr/techproducts/query?indent=on&q=test&wt=json&rq={!ltr%20model=exampleModel%20reRankDocs=25%20efi.user_query=%27test%27}&fl=price,score,name
|
||||
```
|
||||
|
||||
# Assemble training data
|
||||
In order to train a learning to rank model you need training data. Training data is
|
||||
what "teaches" the model what the appropriate weight for each feature is. In general
|
||||
what *teaches* the model what the appropriate weight for each feature is. In general
|
||||
training data is a collection of queries with associated documents and what their ranking/score
|
||||
should be. As an example:
|
||||
```
|
||||
hard drive|SP2514N|0.6666666|CLICK_LOGS
|
||||
hard drive|6H500F0|0.330082034|CLICK_LOGS
|
||||
hard drive|SP2514N |0.6|CLICK_LOGS
|
||||
hard drive|6H500F0 |0.3|CLICK_LOGS
|
||||
hard drive|F8V7067-APL-KIT|0.0|CLICK_LOGS
|
||||
hard drive|IW-02|0.0|CLICK_LOGS
|
||||
hard drive|IW-02 |0.0|CLICK_LOGS
|
||||
|
||||
ipod|MA147LL/A|1.0|EXPLICIT
|
||||
ipod|F8V7067-APL-KIT|0.25|EXPLICIT
|
||||
ipod|IW-02|0.25|EXPLICIT
|
||||
ipod|6H500F0|0.0|EXPLICIT
|
||||
ipod |MA147LL/A |1.0|HUMAN_JUDGEMENT
|
||||
ipod |F8V7067-APL-KIT|0.5|HUMAN_JUDGEMENT
|
||||
ipod |IW-02 |0.5|HUMAN_JUDGEMENT
|
||||
ipod |6H500F0 |0.0|HUMAN_JUDGEMENT
|
||||
```
|
||||
In this example the first column indicates the query, the second column indicates a unique id for that doc,
|
||||
the third column indicates the relative importance or relevance of that doc, and the fourth column indicates the source.
|
||||
There are 2 primary ways you might collect data for use with your machine learning algorithim. The first
|
||||
is to collect the clicks of your users given a specific query. There are many ways of preparing this data
|
||||
to train a model (http://www.cs.cornell.edu/people/tj/publications/joachims_etal_05a.pdf). The general idea
|
||||
is that if a user sees multiple documents and clicks the one lower down, that document should be scored higher
|
||||
than the one above it. The second way is explicitly through a crowdsourcing platform like Mechanical Turk or
|
||||
CrowdFlower. These platforms allow you to show human workers documents associated with a query and have them
|
||||
tell you what the correct ranking should be.
|
||||
The columns in the example represent:
|
||||
|
||||
1. the user query;
|
||||
|
||||
2. a unique id for a document in the response;
|
||||
|
||||
3. the a score representing the relevance of that document (not necessarily between zero and one);
|
||||
|
||||
4. the source, i.e., if the training record was produced by using interaction data (`CLICK_LOGS`) or by human judgements (`HUMAN_JUDGEMENT`).
|
||||
|
||||
## How to produce training data
|
||||
|
||||
You might collect data for use with your machine learning algorithm relying on:
|
||||
|
||||
* **Users Interactions**: given a specific query, it is possible to log all the users interactions (e.g., clicks, shares on social networks, send by email etc), and then use them as proxies for relevance;
|
||||
* **Human Judgements**: A training dataset is produced by explicitly asking some judges to evaluate the relevance of a document given the query.
|
||||
|
||||
### How to prepare training data from interaction data?
|
||||
|
||||
There are many ways of preparing interaction data for training a model, and it is outside the scope of this readme to provide a complete review of all the techniques. In the following we illustrate a simple way for obtaining training data from simple interaction data.
|
||||
|
||||
Simple interaction data will be a log file generated by your application after it
|
||||
has talked to Solr. The log will contain two different types of record:
|
||||
|
||||
* **query**: when a user performs a query we have a record with `user-id, query, responses`,
|
||||
where `responses` is a list of unique document ids returned for a query.
|
||||
|
||||
**Example:**
|
||||
|
||||
```
|
||||
diego, hard drive, [SP2514N,6H500F0,F8V7067-APL-KIT,IW-02]
|
||||
```
|
||||
|
||||
* **click**: when a user performs a click we have a record with `user-id, query, document-id, click`
|
||||
|
||||
**Example:**
|
||||
```
|
||||
christine, hard drive, SP2154N
|
||||
diego , hard drive, SP2154N
|
||||
michael , hard drive, SP2154N
|
||||
joshua , hard drive, IW-02
|
||||
```
|
||||
|
||||
Given a log composed by records like these, a simple way to produce a training dataset is to group on the query field
|
||||
and then assign to each query a relevance score equal to the number of clicks:
|
||||
|
||||
```
|
||||
hard drive|SP2514N |3|CLICK_LOGS
|
||||
hard drive|IW-02 |1|CLICK_LOGS
|
||||
hard drive|6H500F0 |0|CLICK_LOGS
|
||||
hard drive|F8V7067-APL-KIT|0|CLICK_LOGS
|
||||
```
|
||||
|
||||
This is a really trival way to generate a training dataset, and in many settings
|
||||
it might not produce great results. Indeed, it is a well known fact that
|
||||
clicks are *biased*: users tend to click on the first
|
||||
result proposed for a query, also if it is not relevant. A click on a document in position
|
||||
five could be considered more important than a click on a document in position one, because
|
||||
the user took the effort to browse the results list until position five.
|
||||
|
||||
Some approaches take into account the time spent on the clicked document (if the user
|
||||
spent only two seconds on the document and then clicked on other documents in the list,
|
||||
probably she did not intend to click that document).
|
||||
|
||||
There are many papers proposing techniques for removing the bias, or for taking into account the click positions,
|
||||
a good survey is [Click Models for Web Search](http://clickmodels.weebly.com/uploads/5/2/2/5/52257029/mc2015-clickmodels.pdf),
|
||||
by Chuklin, Markov and Rijke.
|
||||
|
||||
### Prepare training data from human judgements
|
||||
|
||||
Another way to get training data is asking human judges to label them.
|
||||
Producing human judgements is in general more expensive, but the quality of the
|
||||
dataset produced can be better than the one produced from interaction data.
|
||||
It is worth to note that human judgements can be produced also relying on a
|
||||
crowdsourcing platform, that allows a user to show human workers documents associated with a
|
||||
query and to get back relevance labels.
|
||||
Usually a human worker visualizes a query together with a list of results and the task
|
||||
consists in assigning a relevance label to each document (e.g., Perfect, Excellent, Good, Fair, Not relevant).
|
||||
Training data can then be obtained by translating the labels into numeric scores
|
||||
(e.g., Perfect = 4, Excellent = 3, Good = 2, Fair = 1, Not relevant = 0).
|
||||
|
||||
|
||||
At this point you'll need to collect feature vectors for each query document pair. You can use the information
|
||||
from the Extract features section above to do this. An example script has been included in example/train_and_upload_demo_model.py.
|
||||
|
|
|
@ -1,8 +1,8 @@
|
|||
hard drive|SP2514N|0.6666666|CLICK_LOGS
|
||||
hard drive|6H500F0|0.330082034|CLICK_LOGS
|
||||
hard drive|SP2514N|0.6|CLICK_LOGS
|
||||
hard drive|6H500F0|0.3|CLICK_LOGS
|
||||
hard drive|F8V7067-APL-KIT|0.0|CLICK_LOGS
|
||||
hard drive|IW-02|0.0|CLICK_LOGS
|
||||
ipod|MA147LL/A|1.0|EXPLICIT
|
||||
ipod|F8V7067-APL-KIT|0.25|EXPLICIT
|
||||
ipod|IW-02|0.25|EXPLICIT
|
||||
ipod|6H500F0|0.0|EXPLICIT
|
||||
ipod|MA147LL/A|1.0|HUMAN_JUDGEMENT
|
||||
ipod|F8V7067-APL-KIT|0.5|HUMAN_JUDGEMENT
|
||||
ipod|IW-02|0.5|HUMAN_JUDGEMENT
|
||||
ipod|6H500F0|0.0|HUMAN_JUDGEMENT
|
||||
|
|
Loading…
Reference in New Issue