Merge branch 'apache-https-master' into jira/solr-8593

2025-02-21 01:18:45 +00:00 · 2017-01-06 16:03:20 -06:00 · 2017-01-06 16:03:20 -06:00 · 4b17b82a91
commit 4b17b82a91
parent 3793eb5ec6 024c4031e5
4 changed files with 109 additions and 26 deletions
--- a/lucene/core/src/java/org/apache/lucene/index/AutomatonTermsEnum.java
+++ b/lucene/core/src/java/org/apache/lucene/index/AutomatonTermsEnum.java
@ -76,6 +76,9 @@ public class AutomatonTermsEnum extends FilteredTermsEnum {
   */
  public AutomatonTermsEnum(TermsEnum tenum, CompiledAutomaton compiled) {
    super(tenum);
+    if (compiled.type != CompiledAutomaton.AUTOMATON_TYPE.NORMAL) {
+      throw new IllegalArgumentException("please use CompiledAutomaton.getTermsEnum instead");
+    }
    this.finite = compiled.finite;
    this.runAutomaton = compiled.runAutomaton;
    assert this.runAutomaton != null;
--- a/lucene/core/src/test/org/apache/lucene/index/TestTermsEnum.java
+++ b/lucene/core/src/test/org/apache/lucene/index/TestTermsEnum.java
@ -1016,4 +1016,12 @@ public class TestTermsEnum extends LuceneTestCase {
    w.close();
    d.close();
  }
+
+  // LUCENE-7576
+  public void testInvalidAutomatonTermsEnum() throws Exception {
+    expectThrows(IllegalArgumentException.class,
+                 () -> {
+                   new AutomatonTermsEnum(TermsEnum.EMPTY, new CompiledAutomaton(Automata.makeString("foo")));
+                 });
+  }
 }
--- a/solr/contrib/ltr/example/README.md
+++ b/solr/contrib/ltr/example/README.md
@ -28,33 +28,105 @@ Please refer to the Solr Reference Guide's section on [Result Reranking](https:/

 4. Search and rerank the results using the trained model

-   http://localhost:8983/solr/techproducts/query?indent=on&q=test&wt=json&rq={!ltr%20model=exampleModel%20reRankDocs=25%20efi.user_query=%27test%27}&fl=price,score,name
+```
+http://localhost:8983/solr/techproducts/query?indent=on&q=test&wt=json&rq={!ltr%20model=exampleModel%20reRankDocs=25%20efi.user_query=%27test%27}&fl=price,score,name
+```

 # Assemble training data
 In order to train a learning to rank model you need training data. Training data is
-what "teaches" the model what the appropriate weight for each feature is. In general
+what *teaches* the model what the appropriate weight for each feature is. In general
 training data is a collection of queries with associated documents and what their ranking/score
 should be. As an example:
 ```
-hard drive|SP2514N|0.6666666|CLICK_LOGS
-hard drive|6H500F0|0.330082034|CLICK_LOGS
+hard drive|SP2514N        |0.6|CLICK_LOGS
+hard drive|6H500F0        |0.3|CLICK_LOGS
 hard drive|F8V7067-APL-KIT|0.0|CLICK_LOGS
-hard drive|IW-02|0.0|CLICK_LOGS
+hard drive|IW-02          |0.0|CLICK_LOGS

-ipod|MA147LL/A|1.0|EXPLICIT
-ipod|F8V7067-APL-KIT|0.25|EXPLICIT
-ipod|IW-02|0.25|EXPLICIT
-ipod|6H500F0|0.0|EXPLICIT
+ipod      |MA147LL/A      |1.0|HUMAN_JUDGEMENT
+ipod      |F8V7067-APL-KIT|0.5|HUMAN_JUDGEMENT
+ipod      |IW-02          |0.5|HUMAN_JUDGEMENT
+ipod      |6H500F0        |0.0|HUMAN_JUDGEMENT
 ```
-In this example the first column indicates the query, the second column indicates a unique id for that doc,
-the third column indicates the relative importance or relevance of that doc, and the fourth column indicates the source.
-There are 2 primary ways you might collect data for use with your machine learning algorithim. The first
-is to collect the clicks of your users given a specific query. There are many ways of preparing this data
-to train a model (http://www.cs.cornell.edu/people/tj/publications/joachims_etal_05a.pdf). The general idea
-is that if a user sees multiple documents and clicks the one lower down, that document should be scored higher
-than the one above it. The second way is explicitly through a crowdsourcing platform like Mechanical Turk or
-CrowdFlower. These platforms allow you to show human workers documents associated with a query and have them
-tell you what the correct ranking should be.
+The columns in the example represent:
+
+  1. the user query;
+
+  2. a unique id for a document in the response;
+
+  3. the a score representing the relevance of that document (not necessarily between zero and one);
+
+  4. the source, i.e., if the training record was produced by using interaction data (`CLICK_LOGS`) or by human judgements (`HUMAN_JUDGEMENT`).
+
+## How to produce training data
+
+You might collect data for use with your machine learning algorithm relying on:
+
+  * **Users Interactions**: given a specific query, it is possible to log all the users interactions (e.g., clicks, shares on social networks, send by email etc), and then use them as proxies for relevance;
+  * **Human Judgements**: A training dataset is produced by explicitly asking some judges to evaluate the relevance of a document given the query.
+
+### How to prepare training data from interaction data?
+
+There are many ways of preparing interaction data for training a model, and it is outside the scope of this readme to provide a complete review of all the techniques.  In the following we illustrate a simple way for obtaining training data from simple interaction data.
+
+Simple interaction data will be a log file generated by your application after it
+has talked to Solr. The log will contain two different types of record:
+
+  * **query**: when a user performs a query we have a record with `user-id, query, responses`,
+  where `responses` is a list of unique document ids returned for a query.
+
+**Example:**
+
+```
+diego, hard drive, [SP2514N,6H500F0,F8V7067-APL-KIT,IW-02]
+```
+
+  * **click**: when a user performs a click we have a record with `user-id, query, document-id, click`
+
+**Example:**
+```
+christine, hard drive, SP2154N
+diego    , hard drive, SP2154N
+michael  , hard drive, SP2154N
+joshua   , hard drive, IW-02
+```
+
+Given a log composed by records like these, a simple way to produce a training dataset is to group on the query field
+and then assign to each query a relevance score equal to the number of clicks:
+
+```
+hard drive|SP2514N        |3|CLICK_LOGS
+hard drive|IW-02          |1|CLICK_LOGS
+hard drive|6H500F0        |0|CLICK_LOGS
+hard drive|F8V7067-APL-KIT|0|CLICK_LOGS
+```
+
+This is a really trival way to generate a training dataset, and in many settings 
+it might not produce great results. Indeed, it is a well known fact that 
+clicks are *biased*: users tend to click  on the first
+result proposed for a query, also if it is not relevant. A click on a document in position
+five could be considered more important than a click on a document in position one, because
+the user took the effort to browse the results list until position five.
+
+Some approaches take into account the time spent on the clicked document (if the user
+spent only two seconds on the document and then clicked on other documents in the list,
+probably she did not intend to click that document).
+
+There are many papers proposing techniques for removing the bias, or for taking into account the click positions,
+a good survey is  [Click Models for Web Search](http://clickmodels.weebly.com/uploads/5/2/2/5/52257029/mc2015-clickmodels.pdf),
+by Chuklin, Markov and Rijke.
+
+### Prepare training data from human judgements
+
+Another way to get training data is asking human judges to label them.
+Producing human judgements is in general more expensive, but the quality of the
+dataset produced can be better than the one produced from interaction data.
+It is worth to note that human judgements can be produced also relying on a
+crowdsourcing platform, that allows a user to show human workers documents associated with a
+query and to get back relevance labels.
+Usually a human worker visualizes a query together with a list of results and the task
+consists in assigning a relevance label to each document (e.g., Perfect, Excellent, Good, Fair, Not relevant).
+Training data can then be obtained by translating the labels into numeric scores
+(e.g., Perfect = 4, Excellent = 3, Good = 2, Fair = 1, Not relevant = 0).
+

-At this point you'll need to collect feature vectors for each query document pair. You can use the information
-from the Extract features section above to do this. An example script has been included in example/train_and_upload_demo_model.py.
--- a/solr/contrib/ltr/example/user_queries.txt
+++ b/solr/contrib/ltr/example/user_queries.txt
@ -1,8 +1,8 @@
-hard drive|SP2514N|0.6666666|CLICK_LOGS
-hard drive|6H500F0|0.330082034|CLICK_LOGS
+hard drive|SP2514N|0.6|CLICK_LOGS
+hard drive|6H500F0|0.3|CLICK_LOGS
 hard drive|F8V7067-APL-KIT|0.0|CLICK_LOGS
 hard drive|IW-02|0.0|CLICK_LOGS
-ipod|MA147LL/A|1.0|EXPLICIT
-ipod|F8V7067-APL-KIT|0.25|EXPLICIT
-ipod|IW-02|0.25|EXPLICIT
-ipod|6H500F0|0.0|EXPLICIT
+ipod|MA147LL/A|1.0|HUMAN_JUDGEMENT
+ipod|F8V7067-APL-KIT|0.5|HUMAN_JUDGEMENT
+ipod|IW-02|0.5|HUMAN_JUDGEMENT
+ipod|6H500F0|0.0|HUMAN_JUDGEMENT