2014-12-25 16:27:12 -05:00
|
|
|
We have a movie data set in JSON, Solr XML, and CSV formats.
|
|
|
|
All 3 formats contain the same data. You can use any one format to index documents to Solr.
|
|
|
|
|
|
|
|
The data is fetched from Freebase and the data license is present in the films-LICENSE.txt file.
|
|
|
|
|
2015-01-09 17:39:07 -05:00
|
|
|
This data consists of the following fields:
|
2014-12-25 16:27:12 -05:00
|
|
|
* "id" - unique identifier for the movie
|
|
|
|
* "name" - Name of the movie
|
|
|
|
* "directed_by" - The person(s) who directed the making of the film
|
|
|
|
* "initial_release_date" - The earliest official initial film screening date in any country
|
|
|
|
* "genre" - The genre(s) that the movie belongs to
|
|
|
|
|
|
|
|
Steps:
|
|
|
|
* Start Solr:
|
2020-04-26 19:43:04 -04:00
|
|
|
```
|
|
|
|
bin/solr start
|
|
|
|
```
|
2014-12-25 16:27:12 -05:00
|
|
|
|
2015-01-09 17:39:07 -05:00
|
|
|
* Create a "films" core:
|
2020-04-26 19:43:04 -04:00
|
|
|
|
|
|
|
```
|
|
|
|
bin/solr create -c films
|
|
|
|
```
|
2014-12-25 16:27:12 -05:00
|
|
|
|
2015-01-09 17:39:07 -05:00
|
|
|
* Set the schema on a couple of fields that Solr would otherwise guess differently (than we'd like) about:
|
2020-04-26 19:43:04 -04:00
|
|
|
|
|
|
|
```
|
|
|
|
curl http://localhost:8983/solr/films/schema -X POST -H 'Content-type:application/json' --data-binary '{
|
|
|
|
"add-field" : {
|
|
|
|
"name":"name",
|
|
|
|
"type":"text_general",
|
|
|
|
"multiValued":false,
|
|
|
|
"stored":true
|
|
|
|
},
|
|
|
|
"add-field" : {
|
|
|
|
"name":"initial_release_date",
|
|
|
|
"type":"pdate",
|
|
|
|
"stored":true
|
|
|
|
}
|
|
|
|
}'
|
|
|
|
```
|
2014-12-25 16:27:12 -05:00
|
|
|
|
2014-12-25 21:45:21 -05:00
|
|
|
* Now let's index the data, using one of these three commands:
|
2014-12-25 16:27:12 -05:00
|
|
|
|
2020-04-26 19:43:04 -04:00
|
|
|
- JSON: `bin/post -c films example/films/films.json`
|
|
|
|
- XML: `bin/post -c films example/films/films.xml`
|
|
|
|
- CSV:
|
|
|
|
```
|
|
|
|
bin/post \
|
2015-01-14 19:53:17 -05:00
|
|
|
-c films \
|
2015-01-09 17:39:07 -05:00
|
|
|
example/films/films.csv \
|
2015-01-14 19:53:17 -05:00
|
|
|
-params "f.genre.split=true&f.directed_by.split=true&f.genre.separator=|&f.directed_by.separator=|"
|
2020-04-26 19:43:04 -04:00
|
|
|
```
|
2015-01-09 17:39:07 -05:00
|
|
|
* Let's get searching!
|
2014-12-25 16:27:12 -05:00
|
|
|
- Search for 'Batman':
|
2020-04-26 19:43:04 -04:00
|
|
|
|
2014-12-25 16:27:12 -05:00
|
|
|
http://localhost:8983/solr/films/query?q=name:batman
|
|
|
|
|
2015-01-05 08:46:10 -05:00
|
|
|
* If you get an error about the name field not existing, you haven't yet indexed the data
|
|
|
|
* If you don't get an error, but zero results, chances are that the _name_ field schema type override wasn't set
|
2015-01-09 17:39:07 -05:00
|
|
|
before indexing the data the first time (it ended up as a "string" type, requiring exact matching by case even).
|
|
|
|
It's easiest to simply reset the environment and try again, ensuring that each step successfully executes.
|
2015-01-05 08:46:10 -05:00
|
|
|
|
2014-12-25 16:27:12 -05:00
|
|
|
- Show me all 'Super hero' movies:
|
2020-04-26 19:43:04 -04:00
|
|
|
|
2014-12-25 16:27:12 -05:00
|
|
|
http://localhost:8983/solr/films/query?q=*:*&fq=genre:%22Superhero%20movie%22
|
|
|
|
|
2015-01-09 17:39:07 -05:00
|
|
|
- Let's see the distribution of genres across all the movies. See the facet section of the response for the counts:
|
2020-04-26 19:43:04 -04:00
|
|
|
|
2014-12-25 16:27:12 -05:00
|
|
|
http://localhost:8983/solr/films/query?q=*:*&facet=true&facet.field=genre
|
|
|
|
|
|
|
|
Exploring the data further -
|
|
|
|
|
2015-01-04 12:48:03 -05:00
|
|
|
* Increase the MAX_ITERATIONS value, put in your freebase API_KEY and run the film_data_generator.py script using Python 3.
|
2015-01-05 08:46:10 -05:00
|
|
|
Now re-index Solr with the new data.
|
|
|
|
|
|
|
|
FAQ:
|
|
|
|
Why override the schema of the _name_ and _initial_release_date_ fields?
|
|
|
|
|
|
|
|
Without overriding those field types, the _name_ field would have been guessed as a multi-valued string field type
|
2017-09-05 11:14:53 -04:00
|
|
|
and _initial_release_date_ would have been guessed as a multi-valued pdate type. It makes more sense with this
|
2015-01-09 17:39:07 -05:00
|
|
|
particular data set domain to have the movie name be a single valued general full-text searchable field,
|
|
|
|
and for the release date also to be single valued.
|
2015-01-05 08:46:10 -05:00
|
|
|
|
|
|
|
How do I clear and reset my environment?
|
|
|
|
|
2015-01-09 17:39:07 -05:00
|
|
|
See the script below.
|
|
|
|
|
|
|
|
Is there an easy to copy/paste script to do all of the above?
|
|
|
|
|
2020-04-26 19:43:04 -04:00
|
|
|
```
|
2015-01-09 17:39:07 -05:00
|
|
|
Here ya go << END_OF_SCRIPT
|
|
|
|
|
|
|
|
bin/solr stop
|
|
|
|
rm server/logs/*.log
|
|
|
|
rm -Rf server/solr/films/
|
|
|
|
bin/solr start
|
2015-01-14 19:53:17 -05:00
|
|
|
bin/solr create -c films
|
2015-01-09 17:39:07 -05:00
|
|
|
curl http://localhost:8983/solr/films/schema -X POST -H 'Content-type:application/json' --data-binary '{
|
|
|
|
"add-field" : {
|
|
|
|
"name":"name",
|
|
|
|
"type":"text_general",
|
2016-04-17 14:21:47 -04:00
|
|
|
"multiValued":false,
|
2015-01-09 17:39:07 -05:00
|
|
|
"stored":true
|
|
|
|
},
|
|
|
|
"add-field" : {
|
|
|
|
"name":"initial_release_date",
|
2017-09-05 11:14:53 -04:00
|
|
|
"type":"pdate",
|
2015-01-09 17:39:07 -05:00
|
|
|
"stored":true
|
|
|
|
}
|
|
|
|
}'
|
2015-01-14 19:53:17 -05:00
|
|
|
bin/post -c films example/films/films.json
|
2015-01-09 17:39:07 -05:00
|
|
|
|
|
|
|
# END_OF_SCRIPT
|
2020-04-26 19:43:04 -04:00
|
|
|
```
|