lucene/solr/solr-ref-guide/src/introduction-to-solr-indexi...

57 lines
4.5 KiB
Plaintext

= Introduction to Solr Indexing
:page-shortname: introduction-to-solr-indexing
:page-permalink: introduction-to-solr-indexing.html
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
This section describes the process of indexing: adding content to a Solr index and, if necessary, modifying that content or deleting it.
By adding content to an index, we make it searchable by Solr.
A Solr index can accept data from many different sources, including XML files, comma-separated value (CSV) files, data extracted from tables in a database, and files in common file formats such as Microsoft Word or PDF.
Here are the three most common ways of loading data into a Solr index:
* Using the <<uploading-data-with-solr-cell-using-apache-tika.adoc#uploading-data-with-solr-cell-using-apache-tika,Solr Cell>> framework built on Apache Tika for ingesting binary files or structured files such as Office, Word, PDF, and other proprietary formats.
* Uploading XML files by sending HTTP requests to the Solr server from any environment where such requests can be generated.
* Writing a custom Java application to ingest data through Solr's Java Client API (which is described in more detail in <<client-apis.adoc#client-apis,Client APIs>>). Using the Java API may be the best choice if you're working with an application, such as a Content Management System (CMS), that offers a Java API.
Regardless of the method used to ingest data, there is a common basic data structure for data being fed into a Solr index: a _document_ containing multiple _fields,_ each with a _name_ and containing _content,_ which may be empty. One of the fields is usually designated as a unique ID field (analogous to a primary key in a database), although the use of a unique ID field is not strictly required by Solr.
If the field name is defined in the Schema that is associated with the index, then the analysis steps associated with that field will be applied to its content when the content is tokenized. Fields that are not explicitly defined in the Schema will either be ignored or mapped to a dynamic field definition (see <<documents-fields-and-schema-design.adoc#documents-fields-and-schema-design,Documents, Fields, and Schema Design>>), if one matching the field name exists.
For more information on indexing in Solr, see the https://wiki.apache.org/solr/FrontPage[Solr Wiki].
[[IntroductiontoSolrIndexing-TheSolrExampleDirectory]]
== The Solr Example Directory
When starting Solr with the "-e" option, the `example/` directory will be used as base directory for the example Solr instances that are created. This directory also includes an `example/exampledocs/` subdirectory containing sample documents in a variety of formats that you can use to experiment with indexing into the various examples.
[[IntroductiontoSolrIndexing-ThecurlUtilityforTransferringFiles]]
== The curl Utility for Transferring Files
Many of the instructions and examples in this section make use of the `curl` utility for transferring content through a URL. `curl` posts and retrieves data over HTTP, FTP, and many other protocols. Most Linux distributions include a copy of `curl`. You'll find curl downloads for Linux, Windows, and many other operating systems at http://curl.haxx.se/download.html. Documentation for `curl` is available here: http://curl.haxx.se/docs/manpage.html.
[IMPORTANT]
====
Using `curl` or other command line tools for posting data is just fine for examples or tests, but it's not the recommended method for achieving the best performance for updates in production environments. You will achieve better performance with Solr Cell or the other methods described in this section.
Instead of `curl`, you can use utilities such as GNU `wget` (http://www.gnu.org/software/wget/) or manage GETs and POSTS with Perl, although the command line options will differ.
====