From 48ac9747a8c9a3206aaff165add1c12e75c79604 Mon Sep 17 00:00:00 2001 From: Luca Cavanna Date: Thu, 8 Aug 2013 17:10:42 +0200 Subject: [PATCH] Added third highlighter type based on lucene postings highlighter Requires field index_options set to "offsets" in order to store positions and offsets in the postings list. Considerably faster than the plain highlighter since it doesn't require to reanalyze the text to be highlighted: the larger the documents the better the performance gain should be. Requires less disk space than term_vectors, needed for the fast_vector_highlighter. Breaks the text into sentences and highlights them. Uses a BreakIterator to find sentences in the text. Plays really well with natural text, not quite the same if the text contains html markup for instance. Treats the document as the whole corpus, and scores individual sentences as if they were documents in this corpus, using the BM25 algorithm. Uses forked version of lucene postings highlighter to support: - per value discrete highlighting for fields that have multiple values, needed when number_of_fragments=0 since we want to return a snippet per value - manually passing in query terms to avoid calling extract terms multiple times, since we use a different highlighter instance per doc/field, but the query is always the same The lucene postings highlighter api is quite different compared to the existing highlighters api, the main difference being that it allows to highlight multiple fields in multiple docs with a single call, ensuring sequential IO. The way it is introduced in elasticsearch in this first round is a compromise trying not to change the current highlight api, which works per document, per field. The main disadvantage is that we lose the sequential IO, but we can always refactor the highlight api to work with multiple documents. Supports pre_tag, post_tag, number_of_fragments (0 highlights the whole field), require_field_match, no_match_size, order by score and html encoding. Closes #3704 --- .../search/request/highlighting.asciidoc | 125 +- .../CustomPassageFormatter.java | 78 + .../CustomPostingsHighlighter.java | 187 ++ .../search/postingshighlight/Snippet.java | 50 + .../XDefaultPassageFormatter.java | 147 ++ .../postingshighlight/XPassageFormatter.java | 46 + .../XPostingsHighlighter.java | 777 ++++++++ .../highlight/FastVectorHighlighter.java | 9 +- .../search/highlight/HighlightBuilder.java | 4 +- .../search/highlight/HighlightModule.java | 1 + .../search/highlight/HighlightPhase.java | 11 +- .../search/highlight/HighlightUtils.java | 68 + .../search/highlight/HighlighterContext.java | 12 +- .../search/highlight/PlainHighlighter.java | 38 +- .../search/highlight/PostingsHighlighter.java | 237 +++ .../CustomPassageFormatterTests.java | 107 ++ .../CustomPostingsHighlighterTests.java | 487 +++++ .../XPostingsHighlighterTests.java | 1693 +++++++++++++++++ .../highlight/HighlighterSearchTests.java | 797 +++++++- .../hamcrest/ElasticsearchAssertions.java | 10 +- .../search/postingshighlight/CambridgeMA.utf8 | 1 + 21 files changed, 4770 insertions(+), 115 deletions(-) create mode 100644 src/main/java/org/apache/lucene/search/postingshighlight/CustomPassageFormatter.java create mode 100644 src/main/java/org/apache/lucene/search/postingshighlight/CustomPostingsHighlighter.java create mode 100644 src/main/java/org/apache/lucene/search/postingshighlight/Snippet.java create mode 100644 src/main/java/org/apache/lucene/search/postingshighlight/XDefaultPassageFormatter.java create mode 100644 src/main/java/org/apache/lucene/search/postingshighlight/XPassageFormatter.java create mode 100644 src/main/java/org/apache/lucene/search/postingshighlight/XPostingsHighlighter.java create mode 100644 src/main/java/org/elasticsearch/search/highlight/HighlightUtils.java create mode 100644 src/main/java/org/elasticsearch/search/highlight/PostingsHighlighter.java create mode 100644 src/test/java/org/apache/lucene/search/postingshighlight/CustomPassageFormatterTests.java create mode 100644 src/test/java/org/apache/lucene/search/postingshighlight/CustomPostingsHighlighterTests.java create mode 100644 src/test/java/org/apache/lucene/search/postingshighlight/XPostingsHighlighterTests.java create mode 100644 src/test/resources/org/apache/lucene/search/postingshighlight/CambridgeMA.utf8 diff --git a/docs/reference/search/request/highlighting.asciidoc b/docs/reference/search/request/highlighting.asciidoc index 74a84daee1f..28566636849 100644 --- a/docs/reference/search/request/highlighting.asciidoc +++ b/docs/reference/search/request/highlighting.asciidoc @@ -2,8 +2,9 @@ === Highlighting Allows to highlight search results on one or more fields. The -implementation uses either the lucene `fast-vector-highlighter` or -`highlighter`. The search request body: +implementation uses either the lucene `highlighter`, `fast-vector-highlighter` +or `postings-highlighter`. The following is an example of the search request +body: [source,js] -------------------------------------------------- @@ -24,16 +25,54 @@ fragments). In order to perform highlighting, the actual content of the field is required. If the field in question is stored (has `store` set to `yes` -in the mapping), it will be used, otherwise, the actual `_source` will +in the mapping) it will be used, otherwise, the actual `_source` will be loaded and the relevant field will be extracted from it. -If `term_vector` information is provided by setting `term_vector` to -`with_positions_offsets` in the mapping then the fast vector -highlighter will be used instead of the plain highlighter. The fast vector highlighter: +The field name supports wildcard notation. For example, using `comment_*` +will cause all fields that match the expression to be highlighted. + +==== Postings highlighter + +If `index_options` is set to `offsets` in the mapping the postings highlighter +will be used instead of the plain highlighter. The postings highlighter: + +* Is faster since it doesn't require to reanalyze the text to be highlighted: +the larger the documents the better the performance gain should be +* Requires less disk space than term_vectors, needed for the fast vector +highlighter +* Breaks the text into sentences and highlights them. Plays really well with +natural languages, not as well with fields containing for instance html markup +* Treats the document as the whole corpus, and scores individual sentences as +if they were documents in this corpus, using the BM25 algorithm + +Here is an example of setting the `content` field to allow for +highlighting using the postings highlighter on it: + +[source,js] +-------------------------------------------------- +{ + "type_name" : { + "content" : {"index_options" : "offsets"} + } +} +-------------------------------------------------- + +Note that the postings highlighter is meant to perform simple query terms +highlighting, regardless of their positions. That means that when used for +instance in combination with a phrase query, it will highlight all the terms +that the query is composed of, regardless of whether they are actually part of +a query match, effectively ignoring their positions. + + +==== Fast vector highlighter + +If `term_vector` information is provided by setting `term_vector` to +`with_positions_offsets` in the mapping then the fast vector highlighter +will be used instead of the plain highlighter. The fast vector highlighter: * Is faster especially for large fields (> `1MB`) * Can be customized with `boundary_chars`, `boundary_max_scan`, and - `fragment_offset` (see below) + `fragment_offset` (see <>) * Requires setting `term_vector` to `with_positions_offsets` which increases the size of the index @@ -50,9 +89,25 @@ the index to be bigger): } -------------------------------------------------- -The field name supports wildcard notation, for example, -using `comment_*` which will cause all fields that match the expression -to be highlighted. +==== Force highlighter type + +The `type` field allows to force a specific highlighter type. This is useful +for instance when needing to use the plain highlighter on a field that has +`term_vectors` enabled. The allowed values are: `plain`, `postings` and `fvh`. +The following is an example that forces the use of the plain highlighter: + +[source,js] +-------------------------------------------------- +{ + "query" : {...}, + "highlight" : { + "fields" : { + "content" : { "type" : "plain"} + } + } +} +-------------------------------------------------- + [[tags]] ==== Highlighting Tags @@ -61,6 +116,23 @@ By default, the highlighting will wrap highlighted text in `` and ``. This can be controlled by setting `pre_tags` and `post_tags`, for example: +[source,js] +-------------------------------------------------- +{ + "query" : {...}, + "highlight" : { + "pre_tags" : [""], + "post_tags" : [""], + "fields" : { + "_all" : {} + } + } +} +-------------------------------------------------- + +Using the fast vector highlighter there can be more tags, and the "importance" +is ordered. + [source,js] -------------------------------------------------- { @@ -75,9 +147,8 @@ for example: } -------------------------------------------------- -There can be a single tag or more, and the "importance" is ordered. There are also built in "tag" schemas, with currently a single schema -called `styled` with `pre_tags` of: +called `styled` with the following `pre_tags`: [source,js] -------------------------------------------------- @@ -87,7 +158,7 @@ called `styled` with `pre_tags` of: -------------------------------------------------- -And post tag of ``. If you think of more nice to have built in tag +and `` as `post_tags`. If you think of more nice to have built in tag schemas, just send an email to the mailing list or open an issue. Here is an example of switching tag schemas: @@ -104,6 +175,9 @@ is an example of switching tag schemas: } -------------------------------------------------- + +==== Encoder + An `encoder` parameter can be used to define how highlighted text will be encoded. It can be either `default` (no encoding) or `html` (will escape html, if you use html highlighting tags). @@ -112,7 +186,8 @@ escape html, if you use html highlighting tags). Each field highlighted can control the size of the highlighted fragment in characters (defaults to `100`), and the maximum number of fragments -to return (defaults to `5`). For example: +to return (defaults to `5`). +For example: [source,js] -------------------------------------------------- @@ -126,8 +201,11 @@ to return (defaults to `5`). For example: } -------------------------------------------------- -On top of this it is possible to specify that highlighted fragments are -order by score: +The `fragment_size` is ignored when using the postings highlighter, as it +outputs sentences regardless of their length. + +On top of this it is possible to specify that highlighted fragments need +to be sorted by score: [source,js] -------------------------------------------------- @@ -168,7 +246,10 @@ In the case where there is no matching fragment to highlight, the default is to not return anything. Instead, we can return a snippet of text from the beginning of the field by setting `no_match_size` (default `0`) to the length of the text that you want returned. The actual length may be shorter than -specified as it tries to break on a word boundary. +specified as it tries to break on a word boundary. When using the postings +highlighter it is not possible to control the actual size of the snippet, +therefore the first sentence gets returned whenever `no_match_size` is +greater than `0`. [source,js] -------------------------------------------------- @@ -256,9 +337,11 @@ query and the rescore query in `highlight_query`. } -------------------------------------------------- -Note the score of text fragment in this case is calculated by Lucene -highlighting framework. For implementation details you can check -`ScoreOrderFragmentsBuilder.java` class. +Note that the score of text fragment in this case is calculated by the Lucene +highlighting framework. For implementation details you can check the +`ScoreOrderFragmentsBuilder.java` class. On the other hand when using the +postings highlighter the fragments are scored using, as mentioned above, +the BM25 algorithm. [[highlighting-settings]] ==== Global Settings @@ -295,7 +378,7 @@ matches specifically on them. [[boundary-characters]] ==== Boundary Characters -When highlighting a field that is mapped with term vectors, +When highlighting a field using the fast vector highlighter, `boundary_chars` can be configured to define what constitutes a boundary for highlighting. It's a single string with each boundary character defined in it. It defaults to `.,!? \t\n`. diff --git a/src/main/java/org/apache/lucene/search/postingshighlight/CustomPassageFormatter.java b/src/main/java/org/apache/lucene/search/postingshighlight/CustomPassageFormatter.java new file mode 100644 index 00000000000..6c390ef684d --- /dev/null +++ b/src/main/java/org/apache/lucene/search/postingshighlight/CustomPassageFormatter.java @@ -0,0 +1,78 @@ +/* + * Licensed to ElasticSearch and Shay Banon under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. ElasticSearch licenses this + * file to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT + * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the + * License for the specific language governing permissions and limitations under + * the License. + */ + +package org.apache.lucene.search.postingshighlight; + +import org.apache.lucene.search.highlight.Encoder; +import org.elasticsearch.search.highlight.HighlightUtils; + +/** +Custom passage formatter that allows us to: +1) extract different snippets (instead of a single big string) together with their scores ({@link Snippet}) +2) use the {@link Encoder} implementations that are already used with the other highlighters + */ +public class CustomPassageFormatter extends XPassageFormatter { + + private final String preTag; + private final String postTag; + private final Encoder encoder; + + public CustomPassageFormatter(String preTag, String postTag, Encoder encoder) { + this.preTag = preTag; + this.postTag = postTag; + this.encoder = encoder; + } + + @Override + public Snippet[] format(Passage[] passages, String content) { + Snippet[] snippets = new Snippet[passages.length]; + int pos; + for (int j = 0; j < passages.length; j++) { + Passage passage = passages[j]; + StringBuilder sb = new StringBuilder(); + pos = passage.startOffset; + for (int i = 0; i < passage.numMatches; i++) { + int start = passage.matchStarts[i]; + int end = passage.matchEnds[i]; + // its possible to have overlapping terms + if (start > pos) { + append(sb, content, pos, start); + } + if (end > pos) { + sb.append(preTag); + append(sb, content, Math.max(pos, start), end); + sb.append(postTag); + pos = end; + } + } + // its possible a "term" from the analyzer could span a sentence boundary. + append(sb, content, pos, Math.max(pos, passage.endOffset)); + //we remove the paragraph separator if present at the end of the snippet (we used it as separator between values) + if (sb.charAt(sb.length() - 1) == HighlightUtils.PARAGRAPH_SEPARATOR) { + sb.deleteCharAt(sb.length() - 1); + } + //and we trim the snippets too + snippets[j] = new Snippet(sb.toString().trim(), passage.score, passage.numMatches > 0); + } + return snippets; + } + + protected void append(StringBuilder dest, String content, int start, int end) { + dest.append(encoder.encodeText(content.substring(start, end))); + } +} diff --git a/src/main/java/org/apache/lucene/search/postingshighlight/CustomPostingsHighlighter.java b/src/main/java/org/apache/lucene/search/postingshighlight/CustomPostingsHighlighter.java new file mode 100644 index 00000000000..b48d29bc2b3 --- /dev/null +++ b/src/main/java/org/apache/lucene/search/postingshighlight/CustomPostingsHighlighter.java @@ -0,0 +1,187 @@ +/* + * Licensed to ElasticSearch and Shay Banon under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. ElasticSearch licenses this + * file to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT + * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the + * License for the specific language governing permissions and limitations under + * the License. + */ + +package org.apache.lucene.search.postingshighlight; + +import org.apache.lucene.index.AtomicReaderContext; +import org.apache.lucene.index.IndexReader; +import org.apache.lucene.index.IndexReaderContext; +import org.apache.lucene.search.IndexSearcher; +import org.apache.lucene.util.BytesRef; +import org.elasticsearch.common.Strings; +import org.elasticsearch.search.highlight.HighlightUtils; + +import java.io.IOException; +import java.text.BreakIterator; +import java.util.List; +import java.util.Map; + +/** + * Subclass of the {@link XPostingsHighlighter} that works for a single field in a single document. + * It receives the field values as input and it performs discrete highlighting on each single value + * calling the highlightDoc method multiple times. + * It allows to pass in the query terms to avoid calling extract terms multiple times. + * + * The use that we make of the postings highlighter is not optimal. It would be much better to + * highlight multiple docs in a single call, as we actually lose its sequential IO. But that would require: + * 1) to make our fork more complex and harder to maintain to perform discrete highlighting (needed to return + * a different snippet per value when number_of_fragments=0 and the field has multiple values) + * 2) refactoring of the elasticsearch highlight api which currently works per hit + * + */ +public final class CustomPostingsHighlighter extends XPostingsHighlighter { + + private static final Snippet[] EMPTY_SNIPPET = new Snippet[0]; + private static final Passage[] EMPTY_PASSAGE = new Passage[0]; + + private final CustomPassageFormatter passageFormatter; + private final int noMatchSize; + private final int totalContentLength; + private final String[] fieldValues; + private final int[] fieldValuesOffsets; + private int currentValueIndex = 0; + + private BreakIterator breakIterator; + + public CustomPostingsHighlighter(CustomPassageFormatter passageFormatter, List fieldValues, boolean mergeValues, int maxLength, int noMatchSize) { + super(maxLength); + this.passageFormatter = passageFormatter; + this.noMatchSize = noMatchSize; + + if (mergeValues) { + String rawValue = Strings.collectionToDelimitedString(fieldValues, String.valueOf(getMultiValuedSeparator(""))); + String fieldValue = rawValue.substring(0, Math.min(rawValue.length(), maxLength)); + this.fieldValues = new String[]{fieldValue}; + this.fieldValuesOffsets = new int[]{0}; + this.totalContentLength = fieldValue.length(); + } else { + this.fieldValues = new String[fieldValues.size()]; + this.fieldValuesOffsets = new int[fieldValues.size()]; + int contentLength = 0; + int offset = 0; + int previousLength = -1; + for (int i = 0; i < fieldValues.size(); i++) { + String rawValue = fieldValues.get(i).toString(); + String fieldValue = rawValue.substring(0, Math.min(rawValue.length(), maxLength)); + this.fieldValues[i] = fieldValue; + contentLength += fieldValue.length(); + offset += previousLength + 1; + this.fieldValuesOffsets[i] = offset; + previousLength = fieldValue.length(); + } + this.totalContentLength = contentLength; + } + } + + /* + Our own api to highlight a single document field, passing in the query terms, and get back our own Snippet object + */ + public Snippet[] highlightDoc(String field, BytesRef[] terms, IndexSearcher searcher, int docId, int maxPassages) throws IOException { + IndexReader reader = searcher.getIndexReader(); + IndexReaderContext readerContext = reader.getContext(); + List leaves = readerContext.leaves(); + + String[] contents = new String[]{loadCurrentFieldValue()}; + Map snippetsMap = highlightField(field, contents, getBreakIterator(field), terms, new int[]{docId}, leaves, maxPassages); + + //increment the current value index so that next time we'll highlight the next value if available + currentValueIndex++; + + Object snippetObject = snippetsMap.get(docId); + if (snippetObject != null && snippetObject instanceof Snippet[]) { + return (Snippet[]) snippetObject; + } + return EMPTY_SNIPPET; + } + + /* + Method provided through our own fork: allows to do proper scoring when doing per value discrete highlighting. + Used to provide the total length of the field (all values) for proper scoring. + */ + @Override + protected int getContentLength(String field, int docId) { + return totalContentLength; + } + + /* + Method provided through our own fork: allows to perform proper per value discrete highlighting. + Used to provide the offset for the current value. + */ + @Override + protected int getOffsetForCurrentValue(String field, int docId) { + if (currentValueIndex < fieldValuesOffsets.length) { + return fieldValuesOffsets[currentValueIndex]; + } + throw new IllegalArgumentException("No more values offsets to return"); + } + + public void setBreakIterator(BreakIterator breakIterator) { + this.breakIterator = breakIterator; + } + + @Override + protected XPassageFormatter getFormatter(String field) { + return passageFormatter; + } + + @Override + protected BreakIterator getBreakIterator(String field) { + if (breakIterator == null) { + return super.getBreakIterator(field); + } + return breakIterator; + } + + @Override + protected char getMultiValuedSeparator(String field) { + //U+2029 PARAGRAPH SEPARATOR (PS): each value holds a discrete passage for highlighting + return HighlightUtils.PARAGRAPH_SEPARATOR; + } + + /* + By default the postings highlighter returns non highlighted snippet when there are no matches. + We want to return no snippets by default, unless no_match_size is greater than 0 + */ + @Override + protected Passage[] getEmptyHighlight(String fieldName, BreakIterator bi, int maxPassages) { + if (noMatchSize > 0) { + //we want to return the first sentence of the first snippet only + return super.getEmptyHighlight(fieldName, bi, 1); + } + return EMPTY_PASSAGE; + } + + /* + Not needed since we call our own loadCurrentFieldValue explicitly, but we override it anyway for consistency. + */ + @Override + protected String[][] loadFieldValues(IndexSearcher searcher, String[] fields, int[] docids, int maxLength) throws IOException { + return new String[][]{new String[]{loadCurrentFieldValue()}}; + } + + /* + Our own method that returns the field values, which relies on the content that was provided when creating the highlighter. + Supports per value discrete highlighting calling the highlightDoc method multiple times, one per value. + */ + protected String loadCurrentFieldValue() { + if (currentValueIndex < fieldValues.length) { + return fieldValues[currentValueIndex]; + } + throw new IllegalArgumentException("No more values to return"); + } +} diff --git a/src/main/java/org/apache/lucene/search/postingshighlight/Snippet.java b/src/main/java/org/apache/lucene/search/postingshighlight/Snippet.java new file mode 100644 index 00000000000..42904ecce0c --- /dev/null +++ b/src/main/java/org/apache/lucene/search/postingshighlight/Snippet.java @@ -0,0 +1,50 @@ +/* + * Licensed to ElasticSearch and Shay Banon under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. ElasticSearch licenses this + * file to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT + * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the + * License for the specific language governing permissions and limitations under + * the License. + */ + +package org.apache.lucene.search.postingshighlight; + +/** + * Represents a scored highlighted snippet. + * It's our own arbitrary object that we get back from the postings highlighter when highlighting a document. + * Every snippet contains its formatted text and its score. + * The score is needed since we highlight every single value separately and we might want to return snippets sorted by score. + */ +public class Snippet { + + private final String text; + private final float score; + private final boolean isHighlighted; + + public Snippet(String text, float score, boolean isHighlighted) { + this.text = text; + this.score = score; + this.isHighlighted = isHighlighted; + } + + public String getText() { + return text; + } + + public float getScore() { + return score; + } + + public boolean isHighlighted() { + return isHighlighted; + } +} diff --git a/src/main/java/org/apache/lucene/search/postingshighlight/XDefaultPassageFormatter.java b/src/main/java/org/apache/lucene/search/postingshighlight/XDefaultPassageFormatter.java new file mode 100644 index 00000000000..b3ab2e68c6d --- /dev/null +++ b/src/main/java/org/apache/lucene/search/postingshighlight/XDefaultPassageFormatter.java @@ -0,0 +1,147 @@ +package org.apache.lucene.search.postingshighlight; + +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +import org.elasticsearch.Version; + +/** + * Creates a formatted snippet from the top passages. + *

+ * The default implementation marks the query terms as bold, and places + * ellipses between unconnected passages. + */ +//LUCENE MONITOR - REMOVE ME WHEN LUCENE 4.6 IS OUT +//Applied LUCENE-4906 to be able to return arbitrary objects +public class XDefaultPassageFormatter extends XPassageFormatter { + + static { + assert Version.CURRENT.luceneVersion.compareTo(org.apache.lucene.util.Version.LUCENE_45) == 0 : "Remove XDefaultPassageFormatter once 4.6 is out"; + } + + /** text that will appear before highlighted terms */ + protected final String preTag; + /** text that will appear after highlighted terms */ + protected final String postTag; + /** text that will appear between two unconnected passages */ + protected final String ellipsis; + /** true if we should escape for html */ + protected final boolean escape; + + /** + * Creates a new DefaultPassageFormatter with the default tags. + */ + public XDefaultPassageFormatter() { + this("", "", "... ", false); + } + + /** + * Creates a new DefaultPassageFormatter with custom tags. + * @param preTag text which should appear before a highlighted term. + * @param postTag text which should appear after a highlighted term. + * @param ellipsis text which should be used to connect two unconnected passages. + * @param escape true if text should be html-escaped + */ + public XDefaultPassageFormatter(String preTag, String postTag, String ellipsis, boolean escape) { + if (preTag == null || postTag == null || ellipsis == null) { + throw new NullPointerException(); + } + this.preTag = preTag; + this.postTag = postTag; + this.ellipsis = ellipsis; + this.escape = escape; + } + + @Override + public String format(Passage passages[], String content) { + StringBuilder sb = new StringBuilder(); + int pos = 0; + for (Passage passage : passages) { + // don't add ellipsis if its the first one, or if its connected. + if (passage.startOffset > pos && pos > 0) { + sb.append(ellipsis); + } + pos = passage.startOffset; + for (int i = 0; i < passage.numMatches; i++) { + int start = passage.matchStarts[i]; + int end = passage.matchEnds[i]; + // its possible to have overlapping terms + if (start > pos) { + append(sb, content, pos, start); + } + if (end > pos) { + sb.append(preTag); + append(sb, content, Math.max(pos, start), end); + sb.append(postTag); + pos = end; + } + } + // its possible a "term" from the analyzer could span a sentence boundary. + append(sb, content, pos, Math.max(pos, passage.endOffset)); + pos = passage.endOffset; + } + return sb.toString(); + } + + /** + * Appends original text to the response. + * @param dest resulting text, possibly transformed or encoded + * @param content original text content + * @param start index of the first character in content + * @param end index of the character following the last character in content + */ + protected void append(StringBuilder dest, String content, int start, int end) { + if (escape) { + // note: these are the rules from owasp.org + for (int i = start; i < end; i++) { + char ch = content.charAt(i); + switch(ch) { + case '&': + dest.append("&"); + break; + case '<': + dest.append("<"); + break; + case '>': + dest.append(">"); + break; + case '"': + dest.append("""); + break; + case '\'': + dest.append("'"); + break; + case '/': + dest.append("/"); + break; + default: + if (ch >= 0x30 && ch <= 0x39 || ch >= 0x41 && ch <= 0x5A || ch >= 0x61 && ch <= 0x7A) { + dest.append(ch); + } else if (ch < 0xff) { + dest.append("&#"); + dest.append((int)ch); + dest.append(";"); + } else { + dest.append(ch); + } + } + } + } else { + dest.append(content, start, end); + } + } +} diff --git a/src/main/java/org/apache/lucene/search/postingshighlight/XPassageFormatter.java b/src/main/java/org/apache/lucene/search/postingshighlight/XPassageFormatter.java new file mode 100644 index 00000000000..456c27fe32a --- /dev/null +++ b/src/main/java/org/apache/lucene/search/postingshighlight/XPassageFormatter.java @@ -0,0 +1,46 @@ +package org.apache.lucene.search.postingshighlight; + +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +import org.elasticsearch.Version; + +/** + * Creates a formatted snippet from the top passages. + * + * @lucene.experimental + */ +//LUCENE MONITOR - REMOVE ME WHEN LUCENE 4.6 IS OUT +//Applied LUCENE-4906 to be able to return arbitrary objects +public abstract class XPassageFormatter { + + static { + assert Version.CURRENT.luceneVersion.compareTo(org.apache.lucene.util.Version.LUCENE_45) == 0 : "Remove XPassageFormatter once 4.6 is out"; + } + + /** + * Formats the top passages from content + * into a human-readable text snippet. + * + * @param passages top-N passages for the field. Note these are sorted in + * the order that they appear in the document for convenience. + * @param content content for the field. + * @return formatted highlight + */ + public abstract Object format(Passage passages[], String content); + +} \ No newline at end of file diff --git a/src/main/java/org/apache/lucene/search/postingshighlight/XPostingsHighlighter.java b/src/main/java/org/apache/lucene/search/postingshighlight/XPostingsHighlighter.java new file mode 100644 index 00000000000..5fd0792b241 --- /dev/null +++ b/src/main/java/org/apache/lucene/search/postingshighlight/XPostingsHighlighter.java @@ -0,0 +1,777 @@ +/* + * Licensed to ElasticSearch and Shay Banon under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. ElasticSearch licenses this + * file to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT + * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the + * License for the specific language governing permissions and limitations under + * the License. + */ + +package org.apache.lucene.search.postingshighlight; + +import org.apache.lucene.index.*; +import org.apache.lucene.search.IndexSearcher; +import org.apache.lucene.search.Query; +import org.apache.lucene.search.ScoreDoc; +import org.apache.lucene.search.TopDocs; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.InPlaceMergeSorter; +import org.apache.lucene.util.UnicodeUtil; + +import java.io.IOException; +import java.text.BreakIterator; +import java.util.*; + +/* +FORKED from Lucene 4.5 to be able to: +1) support discrete highlighting for multiple values, so that we can return a different snippet per value when highlighting the whole text +2) call the highlightField method directly from subclasses and provide the terms by ourselves +3) Applied LUCENE-4906 to allow PassageFormatter to return arbitrary objects (LUCENE 4.6) + +All our changes start with //BEGIN EDIT + */ +public class XPostingsHighlighter { + + //BEGIN EDIT added method to override offset for current value (default 0) + //we need this to perform discrete highlighting per field + protected int getOffsetForCurrentValue(String field, int docId) { + return 0; + } + //END EDIT + + //BEGIN EDIT + //we need this to fix scoring when highlighting every single value separately, since the score depends on the total length of the field (all values rather than only the current one) + protected int getContentLength(String field, int docId) { + return -1; + } + //END EDIT + + + // TODO: maybe allow re-analysis for tiny fields? currently we require offsets, + // but if the analyzer is really fast and the field is tiny, this might really be + // unnecessary. + + /** for rewriting: we don't want slow processing from MTQs */ + private static final IndexReader EMPTY_INDEXREADER = new MultiReader(); + + /** Default maximum content size to process. Typically snippets + * closer to the beginning of the document better summarize its content */ + public static final int DEFAULT_MAX_LENGTH = 10000; + + private final int maxLength; + + /** Set the first time {@link #getFormatter} is called, + * and then reused. */ + private XPassageFormatter defaultFormatter; + + /** Set the first time {@link #getScorer} is called, + * and then reused. */ + private PassageScorer defaultScorer; + + /** + * Creates a new highlighter with default parameters. + */ + public XPostingsHighlighter() { + this(DEFAULT_MAX_LENGTH); + } + + /** + * Creates a new highlighter, specifying maximum content length. + * @param maxLength maximum content size to process. + * @throws IllegalArgumentException if maxLength is negative or Integer.MAX_VALUE + */ + public XPostingsHighlighter(int maxLength) { + if (maxLength < 0 || maxLength == Integer.MAX_VALUE) { + // two reasons: no overflow problems in BreakIterator.preceding(offset+1), + // our sentinel in the offsets queue uses this value to terminate. + throw new IllegalArgumentException("maxLength must be < Integer.MAX_VALUE"); + } + this.maxLength = maxLength; + } + + /** Returns the {@link java.text.BreakIterator} to use for + * dividing text into passages. This returns + * {@link java.text.BreakIterator#getSentenceInstance(java.util.Locale)} by default; + * subclasses can override to customize. */ + protected BreakIterator getBreakIterator(String field) { + return BreakIterator.getSentenceInstance(Locale.ROOT); + } + + /** Returns the {@link PassageFormatter} to use for + * formatting passages into highlighted snippets. This + * returns a new {@code PassageFormatter} by default; + * subclasses can override to customize. */ + protected XPassageFormatter getFormatter(String field) { + if (defaultFormatter == null) { + defaultFormatter = new XDefaultPassageFormatter(); + } + return defaultFormatter; + } + + /** Returns the {@link PassageScorer} to use for + * ranking passages. This + * returns a new {@code PassageScorer} by default; + * subclasses can override to customize. */ + protected PassageScorer getScorer(String field) { + if (defaultScorer == null) { + defaultScorer = new PassageScorer(); + } + return defaultScorer; + } + + /** + * Highlights the top passages from a single field. + * + * @param field field name to highlight. + * Must have a stored string value and also be indexed with offsets. + * @param query query to highlight. + * @param searcher searcher that was previously used to execute the query. + * @param topDocs TopDocs containing the summary result documents to highlight. + * @return Array of formatted snippets corresponding to the documents in topDocs. + * If no highlights were found for a document, the + * first sentence for the field will be returned. + * @throws java.io.IOException if an I/O error occurred during processing + * @throws IllegalArgumentException if field was indexed without + * {@link org.apache.lucene.index.FieldInfo.IndexOptions#DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS} + */ + public String[] highlight(String field, Query query, IndexSearcher searcher, TopDocs topDocs) throws IOException { + return highlight(field, query, searcher, topDocs, 1); + } + + /** + * Highlights the top-N passages from a single field. + * + * @param field field name to highlight. + * Must have a stored string value and also be indexed with offsets. + * @param query query to highlight. + * @param searcher searcher that was previously used to execute the query. + * @param topDocs TopDocs containing the summary result documents to highlight. + * @param maxPassages The maximum number of top-N ranked passages used to + * form the highlighted snippets. + * @return Array of formatted snippets corresponding to the documents in topDocs. + * If no highlights were found for a document, the + * first {@code maxPassages} sentences from the + * field will be returned. + * @throws IOException if an I/O error occurred during processing + * @throws IllegalArgumentException if field was indexed without + * {@link org.apache.lucene.index.FieldInfo.IndexOptions#DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS} + */ + public String[] highlight(String field, Query query, IndexSearcher searcher, TopDocs topDocs, int maxPassages) throws IOException { + Map res = highlightFields(new String[] { field }, query, searcher, topDocs, new int[] { maxPassages }); + return res.get(field); + } + + /** + * Highlights the top passages from multiple fields. + *

+ * Conceptually, this behaves as a more efficient form of: + *

+     * Map m = new HashMap();
+     * for (String field : fields) {
+     *   m.put(field, highlight(field, query, searcher, topDocs));
+     * }
+     * return m;
+     * 
+ * + * @param fields field names to highlight. + * Must have a stored string value and also be indexed with offsets. + * @param query query to highlight. + * @param searcher searcher that was previously used to execute the query. + * @param topDocs TopDocs containing the summary result documents to highlight. + * @return Map keyed on field name, containing the array of formatted snippets + * corresponding to the documents in topDocs. + * If no highlights were found for a document, the + * first sentence from the field will be returned. + * @throws IOException if an I/O error occurred during processing + * @throws IllegalArgumentException if field was indexed without + * {@link org.apache.lucene.index.FieldInfo.IndexOptions#DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS} + */ + public Map highlightFields(String fields[], Query query, IndexSearcher searcher, TopDocs topDocs) throws IOException { + int maxPassages[] = new int[fields.length]; + Arrays.fill(maxPassages, 1); + return highlightFields(fields, query, searcher, topDocs, maxPassages); + } + + /** + * Highlights the top-N passages from multiple fields. + *

+ * Conceptually, this behaves as a more efficient form of: + *

+     * Map m = new HashMap();
+     * for (String field : fields) {
+     *   m.put(field, highlight(field, query, searcher, topDocs, maxPassages));
+     * }
+     * return m;
+     * 
+ * + * @param fields field names to highlight. + * Must have a stored string value and also be indexed with offsets. + * @param query query to highlight. + * @param searcher searcher that was previously used to execute the query. + * @param topDocs TopDocs containing the summary result documents to highlight. + * @param maxPassages The maximum number of top-N ranked passages per-field used to + * form the highlighted snippets. + * @return Map keyed on field name, containing the array of formatted snippets + * corresponding to the documents in topDocs. + * If no highlights were found for a document, the + * first {@code maxPassages} sentences from the + * field will be returned. + * @throws IOException if an I/O error occurred during processing + * @throws IllegalArgumentException if field was indexed without + * {@link org.apache.lucene.index.FieldInfo.IndexOptions#DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS} + */ + public Map highlightFields(String fields[], Query query, IndexSearcher searcher, TopDocs topDocs, int maxPassages[]) throws IOException { + final ScoreDoc scoreDocs[] = topDocs.scoreDocs; + int docids[] = new int[scoreDocs.length]; + for (int i = 0; i < docids.length; i++) { + docids[i] = scoreDocs[i].doc; + } + + return highlightFields(fields, query, searcher, docids, maxPassages); + } + + /** + * Highlights the top-N passages from multiple fields, + * for the provided int[] docids. + * + * @param fieldsIn field names to highlight. + * Must have a stored string value and also be indexed with offsets. + * @param query query to highlight. + * @param searcher searcher that was previously used to execute the query. + * @param docidsIn containing the document IDs to highlight. + * @param maxPassagesIn The maximum number of top-N ranked passages per-field used to + * form the highlighted snippets. + * @return Map keyed on field name, containing the array of formatted snippets + * corresponding to the documents in topDocs. + * If no highlights were found for a document, the + * first {@code maxPassages} from the field will + * be returned. + * @throws IOException if an I/O error occurred during processing + * @throws IllegalArgumentException if field was indexed without + * {@link org.apache.lucene.index.FieldInfo.IndexOptions#DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS} + */ + public Map highlightFields(String fieldsIn[], Query query, IndexSearcher searcher, int[] docidsIn, int maxPassagesIn[]) throws IOException { + Map snippets = new HashMap(); + for(Map.Entry ent : highlightFieldsAsObjects(fieldsIn, query, searcher, docidsIn, maxPassagesIn).entrySet()) { + Object[] snippetObjects = ent.getValue(); + String[] snippetStrings = new String[snippetObjects.length]; + snippets.put(ent.getKey(), snippetStrings); + for(int i=0;i highlightFieldsAsObjects(String fieldsIn[], Query query, IndexSearcher searcher, int[] docidsIn, int maxPassagesIn[]) throws IOException { + if (fieldsIn.length < 1) { + throw new IllegalArgumentException("fieldsIn must not be empty"); + } + if (fieldsIn.length != maxPassagesIn.length) { + throw new IllegalArgumentException("invalid number of maxPassagesIn"); + } + final IndexReader reader = searcher.getIndexReader(); + query = rewrite(query); + SortedSet queryTerms = new TreeSet(); + query.extractTerms(queryTerms); + + IndexReaderContext readerContext = reader.getContext(); + List leaves = readerContext.leaves(); + + // Make our own copies because we sort in-place: + int[] docids = new int[docidsIn.length]; + System.arraycopy(docidsIn, 0, docids, 0, docidsIn.length); + final String fields[] = new String[fieldsIn.length]; + System.arraycopy(fieldsIn, 0, fields, 0, fieldsIn.length); + final int maxPassages[] = new int[maxPassagesIn.length]; + System.arraycopy(maxPassagesIn, 0, maxPassages, 0, maxPassagesIn.length); + + // sort for sequential io + Arrays.sort(docids); + new InPlaceMergeSorter() { + + @Override + protected void swap(int i, int j) { + String tmp = fields[i]; + fields[i] = fields[j]; + fields[j] = tmp; + int tmp2 = maxPassages[i]; + maxPassages[i] = maxPassages[j]; + maxPassages[j] = tmp2; + } + + @Override + protected int compare(int i, int j) { + return fields[i].compareTo(fields[j]); + } + + }.sort(0, fields.length); + + // pull stored data: + String[][] contents = loadFieldValues(searcher, fields, docids, maxLength); + + Map highlights = new HashMap(); + for (int i = 0; i < fields.length; i++) { + String field = fields[i]; + int numPassages = maxPassages[i]; + + Term floor = new Term(field, ""); + Term ceiling = new Term(field, UnicodeUtil.BIG_TERM); + SortedSet fieldTerms = queryTerms.subSet(floor, ceiling); + // TODO: should we have some reasonable defaults for term pruning? (e.g. stopwords) + + // Strip off the redundant field: + BytesRef terms[] = new BytesRef[fieldTerms.size()]; + int termUpto = 0; + for(Term term : fieldTerms) { + terms[termUpto++] = term.bytes(); + } + Map fieldHighlights = highlightField(field, contents[i], getBreakIterator(field), terms, docids, leaves, numPassages); + + Object[] result = new Object[docids.length]; + for (int j = 0; j < docidsIn.length; j++) { + result[j] = fieldHighlights.get(docidsIn[j]); + } + highlights.put(field, result); + } + return highlights; + } + + /** Loads the String values for each field X docID to be + * highlighted. By default this loads from stored + * fields, but a subclass can change the source. This + * method should allocate the String[fields.length][docids.length] + * and fill all values. The returned Strings must be + * identical to what was indexed. */ + protected String[][] loadFieldValues(IndexSearcher searcher, String[] fields, int[] docids, int maxLength) throws IOException { + String contents[][] = new String[fields.length][docids.length]; + char valueSeparators[] = new char[fields.length]; + for (int i = 0; i < fields.length; i++) { + valueSeparators[i] = getMultiValuedSeparator(fields[i]); + } + LimitedStoredFieldVisitor visitor = new LimitedStoredFieldVisitor(fields, valueSeparators, maxLength); + for (int i = 0; i < docids.length; i++) { + searcher.doc(docids[i], visitor); + for (int j = 0; j < fields.length; j++) { + contents[j][i] = visitor.getValue(j); + } + visitor.reset(); + } + return contents; + } + + /** + * Returns the logical separator between values for multi-valued fields. + * The default value is a space character, which means passages can span across values, + * but a subclass can override, for example with {@code U+2029 PARAGRAPH SEPARATOR (PS)} + * if each value holds a discrete passage for highlighting. + */ + protected char getMultiValuedSeparator(String field) { + return ' '; + } + + //BEGIN EDIT: made protected so that we can call from our subclass and pass in the terms by ourselves + protected Map highlightField(String field, String contents[], BreakIterator bi, BytesRef terms[], int[] docids, List leaves, int maxPassages) throws IOException { + //private Map highlightField(String field, String contents[], BreakIterator bi, BytesRef terms[], int[] docids, List leaves, int maxPassages) throws IOException { + //END EDIT + + Map highlights = new HashMap(); + + // reuse in the real sense... for docs in same segment we just advance our old enum + DocsAndPositionsEnum postings[] = null; + TermsEnum termsEnum = null; + int lastLeaf = -1; + + XPassageFormatter fieldFormatter = getFormatter(field); + if (fieldFormatter == null) { + throw new NullPointerException("PassageFormatter cannot be null"); + } + + for (int i = 0; i < docids.length; i++) { + String content = contents[i]; + if (content.length() == 0) { + continue; // nothing to do + } + bi.setText(content); + int doc = docids[i]; + int leaf = ReaderUtil.subIndex(doc, leaves); + AtomicReaderContext subContext = leaves.get(leaf); + AtomicReader r = subContext.reader(); + Terms t = r.terms(field); + if (t == null) { + continue; // nothing to do + } + if (leaf != lastLeaf) { + termsEnum = t.iterator(null); + postings = new DocsAndPositionsEnum[terms.length]; + } + Passage passages[] = highlightDoc(field, terms, content.length(), bi, doc - subContext.docBase, termsEnum, postings, maxPassages); + if (passages.length == 0) { + passages = getEmptyHighlight(field, bi, maxPassages); + } + if (passages.length > 0) { + // otherwise a null snippet (eg if field is missing + // entirely from the doc) + highlights.put(doc, fieldFormatter.format(passages, content)); + } + lastLeaf = leaf; + } + + return highlights; + } + + // algorithm: treat sentence snippets as miniature documents + // we can intersect these with the postings lists via BreakIterator.preceding(offset),s + // score each sentence as norm(sentenceStartOffset) * sum(weight * tf(freq)) + private Passage[] highlightDoc(String field, BytesRef terms[], int contentLength, BreakIterator bi, int doc, + TermsEnum termsEnum, DocsAndPositionsEnum[] postings, int n) throws IOException { + + //BEGIN EDIT added call to method that returns the offset for the current value (discrete highlighting) + int valueOffset = getOffsetForCurrentValue(field, doc); + //END EDIT + + PassageScorer scorer = getScorer(field); + if (scorer == null) { + throw new NullPointerException("PassageScorer cannot be null"); + } + + + //BEGIN EDIT discrete highlighting + // the scoring needs to be based on the length of the whole field (all values rather than only the current one) + int totalContentLength = getContentLength(field, doc); + if (totalContentLength == -1) { + totalContentLength = contentLength; + } + //END EDIT + + + PriorityQueue pq = new PriorityQueue(); + float weights[] = new float[terms.length]; + // initialize postings + for (int i = 0; i < terms.length; i++) { + DocsAndPositionsEnum de = postings[i]; + int pDoc; + if (de == EMPTY) { + continue; + } else if (de == null) { + postings[i] = EMPTY; // initially + if (!termsEnum.seekExact(terms[i])) { + continue; // term not found + } + de = postings[i] = termsEnum.docsAndPositions(null, null, DocsAndPositionsEnum.FLAG_OFFSETS); + if (de == null) { + // no positions available + throw new IllegalArgumentException("field '" + field + "' was indexed without offsets, cannot highlight"); + } + pDoc = de.advance(doc); + } else { + pDoc = de.docID(); + if (pDoc < doc) { + pDoc = de.advance(doc); + } + } + + if (doc == pDoc) { + //BEGIN EDIT we take into account the length of the whole field (all values) to properly score the snippets + weights[i] = scorer.weight(totalContentLength, de.freq()); + //weights[i] = scorer.weight(contentLength, de.freq()); + //END EDIT + de.nextPosition(); + pq.add(new OffsetsEnum(de, i)); + } + } + + pq.add(new OffsetsEnum(EMPTY, Integer.MAX_VALUE)); // a sentinel for termination + + PriorityQueue passageQueue = new PriorityQueue(n, new Comparator() { + @Override + public int compare(Passage left, Passage right) { + if (left.score < right.score) { + return -1; + } else if (left.score > right.score) { + return 1; + } else { + return left.startOffset - right.startOffset; + } + } + }); + Passage current = new Passage(); + + OffsetsEnum off; + while ((off = pq.poll()) != null) { + final DocsAndPositionsEnum dp = off.dp; + + int start = dp.startOffset(); + if (start == -1) { + throw new IllegalArgumentException("field '" + field + "' was indexed without offsets, cannot highlight"); + } + int end = dp.endOffset(); + // LUCENE-5166: this hit would span the content limit... however more valid + // hits may exist (they are sorted by start). so we pretend like we never + // saw this term, it won't cause a passage to be added to passageQueue or anything. + assert EMPTY.startOffset() == Integer.MAX_VALUE; + if (start < contentLength && end > contentLength) { + continue; + } + + + //BEGIN EDIT support for discrete highlighting (added block code) + //switch to the first match in the current value if there is one + boolean seenEnough = false; + while (start < valueOffset) { + if (off.pos == dp.freq()) { + seenEnough = true; + break; + } else { + off.pos++; + dp.nextPosition(); + start = dp.startOffset(); + end = dp.endOffset(); + } + } + + //continue with next term if we've already seen the current one all the times it appears + //that means that the current value doesn't hold matches for the current term + if (seenEnough) { + continue; + } + + //we now subtract the offset of the current value to both start and end + start -= valueOffset; + end -= valueOffset; + //END EDIT + + + if (start >= current.endOffset) { + if (current.startOffset >= 0) { + // finalize current + //BEGIN EDIT we take into account the value offset when scoring the snippet based on its position + current.score *= scorer.norm(current.startOffset + valueOffset); + //current.score *= scorer.norm(current.startOffset); + //END EDIT + // new sentence: first add 'current' to queue + if (passageQueue.size() == n && current.score < passageQueue.peek().score) { + current.reset(); // can't compete, just reset it + } else { + passageQueue.offer(current); + if (passageQueue.size() > n) { + current = passageQueue.poll(); + current.reset(); + } else { + current = new Passage(); + } + } + } + // if we exceed limit, we are done + if (start >= contentLength) { + Passage passages[] = new Passage[passageQueue.size()]; + passageQueue.toArray(passages); + for (Passage p : passages) { + p.sort(); + } + // sort in ascending order + Arrays.sort(passages, new Comparator() { + @Override + public int compare(Passage left, Passage right) { + return left.startOffset - right.startOffset; + } + }); + return passages; + } + // advance breakiterator + assert BreakIterator.DONE < 0; + current.startOffset = Math.max(bi.preceding(start+1), 0); + current.endOffset = Math.min(bi.next(), contentLength); + } + int tf = 0; + while (true) { + tf++; + current.addMatch(start, end, terms[off.id]); + if (off.pos == dp.freq()) { + break; // removed from pq + } else { + off.pos++; + dp.nextPosition(); + //BEGIN EDIT support for discrete highlighting + start = dp.startOffset() - valueOffset; + end = dp.endOffset() - valueOffset; + //start = dp.startOffset(); + //end = dp.endOffset(); + //END EDIT + } + if (start >= current.endOffset || end > contentLength) { + pq.offer(off); + break; + } + } + current.score += weights[off.id] * scorer.tf(tf, current.endOffset - current.startOffset); + } + + // Dead code but compiler disagrees: + assert false; + return null; + } + + /** Called to summarize a document when no hits were + * found. By default this just returns the first + * {@code maxPassages} sentences; subclasses can override + * to customize. */ + protected Passage[] getEmptyHighlight(String fieldName, BreakIterator bi, int maxPassages) { + // BreakIterator should be un-next'd: + List passages = new ArrayList(); + int pos = bi.current(); + assert pos == 0; + while (passages.size() < maxPassages) { + int next = bi.next(); + if (next == BreakIterator.DONE) { + break; + } + Passage passage = new Passage(); + passage.score = Float.NaN; + passage.startOffset = pos; + passage.endOffset = next; + passages.add(passage); + pos = next; + } + + return passages.toArray(new Passage[passages.size()]); + } + + private static class OffsetsEnum implements Comparable { + DocsAndPositionsEnum dp; + int pos; + int id; + + OffsetsEnum(DocsAndPositionsEnum dp, int id) throws IOException { + this.dp = dp; + this.id = id; + this.pos = 1; + } + + @Override + public int compareTo(OffsetsEnum other) { + try { + int off = dp.startOffset(); + int otherOff = other.dp.startOffset(); + if (off == otherOff) { + return id - other.id; + } else { + return Long.signum(((long)off) - otherOff); + } + } catch (IOException e) { + throw new RuntimeException(e); + } + } + } + + private static final DocsAndPositionsEnum EMPTY = new DocsAndPositionsEnum() { + + @Override + public int nextPosition() throws IOException { return 0; } + + @Override + public int startOffset() throws IOException { return Integer.MAX_VALUE; } + + @Override + public int endOffset() throws IOException { return Integer.MAX_VALUE; } + + @Override + public BytesRef getPayload() throws IOException { return null; } + + @Override + public int freq() throws IOException { return 0; } + + @Override + public int docID() { return NO_MORE_DOCS; } + + @Override + public int nextDoc() throws IOException { return NO_MORE_DOCS; } + + @Override + public int advance(int target) throws IOException { return NO_MORE_DOCS; } + + @Override + public long cost() { return 0; } + }; + + /** + * we rewrite against an empty indexreader: as we don't want things like + * rangeQueries that don't summarize the document + */ + private static Query rewrite(Query original) throws IOException { + Query query = original; + for (Query rewrittenQuery = query.rewrite(EMPTY_INDEXREADER); rewrittenQuery != query; + rewrittenQuery = query.rewrite(EMPTY_INDEXREADER)) { + query = rewrittenQuery; + } + return query; + } + + private static class LimitedStoredFieldVisitor extends StoredFieldVisitor { + private final String fields[]; + private final char valueSeparators[]; + private final int maxLength; + private final StringBuilder builders[]; + private int currentField = -1; + + public LimitedStoredFieldVisitor(String fields[], char valueSeparators[], int maxLength) { + assert fields.length == valueSeparators.length; + this.fields = fields; + this.valueSeparators = valueSeparators; + this.maxLength = maxLength; + builders = new StringBuilder[fields.length]; + for (int i = 0; i < builders.length; i++) { + builders[i] = new StringBuilder(); + } + } + + @Override + public void stringField(FieldInfo fieldInfo, String value) throws IOException { + assert currentField >= 0; + StringBuilder builder = builders[currentField]; + if (builder.length() > 0 && builder.length() < maxLength) { + builder.append(valueSeparators[currentField]); + } + if (builder.length() + value.length() > maxLength) { + builder.append(value, 0, maxLength - builder.length()); + } else { + builder.append(value); + } + } + + @Override + public Status needsField(FieldInfo fieldInfo) throws IOException { + currentField = Arrays.binarySearch(fields, fieldInfo.name); + if (currentField < 0) { + return Status.NO; + } else if (builders[currentField].length() > maxLength) { + return fields.length == 1 ? Status.STOP : Status.NO; + } + return Status.YES; + } + + String getValue(int i) { + return builders[i].toString(); + } + + void reset() { + currentField = -1; + for (int i = 0; i < fields.length; i++) { + builders[i].setLength(0); + } + } + } +} diff --git a/src/main/java/org/elasticsearch/search/highlight/FastVectorHighlighter.java b/src/main/java/org/elasticsearch/search/highlight/FastVectorHighlighter.java index c10bbec0a96..b0eb6e287c1 100644 --- a/src/main/java/org/elasticsearch/search/highlight/FastVectorHighlighter.java +++ b/src/main/java/org/elasticsearch/search/highlight/FastVectorHighlighter.java @@ -19,9 +19,7 @@ package org.elasticsearch.search.highlight; import com.google.common.collect.Maps; -import org.apache.lucene.search.highlight.DefaultEncoder; import org.apache.lucene.search.highlight.Encoder; -import org.apache.lucene.search.highlight.SimpleHTMLEncoder; import org.apache.lucene.search.vectorhighlight.*; import org.apache.lucene.search.vectorhighlight.FieldPhraseList.WeightedPhraseInfo; import org.elasticsearch.ElasticSearchIllegalArgumentException; @@ -70,7 +68,7 @@ public class FastVectorHighlighter implements Highlighter { throw new ElasticSearchIllegalArgumentException("the field [" + field.field() + "] should be indexed with term vector with position offsets to be used with fast vector highlighter"); } - Encoder encoder = field.encoder().equals("html") ? Encoders.HTML : Encoders.DEFAULT; + Encoder encoder = field.encoder().equals("html") ? HighlightUtils.Encoders.HTML : HighlightUtils.Encoders.DEFAULT; if (!hitContext.cache().containsKey(CACHE_KEY)) { hitContext.cache().put(CACHE_KEY, new HighlighterEntry()); @@ -185,9 +183,4 @@ public class FastVectorHighlighter implements Highlighter { public FieldQuery fieldMatchFieldQuery; public Map mappers = Maps.newHashMap(); } - - private static class Encoders { - public static Encoder DEFAULT = new DefaultEncoder(); - public static Encoder HTML = new SimpleHTMLEncoder(); - } } diff --git a/src/main/java/org/elasticsearch/search/highlight/HighlightBuilder.java b/src/main/java/org/elasticsearch/search/highlight/HighlightBuilder.java index 879c7903a32..afef2d73716 100644 --- a/src/main/java/org/elasticsearch/search/highlight/HighlightBuilder.java +++ b/src/main/java/org/elasticsearch/search/highlight/HighlightBuilder.java @@ -190,7 +190,7 @@ public class HighlightBuilder implements ToXContent { /** * Set type of highlighter to use. Supported types - * are highlighter and fast-vector-highlighter. + * are highlighter, fast-vector-highlighter and postings-highlighter. */ public HighlightBuilder highlighterType(String highlighterType) { this.highlighterType = highlighterType; @@ -420,7 +420,7 @@ public class HighlightBuilder implements ToXContent { /** * Set type of highlighter to use. Supported types - * are highlighter and fast-vector-highlighter. + * are highlighter, fast-vector-highlighter nad postings-highlighter. * This overrides global settings set by {@link HighlightBuilder#highlighterType(String)}. */ public Field highlighterType(String highlighterType) { diff --git a/src/main/java/org/elasticsearch/search/highlight/HighlightModule.java b/src/main/java/org/elasticsearch/search/highlight/HighlightModule.java index 1eb6c30448f..8deaee0de34 100644 --- a/src/main/java/org/elasticsearch/search/highlight/HighlightModule.java +++ b/src/main/java/org/elasticsearch/search/highlight/HighlightModule.java @@ -34,6 +34,7 @@ public class HighlightModule extends AbstractModule { public HighlightModule() { registerHighlighter(FastVectorHighlighter.class); registerHighlighter(PlainHighlighter.class); + registerHighlighter(PostingsHighlighter.class); } public void registerHighlighter(Class clazz) { diff --git a/src/main/java/org/elasticsearch/search/highlight/HighlightPhase.java b/src/main/java/org/elasticsearch/search/highlight/HighlightPhase.java index 8251c49a046..a580770d27a 100644 --- a/src/main/java/org/elasticsearch/search/highlight/HighlightPhase.java +++ b/src/main/java/org/elasticsearch/search/highlight/HighlightPhase.java @@ -22,6 +22,7 @@ package org.elasticsearch.search.highlight; import com.google.common.collect.ImmutableMap; import com.google.common.collect.ImmutableSet; import org.apache.lucene.search.Query; +import org.apache.lucene.index.FieldInfo; import org.elasticsearch.ElasticSearchException; import org.elasticsearch.ElasticSearchIllegalArgumentException; import org.elasticsearch.common.component.AbstractComponent; @@ -46,7 +47,7 @@ import static com.google.common.collect.Maps.newHashMap; */ public class HighlightPhase extends AbstractComponent implements FetchSubPhase { - private Highlighters highlighters; + private final Highlighters highlighters; @Inject public HighlightPhase(Settings settings, Highlighters highlighters) { @@ -93,7 +94,13 @@ public class HighlightPhase extends AbstractComponent implements FetchSubPhase { if (field.highlighterType() == null) { boolean useFastVectorHighlighter = fieldMapper.fieldType().storeTermVectors() && fieldMapper.fieldType().storeTermVectorOffsets() && fieldMapper.fieldType().storeTermVectorPositions(); - field.highlighterType(useFastVectorHighlighter ? "fvh" : "plain"); + if (useFastVectorHighlighter) { + field.highlighterType("fvh"); + } else if (fieldMapper.fieldType().indexOptions() == FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS) { + field.highlighterType("postings"); + } else { + field.highlighterType("plain"); + } } Highlighter highlighter = highlighters.get(field.highlighterType()); diff --git a/src/main/java/org/elasticsearch/search/highlight/HighlightUtils.java b/src/main/java/org/elasticsearch/search/highlight/HighlightUtils.java new file mode 100644 index 00000000000..4fbab0c34ca --- /dev/null +++ b/src/main/java/org/elasticsearch/search/highlight/HighlightUtils.java @@ -0,0 +1,68 @@ +/* + * Licensed to ElasticSearch and Shay Banon under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. ElasticSearch licenses this + * file to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT + * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the + * License for the specific language governing permissions and limitations under + * the License. + */ + +package org.elasticsearch.search.highlight; + +import com.google.common.collect.ImmutableList; +import com.google.common.collect.ImmutableSet; +import org.apache.lucene.search.highlight.DefaultEncoder; +import org.apache.lucene.search.highlight.Encoder; +import org.apache.lucene.search.highlight.SimpleHTMLEncoder; +import org.elasticsearch.index.fieldvisitor.CustomFieldsVisitor; +import org.elasticsearch.index.mapper.FieldMapper; +import org.elasticsearch.search.fetch.FetchSubPhase; +import org.elasticsearch.search.internal.SearchContext; +import org.elasticsearch.search.lookup.SearchLookup; + +import java.io.IOException; +import java.util.List; + +public final class HighlightUtils { + + //U+2029 PARAGRAPH SEPARATOR (PS): each value holds a discrete passage for highlighting (postings highlighter) + public static final char PARAGRAPH_SEPARATOR = 8233; + + private HighlightUtils() { + + } + + static List loadFieldValues(FieldMapper mapper, SearchContext searchContext, FetchSubPhase.HitContext hitContext) throws IOException { + List textsToHighlight; + if (mapper.fieldType().stored()) { + CustomFieldsVisitor fieldVisitor = new CustomFieldsVisitor(ImmutableSet.of(mapper.names().indexName()), false); + hitContext.reader().document(hitContext.docId(), fieldVisitor); + textsToHighlight = fieldVisitor.fields().get(mapper.names().indexName()); + if (textsToHighlight == null) { + // Can happen if the document doesn't have the field to highlight + textsToHighlight = ImmutableList.of(); + } + } else { + SearchLookup lookup = searchContext.lookup(); + lookup.setNextReader(hitContext.readerContext()); + lookup.setNextDocId(hitContext.docId()); + textsToHighlight = lookup.source().extractRawValues(mapper.names().sourcePath()); + } + assert textsToHighlight != null; + return textsToHighlight; + } + + static class Encoders { + static Encoder DEFAULT = new DefaultEncoder(); + static Encoder HTML = new SimpleHTMLEncoder(); + } +} diff --git a/src/main/java/org/elasticsearch/search/highlight/HighlighterContext.java b/src/main/java/org/elasticsearch/search/highlight/HighlighterContext.java index 080929f5e94..b1d39210d53 100644 --- a/src/main/java/org/elasticsearch/search/highlight/HighlighterContext.java +++ b/src/main/java/org/elasticsearch/search/highlight/HighlighterContext.java @@ -28,12 +28,12 @@ import org.elasticsearch.search.internal.SearchContext; */ public class HighlighterContext { - public String fieldName; - public SearchContextHighlight.Field field; - public FieldMapper mapper; - public SearchContext context; - public FetchSubPhase.HitContext hitContext; - public Query highlightQuery; + public final String fieldName; + public final SearchContextHighlight.Field field; + public final FieldMapper mapper; + public final SearchContext context; + public final FetchSubPhase.HitContext hitContext; + public final Query highlightQuery; public HighlighterContext(String fieldName, SearchContextHighlight.Field field, FieldMapper mapper, SearchContext context, FetchSubPhase.HitContext hitContext, Query highlightQuery) { diff --git a/src/main/java/org/elasticsearch/search/highlight/PlainHighlighter.java b/src/main/java/org/elasticsearch/search/highlight/PlainHighlighter.java index d12250837e9..4a3ee6cc045 100644 --- a/src/main/java/org/elasticsearch/search/highlight/PlainHighlighter.java +++ b/src/main/java/org/elasticsearch/search/highlight/PlainHighlighter.java @@ -18,8 +18,6 @@ */ package org.elasticsearch.search.highlight; -import com.google.common.collect.ImmutableList; -import com.google.common.collect.ImmutableSet; import com.google.common.collect.Maps; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.TokenStream; @@ -31,12 +29,10 @@ import org.apache.lucene.util.CollectionUtil; import org.elasticsearch.ElasticSearchIllegalArgumentException; import org.elasticsearch.common.text.StringText; import org.elasticsearch.common.text.Text; -import org.elasticsearch.index.fieldvisitor.CustomFieldsVisitor; import org.elasticsearch.index.mapper.FieldMapper; import org.elasticsearch.search.fetch.FetchPhaseExecutionException; import org.elasticsearch.search.fetch.FetchSubPhase; import org.elasticsearch.search.internal.SearchContext; -import org.elasticsearch.search.lookup.SearchLookup; import java.io.IOException; import java.util.ArrayList; @@ -62,7 +58,7 @@ public class PlainHighlighter implements Highlighter { FetchSubPhase.HitContext hitContext = highlighterContext.hitContext; FieldMapper mapper = highlighterContext.mapper; - Encoder encoder = field.encoder().equals("html") ? Encoders.HTML : Encoders.DEFAULT; + Encoder encoder = field.encoder().equals("html") ? HighlightUtils.Encoders.HTML : HighlightUtils.Encoders.DEFAULT; if (!hitContext.cache().containsKey(CACHE_KEY)) { Map, org.apache.lucene.search.highlight.Highlighter> mappers = Maps.newHashMap(); @@ -97,31 +93,14 @@ public class PlainHighlighter implements Highlighter { cache.put(mapper, entry); } - List textsToHighlight; - if (mapper.fieldType().stored()) { - try { - CustomFieldsVisitor fieldVisitor = new CustomFieldsVisitor(ImmutableSet.of(mapper.names().indexName()), false); - hitContext.reader().document(hitContext.docId(), fieldVisitor); - textsToHighlight = fieldVisitor.fields().get(mapper.names().indexName()); - if (textsToHighlight == null) { - // Can happen if the document doesn't have the field to highlight - textsToHighlight = ImmutableList.of(); - } - } catch (Exception e) { - throw new FetchPhaseExecutionException(context, "Failed to highlight field [" + highlighterContext.fieldName + "]", e); - } - } else { - SearchLookup lookup = context.lookup(); - lookup.setNextReader(hitContext.readerContext()); - lookup.setNextDocId(hitContext.docId()); - textsToHighlight = lookup.source().extractRawValues(mapper.names().sourcePath()); - } - assert textsToHighlight != null; - // a HACK to make highlighter do highlighting, even though its using the single frag list builder int numberOfFragments = field.numberOfFragments() == 0 ? 1 : field.numberOfFragments(); ArrayList fragsList = new ArrayList(); + List textsToHighlight; + try { + textsToHighlight = HighlightUtils.loadFieldValues(mapper, context, hitContext); + for (Object textToHighlight : textsToHighlight) { String text = textToHighlight.toString(); Analyzer analyzer = context.mapperService().documentMapper(hitContext.hit().type()).mappers().indexAnalyzer(); @@ -185,7 +164,7 @@ public class PlainHighlighter implements Highlighter { return null; } - private int findGoodEndForNoHighlightExcerpt(int noMatchSize, TokenStream tokenStream) throws IOException { + private static int findGoodEndForNoHighlightExcerpt(int noMatchSize, TokenStream tokenStream) throws IOException { try { if (!tokenStream.hasAttribute(OffsetAttribute.class)) { // Can't split on term boundaries without offsets @@ -211,9 +190,4 @@ public class PlainHighlighter implements Highlighter { tokenStream.close(); } } - - private static class Encoders { - public static Encoder DEFAULT = new DefaultEncoder(); - public static Encoder HTML = new SimpleHTMLEncoder(); - } } diff --git a/src/main/java/org/elasticsearch/search/highlight/PostingsHighlighter.java b/src/main/java/org/elasticsearch/search/highlight/PostingsHighlighter.java new file mode 100644 index 00000000000..a4cf31e4799 --- /dev/null +++ b/src/main/java/org/elasticsearch/search/highlight/PostingsHighlighter.java @@ -0,0 +1,237 @@ +/* + * Licensed to ElasticSearch and Shay Banon under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. ElasticSearch licenses this + * file to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT + * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the + * License for the specific language governing permissions and limitations under + * the License. + */ + +package org.elasticsearch.search.highlight; + +import com.google.common.collect.Maps; +import org.apache.lucene.index.FieldInfo; +import org.apache.lucene.index.IndexReader; +import org.apache.lucene.index.MultiReader; +import org.apache.lucene.index.Term; +import org.apache.lucene.search.IndexSearcher; +import org.apache.lucene.search.Query; +import org.apache.lucene.search.highlight.Encoder; +import org.apache.lucene.search.postingshighlight.CustomPassageFormatter; +import org.apache.lucene.search.postingshighlight.CustomPostingsHighlighter; +import org.apache.lucene.search.postingshighlight.Snippet; +import org.apache.lucene.search.postingshighlight.WholeBreakIterator; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.CollectionUtil; +import org.apache.lucene.util.UnicodeUtil; +import org.elasticsearch.ElasticSearchIllegalArgumentException; +import org.elasticsearch.common.Strings; +import org.elasticsearch.common.text.StringText; +import org.elasticsearch.index.mapper.FieldMapper; +import org.elasticsearch.search.fetch.FetchPhaseExecutionException; +import org.elasticsearch.search.fetch.FetchSubPhase; +import org.elasticsearch.search.internal.SearchContext; + +import java.io.IOException; +import java.text.BreakIterator; +import java.util.*; + +public class PostingsHighlighter implements Highlighter { + + private static final String CACHE_KEY = "highlight-postings"; + + @Override + public String[] names() { + return new String[]{"postings", "postings-highlighter"}; + } + + @Override + public HighlightField highlight(HighlighterContext highlighterContext) { + + FieldMapper fieldMapper = highlighterContext.mapper; + SearchContextHighlight.Field field = highlighterContext.field; + if (fieldMapper.fieldType().indexOptions() != FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS) { + throw new ElasticSearchIllegalArgumentException("the field [" + field.field() + "] should be indexed with positions and offsets in the postings list to be used with postings highlighter"); + } + + SearchContext context = highlighterContext.context; + FetchSubPhase.HitContext hitContext = highlighterContext.hitContext; + + if (!hitContext.cache().containsKey(CACHE_KEY)) { + Query query; + try { + query = rewrite(context.query()); + } catch (IOException e) { + throw new FetchPhaseExecutionException(context, "Failed to highlight field [" + highlighterContext.fieldName + "]", e); + } + SortedSet queryTerms = extractTerms(query); + hitContext.cache().put(CACHE_KEY, new HighlighterEntry(queryTerms)); + } + + HighlighterEntry highlighterEntry = (HighlighterEntry) hitContext.cache().get(CACHE_KEY); + MapperHighlighterEntry mapperHighlighterEntry = highlighterEntry.mappers.get(fieldMapper); + + if (mapperHighlighterEntry == null) { + Encoder encoder = field.encoder().equals("html") ? HighlightUtils.Encoders.HTML : HighlightUtils.Encoders.DEFAULT; + CustomPassageFormatter passageFormatter = new CustomPassageFormatter(field.preTags()[0], field.postTags()[0], encoder); + BytesRef[] filteredQueryTerms = filterTerms(highlighterEntry.queryTerms, highlighterContext.fieldName, field.requireFieldMatch()); + mapperHighlighterEntry = new MapperHighlighterEntry(passageFormatter, filteredQueryTerms); + } + + //we merge back multiple values into a single value using the paragraph separator, unless we have to highlight every single value separately (number_of_fragments=0). + boolean mergeValues = field.numberOfFragments() != 0; + List snippets = new ArrayList(); + int numberOfFragments; + + try { + //we manually load the field values (from source if needed) + List textsToHighlight = HighlightUtils.loadFieldValues(fieldMapper, context, hitContext); + CustomPostingsHighlighter highlighter = new CustomPostingsHighlighter(mapperHighlighterEntry.passageFormatter, textsToHighlight, mergeValues, Integer.MAX_VALUE-1, field.noMatchSize()); + + if (field.numberOfFragments() == 0) { + highlighter.setBreakIterator(new WholeBreakIterator()); + numberOfFragments = 1; //1 per value since we highlight per value + } else { + numberOfFragments = field.numberOfFragments(); + } + + //we highlight every value separately calling the highlight method multiple times, only if we need to have back a snippet per value (whole value) + int values = mergeValues ? 1 : textsToHighlight.size(); + for (int i = 0; i < values; i++) { + Snippet[] fieldSnippets = highlighter.highlightDoc(highlighterContext.fieldName, mapperHighlighterEntry.filteredQueryTerms, new IndexSearcher(hitContext.reader()), hitContext.docId(), numberOfFragments); + if (fieldSnippets != null) { + for (Snippet fieldSnippet : fieldSnippets) { + if (Strings.hasText(fieldSnippet.getText())) { + snippets.add(fieldSnippet); + } + } + } + } + + } catch(IOException e) { + throw new FetchPhaseExecutionException(context, "Failed to highlight field [" + highlighterContext.fieldName + "]", e); + } + + snippets = filterSnippets(snippets, field.numberOfFragments()); + + if (field.scoreOrdered()) { + //let's sort the snippets by score if needed + CollectionUtil.introSort(snippets, new Comparator() { + public int compare(Snippet o1, Snippet o2) { + return (int) Math.signum(o2.getScore() - o1.getScore()); + } + }); + } + + String[] fragments = new String[snippets.size()]; + for (int i = 0; i < fragments.length; i++) { + fragments[i] = snippets.get(i).getText(); + } + + if (fragments.length > 0) { + return new HighlightField(highlighterContext.fieldName, StringText.convertFromStringArray(fragments)); + } + + return null; + } + + private static final IndexReader EMPTY_INDEXREADER = new MultiReader(); + + private static Query rewrite(Query original) throws IOException { + Query query = original; + for (Query rewrittenQuery = query.rewrite(EMPTY_INDEXREADER); rewrittenQuery != query; + rewrittenQuery = query.rewrite(EMPTY_INDEXREADER)) { + query = rewrittenQuery; + } + return query; + } + + private static SortedSet extractTerms(Query query) { + SortedSet queryTerms = new TreeSet(); + query.extractTerms(queryTerms); + return queryTerms; + } + + private static BytesRef[] filterTerms(SortedSet queryTerms, String field, boolean requireFieldMatch) { + SortedSet fieldTerms; + if (requireFieldMatch) { + Term floor = new Term(field, ""); + Term ceiling = new Term(field, UnicodeUtil.BIG_TERM); + fieldTerms = queryTerms.subSet(floor, ceiling); + } else { + fieldTerms = queryTerms; + } + + BytesRef terms[] = new BytesRef[fieldTerms.size()]; + int termUpto = 0; + for(Term term : fieldTerms) { + terms[termUpto++] = term.bytes(); + } + + return terms; + } + + private static List filterSnippets(List snippets, int numberOfFragments) { + + //We need to filter the snippets as due to no_match_size we could have + //either highlighted snippets together non highlighted ones + //We don't want to mix those up + List filteredSnippets = new ArrayList(snippets.size()); + for (Snippet snippet : snippets) { + if (snippet.isHighlighted()) { + filteredSnippets.add(snippet); + } + } + + //if there's at least one highlighted snippet, we return all the highlighted ones + //otherwise we return the first non highlighted one if available + if (filteredSnippets.size() == 0) { + if (snippets.size() > 0) { + Snippet snippet = snippets.get(0); + //if we did discrete per value highlighting using whole break iterator (as number_of_fragments was 0) + //we need to obtain the first sentence of the first value + if (numberOfFragments == 0) { + BreakIterator bi = BreakIterator.getSentenceInstance(Locale.ROOT); + String text = snippet.getText(); + bi.setText(text); + int next = bi.next(); + if (next != BreakIterator.DONE) { + String newText = text.substring(0, next).trim(); + snippet = new Snippet(newText, snippet.getScore(), snippet.isHighlighted()); + } + } + filteredSnippets.add(snippet); + } + } + + return filteredSnippets; + } + + private static class HighlighterEntry { + final SortedSet queryTerms; + Map, MapperHighlighterEntry> mappers = Maps.newHashMap(); + + private HighlighterEntry(SortedSet queryTerms) { + this.queryTerms = queryTerms; + } + } + + private static class MapperHighlighterEntry { + final CustomPassageFormatter passageFormatter; + final BytesRef[] filteredQueryTerms; + + private MapperHighlighterEntry(CustomPassageFormatter passageFormatter, BytesRef[] filteredQueryTerms) { + this.passageFormatter = passageFormatter; + this.filteredQueryTerms = filteredQueryTerms; + } + } +} diff --git a/src/test/java/org/apache/lucene/search/postingshighlight/CustomPassageFormatterTests.java b/src/test/java/org/apache/lucene/search/postingshighlight/CustomPassageFormatterTests.java new file mode 100644 index 00000000000..03c2fdb4e9b --- /dev/null +++ b/src/test/java/org/apache/lucene/search/postingshighlight/CustomPassageFormatterTests.java @@ -0,0 +1,107 @@ +/* + * Licensed to ElasticSearch and Shay Banon under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. ElasticSearch licenses this + * file to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT + * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the + * License for the specific language governing permissions and limitations under + * the License. + */ + +package org.apache.lucene.search.postingshighlight; + +import org.apache.lucene.search.highlight.DefaultEncoder; +import org.apache.lucene.search.highlight.SimpleHTMLEncoder; +import org.apache.lucene.util.BytesRef; +import org.junit.Test; + +import static org.hamcrest.CoreMatchers.equalTo; +import static org.hamcrest.CoreMatchers.notNullValue; +import static org.hamcrest.MatcherAssert.assertThat; + + +public class CustomPassageFormatterTests { + + @Test + public void testSimpleFormat() { + String content = "This is a really cool highlighter. Postings highlighter gives nice snippets back. No matches here."; + + CustomPassageFormatter passageFormatter = new CustomPassageFormatter("", "", new DefaultEncoder()); + + Passage[] passages = new Passage[3]; + String match = "highlighter"; + BytesRef matchBytesRef = new BytesRef(match); + + Passage passage1 = new Passage(); + int start = content.indexOf(match); + int end = start + match.length(); + passage1.startOffset = 0; + passage1.endOffset = end + 2; //lets include the whitespace at the end to make sure we trim it + passage1.addMatch(start, end, matchBytesRef); + passages[0] = passage1; + + Passage passage2 = new Passage(); + start = content.lastIndexOf(match); + end = start + match.length(); + passage2.startOffset = passage1.endOffset; + passage2.endOffset = end + 26; + passage2.addMatch(start, end, matchBytesRef); + passages[1] = passage2; + + Passage passage3 = new Passage(); + passage3.startOffset = passage2.endOffset; + passage3.endOffset = content.length(); + passages[2] = passage3; + + Snippet[] fragments = passageFormatter.format(passages, content); + assertThat(fragments, notNullValue()); + assertThat(fragments.length, equalTo(3)); + assertThat(fragments[0].getText(), equalTo("This is a really cool highlighter.")); + assertThat(fragments[0].isHighlighted(), equalTo(true)); + assertThat(fragments[1].getText(), equalTo("Postings highlighter gives nice snippets back.")); + assertThat(fragments[1].isHighlighted(), equalTo(true)); + assertThat(fragments[2].getText(), equalTo("No matches here.")); + assertThat(fragments[2].isHighlighted(), equalTo(false)); + } + + @Test + public void testHtmlEncodeFormat() { + String content = "This is a really cool highlighter. Postings highlighter gives nice snippets back."; + + CustomPassageFormatter passageFormatter = new CustomPassageFormatter("", "", new SimpleHTMLEncoder()); + + Passage[] passages = new Passage[2]; + String match = "highlighter"; + BytesRef matchBytesRef = new BytesRef(match); + + Passage passage1 = new Passage(); + int start = content.indexOf(match); + int end = start + match.length(); + passage1.startOffset = 0; + passage1.endOffset = end + 6; //lets include the whitespace at the end to make sure we trim it + passage1.addMatch(start, end, matchBytesRef); + passages[0] = passage1; + + Passage passage2 = new Passage(); + start = content.lastIndexOf(match); + end = start + match.length(); + passage2.startOffset = passage1.endOffset; + passage2.endOffset = content.length(); + passage2.addMatch(start, end, matchBytesRef); + passages[1] = passage2; + + Snippet[] fragments = passageFormatter.format(passages, content); + assertThat(fragments, notNullValue()); + assertThat(fragments.length, equalTo(2)); + assertThat(fragments[0].getText(), equalTo("<b>This is a really cool highlighter.</b>")); + assertThat(fragments[1].getText(), equalTo("Postings highlighter gives nice snippets back.")); + } +} diff --git a/src/test/java/org/apache/lucene/search/postingshighlight/CustomPostingsHighlighterTests.java b/src/test/java/org/apache/lucene/search/postingshighlight/CustomPostingsHighlighterTests.java new file mode 100644 index 00000000000..e43c9e4c502 --- /dev/null +++ b/src/test/java/org/apache/lucene/search/postingshighlight/CustomPostingsHighlighterTests.java @@ -0,0 +1,487 @@ +/* + * Licensed to ElasticSearch and Shay Banon under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. ElasticSearch licenses this + * file to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT + * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the + * License for the specific language governing permissions and limitations under + * the License. + */ + +package org.apache.lucene.search.postingshighlight; + +import org.apache.lucene.analysis.MockAnalyzer; +import org.apache.lucene.document.Document; +import org.apache.lucene.document.Field; +import org.apache.lucene.document.FieldType; +import org.apache.lucene.document.TextField; +import org.apache.lucene.index.*; +import org.apache.lucene.search.*; +import org.apache.lucene.search.highlight.DefaultEncoder; +import org.apache.lucene.store.Directory; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.LuceneTestCase; +import org.apache.lucene.util.UnicodeUtil; +import org.elasticsearch.search.highlight.HighlightUtils; +import org.elasticsearch.test.ElasticsearchLuceneTestCase; +import org.junit.Test; + +import java.util.*; + +import static org.hamcrest.CoreMatchers.equalTo; +import static org.hamcrest.CoreMatchers.notNullValue; + +@LuceneTestCase.SuppressCodecs({"MockFixedIntBlock", "MockVariableIntBlock", "MockSep", "MockRandom", "Lucene3x"}) +public class CustomPostingsHighlighterTests extends ElasticsearchLuceneTestCase { + + @Test + public void testDiscreteHighlightingPerValue() throws Exception { + Directory dir = newDirectory(); + IndexWriterConfig iwc = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random())); + iwc.setMergePolicy(newLogMergePolicy()); + RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc); + + FieldType offsetsType = new FieldType(TextField.TYPE_STORED); + offsetsType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); + Field body = new Field("body", "", offsetsType); + final String firstValue = "This is a test. Just a test highlighting from postings highlighter."; + Document doc = new Document(); + doc.add(body); + body.setStringValue(firstValue); + + final String secondValue = "This is the second value to perform highlighting on."; + Field body2 = new Field("body", "", offsetsType); + doc.add(body2); + body2.setStringValue(secondValue); + + final String thirdValue = "This is the third value to test highlighting with postings."; + Field body3 = new Field("body", "", offsetsType); + doc.add(body3); + body3.setStringValue(thirdValue); + + iw.addDocument(doc); + + IndexReader ir = iw.getReader(); + iw.close(); + + List fieldValues = new ArrayList(); + fieldValues.add(firstValue); + fieldValues.add(secondValue); + fieldValues.add(thirdValue); + + + IndexSearcher searcher = newSearcher(ir); + + Query query = new TermQuery(new Term("body", "highlighting")); + BytesRef[] queryTerms = filterTerms(extractTerms(query), "body", true); + + TopDocs topDocs = searcher.search(query, null, 10, Sort.INDEXORDER); + assertThat(topDocs.totalHits, equalTo(1)); + int docId = topDocs.scoreDocs[0].doc; + + //highlighting per value, considering whole values (simulating number_of_fragments=0) + CustomPostingsHighlighter highlighter = new CustomPostingsHighlighter(new CustomPassageFormatter("", "", new DefaultEncoder()), fieldValues, false, Integer.MAX_VALUE - 1, 0); + highlighter.setBreakIterator(new WholeBreakIterator()); + + Snippet[] snippets = highlighter.highlightDoc("body", queryTerms, searcher, docId, 5); + assertThat(snippets.length, equalTo(1)); + assertThat(snippets[0].getText(), equalTo("This is a test. Just a test highlighting from postings highlighter.")); + + snippets = highlighter.highlightDoc("body", queryTerms, searcher, docId, 5); + assertThat(snippets.length, equalTo(1)); + assertThat(snippets[0].getText(), equalTo("This is the second value to perform highlighting on.")); + + snippets = highlighter.highlightDoc("body", queryTerms, searcher, docId, 5); + assertThat(snippets.length, equalTo(1)); + assertThat(snippets[0].getText(), equalTo("This is the third value to test highlighting with postings.")); + + + //let's try without whole break iterator as well, to prove that highlighting works the same when working per value (not optimized though) + highlighter = new CustomPostingsHighlighter(new CustomPassageFormatter("", "", new DefaultEncoder()), fieldValues, false, Integer.MAX_VALUE - 1, 0); + + snippets = highlighter.highlightDoc("body", queryTerms, searcher, docId, 5); + assertThat(snippets.length, equalTo(1)); + assertThat(snippets[0].getText(), equalTo("Just a test highlighting from postings highlighter.")); + + snippets = highlighter.highlightDoc("body", queryTerms, searcher, docId, 5); + assertThat(snippets.length, equalTo(1)); + assertThat(snippets[0].getText(), equalTo("This is the second value to perform highlighting on.")); + + snippets = highlighter.highlightDoc("body", queryTerms, searcher, docId, 5); + assertThat(snippets.length, equalTo(1)); + assertThat(snippets[0].getText(), equalTo("This is the third value to test highlighting with postings.")); + + ir.close(); + dir.close(); + } + + /* + Tests that scoring works properly even when using discrete per value highlighting + */ + @Test + public void testDiscreteHighlightingScoring() throws Exception { + + Directory dir = newDirectory(); + IndexWriterConfig iwc = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random())); + iwc.setMergePolicy(newLogMergePolicy()); + RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc); + + FieldType offsetsType = new FieldType(TextField.TYPE_STORED); + offsetsType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); + + //good position but only one match + final String firstValue = "This is a test. Just a test1 highlighting from postings highlighter."; + Field body = new Field("body", "", offsetsType); + Document doc = new Document(); + doc.add(body); + body.setStringValue(firstValue); + + //two matches, not the best snippet due to its length though + final String secondValue = "This is the second highlighting value to perform highlighting on a longer text that gets scored lower."; + Field body2 = new Field("body", "", offsetsType); + doc.add(body2); + body2.setStringValue(secondValue); + + //two matches and short, will be scored highest + final String thirdValue = "This is highlighting the third short highlighting value."; + Field body3 = new Field("body", "", offsetsType); + doc.add(body3); + body3.setStringValue(thirdValue); + + //one match, same as first but at the end, will be scored lower due to its position + final String fourthValue = "Just a test4 highlighting from postings highlighter."; + Field body4 = new Field("body", "", offsetsType); + doc.add(body4); + body4.setStringValue(fourthValue); + + iw.addDocument(doc); + + IndexReader ir = iw.getReader(); + iw.close(); + + + String firstHlValue = "Just a test1 highlighting from postings highlighter."; + String secondHlValue = "This is the second highlighting value to perform highlighting on a longer text that gets scored lower."; + String thirdHlValue = "This is highlighting the third short highlighting value."; + String fourthHlValue = "Just a test4 highlighting from postings highlighter."; + + + IndexSearcher searcher = newSearcher(ir); + Query query = new TermQuery(new Term("body", "highlighting")); + BytesRef[] queryTerms = filterTerms(extractTerms(query), "body", true); + + TopDocs topDocs = searcher.search(query, null, 10, Sort.INDEXORDER); + assertThat(topDocs.totalHits, equalTo(1)); + + int docId = topDocs.scoreDocs[0].doc; + + List fieldValues = new ArrayList(); + fieldValues.add(firstValue); + fieldValues.add(secondValue); + fieldValues.add(thirdValue); + fieldValues.add(fourthValue); + + boolean mergeValues = true; + CustomPostingsHighlighter highlighter = new CustomPostingsHighlighter(new CustomPassageFormatter("", "", new DefaultEncoder()), fieldValues, mergeValues, Integer.MAX_VALUE-1, 0); + Snippet[] snippets = highlighter.highlightDoc("body", queryTerms, searcher, docId, 5); + + assertThat(snippets.length, equalTo(4)); + + assertThat(snippets[0].getText(), equalTo(firstHlValue)); + assertThat(snippets[1].getText(), equalTo(secondHlValue)); + assertThat(snippets[2].getText(), equalTo(thirdHlValue)); + assertThat(snippets[3].getText(), equalTo(fourthHlValue)); + + + //Let's highlight each separate value and check how the snippets are scored + mergeValues = false; + highlighter = new CustomPostingsHighlighter(new CustomPassageFormatter("", "", new DefaultEncoder()), fieldValues, mergeValues, Integer.MAX_VALUE-1, 0); + List snippets2 = new ArrayList(); + for (int i = 0; i < fieldValues.size(); i++) { + snippets2.addAll(Arrays.asList(highlighter.highlightDoc("body", queryTerms, searcher, docId, 5))); + } + + assertThat(snippets2.size(), equalTo(4)); + assertThat(snippets2.get(0).getText(), equalTo(firstHlValue)); + assertThat(snippets2.get(1).getText(), equalTo(secondHlValue)); + assertThat(snippets2.get(2).getText(), equalTo(thirdHlValue)); + assertThat(snippets2.get(3).getText(), equalTo(fourthHlValue)); + + Comparator comparator = new Comparator() { + @Override + public int compare(Snippet o1, Snippet o2) { + return (int)Math.signum(o1.getScore() - o2.getScore()); + } + }; + + //sorting both groups of snippets + Arrays.sort(snippets, comparator); + Collections.sort(snippets2, comparator); + + //checking that the snippets are in the same order, regardless of whether we used per value discrete highlighting or not + //we can't compare the scores directly since they are slightly different due to the multiValued separator added when merging values together + //That causes slightly different lengths and start offsets, thus a slightly different score. + //Anyways, that's not an issue. What's important is that the score is computed the same way, so that the produced order is always the same. + for (int i = 0; i < snippets.length; i++) { + assertThat(snippets[i].getText(), equalTo(snippets2.get(i).getText())); + } + + ir.close(); + dir.close(); + } + + /* + Tests that we produce the same snippets and scores when manually merging values in our own custom highlighter rather than using the built-in code + */ + @Test + public void testMergeValuesScoring() throws Exception { + + Directory dir = newDirectory(); + IndexWriterConfig iwc = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random())); + iwc.setMergePolicy(newLogMergePolicy()); + RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc); + + FieldType offsetsType = new FieldType(TextField.TYPE_STORED); + offsetsType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); + + //good position but only one match + final String firstValue = "This is a test. Just a test1 highlighting from postings highlighter."; + Field body = new Field("body", "", offsetsType); + Document doc = new Document(); + doc.add(body); + body.setStringValue(firstValue); + + //two matches, not the best snippet due to its length though + final String secondValue = "This is the second highlighting value to perform highlighting on a longer text that gets scored lower."; + Field body2 = new Field("body", "", offsetsType); + doc.add(body2); + body2.setStringValue(secondValue); + + //two matches and short, will be scored highest + final String thirdValue = "This is highlighting the third short highlighting value."; + Field body3 = new Field("body", "", offsetsType); + doc.add(body3); + body3.setStringValue(thirdValue); + + //one match, same as first but at the end, will be scored lower due to its position + final String fourthValue = "Just a test4 highlighting from postings highlighter."; + Field body4 = new Field("body", "", offsetsType); + doc.add(body4); + body4.setStringValue(fourthValue); + + iw.addDocument(doc); + + IndexReader ir = iw.getReader(); + iw.close(); + + + String firstHlValue = "Just a test1 highlighting from postings highlighter."; + String secondHlValue = "This is the second highlighting value to perform highlighting on a longer text that gets scored lower."; + String thirdHlValue = "This is highlighting the third short highlighting value."; + String fourthHlValue = "Just a test4 highlighting from postings highlighter."; + + + IndexSearcher searcher = newSearcher(ir); + Query query = new TermQuery(new Term("body", "highlighting")); + BytesRef[] queryTerms = filterTerms(extractTerms(query), "body", true); + + TopDocs topDocs = searcher.search(query, null, 10, Sort.INDEXORDER); + assertThat(topDocs.totalHits, equalTo(1)); + + int docId = topDocs.scoreDocs[0].doc; + + List fieldValues = new ArrayList(); + fieldValues.add(firstValue); + fieldValues.add(secondValue); + fieldValues.add(thirdValue); + fieldValues.add(fourthValue); + + boolean mergeValues = true; + CustomPostingsHighlighter highlighter = new CustomPostingsHighlighter(new CustomPassageFormatter("", "", new DefaultEncoder()), fieldValues, mergeValues, Integer.MAX_VALUE-1, 0); + Snippet[] snippets = highlighter.highlightDoc("body", queryTerms, searcher, docId, 5); + + assertThat(snippets.length, equalTo(4)); + + assertThat(snippets[0].getText(), equalTo(firstHlValue)); + assertThat(snippets[1].getText(), equalTo(secondHlValue)); + assertThat(snippets[2].getText(), equalTo(thirdHlValue)); + assertThat(snippets[3].getText(), equalTo(fourthHlValue)); + + + //testing now our fork / normal postings highlighter, which merges multiple values together using the paragraph separator + XPostingsHighlighter highlighter2 = new XPostingsHighlighter(Integer.MAX_VALUE - 1) { + @Override + protected char getMultiValuedSeparator(String field) { + return HighlightUtils.PARAGRAPH_SEPARATOR; + } + + @Override + protected XPassageFormatter getFormatter(String field) { + return new CustomPassageFormatter("", "", new DefaultEncoder()); + } + }; + + Map highlightMap = highlighter2.highlightFieldsAsObjects(new String[]{"body"}, query, searcher, new int[]{docId}, new int[]{5}); + Object[] objects = highlightMap.get("body"); + assertThat(objects, notNullValue()); + assertThat(objects.length, equalTo(1)); + Snippet[] normalSnippets = (Snippet[])objects[0]; + + assertThat(normalSnippets.length, equalTo(4)); + + assertThat(normalSnippets[0].getText(), equalTo(firstHlValue)); + assertThat(normalSnippets[1].getText(), equalTo(secondHlValue)); + assertThat(normalSnippets[2].getText(), equalTo(thirdHlValue)); + assertThat(normalSnippets[3].getText(), equalTo(fourthHlValue)); + + + for (int i = 0; i < normalSnippets.length; i++) { + Snippet normalSnippet = snippets[0]; + Snippet customSnippet = normalSnippets[0]; + assertThat(customSnippet.getText(), equalTo(normalSnippet.getText())); + assertThat(customSnippet.getScore(), equalTo(normalSnippet.getScore())); + } + + ir.close(); + dir.close(); + } + + @Test + public void testRequireFieldMatch() throws Exception { + Directory dir = newDirectory(); + IndexWriterConfig iwc = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random())); + iwc.setMergePolicy(newLogMergePolicy()); + RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc); + + FieldType offsetsType = new FieldType(TextField.TYPE_STORED); + offsetsType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); + Field body = new Field("body", "", offsetsType); + Field none = new Field("none", "", offsetsType); + Document doc = new Document(); + doc.add(body); + doc.add(none); + + String firstValue = "This is a test. Just a test highlighting from postings. Feel free to ignore."; + body.setStringValue(firstValue); + none.setStringValue(firstValue); + iw.addDocument(doc); + + IndexReader ir = iw.getReader(); + iw.close(); + + Query query = new TermQuery(new Term("none", "highlighting")); + SortedSet queryTerms = extractTerms(query); + + IndexSearcher searcher = newSearcher(ir); + TopDocs topDocs = searcher.search(query, null, 10, Sort.INDEXORDER); + assertThat(topDocs.totalHits, equalTo(1)); + int docId = topDocs.scoreDocs[0].doc; + + List values = new ArrayList(); + values.add(firstValue); + + CustomPassageFormatter passageFormatter = new CustomPassageFormatter("", "", new DefaultEncoder()); + CustomPostingsHighlighter highlighter = new CustomPostingsHighlighter(passageFormatter, values, true, Integer.MAX_VALUE - 1, 0); + + //no snippets with simulated require field match (we filter the terms ourselves) + boolean requireFieldMatch = true; + BytesRef[] filteredQueryTerms = filterTerms(queryTerms, "body", requireFieldMatch); + Snippet[] snippets = highlighter.highlightDoc("body", filteredQueryTerms, searcher, docId, 5); + assertThat(snippets.length, equalTo(0)); + + + highlighter = new CustomPostingsHighlighter(passageFormatter, values, true, Integer.MAX_VALUE - 1, 0); + //one snippet without require field match, just passing in the query terms with no filtering on our side + requireFieldMatch = false; + filteredQueryTerms = filterTerms(queryTerms, "body", requireFieldMatch); + snippets = highlighter.highlightDoc("body", filteredQueryTerms, searcher, docId, 5); + assertThat(snippets.length, equalTo(1)); + assertThat(snippets[0].getText(), equalTo("Just a test highlighting from postings.")); + + ir.close(); + dir.close(); + } + + @Test + public void testNoMatchSize() throws Exception { + Directory dir = newDirectory(); + IndexWriterConfig iwc = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random())); + iwc.setMergePolicy(newLogMergePolicy()); + RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc); + + FieldType offsetsType = new FieldType(TextField.TYPE_STORED); + offsetsType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); + Field body = new Field("body", "", offsetsType); + Field none = new Field("none", "", offsetsType); + Document doc = new Document(); + doc.add(body); + doc.add(none); + + String firstValue = "This is a test. Just a test highlighting from postings. Feel free to ignore."; + body.setStringValue(firstValue); + none.setStringValue(firstValue); + iw.addDocument(doc); + + IndexReader ir = iw.getReader(); + iw.close(); + + Query query = new TermQuery(new Term("none", "highlighting")); + SortedSet queryTerms = extractTerms(query); + + IndexSearcher searcher = newSearcher(ir); + TopDocs topDocs = searcher.search(query, null, 10, Sort.INDEXORDER); + assertThat(topDocs.totalHits, equalTo(1)); + int docId = topDocs.scoreDocs[0].doc; + + List values = new ArrayList(); + values.add(firstValue); + + BytesRef[] filteredQueryTerms = filterTerms(queryTerms, "body", true); + CustomPassageFormatter passageFormatter = new CustomPassageFormatter("", "", new DefaultEncoder()); + + CustomPostingsHighlighter highlighter = new CustomPostingsHighlighter(passageFormatter, values, true, Integer.MAX_VALUE - 1, 0); + Snippet[] snippets = highlighter.highlightDoc("body", filteredQueryTerms, searcher, docId, 5); + assertThat(snippets.length, equalTo(0)); + + highlighter = new CustomPostingsHighlighter(passageFormatter, values, true, Integer.MAX_VALUE - 1, atLeast(1)); + snippets = highlighter.highlightDoc("body", filteredQueryTerms, searcher, docId, 5); + assertThat(snippets.length, equalTo(1)); + assertThat(snippets[0].getText(), equalTo("This is a test.")); + + ir.close(); + dir.close(); + } + + private static SortedSet extractTerms(Query query) { + SortedSet queryTerms = new TreeSet(); + query.extractTerms(queryTerms); + return queryTerms; + } + + private static BytesRef[] filterTerms(SortedSet queryTerms, String field, boolean requireFieldMatch) { + SortedSet fieldTerms; + if (requireFieldMatch) { + Term floor = new Term(field, ""); + Term ceiling = new Term(field, UnicodeUtil.BIG_TERM); + fieldTerms = queryTerms.subSet(floor, ceiling); + } else { + fieldTerms = queryTerms; + } + + BytesRef terms[] = new BytesRef[fieldTerms.size()]; + int termUpto = 0; + for(Term term : fieldTerms) { + terms[termUpto++] = term.bytes(); + } + + return terms; + } +} diff --git a/src/test/java/org/apache/lucene/search/postingshighlight/XPostingsHighlighterTests.java b/src/test/java/org/apache/lucene/search/postingshighlight/XPostingsHighlighterTests.java new file mode 100644 index 00000000000..74048cb6e3b --- /dev/null +++ b/src/test/java/org/apache/lucene/search/postingshighlight/XPostingsHighlighterTests.java @@ -0,0 +1,1693 @@ +/* + * Licensed to ElasticSearch and Shay Banon under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. ElasticSearch licenses this + * file to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT + * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the + * License for the specific language governing permissions and limitations under + * the License. + */ + +package org.apache.lucene.search.postingshighlight; + +import org.apache.lucene.analysis.Analyzer; +import org.apache.lucene.analysis.MockAnalyzer; +import org.apache.lucene.analysis.MockTokenizer; +import org.apache.lucene.document.*; +import org.apache.lucene.index.*; +import org.apache.lucene.search.*; +import org.apache.lucene.search.highlight.DefaultEncoder; +import org.apache.lucene.store.Directory; +import org.apache.lucene.util.LuceneTestCase; +import org.elasticsearch.test.ElasticsearchLuceneTestCase; +import org.junit.Test; + +import java.io.BufferedReader; +import java.io.IOException; +import java.io.InputStreamReader; +import java.text.BreakIterator; +import java.util.Arrays; +import java.util.Iterator; +import java.util.Map; + +import static org.hamcrest.CoreMatchers.*; + +@LuceneTestCase.SuppressCodecs({"MockFixedIntBlock", "MockVariableIntBlock", "MockSep", "MockRandom", "Lucene3x"}) +public class XPostingsHighlighterTests extends ElasticsearchLuceneTestCase { + + /* + Tests changes needed to make possible to perform discrete highlighting. + We want to highlight every field value separately in case of multiple values, at least when needing to return the whole field content + This is needed to be able to get back a single snippet per value when number_of_fragments=0 + */ + @Test + public void testDiscreteHighlightingPerValue() throws Exception { + Directory dir = newDirectory(); + IndexWriterConfig iwc = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random())); + iwc.setMergePolicy(newLogMergePolicy()); + RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc); + + FieldType offsetsType = new FieldType(TextField.TYPE_STORED); + offsetsType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); + Field body = new Field("body", "", offsetsType); + final String firstValue = "This is a test. Just a test highlighting from postings highlighter."; + Document doc = new Document(); + doc.add(body); + body.setStringValue(firstValue); + + final String secondValue = "This is the second value to perform highlighting on."; + Field body2 = new Field("body", "", offsetsType); + doc.add(body2); + body2.setStringValue(secondValue); + + final String thirdValue = "This is the third value to test highlighting with postings."; + Field body3 = new Field("body", "", offsetsType); + doc.add(body3); + body3.setStringValue(thirdValue); + + iw.addDocument(doc); + + IndexReader ir = iw.getReader(); + iw.close(); + + IndexSearcher searcher = newSearcher(ir); + XPostingsHighlighter highlighter = new XPostingsHighlighter() { + @Override + protected BreakIterator getBreakIterator(String field) { + return new WholeBreakIterator(); + } + + @Override + protected char getMultiValuedSeparator(String field) { + //U+2029 PARAGRAPH SEPARATOR (PS): each value holds a discrete passage for highlighting + return 8233; + } + }; + Query query = new TermQuery(new Term("body", "highlighting")); + TopDocs topDocs = searcher.search(query, null, 10, Sort.INDEXORDER); + assertThat(topDocs.totalHits, equalTo(1)); + String snippets[] = highlighter.highlight("body", query, searcher, topDocs); + assertThat(snippets.length, equalTo(1)); + + String firstHlValue = "This is a test. Just a test highlighting from postings highlighter."; + String secondHlValue = "This is the second value to perform highlighting on."; + String thirdHlValue = "This is the third value to test highlighting with postings."; + + //default behaviour: using the WholeBreakIterator, despite the multi valued paragraph separator we get back a single snippet for multiple values + assertThat(snippets[0], equalTo(firstHlValue + (char)8233 + secondHlValue + (char)8233 + thirdHlValue)); + + + + highlighter = new XPostingsHighlighter() { + Iterator valuesIterator = Arrays.asList(firstValue, secondValue, thirdValue).iterator(); + Iterator offsetsIterator = Arrays.asList(0, firstValue.length() + 1, firstValue.length() + secondValue.length() + 2).iterator(); + + @Override + protected String[][] loadFieldValues(IndexSearcher searcher, String[] fields, int[] docids, int maxLength) throws IOException { + return new String[][]{new String[]{valuesIterator.next()}}; + } + + @Override + protected int getOffsetForCurrentValue(String field, int docId) { + return offsetsIterator.next(); + } + + @Override + protected BreakIterator getBreakIterator(String field) { + return new WholeBreakIterator(); + } + }; + + //first call using the WholeBreakIterator, we get now only the first value properly highlighted as we wish + snippets = highlighter.highlight("body", query, searcher, topDocs); + assertThat(snippets.length, equalTo(1)); + assertThat(snippets[0], equalTo(firstHlValue)); + + //second call using the WholeBreakIterator, we get now only the second value properly highlighted as we wish + snippets = highlighter.highlight("body", query, searcher, topDocs); + assertThat(snippets.length, equalTo(1)); + assertThat(snippets[0], equalTo(secondHlValue)); + + //third call using the WholeBreakIterator, we get now only the third value properly highlighted as we wish + snippets = highlighter.highlight("body", query, searcher, topDocs); + assertThat(snippets.length, equalTo(1)); + assertThat(snippets[0], equalTo(thirdHlValue)); + + ir.close(); + dir.close(); + } + + @Test + public void testDiscreteHighlightingPerValue_secondValueWithoutMatches() throws Exception { + Directory dir = newDirectory(); + IndexWriterConfig iwc = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random())); + iwc.setMergePolicy(newLogMergePolicy()); + RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc); + + FieldType offsetsType = new FieldType(TextField.TYPE_STORED); + offsetsType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); + Field body = new Field("body", "", offsetsType); + final String firstValue = "This is a test. Just a test highlighting from postings highlighter."; + Document doc = new Document(); + doc.add(body); + body.setStringValue(firstValue); + + final String secondValue = "This is the second value without matches."; + Field body2 = new Field("body", "", offsetsType); + doc.add(body2); + body2.setStringValue(secondValue); + + final String thirdValue = "This is the third value to test highlighting with postings."; + Field body3 = new Field("body", "", offsetsType); + doc.add(body3); + body3.setStringValue(thirdValue); + + iw.addDocument(doc); + + IndexReader ir = iw.getReader(); + iw.close(); + + IndexSearcher searcher = newSearcher(ir); + + Query query = new TermQuery(new Term("body", "highlighting")); + TopDocs topDocs = searcher.search(query, null, 10, Sort.INDEXORDER); + assertThat(topDocs.totalHits, equalTo(1)); + + XPostingsHighlighter highlighter = new XPostingsHighlighter() { + @Override + protected BreakIterator getBreakIterator(String field) { + return new WholeBreakIterator(); + } + + @Override + protected char getMultiValuedSeparator(String field) { + //U+2029 PARAGRAPH SEPARATOR (PS): each value holds a discrete passage for highlighting + return 8233; + } + + @Override + protected Passage[] getEmptyHighlight(String fieldName, BreakIterator bi, int maxPassages) { + return new Passage[0]; + } + }; + String snippets[] = highlighter.highlight("body", query, searcher, topDocs); + assertThat(snippets.length, equalTo(1)); + String firstHlValue = "This is a test. Just a test highlighting from postings highlighter."; + String thirdHlValue = "This is the third value to test highlighting with postings."; + //default behaviour: using the WholeBreakIterator, despite the multi valued paragraph separator we get back a single snippet for multiple values + //but only the first and the third value are returned since there are no matches in the second one. + assertThat(snippets[0], equalTo(firstHlValue + (char)8233 + secondValue + (char)8233 + thirdHlValue)); + + + highlighter = new XPostingsHighlighter() { + Iterator valuesIterator = Arrays.asList(firstValue, secondValue, thirdValue).iterator(); + Iterator offsetsIterator = Arrays.asList(0, firstValue.length() + 1, firstValue.length() + secondValue.length() + 2).iterator(); + + @Override + protected String[][] loadFieldValues(IndexSearcher searcher, String[] fields, int[] docids, int maxLength) throws IOException { + return new String[][]{new String[]{valuesIterator.next()}}; + } + + @Override + protected int getOffsetForCurrentValue(String field, int docId) { + return offsetsIterator.next(); + } + + @Override + protected BreakIterator getBreakIterator(String field) { + return new WholeBreakIterator(); + } + + @Override + protected Passage[] getEmptyHighlight(String fieldName, BreakIterator bi, int maxPassages) { + return new Passage[0]; + } + }; + + //first call using the WholeBreakIterator, we get now only the first value properly highlighted as we wish + snippets = highlighter.highlight("body", query, searcher, topDocs); + assertThat(snippets.length, equalTo(1)); + assertThat(snippets[0], equalTo(firstHlValue)); + + //second call using the WholeBreakIterator, we get now nothing back because there's nothing to highlight in the second value + snippets = highlighter.highlight("body", query, searcher, topDocs); + assertThat(snippets.length, equalTo(1)); + assertThat(snippets[0], nullValue()); + + //third call using the WholeBreakIterator, we get now only the third value properly highlighted as we wish + snippets = highlighter.highlight("body", query, searcher, topDocs); + assertThat(snippets.length, equalTo(1)); + assertThat(snippets[0], equalTo(thirdHlValue)); + + ir.close(); + dir.close(); + } + + @Test + public void testDiscreteHighlightingPerValue_MultipleMatches() throws Exception { + Directory dir = newDirectory(); + IndexWriterConfig iwc = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random())); + iwc.setMergePolicy(newLogMergePolicy()); + RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc); + + FieldType offsetsType = new FieldType(TextField.TYPE_STORED); + offsetsType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); + Field body = new Field("body", "", offsetsType); + final String firstValue = "This is a highlighting test. Just a test highlighting from postings highlighter."; + Document doc = new Document(); + doc.add(body); + body.setStringValue(firstValue); + + final String secondValue = "This is the second highlighting value to test highlighting with postings."; + Field body2 = new Field("body", "", offsetsType); + doc.add(body2); + body2.setStringValue(secondValue); + + iw.addDocument(doc); + + IndexReader ir = iw.getReader(); + iw.close(); + + IndexSearcher searcher = newSearcher(ir); + + Query query = new TermQuery(new Term("body", "highlighting")); + TopDocs topDocs = searcher.search(query, null, 10, Sort.INDEXORDER); + assertThat(topDocs.totalHits, equalTo(1)); + + String firstHlValue = "This is a highlighting test. Just a test highlighting from postings highlighter."; + String secondHlValue = "This is the second highlighting value to test highlighting with postings."; + + XPostingsHighlighter highlighter = new XPostingsHighlighter() { + Iterator valuesIterator = Arrays.asList(firstValue, secondValue).iterator(); + Iterator offsetsIterator = Arrays.asList(0, firstValue.length() + 1).iterator(); + + @Override + protected String[][] loadFieldValues(IndexSearcher searcher, String[] fields, int[] docids, int maxLength) throws IOException { + return new String[][]{new String[]{valuesIterator.next()}}; + } + + @Override + protected int getOffsetForCurrentValue(String field, int docId) { + return offsetsIterator.next(); + } + + @Override + protected BreakIterator getBreakIterator(String field) { + return new WholeBreakIterator(); + } + + @Override + protected Passage[] getEmptyHighlight(String fieldName, BreakIterator bi, int maxPassages) { + return new Passage[0]; + } + }; + + //first call using the WholeBreakIterator, we get now only the first value properly highlighted as we wish + String[] snippets = highlighter.highlight("body", query, searcher, topDocs); + assertThat(snippets.length, equalTo(1)); + assertThat(snippets[0], equalTo(firstHlValue)); + + //second call using the WholeBreakIterator, we get now only the second value properly highlighted as we wish + snippets = highlighter.highlight("body", query, searcher, topDocs); + assertThat(snippets.length, equalTo(1)); + assertThat(snippets[0], equalTo(secondHlValue)); + + ir.close(); + dir.close(); + } + + @Test + public void testDiscreteHighlightingPerValue_MultipleQueryTerms() throws Exception { + Directory dir = newDirectory(); + IndexWriterConfig iwc = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random())); + iwc.setMergePolicy(newLogMergePolicy()); + RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc); + + FieldType offsetsType = new FieldType(TextField.TYPE_STORED); + offsetsType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); + Field body = new Field("body", "", offsetsType); + final String firstValue = "This is the first sentence. This is the second sentence."; + Document doc = new Document(); + doc.add(body); + body.setStringValue(firstValue); + + final String secondValue = "This is the third sentence. This is the fourth sentence."; + Field body2 = new Field("body", "", offsetsType); + doc.add(body2); + body2.setStringValue(secondValue); + + final String thirdValue = "This is the fifth sentence"; + Field body3 = new Field("body", "", offsetsType); + doc.add(body3); + body3.setStringValue(thirdValue); + + iw.addDocument(doc); + + IndexReader ir = iw.getReader(); + iw.close(); + + IndexSearcher searcher = newSearcher(ir); + + BooleanQuery query = new BooleanQuery(); + query.add(new BooleanClause(new TermQuery(new Term("body", "third")), BooleanClause.Occur.SHOULD)); + query.add(new BooleanClause(new TermQuery(new Term("body", "seventh")), BooleanClause.Occur.SHOULD)); + query.add(new BooleanClause(new TermQuery(new Term("body", "fifth")), BooleanClause.Occur.SHOULD)); + query.setMinimumNumberShouldMatch(1); + + TopDocs topDocs = searcher.search(query, null, 10, Sort.INDEXORDER); + assertThat(topDocs.totalHits, equalTo(1)); + + String secondHlValue = "This is the third sentence. This is the fourth sentence."; + String thirdHlValue = "This is the fifth sentence"; + + XPostingsHighlighter highlighter = new XPostingsHighlighter() { + Iterator valuesIterator = Arrays.asList(firstValue, secondValue, thirdValue).iterator(); + Iterator offsetsIterator = Arrays.asList(0, firstValue.length() + 1, secondValue.length() + 1).iterator(); + + @Override + protected String[][] loadFieldValues(IndexSearcher searcher, String[] fields, int[] docids, int maxLength) throws IOException { + return new String[][]{new String[]{valuesIterator.next()}}; + } + + @Override + protected int getOffsetForCurrentValue(String field, int docId) { + return offsetsIterator.next(); + } + + @Override + protected BreakIterator getBreakIterator(String field) { + return new WholeBreakIterator(); + } + + @Override + protected Passage[] getEmptyHighlight(String fieldName, BreakIterator bi, int maxPassages) { + return new Passage[0]; + } + }; + + //first call using the WholeBreakIterator, we get now null as the first value doesn't hold any match + String[] snippets = highlighter.highlight("body", query, searcher, topDocs); + assertThat(snippets.length, equalTo(1)); + assertThat(snippets[0], nullValue()); + + //second call using the WholeBreakIterator, we get now only the second value properly highlighted as we wish + snippets = highlighter.highlight("body", query, searcher, topDocs); + assertThat(snippets.length, equalTo(1)); + assertThat(snippets[0], equalTo(secondHlValue)); + + //second call using the WholeBreakIterator, we get now only the third value properly highlighted as we wish + snippets = highlighter.highlight("body", query, searcher, topDocs); + assertThat(snippets.length, equalTo(1)); + assertThat(snippets[0], equalTo(thirdHlValue)); + + ir.close(); + dir.close(); + } + + /* + The following are tests that we added to make sure that certain behaviours are possible using the postings highlighter + They don't require our forked version, but only custom versions of methods that can be overridden and are already exposed to subclasses + */ + + /* + Tests that it's possible to obtain different fragments per document instead of a big string of concatenated fragments. + We use our own PassageFormatter for that and override the getFormatter method. + */ + @Test + public void testCustomPassageFormatterMultipleFragments() throws Exception { + Directory dir = newDirectory(); + IndexWriterConfig iwc = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random())); + iwc.setMergePolicy(newLogMergePolicy()); + RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc); + + FieldType offsetsType = new FieldType(TextField.TYPE_STORED); + offsetsType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); + Field body = new Field("body", "", offsetsType); + Document doc = new Document(); + doc.add(body); + body.setStringValue("This test is another test. Not a good sentence. Test test test test."); + iw.addDocument(doc); + + IndexReader ir = iw.getReader(); + iw.close(); + + XPostingsHighlighter highlighter = new XPostingsHighlighter(); + IndexSearcher searcher = newSearcher(ir); + Query query = new TermQuery(new Term("body", "test")); + TopDocs topDocs = searcher.search(query, null, 10, Sort.INDEXORDER); + assertThat(topDocs.totalHits, equalTo(1)); + String snippets[] = highlighter.highlight("body", query, searcher, topDocs, 5); + assertThat(snippets.length, equalTo(1)); + //default behaviour that we want to change + assertThat(snippets[0], equalTo("This test is another test. ... Test test test test.")); + + + final CustomPassageFormatter passageFormatter = new CustomPassageFormatter("", "", new DefaultEncoder()); + highlighter = new XPostingsHighlighter() { + @Override + protected XPassageFormatter getFormatter(String field) { + return passageFormatter; + } + }; + + final ScoreDoc scoreDocs[] = topDocs.scoreDocs; + int docids[] = new int[scoreDocs.length]; + int maxPassages[] = new int[scoreDocs.length]; + for (int i = 0; i < docids.length; i++) { + docids[i] = scoreDocs[i].doc; + maxPassages[i] = 5; + } + Map highlights = highlighter.highlightFieldsAsObjects(new String[]{"body"}, query, searcher, docids, maxPassages); + assertThat(highlights, notNullValue()); + assertThat(highlights.size(), equalTo(1)); + Object[] objectSnippets = highlights.get("body"); + assertThat(objectSnippets, notNullValue()); + assertThat(objectSnippets.length, equalTo(1)); + assertThat(objectSnippets[0], instanceOf(Snippet[].class)); + + Snippet[] snippetsSnippet = (Snippet[]) objectSnippets[0]; + assertThat(snippetsSnippet.length, equalTo(2)); + //multiple fragments as we wish + assertThat(snippetsSnippet[0].getText(), equalTo("This test is another test.")); + assertThat(snippetsSnippet[1].getText(), equalTo("Test test test test.")); + + ir.close(); + dir.close(); + } + + /* + Tests that it's possible to return no fragments when there's nothing to highlight + We do that by overriding the getEmptyHighlight method + */ + @Test + public void testHighlightWithNoMatches() throws Exception { + Directory dir = newDirectory(); + IndexWriterConfig iwc = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random())); + iwc.setMergePolicy(newLogMergePolicy()); + RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc); + + FieldType offsetsType = new FieldType(TextField.TYPE_STORED); + offsetsType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); + Field body = new Field("body", "", offsetsType); + Field none = new Field("none", "", offsetsType); + Document doc = new Document(); + doc.add(body); + doc.add(none); + + body.setStringValue("This is a test. Just a test highlighting from postings. Feel free to ignore."); + none.setStringValue(body.stringValue()); + iw.addDocument(doc); + body.setStringValue("Highlighting the first term. Hope it works."); + none.setStringValue(body.stringValue()); + iw.addDocument(doc); + + IndexReader ir = iw.getReader(); + iw.close(); + + IndexSearcher searcher = newSearcher(ir); + XPostingsHighlighter highlighter = new XPostingsHighlighter(); + Query query = new TermQuery(new Term("none", "highlighting")); + TopDocs topDocs = searcher.search(query, null, 10, Sort.INDEXORDER); + assertThat(topDocs.totalHits, equalTo(2)); + String snippets[] = highlighter.highlight("body", query, searcher, topDocs, 1); + //Two null snippets if there are no matches (thanks to our own custom passage formatter) + assertThat(snippets.length, equalTo(2)); + //default behaviour: returns the first sentence with num passages = 1 + assertThat(snippets[0], equalTo("This is a test. ")); + assertThat(snippets[1], equalTo("Highlighting the first term. ")); + + highlighter = new XPostingsHighlighter() { + @Override + protected Passage[] getEmptyHighlight(String fieldName, BreakIterator bi, int maxPassages) { + return new Passage[0]; + } + }; + snippets = highlighter.highlight("body", query, searcher, topDocs); + //Two null snippets if there are no matches, as we wish + assertThat(snippets.length, equalTo(2)); + assertThat(snippets[0], nullValue()); + assertThat(snippets[1], nullValue()); + + ir.close(); + dir.close(); + } + + /* + Tests that it's possible to avoid having fragments that span across different values + We do that by overriding the getMultiValuedSeparator and using a proper separator between values + */ + @Test + public void testCustomMultiValuedSeparator() throws Exception { + Directory dir = newDirectory(); + IndexWriterConfig iwc = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random())); + iwc.setMergePolicy(newLogMergePolicy()); + RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc); + + FieldType offsetsType = new FieldType(TextField.TYPE_STORED); + offsetsType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); + Field body = new Field("body", "", offsetsType); + Document doc = new Document(); + doc.add(body); + + body.setStringValue("This is a test. Just a test highlighting from postings"); + + Field body2 = new Field("body", "", offsetsType); + doc.add(body2); + body2.setStringValue("highlighter."); + iw.addDocument(doc); + + + IndexReader ir = iw.getReader(); + iw.close(); + + IndexSearcher searcher = newSearcher(ir); + XPostingsHighlighter highlighter = new XPostingsHighlighter(); + Query query = new TermQuery(new Term("body", "highlighting")); + TopDocs topDocs = searcher.search(query, null, 10, Sort.INDEXORDER); + assertThat(topDocs.totalHits, equalTo(1)); + String snippets[] = highlighter.highlight("body", query, searcher, topDocs); + assertThat(snippets.length, equalTo(1)); + //default behaviour: getting a fragment that spans across different values + assertThat(snippets[0], equalTo("Just a test highlighting from postings highlighter.")); + + + highlighter = new XPostingsHighlighter() { + @Override + protected char getMultiValuedSeparator(String field) { + //U+2029 PARAGRAPH SEPARATOR (PS): each value holds a discrete passage for highlighting + return 8233; + } + }; + snippets = highlighter.highlight("body", query, searcher, topDocs); + assertThat(snippets.length, equalTo(1)); + //getting a fragment that doesn't span across different values since we used the paragraph separator between the different values + assertThat(snippets[0], equalTo("Just a test highlighting from postings" + (char)8233)); + + ir.close(); + dir.close(); + } + + + + + /* + The following are all the existing postings highlighter tests, to make sure we don't have regression in our own fork + */ + + @Test + public void testBasics() throws Exception { + Directory dir = newDirectory(); + IndexWriterConfig iwc = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random())); + iwc.setMergePolicy(newLogMergePolicy()); + RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc); + + FieldType offsetsType = new FieldType(TextField.TYPE_STORED); + offsetsType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); + Field body = new Field("body", "", offsetsType); + Document doc = new Document(); + doc.add(body); + + body.setStringValue("This is a test. Just a test highlighting from postings. Feel free to ignore."); + iw.addDocument(doc); + body.setStringValue("Highlighting the first term. Hope it works."); + iw.addDocument(doc); + + IndexReader ir = iw.getReader(); + iw.close(); + + IndexSearcher searcher = newSearcher(ir); + XPostingsHighlighter highlighter = new XPostingsHighlighter(); + Query query = new TermQuery(new Term("body", "highlighting")); + TopDocs topDocs = searcher.search(query, null, 10, Sort.INDEXORDER); + assertEquals(2, topDocs.totalHits); + String snippets[] = highlighter.highlight("body", query, searcher, topDocs); + assertEquals(2, snippets.length); + assertEquals("Just a test highlighting from postings. ", snippets[0]); + assertEquals("Highlighting the first term. ", snippets[1]); + + ir.close(); + dir.close(); + } + + public void testFormatWithMatchExceedingContentLength2() throws Exception { + + String bodyText = "123 TEST 01234 TEST"; + + String[] snippets = formatWithMatchExceedingContentLength(bodyText); + + assertEquals(1, snippets.length); + assertEquals("123 TEST 01234 TE", snippets[0]); + } + + public void testFormatWithMatchExceedingContentLength3() throws Exception { + + String bodyText = "123 5678 01234 TEST TEST"; + + String[] snippets = formatWithMatchExceedingContentLength(bodyText); + + assertEquals(1, snippets.length); + assertEquals("123 5678 01234 TE", snippets[0]); + } + + public void testFormatWithMatchExceedingContentLength() throws Exception { + + String bodyText = "123 5678 01234 TEST"; + + String[] snippets = formatWithMatchExceedingContentLength(bodyText); + + assertEquals(1, snippets.length); + // LUCENE-5166: no snippet + assertEquals("123 5678 01234 TE", snippets[0]); + } + + private String[] formatWithMatchExceedingContentLength(String bodyText) throws IOException { + + int maxLength = 17; + + final Analyzer analyzer = new MockAnalyzer(random()); + + Directory dir = newDirectory(); + IndexWriterConfig iwc = newIndexWriterConfig(TEST_VERSION_CURRENT, analyzer); + iwc.setMergePolicy(newLogMergePolicy()); + RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc); + + final FieldType fieldType = new FieldType(TextField.TYPE_STORED); + fieldType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); + final Field body = new Field("body", bodyText, fieldType); + + Document doc = new Document(); + doc.add(body); + + iw.addDocument(doc); + + IndexReader ir = iw.getReader(); + iw.close(); + + IndexSearcher searcher = newSearcher(ir); + + Query query = new TermQuery(new Term("body", "test")); + + TopDocs topDocs = searcher.search(query, null, 10, Sort.INDEXORDER); + assertEquals(1, topDocs.totalHits); + + XPostingsHighlighter highlighter = new XPostingsHighlighter(maxLength); + String snippets[] = highlighter.highlight("body", query, searcher, topDocs); + + + ir.close(); + dir.close(); + return snippets; + } + + // simple test highlighting last word. + public void testHighlightLastWord() throws Exception { + Directory dir = newDirectory(); + IndexWriterConfig iwc = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random())); + iwc.setMergePolicy(newLogMergePolicy()); + RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc); + + FieldType offsetsType = new FieldType(TextField.TYPE_STORED); + offsetsType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); + Field body = new Field("body", "", offsetsType); + Document doc = new Document(); + doc.add(body); + + body.setStringValue("This is a test"); + iw.addDocument(doc); + + IndexReader ir = iw.getReader(); + iw.close(); + + IndexSearcher searcher = newSearcher(ir); + XPostingsHighlighter highlighter = new XPostingsHighlighter(); + Query query = new TermQuery(new Term("body", "test")); + TopDocs topDocs = searcher.search(query, null, 10, Sort.INDEXORDER); + assertEquals(1, topDocs.totalHits); + String snippets[] = highlighter.highlight("body", query, searcher, topDocs); + assertEquals(1, snippets.length); + assertEquals("This is a test", snippets[0]); + + ir.close(); + dir.close(); + } + + // simple test with one sentence documents. + @Test + public void testOneSentence() throws Exception { + Directory dir = newDirectory(); + // use simpleanalyzer for more natural tokenization (else "test." is a token) + IndexWriterConfig iwc = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random(), MockTokenizer.SIMPLE, true)); + iwc.setMergePolicy(newLogMergePolicy()); + RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc); + + FieldType offsetsType = new FieldType(TextField.TYPE_STORED); + offsetsType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); + Field body = new Field("body", "", offsetsType); + Document doc = new Document(); + doc.add(body); + + body.setStringValue("This is a test."); + iw.addDocument(doc); + body.setStringValue("Test a one sentence document."); + iw.addDocument(doc); + + IndexReader ir = iw.getReader(); + iw.close(); + + IndexSearcher searcher = newSearcher(ir); + XPostingsHighlighter highlighter = new XPostingsHighlighter(); + Query query = new TermQuery(new Term("body", "test")); + TopDocs topDocs = searcher.search(query, null, 10, Sort.INDEXORDER); + assertEquals(2, topDocs.totalHits); + String snippets[] = highlighter.highlight("body", query, searcher, topDocs); + assertEquals(2, snippets.length); + assertEquals("This is a test.", snippets[0]); + assertEquals("Test a one sentence document.", snippets[1]); + + ir.close(); + dir.close(); + } + + // simple test with multiple values that make a result longer than maxLength. + @Test + public void testMaxLengthWithMultivalue() throws Exception { + Directory dir = newDirectory(); + // use simpleanalyzer for more natural tokenization (else "test." is a token) + IndexWriterConfig iwc = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random(), MockTokenizer.SIMPLE, true)); + iwc.setMergePolicy(newLogMergePolicy()); + RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc); + + FieldType offsetsType = new FieldType(TextField.TYPE_STORED); + offsetsType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); + Document doc = new Document(); + + for(int i = 0; i < 3 ; i++) { + Field body = new Field("body", "", offsetsType); + body.setStringValue("This is a multivalued field"); + doc.add(body); + } + + iw.addDocument(doc); + + IndexReader ir = iw.getReader(); + iw.close(); + + IndexSearcher searcher = newSearcher(ir); + XPostingsHighlighter highlighter = new XPostingsHighlighter(40); + Query query = new TermQuery(new Term("body", "field")); + TopDocs topDocs = searcher.search(query, null, 10, Sort.INDEXORDER); + assertEquals(1, topDocs.totalHits); + String snippets[] = highlighter.highlight("body", query, searcher, topDocs); + assertEquals(1, snippets.length); + assertTrue("Snippet should have maximum 40 characters plus the pre and post tags", + snippets[0].length() == (40 + "".length())); + + ir.close(); + dir.close(); + } + + @Test + public void testMultipleFields() throws Exception { + Directory dir = newDirectory(); + IndexWriterConfig iwc = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random(), MockTokenizer.SIMPLE, true)); + iwc.setMergePolicy(newLogMergePolicy()); + RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc); + + FieldType offsetsType = new FieldType(TextField.TYPE_STORED); + offsetsType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); + Field body = new Field("body", "", offsetsType); + Field title = new Field("title", "", offsetsType); + Document doc = new Document(); + doc.add(body); + doc.add(title); + + body.setStringValue("This is a test. Just a test highlighting from postings. Feel free to ignore."); + title.setStringValue("I am hoping for the best."); + iw.addDocument(doc); + body.setStringValue("Highlighting the first term. Hope it works."); + title.setStringValue("But best may not be good enough."); + iw.addDocument(doc); + + IndexReader ir = iw.getReader(); + iw.close(); + + IndexSearcher searcher = newSearcher(ir); + XPostingsHighlighter highlighter = new XPostingsHighlighter(); + BooleanQuery query = new BooleanQuery(); + query.add(new TermQuery(new Term("body", "highlighting")), BooleanClause.Occur.SHOULD); + query.add(new TermQuery(new Term("title", "best")), BooleanClause.Occur.SHOULD); + TopDocs topDocs = searcher.search(query, null, 10, Sort.INDEXORDER); + assertEquals(2, topDocs.totalHits); + Map snippets = highlighter.highlightFields(new String [] { "body", "title" }, query, searcher, topDocs); + assertEquals(2, snippets.size()); + assertEquals("Just a test highlighting from postings. ", snippets.get("body")[0]); + assertEquals("Highlighting the first term. ", snippets.get("body")[1]); + assertEquals("I am hoping for the best.", snippets.get("title")[0]); + assertEquals("But best may not be good enough.", snippets.get("title")[1]); + ir.close(); + dir.close(); + } + + @Test + public void testMultipleTerms() throws Exception { + Directory dir = newDirectory(); + IndexWriterConfig iwc = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random())); + iwc.setMergePolicy(newLogMergePolicy()); + RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc); + + FieldType offsetsType = new FieldType(TextField.TYPE_STORED); + offsetsType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); + Field body = new Field("body", "", offsetsType); + Document doc = new Document(); + doc.add(body); + + body.setStringValue("This is a test. Just a test highlighting from postings. Feel free to ignore."); + iw.addDocument(doc); + body.setStringValue("Highlighting the first term. Hope it works."); + iw.addDocument(doc); + + IndexReader ir = iw.getReader(); + iw.close(); + + IndexSearcher searcher = newSearcher(ir); + XPostingsHighlighter highlighter = new XPostingsHighlighter(); + BooleanQuery query = new BooleanQuery(); + query.add(new TermQuery(new Term("body", "highlighting")), BooleanClause.Occur.SHOULD); + query.add(new TermQuery(new Term("body", "just")), BooleanClause.Occur.SHOULD); + query.add(new TermQuery(new Term("body", "first")), BooleanClause.Occur.SHOULD); + TopDocs topDocs = searcher.search(query, null, 10, Sort.INDEXORDER); + assertEquals(2, topDocs.totalHits); + String snippets[] = highlighter.highlight("body", query, searcher, topDocs); + assertEquals(2, snippets.length); + assertEquals("Just a test highlighting from postings. ", snippets[0]); + assertEquals("Highlighting the first term. ", snippets[1]); + + ir.close(); + dir.close(); + } + + @Test + public void testMultiplePassages() throws Exception { + Directory dir = newDirectory(); + IndexWriterConfig iwc = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random(), MockTokenizer.SIMPLE, true)); + iwc.setMergePolicy(newLogMergePolicy()); + RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc); + + FieldType offsetsType = new FieldType(TextField.TYPE_STORED); + offsetsType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); + Field body = new Field("body", "", offsetsType); + Document doc = new Document(); + doc.add(body); + + body.setStringValue("This is a test. Just a test highlighting from postings. Feel free to ignore."); + iw.addDocument(doc); + body.setStringValue("This test is another test. Not a good sentence. Test test test test."); + iw.addDocument(doc); + + IndexReader ir = iw.getReader(); + iw.close(); + + IndexSearcher searcher = newSearcher(ir); + XPostingsHighlighter highlighter = new XPostingsHighlighter(); + Query query = new TermQuery(new Term("body", "test")); + TopDocs topDocs = searcher.search(query, null, 10, Sort.INDEXORDER); + assertEquals(2, topDocs.totalHits); + String snippets[] = highlighter.highlight("body", query, searcher, topDocs, 2); + assertEquals(2, snippets.length); + assertEquals("This is a test. Just a test highlighting from postings. ", snippets[0]); + assertEquals("This test is another test. ... Test test test test.", snippets[1]); + + ir.close(); + dir.close(); + } + + @Test + public void testUserFailedToIndexOffsets() throws Exception { + Directory dir = newDirectory(); + IndexWriterConfig iwc = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random(), MockTokenizer.SIMPLE, true)); + iwc.setMergePolicy(newLogMergePolicy()); + RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc); + + FieldType positionsType = new FieldType(TextField.TYPE_STORED); + positionsType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS); + Field body = new Field("body", "", positionsType); + Field title = new StringField("title", "", Field.Store.YES); + Document doc = new Document(); + doc.add(body); + doc.add(title); + + body.setStringValue("This is a test. Just a test highlighting from postings. Feel free to ignore."); + title.setStringValue("test"); + iw.addDocument(doc); + body.setStringValue("This test is another test. Not a good sentence. Test test test test."); + title.setStringValue("test"); + iw.addDocument(doc); + + IndexReader ir = iw.getReader(); + iw.close(); + + IndexSearcher searcher = newSearcher(ir); + XPostingsHighlighter highlighter = new XPostingsHighlighter(); + Query query = new TermQuery(new Term("body", "test")); + TopDocs topDocs = searcher.search(query, null, 10, Sort.INDEXORDER); + assertEquals(2, topDocs.totalHits); + try { + highlighter.highlight("body", query, searcher, topDocs, 2); + fail("did not hit expected exception"); + } catch (IllegalArgumentException iae) { + // expected + } + + try { + highlighter.highlight("title", new TermQuery(new Term("title", "test")), searcher, topDocs, 2); + fail("did not hit expected exception"); + } catch (IllegalArgumentException iae) { + // expected + } + ir.close(); + dir.close(); + } + + @Test + public void testBuddhism() throws Exception { + String text = "This eight-volume set brings together seminal papers in Buddhist studies from a vast " + + "range of academic disciplines published over the last forty years. With a new introduction " + + "by the editor, this collection is a unique and unrivalled research resource for both " + + "student and scholar. Coverage includes: - Buddhist origins; early history of Buddhism in " + + "South and Southeast Asia - early Buddhist Schools and Doctrinal History; Theravada Doctrine " + + "- the Origins and nature of Mahayana Buddhism; some Mahayana religious topics - Abhidharma " + + "and Madhyamaka - Yogacara, the Epistemological tradition, and Tathagatagarbha - Tantric " + + "Buddhism (Including China and Japan); Buddhism in Nepal and Tibet - Buddhism in South and " + + "Southeast Asia, and - Buddhism in China, East Asia, and Japan."; + Directory dir = newDirectory(); + Analyzer analyzer = new MockAnalyzer(random(), MockTokenizer.SIMPLE, true); + RandomIndexWriter iw = new RandomIndexWriter(random(), dir, analyzer); + + FieldType positionsType = new FieldType(TextField.TYPE_STORED); + positionsType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); + Field body = new Field("body", text, positionsType); + Document document = new Document(); + document.add(body); + iw.addDocument(document); + IndexReader ir = iw.getReader(); + iw.close(); + IndexSearcher searcher = newSearcher(ir); + PhraseQuery query = new PhraseQuery(); + query.add(new Term("body", "buddhist")); + query.add(new Term("body", "origins")); + TopDocs topDocs = searcher.search(query, 10); + assertEquals(1, topDocs.totalHits); + XPostingsHighlighter highlighter = new XPostingsHighlighter(); + String snippets[] = highlighter.highlight("body", query, searcher, topDocs, 2); + assertEquals(1, snippets.length); + assertTrue(snippets[0].contains("Buddhist origins")); + ir.close(); + dir.close(); + } + + @Test + public void testCuriousGeorge() throws Exception { + String text = "It’s the formula for success for preschoolers—Curious George and fire trucks! " + + "Curious George and the Firefighters is a story based on H. A. and Margret Rey’s " + + "popular primate and painted in the original watercolor and charcoal style. " + + "Firefighters are a famously brave lot, but can they withstand a visit from one curious monkey?"; + Directory dir = newDirectory(); + Analyzer analyzer = new MockAnalyzer(random(), MockTokenizer.SIMPLE, true); + RandomIndexWriter iw = new RandomIndexWriter(random(), dir, analyzer); + FieldType positionsType = new FieldType(TextField.TYPE_STORED); + positionsType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); + Field body = new Field("body", text, positionsType); + Document document = new Document(); + document.add(body); + iw.addDocument(document); + IndexReader ir = iw.getReader(); + iw.close(); + IndexSearcher searcher = newSearcher(ir); + PhraseQuery query = new PhraseQuery(); + query.add(new Term("body", "curious")); + query.add(new Term("body", "george")); + TopDocs topDocs = searcher.search(query, 10); + assertEquals(1, topDocs.totalHits); + XPostingsHighlighter highlighter = new XPostingsHighlighter(); + String snippets[] = highlighter.highlight("body", query, searcher, topDocs, 2); + assertEquals(1, snippets.length); + assertFalse(snippets[0].contains("CuriousCurious")); + ir.close(); + dir.close(); + } + + @Test + public void testCambridgeMA() throws Exception { + BufferedReader r = new BufferedReader(new InputStreamReader( + this.getClass().getResourceAsStream("CambridgeMA.utf8"), "UTF-8")); + String text = r.readLine(); + r.close(); + Directory dir = newDirectory(); + Analyzer analyzer = new MockAnalyzer(random(), MockTokenizer.SIMPLE, true); + RandomIndexWriter iw = new RandomIndexWriter(random(), dir, analyzer); + FieldType positionsType = new FieldType(TextField.TYPE_STORED); + positionsType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); + Field body = new Field("body", text, positionsType); + Document document = new Document(); + document.add(body); + iw.addDocument(document); + IndexReader ir = iw.getReader(); + iw.close(); + IndexSearcher searcher = newSearcher(ir); + BooleanQuery query = new BooleanQuery(); + query.add(new TermQuery(new Term("body", "porter")), BooleanClause.Occur.SHOULD); + query.add(new TermQuery(new Term("body", "square")), BooleanClause.Occur.SHOULD); + query.add(new TermQuery(new Term("body", "massachusetts")), BooleanClause.Occur.SHOULD); + TopDocs topDocs = searcher.search(query, 10); + assertEquals(1, topDocs.totalHits); + XPostingsHighlighter highlighter = new XPostingsHighlighter(Integer.MAX_VALUE-1); + String snippets[] = highlighter.highlight("body", query, searcher, topDocs, 2); + assertEquals(1, snippets.length); + assertTrue(snippets[0].contains("Square")); + assertTrue(snippets[0].contains("Porter")); + ir.close(); + dir.close(); + } + + @Test + public void testPassageRanking() throws Exception { + Directory dir = newDirectory(); + IndexWriterConfig iwc = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random(), MockTokenizer.SIMPLE, true)); + iwc.setMergePolicy(newLogMergePolicy()); + RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc); + + FieldType offsetsType = new FieldType(TextField.TYPE_STORED); + offsetsType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); + Field body = new Field("body", "", offsetsType); + Document doc = new Document(); + doc.add(body); + + body.setStringValue("This is a test. Just highlighting from postings. This is also a much sillier test. Feel free to test test test test test test test."); + iw.addDocument(doc); + + IndexReader ir = iw.getReader(); + iw.close(); + + IndexSearcher searcher = newSearcher(ir); + XPostingsHighlighter highlighter = new XPostingsHighlighter(); + Query query = new TermQuery(new Term("body", "test")); + TopDocs topDocs = searcher.search(query, null, 10, Sort.INDEXORDER); + assertEquals(1, topDocs.totalHits); + String snippets[] = highlighter.highlight("body", query, searcher, topDocs, 2); + assertEquals(1, snippets.length); + assertEquals("This is a test. ... Feel free to test test test test test test test.", snippets[0]); + + ir.close(); + dir.close(); + } + + @Test + public void testBooleanMustNot() throws Exception { + Directory dir = newDirectory(); + Analyzer analyzer = new MockAnalyzer(random(), MockTokenizer.SIMPLE, true); + RandomIndexWriter iw = new RandomIndexWriter(random(), dir, analyzer); + FieldType positionsType = new FieldType(TextField.TYPE_STORED); + positionsType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); + Field body = new Field("body", "This sentence has both terms. This sentence has only terms.", positionsType); + Document document = new Document(); + document.add(body); + iw.addDocument(document); + IndexReader ir = iw.getReader(); + iw.close(); + IndexSearcher searcher = newSearcher(ir); + BooleanQuery query = new BooleanQuery(); + query.add(new TermQuery(new Term("body", "terms")), BooleanClause.Occur.SHOULD); + BooleanQuery query2 = new BooleanQuery(); + query.add(query2, BooleanClause.Occur.SHOULD); + query2.add(new TermQuery(new Term("body", "both")), BooleanClause.Occur.MUST_NOT); + TopDocs topDocs = searcher.search(query, 10); + assertEquals(1, topDocs.totalHits); + XPostingsHighlighter highlighter = new XPostingsHighlighter(Integer.MAX_VALUE-1); + String snippets[] = highlighter.highlight("body", query, searcher, topDocs, 2); + assertEquals(1, snippets.length); + assertFalse(snippets[0].contains("both")); + ir.close(); + dir.close(); + } + + @Test + public void testHighlightAllText() throws Exception { + Directory dir = newDirectory(); + IndexWriterConfig iwc = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random(), MockTokenizer.SIMPLE, true)); + iwc.setMergePolicy(newLogMergePolicy()); + RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc); + + FieldType offsetsType = new FieldType(TextField.TYPE_STORED); + offsetsType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); + Field body = new Field("body", "", offsetsType); + Document doc = new Document(); + doc.add(body); + + body.setStringValue("This is a test. Just highlighting from postings. This is also a much sillier test. Feel free to test test test test test test test."); + iw.addDocument(doc); + + IndexReader ir = iw.getReader(); + iw.close(); + + IndexSearcher searcher = newSearcher(ir); + XPostingsHighlighter highlighter = new XPostingsHighlighter(10000) { + @Override + protected BreakIterator getBreakIterator(String field) { + return new WholeBreakIterator(); + } + }; + Query query = new TermQuery(new Term("body", "test")); + TopDocs topDocs = searcher.search(query, null, 10, Sort.INDEXORDER); + assertEquals(1, topDocs.totalHits); + String snippets[] = highlighter.highlight("body", query, searcher, topDocs, 2); + assertEquals(1, snippets.length); + assertEquals("This is a test. Just highlighting from postings. This is also a much sillier test. Feel free to test test test test test test test.", snippets[0]); + + ir.close(); + dir.close(); + } + + @Test + public void testSpecificDocIDs() throws Exception { + Directory dir = newDirectory(); + IndexWriterConfig iwc = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random())); + iwc.setMergePolicy(newLogMergePolicy()); + RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc); + + FieldType offsetsType = new FieldType(TextField.TYPE_STORED); + offsetsType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); + Field body = new Field("body", "", offsetsType); + Document doc = new Document(); + doc.add(body); + + body.setStringValue("This is a test. Just a test highlighting from postings. Feel free to ignore."); + iw.addDocument(doc); + body.setStringValue("Highlighting the first term. Hope it works."); + iw.addDocument(doc); + + IndexReader ir = iw.getReader(); + iw.close(); + + IndexSearcher searcher = newSearcher(ir); + XPostingsHighlighter highlighter = new XPostingsHighlighter(); + Query query = new TermQuery(new Term("body", "highlighting")); + TopDocs topDocs = searcher.search(query, null, 10, Sort.INDEXORDER); + assertEquals(2, topDocs.totalHits); + ScoreDoc[] hits = topDocs.scoreDocs; + int[] docIDs = new int[2]; + docIDs[0] = hits[0].doc; + docIDs[1] = hits[1].doc; + String snippets[] = highlighter.highlightFields(new String[] {"body"}, query, searcher, docIDs, new int[] { 1 }).get("body"); + assertEquals(2, snippets.length); + assertEquals("Just a test highlighting from postings. ", snippets[0]); + assertEquals("Highlighting the first term. ", snippets[1]); + + ir.close(); + dir.close(); + } + + @Test + public void testCustomFieldValueSource() throws Exception { + Directory dir = newDirectory(); + IndexWriterConfig iwc = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random(), MockTokenizer.SIMPLE, true)); + iwc.setMergePolicy(newLogMergePolicy()); + RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc); + + Document doc = new Document(); + + FieldType offsetsType = new FieldType(TextField.TYPE_NOT_STORED); + offsetsType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); + final String text = "This is a test. Just highlighting from postings. This is also a much sillier test. Feel free to test test test test test test test."; + Field body = new Field("body", text, offsetsType); + doc.add(body); + iw.addDocument(doc); + + IndexReader ir = iw.getReader(); + iw.close(); + + IndexSearcher searcher = newSearcher(ir); + + XPostingsHighlighter highlighter = new XPostingsHighlighter(10000) { + @Override + protected String[][] loadFieldValues(IndexSearcher searcher, String[] fields, int[] docids, int maxLength) throws IOException { + assert fields.length == 1; + assert docids.length == 1; + String[][] contents = new String[1][1]; + contents[0][0] = text; + return contents; + } + + @Override + protected BreakIterator getBreakIterator(String field) { + return new WholeBreakIterator(); + } + }; + + Query query = new TermQuery(new Term("body", "test")); + TopDocs topDocs = searcher.search(query, null, 10, Sort.INDEXORDER); + assertEquals(1, topDocs.totalHits); + String snippets[] = highlighter.highlight("body", query, searcher, topDocs, 2); + assertEquals(1, snippets.length); + assertEquals("This is a test. Just highlighting from postings. This is also a much sillier test. Feel free to test test test test test test test.", snippets[0]); + + ir.close(); + dir.close(); + } + + /** Make sure highlighter returns first N sentences if + * there were no hits. */ + @Test + public void testEmptyHighlights() throws Exception { + Directory dir = newDirectory(); + IndexWriterConfig iwc = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random())); + iwc.setMergePolicy(newLogMergePolicy()); + RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc); + + FieldType offsetsType = new FieldType(TextField.TYPE_STORED); + offsetsType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); + Document doc = new Document(); + + Field body = new Field("body", "test this is. another sentence this test has. far away is that planet.", offsetsType); + doc.add(body); + iw.addDocument(doc); + + IndexReader ir = iw.getReader(); + iw.close(); + + IndexSearcher searcher = newSearcher(ir); + XPostingsHighlighter highlighter = new XPostingsHighlighter(); + Query query = new TermQuery(new Term("body", "highlighting")); + int[] docIDs = new int[] {0}; + String snippets[] = highlighter.highlightFields(new String[] {"body"}, query, searcher, docIDs, new int[] { 2 }).get("body"); + assertEquals(1, snippets.length); + assertEquals("test this is. another sentence this test has. ", snippets[0]); + + ir.close(); + dir.close(); + } + + /** Make sure highlighter we can customize how emtpy + * highlight is returned. */ + @Test + public void testCustomEmptyHighlights() throws Exception { + Directory dir = newDirectory(); + IndexWriterConfig iwc = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random())); + iwc.setMergePolicy(newLogMergePolicy()); + RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc); + + FieldType offsetsType = new FieldType(TextField.TYPE_STORED); + offsetsType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); + Document doc = new Document(); + + Field body = new Field("body", "test this is. another sentence this test has. far away is that planet.", offsetsType); + doc.add(body); + iw.addDocument(doc); + + IndexReader ir = iw.getReader(); + iw.close(); + + IndexSearcher searcher = newSearcher(ir); + XPostingsHighlighter highlighter = new XPostingsHighlighter() { + @Override + public Passage[] getEmptyHighlight(String fieldName, BreakIterator bi, int maxPassages) { + return new Passage[0]; + } + }; + Query query = new TermQuery(new Term("body", "highlighting")); + int[] docIDs = new int[] {0}; + String snippets[] = highlighter.highlightFields(new String[] {"body"}, query, searcher, docIDs, new int[] { 2 }).get("body"); + assertEquals(1, snippets.length); + assertNull(snippets[0]); + + ir.close(); + dir.close(); + } + + /** Make sure highlighter returns whole text when there + * are no hits and BreakIterator is null. */ + @Test + public void testEmptyHighlightsWhole() throws Exception { + Directory dir = newDirectory(); + IndexWriterConfig iwc = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random())); + iwc.setMergePolicy(newLogMergePolicy()); + RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc); + + FieldType offsetsType = new FieldType(TextField.TYPE_STORED); + offsetsType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); + Document doc = new Document(); + + Field body = new Field("body", "test this is. another sentence this test has. far away is that planet.", offsetsType); + doc.add(body); + iw.addDocument(doc); + + IndexReader ir = iw.getReader(); + iw.close(); + + IndexSearcher searcher = newSearcher(ir); + XPostingsHighlighter highlighter = new XPostingsHighlighter(10000) { + @Override + protected BreakIterator getBreakIterator(String field) { + return new WholeBreakIterator(); + } + }; + Query query = new TermQuery(new Term("body", "highlighting")); + int[] docIDs = new int[] {0}; + String snippets[] = highlighter.highlightFields(new String[] {"body"}, query, searcher, docIDs, new int[] { 2 }).get("body"); + assertEquals(1, snippets.length); + assertEquals("test this is. another sentence this test has. far away is that planet.", snippets[0]); + + ir.close(); + dir.close(); + } + + /** Make sure highlighter is OK with entirely missing + * field. */ + @Test + public void testFieldIsMissing() throws Exception { + Directory dir = newDirectory(); + IndexWriterConfig iwc = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random())); + iwc.setMergePolicy(newLogMergePolicy()); + RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc); + + FieldType offsetsType = new FieldType(TextField.TYPE_STORED); + offsetsType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); + Document doc = new Document(); + + Field body = new Field("body", "test this is. another sentence this test has. far away is that planet.", offsetsType); + doc.add(body); + iw.addDocument(doc); + + IndexReader ir = iw.getReader(); + iw.close(); + + IndexSearcher searcher = newSearcher(ir); + XPostingsHighlighter highlighter = new XPostingsHighlighter(); + Query query = new TermQuery(new Term("bogus", "highlighting")); + int[] docIDs = new int[] {0}; + String snippets[] = highlighter.highlightFields(new String[] {"bogus"}, query, searcher, docIDs, new int[] { 2 }).get("bogus"); + assertEquals(1, snippets.length); + assertNull(snippets[0]); + + ir.close(); + dir.close(); + } + + @Test + public void testFieldIsJustSpace() throws Exception { + Directory dir = newDirectory(); + IndexWriterConfig iwc = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random())); + iwc.setMergePolicy(newLogMergePolicy()); + RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc); + + FieldType offsetsType = new FieldType(TextField.TYPE_STORED); + offsetsType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); + + Document doc = new Document(); + doc.add(new Field("body", " ", offsetsType)); + doc.add(new Field("id", "id", offsetsType)); + iw.addDocument(doc); + + doc = new Document(); + doc.add(new Field("body", "something", offsetsType)); + iw.addDocument(doc); + + IndexReader ir = iw.getReader(); + iw.close(); + + IndexSearcher searcher = newSearcher(ir); + XPostingsHighlighter highlighter = new XPostingsHighlighter(); + int docID = searcher.search(new TermQuery(new Term("id", "id")), 1).scoreDocs[0].doc; + + Query query = new TermQuery(new Term("body", "highlighting")); + int[] docIDs = new int[1]; + docIDs[0] = docID; + String snippets[] = highlighter.highlightFields(new String[] {"body"}, query, searcher, docIDs, new int[] { 2 }).get("body"); + assertEquals(1, snippets.length); + assertEquals(" ", snippets[0]); + + ir.close(); + dir.close(); + } + + @Test + public void testFieldIsEmptyString() throws Exception { + Directory dir = newDirectory(); + IndexWriterConfig iwc = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random())); + iwc.setMergePolicy(newLogMergePolicy()); + RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc); + + FieldType offsetsType = new FieldType(TextField.TYPE_STORED); + offsetsType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); + + Document doc = new Document(); + doc.add(new Field("body", "", offsetsType)); + doc.add(new Field("id", "id", offsetsType)); + iw.addDocument(doc); + + doc = new Document(); + doc.add(new Field("body", "something", offsetsType)); + iw.addDocument(doc); + + IndexReader ir = iw.getReader(); + iw.close(); + + IndexSearcher searcher = newSearcher(ir); + XPostingsHighlighter highlighter = new XPostingsHighlighter(); + int docID = searcher.search(new TermQuery(new Term("id", "id")), 1).scoreDocs[0].doc; + + Query query = new TermQuery(new Term("body", "highlighting")); + int[] docIDs = new int[1]; + docIDs[0] = docID; + String snippets[] = highlighter.highlightFields(new String[] {"body"}, query, searcher, docIDs, new int[] { 2 }).get("body"); + assertEquals(1, snippets.length); + assertNull(snippets[0]); + + ir.close(); + dir.close(); + } + + @Test + public void testMultipleDocs() throws Exception { + Directory dir = newDirectory(); + IndexWriterConfig iwc = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random())); + iwc.setMergePolicy(newLogMergePolicy()); + RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc); + + FieldType offsetsType = new FieldType(TextField.TYPE_STORED); + offsetsType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); + + int numDocs = atLeast(100); + for(int i=0;i snippets = highlighter.highlightFields(new String[] { "title", "body" }, query, searcher, new int[] { 0 }, new int[] { 1, 2 }); + String titleHighlight = snippets.get("title")[0]; + String bodyHighlight = snippets.get("body")[0]; + assertEquals("This is a test. ", titleHighlight); + assertEquals("This is a test. Just a test highlighting from postings. ", bodyHighlight); + ir.close(); + dir.close(); + } + + public void testEncode() throws Exception { + Directory dir = newDirectory(); + IndexWriterConfig iwc = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random())); + iwc.setMergePolicy(newLogMergePolicy()); + RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc); + + FieldType offsetsType = new FieldType(TextField.TYPE_STORED); + offsetsType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); + Field body = new Field("body", "", offsetsType); + Document doc = new Document(); + doc.add(body); + + body.setStringValue("This is a test. Just a test highlighting from postings. Feel free to ignore."); + iw.addDocument(doc); + + IndexReader ir = iw.getReader(); + iw.close(); + + IndexSearcher searcher = newSearcher(ir); + PostingsHighlighter highlighter = new PostingsHighlighter() { + @Override + protected PassageFormatter getFormatter(String field) { + return new DefaultPassageFormatter("", "", "... ", true); + } + }; + Query query = new TermQuery(new Term("body", "highlighting")); + TopDocs topDocs = searcher.search(query, null, 10, Sort.INDEXORDER); + assertEquals(1, topDocs.totalHits); + String snippets[] = highlighter.highlight("body", query, searcher, topDocs); + assertEquals(1, snippets.length); + assertEquals("Just a test highlighting from <i>postings</i>. ", snippets[0]); + + ir.close(); + dir.close(); + } + + /** customizing the gap separator to force a sentence break */ + public void testGapSeparator() throws Exception { + Directory dir = newDirectory(); + // use simpleanalyzer for more natural tokenization (else "test." is a token) + IndexWriterConfig iwc = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random(), MockTokenizer.SIMPLE, true)); + iwc.setMergePolicy(newLogMergePolicy()); + RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc); + + FieldType offsetsType = new FieldType(TextField.TYPE_STORED); + offsetsType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); + Document doc = new Document(); + + Field body1 = new Field("body", "", offsetsType); + body1.setStringValue("This is a multivalued field"); + doc.add(body1); + + Field body2 = new Field("body", "", offsetsType); + body2.setStringValue("This is something different"); + doc.add(body2); + + iw.addDocument(doc); + + IndexReader ir = iw.getReader(); + iw.close(); + + IndexSearcher searcher = newSearcher(ir); + PostingsHighlighter highlighter = new PostingsHighlighter() { + @Override + protected char getMultiValuedSeparator(String field) { + assert field.equals("body"); + return '\u2029'; + } + }; + Query query = new TermQuery(new Term("body", "field")); + TopDocs topDocs = searcher.search(query, null, 10, Sort.INDEXORDER); + assertEquals(1, topDocs.totalHits); + String snippets[] = highlighter.highlight("body", query, searcher, topDocs); + assertEquals(1, snippets.length); + assertEquals("This is a multivalued field\u2029", snippets[0]); + + ir.close(); + dir.close(); + } + + // LUCENE-4906 + public void testObjectFormatter() throws Exception { + Directory dir = newDirectory(); + IndexWriterConfig iwc = newIndexWriterConfig(TEST_VERSION_CURRENT, new MockAnalyzer(random())); + iwc.setMergePolicy(newLogMergePolicy()); + RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc); + + FieldType offsetsType = new FieldType(TextField.TYPE_STORED); + offsetsType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); + Field body = new Field("body", "", offsetsType); + Document doc = new Document(); + doc.add(body); + + body.setStringValue("This is a test. Just a test highlighting from postings. Feel free to ignore."); + iw.addDocument(doc); + + IndexReader ir = iw.getReader(); + iw.close(); + + IndexSearcher searcher = newSearcher(ir); + XPostingsHighlighter highlighter = new XPostingsHighlighter() { + @Override + protected XPassageFormatter getFormatter(String field) { + return new XPassageFormatter() { + XPassageFormatter defaultFormatter = new XDefaultPassageFormatter(); + + @Override + public String[] format(Passage passages[], String content) { + // Just turns the String snippet into a length 2 + // array of String + return new String[] {"blah blah", defaultFormatter.format(passages, content).toString()}; + } + }; + } + }; + + Query query = new TermQuery(new Term("body", "highlighting")); + TopDocs topDocs = searcher.search(query, null, 10, Sort.INDEXORDER); + assertEquals(1, topDocs.totalHits); + int[] docIDs = new int[1]; + docIDs[0] = topDocs.scoreDocs[0].doc; + Map snippets = highlighter.highlightFieldsAsObjects(new String[]{"body"}, query, searcher, docIDs, new int[] {1}); + Object[] bodySnippets = snippets.get("body"); + assertEquals(1, bodySnippets.length); + assertTrue(Arrays.equals(new String[] {"blah blah", "Just a test highlighting from postings. "}, (String[]) bodySnippets[0])); + + ir.close(); + dir.close(); + } +} diff --git a/src/test/java/org/elasticsearch/search/highlight/HighlighterSearchTests.java b/src/test/java/org/elasticsearch/search/highlight/HighlighterSearchTests.java index e45bf1cf341..c00dbb9dd5b 100644 --- a/src/test/java/org/elasticsearch/search/highlight/HighlighterSearchTests.java +++ b/src/test/java/org/elasticsearch/search/highlight/HighlighterSearchTests.java @@ -25,24 +25,29 @@ import org.elasticsearch.action.search.SearchPhaseExecutionException; import org.elasticsearch.action.search.SearchRequestBuilder; import org.elasticsearch.action.search.SearchResponse; import org.elasticsearch.action.search.SearchType; +import org.elasticsearch.action.search.ShardSearchFailure; import org.elasticsearch.common.Priority; import org.elasticsearch.common.settings.ImmutableSettings; import org.elasticsearch.common.settings.ImmutableSettings.Builder; +import org.elasticsearch.common.text.Text; import org.elasticsearch.common.xcontent.XContentBuilder; import org.elasticsearch.common.xcontent.XContentFactory; -import org.elasticsearch.index.query.*; +import org.elasticsearch.index.query.FilterBuilders; +import org.elasticsearch.index.query.IdsQueryBuilder; +import org.elasticsearch.index.query.MatchQueryBuilder; import org.elasticsearch.index.query.MatchQueryBuilder.Operator; import org.elasticsearch.index.query.MatchQueryBuilder.Type; +import org.elasticsearch.index.query.QueryBuilders; import org.elasticsearch.rest.RestStatus; import org.elasticsearch.search.SearchHit; import org.elasticsearch.search.builder.SearchSourceBuilder; import org.elasticsearch.test.AbstractIntegrationTest; -import org.elasticsearch.test.hamcrest.ElasticsearchAssertions; import org.hamcrest.Matcher; import org.junit.Test; import java.io.IOException; import java.util.Arrays; +import java.util.Map; import static org.elasticsearch.action.search.SearchType.QUERY_THEN_FETCH; import static org.elasticsearch.client.Requests.searchRequest; @@ -177,19 +182,21 @@ public class HighlighterSearchTests extends AbstractIntegrationTest { endObject(). endObject(). endObject(); - ElasticsearchAssertions.assertAcked(prepareCreate("test").addMapping("test", builder).setSettings( - ImmutableSettings.settingsBuilder() .put("index.number_of_shards", 1) - .put("index.number_of_replicas", 0) - .put("analysis.filter.wordDelimiter.type", "word_delimiter") - .put("analysis.filter.wordDelimiter.type.split_on_numerics", false) - .put("analysis.filter.wordDelimiter.generate_word_parts", true) - .put("analysis.filter.wordDelimiter.generate_number_parts", true) - .put("analysis.filter.wordDelimiter.catenate_words", true) - .put("analysis.filter.wordDelimiter.catenate_numbers", true) - .put("analysis.filter.wordDelimiter.catenate_all", false) - .put("analysis.analyzer.custom_analyzer.tokenizer", "whitespace") - .putArray("analysis.analyzer.custom_analyzer.filter", "lowercase", "wordDelimiter")) - ); + + assertAcked(prepareCreate("test").addMapping("test", builder).setSettings( + ImmutableSettings.settingsBuilder().put("index.number_of_shards", 1) + .put("index.number_of_replicas", 0) + .put("analysis.filter.wordDelimiter.type", "word_delimiter") + .put("analysis.filter.wordDelimiter.type.split_on_numerics", false) + .put("analysis.filter.wordDelimiter.generate_word_parts", true) + .put("analysis.filter.wordDelimiter.generate_number_parts", true) + .put("analysis.filter.wordDelimiter.catenate_words", true) + .put("analysis.filter.wordDelimiter.catenate_numbers", true) + .put("analysis.filter.wordDelimiter.catenate_all", false) + .put("analysis.analyzer.custom_analyzer.tokenizer", "whitespace") + .putArray("analysis.analyzer.custom_analyzer.filter", "lowercase", "wordDelimiter")) + ); + ensureGreen(); client().prepareIndex("test", "test", "1") .setSource(XContentFactory.jsonBuilder() @@ -520,6 +527,80 @@ public class HighlighterSearchTests extends AbstractIntegrationTest { } } + @Test + public void testSourceLookupHighlightingUsingPostingsHighlighter() throws Exception { + client().admin().indices().prepareCreate("test").setSettings(ImmutableSettings.settingsBuilder().put("index.number_of_shards", 2)) + .addMapping("type1", jsonBuilder().startObject().startObject("type1").startObject("properties") + // we don't store title, now lets see if it works... + .startObject("title").field("type", "string").field("store", "no").field("index_options", "offsets").endObject() + .startObject("attachments").startObject("properties").startObject("body").field("type", "string").field("store", "no").field("index_options", "offsets").endObject().endObject().endObject() + .endObject().endObject().endObject()) + .execute().actionGet(); + client().admin().cluster().prepareHealth().setWaitForEvents(Priority.LANGUID).setWaitForYellowStatus().execute().actionGet(); + + for (int i = 0; i < 5; i++) { + client().prepareIndex("test", "type1", Integer.toString(i)) + .setSource(XContentFactory.jsonBuilder().startObject() + .array("title", "This is a test on the highlighting bug present in elasticsearch. Hopefully it works.", + "This is the second bug to perform highlighting on.") + .startArray("attachments").startObject().field("body", "attachment for this test").endObject().startObject().field("body", "attachment 2").endObject().endArray() + .endObject()) + .setRefresh(true).execute().actionGet(); + } + + SearchResponse search = client().prepareSearch() + .setQuery(matchQuery("title", "bug")) + //asking for the whole field to be highlighted + .addHighlightedField("title", -1, 0) + .execute().actionGet(); + + assertNoFailures(search); + + assertThat(search.getHits().totalHits(), equalTo(5l)); + assertThat(search.getHits().hits().length, equalTo(5)); + + for (SearchHit hit : search.getHits()) { + Text[] fragments = hit.highlightFields().get("title").fragments(); + assertThat(fragments.length, equalTo(2)); + assertThat(fragments[0].string(), equalTo("This is a test on the highlighting bug present in elasticsearch. Hopefully it works.")); + assertThat(fragments[1].string(), equalTo("This is the second bug to perform highlighting on.")); + } + + search = client().prepareSearch() + .setQuery(matchQuery("title", "bug")) + //sentences will be generated out of each value + .addHighlightedField("title") + .execute().actionGet(); + + assertNoFailures(search); + + assertThat(search.getHits().totalHits(), equalTo(5l)); + assertThat(search.getHits().hits().length, equalTo(5)); + + for (SearchHit hit : search.getHits()) { + Text[] fragments = hit.highlightFields().get("title").fragments(); + assertThat(fragments.length, equalTo(2)); + assertThat(fragments[0].string(), equalTo("This is a test on the highlighting bug present in elasticsearch.")); + assertThat(fragments[1].string(), equalTo("This is the second bug to perform highlighting on.")); + } + + search = client().prepareSearch() + .setQuery(matchQuery("attachments.body", "attachment")) + .addHighlightedField("attachments.body", -1, 2) + .execute().actionGet(); + + assertNoFailures(search); + + assertThat(search.getHits().totalHits(), equalTo(5l)); + assertThat(search.getHits().hits().length, equalTo(5)); + + for (SearchHit hit : search.getHits()) { + //shorter fragments are scored higher + assertThat(hit.highlightFields().get("attachments.body").fragments()[0].string(), equalTo("attachment for this test")); + assertThat(hit.highlightFields().get("attachments.body").fragments()[1].string(), equalTo("attachment 2")); + } + } + @Test public void testHighlightIssue1994() throws Exception { client().admin().indices().prepareCreate("test").setSettings(ImmutableSettings.settingsBuilder().put("index.number_of_shards", 2)) @@ -1662,11 +1743,11 @@ public class HighlighterSearchTests extends AbstractIntegrationTest { public void testHighlightNoMatchSize() throws IOException { prepareCreate("test") - .addMapping("type1", "text", "type=string," + randomStoreField() + "term_vector=with_positions_offsets") + .addMapping("type1", "text", "type=string," + randomStoreField() + "term_vector=with_positions_offsets,index_options=offsets") .get(); ensureGreen(); - String text = "I am pretty long so some of me should get cut off"; + String text = "I am pretty long so some of me should get cut off. Second sentence"; index("test", "type1", "1", "text", text); refresh(); @@ -1682,6 +1763,10 @@ public class HighlighterSearchTests extends AbstractIntegrationTest { response = client().prepareSearch("test").addHighlightedField(field).get(); assertNotHighlighted(response, 0, "text"); + field.highlighterType("postings"); + response = client().prepareSearch("test").addHighlightedField(field).get(); + assertNotHighlighted(response, 0, "text"); + // When noMatchSize is set to 0 you also shouldn't get any field.highlighterType("plain").noMatchSize(0); response = client().prepareSearch("test").addHighlightedField(field).get(); @@ -1691,55 +1776,88 @@ public class HighlighterSearchTests extends AbstractIntegrationTest { response = client().prepareSearch("test").addHighlightedField(field).get(); assertNotHighlighted(response, 0, "text"); + field.highlighterType("postings"); + response = client().prepareSearch("test").addHighlightedField(field).get(); + assertNotHighlighted(response, 0, "text"); + // When noMatchSize is between 0 and the size of the string field.highlighterType("plain").noMatchSize(21); response = client().prepareSearch("test").addHighlightedField(field).get(); - assertHighlight(response, 0, "text", 0, equalTo("I am pretty long so")); + assertHighlight(response, 0, "text", 0, 1, equalTo("I am pretty long so")); // The FVH also works but the fragment is longer than the plain highlighter because of boundary_max_scan field.highlighterType("fvh"); response = client().prepareSearch("test").addHighlightedField(field).get(); - assertHighlight(response, 0, "text", 0, equalTo("I am pretty long so some")); + assertHighlight(response, 0, "text", 0, 1, equalTo("I am pretty long so some")); + + // Postings hl also works but the fragment is the whole first sentence (size ignored) + field.highlighterType("postings"); + response = client().prepareSearch("test").addHighlightedField(field).get(); + assertHighlight(response, 0, "text", 0, 1, equalTo("I am pretty long so some of me should get cut off.")); // We can also ask for a fragment longer than the input string and get the whole string field.highlighterType("plain").noMatchSize(text.length() * 2); response = client().prepareSearch("test").addHighlightedField(field).get(); - assertHighlight(response, 0, "text", 0, equalTo(text)); + assertHighlight(response, 0, "text", 0, 1, equalTo(text)); - // Same for the fvh field.highlighterType("fvh"); response = client().prepareSearch("test").addHighlightedField(field).get(); - assertHighlight(response, 0, "text", 0, equalTo(text)); + assertHighlight(response, 0, "text", 0, 1, equalTo(text)); + + //no difference using postings hl as the noMatchSize is ignored (just needs to be greater than 0) + field.highlighterType("postings"); + response = client().prepareSearch("test").addHighlightedField(field).get(); + assertHighlight(response, 0, "text", 0, 1, equalTo("I am pretty long so some of me should get cut off.")); // We can also ask for a fragment exactly the size of the input field and get the whole field field.highlighterType("plain").noMatchSize(text.length()); response = client().prepareSearch("test").addHighlightedField(field).get(); - assertHighlight(response, 0, "text", 0, equalTo(text)); + assertHighlight(response, 0, "text", 0, 1, equalTo(text)); - // Same for the fvh field.highlighterType("fvh"); response = client().prepareSearch("test").addHighlightedField(field).get(); - assertHighlight(response, 0, "text", 0, equalTo(text)); + assertHighlight(response, 0, "text", 0, 1, equalTo(text)); + + //no difference using postings hl as the noMatchSize is ignored (just needs to be greater than 0) + field.highlighterType("postings"); + response = client().prepareSearch("test").addHighlightedField(field).get(); + assertHighlight(response, 0, "text", 0, 1, equalTo("I am pretty long so some of me should get cut off.")); // You can set noMatchSize globally in the highlighter as well field.highlighterType("plain").noMatchSize(null); response = client().prepareSearch("test").setHighlighterNoMatchSize(21).addHighlightedField(field).get(); - assertHighlight(response, 0, "text", 0, equalTo("I am pretty long so")); + assertHighlight(response, 0, "text", 0, 1, equalTo("I am pretty long so")); - // Same for the fvh field.highlighterType("fvh"); response = client().prepareSearch("test").setHighlighterNoMatchSize(21).addHighlightedField(field).get(); - assertHighlight(response, 0, "text", 0, equalTo("I am pretty long so some")); + assertHighlight(response, 0, "text", 0, 1, equalTo("I am pretty long so some")); + + field.highlighterType("postings"); + response = client().prepareSearch("test").setHighlighterNoMatchSize(21).addHighlightedField(field).get(); + assertHighlight(response, 0, "text", 0, 1, equalTo("I am pretty long so some of me should get cut off.")); + + // We don't break if noMatchSize is less than zero though + field.highlighterType("plain").noMatchSize(randomIntBetween(Integer.MIN_VALUE, -1)); + response = client().prepareSearch("test").addHighlightedField(field).get(); + assertNotHighlighted(response, 0, "text"); + + field.highlighterType("fvh"); + response = client().prepareSearch("test").addHighlightedField(field).get(); + assertNotHighlighted(response, 0, "text"); + + field.highlighterType("postings"); + response = client().prepareSearch("test").addHighlightedField(field).get(); + assertNotHighlighted(response, 0, "text"); } @Test public void testHighlightNoMatchSizeWithMultivaluedFields() throws IOException { prepareCreate("test") - .addMapping("type1", "text", "type=string," + randomStoreField() + "term_vector=with_positions_offsets") + .addMapping("type1", "text", "type=string," + randomStoreField() + "term_vector=with_positions_offsets,index_options=offsets") .get(); ensureGreen(); - String text1 = "I am pretty long so some of me should get cut off"; + String text1 = "I am pretty long so some of me should get cut off. We'll see how that goes."; String text2 = "I am short"; index("test", "type1", "1", "text", new String[] {text1, text2}); refresh(); @@ -1751,56 +1869,85 @@ public class HighlighterSearchTests extends AbstractIntegrationTest { .highlighterType("plain") .noMatchSize(21); SearchResponse response = client().prepareSearch("test").addHighlightedField(field).get(); - assertHighlight(response, 0, "text", 0, equalTo("I am pretty long so")); + assertHighlight(response, 0, "text", 0, 1, equalTo("I am pretty long so")); - // And the fvh should work as well field.highlighterType("fvh"); response = client().prepareSearch("test").addHighlightedField(field).get(); - assertHighlight(response, 0, "text", 0, equalTo("I am pretty long so some")); + assertHighlight(response, 0, "text", 0, 1, equalTo("I am pretty long so some")); + + // Postings hl also works but the fragment is the whole first sentence (size ignored) + field.highlighterType("postings"); + response = client().prepareSearch("test").addHighlightedField(field).get(); + assertHighlight(response, 0, "text", 0, 1, equalTo("I am pretty long so some of me should get cut off.")); // And noMatchSize returns nothing when the first entry is empty string! index("test", "type1", "2", "text", new String[] {"", text2}); refresh(); + IdsQueryBuilder idsQueryBuilder = QueryBuilders.idsQuery("type1").addIds("2"); field.highlighterType("plain"); response = client().prepareSearch("test") - .setQuery(QueryBuilders.idsQuery("type1").addIds("2")) + .setQuery(idsQueryBuilder) .addHighlightedField(field).get(); assertNotHighlighted(response, 0, "text"); - // And the fvh should do the same field.highlighterType("fvh"); - response = client().prepareSearch("test").addHighlightedField(field).get(); + response = client().prepareSearch("test") + .setQuery(idsQueryBuilder) + .addHighlightedField(field).get(); + assertNotHighlighted(response, 0, "text"); + + field.highlighterType("postings"); + response = client().prepareSearch("test") + .setQuery(idsQueryBuilder) + .addHighlightedField(field).get(); assertNotHighlighted(response, 0, "text"); // But if the field was actually empty then you should get no highlighting field index("test", "type1", "3", "text", new String[] {}); refresh(); + idsQueryBuilder = QueryBuilders.idsQuery("type1").addIds("3"); field.highlighterType("plain"); response = client().prepareSearch("test") - .setQuery(QueryBuilders.idsQuery("type1").addIds("3")) + .setQuery(idsQueryBuilder) .addHighlightedField(field).get(); assertNotHighlighted(response, 0, "text"); - // And the fvh should do the same field.highlighterType("fvh"); - response = client().prepareSearch("test").addHighlightedField(field).get(); + response = client().prepareSearch("test") + .setQuery(idsQueryBuilder) + .addHighlightedField(field).get(); + assertNotHighlighted(response, 0, "text"); + + field.highlighterType("postings"); + response = client().prepareSearch("test") + .setQuery(idsQueryBuilder) + .addHighlightedField(field).get(); assertNotHighlighted(response, 0, "text"); // Same for if the field doesn't even exist on the document index("test", "type1", "4"); refresh(); + + idsQueryBuilder = QueryBuilders.idsQuery("type1").addIds("4"); field.highlighterType("plain"); response = client().prepareSearch("test") - .setQuery(QueryBuilders.idsQuery("type1").addIds("4")) + .setQuery(idsQueryBuilder) .addHighlightedField(field).get(); assertNotHighlighted(response, 0, "text"); - // And the fvh should do the same field.highlighterType("fvh"); - response = client().prepareSearch("test").addHighlightedField(field).get(); + response = client().prepareSearch("test") + .setQuery(idsQueryBuilder) + .addHighlightedField(field).get(); assertNotHighlighted(response, 0, "text"); + field.highlighterType("fvh"); + response = client().prepareSearch("test") + .setQuery(idsQueryBuilder) + .addHighlightedField(field).get(); + assertNotHighlighted(response, 0, "postings"); + // Again same if the field isn't mapped field = new HighlightBuilder.Field("unmapped") .highlighterType("plain") @@ -1808,9 +1955,573 @@ public class HighlighterSearchTests extends AbstractIntegrationTest { response = client().prepareSearch("test").addHighlightedField(field).get(); assertNotHighlighted(response, 0, "text"); - // And the fvh should work as well field.highlighterType("fvh"); response = client().prepareSearch("test").addHighlightedField(field).get(); assertNotHighlighted(response, 0, "text"); + + field.highlighterType("postings"); + response = client().prepareSearch("test").addHighlightedField(field).get(); + assertNotHighlighted(response, 0, "text"); + } + + @Test + public void testHighlightNoMatchSizeNumberOfFragments() throws IOException { + prepareCreate("test") + .addMapping("type1", "text", "type=string," + randomStoreField() + "term_vector=with_positions_offsets,index_options=offsets") + .get(); + ensureGreen(); + + String text1 = "This is the first sentence. This is the second sentence."; + String text2 = "This is the third sentence. This is the fourth sentence."; + String text3 = "This is the fifth sentence"; + index("test", "type1", "1", "text", new String[] {text1, text2, text3}); + refresh(); + + // The no match fragment should come from the first value of a multi-valued field + HighlightBuilder.Field field = new HighlightBuilder.Field("text") + .fragmentSize(1) + .numOfFragments(0) + .highlighterType("plain") + .noMatchSize(20); + SearchResponse response = client().prepareSearch("test").addHighlightedField(field).get(); + assertHighlight(response, 0, "text", 0, 1, equalTo("This is the first")); + + field.highlighterType("fvh"); + response = client().prepareSearch("test").addHighlightedField(field).get(); + assertHighlight(response, 0, "text", 0, 1, equalTo("This is the first sentence")); + + // Postings hl also works but the fragment is the whole first sentence (size ignored) + field.highlighterType("postings"); + response = client().prepareSearch("test").addHighlightedField(field).get(); + assertHighlight(response, 0, "text", 0, 1, equalTo("This is the first sentence.")); + + //if there's a match we only return the values with matches (whole value as number_of_fragments == 0) + MatchQueryBuilder queryBuilder = QueryBuilders.matchQuery("text", "third fifth"); + field.highlighterType("plain"); + response = client().prepareSearch("test").setQuery(queryBuilder).addHighlightedField(field).get(); + assertHighlight(response, 0, "text", 0, 2, equalTo("This is the third sentence. This is the fourth sentence.")); + assertHighlight(response, 0, "text", 1, 2, equalTo("This is the fifth sentence")); + + field.highlighterType("fvh"); + response = client().prepareSearch("test").setQuery(queryBuilder).addHighlightedField(field).get(); + assertHighlight(response, 0, "text", 0, 2, equalTo("This is the third sentence. This is the fourth sentence.")); + assertHighlight(response, 0, "text", 1, 2, equalTo("This is the fifth sentence")); + + field.highlighterType("postings"); + response = client().prepareSearch("test").setQuery(queryBuilder).addHighlightedField(field).get(); + assertHighlight(response, 0, "text", 0, 2, equalTo("This is the third sentence. This is the fourth sentence.")); + assertHighlight(response, 0, "text", 1, 2, equalTo("This is the fifth sentence")); + } + + @Test + public void testPostingsHighlighter() throws Exception { + client().admin().indices().prepareCreate("test").addMapping("type1", type1PostingsffsetsMapping()).get(); + ensureGreen(); + + client().prepareIndex("test", "type1") + .setSource("field1", "this is a test", "field2", "The quick brown fox jumps over the lazy dog").setRefresh(true).get(); + + logger.info("--> highlighting and searching on field1"); + SearchSourceBuilder source = searchSource() + .query(termQuery("field1", "test")) + .highlight(highlight().field("field1").preTags("").postTags("")); + + SearchResponse searchResponse = client().search(searchRequest("test").source(source)).actionGet(); + assertHitCount(searchResponse, 1l); + + assertThat(searchResponse.getHits().getAt(0).highlightFields().get("field1").fragments()[0].string(), equalTo("this is a test")); + + logger.info("--> searching on _all, highlighting on field1"); + source = searchSource() + .query(termQuery("_all", "test")) + .highlight(highlight().field("field1").preTags("").postTags("")); + + searchResponse = client().search(searchRequest("test").source(source)).actionGet(); + assertHitCount(searchResponse, 1l); + + assertThat(searchResponse.getHits().getAt(0).highlightFields().get("field1").fragments()[0].string(), equalTo("this is a test")); + + logger.info("--> searching on _all, highlighting on field2"); + source = searchSource() + .query(termQuery("_all", "quick")) + .highlight(highlight().field("field2").order("score").preTags("").postTags("")); + + searchResponse = client().search(searchRequest("test").source(source)).actionGet(); + assertHitCount(searchResponse, 1l); + + assertThat(searchResponse.getHits().getAt(0).highlightFields().get("field2").fragments()[0].string(), equalTo("The quick brown fox jumps over the lazy dog")); + + logger.info("--> searching on _all, highlighting on field2"); + source = searchSource() + .query(prefixQuery("_all", "qui")) + .highlight(highlight().field("field2").preTags("").postTags("")); + + searchResponse = client().search(searchRequest("test").source(source)).actionGet(); + assertHitCount(searchResponse, 1l); + //no snippets produced for prefix query, not supported by postings highlighter + assertThat(searchResponse.getHits().getAt(0).highlightFields().size(), equalTo(0)); + + //lets fall back to the standard highlighter then, what people would do with unsupported queries + logger.info("--> searching on _all, highlighting on field2, falling back to the plain highlighter"); + source = searchSource() + .query(prefixQuery("_all", "qui")) + .highlight(highlight().field("field2").preTags("").postTags("").highlighterType("highlighter")); + + searchResponse = client().search(searchRequest("test").source(source)).actionGet(); + assertHitCount(searchResponse, 1l); + + assertThat(searchResponse.getHits().getAt(0).highlightFields().get("field2").fragments()[0].string(), equalTo("The quick brown fox jumps over the lazy dog")); + } + + @Test + public void testPostingsHighlighterMultipleFields() throws Exception { + client().admin().indices().prepareCreate("test").addMapping("type1", type1PostingsffsetsMapping()).get(); + ensureGreen(); + + client().prepareIndex("test", "type1") + .setSource("field1", "this is a test1", "field2", "this is a test2", "field3", "this is a test3").setRefresh(true).get(); + + logger.info("--> highlighting and searching on field1"); + SearchSourceBuilder source = searchSource() + .query(boolQuery() + .should(termQuery("field1", "test1")) + .should(termQuery("field2", "test2")) + .should(termQuery("field3", "test3"))) + .highlight(highlight().preTags("").postTags("").requireFieldMatch(false) + .field("field1").field("field2").field(new HighlightBuilder.Field("field3").preTags("").postTags(""))); + + SearchResponse searchResponse = client().search(searchRequest("test").source(source)).actionGet(); + assertHitCount(searchResponse, 1l); + + assertThat(searchResponse.getHits().getAt(0).highlightFields().get("field1").fragments()[0].string(), equalTo("this is a test1")); + assertThat(searchResponse.getHits().getAt(0).highlightFields().get("field2").fragments()[0].string(), equalTo("this is a test2")); + assertThat(searchResponse.getHits().getAt(0).highlightFields().get("field3").fragments()[0].string(), equalTo("this is a test3")); + } + + @Test + public void testPostingsHighlighterNumberOfFragments() throws Exception { + client().admin().indices().prepareCreate("test").addMapping("type1", type1PostingsffsetsMapping()).get(); + ensureGreen(); + + client().prepareIndex("test", "type1", "1") + .setSource("field1", "The quick brown fox jumps over the lazy dog. The lazy red fox jumps over the quick dog. The quick brown dog jumps over the lazy fox.", + "field2", "The quick brown fox jumps over the lazy dog. The lazy red fox jumps over the quick dog. The quick brown dog jumps over the lazy fox.") + .setRefresh(true).get(); + + logger.info("--> highlighting and searching on field1"); + SearchSourceBuilder source = searchSource() + .query(termQuery("field1", "fox")) + .highlight(highlight() + .field(new HighlightBuilder.Field("field1").numOfFragments(5).preTags("").postTags("")) + .field(new HighlightBuilder.Field("field2").numOfFragments(2).preTags("").postTags(""))); + + SearchResponse searchResponse = client().search(searchRequest("test").source(source)).actionGet(); + assertHitCount(searchResponse, 1l); + + + Map highlightFieldMap = searchResponse.getHits().getAt(0).highlightFields(); + assertThat(highlightFieldMap.size(), equalTo(2)); + HighlightField field1 = highlightFieldMap.get("field1"); + assertThat(field1.fragments().length, equalTo(3)); + assertThat(field1.fragments()[0].string(), equalTo("The quick brown fox jumps over the lazy dog.")); + assertThat(field1.fragments()[1].string(), equalTo("The lazy red fox jumps over the quick dog.")); + assertThat(field1.fragments()[2].string(), equalTo("The quick brown dog jumps over the lazy fox.")); + + HighlightField field2 = highlightFieldMap.get("field2"); + assertThat(field2.fragments().length, equalTo(2)); + assertThat(field2.fragments()[0].string(), equalTo("The quick brown fox jumps over the lazy dog.")); + assertThat(field2.fragments()[1].string(), equalTo("The lazy red fox jumps over the quick dog.")); + + + client().prepareIndex("test", "type1", "2") + .setSource("field1", new String[]{"The quick brown fox jumps over the lazy dog. Second sentence not finished", "The lazy red fox jumps over the quick dog.", "The quick brown dog jumps over the lazy fox."}) + .setRefresh(true).get(); + + source = searchSource() + .query(termQuery("field1", "fox")) + .highlight(highlight() + .field(new HighlightBuilder.Field("field1").numOfFragments(0).preTags("").postTags(""))); + + searchResponse = client().search(searchRequest("test").source(source)).actionGet(); + assertHitCount(searchResponse, 2l); + + for (SearchHit searchHit : searchResponse.getHits()) { + highlightFieldMap = searchHit.highlightFields(); + assertThat(highlightFieldMap.size(), equalTo(1)); + field1 = highlightFieldMap.get("field1"); + assertThat(field1, notNullValue()); + if ("1".equals(searchHit.id())) { + assertThat(field1.fragments().length, equalTo(1)); + assertThat(field1.fragments()[0].string(), equalTo("The quick brown fox jumps over the lazy dog. The lazy red fox jumps over the quick dog. The quick brown dog jumps over the lazy fox.")); + } else if ("2".equals(searchHit.id())) { + assertThat(field1.fragments().length, equalTo(3)); + assertThat(field1.fragments()[0].string(), equalTo("The quick brown fox jumps over the lazy dog. Second sentence not finished")); + assertThat(field1.fragments()[1].string(), equalTo("The lazy red fox jumps over the quick dog.")); + assertThat(field1.fragments()[2].string(), equalTo("The quick brown dog jumps over the lazy fox.")); + } else { + fail("Only hits with id 1 and 2 are returned"); + } + } + } + + @Test + public void testPostingsHighlighterRequireFieldMatch() throws Exception { + client().admin().indices().prepareCreate("test").addMapping("type1", type1PostingsffsetsMapping()).get(); + ensureGreen(); + + client().prepareIndex("test", "type1") + .setSource("field1", "The quick brown fox jumps over the lazy dog. The lazy red fox jumps over the quick dog. The quick brown dog jumps over the lazy fox.", + "field2", "The quick brown fox jumps over the lazy dog. The lazy red fox jumps over the quick dog. The quick brown dog jumps over the lazy fox.") + .setRefresh(true).get(); + + logger.info("--> highlighting and searching on field1"); + SearchSourceBuilder source = searchSource() + .query(termQuery("field1", "fox")) + .highlight(highlight() + .field(new HighlightBuilder.Field("field1").requireFieldMatch(true).preTags("").postTags("")) + .field(new HighlightBuilder.Field("field2").requireFieldMatch(true).preTags("").postTags(""))); + + SearchResponse searchResponse = client().search(searchRequest("test").source(source)).actionGet(); + assertHitCount(searchResponse, 1l); + + //field2 is not returned highlighted because of the require field match option set to true + Map highlightFieldMap = searchResponse.getHits().getAt(0).highlightFields(); + assertThat(highlightFieldMap.size(), equalTo(1)); + HighlightField field1 = highlightFieldMap.get("field1"); + assertThat(field1.fragments().length, equalTo(3)); + assertThat(field1.fragments()[0].string(), equalTo("The quick brown fox jumps over the lazy dog.")); + assertThat(field1.fragments()[1].string(), equalTo("The lazy red fox jumps over the quick dog.")); + assertThat(field1.fragments()[2].string(), equalTo("The quick brown dog jumps over the lazy fox.")); + + + logger.info("--> highlighting and searching on field1 and field2 - require field match set to false"); + source = searchSource() + .query(termQuery("field1", "fox")) + .highlight(highlight() + .field(new HighlightBuilder.Field("field1").requireFieldMatch(false).preTags("").postTags("")) + .field(new HighlightBuilder.Field("field2").requireFieldMatch(false).preTags("").postTags(""))); + + searchResponse = client().search(searchRequest("test").source(source)).actionGet(); + assertHitCount(searchResponse, 1l); + + //field2 is now returned highlighted thanks to the multi_match query on both fields + highlightFieldMap = searchResponse.getHits().getAt(0).highlightFields(); + assertThat(highlightFieldMap.size(), equalTo(2)); + field1 = highlightFieldMap.get("field1"); + assertThat(field1.fragments().length, equalTo(3)); + assertThat(field1.fragments()[0].string(), equalTo("The quick brown fox jumps over the lazy dog.")); + assertThat(field1.fragments()[1].string(), equalTo("The lazy red fox jumps over the quick dog.")); + assertThat(field1.fragments()[2].string(), equalTo("The quick brown dog jumps over the lazy fox.")); + + HighlightField field2 = highlightFieldMap.get("field2"); + assertThat(field2.fragments().length, equalTo(3)); + assertThat(field2.fragments()[0].string(), equalTo("The quick brown fox jumps over the lazy dog.")); + assertThat(field2.fragments()[1].string(), equalTo("The lazy red fox jumps over the quick dog.")); + assertThat(field2.fragments()[2].string(), equalTo("The quick brown dog jumps over the lazy fox.")); + + + logger.info("--> highlighting and searching on field1 and field2 via multi_match query"); + source = searchSource() + .query(multiMatchQuery("fox", "field1", "field2")) + .highlight(highlight() + .field(new HighlightBuilder.Field("field1").requireFieldMatch(true).preTags("").postTags("")) + .field(new HighlightBuilder.Field("field2").requireFieldMatch(true).preTags("").postTags(""))); + + searchResponse = client().search(searchRequest("test").source(source)).actionGet(); + assertHitCount(searchResponse, 1l); + + //field2 is now returned highlighted thanks to the multi_match query on both fields + highlightFieldMap = searchResponse.getHits().getAt(0).highlightFields(); + assertThat(highlightFieldMap.size(), equalTo(2)); + field1 = highlightFieldMap.get("field1"); + assertThat(field1.fragments().length, equalTo(3)); + assertThat(field1.fragments()[0].string(), equalTo("The quick brown fox jumps over the lazy dog.")); + assertThat(field1.fragments()[1].string(), equalTo("The lazy red fox jumps over the quick dog.")); + assertThat(field1.fragments()[2].string(), equalTo("The quick brown dog jumps over the lazy fox.")); + + field2 = highlightFieldMap.get("field2"); + assertThat(field2.fragments().length, equalTo(3)); + assertThat(field2.fragments()[0].string(), equalTo("The quick brown fox jumps over the lazy dog.")); + assertThat(field2.fragments()[1].string(), equalTo("The lazy red fox jumps over the quick dog.")); + assertThat(field2.fragments()[2].string(), equalTo("The quick brown dog jumps over the lazy fox.")); + } + + @Test + public void testPostingsHighlighterOrderByScore() throws Exception { + client().admin().indices().prepareCreate("test").addMapping("type1", type1PostingsffsetsMapping()).get(); + ensureGreen(); + + client().prepareIndex("test", "type1") + .setSource("field1", new String[]{"This sentence contains one match, not that short. This sentence contains two sentence matches. This one contains no matches.", + "This is the second value's first sentence. This one contains no matches. This sentence contains three sentence occurrences (sentence).", + "One sentence match here and scored lower since the text is quite long, not that appealing. This one contains no matches."}) + .setRefresh(true).get(); + + logger.info("--> highlighting and searching on field1"); + SearchSourceBuilder source = searchSource() + .query(termQuery("field1", "sentence")) + .highlight(highlight().field("field1").order("score")); + + SearchResponse searchResponse = client().search(searchRequest("test").source(source)).actionGet(); + assertHitCount(searchResponse, 1l); + + Map highlightFieldMap = searchResponse.getHits().getAt(0).highlightFields(); + assertThat(highlightFieldMap.size(), equalTo(1)); + HighlightField field1 = highlightFieldMap.get("field1"); + assertThat(field1.fragments().length, equalTo(5)); + assertThat(field1.fragments()[0].string(), equalTo("This sentence contains three sentence occurrences (sentence).")); + assertThat(field1.fragments()[1].string(), equalTo("This sentence contains two sentence matches.")); + assertThat(field1.fragments()[2].string(), equalTo("This is the second value's first sentence.")); + assertThat(field1.fragments()[3].string(), equalTo("This sentence contains one match, not that short.")); + assertThat(field1.fragments()[4].string(), equalTo("One sentence match here and scored lower since the text is quite long, not that appealing.")); + + //lets use now number_of_fragments = 0, so that we highlight per value without breaking them into snippets, but we sort the values by score + source = searchSource() + .query(termQuery("field1", "sentence")) + .highlight(highlight().field("field1", -1, 0).order("score")); + + searchResponse = client().search(searchRequest("test").source(source)).actionGet(); + assertHitCount(searchResponse, 1l); + + highlightFieldMap = searchResponse.getHits().getAt(0).highlightFields(); + assertThat(highlightFieldMap.size(), equalTo(1)); + field1 = highlightFieldMap.get("field1"); + assertThat(field1.fragments().length, equalTo(3)); + assertThat(field1.fragments()[0].string(), equalTo("This is the second value's first sentence. This one contains no matches. This sentence contains three sentence occurrences (sentence).")); + assertThat(field1.fragments()[1].string(), equalTo("This sentence contains one match, not that short. This sentence contains two sentence matches. This one contains no matches.")); + assertThat(field1.fragments()[2].string(), equalTo("One sentence match here and scored lower since the text is quite long, not that appealing. This one contains no matches.")); + } + + @Test + public void testPostingsHighlighterEscapeHtml() throws Exception { + client().admin().indices().prepareCreate("test").setSettings(ImmutableSettings.settingsBuilder().put("index.number_of_shards", 2)) + .addMapping("type1", jsonBuilder().startObject().startObject("type1").startObject("properties") + .startObject("title").field("type", "string").field("store", "yes").field("index_options", "offsets").endObject() + .endObject().endObject().endObject()) + .get(); + ensureYellow(); + for (int i = 0; i < 5; i++) { + client().prepareIndex("test", "type1", Integer.toString(i)) + .setSource("title", "This is a html escaping highlighting test for *&? elasticsearch").setRefresh(true).execute().actionGet(); + } + + SearchResponse searchResponse = client().prepareSearch() + .setQuery(matchQuery("title", "test")) + .setHighlighterEncoder("html") + .addHighlightedField("title").get(); + + assertHitCount(searchResponse, 5l); + assertThat(searchResponse.getHits().hits().length, equalTo(5)); + + for (SearchHit hit : searchResponse.getHits()) { + assertThat(hit.highlightFields().get("title").fragments()[0].string(), equalTo("This is a html escaping highlighting test for *&?")); + } + } + + @Test + public void testPostingsHighlighterMultiMapperWithStore() throws Exception { + client().admin().indices().prepareCreate("test").setSettings(ImmutableSettings.settingsBuilder().put("index.number_of_shards", 2)) + .addMapping("type1", jsonBuilder().startObject().startObject("type1") + //just to make sure that we hit the stored fields rather than the _source + .startObject("_source").field("enabled", false).endObject() + .startObject("properties") + .startObject("title").field("type", "multi_field").startObject("fields") + .startObject("title").field("type", "string").field("store", "yes").field("index_options", "offsets").endObject() + .startObject("key").field("type", "string").field("store", "yes").field("index_options", "offsets").field("analyzer", "whitespace").endObject() + .endObject().endObject() + .endObject().endObject().endObject()) + .execute().actionGet(); + ensureGreen(); + client().prepareIndex("test", "type1", "1").setSource("title", "this is a test . Second sentence.").get(); + refresh(); + // simple search on body with standard analyzer with a simple field query + SearchResponse searchResponse = client().prepareSearch() + //lets make sure we analyze the query and we highlight the resulting terms + .setQuery(matchQuery("title", "This is a Test")) + .addHighlightedField("title").get(); + + assertHitCount(searchResponse, 1l); + SearchHit hit = searchResponse.getHits().getAt(0); + assertThat(hit.source(), nullValue()); + + //stopwords are not highlighted since not indexed + assertThat(hit.highlightFields().get("title").fragments()[0].string(), equalTo("this is a test .")); + + // search on title.key and highlight on title + searchResponse = client().prepareSearch() + .setQuery(matchQuery("title.key", "this is a test")) + .addHighlightedField("title.key").get(); + assertHitCount(searchResponse, 1l); + + hit = searchResponse.getHits().getAt(0); + //stopwords are now highlighted since we used only whitespace analyzer here + assertThat(hit.highlightFields().get("title.key").fragments()[0].string(), equalTo("this is a test .")); + } + + @Test + public void testPostingsHighlighterMultiMapperFromSource() throws Exception { + client().admin().indices().prepareCreate("test").setSettings(ImmutableSettings.settingsBuilder().put("index.number_of_shards", 2)) + .addMapping("type1", jsonBuilder().startObject().startObject("type1").startObject("properties") + .startObject("title").field("type", "multi_field").startObject("fields") + .startObject("title").field("type", "string").field("store", "no").field("index_options", "offsets").endObject() + .startObject("key").field("type", "string").field("store", "no").field("index_options", "offsets").field("analyzer", "whitespace").endObject() + .endObject().endObject() + .endObject().endObject().endObject()) + .get(); + ensureGreen(); + + client().prepareIndex("test", "type1", "1").setSource("title", "this is a test").get(); + refresh(); + + // simple search on body with standard analyzer with a simple field query + SearchResponse searchResponse = client().prepareSearch() + .setQuery(matchQuery("title", "this is a test")) + .addHighlightedField("title") + .execute().actionGet(); + + assertHitCount(searchResponse, 1l); + + SearchHit hit = searchResponse.getHits().getAt(0); + assertThat(hit.highlightFields().get("title").fragments()[0].string(), equalTo("this is a test")); + + // search on title.key and highlight on title.key + searchResponse = client().prepareSearch() + .setQuery(matchQuery("title.key", "this is a test")) + .addHighlightedField("title.key") + .get(); + assertHitCount(searchResponse, 1l); + + hit = searchResponse.getHits().getAt(0); + assertThat(hit.highlightFields().get("title.key").fragments()[0].string(), equalTo("this is a test")); + } + + @Test + public void testPostingsHighlighterShouldFailIfNoOffsets() throws Exception { + client().admin().indices().prepareCreate("test").setSettings(ImmutableSettings.settingsBuilder().put("index.number_of_shards", 2)) + .addMapping("type1", jsonBuilder().startObject().startObject("type1").startObject("properties") + .startObject("title").field("type", "string").field("store", "yes").field("index_options", "docs").endObject() + .endObject().endObject().endObject()) + .get(); + ensureGreen(); + + for (int i = 0; i < 5; i++) { + client().prepareIndex("test", "type1", Integer.toString(i)) + .setSource("title", "This is a test for the postings highlighter").setRefresh(true).get(); + } + refresh(); + + SearchResponse search = client().prepareSearch() + .setQuery(matchQuery("title", "this is a test")) + .addHighlightedField("title") + .get(); + assertNoFailures(search); + + search = client().prepareSearch() + .setQuery(matchQuery("title", "this is a test")) + .addHighlightedField("title") + .setHighlighterType("postings-highlighter") + .get(); + assertThat(search.getFailedShards(), equalTo(2)); + for (ShardSearchFailure shardSearchFailure : search.getShardFailures()) { + assertThat(shardSearchFailure.reason(), containsString("the field [title] should be indexed with positions and offsets in the postings list to be used with postings highlighter")); + } + + search = client().prepareSearch() + .setQuery(matchQuery("title", "this is a test")) + .addHighlightedField("title") + .setHighlighterType("postings") + .get(); + + assertThat(search.getFailedShards(), equalTo(2)); + for (ShardSearchFailure shardSearchFailure : search.getShardFailures()) { + assertThat(shardSearchFailure.reason(), containsString("the field [title] should be indexed with positions and offsets in the postings list to be used with postings highlighter")); + } + } + + @Test + public void testPostingsHighlighterBoostingQuery() throws ElasticSearchException, IOException { + client().admin().indices().prepareCreate("test").addMapping("type1", type1PostingsffsetsMapping()).get(); + ensureGreen(); + client().prepareIndex("test", "type1").setSource("field1", "this is a test", "field2", "The quick brown fox jumps over the lazy dog! Second sentence.") + .get(); + refresh(); + + logger.info("--> highlighting and searching on field1"); + SearchSourceBuilder source = searchSource() + .query(boostingQuery().positive(termQuery("field2", "brown")).negative(termQuery("field2", "foobar")).negativeBoost(0.5f)) + .highlight(highlight().field("field2").preTags("").postTags("")); + + SearchResponse searchResponse = client().search(searchRequest("test").source(source)).actionGet(); + assertHitCount(searchResponse, 1l); + + assertThat(searchResponse.getHits().getAt(0).highlightFields().get("field2").fragments()[0].string(), + equalTo("The quick brown fox jumps over the lazy dog!")); + } + + @Test + public void testPostingsHighlighterCommonTermsQuery() throws ElasticSearchException, IOException { + client().admin().indices().prepareCreate("test").addMapping("type1", type1PostingsffsetsMapping()).get(); + ensureGreen(); + + client().prepareIndex("test", "type1").setSource("field1", "this is a test", "field2", "The quick brown fox jumps over the lazy dog! Second sentence.").get(); + refresh(); + logger.info("--> highlighting and searching on field1"); + SearchSourceBuilder source = searchSource().query(commonTerms("field2", "quick brown").cutoffFrequency(100)) + .highlight(highlight().field("field2").preTags("").postTags("")); + + SearchResponse searchResponse = client().search(searchRequest("test").source(source)).actionGet(); + assertHitCount(searchResponse, 1l); + + assertThat(searchResponse.getHits().getAt(0).highlightFields().get("field2").fragments()[0].string(), + equalTo("The quick brown fox jumps over the lazy dog!")); + } + + public XContentBuilder type1PostingsffsetsMapping() throws IOException { + return XContentFactory.jsonBuilder().startObject().startObject("type1") + .startObject("_all").field("store", "yes").field("index_options", "offsets").endObject() + .startObject("properties") + .startObject("field1").field("type", "string").field("index_options", "offsets").endObject() + .startObject("field2").field("type", "string").field("index_options", "offsets").endObject() + .endObject() + .endObject().endObject(); + } + + @Test + @Slow + public void testPostingsHighlighterManyDocs() throws Exception { + client().admin().indices().prepareCreate("test").addMapping("type1", type1PostingsffsetsMapping()).get(); + ensureGreen(); + + int COUNT = between(20, 100); + logger.info("--> indexing docs"); + for (int i = 0; i < COUNT; i++) { + client().prepareIndex("test", "type1", Integer.toString(i)).setSource("field1", "Sentence test " + i + ". Sentence two.").get(); + } + refresh(); + + logger.info("--> searching explicitly on field1 and highlighting on it"); + SearchResponse searchResponse = client().prepareSearch() + .setSize(COUNT) + .setQuery(termQuery("field1", "test")) + .addHighlightedField("field1") + .get(); + assertHitCount(searchResponse, (long)COUNT); + assertThat(searchResponse.getHits().hits().length, equalTo(COUNT)); + for (SearchHit hit : searchResponse.getHits()) { + assertThat(hit.highlightFields().get("field1").fragments()[0].string(), equalTo("Sentence test " + hit.id() + ".")); + } + + logger.info("--> searching explicitly on field1 and highlighting on it, with DFS"); + searchResponse = client().prepareSearch() + .setSearchType(SearchType.DFS_QUERY_THEN_FETCH) + .setSize(COUNT) + .setQuery(termQuery("field1", "test")) + .addHighlightedField("field1") + .get(); + assertHitCount(searchResponse, (long)COUNT); + assertThat(searchResponse.getHits().hits().length, equalTo(COUNT)); + for (SearchHit hit : searchResponse.getHits()) { + assertThat(hit.highlightFields().get("field1").fragments()[0].string(), equalTo("Sentence test " + hit.id() + ".")); + } } } diff --git a/src/test/java/org/elasticsearch/test/hamcrest/ElasticsearchAssertions.java b/src/test/java/org/elasticsearch/test/hamcrest/ElasticsearchAssertions.java index 803d608f43b..c941076842d 100644 --- a/src/test/java/org/elasticsearch/test/hamcrest/ElasticsearchAssertions.java +++ b/src/test/java/org/elasticsearch/test/hamcrest/ElasticsearchAssertions.java @@ -187,10 +187,18 @@ public class ElasticsearchAssertions { } public static void assertHighlight(SearchResponse resp, int hit, String field, int fragment, Matcher matcher) { + assertHighlight(resp, hit, field, fragment, greaterThan(fragment), matcher); + } + + public static void assertHighlight(SearchResponse resp, int hit, String field, int fragment, int totalFragments, Matcher matcher) { + assertHighlight(resp, hit, field, fragment, equalTo(totalFragments), matcher); + } + + private static void assertHighlight(SearchResponse resp, int hit, String field, int fragment, Matcher fragmentsMatcher, Matcher matcher) { assertNoFailures(resp); assertThat("not enough hits", resp.getHits().hits().length, greaterThan(hit)); assertThat(resp.getHits().hits()[hit].getHighlightFields(), hasKey(field)); - assertThat(resp.getHits().hits()[hit].getHighlightFields().get(field).fragments().length, greaterThan(fragment)); + assertThat(resp.getHits().hits()[hit].getHighlightFields().get(field).fragments().length, fragmentsMatcher); assertThat(resp.getHits().hits()[hit].highlightFields().get(field).fragments()[fragment].string(), matcher); assertVersionSerializable(resp); } diff --git a/src/test/resources/org/apache/lucene/search/postingshighlight/CambridgeMA.utf8 b/src/test/resources/org/apache/lucene/search/postingshighlight/CambridgeMA.utf8 new file mode 100644 index 00000000000..d60b6fa15d8 --- /dev/null +++ b/src/test/resources/org/apache/lucene/search/postingshighlight/CambridgeMA.utf8 @@ -0,0 +1 @@ +{{Distinguish|Cambridge, England}} {{primary sources|date=June 2012}} {{Use mdy dates|date=January 2011}} {{Infobox settlement |official_name = Cambridge, Massachusetts |nickname = |motto = "Boston's Left Bank"{{cite web|url= http://www.epodunk.com/cgi-bin/genInfo.php?locIndex=2894|title=Profile for Cambridge, Massachusetts, MA|publisher= ePodunk |accessdate= November 1, 2012}} |image_skyline = CambridgeMACityHall2.jpg |imagesize = 175px |image_caption = Cambridge City Hall |image_seal = |image_flag = |image_map = Cambridge ma highlight.png |mapsize = 250px |map_caption = Location in Middlesex County in Massachusetts |image_map1 = |mapsize1 = |map_caption1 = |coordinates_region = US-MA |subdivision_type = Country |subdivision_name = United States |subdivision_type1 = State |subdivision_name1 = [[Massachusetts]] |subdivision_type2 = [[List of counties in Massachusetts|County]] |subdivision_name2 = [[Middlesex County, Massachusetts|Middlesex]] |established_title = Settled |established_date = 1630 |established_title2 = Incorporated |established_date2 = 1636 |established_title3 = |established_date3 = |government_type = [[Council-manager government|Council-City Manager]] |leader_title = Mayor |leader_name = Henrietta Davis |leader_title1 = [[City manager|City Manager]] |leader_name1 = [[Robert W. Healy]] |area_magnitude = |area_total_km2 = 18.47 |area_total_sq_mi = 7.13 |area_land_km2 = 16.65 |area_land_sq_mi = 6.43 |area_water_km2 = 1.81 |area_water_sq_mi = 0.70 |population_as_of = 2010 |population_blank2_title = [[Demonym]] |population_blank2 = [[Cantabrigian]] |settlement_type = City |population_total = 105,162 |population_density_km2 = 6,341.98 |population_density_sq_mi = 16,422.08 |elevation_m = 12 |elevation_ft = 40 |timezone = [[Eastern Time Zone|Eastern]] |utc_offset = -5 |timezone_DST = [[Eastern Time Zone|Eastern]] |utc_offset_DST = -4 |coordinates_display = display=inline,title |latd = 42 |latm = 22 |lats = 25 |latNS = N |longd = 71 |longm = 06 |longs = 38 |longEW = W |website = [http://www.cambridgema.gov/ www.cambridgema.gov] |postal_code_type = ZIP code |postal_code = 02138, 02139, 02140, 02141, 02142 |area_code = [[Area code 617|617]] / [[Area code 857|857]] |blank_name = [[Federal Information Processing Standard|FIPS code]] |blank_info = 25-11000 |blank1_name = [[Geographic Names Information System|GNIS]] feature ID |blank1_info = 0617365 |footnotes = }} '''Cambridge''' is a city in [[Middlesex County, Massachusetts|Middlesex County]], [[Massachusetts]], [[United States]], in the [[Greater Boston]] area. It was named in honor of the [[University of Cambridge]] in [[England]], an important center of the [[Puritan]] theology embraced by the town's founders.{{cite book|last=Degler|first=Carl Neumann|title=Out of Our Pasts: The Forces That Shaped Modern America|publisher=HarperCollins|location=New York|year=1984|url=http://books.google.com/books?id=NebLe1ueuGQC&pg=PA18&lpg=PA18&dq=cambridge+university+puritans+newtowne#v=onepage&q=&f=false|accessdate=September 9, 2009 | isbn=978-0-06-131985-3}} Cambridge is home to two of the world's most prominent universities, [[Harvard University]] and the [[Massachusetts Institute of Technology]]. According to the [[2010 United States Census]], the city's population was 105,162.{{cite web|url=http://2010.census.gov/news/releases/operations/cb11-cn104.html |title=Census 2010 News | U.S. Census Bureau Delivers Massachusetts' 2010 Census Population Totals, Including First Look at Race and Hispanic Origin Data for Legislative Redistricting |publisher=2010.census.gov |date=2011-03-22 |accessdate=2012-04-28}} It is the fifth most populous city in the state, behind [[Boston]], [[Worcester, MA|Worcester]], [[Springfield, MA|Springfield]], and [[Lowell, Massachusetts|Lowell]]. Cambridge was one of the two [[county seat]]s of Middlesex County prior to the abolition of county government in 1997; [[Lowell, Massachusetts|Lowell]] was the other. ==History== {{See also|Timeline of Cambridge, Massachusetts history}} [[File:Formation of Massachusetts towns.svg|thumb|A map showing the original boundaries of Cambridge]] The site for what would become Cambridge was chosen in December 1630, because it was located safely upriver from Boston Harbor, which made it easily defensible from attacks by enemy ships. Also, the water from the local spring was so good that the local Native Americans believed it had medicinal properties.{{Citation needed|date=November 2009}} [[Thomas Dudley]], his daughter [[Anne Bradstreet]] and her husband Simon were among the first settlers of the town. The first houses were built in the spring of 1631. The settlement was initially referred to as "the newe towne".{{cite book|last=Drake|first=Samuel Adams|title=History of Middlesex County, Massachusetts|publisher=Estes and Lauriat|location=Boston|year=1880|volume=1|pages=305–16|url=http://books.google.com/books?id=QGolOAyd9RMC&pg=PA316&lpg=PA305&dq=newetowne&ct=result#PPA305,M1|accessdate=December 26, 2008}} Official Massachusetts records show the name capitalized as '''Newe Towne''' by 1632.{{cite book|title=Report on the Custody and Condition of the Public Records of Parishes|publisher=Massachusetts Secretary of the Commonwealth|url=http://books.google.com/books?id=IyYWAAAAYAAJ&pg=RA1-PA298&lpg=RA1-PA298&dq=%22Ordered+That+Newtowne+shall+henceforward+be+called%22|location=Boston|year=1889|page=298|accessdate=December 24, 2008}} Located at the first convenient [[Charles River]] crossing west of [[Boston]], Newe Towne was one of a number of towns (including Boston, [[Dorchester, Massachusetts|Dorchester]], [[Watertown, Massachusetts|Watertown]], and [[Weymouth, Massachusetts|Weymouth]]) founded by the 700 original [[Puritan]] colonists of the [[Massachusetts Bay Colony]] under governor [[John Winthrop]]. The original village site is in the heart of today's [[Harvard Square]]. The marketplace where farmers brought in crops from surrounding towns to sell survives today as the small park at the corner of John F. Kennedy (J.F.K.) and Winthrop Streets, then at the edge of a salt marsh, since filled. The town included a much larger area than the present city, with various outlying parts becoming independent towns over the years: [[Newton, Massachusetts|Newton (originally Cambridge Village, then Newtown)]] in 1688,{{cite book |last= Ritter |first= Priscilla R. |coauthors= Thelma Fleishman |title= Newton, Massachusetts 1679–1779: A Biographical Directory |year= 1982 |publisher= New England Historic Genealogical Society }} [[Lexington, Massachusetts|Lexington (Cambridge Farms)]] in 1712, and both [[Arlington, Massachusetts|West Cambridge (originally Menotomy)]] and [[Brighton, Massachusetts|Brighton (Little Cambridge)]] in 1807.{{cite web |url=http://www.brightonbot.com/history.php |title=A Short History of Allston-Brighton |first=Marchione |last=William P. |author= |authorlink= |coauthors= |date= |month= |year=2011 |work=Brighton-Allston Historical Society |publisher=Brighton Board of Trade |location= |page= |pages= |at= |language= |trans_title= |arxiv= |asin= |bibcode= |doi= |doibroken= |isbn= |issn= |jfm= |jstor= |lccn= |mr= |oclc= |ol= |osti= |pmc = |pmid= |rfc= |ssrn= |zbl= |id= |archiveurl= |archivedate= |deadurl= |accessdate=December 21, 2011 |quote= |ref= |separator= |postscript=}} Part of West Cambridge joined the new town of [[Belmont, Massachusetts|Belmont]] in 1859, and the rest of West Cambridge was renamed Arlington in 1867; Brighton was annexed by Boston in 1874. In the late 19th century, various schemes for annexing Cambridge itself to the City of Boston were pursued and rejected.{{cite news |title=ANNEXATION AND ITS FRUITS |author=Staff writer |first= |last= |authorlink= |url=http://query.nytimes.com/gst/abstract.html?res=9901E4DC173BEF34BC4D52DFB766838F669FDE |agency= |newspaper=[[The New York Times]] |publisher= |isbn= |issn= |pmid= |pmd= |bibcode= |doi= |date=January 15, 1874, Wednesday |page= 4 |pages= |accessdate=|archiveurl=http://query.nytimes.com/mem/archive-free/pdf?res=9901E4DC173BEF34BC4D52DFB766838F669FDE |archivedate=January 15, 1874 |ref= }}{{cite news |title=BOSTON'S ANNEXATION SCHEMES.; PROPOSAL TO ABSORB CAMBRIDGE AND OTHER NEAR-BY TOWNS |author=Staff writer |first= |last= |authorlink= |url=http://query.nytimes.com/gst/abstract.html?res=9C05E1DC1F39E233A25754C2A9659C94639ED7CF |agency= |newspaper=[[The New York Times]] |publisher= |isbn= |issn= |pmid= |pmd= |bibcode= |doi= |date=March 26, 1892, Wednesday |page= 11 |pages= |accessdate=August 21, 2010|archiveurl=http://query.nytimes.com/mem/archive-free/pdf?res=9C05E1DC1F39E233A25754C2A9659C94639ED7CF |archivedate=March 27, 1892 |ref= }} In 1636, [[Harvard College]] was founded by the colony to train [[minister (religion)|ministers]] and the new town was chosen for its site by [[Thomas Dudley]]. By 1638, the name "Newe Towne" had "compacted by usage into 'Newtowne'." In May 1638{{cite book|title=The Cambridge of Eighteen Hundred and Ninety-six|editor=Arthur Gilman, ed.|publisher=Committee on the Memorial Volume|location=Cambridge|year=1896|page=8}}{{cite web|author=Harvard News Office |url=http://news.harvard.edu/gazette/2002/05.02/02-history.html |title=''Harvard Gazette'' historical calendar giving May 12, 1638 as date of name change; certain other sources say May 2, 1638 or late 1637 |publisher=News.harvard.edu |date=2002-05-02 |accessdate=2012-04-28}} the name was changed to '''Cambridge''' in honor of the [[University of Cambridge|university]] in [[Cambridge, England]].{{cite book |last= Hannah Winthrop Chapter, D.A.R. |title= Historic Guide to Cambridge |edition= Second |year= 1907 |publisher= Hannah Winthrop Chapter, D.A.R. |location= Cambridge, Mass. |pages= 20–21 |quote= On October 15, 1637, the Great and General Court passed a vote that: "The college is ordered to bee at Newetowne." In this same year the name of Newetowne was changed to Cambridge, ("It is ordered that Newetowne shall henceforward be called Cambridge") in honor of the university in Cambridge, England, where many of the early settlers were educated. }} The first president ([[Henry Dunster]]), the first benefactor ([[John Harvard (clergyman)|John Harvard]]), and the first schoolmaster ([[Nathaniel Eaton]]) of Harvard were all Cambridge University alumni, as was the then ruling (and first) governor of the [[Massachusetts Bay Colony]], John Winthrop. In 1629, Winthrop had led the signing of the founding document of the city of Boston, which was known as the [[Cambridge Agreement]], after the university.{{cite web|url=http://www.winthropsociety.org/doc_cambr.php|publisher=The Winthrop Society|title=Descendants of the Great Migration|accessdate=September 8, 2008}} It was Governor Thomas Dudley who, in 1650, signed the charter creating the corporation which still governs Harvard College.{{cite web|url=http://hul.harvard.edu/huarc/charter.html |title=Harvard Charter of 1650, Harvard University Archives, Harvard University, harvard.edu |publisher=Hul.harvard.edu |date= |accessdate=2012-04-28}}{{cite book |last1= |first1= |authorlink1= |editor1-first= |editor1-last= |editor1-link= |others= |title=Constitution of the Commonwealth of Massachusetts|url=http://www.mass.gov/legis/const.htm |accessdate=December 13, 2009 |edition= |series= |volume= |date=September 1, 1779 |publisher=The General Court of Massachusetts |location= |isbn= |oclc= |doi= |page= |pages=|chapter=Chapter V: The University at Cambridge, and encouragement of literature, etc. |chapterurl= |ref= |bibcode= }} [[Image:Washington taking command of the American Army at Cambridge, 1775 - NARA - 532874.tif|thumb|right|George Washington in Cambridge, 1775]] Cambridge grew slowly as an agricultural village eight miles (13 km) by road from Boston, the capital of the colony. By the [[American Revolution]], most residents lived near the [[Cambridge Common|Common]] and Harvard College, with farms and estates comprising most of the town. Most of the inhabitants were descendants of the original Puritan colonists, but there was also a small elite of [[Anglicans|Anglican]] "worthies" who were not involved in village life, who made their livings from estates, investments, and trade, and lived in mansions along "the Road to Watertown" (today's [[Brattle Street (Cambridge, Massachusetts)|Brattle Street]], still known as [[Tory Row]]). In 1775, [[George Washington]] came up from [[Virginia]] to take command of fledgling volunteer American soldiers camped on the [[Cambridge Common]]—today called the birthplace of the [[U.S. Army]]. (The name of today's nearby Sheraton Commander Hotel refers to that event.) Most of the Tory estates were confiscated after the Revolution. On January 24, 1776, [[Henry Knox]] arrived with artillery captured from [[Fort Ticonderoga]], which enabled Washington to drive the British army out of Boston. [[File:Cambridge 1873 WardMap.jpg|thumb|300px|left|A map of Cambridge from 1873]] Between 1790 and 1840, Cambridge began to grow rapidly, with the construction of the [[West Boston Bridge]] in 1792, that connected Cambridge directly to Boston, making it no longer necessary to travel eight miles (13 km) through the [[Boston Neck]], [[Roxbury, Massachusetts|Roxbury]], and [[Brookline, Massachusetts|Brookline]] to cross the [[Charles River]]. A second bridge, the Canal Bridge, opened in 1809 alongside the new [[Middlesex Canal]]. The new bridges and roads made what were formerly estates and [[marsh]]land into prime industrial and residential districts. In the mid-19th century, Cambridge was the center of a literary revolution when it gave the country a new identity through poetry and literature. Cambridge was home to the famous Fireside Poets—so called because their poems would often be read aloud by families in front of their evening fires. In their day, the [[Fireside Poets]]—[[Henry Wadsworth Longfellow]], [[James Russell Lowell]], and [[Oliver Wendell Holmes, Sr.|Oliver Wendell Holmes]]—were as popular and influential as rock stars are today.{{Citation needed|date=November 2009}} Soon after, [[Toll road|turnpikes]] were built: the [[Cambridge and Concord Turnpike]] (today's Broadway and Concord Ave.), the [[Middlesex Turnpike (Massachusetts)|Middlesex Turnpike]] (Hampshire St. and [[Massachusetts Avenue (Boston)|Massachusetts Ave.]] northwest of [[Porter Square]]), and what are today's Cambridge, Main, and Harvard Streets were roads to connect various areas of Cambridge to the bridges. In addition, railroads crisscrossed the town during the same era, leading to the development of Porter Square as well as the creation of neighboring town [[Somerville, Massachusetts|Somerville]] from the formerly rural parts of [[Charlestown, Massachusetts|Charlestown]]. [[File:Middlesex Canal (Massachusetts) map, 1852.jpg|thumb|1852 Map of Boston area showing Cambridge and rail lines.]] Cambridge was incorporated as a city in 1846. This was despite noticeable tensions between East Cambridge, Cambridgeport, and Old Cambridge that stemmed from differences in in each area's culture, sources of income, and the national origins of the residents.Cambridge Considered: A Very Brief History of Cambridge, 1800-1900, Part I. http://cambridgeconsidered.blogspot.com/2011/01/very-brief-history-of-cambridge-1800.html The city's commercial center began to shift from Harvard Square to Central Square, which became the downtown of the city around this time. Between 1850 and 1900, Cambridge took on much of its present character—[[streetcar suburb]]an development along the turnpikes, with working-class and industrial neighborhoods focused on East Cambridge, comfortable middle-class housing being built on old estates in Cambridgeport and Mid-Cambridge, and upper-class enclaves near Harvard University and on the minor hills of the city. The coming of the railroad to North Cambridge and Northwest Cambridge then led to three major changes in the city: the development of massive brickyards and brickworks between Massachusetts Ave., Concord Ave. and [[Alewife Brook]]; the ice-cutting industry launched by [[Frederic Tudor]] on [[Fresh Pond, Cambridge, Massachusetts|Fresh Pond]]; and the carving up of the last estates into residential subdivisions to provide housing to the thousands of immigrants that arrived to work in the new industries. For many years, the city's largest employer was the [[New England Glass Company]], founded in 1818. By the middle of the 19th century it was the largest and most modern glassworks in the world. In 1888, all production was moved, by [[Edward Libbey|Edward Drummond Libbey]], to [[Toledo, Ohio]], where it continues today under the name Owens Illinois. Flint glassware with heavy lead content, produced by that company, is prized by antique glass collectors. There is none on public display in Cambridge, but there is a large collection in the [[Toledo Museum of Art]]. Among the largest businesses located in Cambridge was the firm of [[Carter's Ink Company]], whose neon sign long adorned the [[Charles River]] and which was for many years the largest manufacturer of ink in the world. By 1920, Cambridge was one of the main industrial cities of [[New England]], with nearly 120,000 residents. As industry in New England began to decline during the [[Great Depression]] and after World War II, Cambridge lost much of its industrial base. It also began the transition to being an intellectual, rather than an industrial, center. Harvard University had always been important in the city (both as a landowner and as an institution), but it began to play a more dominant role in the city's life and culture. Also, the move of the [[Massachusetts Institute of Technology]] from Boston in 1916 ensured Cambridge's status as an intellectual center of the United States. After the 1950s, the city's population began to decline slowly, as families tended to be replaced by single people and young couples. The 1980s brought a wave of high-technology startups, creating software such as [[Visicalc]] and [[Lotus 1-2-3]], and advanced computers, but many of these companies fell into decline with the fall of the minicomputer and [[DOS]]-based systems. However, the city continues to be home to many startups as well as a thriving biotech industry. By the end of the 20th century, Cambridge had one of the most expensive housing markets in the Northeastern United States. While maintaining much diversity in class, race, and age, it became harder and harder for those who grew up in the city to be able to afford to stay. The end of [[rent control]] in 1994 prompted many Cambridge renters to move to housing that was more affordable, in Somerville and other communities. In 2005, a reassessment of residential property values resulted in a disproportionate number of houses owned by non-affluent people jumping in value relative to other houses, with hundreds having their property tax increased by over 100%; this forced many homeowners in Cambridge to move elsewhere.Cambridge Chronicle, October 6, 13, 20, 27, 2005 As of 2012, Cambridge's mix of amenities and proximity to Boston has kept housing prices relatively stable. ==Geography== [[File:Charles River Cambridge USA.jpg|thumb|upright|A view from Boston of Harvard's [[Weld Boathouse]] and Cambridge in winter. The [[Charles River]] is in the foreground.]] According to the [[United States Census Bureau]], Cambridge has a total area of {{convert|7.1|sqmi|km2}}, of which {{convert|6.4|sqmi|km2}} of it is land and {{convert|0.7|sqmi|km2}} of it (9.82%) is water. ===Adjacent municipalities=== Cambridge is located in eastern Massachusetts, bordered by: *the city of [[Boston]] to the south (across the [[Charles River]]) and east *the city of [[Somerville, Massachusetts|Somerville]] to the north *the town of [[Arlington, Massachusetts|Arlington]] to the northwest *the town of [[Belmont, Massachusetts|Belmont]] and *the city of [[Watertown, Massachusetts|Watertown]] to the west The border between Cambridge and the neighboring city of [[Somerville, Massachusetts|Somerville]] passes through densely populated neighborhoods which are connected by the [[Red Line (MBTA)|MBTA Red Line]]. Some of the main squares, [[Inman Square|Inman]], [[Porter Square|Porter]], and to a lesser extent, [[Harvard Square|Harvard]], are very close to the city line, as are Somerville's [[Union Square (Somerville)|Union]] and [[Davis Square]]s. ===Neighborhoods=== ====Squares==== [[File:Centralsquarecambridgemass.jpg|thumb|[[Central Square (Cambridge)|Central Square]]]] [[File:Harvard square 2009j.JPG|thumb|[[Harvard Square]]]] [[File:Cambridge MA Inman Square.jpg|thumb|[[Inman Square]]]] Cambridge has been called the "City of Squares" by some,{{cite web|author=No Writer Attributed |url=http://www.thecrimson.com/article/1969/9/18/cambridge-a-city-of-squares-pcambridge/ |title="Cambridge: A City of Squares" Harvard Crimson, Sept. 18, 1969 |publisher=Thecrimson.com |date=1969-09-18 |accessdate=2012-04-28}}{{cite web|url=http://www.travelwritersmagazine.com/RonBernthal/Cambridge.html |title=Cambridge Journal: Massachusetts City No Longer in Boston's Shadow |publisher=Travelwritersmagazine.com |date= |accessdate=2012-04-28}} as most of its commercial districts are major street intersections known as [[Town square|squares]]. Each of the squares acts as a neighborhood center. These include: * [[Kendall Square]], formed by the junction of Broadway, Main Street, and Third Street, is also known as '''Technology Square''', a name shared with an office and laboratory building cluster in the neighborhood. Just over the [[Longfellow Bridge]] from Boston, at the eastern end of the [[Massachusetts Institute of Technology|MIT]] campus, it is served by the [[Kendall (MBTA station)|Kendall/MIT]] station on the [[Massachusetts Bay Transportation Authority|MBTA]] [[Red Line (MBTA)|Red Line]] subway. Most of Cambridge's large office towers are located here, giving the area somewhat of an office park feel. A flourishing [[biotech]] industry has grown up around this area. The "One Kendall Square" complex is nearby, but—confusingly—not actually in Kendall Square. Also, the "Cambridge Center" office complex is located here, and not at the actual center of Cambridge. * [[Central Square (Cambridge)|Central Square]], formed by the junction of Massachusetts Avenue, Prospect Street, and Western Avenue, is well known for its wide variety of ethnic restaurants. As recently as the late 1990s it was rather run-down; it underwent a controversial [[gentrification]] in recent years (in conjunction with the development of the nearby [[University Park at MIT]]), and continues to grow more expensive. It is served by the [[Central (MBTA station)|Central Station]] stop on the MBTA Red Line subway. '''Lafayette Square''', formed by the junction of Massachusetts Avenue, Columbia Street, Sidney Street, and Main Street, is considered part of the Central Square area. [[Cambridgeport]] is south of Central Square along Magazine Street and Brookline Street. * [[Harvard Square]], formed by the junction of Massachusetts Avenue, Brattle Street, and JFK Street. This is the primary site of [[Harvard University]], and is a major Cambridge shopping area. It is served by a [[Harvard (MBTA station)|Red Line station]]. Harvard Square was originally the northwestern terminus of the Red Line and a major transfer point to streetcars that also operated in a short [[Harvard Bus Tunnel|tunnel]]—which is still a major bus terminal, although the area under the Square was reconfigured dramatically in the 1980s when the Red Line was extended. The Harvard Square area includes '''Brattle Square''' and '''Eliot Square'''. A short distance away from the square lies the [[Cambridge Common]], while the neighborhood north of Harvard and east of Massachusetts Avenue is known as Agassiz in honor of the famed scientist [[Louis Agassiz]]. * [[Porter Square]], about a mile north on Massachusetts Avenue from Harvard Square, is formed by the junction of Massachusetts and Somerville Avenues, and includes part of the city of [[Somerville, Massachusetts|Somerville]]. It is served by the [[Porter (MBTA station)|Porter Square Station]], a complex housing a [[Red Line (MBTA)|Red Line]] stop and a [[Fitchburg Line]] [[MBTA commuter rail|commuter rail]] stop. [[Lesley University]]'s University Hall and Porter campus are located at Porter Square. * [[Inman Square]], at the junction of Cambridge and Hampshire streets in Mid-Cambridge. Inman Square is home to many diverse restaurants, bars, music venues and boutiques. The funky street scene still holds some urban flair, but was dressed up recently with Victorian streetlights, benches and bus stops. A new community park was installed and is a favorite place to enjoy some takeout food from the nearby restaurants and ice cream parlor. * [[Lechmere Square]], at the junction of Cambridge and First streets, adjacent to the CambridgeSide Galleria shopping mall. Perhaps best known as the northern terminus of the [[Massachusetts Bay Transportation Authority|MBTA]] [[Green Line (MBTA)|Green Line]] subway, at [[Lechmere (MBTA station)|Lechmere Station]]. ====Other neighborhoods==== The residential neighborhoods ([http://www.cambridgema.gov/CPD/publications/neighborhoods.cfm map]) in Cambridge border, but are not defined by the squares. These include: * [[East Cambridge, Massachusetts|East Cambridge]] (Area 1) is bordered on the north by the [[Somerville, Massachusetts|Somerville]] border, on the east by the Charles River, on the south by Broadway and Main Street, and on the west by the [[Grand Junction Railroad]] tracks. It includes the [[NorthPoint (Cambridge, Massachusetts)|NorthPoint]] development. * [[Massachusetts Institute of Technology|MIT]] Campus ([[MIT Campus (Area 2), Cambridge|Area 2]]) is bordered on the north by Broadway, on the south and east by the Charles River, and on the west by the Grand Junction Railroad tracks. * [[Wellington-Harrington]] (Area 3) is bordered on the north by the [[Somerville, Massachusetts|Somerville]] border, on the south and west by Hampshire Street, and on the east by the Grand Junction Railroad tracks. Referred to as "Mid-Block".{{clarify|What is? By whom? A full sentence would help.|date=September 2011}} * [[Area 4, Cambridge|Area 4]] is bordered on the north by Hampshire Street, on the south by Massachusetts Avenue, on the west by Prospect Street, and on the east by the Grand Junction Railroad tracks. Residents of Area 4 often refer to their neighborhood simply as "The Port", and refer to the area of Cambridgeport and Riverside as "The Coast". * [[Cambridgeport]] (Area 5) is bordered on the north by Massachusetts Avenue, on the south by the Charles River, on the west by River Street, and on the east by the Grand Junction Railroad tracks. * [[Mid-Cambridge]] (Area 6) is bordered on the north by Kirkland and Hampshire Streets and the [[Somerville, Massachusetts|Somerville]] border, on the south by Massachusetts Avenue, on the west by Peabody Street, and on the east by Prospect Street. * [[Riverside, Cambridge|Riverside]] (Area 7), an area sometimes referred to as "The Coast," is bordered on the north by Massachusetts Avenue, on the south by the Charles River, on the west by JFK Street, and on the east by River Street. * [[Agassiz, Cambridge, Massachusetts|Agassiz (Harvard North)]] (Area 8) is bordered on the north by the [[Somerville, Massachusetts|Somerville]] border, on the south and east by Kirkland Street, and on the west by Massachusetts Avenue. * [[Peabody, Cambridge, Massachusetts|Peabody]] (Area 9) is bordered on the north by railroad tracks, on the south by Concord Avenue, on the west by railroad tracks, and on the east by Massachusetts Avenue. The Avon Hill sub-neighborhood consists of the higher elevations bounded by Upland Road, Raymond Street, Linnaean Street and Massachusetts Avenue. * Brattle area/[[West Cambridge (neighborhood)|West Cambridge]] (Area 10) is bordered on the north by Concord Avenue and Garden Street, on the south by the Charles River and the [[Watertown, Massachusetts|Watertown]] border, on the west by Fresh Pond and the Collins Branch Library, and on the east by JFK Street. It includes the sub-neighborhoods of Brattle Street (formerly known as [[Tory Row]]) and Huron Village. * [[North Cambridge, Massachusetts|North Cambridge]] (Area 11) is bordered on the north by the [[Arlington, Massachusetts|Arlington]] and [[Somerville, Massachusetts|Somerville]] borders, on the south by railroad tracks, on the west by the [[Belmont, Massachusetts|Belmont]] border, and on the east by the [[Somerville, Massachusetts|Somerville]] border. * [[Cambridge Highlands]] (Area 12) is bordered on the north and east by railroad tracks, on the south by Fresh Pond, and on the west by the [[Belmont, Massachusetts|Belmont]] border. * [[Strawberry Hill, Cambridge|Strawberry Hill]] (Area 13) is bordered on the north by Fresh Pond, on the south by the [[Watertown, Massachusetts|Watertown]] border, on the west by the [[Belmont, Massachusetts|Belmont]] border, and on the east by railroad tracks. ===Parks and outdoors=== [[File:Alewife Brook Reservation.jpg|thumb|Alewife Brook Reservation]] Consisting largely of densely built residential space, Cambridge lacks significant tracts of public parkland. This is partly compensated for, however, by the presence of easily accessible open space on the university campuses, including [[Harvard Yard]] and MIT's Great Lawn, as well as the considerable open space of [[Mount Auburn Cemetery]]. At the western edge of Cambridge, the cemetery is well known as the first garden cemetery, for its distinguished inhabitants, for its superb landscaping (the oldest planned landscape in the country), and as a first-rate [[arboretum]]. Although known as a Cambridge landmark, much of the cemetery lies within the bounds of Watertown.http://www2.cambridgema.gov/CityOfCambridge_Content/documents/CambridgeStreetMap18x24_032007.pdf It is also a significant [[Important Bird Area]] (IBA) in the Greater Boston area. Public parkland includes the esplanade along the Charles River, which mirrors its [[Charles River Esplanade|Boston counterpart]], [[Cambridge Common]], a busy and historic public park immediately adjacent to the Harvard campus, and the [[Alewife Brook Reservation]] and [[Fresh Pond, Cambridge, Massachusetts|Fresh Pond]] in the western part of the city. ==Demographics== {{Historical populations | type=USA | align=right | 1790|2115 | 1800|2453 | 1810|2323 | 1820|3295 | 1830|6072 | 1840|8409 | 1850|15215 | 1860|26060 | 1870|39634 | 1880|52669 | 1890|70028 | 1900|91886 | 1910|104839 | 1920|109694 | 1930|113643 | 1940|110879 | 1950|120740 | 1960|107716 | 1970|100361 | 1980|95322 | 1990|95802 | 2000|101355 | 2010|105162 | footnote= {{Historical populations/Massachusetts municipalities references}}{{cite journal | title=1950 Census of Population | volume=1: Number of Inhabitants | at=Section 6, Pages 21-7 through 21-09, Massachusetts Table 4. Population of Urban Places of 10,000 or more from Earliest Census to 1920 | publisher=Bureau of the Census | accessdate=July 12, 2011 | year=1952 | url=http://www2.census.gov/prod2/decennial/documents/23761117v1ch06.pdf}} }} As of the census{{GR|2}} of 2010, there were 105,162 people, 44,032 households, and 17,420 families residing in the city. The population density was 16,422.08 people per square mile (6,341.98/km²), making Cambridge the fifth most densely populated city in the USCounty and City Data Book: 2000. Washington, DC: US Department of Commerce, Bureau of the Census. Table C-1. and the second most densely populated city in [[Massachusetts]] behind neighboring [[Somerville, Massachusetts|Somerville]].[http://www.boston.com/realestate/news/articles/2008/07/13/highest_population_density/ Highest Population Density, The Boston Globe] There were 47,291 housing units at an average density of 7,354.7 per square mile (2,840.3/km²). The racial makeup of the city was 66.60% [[White (U.S. Census)|White]], 11.70% [[Black (people)|Black]] or [[Race (United States Census)|African American]], 0.20% [[Native American (U.S. Census)|Native American]], 15.10% [[Asian (U.S. Census)|Asian]], 0.01% [[Pacific Islander (U.S. Census)|Pacific Islander]], 2.10% from [[Race (United States Census)|other races]], and 4.30% from two or more races. 7.60% of the population were [[Hispanics in the United States|Hispanic]] or [[Latino (U.S. Census)|Latino]] of any race. [[Non-Hispanic Whites]] were 62.1% of the population in 2010,{{cite web |url=http://quickfacts.census.gov/qfd/states/25/2511000.html |title=Cambridge (city), Massachusetts |work=State & County QuickFacts |publisher=U.S. Census Bureau}} down from 89.7% in 1970.{{cite web|title=Massachusetts - Race and Hispanic Origin for Selected Cities and Other Places: Earliest Census to 1990|publisher=U.S. Census Bureau|url=http://www.census.gov/population/www/documentation/twps0076/twps0076.html}} This rather closely parallels the average [[racial demographics of the United States]] as a whole, although Cambridge has significantly more Asians than the average, and fewer Hispanics and Caucasians. 11.0% were of [[irish people|Irish]], 7.2% English, 6.9% [[italians|Italian]], 5.5% [[West Indian]] and 5.3% [[germans|German]] ancestry according to [[Census 2000]]. 69.4% spoke English, 6.9% Spanish, 3.2% [[Standard Mandarin|Chinese]] or [[Standard Mandarin|Mandarin]], 3.0% [[portuguese language|Portuguese]], 2.9% [[French-based creole languages|French Creole]], 2.3% French, 1.5% [[korean language|Korean]], and 1.0% [[italian language|Italian]] as their first language. There were 44,032 households out of which 16.9% had children under the age of 18 living with them, 28.9% were married couples living together, 8.4% had a female householder with no husband present, and 60.4% were non-families. 40.7% of all households were made up of individuals and 9.6% had someone living alone who was 65 years of age or older. The average household size was 2.00 and the average family size was 2.76. In the city the population was spread out with 13.3% under the age of 18, 21.2% from 18 to 24, 38.6% from 25 to 44, 17.8% from 45 to 64, and 9.2% who were 65 years of age or older. The median age was 30.5 years. For every 100 females, there were 96.1 males. For every 100 females age 18 and over, there were 94.7 males. The median income for a household in the city was $47,979, and the median income for a family was $59,423 (these figures had risen to $58,457 and $79,533 respectively {{as of|2007|alt=as of a 2007 estimate}}{{cite web|url=http://factfinder.census.gov/servlet/ACSSAFFFacts?_event=Search&geo_id=16000US2418750&_geoContext=01000US%7C04000US24%7C16000US2418750&_street=&_county=cambridge&_cityTown=cambridge&_state=04000US25&_zip=&_lang=en&_sse=on&ActiveGeoDiv=geoSelect&_useEV=&pctxt=fph&pgsl=160&_submenuId=factsheet_1&ds_name=ACS_2007_3YR_SAFF&_ci_nbr=null&qr_name=null®=null%3Anull&_keyword=&_industry= |title=U.S. Census, 2000 |publisher=Factfinder.census.gov |date= |accessdate=2012-04-28}}). Males had a median income of $43,825 versus $38,489 for females. The per capita income for the city was $31,156. About 8.7% of families and 12.9% of the population were below the poverty line, including 15.1% of those under age 18 and 12.9% of those age 65 or over. Cambridge was ranked as one of the most liberal cities in America.{{cite web|author=Aug 16, 2005 12:00 AM |url=http://www.govpro.com/News/Article/31439/ |title=Study Ranks America’s Most Liberal and Conservative Cities |publisher=Govpro.com |date=2005-08-16 |accessdate=2012-04-28}} Locals living in and near the city jokingly refer to it as "The People's Republic of Cambridge."[http://www.universalhub.com/glossary/peoples_republic_the.html Wicked Good Guide to Boston English] Accessed February 2, 2009 For 2012, the residential property tax rate in Cambridge is $8.48 per $1,000.{{cite web|url=http://www.cambridgema.gov/finance/propertytaxinformation/fy12propertytaxinformation.aspx |title=FY12 Property Tax Information - City of Cambridge, Massachusetts |publisher=Cambridgema.gov |date= |accessdate=2012-04-28}} Cambridge enjoys the highest possible [[bond credit rating]], AAA, with all three Wall Street rating agencies.http://www.cambridgema.gov/CityOfCambridge_Content/documents/Understanding_Your_Taxes_2007.pdf Cambridge is noted for its diverse population, both racially and economically. Residents, known as ''Cantabrigians'', include affluent [[MIT]] and Harvard professors. The first legal applications in America for same-sex marriage licenses were issued at Cambridge's City Hall.{{cite web|url=http://www.boston.com/news/local/articles/2004/05/17/free_to_marry/ |title=Free to Marry |work=[[The Boston Globe]] |date=2004-05-17 |accessdate=2012-07-18}} Cambridge is also the birthplace of [[Thailand|Thai]] king [[Bhumibol Adulyadej|Bhumibol Adulyadej (Rama IX)]], who is the world's longest reigning monarch at age 82 (2010), as well as the longest reigning monarch in Thai history. He is also the first king of a foreign country to be born in the United States. ==Government== ===Federal and state representation=== {| class=wikitable ! colspan = 6 | Voter registration and party enrollment {{as of|lc=y|df=US|2008|10|15}}{{cite web|title = 2008 State Party Election Party Enrollment Statistics | publisher = Massachusetts Elections Division | format = PDF | accessdate = July 7, 2010 | url = http://www.sec.state.ma.us/ele/elepdf/st_county_town_enroll_breakdown_08.pdf}} |- ! colspan = 2 | Party ! Number of voters ! Percentage {{American politics/party colors/Democratic/row}} | [[Democratic Party (United States)|Democratic]] | style="text-align:center;"| 37,822 | style="text-align:center;"| 58.43% {{American politics/party colors/Republican/row}} | [[Republican Party (United States)|Republican]] | style="text-align:center;"| 3,280 | style="text-align:center;"| 5.07% {{American politics/party colors/Independent/row}} | Unaffiliated | style="text-align:center;"| 22,935 | style="text-align:center;"| 35.43% {{American politics/party colors/Libertarian/row}} | Minor Parties | style="text-align:center;"| 690 | style="text-align:center;"| 1.07% |- ! colspan = 2 | Total ! style="text-align:center;"| 64,727 ! style="text-align:center;"| 100% |} Cambridge is part of [[Massachusetts's 8th congressional district]], represented by Democrat [[Mike Capuano]], elected in 1998. The state's senior member of the [[United States Senate]] is Democrat [[John Kerry]], elected in 1984. The state's junior member is Republican [[Scott Brown]], [[United States Senate special election in Massachusetts, 2010|elected in 2010]] to fill the vacancy caused by the death of long-time Democratic Senator [[Ted Kennedy]]. The Governor of Massachusetts is Democrat [[Deval Patrick]], elected in 2006 and re-elected in 2010. On the state level, Cambridge is represented in six districts in the [[Massachusetts House of Representatives]]: the 24th Middlesex (which includes parts of Belmont and Arlington), the 25th and 26th Middlesex (the latter which includes a portion of Somerville), the 29th Middlesex (which includes a small part of Watertown), and the Eighth and Ninth Suffolk (both including parts of the City of Boston). The city is represented in the [[Massachusetts Senate]] as a part of the "First Suffolk and Middlesex" district (this contains parts of Boston, Revere and Winthrop each in Suffolk County); the "Middlesex, Suffolk and Essex" district, which includes Everett and Somerville, with Boston, Chelsea, and Revere of Suffolk, and Saugus in Essex; and the "Second Suffolk and Middlesex" district, containing parts of the City of Boston in Suffolk county, and Cambridge, Belmont and Watertown in Middlesex county.{{cite web|url=http://www.malegislature.gov/ |title=Index of Legislative Representation by City and Town, from |publisher=Mass.gov |date= |accessdate=2012-04-28}} In addition to the [[Cambridge Police Department (Massachusetts)|Cambridge Police Department]], the city is patrolled by the Fifth (Brighton) Barracks of Troop H of the [[Massachusetts State Police]].[http://www.mass.gov/?pageID=eopsterminal&L=5&L0=Home&L1=Law+Enforcement+%26+Criminal+Justice&L2=Law+Enforcement&L3=State+Police+Troops&L4=Troop+H&sid=Eeops&b=terminalcontent&f=msp_divisions_field_services_troops_troop_h_msp_field_troop_h_station_h5&csid=Eeops Station H-5, SP Brighton]{{dead link|date=April 2012}} Due, however, to close proximity, the city also practices functional cooperation with the Fourth (Boston) Barracks of Troop H, as well.[http://www.mass.gov/?pageID=eopsterminal&L=5&L0=Home&L1=Law+Enforcement+%26+Criminal+Justice&L2=Law+Enforcement&L3=State+Police+Troops&L4=Troop+H&sid=Eeops&b=terminalcontent&f=msp_divisions_field_services_troops_troop_h_msp_field_troop_h_station_h4&csid=Eeops Station H-4, SP Boston]{{dead link|date=April 2012}} ===City government=== [[File:CambridgeMACityHall1.jpg|thumb|right|[[Cambridge, Massachusetts City Hall|Cambridge City Hall]] in the 1980s]] Cambridge has a city government led by a [[List of mayors of Cambridge, Massachusetts|Mayor]] and nine-member City Council. There is also a six-member School Committee which functions alongside the Superintendent of public schools. The councilors and school committee members are elected every two years using the [[single transferable vote]] (STV) system.{{cite web|url=http://www.cambridgema.gov/election/Proportional_Representation.cfm |title=Proportional Representation Voting in Cambridge |publisher=Cambridgema.gov |date= |accessdate=2012-04-28}} Once a laborious process that took several days to complete by hand, ballot sorting and calculations to determine the outcome of elections are now quickly performed by computer, after the ballots have been [[Optical scan voting system|optically scanned]]. The mayor is elected by the city councilors from amongst themselves, and serves as the chair of City Council meetings. The mayor also sits on the School Committee. However, the Mayor is not the Chief Executive of the City. Rather, the City Manager, who is appointed by the City Council, serves in that capacity. Under the City's Plan E form of government the city council does not have the power to appoint or remove city officials who are under direction of the city manager. The city council and its individual members are also forbidden from giving orders to any subordinate of the city manager.http://www.cambridgema.gov/CityOfCambridge_Content/documents/planE.pdf [[Robert W. Healy]] is the City Manager; he has served in the position since 1981. In recent history, the media has highlighted the salary of the City Manager as being one of the highest in the State of Massachusetts.{{cite news |title=Cambridge city manager's salary almost as much as Obama's pay |url=http://www.wickedlocal.com/cambridge/features/x1837730973/Cambridge-city-managers-salary-almost-as-much-as-Obamas |agency= |newspaper=Wicked Local: Cambridge |publisher= |date=August 11, 2011 |accessdate=December 30, 2011 |quote= |archiveurl= |archivedate= |deadurl= |ref=}} The city council consists of:{{cite web|url=http://www.cambridgema.gov/ccouncil/citycouncilmembers.aspx |title=City of Cambridge – City Council Members |publisher=Cambridgema.gov |date= |accessdate=2012-04-28}}{{Refbegin|3}} *[[Leland Cheung]] (Jan. 2010–present) *Henrietta Davis (Jan. 1996–present)* *Marjorie C. Decker (Jan. 2000–present){{cite web |url= http://www.wickedlocal.com/cambridge/news/x738245499/Marjorie-Decker-announces-she-will-run-for-Alice-Wolfs-Cambridge-State-Representative-seat |title= Marjorie Decker announces she will run for Alice Wolf's Cambridge State Representative seat |date= 22 March 2012 |work= Wicked Local Cambridge |publisher= GateHouse Media, Inc. |accessdate= 4 April 2012 }} *Craig A. Kelley (Jan. 2006–present) *David Maher (Jan. 2000-Jan. 2006, Sept. 2007–present{{cite web|author=By ewelin, on September 5th, 2007 |url=http://www.cambridgehighlands.com/2007/09/david-p-maher-elected-to-fill-michael-sullivans-vacated-city-council-seat |title=David P. Maher Elected to fill Michael Sullivan’s Vacated City Council Seat • Cambridge Highlands Neighborhood Association |publisher=Cambridgehighlands.com |date=2007-09-05 |accessdate=2012-04-28}})** *[[Kenneth Reeves]] (Jan. 1990–present)** *[[E. Denise Simmons]] (Jan. 2002–present)** *[[Timothy J. Toomey, Jr.]] (Jan. 1990–present) *Minka vanBeuzekom (Jan. 2012–present){{Refend}} ''* = Current Mayor''
''** = former Mayor'' ===Fire Department=== The city of Cambridge is protected full-time by the 274 professional firefighters of the Cambridge Fire Department. The current Chief of Department is Gerald R. Reardon. The Cambridge Fire Department operates out of eight fire stations, located throughout the city, under the command of two divisions. The CFD also maintains and operates a front-line fire apparatus fleet of eight engines, four ladders, two Non-Transport Paramedic EMS units, a Haz-Mat unit, a Tactical Rescue unit, a Dive Rescue unit, two Marine units, and numerous special, support, and reserve units. John J. Gelinas, Chief of Operations, is in charge of day to day operation of the department.{{cite web|url=http://www2.cambridgema.gov/cfd/ |title=City of Cambridge Fire Department |publisher=.cambridgema.gov |date=2005-03-13 |accessdate=2012-06-26}} The CFD is rated as a Class 1 fire department by the [[Insurance Services Office]] (ISO), and is one of only 32 fire departments so rated, out of 37,000 departments in the United States. The other class 1 departments in New England are in [[Hartford, Connecticut]] and [[Milford, Connecticut]]. Class 1 signifies the highest level of fire protection according to various criteria.{{cite web|url=http://www2.cambridgema.gov/CFD/Class1FD.cfm |title=Class 1 Fire Department |publisher=.cambridgema.gov |date=1999-07-01 |accessdate=2012-06-26}} The CFD responds to approximately 15,000 emergency calls annually. {| class=wikitable |- valign=bottom ! Engine Company ! Ladder Company ! Special Unit ! Division ! Address ! Neighborhood |- | Engine 1 || Ladder 1 || || || 491 Broadway || Harvard Square |- | Engine 2 || Ladder 3 || Squad 2 || || 378 Massachusetts Ave. || Lafayette Square |- | Engine 3 || Ladder 2 || || || 175 Cambridge St. || East Cambridge |- | Engine 4 || || Squad 4 || || 2029 Massachusetts Ave. || Porter Square |- | Engine 5 || || || Division 1 || 1384 Cambridge St. || Inman Square |- | Engine 6 || || || || 176 River St. || Cambridgeport |- | Engine 8 || Ladder 4 || || Division 2 || 113 Garden St. || Taylor Square |- | Engine 9 || || || || 167 Lexington Ave || West Cambridge |- | Maintenance Facility || || || || 100 Smith Pl. || |} ===Water Department=== Cambridge is unusual among cities inside Route 128 in having a non-[[MWRA]] water supply. City water is obtained from [[Hobbs Brook]] (in [[Lincoln, Massachusetts|Lincoln]] and [[Waltham, Massachusetts|Waltham]]), [[Stony Brook (Boston)|Stony Brook]] (Waltham and [[Weston, Massachusetts|Weston]]), and [[Fresh Pond (Cambridge, Massachusetts)|Fresh Pond]] (Cambridge). The city owns over 1200 acres of land in other towns that includes these reservoirs and portions of their watershed.{{cite web|url=http://www2.cambridgema.gov/CWD/wat_lands.cfm |title=Cambridge Watershed Lands & Facilities |publisher=.cambridgema.gov |date= |accessdate=2012-04-28}} Water is treated at Fresh Pond, then pumped uphill to an elevation of {{convert|176|ft|m}} [[above sea level]] at the Payson Park Reservoir ([[Belmont, Massachusetts|Belmont]]); From there, the water is redistributed downhill via gravity to individual users in the city.{{cite web|url=http://www.cambridgema.gov/CityOfCambridge_Content/documents/CWD_March_2010.pdf |title=Water supply system |format=PDF |date= |accessdate=2012-04-28}}[http://www.cambridgema.gov/CWD/fpfaqs.cfm Is Fresh Pond really used for drinking water?], Cambridge Water Department ===County government=== Cambridge is a [[county seat]] of [[Middlesex County, Massachusetts]], along with [[Lowell, Massachusetts|Lowell]]. Though the county government was abolished in 1997, the county still exists as a geographical and political region. The employees of Middlesex County courts, jails, registries, and other county agencies now work directly for the state. At present, the county's registrars of [[Deed]]s and Probate remain in Cambridge; however, the Superior Court and District Attorney have had their base of operations transferred to [[Woburn, Massachusetts|Woburn]]. Third District court has shifted operations to [[Medford, Massachusetts|Medford]], and the Sheriff's office for the county is still awaiting a near-term relocation.{{cite news | url=http://www.boston.com/news/local/massachusetts/articles/2008/02/14/court_move_a_hassle_for_commuters/ |title=Court move a hassle for commuters |accessdate=July 25, 2009 |first=Eric |last=Moskowitz |authorlink= |coauthors= |date=February 14, 2008 |work=[[Boston Globe|The Boston Globe]] |pages= |archiveurl= |archivedate= |quote=In a little more than a month, Middlesex Superior Court will open in Woburn after nearly four decades at the Edward J. Sullivan Courthouse in Cambridge. With it, the court will bring the roughly 500 people who pass through its doors each day – the clerical staff, lawyers, judges, jurors, plaintiffs, defendants, and others who use or work in the system.}}{{cite news | url=http://www.wickedlocal.com/cambridge/homepage/x135741754/Cambridges-Middlesex-Jail-courts-may-be-shuttered-for-good |title=Cambridge's Middlesex Jail, courts may be shuttered for good |accessdate=July 25, 2009 |first=Charlie |last=Breitrose |authorlink= |coauthors= |date=July 7, 2009 |work=Wicked Local News: Cambridge |pages= |archiveurl= |archivedate= |quote=The courts moved out of the building to allow workers to remove asbestos. Superior Court moved to Woburn in March 2008, and in February, the Third District Court moved to Medford.}} ==Education== [[File:MIT Main Campus Aerial.jpg|thumb|Aerial view of part of [[MIT]]'s main campus]] [[File:Dunster House.jpg|thumb|[[Dunster House]], Harvard]] ===Higher education=== Cambridge is perhaps best known as an academic and intellectual center, owing to its colleges and universities, which include: *[[Cambridge College]] *[[Cambridge School of Culinary Arts]] *[[Episcopal Divinity School]] *[[Harvard University]] *[[Hult International Business School]] *[[Lesley University]] *[[Longy School of Music]] *[[Massachusetts Institute of Technology]] *[[Le Cordon Bleu College of Culinary Arts in Boston]] [[Nobel laureates by university affiliation|At least 129]] of the world's total 780 [[Nobel Prize]] winners have been, at some point in their careers, affiliated with universities in Cambridge. The [[American Academy of Arts and Sciences]] is also based in Cambridge. ===Primary and secondary public education=== The Cambridge Public School District encompasses 12 elementary schools that follow a variety of different educational systems and philosophies. All but one of the elementary schools extend up to the [[middle school]] grades as well. The 12 elementary schools are: *[[Amigos School]] *Baldwin School *Cambridgeport School *Fletcher-Maynard Academy *Graham and Parks Alternative School *Haggerty School *Kennedy-Longfellow School *King Open School *Martin Luther King, Jr. School *Morse School (a [[Core Knowledge Foundation|Core Knowledge]] school) *Peabody School *Tobin School (a [[Montessori school]]) There are three public high schools serving Cambridge students, including the [[Cambridge Rindge and Latin School]].{{cite web|url=http://www.cpsd.us/Web/PubInfo/SchoolsAtAGlance06-07.pdf|title=Cambridge Public Schools at a Glance|format=PDF}}{{dead link|date=June 2012}} and Community Charter School of Cambridge (www.ccscambridge.org) In 2003, the CRLS, also known as Rindge, came close to losing its educational accreditation when it was placed on probation by the [[New England Association of Schools and Colleges]].{{cite web|url=http://www.thecrimson.com/article.aspx?ref=512061|title=School Fights Achievement Gap|publisher=The Harvard Crimson|accessdate=May 14, 2009}} The school has improved under Principal Chris Saheed, graduation rates hover around 98%, and 70% of students gain college admission. Community Charter School of Cambridge serves 350 students, primarily from Boston and Cambridge, and is a tuition free public charter school with a college preparatory curriculum. All students from the class of 2009 and 2010 gained admission to college. Outside of the main public schools are public charter schools including: [[Benjamin Banneker Charter School]], which serves students in grades K-6,{{cite web|url=http://www.banneker.org/ |title=The Benjamin Banneker Charter Public School |publisher=Banneker.org |date=2012-03-01 |accessdate=2012-04-28}} [[Community Charter School of Cambridge]],{{cite web|url=http://www.ccscambridge.org/ |title=Community Charter School of Cambridge |publisher=Ccscambridge.org |date= |accessdate=2012-04-28}} which is located in Kendall Square and serves students in grades 7–12, and [[Prospect Hill Academy]], a [[charter school]] whose upper school is in [[Central Square (Cambridge)|Central Square]], though it is not a part of the Cambridge Public School District. ===Primary and secondary private education=== [[File:Cambridge Public Library, Cambridge, Massachusetts.JPG|thumb|right|[[Cambridge Public Library]] original building, part of an expanded facility]] There are also many private schools in the city including: *[[Boston Archdiocesan Choir School]] (BACS) *[[Buckingham Browne & Nichols]] (BB&N) *[[Cambridge montessori school|Cambridge Montessori School]] (CMS) *Cambridge [[Religious Society of Friends|Friends]] School. Thomas Waring served as founding headmaster of the school. *Fayerweather Street School (FSS)[http://www.fayerweather.org/ ] *[[International School of Boston]] (ISB, formerly École Bilingue) *[[Matignon High School]] *[[North Cambridge Catholic High School]] (re-branded as Cristo Rey Boston and relocated to Dorchester, MA in 2010) *[[Shady Hill School]] *St. Peter School ==Economy== [[File:Cambridge Skyline.jpg|thumb|Buildings of [[Kendall Square]], center of Cambridge's [[biotech]] economy, seen from the [[Charles River]]]] Manufacturing was an important part of the economy in the late 19th and early 20th century, but educational institutions are the city's biggest employers today. Harvard and [[Massachusetts Institute of Technology|MIT]] together employ about 20,000.[http://www2.cambridgema.gov/cdd/data/labor/top25/top25_2008.html Top 25 Cambridge Employers: 2008], City of Cambridge As a cradle of technological innovation, Cambridge was home to technology firms [[Analog Devices]], [[Akamai Technologies|Akamai]], [[BBN Technologies|Bolt, Beranek, and Newman (BBN Technologies)]] (now part of Raytheon), [[General Radio|General Radio (later GenRad)]], [[Lotus Development Corporation]] (now part of [[IBM]]), [[Polaroid Corporation|Polaroid]], [[Symbolics]], and [[Thinking Machines]]. In 1996, [[Polaroid Corporation|Polaroid]], [[Arthur D. Little]], and [[Lotus Development Corporation|Lotus]] were top employers with over 1,000 employees in Cambridge, but faded out a few years later. Health care and biotechnology firms such as [[Genzyme]], [[Biogen Idec]], [[Millennium Pharmaceuticals]], [[Sanofi]], [[Pfizer]] and [[Novartis]]{{cite news |title=Novartis doubles plan for Cambridge |author=Casey Ross and Robert Weisman |first= |last= |authorlink= |authorlink2= |url=http://articles.boston.com/2010-10-27/business/29323650_1_french-drug-maker-astrazeneca-plc-research-operations |agency= |newspaper=[[The Boston Globe]] |publisher= |isbn= |issn= |pmid= |pmd= |bibcode= |doi= |date=October 27, 2010 |page= |pages= |accessdate=April 12, 2011|quote=Already Cambridge’s largest corporate employer, the Swiss firm expects to hire an additional 200 to 300 employees over the next five years, bringing its total workforce in the city to around 2,300. Novartis’s global research operations are headquartered in Cambridge, across Massachusetts Avenue from the site of the new four-acre campus. |archiveurl= |archivedate= |ref=}} have significant presences in the city. Though headquartered in Switzerland, Novartis continues to expand its operations in Cambridge. Other major biotech and pharmaceutical firms expanding their presence in Cambridge include [[GlaxoSmithKline]], [[AstraZeneca]], [[Shire plc|Shire]], and [[Pfizer]].{{cite news|title=Novartis Doubles Plan for Cambridge|url=http://www.boston.com/business/healthcare/articles/2010/10/27/novartis_doubles_plan_for_cambridge/|accessdate=23 February 2012 | work=The Boston Globe|first1=Casey|last1=Ross|first2=Robert|last2=Weisman|date=October 27, 2010}} Most Biotech firms in Cambridge are located around [[Kendall Square]] and [[East Cambridge, Massachusetts|East Cambridge]], which decades ago were the city's center of manufacturing. A number of biotechnology companies are also located in [[University Park at MIT]], a new development in another former manufacturing area. None of the high technology firms that once dominated the economy was among the 25 largest employers in 2005, but by 2008 high tech companies [[Akamai Technologies|Akamai]] and [[ITA Software]] had grown to be among the largest 25 employers. [[Google]],{{cite web|url=http://www.google.com/corporate/address.html |title=Google Offices |publisher=Google.com |date= |accessdate=2012-07-18}} [[IBM Research]], and [[Microsoft Research]] maintain offices in Cambridge. In late January 2012—less than a year after acquiring [[Billerica, Massachusetts|Billerica]]-based analytic database management company, [[Vertica]]—[[Hewlett-Packard]] announced it would also be opening its first offices in Cambridge.{{cite web|last=Huang|first=Gregory|title=Hewlett-Packard Expands to Cambridge via Vertica’s "Big Data" Center|url=http://www.xconomy.com/boston/2012/01/23/hewlett-packard-expands-to-cambridge-via-verticas-big-data-center/?single_page=true}} Around this same time, e-commerce giants [[Staples Inc.|Staples]]{{cite web|title=Staples to bring e-commerce office to Cambridge's Kendall Square Read more: Staples to bring e-commerce office to Cambridge's Kendall Square - Cambridge, Massachusetts - Cambridge Chronicle http://www.wickedlocal.com/cambridge/news/x690035936/Staples-to-bring-E-commerce-office-to-Cambridges-Kendall-Square#ixzz1nDY39Who|url=http://www.wickedlocal.com/cambridge/news/x690035936/Staples-to-bring-E-commerce-office-to-Cambridges-Kendall-Square#axzz1kg3no7Zg}} and [[Amazon.com]]{{cite web|title=Amazon Seeks Brick-And-Mortar Presence In Boston Area|url=http://www.wbur.org/2011/12/22/amazon-boston}} said they would be opening research and innovation centers in Kendall Square. Video game developer [[Harmonix Music Systems]] is based in [[Central Square (Cambridge)|Central Square]]. The proximity of Cambridge's universities has also made the city a center for nonprofit groups and think tanks, including the [[National Bureau of Economic Research]], the [[Smithsonian Astrophysical Observatory]], the [[Lincoln Institute of Land Policy]], [[Cultural Survival]], and [[One Laptop per Child]]. In September 2011, an initiative by the City of Cambridge called the "[[Entrepreneur Walk of Fame]]" was launched. It seeks to highlight individuals who have made contributions to innovation in the global business community.{{cite news |title=Stars of invention |author= |first=Kathleen |last=Pierce |url=http://articles.boston.com/2011-09-16/business/30165912_1_gates-and-jobs-microsoft-granite-stars |agency= |newspaper=The Boston Globe|date=September 16, 2011 |page= |pages= |at= |accessdate=October 1, 2011}} ===Top employers=== The top ten employers in the city are:{{cite web|url=http://cambridgema.gov/citynewsandpublications/news/2012/01/fy11comprehensiveannualfinancialreportnowavailable.aspx |title=City of Cambridge, Massachusetts Comprehensive Annual Financial Report July 1, 2010—June 30, 2011 |publisher=Cambridgema.gov |date=2011-06-30 |accessdate=2012-04-28}} {| class="wikitable" |- ! # ! Employer ! # of employees |- | 1 |[[Harvard University]] |10,718 |- |2 |[[Massachusetts Institute of Technology]] |7,604 |- |3 |City of Cambridge |2,922 |- |4 |[[Novartis]] Institutes for BioMedical Research |2,095 |- |5 |[[Mount Auburn Hospital]] |1,665 |- |6 |[[Vertex Pharmaceuticals]] |1,600 |- |7 |[[Genzyme]] |1,504 |- |8 |[[Biogen Idec]] |1,350 |- |9 |[[Federal government of the United States|Federal Government]] |1,316 |- |10 |[[Pfizer]] |1,300 |} ==Transportation== {{See also|Boston transportation}} ===Road=== [[File:Harvard Square at Peabody Street and Mass Avenue.jpg|thumb|[[Massachusetts Avenue (Boston)|Massachusetts Avenue]] in [[Harvard Square]]]] Several major roads lead to Cambridge, including [[Massachusetts State Highway 2|Route 2]], [[Massachusetts State Highway 16|Route 16]] and the [[Massachusetts State Highway 28|McGrath Highway (Route 28)]]. The [[Massachusetts Turnpike]] does not pass through Cambridge, but provides access by an exit in nearby [[Allston, Massachusetts|Allston]]. Both [[U.S. Route 1]] and [[I-93 (MA)]] also provide additional access on the eastern end of Cambridge at Leverett Circle in [[Boston]]. [[Massachusetts State Highway 2A|Route 2A]] runs the length of the city, chiefly along Massachusetts Avenue. The Charles River forms the southern border of Cambridge and is crossed by 11 bridges connecting Cambridge to Boston, including the [[Longfellow Bridge]] and the [[Harvard Bridge]], eight of which are open to motorized road traffic. Cambridge has an irregular street network because many of the roads date from the colonial era. Contrary to popular belief, the road system did not evolve from longstanding cow-paths. Roads connected various village settlements with each other and nearby towns, and were shaped by geographic features, most notably streams, hills, and swampy areas. Today, the major "squares" are typically connected by long, mostly straight roads, such as Massachusetts Avenue between [[Harvard Square]] and [[Central Square (Cambridge)|Central Square]], or Hampshire Street between [[Kendall Square]] and [[Inman Square]]. ===Mass transit=== [[File:Central MBTA station.jpg|thumb|[[Central (MBTA)|Central station on the MBTA Red Line]]]] Cambridge is well served by the [[MBTA]], including the [[Porter (MBTA station)|Porter Square stop]] on the regional [[MBTA Commuter Rail|Commuter Rail]], the [[Lechmere (MBTA station)|Lechmere stop]] on the [[Green Line (MBTA)|Green Line]], and five stops on the [[Red Line (MBTA)|Red Line]] ([[Alewife Station (MBTA)|Alewife]], [[Porter (MBTA)|Porter Square]], [[Harvard (MBTA station)|Harvard Square]], [[Central (MBTA station)|Central Square]], and [[Kendall/MIT (MBTA station)|Kendall Square/MIT]]). Alewife Station, the current terminus of the Red Line, has a large multi-story parking garage (at a rate of $7 per day {{as of|lc=y|2009}}).{{cite web|url=http://www.mbta.com/schedules_and_maps/subway/lines/stations/?stopId=10029 |title=> Schedules & Maps > Subway > Alewife Station |publisher=MBTA |date= |accessdate=2012-04-28}} The [[Harvard Bus Tunnel]], under Harvard Square, reduces traffic congestion on the surface, and connects to the Red Line underground. This tunnel was originally opened for streetcars in 1912, and served trackless trolleys and buses as the routes were converted. The tunnel was partially reconfigured when the Red Line was extended to Alewife in the early 1980s. Outside of the state-owned transit agency, the city is also served by the Charles River Transportation Management Agency (CRTMA) shuttles which are supported by some of the largest companies operating in city, in addition to the municipal government itself.{{cite web |url=http://www.charlesrivertma.org/members.htm |title=Charles River TMA Members |author=Staff writer |date=(As of) January 1, 2013 |work=CRTMA |publisher= |language= |trans_title= |type= |archiveurl= |archivedate= |deadurl= |accessdate=January 1, 2013 |quote= |ref= |separator= |postscript=}} ===Cycling=== Cambridge has several [[bike path]]s, including one along the Charles River,{{cite web|url=http://www.mass.gov/dcr/parks/metroboston/maps/bikepaths_dudley.gif |title=Dr. Paul Dudley White Bikepath |date= |accessdate=2012-04-28}} and the [[Cambridge Linear Park|Linear Park]] connecting the [[Minuteman Bikeway]] at Alewife with the [[Somerville Community Path]]. Bike parking is common and there are bike lanes on many streets, although concerns have been expressed regarding the suitability of many of the lanes. On several central MIT streets, bike lanes transfer onto the sidewalk. Cambridge bans cycling on certain sections of sidewalk where pedestrian traffic is heavy.{{cite web|url=http://www.cambridgema.gov/cdd/et/bike/bike_ban.html |title=Sidewalk Bicycling Banned Areas – Cambridge Massachusetts |publisher=Cambridgema.gov |date= |accessdate=2012-04-28}}{{cite web|url=http://www.cambridgema.gov/cdd/et/bike/bike_reg.html |title=Traffic Regulations for Cyclists – Cambridge Massachusetts |publisher=Cambridgema.gov |date=1997-05-01 |accessdate=2012-04-28}} While ''[[Bicycling Magazine]]'' has rated Boston as one of the worst cities in the nation for bicycling (In their words, for "lousy roads, scarce and unconnected bike lanes and bike-friendly gestures from City Hall that go nowhere—such as hiring a bike coordinator in 2001, only to cut the position two years later"),[http://www.bicycling.com/article/1,6610,s1-2-16-14593-11,00.html Urban Treasures – bicycling.com]{{dead link|date=April 2012}} it has listed Cambridge as an honorable mention as one of the best[http://www.bicycling.com/article/1,6610,s1-2-16-14593-9,00.html Urban Treasures – bicycling.com]{{dead link|date=April 2012}} and was called by the magazine "Boston's Great Hope." Cambridge has an active, official bicycle committee. ===Walking=== [[File:Weeks Footbridge Cambridge, MA.jpg|thumb|The [[John W. Weeks Bridge|Weeks Bridge]] provides a pedestrian-only connection between Boston's Allston-Brighton neighborhood and Cambridge over the Charles River]] Walking is a popular activity in Cambridge. Per year 2000 data, of the communities in the U.S. with more than 100,000 residents, Cambridge has the highest percentage of commuters who walk to work.{{cite web|url=http://www.bikesatwork.com/carfree/census-lookup.php?state_select=ALL_STATES&lower_pop=100000&upper_pop=99999999&sort_num=2&show_rows=25&first_row=0 |title=The Carfree Census Database: Result of search for communities in any state with population over 100,000, sorted in descending order by % Pedestrian Commuters |publisher=Bikesatwork.com |date= |accessdate=2012-04-28}} Cambridge receives a "Walk Score" of 100 out of 100 possible points.[http://www.walkscore.com/get-score.php?street=cambridge%2C+ma&go=Go Walk Score site] Accessed July 28, 2009 Cambridge's major historic squares have been recently changed into a modern walking landscape, which has sparked a traffic calming program based on the needs of pedestrians rather than of motorists. ===Intercity=== The Boston intercity bus and train stations at [[South Station]], Boston, and [[Logan International Airport]] in [[East Boston]], are accessible by [[Red Line (MBTA)|subway]]. The [[Fitchburg Line]] rail service from [[Porter (MBTA station)|Porter Square]] connects to some western suburbs. Since October 2010, there has also been intercity bus service between [[Alewife (MBTA station)|Alewife Station]] (Cambridge) and [[New York City]].{{cite web|last=Thomas |first=Sarah |url=http://www.boston.com/yourtown/news/cambridge/2010/10/warren_mbta_welcome_world_wide.html |title=NYC-bound buses will roll from Newton, Cambridge |publisher=Boston.com |date=2010-10-19 |accessdate=2012-04-28}} ==Media== ===Newspapers=== Cambridge is served by several weekly newspapers. The most prominent is the ''[[Cambridge Chronicle]]'', which is also the oldest surviving weekly paper in the United States. ===Radio=== Cambridge is home to the following commercially licensed and student-run radio stations: {| class=wikitable |- ! [[Callsign]] !! Frequency !! City/town !! Licensee !! Format |- | [[WHRB]] || align=right | 95.3 FM || Cambridge (Harvard) || Harvard Radio Broadcasting Co., Inc. || [[Variety (US radio)|Musical variety]] |- | [[WJIB]] || align=right | 740 AM || Cambridge || Bob Bittner Broadcasting || [[Adult Standards]]/Pop |- | [[WMBR]] || align=right | 88.1 FM || Cambridge (MIT) || Technology Broadcasting Corporation || [[College radio]] |} ===Television=== Cambridge Community Television (CCTV) has served the Cambridge community since its inception in 1988. CCTV operates Cambridge's public access television facility and programs three television channels, 8, 9, and 96 on the Cambridge cable system (Comcast). ===Social media=== As of 2011, a growing number of social media efforts provide means for participatory engagement with the locality of Cambridge, such as Localocracy"Localocracy is an online town common where registered voters using real names can weigh in on local issues." [http://cambridge.localocracy.com/ Localocracy Cambridge, Massachusetts]. Accessed 2011-10-01 and [[foursquare (website)|Foursquare]]. ==Culture, art and architecture== [[File:Fogg.jpg|thumb|[[Fogg Museum]], Harvard]] ===Museums=== * [[Harvard Art Museum]], including the [[Busch-Reisinger Museum]], a collection of Germanic art the [[Fogg Art Museum]], a comprehensive collection of Western art, and the [[Arthur M. Sackler Museum]], a collection of Middle East and Asian art * [[Harvard Museum of Natural History]], including the [[Glass Flowers]] collection * [[Peabody Museum of Archaeology and Ethnology]], Harvard *[[Semitic Museum]], Harvard * [[MIT Museum]] * [[List Visual Arts Center]], MIT ===Public art=== Cambridge has a large and varied collection of permanent public art, both on city property (managed by the Cambridge Arts Council),{{cite web|url=http://www.cambridgema.gov/CAC/Public/overview.cfm |title=CAC Public Art Program |publisher=Cambridgema.gov |date=2007-03-13 |accessdate=2012-04-28}} and on the campuses of Harvard{{cite web|url=http://ofa.fas.harvard.edu/visualarts/pubart.php |title=Office for the Arts at Harvard: Public Art |publisher=Ofa.fas.harvard.edu |date= |accessdate=2012-04-28}} and MIT.{{cite web|url=http://listart.mit.edu/map |title=MIT Public Art Collection Map |publisher=Listart.mit.edu |date= |accessdate=2012-04-28}} Temporary public artworks are displayed as part of the annual Cambridge River Festival on the banks of the Charles River, during winter celebrations in Harvard and Central Squares, and at university campus sites. Experimental forms of public artistic and cultural expression include the Central Square World's Fair, the Somerville-based annual Honk! Festival,{{cite web|url=http://honkfest.org/ |title= Honk Fest}} and [[If This House Could Talk]],{{cite web|url=http://cambridgehistory.org/discover/ifthishousecouldtalk/index.html |title=The Cambridge Historical Society}} a neighborhood art and history event. {{or|date=April 2012}} {{Citation needed|date=April 2012}} An active tradition of street musicians and other performers in Harvard Square entertains an audience of tourists and local residents during the warmer months of the year. The performances are coordinated through a public process that has been developed collaboratively by the performers,{{cite web|url=http://www.buskersadvocates.org/ | title= Street Arts & Buskers Advocates}} city administrators, private organizations and business groups.{{cite web|url=http://harvardsquare.com/Home/Arts-and-Entertainment/Street-Arts-and-Buskers-Advocates.aspx |title=Street Arts and Buskers Advocates |publisher=Harvardsquare.com |date= |accessdate=2012-04-28}} [[File:Longfellow National Historic Site, Cambridge, Massachusetts.JPG|thumb|right|The [[Longfellow National Historic Site]]]] [[File:Wfm stata center.jpg|thumb|[[Stata Center]], MIT]] [[File:Simmons Hall, MIT, Cambridge, Massachusetts.JPG|thumb|[[List of MIT undergraduate dormitories|Simmons Hall]], MIT]] ===Architecture=== Despite intensive urbanization during the late 19th century and 20th century, Cambridge has preserved an unusual number of historic buildings, including some dating to the 17th century. The city also contains an abundance of innovative contemporary architecture, largely built by Harvard and MIT. ;Notable historic buildings in the city include: * The [[Asa Gray House]] (1810) * [[Austin Hall, Harvard University]] (1882–84) * [[Cambridge, Massachusetts City Hall|Cambridge City Hall]] (1888–89) * [[Cambridge Public Library]] (1888) * [[Christ Church, Cambridge]] (1761) * [[Cooper-Frost-Austin House]] (1689–1817) * [[Elmwood (Cambridge, Massachusetts)|Elmwood House]] (1767), residence of the [[President of Harvard University]] * [[First Church of Christ, Scientist (Cambridge, Massachusetts)|First Church of Christ, Scientist]] (1924–30) * [[The First Parish in Cambridge]] (1833) * [[Harvard-Epworth United Methodist Church]] (1891–93) * [[Harvard Lampoon Building]] (1909) * The [[Hooper-Lee-Nichols House]] (1685–1850) * [[Longfellow National Historic Site]] (1759), former home of poet [[Henry Wadsworth Longfellow]] * [[The Memorial Church of Harvard University]] (1932) * [[Memorial Hall, Harvard University]] (1870–77) * [[Middlesex County Courthouse (Massachusetts)|Middlesex County Courthouse]] (1814–48) * [[Urban Rowhouse (40-48 Pearl Street, Cambridge, Massachusetts)|Urban Rowhouse]] (1875) * [[spite house|O'Reilly Spite House]] (1908), built to spite a neighbor who would not sell his adjacent landBloom, Jonathan. (February 2, 2003) [[Boston Globe]] ''[http://nl.newsbank.com/nl-search/we/Archives?p_product=BG&p_theme=bg&p_action=search&p_maxdocs=200&p_topdoc=1&p_text_direct-0=0F907F2342522B5D&p_field_direct-0=document_id&p_perpage=10&p_sort=YMD_date:D Existing by the Thinnest of Margins. A Concord Avenue Landmark Gives New Meaning to Cozy.]'' Section: City Weekly; Page 11. Location: 260 Concord Ave, Cambridge, MA 02138. {{See also|List of Registered Historic Places in Cambridge, Massachusetts}} ;Contemporary architecture: * [[List of MIT undergraduate dormitories#Baker House|Baker House]] dormitory, MIT, by Finnish architect [[Alvar Aalto]], one of only two buildings by Aalto in the US * Harvard Graduate Center/Harkness Commons, by [[The Architects Collaborative]] (TAC, with [[Walter Gropius]]) * [[Carpenter Center for the Visual Arts]], Harvard, the only building in North America by [[Le Corbusier]] * [[Kresge Auditorium]], MIT, by [[Eero Saarinen]] * [[MIT Chapel]], by [[Eero Saarinen]] * [[Design Research Building]], by [[Benjamin Thompson and Associates]] * [[American Academy of Arts and Sciences]], by [[Kallmann McKinnell and Wood]], also architects of Boston City Hall * [[Arthur M. Sackler Museum]], Harvard, one of the few buildings in the U.S. by [[James Stirling (architect)|James Stirling]], winner of the [[Pritzker Prize]] * [[Stata Center]], MIT, by [[Frank Gehry]] * [[List of MIT undergraduate dormitories#Simmons Hall|Simmons Hall]], MIT, by [[Steven Holl]] ===Music=== The city has an active music scene from classical performances to the latest popular bands. ==Sister cities== Cambridge has 8 active, official [[Twin towns and sister cities|sister cities]], and an unofficial relationship with [[Cambridge]], England:"A message from the Peace Commission" [http://www.cambridgema.gov/peace/newsandpublications/news/detail.aspx?path=%2fsitecore%2fcontent%2fhome%2fpeace%2fnewsandpublications%2fnews%2f2008%2f02%2finformationoncambridgessistercities]. *{{Flagicon|PRT}} [[Coimbra]], [[Portugal]] *{{Flagicon|CUB}} [[Cienfuegos]], [[Cuba]] *{{Flagicon|ITA}} [[Gaeta]], [[Italy]] *{{Flagicon|IRL}} [[Galway]], [[Republic of Ireland|Ireland]] *{{Flagicon|ARM}} [[Yerevan]], [[Armenia]]{{cite web|url=http://www.cysca.org/ |title=Cambridge-Yerevan Sister City Association |publisher=Cysca.org |date= |accessdate=2012-04-28}} *{{Flagicon|SLV}} [[San José Las Flores, Chalatenango|San José Las Flores]], [[El Salvador]] *{{Flagicon|JPN}} [[Tsukuba, Ibaraki|Tsukuba Science City]], Japan *{{Flagicon|POL}} [[Kraków]], [[Poland]] *{{Flagicon|CHN}} [[Haidian District]], [[China]] Ten other official sister city relationships are inactive: [[Dublin]], Ireland; [[Ischia]], [[Catania]], and [[Florence]], Italy; [[Kraków]], Poland; [[Santo Domingo Oeste]], Dominican Republic; [[Southwark]], London, England; [[Yuseong]], Daejeon, Korea; and [[Haidian District|Haidian]], Beijing, China. There has also been an unofficial relationship with: *{{Flagicon|GBR}} [[Cambridge]], England, UK{{cite web|url=http://www.cambridgema.gov/peace/newsandpublications/news/detail.aspx?path=%2fsitecore%2fcontent%2fhome%2fpeace%2fnewsandpublications%2fnews%2f2008%2f02%2finformationoncambridgessistercities |title="Sister Cities", Cambridge Peace Commission |publisher=Cambridgema.gov |date=2008-02-15 |accessdate=2012-07-18}} ==Zip codes== *02138—Harvard Square/West Cambridge *02139—Central Square/Inman Square/MIT *02140—Porter Square/North Cambridge *02141—East Cambridge *02142—Kendall Square ==References== {{reflist|30em}} ==General references== * ''History of Middlesex County, Massachusetts'', [http://books.google.com/books?id=QGolOAyd9RMC&dq=intitle:History+intitle:of+intitle:Middlesex+intitle:County+intitle:Massachusetts&lr=&num=50&as_brr=0&source=gbs_other_versions_sidebar_s&cad=5 Volume 1 (A-H)], [http://books.google.com/books?id=hNaAnwRMedUC&pg=PA506&dq=intitle:History+intitle:of+intitle:Middlesex+intitle:County+intitle:Massachusetts&lr=&num=50&as_brr=0#PPA3,M1 Volume 2 (L-W)] compiled by Samuel Adams Drake, published 1879–1880. ** [http://books.google.com/books?id=QGolOAyd9RMC&printsec=titlepage#PPA305,M1 Cambridge article] by Rev. Edward Abbott in volume 1, pages 305–358. *Eliot, Samuel Atkins. ''A History of Cambridge, Massachusetts: 1630–1913''. Cambridge: The Cambridge Tribune, 1913. *Hiestand, Emily. "Watershed: An Excursion in Four Parts" The Georgia Review Spring 1998 pages 7–28 *[[Lucius Robinson Paige|Paige, Lucius]]. ''History of Cambridge, Massachusetts: 1630–1877''. Cambridge: The Riverside Press, 1877. *Survey of Architectural History in Cambridge: Mid Cambridge, 1967, Cambridge Historical Commission, Cambridge, Mass.{{ISBN missing}} *Survey of Architectural History in Cambridge: Cambridgeport, 1971 ISBN 0-262-53013-9, Cambridge Historical Commission, Cambridge, Mass. *Survey of Architectural History in Cambridge: Old Cambridge, 1973 ISBN 0-262-53014-7, Cambridge Historical Commission, Cambridge, Mass. *Survey of Architectural History in Cambridge: Northwest Cambridge, 1977 ISBN 0-262-53032-5, Cambridge Historical Commission, Cambridge, Mass. *Survey of Architectural History in Cambridge: East Cambridge, 1988 (revised) ISBN 0-262-53078-3, Cambridge Historical Commission, Cambridge, Mass. *{{cite book|last=Sinclair|first=Jill|title=Fresh Pond: The History of a Cambridge Landscape|publisher=MIT Press|location=Cambridge, Mass.|date=April 2009|isbn=978-0-262-19591-1 }} *{{cite book|last=Seaburg|first=Alan|title=Cambridge on the Charles|url=http://books.google.com/books?id=c7_oCS782-8C|publisher=Anne Miniver Press|location=Billerica, Mass.|year=2001|author=Seaburg, A. and Dahill, T. and Rose, C.H.|isbn=978-0-9625794-9-3}} ==External links== {{Commons category}} {{Wikivoyage|Cambridge (Massachusetts)}} {{Portal|Boston}} {{Commons category|Cambridge, Massachusetts}} *{{Official website|http://www.cambridgema.gov/}} *[http://www.cambridge-usa.org/ Cambridge Office for Tourism] *[http://www.city-data.com/city/Cambridge-Massachusetts.html City-Data.com] *[http://www.epodunk.com/cgi-bin/genInfo.php?locIndex=2894 ePodunk: Profile for Cambridge, Massachusetts] *{{dmoz|Regional/North_America/United_States/Massachusetts/Localities/C/Cambridge}}
===Maps=== *[http://www.cambridgema.gov/GIS/FindMapAtlas.cfm Cambridge Maps] *[http://www.cambridgema.gov/GIS City of Cambridge Geographic Information System (GIS)] *[http://www.salemdeeds.com/atlases_results.asp?ImageType=index&atlastype=MassWorld&atlastown=&atlas=MASSACHUSETTS+1871&atlas_desc=MASSACHUSETTS+1871 ''1871 Atlas of Massachusetts''.] by Wall & Gray. [http://www.salemdeeds.com/atlases_pages.asp?ImageName=PAGE_0010_0011.jpg&atlastype=MassWorld&atlastown=&atlas=MASSACHUSETTS+1871&atlas_desc=MASSACHUSETTS+1871&pageprefix= Map of Massachusetts.] [http://www.salemdeeds.com/atlases_pages.asp?ImageName=PAGE_0044_0045.jpg&atlastype=MassWorld&atlastown=&atlas=MASSACHUSETTS+1871&atlas_desc=MASSACHUSETTS+1871&pageprefix= Map of Middlesex County.] *Dutton, E.P. [http://maps.bpl.org/details_10717/?srch_query=Dutton%2C+E.P.&srch_fields=all&srch_author=on&srch_style=exact&srch_fa=save&srch_ok=Go+Search Chart of Boston Harbor and Massachusetts Bay with Map of Adjacent Country.] Published 1867. A good map of roads and rail lines around Cambridge. *[http://www.citymap.com/cambridge/index.htm Cambridge Citymap – Community, Business, and Visitor Map.] *[http://docs.unh.edu/towns/CambridgeMassachusettsMapList.htm Old USGS maps of Cambridge area.] {{Greater Boston}} {{Middlesex County, Massachusetts}} {{Massachusetts}} {{New England}} {{Massachusetts cities and mayors of 100,000 population}} [[Category:Cambridge, Massachusetts| ]] [[Category:University towns in the United States]] [[Category:County seats in Massachusetts]] [[Category:Populated places established in 1630]] [[Category:Charles River]] [[Category:Place names of English origin in the United States]] [[af:Cambridge, Massachusetts]] [[ar:كامبريدج، ماساتشوستس]] [[zh-min-nan:Cambridge, Massachusetts]] [[be:Горад Кембрыдж, Масачусетс]] [[be-x-old:Кембрыдж (Масачусэтс)]] [[bg:Кеймбридж (Масачузетс)]] [[br:Cambridge (Massachusetts)]] [[ca:Cambridge (Massachusetts)]] [[cs:Cambridge (Massachusetts)]] [[cy:Cambridge, Massachusetts]] [[da:Cambridge (Massachusetts)]] [[de:Cambridge (Massachusetts)]] [[et:Cambridge (Massachusetts)]] [[es:Cambridge (Massachusetts)]] [[eo:Kembriĝo (Masaĉuseco)]] [[eu:Cambridge (Massachusetts)]] [[fa:کمبریج (ماساچوست)]] [[fr:Cambridge (Massachusetts)]] [[gd:Cambridge (MA)]] [[ko:케임브리지 (매사추세츠 주)]] [[hy:Քեմբրիջ (Մասաչուսեթս)]] [[id:Cambridge, Massachusetts]] [[it:Cambridge (Massachusetts)]] [[he:קיימברידג' (מסצ'וסטס)]] [[jv:Cambridge, Massachusetts]] [[kk:Кэмбридж (Массачусетс)]] [[kw:Cambridge, Massachusetts]] [[sw:Cambridge, Massachusetts]] [[ht:Cambridge, Massachusetts]] [[la:Cantabrigia (Massachusetta)]] [[lv:Keimbridža]] [[lb:Cambridge (Massachusetts)]] [[hu:Cambridge (Massachusetts)]] [[mr:केंब्रिज, मॅसेच्युसेट्स]] [[ms:Cambridge, Massachusetts]] [[nl:Cambridge (Massachusetts)]] [[ja:ケンブリッジ (マサチューセッツ州)]] [[no:Cambridge (Massachusetts)]] [[pl:Cambridge (Massachusetts)]] [[pt:Cambridge (Massachusetts)]] [[ro:Cambridge, Massachusetts]] [[ru:Кембридж (Массачусетс)]] [[scn:Cambridge (Massachusetts), USA]] [[simple:Cambridge, Massachusetts]] [[sk:Cambridge (Massachusetts)]] [[sl:Cambridge, Massachusetts]] [[sr:Кембриџ (Масачусетс)]] [[fi:Cambridge (Massachusetts)]] [[sv:Cambridge, Massachusetts]] [[tl:Cambridge, Massachusetts]] [[ta:கேம்பிரிஜ், மாசசூசெட்ஸ்]] [[th:เคมบริดจ์ (รัฐแมสซาชูเซตส์)]] [[tg:Кембриҷ (Массачусетс)]] [[tr:Cambridge, Massachusetts]] [[uk:Кембридж (Массачусетс)]] [[vi:Cambridge, Massachusetts]] [[vo:Cambridge (Massachusetts)]] [[war:Cambridge, Massachusetts]] [[yi:קעמברידזש, מאסאטשוסעטס]] [[zh:剑桥 (马萨诸塞州)]] \ No newline at end of file