OpenSearch

Commit Graph

Author	SHA1	Message	Date
David Pilato	931be57da9	[test] Add standalone runner It could be sometime useful to have a stand alone runner to see how exactly Tika extracts content from a given file. You can run `StandaloneRunner` class using: * `-u file://URL/TO/YOUR/DOC` * `--size` set extracted size (default to mapper attachment size) * `BASE64` encoded binary Example: ```sh StandaloneRunner BASE64Text StandaloneRunner -u /tmp/mydoc.pdf StandaloneRunner -u /tmp/mydoc.pdf --size 1000000 ``` It produces something like: ``` ## Extracted text --------------------- BEGIN ----------------------- This is the extracted text ---------------------- END ------------------------ ## Metadata - author: null - content_length: null - content_type: application/pdf - date: null - keywords: null - language: null - name: null - title: null ``` Closes #99. (cherry picked from commit 720b3bf) (cherry picked from commit 990fa15)	2015-02-09 17:45:07 +01:00
David Pilato	c353936b58	Add sonatype snapshot repository	2015-01-02 19:05:18 +01:00
David Pilato	33c9828385	Depend on elasticsearch-parent To simplify plugins maintenance and provide more value in the future, we are starting to build an `elasticsearch-parent` project. This commit is the first step for this plugin to depend on this new `pom` maven project.	2014-12-14 19:59:15 +01:00
David Pilato	c338ae0dbe	[Test] copyToByteArray has been removed in master	2014-12-03 18:42:14 +01:00
David Pilato	e3d80af54e	Test: Fix removed queryString -> queryStringQuery	2014-12-03 18:31:53 +01:00
Adrien Grand	11b1287610	Upgrade to Lucene 5.0.0-snapshot-1642891	2014-12-02 18:16:59 +01:00
Colin Goodheart-Smithe	bbd4a62e50	Updated AttachmentMapper to work with new validation in ES 2.0	2014-11-28 16:04:31 +00:00
Michael McCandless	abb03dc3d9	Upgrade to Lucene 5.0.0-snapshot-1641343	2014-11-24 05:51:40 -05:00
Michael McCandless	55042f0f23	Upgrade to Lucene 5.0.0-snapshot-1637347	2014-11-10 16:45:44 -05:00
Robert Muir	4c1b27f544	upgrade to lucene 5 snapshot	2014-11-05 16:48:10 -05:00
tlrx	a5ed51533c	update documentation with release 2.4.1	2014-11-05 20:38:24 +01:00
Jun Ohtani	94880aae3e	Tests: thread leaks detected * exclude StarndaloneTest.class from test target * add cleanup to MultifieldAttachementMapperTests for terminating ThreadPool * Modify MapperTestUtils.newMapperService for adding ThreadPool Closes #88	2014-11-03 02:22:45 +09:00
Jun Ohtani	d3f2df6d62	Tests: Fix randomizedtest fail Closes #90	2014-11-03 02:15:59 +09:00
Michael McCandless	4dae1879ad	Upgrade to Lucene 4.10.2	2014-10-30 05:55:35 -04:00
David Pilato	a0d7aafdac	Fix test Related to #89	2014-10-27 22:18:50 +01:00
David Pilato	92bdc23c78	Fix test Related to #89	2014-10-27 22:13:15 +01:00
David Pilato	faf34d745d	Fix test Related to #89	2014-10-27 22:08:41 +01:00
David Pilato	d08e9c7080	Test: add a standalone tool which process content This tool is a simple main class which can be used to test what is extracted from a given binary file or from its base64 equivalent. You can give as first argument the BASE64 content Available options: -u file:/URL/TO/YOUR/DOC (in place of BASE64 content) -s set extracted size (default to mapper attachment size) Examples: ``` StandaloneTest BASE64Text StandaloneTest BASE64Text -s 1000000 StandaloneTest -u /tmp/mydoc.pdf StandaloneTest -u /tmp/mydoc.pdf -s 1000000 ``` Closes #89.	2014-10-27 22:01:22 +01:00
David Pilato	c3bf3b1ce9	Tests: AnalysisService constructor signature change Due to this [change](https://github.com/elasticsearch/elasticsearch/pull/8018), we need to fix our tests for elasticsearch 1.4.0 and above. Closes #87. (cherry picked from commit b3b0d34)	2014-10-15 13:05:41 +02:00
David Pilato	03b47d5a4c	update documentation with release 2.4.0	2014-10-08 18:50:20 +02:00
mikemccand	2ff4eb58d6	Upgrade to Lucene 4.10.1	2014-09-28 17:57:06 -04:00
Michael McCandless	67a2548441	Upgrade to Lucene 4.10.1 snapshot	2014-09-24 17:10:08 -04:00
David Pilato	eef6b61806	Create branch es-1.4 for elasticsearch 1.4.0	2014-09-12 16:08:59 +02:00
David Pilato	ba74fc2b5e	Remove netcdf support Sadly netcdf library is not Apache2 License compatible so we should not package it anymore. For users who wants to use it, they can add manually [netcdf librairies](http://www.unidata.ucar.edu/software/thredds/current/netcdf-java/) in `plugins/mapper-attachments` dir and they will get the support back. Closes #84.	2014-09-08 23:51:01 +02:00
David Pilato	888d79075e	Update to Lucene 4.10.0 Closes #85.	2014-09-08 23:47:15 +02:00
David Pilato	20ee711436	parseMultiField() method signature change in es 1.4 and master As seen with https://github.com/elasticsearch/elasticsearch/pull/7474, we need to update mapper attachment plugin with this new signature. Closes #83.	2014-09-04 11:23:09 +02:00
David Pilato	c0d053d283	Update to elasticsearch 1.4 Related to #77 (cherry picked from commit ad1742a)	2014-09-01 10:26:38 +02:00
David Pilato	34fe111a2b	update documentation with release 2.3.2	2014-09-01 09:53:26 +02:00
David Pilato	87b38c54eb	Unable to extract text from Word documents With issue #80 we explicitly removed appache POI dependency provided by Tika and replaced with a more recent one. Sadly we forgot to add this new dependency to the assembly so the final ZIP file does not contain POI related jars. Closes #82. (cherry picked from commit 49793d5)	2014-09-01 09:41:57 +02:00
David Pilato	cc1a43b5c3	update documentation with release 2.3.1	2014-08-18 21:52:53 +02:00
David Pilato	08454d72f6	update documentation with release 2.2.1	2014-08-18 21:39:31 +02:00
David Pilato	2b172f8ff6	Update a few dependencies Related to #80.	2014-08-18 17:49:36 +02:00
David Pilato	5cf20331a8	Update to elasticsearch 1.4.0 Related to #77. (cherry picked from commit 7e65cfb)	2014-08-18 15:39:19 +02:00
David Pilato	75d03621aa	Update a few dependencies Related to #80. (cherry picked from commit 89d5460)	2014-08-18 15:37:03 +02:00
David Pilato	587e6d3da2	Docs: make the welcome page more obvious Closes #79.	2014-08-18 12:38:03 +02:00
David Pilato	f8d2975946	Update a few dependencies Closes #80. (cherry picked from commit 930c8be)	2014-08-18 12:27:23 +02:00
David Pilato	6edf3447b1	Remove old `content` deprecated field In #73, we deprecated `content` field in favor of `_content` field. In plugin version 2.4.0, we can now remove the old field name. Closes #75. (cherry picked from commit 7a0f838)	2014-07-26 00:33:50 +02:00
David Pilato	e704f68525	Log tika exceptions Currently tika exceptions are swallowed with no log message. We'd like to be able to know when/if this occurs and for what reason. Closes #78. (cherry picked from commit 36b0117)	2014-07-26 00:27:49 +02:00
David Pilato	ad986eb2fc	Add support for multi-fields Now https://github.com/elasticsearch/elasticsearch/pull/6867 is merged in elasticsearch core code (branch 1.x - es 1.4), we can support multi fields in mapper attachment plugin. ``` DELETE /test PUT /test { "settings": { "number_of_shards": 1 } } PUT /test/person/_mapping { "person": { "properties": { "file": { "type": "attachment", "path": "full", "fields": { "file": { "type": "string", "fields": { "store": { "type": "string", "store": true } } }, "content_type": { "type": "string", "fields": { "store": { "type": "string", "store": true }, "untouched": { "type": "string", "index": "not_analyzed", "store": true } } } } } } } } PUT /test/person/1?refresh=true { "file": "IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg==" } GET /test/person/_search { "fields": [ "file.store", "file.content_type.store" ], "aggs": { "store": { "terms": { "field": "file.content_type.store" } }, "untouched": { "terms": { "field": "file.content_type.untouched" } } } } ``` It gives: ```js { "took": 3, "timed_out": false, "_shards": { "total": 1, "successful": 1, "failed": 0 }, "hits": { "total": 1, "max_score": 1, "hits": [ { "_index": "test", "_type": "person", "_id": "1", "_score": 1, "fields": { "file.store": [ "\"God Save the Queen\" (alternatively \"God Save the King\"\n" ], "file.content_type.store": [ "text/plain; charset=ISO-8859-1" ] } } ] }, "aggregations": { "store": { "doc_count_error_upper_bound": 0, "buckets": [ { "key": "1", "doc_count": 1 }, { "key": "8859", "doc_count": 1 }, { "key": "charset", "doc_count": 1 }, { "key": "iso", "doc_count": 1 }, { "key": "plain", "doc_count": 1 }, { "key": "text", "doc_count": 1 } ] }, "untouched": { "doc_count_error_upper_bound": 0, "buckets": [ { "key": "text/plain; charset=ISO-8859-1", "doc_count": 1 } ] } } } ``` Note that using shorter definition works as well: ``` DELETE /test PUT /test { "settings": { "number_of_shards": 1 } } PUT /test/person/_mapping { "person": { "properties": { "file": { "type": "attachment" } } } } PUT /test/person/1?refresh=true { "file": "IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg==" } GET /test/person/_search { "query": { "match": { "file": "king" } } } ``` gives: ```js { "took": 53, "timed_out": false, "_shards": { "total": 1, "successful": 1, "failed": 0 }, "hits": { "total": 1, "max_score": 0.095891505, "hits": [ { "_index": "test", "_type": "person", "_id": "1", "_score": 0.095891505, "_source": { "file": "IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg==" } } ] } } ``` Closes #57. (cherry picked from commit 432d7c0)	2014-07-26 00:27:28 +02:00
David Pilato	663d4eaddb	Update to elasticsearch 1.4.0 Closes #77. (cherry picked from commit c58516f)	2014-07-26 00:26:41 +02:00
David Pilato	eaccd4383d	Deprecate `content` by `_content` When we want to force some values, we need to set those using `_field` where `field` is the field name we want to force: ``` { "file": { "_name": "myfilename.txt" } } ``` But to set the content itself, we use `content` field name. ``` { "file": { "content": "VGhpcyBpcyBhbiBlbGFzdGljc2VhcmNoIG1hcHBlciBhdHRhY2htZW50IHRlc3Qu", "_name": "myfilename.txt" } } ``` For consistency, we set `_content` instead: ``` { "file": { "_content": "VGhpcyBpcyBhbiBlbGFzdGljc2VhcmNoIG1hcHBlciBhdHRhY2htZW50IHRlc3Qu", "_name": "myfilename.txt" } } ``` Closes #73. (cherry picked from commit 2e6be20)	2014-07-25 18:15:37 +02:00
David Pilato	1d1225b87c	Update to Lucene 4.9.0 Update to elasticsearch 1.3.0 Move to java 1.7 Related to #67. Closed #76. (cherry picked from commit 2303932)	2014-07-25 18:15:28 +02:00
David Pilato	310df36bfa	SL4FJ dependency version problem This is due to `edu.ucar:netcdf` lib which comes from `tika-parsers` dependency. ``` [INFO] +- org.apache.tika:tika-parsers:jar:1.5:compile [INFO] \| +- edu.ucar:netcdf:jar:4.2-min:compile [INFO] \| \| \- org.slf4j:slf4j-api:jar:1.5.6:compile ``` We can exclude this library from the generated ZIP artifact. Closes #41.	2014-06-14 18:56:14 +02:00
David Pilato	51a8f6f1a0	Fix doc typo (cherry picked from commit f70eb1d)	2014-06-03 10:13:12 +02:00
David Pilato	a3bb103297	Remove deprecated `language` forced field With #68 we replaced `language`field with `_language`. We can now remove the old deprecated name. Closes #69. (cherry picked from commit e39f144)	2014-06-03 10:11:13 +02:00
David Pilato	94cf141108	Use` _language` field instead of `language` When we want to force a language instead of using Tika language detection, we set `language` field in documents. To be consistent with other forced fields, `_content_type` and `_name`, we should prefix `language` field by an underscore `_`. So `language` become `_language`. We first deprecate `language` in version 2.1.0 and we remove it in 2.3.0. Closes #68. (cherry picked from commit 2f46343)	2014-06-03 10:10:49 +02:00
David Pilato	7c1c2011bc	Update to elasticsearch 1.3.0 Closes #67. (cherry picked from commit d3eaac9)	2014-06-03 09:49:41 +02:00
David Pilato	c0e7795f1f	Update to elasticsearch 1.2.0 Closes #66. (cherry picked from commit fb3b288)	2014-06-03 09:49:13 +02:00
David Pilato	4b35501cf3	Setting "_content_type" in indexing request has no effect Example below. I set the type as text/plain but it is identified as text/html. ```sh #!/bin/sh echo "\n\n Delete testidx \n" curl -XDELETE "http://localhost:9200/testidx" echo "\n\n Create index and mapping \n" curl -XPUT "http://localhost:9200/testidx" -d' { "mappings": { "session": { "properties": { "Content": { "properties": { "content": { "type": "attachment", "path": "full", "store": "yes", "fields": { "content": { "type": "string", "store": "yes" }, "author": { "type": "string", "store": "yes" }, "title": { "type": "string", "store": "yes" }, "name": { "type": "string", "store": "yes" }, "date": { "type": "date", "format": "dateOptionalTime", "store": "yes" }, "keywords": { "type": "string", "store": "yes" }, "content_type": { "type": "string", "store": "yes" }, "content_length": { "type": "integer", "store": "yes" } } } } } } } } }' echo "\n\n Index document \n" curl -XPOST "http://localhost:9200/_bulk" -d' {"index":{"_index":"testidx","_type":"session"}} {"Content":[{"_content_type":"text/plain","content":"BASE64ENCODED_CONTENT"}]} ' echo "\n\n Refresh \n" curl -XPOST "http://localhost:9200/testidx/_refresh" echo "\n\n Get doc type \n" curl -XPOST "http://localhost:9200/testidx/_search?pretty" -d' { "fields": ["Content.content.content_type","Content.content.content_length","Content.content"] }' ``` Closes #65. (cherry picked from commit 38075dc)	2014-06-03 09:36:10 +02:00
David Pilato	7f8143ff12	Add highlighting documentation Closes #54. (cherry picked from commit efdf8ef)	2014-06-03 09:35:05 +02:00

1 2 3

132 Commits All Branches Search

132 Commits

All Branches