OpenSearch

mirror of https://github.com/honeymoose/OpenSearch.git synced 2025-02-08 05:58:44 +00:00

Author	SHA1	Message	Date
Dimitris Athanasiou	8f49d01113	[7.x][ML] Rename df-analytics `_id_copy` to `ml__id_copy` (#43754 ) (#43783 ) Renames `_id_copy` to `ml__id_copy` as field names starting with underscore are deprecated. The new field name `ml__id_copy` was chosen as an obscure enough field that users won't have in their data. Otherwise, this field is only intented to be used by df-analytics.	2019-06-30 19:37:00 +03:00
David Roberts	b599c68d23	[ML] Assert that a no-op job creates no results nor state (#43681 ) If a job is opened and then closed and does nothing in between then it should not persist any results or state documents. This change adapts the no-op job test to assert no results in addition to no state, and to log any documents that cause this assertion to fail. Relates elastic/ml-cpp#512 Relates #43680	2019-06-29 14:57:49 +01:00
Ryan Ernst	28ab77a023	Add StreamableResponseAction to aid in deprecation of Streamable (#43770 ) The Action base class currently works for both Streamable and Writeable response types. This commit intorduces StreamableResponseAction, for which only the legacy Action implementions which provide newResponse() will extend. This eliminates the need for overriding newResponse() with an UnsupportedOperationException. relates #34389	2019-06-28 21:40:00 -07:00
David Roberts	7951c63b91	[ML] Mark ml-cpp dependency as regularly changing (#43760 ) Since #41817 was merged the ml-cpp zip file for any given version has been cached indefinitely by Gradle. This is problematic, particularly in the case of the master branch where the version 8.0.0-SNAPSHOT will be in use for more than a year. This change tells Gradle that the ml-cpp zip file is a "changing" dependency, and to check whether it has changed every two hours. Two hours is a compromise between checking on every build and annoying developers with slow internet connections and checking rarely causing bug fixes in the ml-cpp code to take a long time to propagate through to elasticsearch PRs that rely on them.	2019-06-28 21:21:18 +01:00
Dimitris Athanasiou	cab879118d	[7.x][ML] Support multiple source indices for df-analytics (#43702 ) (#43731 ) This commit adds support for multiple source indices. In order to deal with multiple indices having different mappings, it attempts a best-effort approach to merge the mappings assuming there are no conflicts. In case conflicts exists an error will be returned. To allow users creating custom mappings for special use cases, the destination index is now allowed to exist before the analytics job runs. In addition, settings are no longer copied except for the `index.number_of_shards` and `index.number_of_replicas`.	2019-06-28 13:28:03 +03:00
Przemysław Witek	94f18da5df	Add version and create_time to data frame analytics config (#43683 ) (#43712 )	2019-06-28 07:37:21 +02:00
Igor Motov	3607876a71	Geo: Makes coordinate validator in libs/geo plugable (#43657 ) Moves coordinate validation from Geometry constructors into parser. Relates #43644	2019-06-27 19:53:41 -04:00
Przemysław Witek	68dbbd8793	Deduplicate two similar TimeUtils classes. (#43697 ) * Deduplicate org.elasticsearch.xpack.core.dataframe.utils.TimeUtils and org.elasticsearch.xpack.core.ml.utils.time.TimeUtils into a common class: org.elasticsearch.xpack.core.common.time.TimeUtils. * Add unit tests for parseTimeField and parseTimeFieldToInstant methods	2019-06-27 18:51:48 +02:00
David Roberts	f39619d182	[ML] Don't write timing stats on no-op (#43680 ) Similar to elastic/ml-cpp#512, if a job opens and closes and does nothing in between we shouldn't write timing stats to the results index.	2019-06-27 16:37:54 +01:00
Przemysław Witek	ba518722a2	[7.x] [ML] Tag destination index with data frame metadata (#43567 ) (#43660 )	2019-06-27 08:08:39 +02:00
David Roberts	31dc5b7d3a	[TEST] Wait for replicas before stopping nodes in ML distributed test (#43622 ) If we stop a node before replicas exist then the test can fail because we lose a whole index if we stop the node with the primary on.	2019-06-26 11:52:53 +01:00
David Roberts	558e323c89	[ML] Introduce a setting for the process connect timeout (#43234 ) This change introduces a new setting, xpack.ml.process_connect_timeout, to enable the timeout for one of the external ML processes to connect to the ES JVM to be increased. The timeout may need to be increased if many processes are being started simultaneously on the same machine. This is unlikely in clusters with many ML nodes, as we balance the processes across the ML nodes, but can happen in clusters with a single ML node and a high value for xpack.ml.node_concurrent_job_allocations.	2019-06-26 09:22:04 +01:00
Dimitris Athanasiou	126c2fd2d5	[7.x][ML] Machine learning data frame analytics (#43544 ) (#43592 ) This merges the initial work that adds a framework for performing machine learning analytics on data frames. The feature is currently experimental and requires a platinum license. Note that the original commits can be found in the `feature-ml-data-frame-analytics` branch. A new set of APIs is added which allows the creation of data frame analytics jobs. Configuration allows specifying different types of analysis to be performed on a data frame. At first there is support for outlier detection. The APIs are: - PUT _ml/data_frame/analysis/{id} - GET _ml/data_frame/analysis/{id} - GET _ml/data_frame/analysis/{id}/_stats - POST _ml/data_frame/analysis/{id}/_start - POST _ml/data_frame/analysis/{id}/_stop - DELETE _ml/data_frame/analysis/{id} When a data frame analytics job is started a persistent task is created and started. The main steps of the task are: 1. reindex the source index into the dest index 2. analyze the data through the data_frame_analyzer c++ process 3. merge the results of the process back into the destination index In addition, an evaluation API is added which packages commonly used metrics that provide evaluation of various analysis: - POST _ml/data_frame/_evaluate	2019-06-25 20:29:11 +03:00
Przemysław Witek	b15e40ffad	Extract TimingStats-related functionality into TimingStatsReporter (#43371 ) (#43557 )	2019-06-25 15:48:39 +02:00
David Roberts	9c285ddbab	[ML] Improve message when native controller cannot connect (#43565 ) The error message if the native controller failed to run (for example due to running Elasticsearch on an unsupported platform) was not easy to understand. This change removes pointless detail from the message and adds some hints about likely causes. Fixes #42341	2019-06-25 12:06:54 +01:00
Martijn van Groningen	101cf384ba	Replace Streamable w/ Writable in AcknowledgedResponse and subclasses (backport 7.x) (#43525 ) This commit replaces usages of Streamable with Writeable for the AcknowledgedResponse and its subclasses, plus associated actions. Note that where possible response fields were made final and default constructors were removed. This is a large PR, but the change is mostly mechanical. Relates to #34389 Backport of #43414	2019-06-24 13:47:37 +02:00
David Kyle	73221d2265	[ML] Resolve NetworkDisruptionIT (#43441 ) After the network disruption a partition is created, one side of which can form a cluster the other can't. Ensure requests are sent to a node on the correct side of the cluster	2019-06-21 10:24:02 +01:00
Jason Tedor	1f1a035def	Remove stale test logging annotations (#43403 ) This commit removes some very old test logging annotations that appeared to be added to investigate test failures that are long since closed. If these are needed, they can be added back on a case-by-case basis with a comment associating them to a test failure.	2019-06-19 22:58:22 -04:00
Lee Hinman	d81ce9a647	Return 0 for negative "free" and "total" memory reported by the OS (#42725 ) * Return 0 for negative "free" and "total" memory reported by the OS We've had a situation where the MX bean reported negative values for the free memory of the OS, in those rare cases we want to return a value of 0 rather than blowing up later down the pipeline. In the event that there is a serialization or creation error with regard to memory use, this adds asserts so the failure will occur as soon as possible and give us a better location for investigation. Resolves #42157 * Fix test passing in invalid memory value * Fix another test passing in invalid memory value * Also change mem check in MachineLearning.machineMemoryFromStats * Add background documentation for why we prevent negative return values * Clarify comment a bit more	2019-06-19 10:35:48 -06:00
Przemysław Witek	86b58d9ff3	Rename AutoDetectResultProcessor* to AutodetectResultProcessor* for consistency with other classes where the spelling is "Autodetect" (#43359 ) (#43366 )	2019-06-19 15:31:26 +02:00
Jason Tedor	fa09113080	Remove trace logging for ML native multi-node tests This trace logging looks like it was copy/pasted from another test, where the logging in that test was only added to investigate a test failure. This commit removes the trace logging.	2019-06-18 22:28:27 -04:00
Igor Motov	9f7d1ff2de	Geo: Add coerce support to libs/geo WKT parser (#43273 ) Adds support for coercing not closed polygons and ignoring Z value to libs/geo WKT parser. Closes #43173	2019-06-18 14:41:01 -04:00
Alpar Torok	94930d0e84	Testclusters: convert ml qa tests (#43229 ) * Testclusters: convert ml qa tests This PR converts the ML tests to use testclusters.	2019-06-18 11:55:11 +03:00
David Roberts	da97325790	[ML] Speed up persistent task rechecks in ML failover tests (#43291 ) The ML failover tests sometimes need to wait for jobs to be assigned to new nodes following a node failure. They wait 10 seconds for this to happen. However, if the node that failed was the master node and a new master was elected then this 10 seconds might not be long enough as a refresh of the memory stats will delay job assignment. Once the memory refresh completes the persistent task will be assigned when the next cluster state update occurs or after the periodic recheck interval, which defaults to 30 seconds. Rather than increase the length of the wait for assignment to 31 seconds, this change decreases the periodic recheck interval to 1 second. Fixes #43289	2019-06-18 09:19:20 +01:00
David Roberts	3effe264da	[ML] Fix problem with lost shards in distributed failure test (#43153 ) We were stopping a node in the cluster at a time when the replica shards of the .ml-state index might not have been created. This change moves the wait for green status to a point where the .ml-state index exists. Fixes #40546 Fixes #41742 Forward port of #43111	2019-06-17 09:28:56 +01:00
Przemysław Witek	b2613a123d	[7.x] Report exponential_avg_bucket_processing_time which gives more weight to recent buckets (#43189 ) (#43263 )	2019-06-17 08:58:26 +02:00
David Roberts	3928c624a3	[ML] Close sample stream in post_data endpoint (#43235 ) A static code analysis revealed that we are not closing the input stream in the post_data endpoint. This actually makes no difference in practice, as the particular InputStream implementation in this case is org.elasticsearch.common.bytes.BytesReferenceStreamInput and its close() method is a no-op. However, it is good practice to close the stream anyway.	2019-06-14 17:54:54 +01:00
Przemysław Witek	65a584b6fb	[7.x] Report timing stats as part of the Job stats response (#42709 ) (#43193 )	2019-06-14 09:03:14 +02:00
Jason Tedor	5bc3b7f741	Enable node roles to be pluggable (#43175 ) This commit introduces the possibility for a plugin to introduce additional node roles.	2019-06-13 15:15:48 -04:00
Ryan Ernst	c3ce3f6891	Add native code info to ML info api (#43172 ) The machine learning feature of xpack has native binaries with a different commit id than the rest of code. It is currently exposed in the xpack info api. This commit adds that commit information to the ML info api, so that it may be removed from the info api.	2019-06-13 11:38:58 -07:00
David Roberts	43665183c2	[ML] Restrict detection of epoch timestamps in find_file_structure (#43188 ) Previously 10 digit numbers were considered candidates to be timestamps recorded as seconds since the epoch and 13 digit numbers as timestamps recorded as milliseconds since the epoch. However, this meant that we could detect these formats for numbers that would represent times far in the future. As an example ISBN numbers starting with 9 were detected as milliseconds since the epoch since they had 13 digits. This change tweaks the logic for detecting such timestamps to require that they begin with 1 or 2. This means that numbers that would represent times beyond about 2065 are no longer detected as epoch timestamps. (We can add 3 to the definition as we get closer to the cutoff date.)	2019-06-13 13:15:41 +01:00
Dimitris Athanasiou	b28e006f7c	[ML] Lock down extraction method when possible (#43104 ) (#43140 )	2019-06-12 14:07:17 +03:00
Ryan Ernst	172cd4dbfa	Remove description from xpack feature sets (#43065 ) The description field of xpack featuresets is optionally part of the xpack info api, when using the verbose flag. However, this information is unnecessary, as it is better left for documentation (and the existing descriptions describe anything meaningful). This commit removes the description field from feature sets.	2019-06-11 09:22:58 -07:00
David Roberts	d3136f99e6	[ML] Fix race condition when closing time checker (#43098 ) The tests for the ML TimeoutChecker rely on threads not being interrupted after the TimeoutChecker is closed. This change ensures this by making the close() and setTimeoutExceeded() methods synchronized so that the code inside them cannot execute simultaneously. Fixes #43097	2019-06-11 16:39:17 +01:00
Benjamin Trent	79052050bf	[ML] Adding support for geo_shape, geo_centroid, geo_point in datafeeds (#42969 ) (#43069 ) * [ML] Adding support for geo_shape, geo_centroid, geo_point in datafeeds * only supporting doc_values for geo_point fields * moving validation into GeoPointField ctor	2019-06-10 21:52:53 -05:00
Benjamin Trent	eadfe05587	[ML] Changes slice specification to auto. See #42996 (#43039 ) (#43070 )	2019-06-10 21:52:22 -05:00
Dimitris Athanasiou	76a92b49a8	[ML] Get resources action should be lenient when sort field is unmapped (#42991 ) (#43046 ) Get resources action sorts on the resource id. When there are no resources at all, then it is possible the index does not contain a mapping for the resource id field. In that case, the search api fails by default. This commit adjusts the search request to ignore unmapped fields. Closes elastic/kibana#37870	2019-06-10 19:50:19 +03:00
Alan Woodward	8e23e4518a	Move construction of custom analyzers into AnalysisRegistry (#42940 ) Both TransportAnalyzeAction and CategorizationAnalyzer have logic to build custom analyzers for index-independent analysis. A lot of this code is duplicated, and it requires the AnalysisRegistry to expose a number of internal provider classes, as well as making some assumptions about when analysis components are constructed. This commit moves the build logic directly into AnalysisRegistry, reducing the registry's API surface considerably.	2019-06-10 14:33:25 +01:00
David Turner	68339f90e9	Mute AutodetectMemoryLimitIT#testTooManyPartitions Relates #43013	2019-06-10 09:20:36 +01:00
Henning Andersen	dea935ac31	Reindex max_docs parameter name (#42942 ) Previously, a reindex request had two different size specifications in the body: * Outer level, determining the maximum documents to process * Inside the source element, determining the scroll/batch size. The outer level size has now been renamed to max_docs to avoid confusion and clarify its semantics, with backwards compatibility and deprecation warnings for using size. Similarly, the size parameter has been renamed to max_docs for update/delete-by-query to keep the 3 interfaces consistent. Finally, all 3 endpoints now support max_docs in both body and URL. Relates #24344	2019-06-07 12:16:36 +02:00
David Roberts	40c827a3b8	[ML] Close sample stream in find_file_structure endpoint (#42896 ) A static code analysis revealed that we are not closing the input stream in the find_file_structure endpoint. This actually makes no difference in practice, as the particular InputStream implementation in this case is org.elasticsearch.common.bytes.BytesReferenceStreamInput and its close() method is a no-op. However, it is good practice to close the stream anyway.	2019-06-06 11:03:45 +01:00
David Roberts	b202a59f88	[ML] Add earliest and latest timestamps to field stats (#42890 ) This change adds the earliest and latest timestamps into the field stats for fields of type "date" in the output of the ML find_file_structure endpoint. This will enable the cards for date fields in the file data visualizer in the UI to be made to look more similar to the cards for date fields in the index data visualizer in the UI.	2019-06-06 08:58:35 +01:00
David Roberts	d5baedb789	[ML] Change dots in CSV column names to underscores (#42839 ) Dots in the column names cause an error in the ingest pipeline, as dots are special characters in ingest pipeline. This PR changes dots into underscores in CSV field names suggested by the ML find_file_structure endpoint _unless_ the field names are specifically overridden. The reason for allowing them in overrides is that fields that are not mentioned in the ingest pipeline can contain dots. But it's more consistent that the default behaviour is to replace them all. Fixes elastic/kibana#26800	2019-06-05 11:28:33 +01:00
Mark Vieira	e44b8b1e2e	[Backport] Remove dependency substitutions 7.x (#42866 ) * Remove unnecessary usage of Gradle dependency substitution rules (#42773) (cherry picked from commit 12d583dbf6f7d44f00aa365e34fc7e937c3c61f7)	2019-06-04 13:50:23 -07:00
David Roberts	b61202b0a8	[ML] Add a limit on line merging in find_file_structure (#42501 ) When analysing a semi-structured text file the find_file_structure endpoint merges lines to form multi-line messages using the assumption that the first line in each message contains the timestamp. However, if the timestamp is misdetected then this can lead to excessive numbers of lines being merged to form massive messages. This commit adds a line_merge_size_limit setting (default 10000 characters) that halts the analysis if a message bigger than this is created. This prevents significant CPU time being spent subsequently trying to determine the internal structure of the huge bogus messages.	2019-06-03 13:45:51 +01:00
David Roberts	10aca87389	[ML] Better detection of binary input in find_file_structure (#42707 ) This change helps to prevent the situation where a binary file uploaded to the find_file_structure endpoint is detected as being text in the UTF-16 character set, and then causes a large amount of CPU to be spent analysing the bogus text structure. The approach is to check the distribution of zero bytes between odd and even file positions, on the grounds that UTF-16BE or UTF16-LE would have a very skewed distribution.	2019-06-03 12:47:22 +01:00
David Roberts	48dc0dca57	[ML] Use map and filter instead of flatMap in find_file_structure (#42534 ) Using map and filter avoids the garbage from all the Stream.of calls that flatMap necessitated. Performance is better when there are masses of fields.	2019-05-24 20:12:06 +01:00
David Roberts	34de68b007	[ML] Fix possible race condition when closing an opening job (#42506 ) This change fixes a race condition that would result in an in-memory data structure becoming out-of-sync with persistent tasks in cluster state. If repeated often enough this could result in it being impossible to open any ML jobs on the affected node, as the master node would think the node had capacity to open another job but the chosen node would error during the open sequence due to its in-memory data structure being full. The race could be triggered by opening a job and then closing it a tiny fraction of a second later. It is unlikely a user of the UI could open and close the job that fast, but a script or program calling the REST API could. The nasty thing is, from the externally observable states and stats everything would appear to be fine - the fast open then close sequence would appear to leave the job in the closed state. It's only later that the leftovers in the in-memory data structure might build up and cause a problem.	2019-05-24 20:11:58 +01:00
David Roberts	f472186b9f	[ML] Improve file structure finder timestamp format determination (#41948 ) This change contains a major refactoring of the timestamp format determination code used by the ML find file structure endpoint. Previously timestamp format determination was done separately for each piece of text supplied to the timestamp format finder. This had the drawback that it was not possible to distinguish dd/MM and MM/dd in the case where both numbers were 12 or less. In order to do this sensibly it is best to look across all the available timestamps and see if one of the numbers is greater than 12 in any of them. This necessitates making the timestamp format finder an instantiable class that can accumulate evidence over time. Another problem with the previous approach was that it was only possible to override the timestamp format to one of a limited set of timestamp formats. There was no way out if a file to be analysed had a timestamp that was sane yet not in the supported set. This is now changed to allow any timestamp format that can be parsed by a combination of these Java date/time formats: yy, yyyy, M, MM, MMM, MMMM, d, dd, EEE, EEEE, H, HH, h, mm, ss, a, XX, XXX, zzz Additionally S letter groups (fractional seconds) are supported providing they occur after ss and separated from the ss by a dot, comma or colon. Spacing and punctuation is also permitted with the exception of the question mark, newline and carriage return characters, together with literal text enclosed in single quotes. The full list of changes/improvements in this refactor is: - Make TimestampFormatFinder an instantiable class - Overrides must be specified in Java date/time format - Joda format is no longer accepted - Joda timestamp formats in outputs are now derived from the determined or overridden Java timestamp formats, not stored separately - Functionality for determining the "best" timestamp format in a set of lines has been moved from TextLogFileStructureFinder to TimestampFormatFinder, taking advantage of the fact that TimestampFormatFinder is now an instantiable class with state - The functionality to quickly rule out some possible Grok patterns when looking for timestamp formats has been changed from using simple regular expressions to the much faster approach of using the Shift-And method of sub-string search, but using an "alphabet" consisting of just 1 (representing any digit) and 0 (representing non-digits) - Timestamp format overrides are now much more flexible - Timestamp format overrides that do not correspond to a built-in Grok pattern are mapped to a %{CUSTOM_TIMESTAMP} Grok pattern whose definition is included within the date processor in the ingest pipeline - Grok patterns that correspond to multiple Java date/time patterns are now handled better - the Grok pattern is accepted as matching broadly, and the required set of Java date/time patterns is built up considering all observed samples - As a result of the more flexible acceptance of Grok patterns, when looking for the "best" timestamp in a set of lines timestamps are considered different if they are preceded by a different sequence of punctuation characters (to prevent timestamps far into some lines being considered similar to timestamps near the beginning of other lines) - Out-of-the-box Grok patterns that are considered now include %{DATE} and %{DATESTAMP}, which have indeterminate day/month ordering - The order of day/month in formats with indeterminate day/month order is determined by considering all observed samples (plus the server locale if the observed samples still do not suggest an ordering) Relates #38086 Closes #35137 Closes #35132	2019-05-24 09:10:08 +01:00
Dimitris Athanasiou	a6eb20ad35	[ML] Include node name when native controller cannot start process (#42225 ) (#42338 ) This adds the node name where we fail to start a process via the native controller to facilitate debugging as otherwise it might not be known to which node the job was allocated.	2019-05-22 12:42:04 +03:00

1 2 3 4 5 ...

486 Commits