mirror of
https://github.com/honeymoose/OpenSearch.git
synced 2025-03-09 14:34:43 +00:00
* [ML][Inference] lang_ident model (#50292) This PR contains a java port of Google's CLD3 compact NN model https://github.com/google/cld3 The ported model is formatted to fit within our inference model formatting and stored as a resource in the `:xpack:ml:` plugin and is under basic license. The model is broken up into two major parts: - Preprocessing through the custom embedding (based on CLD3's embedding layer) - Pushing the embedded text through the two layers of fully connected shallow NN. Main differences between this port and CLD3: - We take advantage of Java's internal Unicode handling where possible (i.e. codepoints, characters, decoders, etc.) - We do not trim down input text by removing duplicated tokens - We do not encode doubles/floats as longs/integers.