From 64a6fc8fc02cdb0c96d7904bc649260d54f2520a Mon Sep 17 00:00:00 2001 From: Gian Merlino Date: Thu, 25 Apr 2024 22:16:07 -0700 Subject: [PATCH] JSONFlattenerMaker: Speed up charsetFix. (#16212) JSON parsing has this function "charsetFix" that fixes up strings so they can round-trip through UTF-8 encoding without loss of fidelity. It was originally introduced to fix a bug where strings could be sorted, encoded, then decoded, and the resulting decoded strings could end up no longer in sorted order (due to character swaps during the encode operation). The code has been in place for some time, and only applies to JSON. I am not sure if it needs to apply to other formats; it's certainly more difficult to get broken strings from other formats. It's easy in JSON because you can write a JSON string like "foo\uD900". At any rate, this patch does not revisit whether charsetFix should be applied to all formats. It merely optimizes it for the JSON case. The function works by using CharsetEncoder.canEncode, which is a relatively slow method (just as expensive as actually encoding). This patch adds a short-circuit to skip canEncode if all chars in a string are in the basic multilingual plane (i.e. if no chars are surrogates). --- .../common/parsers/JSONFlattenerMaker.java | 44 ++++++++++++++++--- .../parsers/JSONFlattenerMakerTest.java | 15 +++++++ 2 files changed, 54 insertions(+), 5 deletions(-) diff --git a/processing/src/main/java/org/apache/druid/java/util/common/parsers/JSONFlattenerMaker.java b/processing/src/main/java/org/apache/druid/java/util/common/parsers/JSONFlattenerMaker.java index 5df98199080..84490b19fce 100644 --- a/processing/src/main/java/org/apache/druid/java/util/common/parsers/JSONFlattenerMaker.java +++ b/processing/src/main/java/org/apache/druid/java/util/common/parsers/JSONFlattenerMaker.java @@ -203,19 +203,53 @@ public class JSONFlattenerMaker implements ObjectFlatteners.FlattenerMaker