druid/docs/development/extensions-contrib/compressed-big-decimal.md

7.9 KiB

id title
compressed-big-decimal Compressed Big Decimal

Overview

Compressed Big Decimal is an extension which provides support for Mutable big decimal value that can be used to accumulate values without losing precision or reallocating memory. This type helps in absolute precision arithmetic on large numbers in applications, where greater level of accuracy is required, such as financial applications, currency based transactions. This helps avoid rounding issues where in potentially large amount of money can be lost.

Accumulation requires that the two numbers have the same scale, but does not require that they are of the same size. If the value being accumulated has a larger underlying array than this value (the result), then the higher order bits are dropped, similar to what happens when adding a long to an int and storing the result in an int. A compressed big decimal that holds its data with an embedded array.

Compressed big decimal is an absolute number based complex type based on big decimal in Java. This supports all the functionalities supported by Java Big Decimal. Java Big Decimal is not mutable in order to avoid big garbage collection issues. Compressed big decimal is needed to mutate the value in the accumulator.

Main enhancements provided by this extension:

  1. Functionality: Mutating Big decimal type with greater precision
  2. Accuracy: Provides greater level of accuracy in decimal arithmetic

Operations

To use this extension, make sure to load druid-compressed-bigdecimal to your config file.

Configuration

There are currently no configuration properties specific to Compressed Big Decimal

Limitations

  • Compressed Big Decimal does not provide correct result when the value being accumulated has a larger underlying array than this value (the result), then the higher order bits are dropped, similar to what happens when adding a long to an int and storing the result in an int.

Ingestion Spec:

property description required?
metricsSpec Metrics Specification, In metrics specification while specifying metrics details such as name, type should be specified as compressedBigDecimal Yes

Query spec:

  • Most properties in the query spec derived from groupBy query / timeseries, see documentation for these query types.
property description required?
queryType This String should always be either "groupBy" OR "timeseries"; this is the first thing Druid looks at to figure out how to interpret the query. yes
dataSource A String or Object defining the data source to query, very similar to a table in a relational database. See DataSource for more information. yes
dimensions A JSON list of DimensionSpec (Notice that property is optional) no
limitSpec See LimitSpec no
having See Having no
granularity A period granularity; See Period Granularities yes
filter See Filters no
aggregations Aggregations forms the input to Averagers; See Aggregations. The Aggregations must specify type, scale and size as follows for compressedBigDecimal Type "aggregations": [{"type": "compressedBigDecimal","name": "..","fieldName": "..","scale": [Numeric],"size": [Numeric]}. Please refer query example in Examples section. Yes
postAggregations Supports only aggregations as input; See Post Aggregations no
intervals A JSON Object representing ISO-8601 Intervals. This defines the time ranges to run the query over. yes
context An additional JSON Object which can be used to specify certain flags. no

Examples

Consider the data as

Date Item SaleAmount
20201208,ItemA,0.0
20201208,ItemB,10.000000000
20201208,ItemA,-1.000000000
20201208,ItemC,9999999999.000000000
20201208,ItemB,5000000000.000000005
20201208,ItemA,2.0
20201208,ItemD,0.0

IngestionSpec syntax:

{
	"type": "index_parallel",
	"spec": {
		"dataSchema": {
			"dataSource": "invoices",
			"timestampSpec": {
				"column": "timestamp",
				"format": "yyyyMMdd"
			},
			"dimensionsSpec": {
				"dimensions": [{
					"type": "string",
					"name": "itemName"
				}]
			},
			"metricsSpec": [{
				"name": "saleAmount",
				"type": *"compressedBigDecimal"*,
				"fieldName": "saleAmount"
			}],
			"transformSpec": {
				"filter": null,
				"transforms": []
			},
			"granularitySpec": {
				"type": "uniform",
				"rollup": false,
				"segmentGranularity": "DAY",
				"queryGranularity": "none",
				"intervals": ["2020-12-08/2020-12-09"]
			}
		},
		"ioConfig": {
			"type": "index_parallel",
			"inputSource": {
				"type": "local",
				"baseDir": "/home/user/sales/data/staging/invoice-data",
				"filter": "invoice-001.20201208.txt"
			},
			"inputFormat": {
				"type": "tsv",
                                "delimiter": ",",
                                "skipHeaderRows": 0,
				"columns": [
						"timestamp",
						"itemName",
						"saleAmount"
					]
			}
		},
		"tuningConfig": {
			"type": "index_parallel"
		}
	}
}

Group By Query example

Calculating sales groupBy all.

Query syntax:

{
    "queryType": "groupBy",
    "dataSource": "invoices",
    "granularity": "ALL",
    "dimensions": [
    ],
    "aggregations": [
        {
            "type": "compressedBigDecimal",
            "name": "saleAmount",
            "fieldName": "saleAmount",
            "scale": 9,
            "size": 3

        }
    ],
    "intervals": [
        "2020-01-08T00:00:00.000Z/P1D"
    ]
}

Result:

[ {
  "version" : "v1",
  "timestamp" : "2020-12-08T00:00:00.000Z",
  "event" : {
    "revenue" : 15000000010.000000005
  }
} ]

Had you used doubleSum instead of compressedBigDecimal the result would be

[ {
  "timestamp" : "2020-12-08T00:00:00.000Z",
  "result" : {
    "revenue" : 1.500000001E10
  }
} ]

As shown above the precision is lost and could lead to loss in money.

TimeSeries Query Example

Query syntax:

{
    "queryType": "timeseries",
    "dataSource": "invoices",
    "granularity": "ALL",
    "aggregations": [
        {
            "type": "compressedBigDecimal",
            "name": "revenue",
            "fieldName": "revenue",
            "scale": 9,
            "size": 3
        }
    ],
    "filter": {
        "type": "not",
        "field": {
            "type": "selector",
            "dimension": "itemName",
            "value": "ItemD"
        }
    },
    "intervals": [
        "2020-12-08T00:00:00.000Z/P1D"
    ]
}

Result:

[ {
  "timestamp" : "2020-12-08T00:00:00.000Z",
  "result" : {
    "revenue" : 15000000010.000000005
  }
} ]