druid

History

hqx871 a0234c4e13 Add sampling factor for DeterminePartitionsJob (#13840 ) There are two type of DeterminePartitionsJob: - When the input data is not assume grouped, there may be duplicate rows. In this case, two MR jobs are launched. The first one do group job to remove duplicate rows. And a second one to perform global sorting to find lower and upper bound for target segments. - When the input data is assume grouped, we only need to launch the global sorting MR job to find lower and upper bound for segments. Sampling strategy: - If the input data is assume grouped, sample by random at the mapper side of the global sort mr job. - If the input data is not assume grouped, sample at the mapper of the group job. Use hash on time and all dimensions and mod by sampling factor to sample, don't use random method because there may be duplicate rows.	2023-08-11 10:42:25 +05:30
..
src	Add sampling factor for DeterminePartitionsJob (#13840 )	2023-08-11 10:42:25 +05:30
pom.xml	Prepare master branch for next release, 28.0.0 (#14595 )	2023-07-18 09:22:30 +05:30

Add sampling factor for DeterminePartitionsJob (#13840 )

There are two type of DeterminePartitionsJob:
-  When the input data is not assume grouped, there may be duplicate rows.
In this case, two MR jobs are launched. The first one do group job to remove duplicate rows.
And a second one to perform global sorting to find lower and upper bound for target segments.
- When the input data is assume grouped, we only need to launch the global sorting
MR job to find lower and upper bound for segments.

Sampling strategy:
- If the input data is assume grouped, sample by random at the mapper side of the global sort mr job.
- If the input data is not assume grouped, sample at the mapper of the group job. Use hash on time
and all dimensions and mod by sampling factor to sample, don't use random method because there
may be duplicate rows.

2023-08-11 10:42:25 +05:30

src

Add sampling factor for DeterminePartitionsJob (#13840 )

2023-08-11 10:42:25 +05:30

pom.xml

Prepare master branch for next release, 28.0.0 (#14595 )

2023-07-18 09:22:30 +05:30