Commit Graph

17 Commits

Author SHA1 Message Date
Kashif Faraz 8ff1b2d5d4
Revert "Add filter in cloud object input source for backward compatibility (#13437)" (#13450)
This reverts commit b12e5f300e.
2022-11-30 16:33:05 +05:30
Tejaswini Bandlamudi b12e5f300e
Add filter in cloud object input source for backward compatibility (#13437)
https://github.com/apache/druid/pull/13027 PR replaces `filter` parameter with
`objectGlob` in ingestion input source. However, this will cause existing ingestion
jobs to fail if they are using a filter already. This PR adds old filter functionality
alongside objectGlob to preserve backward compatibility.
2022-11-28 23:04:33 +05:30
Didip Kerabat 56d5c9780d
Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects (#13027)
* Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects.

Removed:

import org.apache.commons.io.FilenameUtils;

Add:

import java.nio.file.FileSystems;
import java.nio.file.PathMatcher;
import java.nio.file.Paths;

* Forgot to update CloudObjectInputSource as well.

* Fix tests.

* Removed unused exceptions.

* Able to reduced user mistakes, by removing the protocol and the bucket on filter.

* add 1 more test.

* add comment on filterWithoutProtocolAndBucket

* Fix lint issue.

* Fix another lint issue.

* Replace all mention of filter -> objectGlob per convo here:

https://github.com/apache/druid/pull/13027#issuecomment-1266410707

* fix 1 bad constructor.

* Fix the documentation.

* Don’t do anything clever with the object path.

* Remove unused imports.

* Fix spelling error.

* Fix incorrect search and replace.

* Addressing Gian’s comment.

* add filename on .spelling

* Fix documentation.

* fix documentation again

Co-authored-by: Didip Kerabat <didip@apple.com>
2022-11-10 23:46:40 -08:00
Abhishek Agarwal e3f9a0ed44
Lazy initialization of segment killers, movers and archivers (#13170)
* Lazy initialization of segment killers, movers and archivers

* Add test for lazy killer

* Add more tests

* Intellij fixes
2022-10-04 15:55:46 +05:30
Karan Kumar 275f834b2a
Race in Task report/log streamer (#12931)
* Fixing RACE in HTTP remote task Runner

* Changes in the interface

* Updating documentation

* Adding test cases to SwitchingTaskLogStreamer

* Adding more tests
2022-08-25 17:56:01 -07:00
Didip Kerabat 6ddb828c7a
Able to filter Cloud objects with glob notation. (#12659)
In a heterogeneous environment, sometimes you don't have control over the input folder. Upstream can put any folder they want. In this situation the S3InputSource.java is unusable.

Most people like me solved it by using Airflow to fetch the full list of parquet files and pass it over to Druid. But doing this explodes the JSON spec. We had a situation where 1 of the JSON spec is 16MB and that's simply too much for Overlord.

This patch allows users to pass {"filter": "*.parquet"} and let Druid performs the filtering of the input files.

I am using the glob notation to be consistent with the LocalFirehose syntax.
2022-06-24 11:40:08 +05:30
Karan Kumar b94390ba33
Adding Shared Access resource support for azure (#12266)
Azure Blob storage has multiple modes of authentication. One of them is Shared access resource
. This is very useful in cases when we do not want to add the account key in the druid properties .
2022-02-22 18:27:43 +05:30
Jihoon Son ab3d994a17
Lazy instantiation for segmentKillers, segmentMovers, and segmentArchivers (#12207)
* working

* Lazily load segmentKillers, segmentMovers, and segmentArchivers

* more tests

* test-jar plugin

* more coverage

* lazy client

* clean up changes

* checkstyle

* i did not change the branch condition

* adjust failure rate to run tests faster

* javadocs

* checkstyle
2022-02-08 13:02:06 -08:00
Gian Merlino babf00f8e3
Migrate File.mkdirs to FileUtils.mkdirp. (#11879)
* Migrate File.mkdirs to FileUtils.mkdirp.

* Remove unused imports.

* Fix LookupReferencesManager.

* Simplify.

* Also migrate usages of forceMkdir.

* Fix var name.

* Fix incorrect call.

* Update test.
2021-11-09 11:10:49 -08:00
Parag Jain c7b46671b3
option to use deep storage for storing shuffle data (#11507)
Fixes #11297.
Description

Description and design in the proposal #11297
Key changed/added classes in this PR

    *DataSegmentPusher
    *ShuffleClient
    *PartitionStat
    *PartitionLocation
    *IntermediaryDataManager
2021-08-13 16:40:25 -04:00
Jihoon Son b5b3e6ecce
Add maxNumFiles to splitHintSpec (#10243)
* Add maxNumFiles to splitHintSpec

* missing link

* fix build failure; use maxNumFiles for integration tests

* spelling

* lower default

* Update docs/ingestion/native-batch.md

Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>

* address comments; change default maxSplitSize

* spelling

* typos and doc

* same change for segments splitHintSpec

* fix build

* fix build

Co-authored-by: Abhishek Agarwal <1477457+abhishekagarwal87@users.noreply.github.com>
2020-08-21 09:43:58 -07:00
Suneet Saldanha 1ced3b33fb
IntelliJ inspections cleanup (#9339)
* IntelliJ inspections cleanup

* Standard Charset object can be used
* Redundant Collection.addAll() call
* String literal concatenation missing whitespace
* Statement with empty body
* Redundant Collection operation
* StringBuilder can be replaced with String
* Type parameter hides visible type

* fix warnings in test code

* more test fixes

* remove string concatenation inspection error

* fix extra curly brace

* cleanup AzureTestUtils

* fix charsets for RangerAdminClient

* review comments
2020-04-10 10:04:40 -07:00
zachjsh e855c7fe1b
Allow Cloud Deep Storage configs without segment bucket or path specified (#9588)
* Allow Cloud SegmentKillers to be instantiated without segment bucket or path

This change fixes a bug that was introduced that causes ingestion
to fail if data is ingested from one of the supported cloud storages
(Azure, Google, S3), and the user is using another type of storage
for deep storage. In this case the all segment killer implementations
are instantiated. A change recently made forced a dependency between
the supported cloud storage type SegmentKiller classes and the
deep storage configuration for that storage type being set, which
forced the deep storage bucket and prefix to be non-null. This caused
a NullPointerException to be thrown when instantiating the
SegmentKiller classes during ingestion.

To fix this issue, the respective deep storage segment configs for the
cloud storage types supported in druid are now allowed to have nullable
bucket and prefix configurations

* * Allow google deep storage bucket to be null
2020-04-01 11:57:32 -07:00
zachjsh 4870ad7b56
Azure deep storage does not work with datasource name containing non-ASCII chars (#9525)
* Azure deep storage does not work with datasource name containing non-ASCII chars

Fixed a bug where recording the segment file location fails when
using Azure Deep Storage, if the datasource has any special
characters

* * update jacoco thresholds

* * resolve merge conflicts
* address review comments
2020-03-19 12:32:35 -07:00
zachjsh b18dd2b7a9
Ability to Delete task logs and segments from Azure Storage (#9523)
* Ability to Delete task logs and segments from Azure Storage

* implement ability to delete all tasks logs or all task logs
  written before a particular date when written to Azure storage

* implement ability to delete all segments from Azure deep storage

* * Address review comments
2020-03-18 17:59:17 -07:00
Jihoon Son 9466ac7c9b
Skip empty files for local, hdfs, and cloud input sources (#9450)
* Skip empty files for local, hdfs, and cloud input sources

* split hint spec doc

* doc for skipping empty files

* fix typo; adjust tests

* unnecessary fluent iterable

* address comments

* fix test

* use the right lists

* fix test

* fix test
2020-03-03 20:51:06 -08:00
zachjsh d771b42ed1
Move Azure extension into Core (#9394)
* Move Azure extension into Core

Moving the azure extension into Core.

* * Fix build failure

* * Add The MIT License (MIT) to list of compatible licenses

* * Address review comments

* * change reference to contrib azure to core azure

* * Fix spelling mistakes.
2020-02-25 17:49:16 -08:00