- * An Abstract Processor that is intended to simplify the coding required in order to perform Listing operations of remote resources.
- * Those remote resources may be files, "objects", "messages", or any other sort of entity that may need to be listed in such a way that
+ * An Abstract Processor that is intended to simplify the coding required in order to perform Listing operations of remote or local resources.
+ * Those resources may be files, "objects", "messages", or any other sort of entity that may need to be listed in such a way that
* we identity the entity only once. Each of these objects, messages, etc. is referred to as an "entity" for the scope of this Processor.
*
*
@@ -83,6 +87,9 @@ import java.util.stream.Collectors;
* than the last timestamp pulled, then the entity is considered new.
*
*
+ * With 'Tracking Entities' strategy, the size of entity content is also used to determine if an entity is "new". If the size changes the entity is considered "new".
+ *
+ *
* Entity must have a user-readable name that can be used for logging purposes.
*
*
@@ -96,9 +103,10 @@ import java.util.stream.Collectors;
* NOTE: This processor performs migrations of legacy state mechanisms inclusive of locally stored, file-based state and the optional utilization of the Distributed Cache
* Service property to the new {@link StateManager} functionality. Upon successful migration, the associated data from one or both of the legacy mechanisms is purged.
*
+ *
*
* For each new entity that is listed, the Processor will send a FlowFile to the 'success' relationship. The FlowFile will have no content but will have some set
- * of attributes (defined by the concrete implementation) that can be used to fetch those remote resources or interact with them in whatever way makes sense for
+ * of attributes (defined by the concrete implementation) that can be used to fetch those resources or interact with them in whatever way makes sense for
* the configured dataflow.
*
- * Perform a listing of remote resources. The subclass will implement the {@link #performListing(ProcessContext, Long)} method, which creates a listing of all
- * entities on the remote system that have timestamps later than the provided timestamp. If the entities returned have a timestamp before the provided one, those
+ * Perform a listing of resources. The subclass will implement the {@link #performListing(ProcessContext, Long)} method, which creates a listing of all
+ * entities on the target system that have timestamps later than the provided timestamp. If the entities returned have a timestamp before the provided one, those
* entities will be filtered out. It is therefore not necessary to perform the filtering of timestamps but is provided in order to give the implementation the ability
* to filter those resources on the server side rather than pulling back all of the information, if it makes sense to do so in the concrete implementation.
*
@@ -138,9 +146,11 @@ public abstract class AbstractListProcessor extends Ab
public static final PropertyDescriptor DISTRIBUTED_CACHE_SERVICE = new PropertyDescriptor.Builder()
.name("Distributed Cache Service")
- .description("Specifies the Controller Service that should be used to maintain state about what has been pulled from the remote server so that if a new node "
- + "begins pulling data, it won't duplicate all of the work that has been done. If not specified, the information will not be shared across the cluster. "
- + "This property does not need to be set for standalone instances of NiFi but should be configured if NiFi is run within a cluster.")
+ .description("NOTE: This property is used merely for migration from old NiFi version before state management was introduced at version 0.5.0. "
+ + "The stored value in the cache service will be migrated into the state when this processor is started at the first time. "
+ + "The specified Controller Service was used to maintain state about what had been pulled from the remote server so that if a new node "
+ + "begins pulling data, it won't duplicate all of the work that has been done. If not specified, the information was not shared across the cluster. "
+ + "This property did not need to be set for standalone instances of NiFi but was supposed to be configured if NiFi had been running within a cluster.")
.required(false)
.identifiesControllerService(DistributedMapCacheClient.class)
.build();
@@ -169,6 +179,28 @@ public abstract class AbstractListProcessor extends Ab
.description("All FlowFiles that are received are routed to success")
.build();
+ public static final AllowableValue BY_TIMESTAMPS = new AllowableValue("timestamps", "Tracking Timestamps",
+ "This strategy tracks the latest timestamp of listed entity to determine new/updated entities." +
+ " Since it only tracks few timestamps, it can manage listing state efficiently." +
+ " However, any newly added, or updated entity having timestamp older than the tracked latest timestamp can not be picked by this strategy." +
+ " For example, such situation can happen in a file system if a file with old timestamp" +
+ " is copied or moved into the target directory without its last modified timestamp being updated.");
+
+ public static final AllowableValue BY_ENTITIES = new AllowableValue("entities", "Tracking Entities",
+ "This strategy tracks information of all the listed entities within the latest 'Entity Tracking Time Window' to determine new/updated entities." +
+ " This strategy can pick entities having old timestamp that can be missed with 'Tracing Timestamps'." +
+ " However additional DistributedMapCache controller service is required and more JVM heap memory is used." +
+ " See the description of 'Entity Tracking Time Window' property for further details on how it works.");
+
+ public static final PropertyDescriptor LISTING_STRATEGY = new PropertyDescriptor.Builder()
+ .name("listing-strategy")
+ .displayName("Listing Strategy")
+ .description("Specify how to determine new/updated entities. See each strategy descriptions for detail.")
+ .required(true)
+ .allowableValues(BY_TIMESTAMPS, BY_ENTITIES)
+ .defaultValue(BY_TIMESTAMPS.getValue())
+ .build();
+
/**
* Represents the timestamp of an entity which was the latest one within those listed at the previous cycle.
* It does not necessary mean it has been processed as well.
@@ -185,6 +217,8 @@ public abstract class AbstractListProcessor extends Ab
private volatile boolean resetState = false;
private volatile List latestIdentifiersProcessed = new ArrayList<>();
+ private volatile ListedEntityTracker listedEntityTracker;
+
/*
* A constant used in determining an internal "yield" of processing files. Given the logic to provide a pause on the newest
* files according to timestamp, it is ensured that at least the specified millis has been eclipsed to avoid getting scheduled
@@ -206,14 +240,6 @@ public abstract class AbstractListProcessor extends Ab
return new File("conf/state/" + getIdentifier());
}
- @Override
- protected List getSupportedPropertyDescriptors() {
- final List properties = new ArrayList<>();
- properties.add(DISTRIBUTED_CACHE_SERVICE);
- properties.add(TARGET_SYSTEM_TIMESTAMP_PRECISION);
- return properties;
- }
-
@Override
public void onPropertyModified(final PropertyDescriptor descriptor, final String oldValue, final String newValue) {
if (isConfigurationRestored() && isListingResetNecessary(descriptor)) {
@@ -230,6 +256,32 @@ public abstract class AbstractListProcessor extends Ab
return relationships;
}
+ /**
+ * In order to add custom validation at sub-classes, implement {@link #customValidate(ValidationContext, Collection)} method.
+ */
+ @Override
+ protected final Collection customValidate(ValidationContext context) {
+ final Collection results = new ArrayList<>();
+
+ final String listingStrategy = context.getProperty(LISTING_STRATEGY).getValue();
+ if (BY_ENTITIES.equals(listingStrategy)) {
+ ListedEntityTracker.validateProperties(context, results, getStateScope(context));
+ }
+
+ customValidate(context, results);
+ return results;
+ }
+
+
+ /**
+ * Sub-classes can add custom validation by implementing this method.
+ * @param validationContext the validation context
+ * @param validationResults add custom validation result to this collection
+ */
+ protected void customValidate(ValidationContext validationContext, Collection validationResults) {
+
+ }
+
@OnPrimaryNodeStateChange
public void onPrimaryNodeChange(final PrimaryNodeState newState) {
justElectedPrimaryNode = (newState == PrimaryNodeState.ELECTED_PRIMARY_NODE);
@@ -260,7 +312,6 @@ public abstract class AbstractListProcessor extends Ab
if (resetState) {
context.getStateManager().clear(getStateScope(context));
- resetState = false;
}
}
@@ -352,9 +403,24 @@ public abstract class AbstractListProcessor extends Ab
return mapper.readValue(serializedState, EntityListing.class);
}
-
@Override
public void onTrigger(final ProcessContext context, final ProcessSession session) throws ProcessException {
+
+ resetState = false;
+
+ final String listingStrategy = context.getProperty(LISTING_STRATEGY).getValue();
+ if (BY_TIMESTAMPS.equals(listingStrategy)) {
+ listByTrackingTimestamps(context, session);
+
+ } else if (BY_ENTITIES.equals(listingStrategy)) {
+ listByTrackingEntities(context, session);
+
+ } else {
+ throw new ProcessException("Unknown listing strategy: " + listingStrategy);
+ }
+ }
+
+ public void listByTrackingTimestamps(final ProcessContext context, final ProcessSession session) throws ProcessException {
Long minTimestampToListMillis = lastListedLatestEntryTimestampMillis;
if (this.lastListedLatestEntryTimestampMillis == null || this.lastProcessedLatestEntryTimestampMillis == null || justElectedPrimaryNode) {
@@ -624,7 +690,7 @@ public abstract class AbstractListProcessor extends Ab
* @param context the ProcessContext to use in order to make a determination
* @return a Scope that specifies where the state should be managed for this Processor
*/
- protected abstract Scope getStateScope(final ProcessContext context);
+ protected abstract Scope getStateScope(final PropertyContext context);
private static class StringSerDe implements Serializer, Deserializer {
@@ -642,4 +708,41 @@ public abstract class AbstractListProcessor extends Ab
out.write(value.getBytes(StandardCharsets.UTF_8));
}
}
+
+ @OnScheduled
+ public void initListedEntityTracker(ProcessContext context) {
+ final boolean isTrackingEntityStrategy = BY_ENTITIES.getValue().equals(context.getProperty(LISTING_STRATEGY).getValue());
+ if (listedEntityTracker != null && (resetState || !isTrackingEntityStrategy)) {
+ try {
+ listedEntityTracker.clearListedEntities();
+ } catch (IOException e) {
+ throw new RuntimeException("Failed to reset previously listed entities due to " + e, e);
+ }
+ }
+
+ if (isTrackingEntityStrategy) {
+ if (listedEntityTracker == null) {
+ listedEntityTracker = createListedEntityTracker();
+ }
+ } else {
+ listedEntityTracker = null;
+ }
+ }
+
+ protected ListedEntityTracker createListedEntityTracker() {
+ return new ListedEntityTracker<>(getIdentifier(), getLogger());
+ }
+
+ private void listByTrackingEntities(ProcessContext context, ProcessSession session) throws ProcessException {
+ listedEntityTracker.trackEntities(context, session, justElectedPrimaryNode, getStateScope(context), minTimestampToList -> {
+ try {
+ return performListing(context, minTimestampToList);
+ } catch (final IOException e) {
+ getLogger().error("Failed to perform listing on remote host due to {}", e);
+ return Collections.emptyList();
+ }
+ }, entity -> createAttributes(entity, context));
+ justElectedPrimaryNode = false;
+ }
+
}
diff --git a/nifi-nar-bundles/nifi-extension-utils/nifi-processor-utils/src/main/java/org/apache/nifi/processor/util/list/ListableEntity.java b/nifi-nar-bundles/nifi-extension-utils/nifi-processor-utils/src/main/java/org/apache/nifi/processor/util/list/ListableEntity.java
index 01837cb174..f769e78014 100644
--- a/nifi-nar-bundles/nifi-extension-utils/nifi-processor-utils/src/main/java/org/apache/nifi/processor/util/list/ListableEntity.java
+++ b/nifi-nar-bundles/nifi-extension-utils/nifi-processor-utils/src/main/java/org/apache/nifi/processor/util/list/ListableEntity.java
@@ -37,4 +37,9 @@ public interface ListableEntity {
*/
long getTimestamp();
+ /**
+ * @return the size of the entity content.
+ */
+ long getSize();
+
}
diff --git a/nifi-nar-bundles/nifi-extension-utils/nifi-processor-utils/src/main/java/org/apache/nifi/processor/util/list/ListedEntity.java b/nifi-nar-bundles/nifi-extension-utils/nifi-processor-utils/src/main/java/org/apache/nifi/processor/util/list/ListedEntity.java
new file mode 100644
index 0000000000..1ae646f3ff
--- /dev/null
+++ b/nifi-nar-bundles/nifi-extension-utils/nifi-processor-utils/src/main/java/org/apache/nifi/processor/util/list/ListedEntity.java
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nifi.processor.util.list;
+
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonProperty;
+
+public class ListedEntity {
+ /**
+ * Milliseconds.
+ */
+ private final long timestamp;
+ /**
+ * Bytes.
+ */
+ private final long size;
+
+ @JsonCreator
+ public ListedEntity(@JsonProperty("timestamp") long timestamp, @JsonProperty("size") long size) {
+ this.timestamp = timestamp;
+ this.size = size;
+ }
+
+ public long getTimestamp() {
+ return timestamp;
+ }
+
+ public long getSize() {
+ return size;
+ }
+}
diff --git a/nifi-nar-bundles/nifi-extension-utils/nifi-processor-utils/src/main/java/org/apache/nifi/processor/util/list/ListedEntityTracker.java b/nifi-nar-bundles/nifi-extension-utils/nifi-processor-utils/src/main/java/org/apache/nifi/processor/util/list/ListedEntityTracker.java
new file mode 100644
index 0000000000..6c36ba95e5
--- /dev/null
+++ b/nifi-nar-bundles/nifi-extension-utils/nifi-processor-utils/src/main/java/org/apache/nifi/processor/util/list/ListedEntityTracker.java
@@ -0,0 +1,342 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nifi.processor.util.list;
+
+import com.fasterxml.jackson.core.type.TypeReference;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import org.apache.nifi.components.AllowableValue;
+import org.apache.nifi.components.PropertyDescriptor;
+import org.apache.nifi.components.ValidationContext;
+import org.apache.nifi.components.ValidationResult;
+import org.apache.nifi.components.state.Scope;
+import org.apache.nifi.distributed.cache.client.Deserializer;
+import org.apache.nifi.distributed.cache.client.DistributedMapCacheClient;
+import org.apache.nifi.distributed.cache.client.Serializer;
+import org.apache.nifi.expression.ExpressionLanguageScope;
+import org.apache.nifi.flowfile.FlowFile;
+import org.apache.nifi.logging.ComponentLog;
+import org.apache.nifi.processor.ProcessContext;
+import org.apache.nifi.processor.ProcessSession;
+import org.apache.nifi.processor.exception.ProcessException;
+import org.apache.nifi.processor.util.StandardValidators;
+import org.apache.nifi.stream.io.GZIPOutputStream;
+import org.apache.nifi.util.StringUtils;
+
+import java.io.ByteArrayInputStream;
+import java.io.IOException;
+import java.nio.charset.StandardCharsets;
+import java.util.Collection;
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.TimeUnit;
+import java.util.function.Function;
+import java.util.function.Supplier;
+import java.util.stream.Collectors;
+import java.util.zip.GZIPInputStream;
+
+import static java.lang.String.format;
+import static org.apache.nifi.processor.util.list.AbstractListProcessor.REL_SUCCESS;
+
+public class ListedEntityTracker {
+
+ private final ObjectMapper objectMapper = new ObjectMapper();
+ private volatile Map alreadyListedEntities;
+
+ private static final String NOTE = "Used by 'Tracking Entities' strategy.";
+ public static final PropertyDescriptor TRACKING_STATE_CACHE = new PropertyDescriptor.Builder()
+ .name("et-state-cache")
+ .displayName("Entity Tracking State Cache")
+ .description(format("Listed entities are stored in the specified cache storage" +
+ " so that this processor can resume listing across NiFi restart or in case of primary node change." +
+ " 'Tracking Entities' strategy require tracking information of all listed entities within the last 'Tracking Time Window'." +
+ " To support large number of entities, the strategy uses DistributedMapCache instead of managed state." +
+ " Cache key format is 'ListedEntities::{processorId}(::{nodeId})'." +
+ " If it tracks per node listed entities, then the optional '::{nodeId}' part is added to manage state separately." +
+ " E.g. cluster wide cache key = 'ListedEntities::8dda2321-0164-1000-50fa-3042fe7d6a7b'," +
+ " per node cache key = 'ListedEntities::8dda2321-0164-1000-50fa-3042fe7d6a7b::nifi-node3'" +
+ " The stored cache content is Gzipped JSON string." +
+ " The cache key will be deleted when target listing configuration is changed." +
+ " %s", NOTE))
+ .identifiesControllerService(DistributedMapCacheClient.class)
+ .build();
+
+ public static final PropertyDescriptor TRACKING_TIME_WINDOW = new PropertyDescriptor.Builder()
+ .name("et-time-window")
+ .displayName("Entity Tracking Time Window")
+ .description(format("Specify how long this processor should track already-listed entities." +
+ " 'Tracking Entities' strategy can pick any entity whose timestamp is inside the specified time window." +
+ " For example, if set to '30 minutes', any entity having timestamp in recent 30 minutes will be the listing target when this processor runs." +
+ " A listed entity is considered 'new/updated' and a FlowFile is emitted if one of following condition meets:" +
+ " 1. does not exist in the already-listed entities," +
+ " 2. has newer timestamp than the cached entity," +
+ " 3. has different size than the cached entity." +
+ " If a cached entity's timestamp becomes older than specified time window, that entity will be removed from the cached already-listed entities." +
+ " %s", NOTE))
+ .addValidator(StandardValidators.TIME_PERIOD_VALIDATOR)
+ .expressionLanguageSupported(ExpressionLanguageScope.VARIABLE_REGISTRY)
+ .defaultValue("3 hours")
+ .build();
+
+ private static final AllowableValue INITIAL_LISTING_TARGET_ALL = new AllowableValue("all", "All Available",
+ "Regardless of entities timestamp, all existing entities will be listed at the initial listing activity.");
+ private static final AllowableValue INITIAL_LISTING_TARGET_WINDOW = new AllowableValue("window", "Tracking Time Window",
+ "Ignore entities having timestamp older than the specified 'Tracking Time Window' at the initial listing activity.");
+
+ public static final PropertyDescriptor INITIAL_LISTING_TARGET = new PropertyDescriptor.Builder()
+ .name("et-initial-listing-target")
+ .displayName("Entity Tracking Initial Listing Target")
+ .description(format("Specify how initial listing should be handled." +
+ " %s", NOTE))
+ .allowableValues(INITIAL_LISTING_TARGET_WINDOW, INITIAL_LISTING_TARGET_ALL)
+ .defaultValue(INITIAL_LISTING_TARGET_ALL.getValue())
+ .build();
+
+ public static final PropertyDescriptor NODE_IDENTIFIER = new PropertyDescriptor.Builder()
+ .name("et-node-identifier")
+ .displayName("Entity Tracking Node Identifier")
+ .description(format("The configured value will be appended to the cache key" +
+ " so that listing state can be tracked per NiFi node rather than cluster wide" +
+ " when tracking state is scoped to LOCAL. %s", NOTE))
+ .addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
+ .expressionLanguageSupported(ExpressionLanguageScope.VARIABLE_REGISTRY)
+ .defaultValue("${hostname()}")
+ .build();
+
+ static final Supplier DEFAULT_CURRENT_TIMESTAMP_SUPPLIER = System::currentTimeMillis;
+ private final Supplier currentTimestampSupplier;
+
+ private final Serializer stringSerializer = (v, o) -> o.write(v.getBytes(StandardCharsets.UTF_8));
+
+ private final Serializer