:batch-asciidoc: ./ :toc: left :toclevels: 4 [[domainLanguageOfBatch]] == The Domain Language of Batch ifndef::onlyonetoggle[] include::toggle.adoc[] endif::onlyonetoggle[] To any experienced batch architect, the overall concepts of batch processing used in Spring Batch should be familiar and comfortable. There are "Jobs" and "Steps" and developer-supplied processing units called `ItemReader` and `ItemWriter`. However, because of the Spring patterns, operations, templates, callbacks, and idioms, there are opportunities for the following: * Significant improvement in adherence to a clear separation of concerns. * Clearly delineated architectural layers and services provided as interfaces. * Simple and default implementations that allow for quick adoption and ease of use out-of-the-box. * Significantly enhanced extensibility. The following diagram is a simplified version of the batch reference architecture that has been used for decades. It provides an overview of the components that make up the domain language of batch processing. This architecture framework is a blueprint that has been proven through decades of implementations on the last several generations of platforms (COBOL/Mainframe, C++/Unix, and now Java/anywhere). JCL and COBOL developers are likely to be as comfortable with the concepts as C++, C#, and Java developers. Spring Batch provides a physical implementation of the layers, components, and technical services commonly found in the robust, maintainable systems that are used to address the creation of simple to complex batch applications, with the infrastructure and extensions to address very complex processing needs. .Batch Stereotypes image::{batch-asciidoc}images/spring-batch-reference-model.png[Figure 2.1: Batch Stereotypes, scaledwidth="60%"] The preceding diagram highlights the key concepts that make up the domain language of Spring Batch. A Job has one to many steps, each of which has exactly one `ItemReader`, one `ItemProcessor`, and one `ItemWriter`. A job needs to be launched (with `JobLauncher`), and metadata about the currently running process needs to be stored (in `JobRepository`). === Job This section describes stereotypes relating to the concept of a batch job. A `Job` is an entity that encapsulates an entire batch process. As is common with other Spring projects, a `Job` is wired together with either an XML configuration file or Java-based configuration. This configuration may be referred to as the "job configuration". However, `Job` is just the top of an overall hierarchy, as shown in the following diagram: .Job Hierarchy image::{batch-asciidoc}images/job-heirarchy.png[Job Hierarchy, scaledwidth="60%"] In Spring Batch, a `Job` is simply a container for `Step` instances. It combines multiple steps that belong logically together in a flow and allows for configuration of properties global to all steps, such as restartability. The job configuration contains: * The simple name of the job. * Definition and ordering of `Step` instances. * Whether or not the job is restartable. ifdef::backend-html5[] [role="javaContent"] A default simple implementation of the Job interface is provided by Spring Batch in the form of the `SimpleJob` class, which creates some standard functionality on top of `Job`. When using java based configuration, a collection of builders is made available for the instantiation of a `Job`, as shown in the following example: [source, java, role="javaContent"] ---- @Bean public Job footballJob() { return this.jobBuilderFactory.get("footballJob") .start(playerLoad()) .next(gameLoad()) .next(playerSummarization()) .end() .build(); } ---- [role="xmlContent"] A default simple implementation of the `Job` interface is provided by Spring Batch in the form of the `SimpleJob` class, which creates some standard functionality on top of `Job`. However, the batch namespace abstracts away the need to instantiate it directly. Instead, the `` tag can be used as shown in the following example: [source, xml, role="xmlContent"] ---- ---- endif::backend-html5[] ifdef::backend-pdf[] A default simple implementation of the Job interface is provided by Spring Batch in the form of the `SimpleJob` class, which creates some standard functionality on top of `Job`. When using java based configuration, a collection of builders are made available for the instantiation of a `Job`, as shown in the following example: [source, java] ---- @Bean public Job footballJob() { return this.jobBuilderFactory.get("footballJob") .start(playerLoad()) .next(gameLoad()) .next(playerSummarization()) .end() .build(); } ---- However, when using XML configuration, the batch namespace abstracts away the need to instantiate it directly. Instead, the `` tag can be used as shown in the following example: [source, xml] ---- ---- endif::backend-pdf[] ==== JobInstance A `JobInstance` refers to the concept of a logical job run. Consider a batch job that should be run once at the end of the day, such as the 'EndOfDay' `Job` from the preceding diagram. There is one 'EndOfDay' job, but each individual run of the `Job` must be tracked separately. In the case of this job, there is one logical `JobInstance` per day. For example, there is a January 1st run, a January 2nd run, and so on. If the January 1st run fails the first time and is run again the next day, it is still the January 1st run. (Usually, this corresponds with the data it is processing as well, meaning the January 1st run processes data for January 1st). Therefore, each `JobInstance` can have multiple executions (`JobExecution` is discussed in more detail later in this chapter), and only one `JobInstance` corresponding to a particular `Job` and identifying `JobParameters` can run at a given time. The definition of a `JobInstance` has absolutely no bearing on the data the to be loaded. It is entirely up to the `ItemReader` implementation to determine how data is loaded. For example, in the EndOfDay scenario, there may be a column on the data that indicates the 'effective date' or 'schedule date' to which the data belongs. So, the January 1st run would load only data from the 1st, and the January 2nd run would use only data from the 2nd. Because this determination is likely to be a business decision, it is left up to the `ItemReader` to decide. However, using the same `JobInstance` determines whether or not the 'state' (that is, the `ExecutionContext`, which is discussed later in this chapter) from previous executions is used. Using a new `JobInstance` means 'start from the beginning', and using an existing instance generally means 'start from where you left off'. ==== JobParameters Having discussed `JobInstance` and how it differs from Job, the natural question to ask is: "How is one `JobInstance` distinguished from another?" The answer is: `JobParameters`. A `JobParameters` object holds a set of parameters used to start a batch job. They can be used for identification or even as reference data during the run, as shown in the following image: .Job Parameters image::{batch-asciidoc}images/job-stereotypes-parameters.png[Job Parameters, scaledwidth="60%"] In the preceding example, where there are two instances, one for January 1st, and another for January 2nd, there is really only one `Job`, but it has two `JobParameter` objects: one that was started with a job parameter of 01-01-2017 and another that was started with a parameter of 01-02-2017. Thus, the contract can be defined as: `JobInstance` = `Job` + identifying `JobParameters`. This allows a developer to effectively control how a `JobInstance` is defined, since they control what parameters are passed in. NOTE: Not all job parameters are required to contribute to the identification of a `JobInstance`. By default, they do so. However, the framework also allows the submission of a `Job` with parameters that do not contribute to the identity of a `JobInstance`. ==== JobExecution A `JobExecution` refers to the technical concept of a single attempt to run a Job. An execution may end in failure or success, but the `JobInstance` corresponding to a given execution is not considered to be complete unless the execution completes successfully. Using the EndOfDay `Job` described previously as an example, consider a `JobInstance` for 01-01-2017 that failed the first time it was run. If it is run again with the same identifying job parameters as the first run (01-01-2017), a new `JobExecution` is created. However, there is still only one `JobInstance`. A `Job` defines what a job is and how it is to be executed, and a `JobInstance` is a purely organizational object to group executions together, primarily to enable correct restart semantics. A `JobExecution`, however, is the primary storage mechanism for what actually happened during a run and contains many more properties that must be controlled and persisted, as shown in the following table: .JobExecution Properties |=== |Property |Definition |Status |A `BatchStatus` object that indicates the status of the execution. While running, it is `BatchStatus#STARTED`. If it fails, it is `BatchStatus#FAILED`. If it finishes successfully, it is `BatchStatus#COMPLETED` |startTime |A `java.util.Date` representing the current system time when the execution was started. This field is empty if the job has yet to start. |endTime |A `java.util.Date` representing the current system time when the execution finished, regardless of whether or not it was successful. The field is empty if the job has yet to finish. |exitStatus |The `ExitStatus`, indicating the result of the run. It is most important, because it contains an exit code that is returned to the caller. See chapter 5 for more details. The field is empty if the job has yet to finish. |createTime |A `java.util.Date` representing the current system time when the `JobExecution` was first persisted. The job may not have been started yet (and thus has no start time), but it always has a createTime, which is required by the framework for managing job level `ExecutionContexts`. |lastUpdated |A `java.util.Date` representing the last time a `JobExecution` was persisted. This field is empty if the job has yet to start. |executionContext |The "property bag" containing any user data that needs to be persisted between executions. |failureExceptions |The list of exceptions encountered during the execution of a `Job`. These can be useful if more than one exception is encountered during the failure of a `Job`. |=== These properties are important because they are persisted and can be used to completely determine the status of an execution. For example, if the EndOfDay job for 01-01 is executed at 9:00 PM and fails at 9:30, the following entries are made in the batch metadata tables: .BATCH_JOB_INSTANCE |=== |JOB_INST_ID |JOB_NAME |1 |EndOfDayJob |=== .BATCH_JOB_EXECUTION_PARAMS |=== |JOB_EXECUTION_ID|TYPE_CD|KEY_NAME|DATE_VAL|IDENTIFYING |1 |DATE |schedule.Date |2017-01-01 |TRUE |=== .BATCH_JOB_EXECUTION |=== |JOB_EXEC_ID|JOB_INST_ID|START_TIME|END_TIME|STATUS |1 |1 |2017-01-01 21:00 |2017-01-01 21:30 |FAILED |=== NOTE: Column names may have been abbreviated or removed for the sake of clarity and formatting. Now that the job has failed, assume that it took the entire night for the problem to be determined, so that the 'batch window' is now closed. Further assuming that the window starts at 9:00 PM, the job is kicked off again for 01-01, starting where it left off and completing successfully at 9:30. Because it is now the next day, the 01-02 job must be run as well, and it is kicked off just afterwards at 9:31 and completes in its normal one hour time at 10:30. There is no requirement that one `JobInstance` be kicked off after another, unless there is potential for the two jobs to attempt to access the same data, causing issues with locking at the database level. It is entirely up to the scheduler to determine when a `Job` should be run. Since they are separate `JobInstances`, Spring Batch makes no attempt to stop them from being run concurrently. (Attempting to run the same `JobInstance` while another is already running results in a `JobExecutionAlreadyRunningException` being thrown). There should now be an extra entry in both the `JobInstance` and `JobParameters` tables and two extra entries in the `JobExecution` table, as shown in the following tables: .BATCH_JOB_INSTANCE |=== |JOB_INST_ID |JOB_NAME |1 |EndOfDayJob |2 |EndOfDayJob |=== .BATCH_JOB_EXECUTION_PARAMS |=== |JOB_EXECUTION_ID|TYPE_CD|KEY_NAME|DATE_VAL|IDENTIFYING |1 |DATE |schedule.Date |2017-01-01 00:00:00 |TRUE |2 |DATE |schedule.Date |2017-01-01 00:00:00 |TRUE |3 |DATE |schedule.Date |2017-01-02 00:00:00 |TRUE |=== .BATCH_JOB_EXECUTION |=== |JOB_EXEC_ID|JOB_INST_ID|START_TIME|END_TIME|STATUS |1 |1 |2017-01-01 21:00 |2017-01-01 21:30 |FAILED |2 |1 |2017-01-02 21:00 |2017-01-02 21:30 |COMPLETED |3 |2 |2017-01-02 21:31 |2017-01-02 22:29 |COMPLETED |=== NOTE: Column names may have been abbreviated or removed for the sake of clarity and formatting. === Step A `Step` is a domain object that encapsulates an independent, sequential phase of a batch job. Therefore, every Job is composed entirely of one or more steps. A `Step` contains all of the information necessary to define and control the actual batch processing. This is a necessarily vague description because the contents of any given `Step` are at the discretion of the developer writing a `Job`. A `Step` can be as simple or complex as the developer desires. A simple `Step` might load data from a file into the database, requiring little or no code (depending upon the implementations used). A more complex `Step` may have complicated business rules that are applied as part of the processing. As with a `Job`, a `Step` has an individual `StepExecution` that correlates with a unique `JobExecution`, as shown in the following image: .Job Hierarchy With Steps image::{batch-asciidoc}images/jobHeirarchyWithSteps.png[Figure 2.1: Job Hierarchy With Steps, scaledwidth="60%"] ==== StepExecution A `StepExecution` represents a single attempt to execute a `Step`. A new `StepExecution` is created each time a `Step` is run, similar to `JobExecution`. However, if a step fails to execute because the step before it fails, no execution is persisted for it. A `StepExecution` is created only when its `Step` is actually started. `Step` executions are represented by objects of the `StepExecution` class. Each execution contains a reference to its corresponding step and `JobExecution` and transaction related data, such as commit and rollback counts and start and end times. Additionally, each step execution contains an `ExecutionContext`, which contains any data a developer needs to have persisted across batch runs, such as statistics or state information needed to restart. The following table lists the properties for `StepExecution`: .StepExecution Properties |=== |Property|Definition |Status |A `BatchStatus` object that indicates the status of the execution. While running, the status is `BatchStatus.STARTED`. If it fails, the status is `BatchStatus.FAILED`. If it finishes successfully, the status is `BatchStatus.COMPLETED`. |startTime |A `java.util.Date` representing the current system time when the execution was started. This field is empty if the step has yet to start. |endTime |A `java.util.Date` representing the current system time when the execution finished, regardless of whether or not it was successful. This field is empty if the step has yet to exit. |exitStatus |The `ExitStatus` indicating the result of the execution. It is most important, because it contains an exit code that is returned to the caller. See chapter 5 for more details. This field is empty if the job has yet to exit. |executionContext |The "property bag" containing any user data that needs to be persisted between executions. |readCount |The number of items that have been successfully read. |writeCount |The number of items that have been successfully written. |commitCount |The number of transactions that have been committed for this execution. |rollbackCount |The number of times the business transaction controlled by the `Step` has been rolled back. |readSkipCount |The number of times `read` has failed, resulting in a skipped item. |processSkipCount |The number of times `process` has failed, resulting in a skipped item. |filterCount |The number of items that have been 'filtered' by the `ItemProcessor`. |writeSkipCount |The number of times `write` has failed, resulting in a skipped item. |=== === ExecutionContext An `ExecutionContext` represents a collection of key/value pairs that are persisted and controlled by the framework in order to allow developers a place to store persistent state that is scoped to a `StepExecution` object or a `JobExecution` object. For those familiar with Quartz, it is very similar to JobDataMap. The best usage example is to facilitate restart. Using flat file input as an example, while processing individual lines, the framework periodically persists the `ExecutionContext` at commit points. Doing so allows the `ItemReader` to store its state in case a fatal error occurs during the run or even if the power goes out. All that is needed is to put the current number of lines read into the context, as shown in the following example, and the framework will do the rest: [source, java] ---- executionContext.putLong(getKey(LINES_READ_COUNT), reader.getPosition()); ---- Using the EndOfDay example from the `Job` Stereotypes section as an example, assume there is one step, 'loadData', that loads a file into the database. After the first failed run, the metadata tables would look like the following example: .BATCH_JOB_INSTANCE |=== |JOB_INST_ID|JOB_NAME |1 |EndOfDayJob |=== .BATCH_JOB_EXECUTION_PARAMS |=== |JOB_INST_ID|TYPE_CD|KEY_NAME|DATE_VAL |1 |DATE |schedule.Date |2017-01-01 |=== .BATCH_JOB_EXECUTION |=== |JOB_EXEC_ID|JOB_INST_ID|START_TIME|END_TIME|STATUS |1 |1 |2017-01-01 21:00 |2017-01-01 21:30 |FAILED |=== .BATCH_STEP_EXECUTION |=== |STEP_EXEC_ID|JOB_EXEC_ID|STEP_NAME|START_TIME|END_TIME|STATUS |1 |1 |loadData |2017-01-01 21:00 |2017-01-01 21:30 |FAILED |=== .BATCH_STEP_EXECUTION_CONTEXT |=== |STEP_EXEC_ID|SHORT_CONTEXT |1 |{piece.count=40321} |=== In the preceding case, the `Step` ran for 30 minutes and processed 40,321 'pieces', which would represent lines in a file in this scenario. This value is updated just before each commit by the framework and can contain multiple rows corresponding to entries within the `ExecutionContext`. Being notified before a commit requires one of the various `StepListener` implementations (or an `ItemStream`), which are discussed in more detail later in this guide. As with the previous example, it is assumed that the `Job` is restarted the next day. When it is restarted, the values from the `ExecutionContext` of the last run are reconstituted from the database. When the `ItemReader` is opened, it can check to see if it has any stored state in the context and initialize itself from there, as shown in the following example: [source, java] ---- if (executionContext.containsKey(getKey(LINES_READ_COUNT))) { log.debug("Initializing for restart. Restart data is: " + executionContext); long lineCount = executionContext.getLong(getKey(LINES_READ_COUNT)); LineReader reader = getReader(); Object record = ""; while (reader.getPosition() < lineCount && record != null) { record = readLine(); } } ---- In this case, after the above code runs, the current line is 40,322, allowing the `Step` to start again from where it left off. The `ExecutionContext` can also be used for statistics that need to be persisted about the run itself. For example, if a flat file contains orders for processing that exist across multiple lines, it may be necessary to store how many orders have been processed (which is much different from the number of lines read), so that an email can be sent at the end of the `Step` with the total number of orders processed in the body. The framework handles storing this for the developer, in order to correctly scope it with an individual `JobInstance`. It can be very difficult to know whether an existing `ExecutionContext` should be used or not. For example, using the 'EndOfDay' example from above, when the 01-01 run starts again for the second time, the framework recognizes that it is the same `JobInstance` and on an individual `Step` basis, pulls the `ExecutionContext` out of the database, and hands it (as part of the `StepExecution`) to the `Step` itself. Conversely, for the 01-02 run, the framework recognizes that it is a different instance, so an empty context must be handed to the `Step`. There are many of these types of determinations that the framework makes for the developer, to ensure the state is given to them at the correct time. It is also important to note that exactly one `ExecutionContext` exists per `StepExecution` at any given time. Clients of the `ExecutionContext` should be careful, because this creates a shared keyspace. As a result, care should be taken when putting values in to ensure no data is overwritten. However, the `Step` stores absolutely no data in the context, so there is no way to adversely affect the framework. It is also important to note that there is at least one `ExecutionContext` per `JobExecution` and one for every `StepExecution`. For example, consider the following code snippet: [source, java] ---- ExecutionContext ecStep = stepExecution.getExecutionContext(); ExecutionContext ecJob = jobExecution.getExecutionContext(); //ecStep does not equal ecJob ---- As noted in the comment, `ecStep` does not equal `ecJob`. They are two different `ExecutionContexts`. The one scoped to the `Step` is saved at every commit point in the `Step`, whereas the one scoped to the Job is saved in between every `Step` execution. === JobRepository `JobRepository` is the persistence mechanism for all of the Stereotypes mentioned above. It provides CRUD operations for `JobLauncher`, `Job`, and `Step` implementations. When a `Job` is first launched, a `JobExecution` is obtained from the repository, and, during the course of execution, `StepExecution` and `JobExecution` implementations are persisted by passing them to the repository. [role="xmlContent"] The batch namespace provides support for configuring a `JobRepository` instance with the `` tag, as shown in the following example: [source, xml, role="xmlContent"] ---- ---- [role="javaContent"] When using java configuration, `@EnableBatchProcessing` annotation provides a `JobRepository` as one of the components automatically configured out of the box. === JobLauncher `JobLauncher` represents a simple interface for launching a `Job` with a given set of `JobParameters`, as shown in the following example: [source, java] ---- public interface JobLauncher { public JobExecution run(Job job, JobParameters jobParameters) throws JobExecutionAlreadyRunningException, JobRestartException, JobInstanceAlreadyCompleteException, JobParametersInvalidException; } ---- It is expected that implementations obtain a valid `JobExecution` from the `JobRepository` and execute the `Job`. === Item Reader `ItemReader` is an abstraction that represents the retrieval of input for a `Step`, one item at a time. When the `ItemReader` has exhausted the items it can provide, it indicates this by returning `null`. More details about the `ItemReader` interface and its various implementations can be found in <>. === Item Writer `ItemWriter` is an abstraction that represents the output of a `Step`, one batch or chunk of items at a time. Generally, an `ItemWriter` has no knowledge of the input it should receive next and knows only the item that was passed in its current invocation. More details about the `ItemWriter` interface and its various implementations can be found in <>. === Item Processor `ItemProcessor` is an abstraction that represents the business processing of an item. While the `ItemReader` reads one item, and the `ItemWriter` writes them, the `ItemProcessor` provides an access point to transform or apply other business processing. If, while processing the item, it is determined that the item is not valid, returning `null` indicates that the item should not be written out. More details about the `ItemProcessor` interface can be found in <>. [role="xmlContent"] === Batch Namespace Many of the domain concepts listed previously need to be configured in a Spring `ApplicationContext`. While there are implementations of the interfaces above that can be used in a standard bean definition, a namespace has been provided for ease of configuration, as shown in the following example: [source, xml, role="xmlContent"] ---- ---- [role="xmlContent"] As long as the batch namespace has been declared, any of its elements can be used. More information on configuring a Job can be found in <>. More information on configuring a `Step` can be found in <>.