Batch of goodness

Java EE 7: An Overview of Batch Processing - Part 3

  

Restarting failed jobs

So far, we had been starting batch jobs using the jobOperator.start() method. Let's say that our payroll input file has some errors. Either the ItemReader or the ItemProcessor could detect invalid records and fail the current step and the job. The administrator or the end user can fix the error and can restart the batch job. This approach of launching a new job that starts from the beginning after recovering from errors might not scale if the amount of data to be processed is large. JobOperator provides another method calledrestart() to solve exactly this problem.

Quick overview of JobInstance and JobExecution

We saw earlier that a job is essentially a container for steps. When a job is started, it must be tracked, so the batch runtime creates aJobInstance. A JobInstance refers to the concept of a logical run. In our example, we have a PayrollJob and if the PayrollJob is run every month, there will be a Jan-2013 JobInstance and there will be another Feb-2013 JobInstance, and so on.

If the payroll processing for Jan-2013 fails, it must be restarted (after presumably fixing the error), but it is still the Jan-2013 run because it is still processing Jan-2013 records.

JobExecution refers to the concept of a single attempt to run a Job. Each time a job is started or restarted, a new JobExecution is created that belongs to the same JobInstance. In our example, if the Jan-2013 JobInstance is restarted, it is still the same Jan-2013JobInstance but a new JobExecution is created that belongs to the same JobInstance.

In summary, a job can have one or more instances of JobInstance and each JobInstance can have one or more JobExecutioninstances. Using a new JobInstance means "start from the beginning" and using an existing JobInstance generally means "start from where you left off."

Resuming failed jobs

If you recall, a chunk-style step executes in a transaction in which item-count entries are read, processed, and written. After theItemWriter's writeItems() has been invoked, the batch runtime calls the checkpointInfo() method on both ItemReader andItemWriter. This allows both ItemReader and ItemWriter to bookmark (save) their current progress. The data that is bookmarked for an ItemReader could be anything that will help it to resume reading. For example, our SimpleItemReader needs to save the line number up to which it has read successfully so far.

Section 10.8 of the JSR 352 specification describes the restart processing in detail.

Let's take a moment to look into the log file where our SimpleItemReader outputs some useful messages from the open() andcheckpoint() methods. Each message is prefixed with the string [SimpleItemReader] so you can quickly identify the messages. The log file is located at <GlassFish install Dir>/domains/domain1/logs/server.log.

Listing 10 shows the messages that are prefixed by the string [SimpleItemReader]:

 

[SimpleItemReader] Opened Payroll File. Will start reading from record number: 0]]
[SimpleItemReader] checkpointInfo() called. Returning current recordNumber: 2]]
[SimpleItemReader] checkpointInfo() called. Returning current recordNumber: 4]]
[SimpleItemReader] checkpointInfo() called. Returning current recordNumber: 6]]
[SimpleItemReader] checkpointInfo() called. Returning current recordNumber: 8]]
[SimpleItemReader] checkpointInfo() called. Returning current recordNumber: 9]]
[SimpleItemReader] close called.]]

 

Note: You could also use the command tail -f server.log | grep SimpleItemReader.

Because, our job XML file (SimplePayrollJob.xml) specifies a value of 2 for item-count as the chunk size, the batch runtime callscheckpointInfo() on our ItemReader every two records. The batch runtime stores this checkpoint information in JobRepository. So, if an error occurs during the midst of our chunk processing, the batch application must be able to resume from the last successful checkpoint.

Let's introduce some errors in our input data file and see how we can recover from input errors.

If you look at our servlet's output, which is located under <GlassFish install Dir>/domains/domain1/applications/hello-batch/WEB-INF/classes/payroll-data/payroll-data.csv, you see that it displays the location of the input file from where CSV data is read for our payroll application. Listing 11 shows the content of the file:

 

1, 8100
2, 8200
3, 8300
4, 8400
5, 8500
6, 8600
7, 8700
8, 8800
9, 8900

 

Open your favorite editor and introduce an error. For example, let's say we add a few characters to the salary field on the eighth record, as shown in Listing 12:

1, 8100
2, 8200
3, 8300
4, 8400
5, 8500
6, 8600
7, 8700
8, abc8800
9, 8900

Save the file and quit the editor. Go back to your browser and click the Calculate Payroll button followed by the Refresh button. You would see that the recently submitted job failed, as shown in Figure 6. (Look at the Exit Status column.)

Figure 6

 

You will also notice that a Restart button appears next to the execution ID of the job that just failed. If you click Refresh, the job will fail (because we haven't fixed the issue yet). Figure 7 shows what is displayed after a few clicks of the Refresh button.

Figure 7

If you look into the GlassFish server log (located under <GlassFish install Dir>/domains/domain1/logs/server.log), you will see an exception, as shown in Listing 13:

Caught exception executing step: com.ibm.jbatch.container.exception.BatchContainerRuntimeException: 
Failure in Read-Process-Write Loop
...
...
Caused by: java.lang.NumberFormatException: For input string: "abc8800"
        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
        at java.lang.Integer.parseInt(Integer.java:492)
        at java.lang.Integer.parseInt(Integer.java:527)
        at com.oracle.javaee7.samples.batch.hello.SimpleItemReader.readItem(SimpleItemReader.java:100)

You should also notice that when you click the Restart button, a new job execution is created but its job instance ID remains the same. When you click the Refresh button, our PayrollJobSubmitter servlet calls a method named restartBatchJob(), which is shown in Listing 14:

private long restartBatchJob(long lastExecutionId)
                        throws Exception {
        JobOperator jobOperator = BatchRuntime.getJobOperator();
        Properties props = new Properties();
        props.setProperty("payrollInputDataFileName", payrollInputDataFileName);

        return jobOperator.restart(lastExecutionId, props);
}

The key line in Listing 14 is the call to JobOperator's restart() method. This method takes a Properties object just like start(), but instead of passing a job XML file name, it passes the execution ID of the most recently failed job. Using the most recently failed job's execution ID, the batch runtime can retrieve the previous execution's last successful checkpoint. The retrieved checkpoint data is passed to the open() method of our SimpleItemReader (and ItemWriter) to enable them to resume reading (and writing) from the last successful checkpoint.

While ensuring that your browser shows the page with a Restart button, edit the file again and remove the extraneous characters from the eighth record. Then click the Restart and Refresh buttons. The latest execution should display a COMPLETED status, as shown in Figure 8.

Figure 8

It is time to look into the log file to understand what just happened. Again, looking for messages prefixed with SimpleItemReader, Listing 15 shows what you might see:

[SimpleItemReader] Opened Payroll File. Will start reading from record number: 7]] 
[SimpleItemReader] checkpointInfo() called. Returning current recordNumber: 9]]
[SimpleItemReader] checkpointInfo() called. Returning current recordNumber: 10]]
[SimpleItemReader] close called.]]

As you can see, our SimpleItemReader's open() method was called with the previous checkpoint value (which was record number 7) allowing our SimpleItemReader to skip the first six records and resume reading from the seventh record.

Viewing Batch Jobs using the GlassFish 4.0 Admin Console

You can view the list of all batch jobs in the JobRepository. Fire up a browser window and go to localhost:4848. Then click server (Admin Server) in the left panel, as shown in Figure 9.

Figure 9

You can click the Batch tab, which should list all the batch jobs submitted to this GlassFish server. Note that the JobRepository is implemented using a database and, hence, the job details survive GlassFish 4.0 server restarts. Figure 10 shows all the batch jobs in theJobRepository.

Figure 10

You can also click one of the IDs listed under Execution IDs. For example, clicking 293 reveals details about just that execution:

Figure 11

More details about the execution can be obtained by clicking the Execution Steps tab on the top.

Figure 12

Look at the statistics provided by this page. It shows how many reads, writes, and commits were performed during this execution.

Viewing Batch Jobs using the GlassFish 4.0 CLI

You can also view the details about jobs running in the GlassFish 4.0 server by using the command-line interface (CLI).

To view the list of batch jobs, open a command window and run the following command:

asadmin list-batch-jobs -l

 

You should see output similar to Figure 13:

Figure 13

To view the list of batch JobExecutions, you can run this command:

asadmin list-batch-job-executions -l 

 

You should see output similar to Figure 14:

Figure 14

The command lists the completion status of each execution and also the job parameters passed to each execution.

Finally, in order to see details about each step in a JobExecution, you could use the following command:

asadmin list-batch-job-steps -l 

 

You should see output similar to Figure 15:

Figure 15

Take note of the STEPMETRICS column. It tells how many times ItemReader and ItemWriter were called and also how many commits and rollbacks were done. These are extremely valuable metrics.

The CLI output must match the Admin Console view because they both query the same JobRepository.

You can use asadmin help <command-name> to get more details about the CLI commands.

Conclusion

In this article, we saw how to write, package, and run simple batch applications that use chunk-style steps. We also saw how the checkpoint feature of the batch runtime allows for the easy restart of failed batch jobs. Yet, we have barely scratched the surface of JSR 352. With the full set of Java EE components and features at your disposal, including servlets, EJB beans, CDI beans, EJB automatic timers, and so on, feature-rich batch applications can be written fairly easily.

This article also covered (briefly) the GlassFish 4.0 Admin Console and CLI support for querying the batch JobRepository. Both the Admin Console and the CLI provide valuable details about jobs and steps that can be used to detect potential bottlenecks.

JSR 352 supports many more exciting features such as batchlets, splits, flows, and custom checkpoints, which will be covered in future articles.

Author BioMahesh Kannan is a senior software engineer with Oracle's Cloud Application Foundation team, and he is the Expert Group Member for the Java Batch JSR. Due to his extensive experience with application servers, containers, and distributed systems, he has served as lead architect and "consultant at large" on many projects that build innovative solutions for Oracle products.

Reprinted with permission from the Oracle Technology Network, Oracle Corporation.

Cover image courtesy of losmininos

Mahesh Kannan
Mahesh Kannan

What do you think?

JAX Magazine - 2014 - 06 Exclucively for iPad users JAX Magazine on Android

Comments

Latest opinions