How to run your PDI based ETL from Java
There are multiple ways to run your PDI-based ETL from Java. In this tutorial, Dreamix’s Veselin Davidov covers three methods.
An enterprise-grade BI solution consists of multiple components. You have the reporting tools, the ETL process, the databases and often some kind of web portal and all of these should be properly integrated. The ETL is usually a scheduled process but often we would like to allow the business users to initiate it manually. The best way to achieve that would be through some simple interface that we have built in our web portal – this way they don’t need to know the infrastructure underneath and we can handle user management, access etc. There are multiple ways to initiate an ETL from the Java program and I will cover a few of them with their advantages and disadvantages.
The most simple method – run external process
That would be the easiest approach and even though it doesn’t look cool it works and that’s what’s most important in the end. It can be something as simple as like:
You can extend that by running into a separate thread, making it configurable so it isn’t platform-dependent, read the output etc. PDI installed on the running machine is required. The main drawback of that method is that the ETL is run inside the JVM and it might slow your web portal. I wouldn’t mind using that method if suits my requirements – for example some small transformations that the business user needs to run and don’t take much time and resources.
The cooler approach – use the PDI libraries
Pentaho provides Java libraries allowing us to integrate and execute jobs and transformations directly into our Java code. I will illustrate with a simple example that gets the required libraries using maven and then execute a simple job.
- First the dependencies
- And then we can use embedded kettle environment from our code:
I called this the cool approach because, from my experience working in a custom software development company, it gives us much more control on the execution of the jobs. We can read jobs from repository, set parameters, read output parameters, monitor the log etc. It’s basically embedded kitchen in our application. The possibilities here are limitless – we can even use PDI transformations to handle some of the business logic in our application. The drawback as in the previous example is that the execution is inside the JVM and if that’s our web portal the excessive load might cause problems. Here we don’t need PDI pre-installed in the running machine, but the libraries will be packed in the application which will make the distributable larger.
Taking the first two approaches to another level
The good thing with these methods is that they reside in our Java code which of course mean we can do whatever we want with that code and extend it in any way we need to. This is kind of obvious, but I still wanted to mention it because that allows us to do easy workarounds and to avoid the disadvantages. So for example the biggest drawback we saw here is that these are executed in the JVM and it can load up web server. With some better architecture of our enterprise application we can easily move that execution to another instance of the JVM (another server) or even to load balance it to different servers. A simple solution would be to create a separate web service that executes the ETL and call that one from the web portal. Another approach would be to use a messaging service and create listeners that execute Jobs using some of the above methods. This would look something like:
The enterprise way – without writing code
PDI comes with a tool called Carte which basically provides web service interface to the pentaho server allowing us to execute jobs remotely. Running it is pretty straightforward – in data-integration/pwd folder you have some basic configuration XMLs for the server and there is a good documentation how to configure it according to your needs. It will also require a repository set up for the jobs. Once run it can be accessed through a simple web interface.
To run a job, you do a call like:
This method allows remote execution on the server, so it doesn’t suffer from the main drawback of the previous two methods. If you run complicated ETLs that take hours and need to run on different machines and servers, it should be your approach.
There are multiple ways to execute our PDI jobs from the Java code. I covered just three but there are probably more. For enterprise applications, most people should go for the enterprise way because it is the most robust and once setup is probably easiest to use. It does make the infrastructure more complicated – you need server, repository, some of its advanced features even require the enterprise version of PDI. There are scenarios when the other two approaches work just fine so pick the best for your application!