ETL Tools vs. SQL and system tools
Built-in tools have some serious advantages, like easier maintenance, fewer bugs, less dependency on the database in use, easier to migrate to another server, etc. But when it comes to performance it is far from the best. What if you wanted to make our ETL even better? Veselin Davidov gives us a full tour!
There are plenty of ETL tools (PDI, Talend, SSIS, Data Integrator, etc.) and they are all undoubtedly useful. We can see how the process works or could use them for scheduling, email notifications, ftp connections and whatever else we might need to build a proper ETL process. They are so sophisticated that we can design our transformations without SQL expertise. Some people even advocate against writing complex SQL for transformations and insist on using the build in tools. There are companies that don’t require in-depth knowledge of SQL for their data warehouse designers if they have experience with the tool.
But what if we want to make our ETL even better
From my experience working in an angular development company, I can agree that there are some advantages for using the built-in tools like easier maintenance, fewer bugs, less dependency on the database in use, easier to migrate to another server, etc. But when it comes to performance it is far from the best. If we are working with big data (although big is a relative term) the performance of our ETL process usually is one of the keys to its success. If our daily ETL is taking more than a day then we have a real problem and even if it takes 5 hours and we can reduce it to 2 we probably should.
Simple ETL Tool vs Pure SQL example
I will illustrate with a simple example using Oracle database and PDI. I ran transformations on
the same tables (1 million records) first using table input and table output steps and then using a
simple SQL command.
And now let’s compare the results
The transformation using PDI steps takes around 10s to finish and the one using pure SQL takes 2-3s.
That’s with the simplest command. If there is more complex logic like joins and data transformations and we can handle them in the database the speed difference grows. I know that these transformation steps have their role especially to merge data from different databases but we shouldn’t overuse them and we should ask ourselves whether there is a better / faster way of doing things.
The same goes for system tools. Usually, a simple sed or awk script will be many times faster than using corresponding string manipulation steps etc. It might make the ETL platform dependent and a bit harder for the developers but in practice, we don’t change the ETL server OS that often and the performance gain is substantial. Sometimes it is even a good idea to write a custom program that solves a particular transformation problem better than the build in tools.
In conclusion, I would like to point out that I am not against ETL tools. They make life easier and I have used them in every ETL project I have designed but I don’t use them for every transformation in the process. Try to write better SQL, learn the advanced system tools you can use, read about external tools that can speed up the process. We should always keep our minds open for other solutions and ideas that can improve the final result. Please do not hesitate to ask, if you have any questions and please share recommendations!