days
-4
-3
hours
-1
-5
minutes
-2
-2
seconds
-1
-5
search
Why are they great?

ETL Tools vs. SQL and system tools

Veselin Davidov
SQL
© Shutterstock / ViblyPhoto  

Built-in tools have some serious advantages, like easier maintenance, fewer bugs, less dependency on the database in use, easier to migrate to another server, etc. But when it comes to performance it is far from the best. What if you wanted to make our ETL even better? Veselin Davidov gives us a full tour!

There are plenty of ETL tools (PDI, Talend, SSIS, Data Integrator, etc.) and they are all undoubtedly useful. We can see how the process works or could use them for scheduling, email notifications, ftp connections and whatever else we might need to build a proper ETL process. They are so sophisticated that we can design our transformations without SQL expertise. Some people even advocate against writing complex SQL for transformations and insist on using the build in tools. There are companies that don’t require in-depth knowledge of SQL for their data warehouse designers if they have experience with the tool.

But what if we want to make our ETL even better

From my experience working in an angular development company, I can agree that there are some advantages for using the built-in tools like easier maintenance, fewer bugs, less dependency on the database in use, easier to migrate to another server, etc. But when it comes to performance it is far from the best. If we are working with big data (although big is a relative term) the performance of our ETL process usually is one of the keys to its success. If our daily ETL is taking more than a day then we have a real problem and even if it takes 5 hours and we can reduce it to 2 we probably should.

Simple ETL Tool vs Pure SQL example

I will illustrate with a simple example using Oracle database and PDI. I ran transformations on
the same tables (1 million records) first using table input and table output steps and then using a
simple SQL command.

And now let’s compare the results

The transformation using PDI steps takes around 10s to finish and the one using pure SQL takes 2-3s.

That’s with the simplest command. If there is more complex logic like joins and data transformations and we can handle them in the database the speed difference grows. I know that these transformation steps have their role especially to merge data from different databases but we shouldn’t overuse them and we should ask ourselves whether there is a better / faster way of doing things.

SEE ALSO: Why SQL bind variables are important for performance

The same goes for system tools. Usually, a simple sed or awk script will be many times faster than using corresponding string manipulation steps etc. It might make the ETL platform dependent and a bit harder for the developers but in practice, we don’t change the ETL server OS that often and the performance gain is substantial. Sometimes it is even a good idea to write a custom program that solves a particular transformation problem better than the build in tools.

Conclusion

In conclusion, I would like to point out that I am not against ETL tools. They make life easier and I have used them in every ETL project I have designed but I don’t use them for every transformation in the process. Try to write better SQL, learn the advanced system tools you can use, read about external tools that can speed up the process. We should always keep our minds open for other solutions and ideas that can improve the final result. Please do not hesitate to ask, if you have any questions and please share recommendations!

Author

Veselin Davidov

Veselin Davidov is a dev-lead and full-stack developer at Dreamix, a custom
software development company . He has experience in all spheres of software development and
has extensive knowledge in designing, developing and delivery of large enterprise systems with
Java, Spring and the surrounding technologies. He is also passionate about data and has led to
success various Big Data and BI projects.


Leave a Reply

10 Comments on "ETL Tools vs. SQL and system tools"

avatar
400
  Subscribe  
Notify of
Aleksei
Guest

Being mostly an SSIS dev working on a system with a kind of framework to call SSIS packages from the filesystem: they are easier to deploy than separate standalone utils.
P.S. Screenshot from the article looks like this:comment image -is it on purpose? :)

veselin davidov
Guest

not on purpose. i hope it will get fixed

Aleksei
Guest

And agreed, massive data processing usually is slower via SSIS than via SQL, but! massive inserts are way faster via SSIS (once tuned a bit) rather than separate tool (at least if it’s not written with a really proficient dev).

veselin davidov
Guest

Well my point is to look for alternatives if you want performance. I am not saying etl tools are always slow and we shouldn’t use them. But having other options is always a good thing

Aleksei
Guest

Some things are harder to implement in SSIS, but I think, the minimum bar needed for a database developer to start maintaining and developing SISS packages is a bit lower than needed to create a [maintainable] data-processing standalone tool.

veselin davidov
Guest

Yeah the minimum bar is lower and the problem is some people don’t try to go over that bar. Ssis is great. The whole SQl server stack including ssis, ssrs, ssas etc work flowlessly and a developer shouldn’t try to write an alternative to it. But sometimes we can extend it and improve a particular task by writing something on our own while still usong our Etl tool for the other parts of the flow

Aleksei
Guest

Thanks for the replies.
Just for the record: I’m no trying to defend any ETL tool (and SSIS particularly) from anything :) Just added personal two cents.

Veselin Davidov
Guest

yeah and I totally agree with your comments

Stoimen
Guest

Thank you, very informative read!

Veselin Davidov
Guest

Thanks!