Snafus

10 common mistakes Java developers make when writing SQL

Lukas Eder
snarf

Java devs operate on a blend of object-oriented and imperative thinking – but when it comes to writing SQL, everything can quickly go to pot. Are you guilty?

This post was originally published over at jooq.org, a blog focusing on Java and SQL
integration from the perspective of the developers of jOOQ.

Java developers mix object-oriented thinking with imperative
thinking, depending on their levels of:

  • Skill (anyone can code imperatively)
  • Dogma (some use the “Pattern-Pattern”, i.e. the pattern of
    applying patterns everywhere and giving them names)
  • Mood (true OO is more clumsy to write than imperative code. At
    first)

But when Java developers write SQL, everything changes. SQL is a
declarative language that has nothing to do with either
object-oriented or imperative thinking. It is very easy to express
a query in SQL. It is not so easy to express it optimally or
correctly. Not only do developers need to re-think their
programming paradigm, they also need to think in terms of set
theory.

Here are common mistakes that a Java developer makes when
writing SQL through JDBC or jOOQ (in
no particular order). For 10 More Common Mistakes, see this article here.

Here are common mistakes that a Java developer makes when
writing SQL (in no particular order):

1. Forgetting about NULL

Misunderstanding NULL is probably the biggest mistake a Java
developer can make when writing SQL. This is also (but not
exclusively) due to the fact that NULL is also called UNKNOWN. If
it were only called UNKNOWN, it would be easier to understand.
Another reason is that JDBC maps SQL NULL to Java null when
fetching data or when binding variables. This may lead to thinking
that NULL = NULL (SQL) would behave the same way as null == null
(Java)

One of the crazier examples of misunderstanding NULL is
when NULL
predicates are used with row value expressions
.

Another, subtle problem appears when misunderstanding the
meaning ofNULL
in NOT IN anti-joins
.

The Cure:

Train yourself. There’s nothing but explicitly thinking about
NULL, every time you write SQL:

  • Is this predicate correct with respect to NULL?
  • Does NULL affect the result of this function?

2. Processing data in Java memory

Few Java developers know SQL very well. The occasional JOIN, the
odd UNION, fine. But window functions? Grouping sets? A lot of Java
developers load SQL data into memory, transform the data into some
appropriate collection type, execute nasty maths on that collection
with verbose loop structures (at least, before Java
8′s Collection improvements
).

But some SQL databases support advanced (and SQL standard!) OLAP
features that tend to perform a lot better and are much easier to
write. A (non-standard) example is Oracle’s awesome MODEL
clause
. Just let the database do the processing and fetch only
the results into Java memory. Because after all some very
smart guys have optimised these expensive products. So in fact, by
moving OLAP to the database, you gain two things:

  • Simplicity. It’s probably easier to write correctly in SQL than
    in Java
  • Performance. The database will probably be faster than your
    algorithm. And more importantly, you don’t have to transmit
    millions of records over the wire.

The Cure:

Every time you implement a data-centric algorithm in Java, ask
yourself: Is there a way to let the database perform that work for
me?

3. Using UNION instead of UNION ALL

It’s a shame that UNION ALL needs an extra keyword compared to
UNION. It would be much better if the SQL standard had been defined
to support:

  • UNION (allowing duplicates)
  • UNION DISTINCT (removing duplicates)

Not only is the removal of duplicates rarely needed (or
sometimes even wrong), it is also quite slow for large result sets
with many columns, as the two subselects need to be ordered, and
each tuple needs to be compared with its subsequent tuple.

Note that even if the SQL standard specifies INTERSECT ALL and
EXCEPT ALL, hardly any database implements these less useful set
operations.

The Cure:

Every time you write a UNION, think if you actually wanted to
write UNION ALL.

4. Using JDBC Paging to page large results

Most databases support some way of paging ordered results
through LIMIT .. OFFSET, TOP .. START AT, OFFSET .. FETCH clauses.
In the absence of support for these clauses, there is still the
possibility for ROWNUM
(Oracle)
 or ROW_NUMBER() OVER()
filtering (DB2, SQL Server 2008 and less)
, which is much faster
than paging in memory. This is specifically true for large
offsets!

The Cure:

Just use those clauses, or a tool (such as jOOQ) that can simulate those clauses
for you.

5. Joining data in Java memory

From early days of SQL, some developers still have an uneasy
feeling when expressing JOINs in their SQL. There is an inherent
fear of JOIN being slow. This can be true if a cost-based optimiser
chooses to perform a nested loop, possibly loading complete tables
into database memory, before creating a joined table source. But
that happens rarely. With appropriate predicates, constraints and
indexes, MERGE JOIN and HASH JOIN operations are extremely fast.
It’s all about the correct metadata (I
cannot cite Tom Kyte often enough for this)
. Nonetheless, there
are probably still quite a few Java developers who will load two
tables from separate queries into maps and join them in Java memory
in one way or another.

The Cure:

If you’re selecting from various tables in various steps, think
again to see if you cannot express your query in a single
statement.

6. Using DISTINCT or UNION to remove duplicates from an
accidental cartesian product

With heavy joining, one can loose track of all the relations
that are playing a role in a SQL statement. Specifically, if
multi-column foreign key relationships are involved, it is possible
to forget to add the relevant predicates in JOIN .. ON clauses.
This might result in duplicate records, but maybe only in
exceptional cases. Some developers may then choose to use DISTINCT
to remove those duplicates again. This is wrong in three ways:

  • It (may) solve the symptoms but not the problem. It may as well
    not solve the symptoms in edge-cases.
  • It is slow for large result sets with many columns. DISTINCT
    performs an ORDER BY operation to remove duplicates.
  • It is slow for large cartesian products, which will still load
    lots of data into memory

The Cure:

As a rule of thumb, when you get unwanted duplicates, always
review your JOIN predicates. There’s probably a subtle cartesian
product in there somewhere.

7. Not using the MERGE statement

This isn’t really a mistake, but probably some lack of knowledge
or some fear towards the powerful
MERGE statement
. Some databases know other forms of UPSERT
statements, e.g. MySQL’s ON DUPLICATE KEY UPDATE clause. But MERGE
is really so powerful, most importantly in databases that heavily
extend the SQL standard, such as SQL
Server
.

The Cure:

If you’re UPSERTING by chaining INSERT and UPDATE or by chaining
SELECT .. FOR UPDATE and then INSERT or UPDATE, think again. Apart
from risking race conditions, you might be able to express a
simpler MERGE statement.

8. Using aggregate functions instead of window functions

Before the introduction of window functions, the only means to
aggregate data in SQL was by using a GROUP BY clause along with
aggregate functions in the projection. This works well in many
cases, and if aggregation data needed to be enriched with regular
data, the grouped query can be pushed down into a joined
subquery.

But SQL:2003 defined window functions, which are implemented by
many popular database vendors. Window functions can aggregate data
on result sets that are not grouped. In fact, each window function
supports its own, independent PARTITION BY clause, which is an
awesome tool for reporting.

Using window functions will:

  • Lead to more readable SQL (less dedicated GROUP BY clauses in
    subqueries)
  • Improve performance, as a RDBMS is likely to optimise window
    functions more easily

The Cure:

When you write a GROUP BY clause in a subquery, think again if
this cannot be done with a window function.

9. Using in-memory sorting for sort indirections

The SQL ORDER BY clause supports many types of expressions,
including CASE statements, which can be very useful for sort
indirections. You should probably never sort data in Java memory
because you think that

  • SQL sorting is too slow
  • SQL sorting cannot do it

The Cure:

If you sort any SQL data in memory, think again if you cannot
push sorting into your database. This goes along well with pushing
paging into the database.

10. Inserting lots of records one by one

JDBC knows batching, and you should use it. Do not INSERT
thousands of records one by one, re-creating a new
PreparedStatement every time. If all of your records go to the same
table, create a batch INSERT statement with a single SQL statement
and multiple bind value sets. Depending on your database and
database configuration, you may need to commit after a certain
amount of inserted records, in order to keep the UNDO log slim.

The Cure:

Always batch-insert large sets of data.

Some interesting books

Some very interesting books on similar topics are

Author
Lukas Eder
Lukas is a Java and SQL aficionado. He’s the founder and head of R&D at Data Geekery GmbH, the company behind jOOQ, the best way to write SQL in Java.
Comments
comments powered by Disqus