English below

Wanneer je met Apache Spark werkt ben je er vast al eens tegenaan gelopen: Sommige DML operaties, zoals het uitvoeren van updates, upserts of deletes zijn niet makkelijk of zelfs niet uit te voeren als je een connectie maakt via JDBC naar de database. Het is echter mogelijk om middels de connection properties toch een stuk custom SQL via JDBC naar de database te sturen. Hoe je dit doet, lees je hier.

Thanks to the native JDBC support by Spark SQL, users can access most databases via their JDBC drivers. There are however some limitations when it comes to loading data when dealing with JDBC. It is possible to rewrite the entire table using savemode.overwrite, or simply insert rows into a table using savemode.append. Other DML operations however like updating a table directly for instance are not supported.

However it is possible to push custom SQL code over a Spark JDBC connection. To do so you first need to set up a JDBC connection:

In order to be able to push custom SQL, add the sessionInitStatement key to the connection properties. This is a statement that fires at the beginning of a JDBC session, but only for JDBC*reads* at the moment. It can contain a custom SQL or t-SQL statement or a PL/SQL code block.

Because Spark has lazy evaluation, we need to force it to execute the JDBC read and can do that through for example a count method. We are reading the result of a pushdown query here, which returns the error code of the session. It is expected to always return 0, because if there is an error, we get a Spark exception, and not the result of the query.

When the read has been forced, the sessionInitStatement containing the custom SQL will be executed. This allows for updates, deletes, upserts or generally more complex SQL code to be executed.