How to Union Non-Matching Data in Spark: A Simple Guide for the Less-Technical

3 min readDec 2, 2020

INTRO

If you’re reading this article then you likely already know the UNION operator is used to combine the result-set of two or more datasets. You also likely know that in order to use the union() method, the dataframes must have the same structure/schema.

This can be a problem if your data is very similar in schema but frequently has slight changes. Such is the case if you are looking at the daily visitor data coming from Google Analytics.

One field missing…. one freakin’ field. And every day is like that. 1–3 fields missing — Always 1 or 2 fields missing between daily files. FML

So in order to easily aggregate the data to a monthly level (without using Google’s very expensive GA360) you need to union the daily datasets together creating a JSON with a schema that is a super-set of all the varying daily schemas.

If you’re like me, this is usually when you google something along the lines of “spark union non-matching dataframes” and hope that a StackOverflow article gives you all the answers. Perhaps, that’s even how you ended up on this article! And while there are some articles that will show you various approaches, the simplest way I found was to use Delta Lake. Delta has a lot of really cool features that make life easy, but its schema enforcement and ability to automatically merge schema changes makes it incredibly useful for the task at hand.

Note: If you’re not familiar with Delta Lake, this series is a great intro.

THE HOW-TO

As I mentioned, we’ll be using Delta Lake, so the first thing we’ll need to do is get Delta Lake up and running. Luckily it’s fully integrated with Spark and their quick-start documentation makes it incredibly easy to put in place. Below I’ll show the steps to do this in the spark-shell.

Now that we have a Spark instance with Delta, we can read in all of our files just as we normally would.

Here’s where Delta makes the union/schema-merging easy. All we need to do is write all of our files to the same delta table, appending them and merging schema as we go.

So for our first dataframe we’ll simply write the data out to a new delta table.

And the remaining dataframes we will append and merge the schema as we go.

You can keep merging all your files, as I did, creating a monthly aggregate of all the files.

Once we’ve appended all of the data into a single delta table, we now have a single file that contains all of our data in a super-set of all the combined schemas. We can then read in the delta table and write it out to our desired file type, which in this case is JSON.

Voila! We have now successfully unioned all of the varying schemed data together with an extremely low amount of technical prowess. If this is a continuous issue you can easily take this concept and set it up as an automated pipeline as well.

IN CONCLUSION

Are there better, more efficient ways to do this task? Sure, but if you are looking for the easiest way to do it programmatically, Delta Lake makes it super simple.

How to Union Non-Matching Data in Spark: A Simple Guide for the Less-Technical

Written by Cody Rich