Apache Spark provides some powerful functionality, abstracting away complex operations of distributed data processing as simple api calls. When working with these transformations and actions, it can be difficult to see exactly what each one does and if it meets your specific use case. Hence moving from smaller example applications to larger ones that contain multiple chained operations combining multiple datasets can leave the developer feeling lost.

We started developing Spark applications with a simplistic approach of writing the full app and testing it afterwards using the unit test approach described in the documentation. But tests inevitably fail and there then came long debugging sessions where count() and collect() statements were inserted to help determine what was was going wrong and how to fix. This grew old fast. Luckily, Test Driven Development (TDD) came to the rescue!

Value of TDD

The benefits TDD brings are well documented. For me the most important point of TDD is how it influences the way you tackle a problem, forcing you to build from the ground up. You get a small piece of the application working and then move onto the next piece with confidence that only well tested code can bring. Spark applications are a good candidate for this approach.

Spark Word Count with TDD example

Take the word count example (written in Scala) from the official Spark examples page:

val textFile = spark.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
                 .map(word => (word, 1))
                 .reduceByKey(_ + _)

I will refactor this here to show how you how you might write it when using TDD. There are 2 transformations (flatMap and map) and an action (reduceByKey). How do we ensure that each one of these does exactly as we want?
The first flatmap takes the file and splits it into words. We extract this into its own method that takes an RDD of lines and transforms it into an RDD of words

def splitLines(wordsByLine: RDD[String]): RDD[String] = {
  wordsByLine.flatMap(line => line.split(" "))

and most importantly write the associated unit test!

test("splitLines should split multi-word lines into words"){
  val fileLines = Array("Line One", "Line Two", "Line Three", "Line Four")
  val inputRDD: RDD[String] = sparkContext.parallelize[String](fileLines)
  val wordsRDD = WordCountExample.splitFile(inputRDD)
  wordsRDD.count() should be (8)

Using sparkContext.parallelize you can easily generate the input RDD. We follow the same process for counting the split words

def countWords(words: RDD[String]): RDD[(String, Int)] = {
  words.map(word => (word, 1)).reduceByKey(_ + _)
test("countWords should count the occurrences of each word"){
  val words = Array("word", "count", "example", "word")
  val inputRDD: RDD[String] = sparkContext.parallelize[String](words)
  val wordCounts = WordCountExample.countWords(inputRDD).collect
  wordCounts should contain ("word", 2)
  wordCounts should contain ("count", 1)
  wordCounts should contain ("example", 1)

As this is Scala we can now use some nice functional composition to combine these two functions (again we will be thorough and test this)

def countWordsInFile = splitLines _ andThen countWords _
test("countWordsInFile should count words") {
  val fileLines = Array("Line One", "Line Two", "Line Three", "Line Four")
  val inputRDD: RDD[String] = sparkContext.parallelize[String](fileLines)
  val results = WordCountExample.countWordsInFile(inputRDD).collect
  results should contain ("Line", 4)

So the original code above now becomes

val file = sc.textFile("sampleWords.txt")

As you can see we have built this code from the ground up, with each part fully tested and, I think, more readable.

Wrapping up

In reality we would not got to this low a level of isolating each task rather we would group at most 2 or 3 operations and have them in a single function (making sure the are all working towards a single logical goal!).
Approaching the writing of Spark applications in this way results in modular, well tested and self documented code which will hopefully make those debug sessions shorter and more productive!

The complete example code is available at spark-tdd-example on GitHub.