Glenn Engstrand

In a previous blog on some big data open source projects covered at OSCON 2014, I reviewed three distributed computing functional programming technologies by using them to write a simple aggregation report. In that blog, I mentioned reproducing the news feed performance map reduce job from a Clojure news feed service that I test ran on AWS in both Clojure/Cascalog and Scala/Spark. Let's dig in and see what there is to see.

At that time, I mentioned using Spark Streaming primarily because that particular technology has gotten a lot of traction in the big data space lately. After an architectural review, I realized that Spark Streaming is just Spark Map Reduce where the data stream is discretized into multiple Resilient Distributed Datasets using a time based window. That approach to streaming isn't really relevant to the sample problem that I am working with so I went back to using normal Spark Map Reduce.

So, what is this news feed performance map reduce job? The input to this job is a little over four million data points that include a time stamp, an operation being performed, an entity type being operated on, and the number of milliseconds that it took to perform that operation on that entity. Each row of the output represents performance data for a single minute. The metrics are broken down by entity and operation and include the number of requests, the median latency of the requests, and the 95th percentile. This report is then used by an ETL program to load an OLAP cube. I picked this because it is too complex to be handled by the default functionality or by simpler, less expressive map reduce technologies such as a single Hive query. We are going to have to roll up our sleeves and do some custom coding.

I originally implemented this as a traditional map reduce job written in Java. You most probably already know that Hadoop Map Reduce has a map phase and a reduce phase but did you know that there is also a combiner phase? Every job provides their own mapper and reducer but almost no jobs provide a combiner because the default combiner already does just what you need.

There has been some push back from young engineers against the Java programming language recently. Why is that? Java is strongly typed and very verbose when it comes to coding. At its inception, Java was considered a high level language but these days people who code in Java usually have to think at a relatively low level of abstraction. That means more lines of code which means more to think about and more opportunities for bugs. At least that is the consensus wisdom.

After that I coded a behaviorally equivalent Cascalog map reduce job written in Clojure. Cascalog is built on top of Cascading which is an application development platform built on top of Hadoop. It features taps, a higher level of abstraction by which data is both input from and output to. A special Cascalog macro allows you to write reduce side aggregators in normal looking Clojure code.

Clojure looks like lisp on steroids but runs in the JVM. Clojure is weakly typed. That means you don't have to specify the types of identifiers so less coding. It also means that the compiler doesn't do any any type checking which means you run into those type of bugs at runtime. Compared to old school Java, Clojure coders get to think about their problems in terms of a higher level of abstraction. A little bit of Clojure code does a lot so it tends to look very dense.

	Java/Map Reduce	Clojure/Cascalog	Scala/Spark
source code size in bytes	4256	2809	1707
run time in seconds	49	77	52
what was awesome	old school is still the fastest	feels like a query language but with easy to code reduce side user defined functions	most natural way to write distributed applications using functional programming
what was disappointing	lots of typing	the reduce side aggregator would not work without a superfluous print statement	Scala Build Tool is immature

Finally, I coded a behaviorally equivalent Apache Spark job written in Scala. Apache Spark is an open source general engine for large scale data processing. You can run it in stand alone or clustered mode. You can run Spark jobs that can read from and/or write to HDFS data. You can configure Spark to integrate with Hadoop Map Reduce either through YARN or Mesos.

Initially, the nomenclature in Spark was a little misleading for me. In Spark, you are always operating on RDDs or Resiliant Distributed Datasets. An RDD has functions such as reduce and reduceByKey. These functions are more akin to the combiner in Hadoop Map Reduce than to the reducer. Use the groupByKey method followed by another map method in order to achieve reduce side functionality.

Apache Spark claims to be faster than Hadoop. It is faster than Hive or Pig. From my limited experience, you can achieve on par performance with Spark if you are willing to write old school Java Map Reduce programs.

Scala looks like Java on steroids and also runs in the JVM. Like Java, Scala is strongly typed. Unlike Java, it takes a lot less lines of code to do in Scala what you can do in Java. How is that? You don't have to specify the types of identifiers so less coding. The compiler is still figuring out the types of identifiers even if you don't identify that. This allows you to have compile time checks on the types of things without you having to specify them. Scala coding is also at a higher level of abstraction which nets fewer lines of code too.

So, what is the most natural fit between functional programming and distributed computing? I would have to say that it is Scala and Apache Spark. That RDD looks just like another lazy sequence with very similar map reduce semantics. Functional programming does seem to be well suited to processing streams of data. Integrating functional programming applications with open source big data application platforms, such as Cascading or Spark, that sit on top of Hadoop does seem to make logical sense to me. It must also make logical sense to the top Hadoop vendors as Cloudera supports both Cascading and Spark and Hortonworks supports both Cascading and Spark too.