Scalding Hot?

Twitter open sources Scalding

Chris Mayer

The social networking giant makes its Scala for Cascading API available to all – jumping on the Hadoop bandwagon?

Scala developers were all aflutter over an announcement from
Twitter’s development team. Why? Well, they’ve just open sourced
Scalding – a Scala version of Cascading, enabling many to analyse
reams and reams of data.

For those unaware, Cascading is a thin Java library that can sit
on top of Apache Hadoop’s MapReduce Layer that aims to cut out all
the difficult, tiring jobs. It has two major components:

  1. a DSL to make MapReduce computations look very similar to
    Scala’s collection API
  2. A wrapper for Cascading to make it simpler to
    define the typical use cases of jobs, tests and describing data
    sources on a Hadoop Distributed File System (HDFS) or local

Patrick Boykin
made the announcement on the development team’s blog
. He also
gave an example of Scalding at work – the Scalding query below
shows how many times a URL has been tweeted in a day.


  .flatMapTo('created_date, 'url) { s =>
   for( url <- urls(s.getText))
     yield (RichDate(s.getCreatedAt).toString(DATE_WITH_DASH), url)
  .groupBy(‘created_date, ‘url) {
    _.size(‘urlCnt) //Count the number of appearences of the URL

As the entire ecosystem gets excited about Hadoop’s potential to
dominate Big Data, it is surely a good thing that this Scala
adaption of Cascading is now available to the masses, making sure
that developing, regression and integration testing and
deploying enterprise applications are made simpler for Scala

The supposed simplicity of Scalding was brought up by Boykin – a
rare claim about Scala at the moment. He wrote:

In comparison to languages such as Apache
 that separate the query language from the user
defined functionality, with Scalding everything is integrated into
one language. In most cases, one file will describe your

Twitter’s use of Scala has been well documented, after
they switched from Ruby on Rails to a Java/Scala mix as they grew.
Given the amount of data the social networking site has to deal
with on a daily basis, you’d think they’d be a good judge for
simplifying anything to do with MapReduce

Why not try Scalding now – it’s available on Github and
you can follow the project on Twitter via
the Big Data revolution now, everyone else is.

comments powered by Disqus