Scalding Hot?

Twitter open sources Scalding

Chris Mayer
Twitter.1

The social networking giant makes its Scala for Cascading API available to all – jumping on the Hadoop bandwagon?

Scala developers were all aflutter over an announcement from Twitter’s development team. Why? Well, they’ve just open sourced Scalding – a Scala version of Cascading, enabling many to analyse reams and reams of data.

For those unaware, Cascading is a thin Java library that can sit on top of Apache Hadoop’s MapReduce Layer that aims to cut out all the difficult, tiring jobs. It has two major components:

  1. a DSL to make MapReduce computations look very similar to Scala’s collection API
  2. A wrapper for Cascading to make it simpler to define the typical use cases of jobs, tests and describing data sources on a Hadoop Distributed File System (HDFS) or local disk

Patrick Boykin made the announcement on the development team’s blog. He also gave an example of Scalding at work – the Scalding query below shows how many times a URL has been tweeted in a day.

 

StatusSource()
  .flatMapTo('created_date, 'url) { s =>
   for( url <- urls(s.getText))
     yield (RichDate(s.getCreatedAt).toString(DATE_WITH_DASH), url)
  }
  .groupBy(‘created_date, ‘url) {
    _.size(‘urlCnt) //Count the number of appearences of the URL
  }
  .write(Tsv(args(“output”)))

As the entire ecosystem gets excited about Hadoop’s potential to dominate Big Data, it is surely a good thing that this Scala adaption of Cascading is now available to the masses, making sure that developing, regression and integration testing and deploying enterprise applications are made simpler for Scala enthusiasts.

The supposed simplicity of Scalding was brought up by Boykin – a rare claim about Scala at the moment. He wrote:

In comparison to languages such as Apache Pig that separate the query language from the user defined functionality, with Scalding everything is integrated into one language. In most cases, one file will describe your job.

Twitter’s use of Scala has been well documented, after they switched from Ruby on Rails to a Java/Scala mix as they grew. Given the amount of data the social networking site has to deal with on a daily basis, you’d think they’d be a good judge for simplifying anything to do with MapReduce

Why not try Scalding now – it’s available on Github and you can follow the project on Twitter via @scaldingEmbrace the Big Data revolution now, everyone else is.

Author
Comments
comments powered by Disqus