How Data Analysis changes Software Development
Hadoop - Think Big - Part 2
Hadoop itself comes with two components: HDFS and MapReduce. HDFS provides a distributed filesystem on top of your GNU/Linux filesystem (Windows is not officially supported yet). It is built on the assumption that the system runs on commodity hardware, so failure does happen. To cope, it comes with auto failover and built-in replication. It also assumes that you are operating hardware; disk scan is cheaper than disk seek – this assumption holds true even for SSDs. The third assumption is that your system will process huge amounts of data – as moving data is more expensive than moving computation, Hadoop will try to run processing as close to the data as possible, ideally on the same machine the data is stored on. MapReduce then provides the programming library to make efficient use of HDFS’ features.
As HDFS only provides raw file storage, there are multiple libraries that provide structured storage of data – in a compressed binary format that provides upgradable data structures. Examples are Avro, Thrift and Protocol Buffers.
HDFS itself is well-suited for background processing, but less suitable for online use. If you have so many users that standard databases can no longer cope with the traffic and you’d like your system to integrate well with Hadoop (in particular with MapReduce jobs) Apache HBase (as well as Hypertable) are the systems to look at for storage. Both provide good online access performance.
When it comes to analysis, developers will need to write MapReduce jobs. There is a Java (including a separate library that makes unit testing easier) as well as a C++ API available. However, Hadoop also comes with a streaming interface that makes it possible to use STDIN and STDOUT for processing thus enabling e. g. any scripting language.
However, sometimes developers will not need to touch either low level programming language. Instead, query languages such as Pig, Latin or Hive’s language, which is very close to SQL, are sufficient for most tasks. Where their syntax is not enough, developers have the option to extend the language with their own operators.
Cascading tries to provide a middle ground between low-level Java MapReduce and high-level Pig. Apache Giraph focuses on processing graph data including operators in particular important when analysing graphs.
When writing jobs, the need to chain mutiple MapReduce phases quickly arises. One might want to mix and match phases writing in Java and those written in Pig. Trigger processing based on a timer or based on data availability. Those tasks can be solved by Oozie – a workflow manager for Hadoop.
What if instead of providing simple statistics the goal is to automatically group users according to the types of products they buy? What if the goal is to classify users by how likely they are to prefer a certain type of product? What if the goal is to predict what queries a user is most likely to type next? Most of the more sophisticated problems can be phrased in a way that makes them simple to solve with automated methods. Apache Mahout helps with problems that are of large scale and comes with Hadoop bindings where necessary.
In terms of visualization, again, what is left to do is to create custom reports and then come up with custom dashboards. Also, it is possible to export data to your preferred file format to then visualise results in common tools like gephi for graph visualisation or import data into regular relational databases and use their tools for visualisation.
It’s never that simple in the real world
What we have looked at so far is a great, coherent infrastructure entirely based on Apache Hadoop. However, infrastructure in the real world never is all that simple. There will be relational databases left running that are just too expensive to migrate. If not integrated well with a Hadoop setup, those will remain isolated data silos that no one can benefit from except the original developers. MapReduce jobs running directly against the database would bring down that system very quickly. Hadoop is very efficient for generating DOS attacks against existing systems. It is common practice to import database content into HDFS on a regular schedule. With Sqoop there is a tool that has been proven to work reliably.
Now all that is left is a working solution to get log files from distributed systems back into HDFS. With Flume, Scribe, and Chukwa, there are three systems available to choose from that help with distributed logging and provide Hadoop bindings.
Taking one step back, the result already looks quite clean: data storage, analysis and visualisation are available, existing systems are well integrated. However, now we have the software to scale the system up to tens, hundreds, even thousands of machines. Those numbers are no longer manually manageable by any operations team. Instead, automated configuration management and the option for controlled roll-back becomes ever more important. Systems like Chef and Puppet for systems, as well as Zookeeper for configuration management suddenly become crucial for scaling (Figure 3).
Fig. 3: A complex but coherent jigsaw
The final picture no longer looks all that simple. In addition to the many different pieces and best practices, there are various steps that involve making a conscious decision for one system and against another, if one is to avoid mixing and matching all sorts of systems that all solve very common problems but differ in important details.
What is to be gained by such a setup? The most important advantage is integration of data storage. If data is no longer separated in data silos, all sorts of applications become possible and less expensive to build as data no longer has to be copied from team to team but is available and processable to all. Also, business reporting becomes simple by making it possible to integrate all necessary data sources in a consistent way.
To be processable however means that the data format is known, well-documented or at least self-describing. Suddenly logging no longer is optional and is needed for debugging and post-mortem analysis in case of issues. All of a sudden, log files are an asset in themselves – if they contain the right information, enough data and valid entries.
In general, it has been observed that use cases for Hadoop are identified easily after its introduction. However, one should not take introduction of such a system all too lightly – the learning curve for developers is still quite steep.
Systems are still to be considered at a DevOps level and that means both development and operations have to work tightly together to make things work. Integration is not yet fully optimised and as well done as with other more common systems. The most promising route is to identify an existing use case that would greatly benefit from using Hadoop. Introducing Apache Hadoop, and all that is needed around to support it, will provide enough experience to simplify and adapting more and more use cases over time. Having one driving project helps focus on real business needs and to figure out what the important features are. It also helps take decisions as to what projects to use to solve real technical problems.
Overall, Hadoop and its ecosystem provide unique features that allow scaling to large, distributed systems on commodity hardware at reasonable cost. The system has been used in production by both larger (Yahoo!, Facebook, Twitter) and smaller (Neofonie GmbH, nurago, nuxeo) companies to solve their analytics needs. Being an Apache project, it is possible to get involved and have issues fixed within reasonable time. Though release cycles have been growing in the past, the project itself has issued faster releases to get features and bug fixes out to uses at shorter intervals. In addition, the project itself unites developers from different backgrounds and brings together knowledge from teams with lots of experience with distributed processing around the world, which does not only help harden the implementation, but also provides a broad basis to draw from when developing new features.
Having released version 1.0 after several years, it shows its developers’ trust into the maturity of the code base. In upcoming releases, the architecture will become much more flexible separating resource management from computation libraries. This will enable developers to treat MapReduce as a simple programming library and evolve MapReduce APIs independently of cluster versions. 2012 will bring many improvements to the project. Make sure to be part of the process by joining the user and developer mailing lists. You can shape the process by putting in your knowledge and work by providing valuable patches that solve still open issues.
- The four steps of development, application specific data
- Hadoop, MapReduce, conclusions