Hadoop – Think Big
The data revolution has started. Or has it?
The price for storage has fallen in recent years – even one GB of SSD storage is expected to fall below a dollar by the end of 2012. Clusters with thousands of machines are no longer the standard of big corporations only. With technologies like Apache Hadoop and HBase, a “throw nothing away culture” has become common. Data scientists are happily sifting through all that raw data finding ever more valuable information and extracting business-relevant pieces that provide competitive advantages. Machine learning helps solve problems that even a few years ago seemed intractable. The data revolution has started. Or has it?
Reading publications on big data today gives the impression that development processes have to be reshaped: growing amounts of data have to be stored to remain competitive. Projects that deal with storing, analysing and visualising data are hyped all over the place. When taking a closer look at individual solutions in production today, the view gets much less coherent – often solutions are built from several different pieces at varying maturity. It is not unusual for developers to take longer than originally estimated to put an architecture to work.
This article tries to put data-driven development in context – comparing and contrasting it with what is already being done in practice. It shows how the big data technologies fit together and where yellow elephants and machines that adapt them fit in.
Let’s start with a hypothetical example of building a web shop. When brainstorming, requirements that comes to mind are things like a place to store products, user data, user transactions (those that are active e. g. “Mary bought new headphones, product ID is 9887876, they are in shipment” and also ideally past ones to be able to make better product recommendations for Mary later). Once the shop is in place, we may guess what changes and improvements users might find helpful. We might explicitly ask Mary for her suggestions. However, average users are very lazy: they only provide feedback very rarely.
What is easier is to watch users online: watching what products they look at, which ones they end up buying, investigating what pages have the highest rate of users leaving the site, what pages are typical entry pages. When going down that road, development turns into a cycle of four steps:
The four steps of development:
1. Observe how users interact with the application – that will lead to finding multiple deficiencies with varying benefits when fixed: Product search might not be ideal, in that users often search for headphones in a particular colour, but colour might not yet be indexed. Users might be searching for the “Buy” button as it is not at the very top of the page.
2. Orient by defining in which direction the application should be improved – decide which deficiency to fix in your next iteration and define what criteria to measure how successful your fix was: Make the colour of headphones navigable and expect to double headphone sales within a month.
3. Decide what to implement – Implementation details may differ, in our toy example options may be to make the colour of headphones part of the index so users can search for them, or include them in a faceting user interface.
4. Act – implement the fix.
In the end, the cycle begins again, by observing how users react to your fix. That last step is what feedback can be made on the current and future implementation. The tighter that cycle can be made, the faster feedback can flow back into development, the more likely a project is to outpace its competitors (Figure 1).
Fig. 1: The four steps of development
This development cycle is not surprising, nor should it be particularly new. Instead, it is what has been done implicitly for any successful implementation. However, this explicit form shows that there are steps that can greatly benefit from gathering well-defined additional data sets or using already existing ones.
Observation is greatly supported by keeping track of any user behaviour – both common and unusual, both successful and unsuccessful. Each time a user successfully interacts with the web shop, it shows what functions work particularly well. Unsuccessful interaction reveals deficiencies and room for improvement. A metric is defined by the success of a new feature which then needs to be evaluated in orientation. Making this step explicit and measurable gives clear numbers against which changes should be compared. It also makes the goal of a change explicit – and helps define if and when that goal is reached.
A second goal of collecting interaction data is to use that data to provide better services to individual users: Most likely the user, Mary, will not want to be searching for another pair of headphones she gave a bad review for. So instead she should get different search results. Also she might be interested in a very specific type of music and may be very happy when presented with compilations that she likes.
In the end both types of using interaction data result in the flowchart displayed in Figure 2.
Fig. 2: Two types of interaction data
Is either way of thinking about data gathering new? Not really. Interaction data has been gathered since the early days of the internet. Users tend to do all sorts of interesting stuff within the infrastructure they are given. As a result, service providers started very early with logging user interactions – be it only to diagnose what caused a problem after the effect.
More common types of requirements engineering on that level include machine sizing based on past traffic patterns, hardening a setup against seen intrusion attempts. Typical features based on user data include showing only new content to readers of an online magazine or showing different content based on the origin of the user’s request.
The data sources used usually include web server access logs, health check results and response time logs. All of these come in different but usually text-based formats that are treated with tools like sed, awk or python scripts. Results are then presented in custom dashboards, gnuplot graphs or even semi-standard tooling for log analysis – AWStats being one prominent and well-known example of web server logs.
Let’s go one step further and look at application-specific data: These include customer databases, transaction logs and the like. What can be learnt from this?
Tools used here are standard relational databases. For analysis there is a standard query language, with some custom extensions depending on the database system provider. For visualisation, developers can come up with custom reports or use tools like tableau.
Now what if log sizes outgrow the storage or compute capacity of one single machine? What if analysis of a customer database takes too long on a single machine or is too expensive to speed up by scaling this one machine up? There is a very simple answer: Take Apache Hadoop.
The not-so-simple, but probably more correct answer is: You won’t get around a distributed system for data analysis. When opting for commodity hardware, your best bet is to use Apache Hadoop as the basis of your system. What you need on top depends on your specific case.
Hadoop itself comes with two components: HDFS and MapReduce. HDFS provides a distributed filesystem on top of your GNU/Linux filesystem (Windows is not officially supported yet). It is built on the assumption that the system runs on commodity hardware, so failure does happen. To cope, it comes with auto failover and built-in replication. It also assumes that you are operating hardware; disk scan is cheaper than disk seek – this assumption holds true even for SSDs. The third assumption is that your system will process huge amounts of data – as moving data is more expensive than moving computation, Hadoop will try to run processing as close to the data as possible, ideally on the same machine the data is stored on. MapReduce then provides the programming library to make efficient use of HDFS’ features.
As HDFS only provides raw file storage, there are multiple libraries that provide structured storage of data – in a compressed binary format that provides upgradable data structures. Examples are Avro, Thrift and Protocol Buffers.
HDFS itself is well-suited for background processing, but less suitable for online use. If you have so many users that standard databases can no longer cope with the traffic and you’d like your system to integrate well with Hadoop (in particular with MapReduce jobs) Apache HBase (as well as Hypertable) are the systems to look at for storage. Both provide good online access performance.
When it comes to analysis, developers will need to write MapReduce jobs. There is a Java (including a separate library that makes unit testing easier) as well as a C++ API available. However, Hadoop also comes with a streaming interface that makes it possible to use STDIN and STDOUT for processing thus enabling e. g. any scripting language.
However, sometimes developers will not need to touch either low level programming language. Instead, query languages such as Pig, Latin or Hive’s language, which is very close to SQL, are sufficient for most tasks. Where their syntax is not enough, developers have the option to extend the language with their own operators.
Cascading tries to provide a middle ground between low-level Java MapReduce and high-level Pig. Apache Giraph focuses on processing graph data including operators in particular important when analysing graphs.
When writing jobs, the need to chain mutiple MapReduce phases quickly arises. One might want to mix and match phases writing in Java and those written in Pig. Trigger processing based on a timer or based on data availability. Those tasks can be solved by Oozie – a workflow manager for Hadoop.
What if instead of providing simple statistics the goal is to automatically group users according to the types of products they buy? What if the goal is to classify users by how likely they are to prefer a certain type of product? What if the goal is to predict what queries a user is most likely to type next? Most of the more sophisticated problems can be phrased in a way that makes them simple to solve with automated methods. Apache Mahout helps with problems that are of large scale and comes with Hadoop bindings where necessary.
In terms of visualization, again, what is left to do is to create custom reports and then come up with custom dashboards. Also, it is possible to export data to your preferred file format to then visualise results in common tools like gephi for graph visualisation or import data into regular relational databases and use their tools for visualisation.
It’s never that simple in the real world
What we have looked at so far is a great, coherent infrastructure entirely based on Apache Hadoop. However, infrastructure in the real world never is all that simple. There will be relational databases left running that are just too expensive to migrate. If not integrated well with a Hadoop setup, those will remain isolated data silos that no one can benefit from except the original developers. MapReduce jobs running directly against the database would bring down that system very quickly. Hadoop is very efficient for generating DOS attacks against existing systems. It is common practice to import database content into HDFS on a regular schedule. With Sqoop there is a tool that has been proven to work reliably.
Now all that is left is a working solution to get log files from distributed systems back into HDFS. With Flume, Scribe, and Chukwa, there are three systems available to choose from that help with distributed logging and provide Hadoop bindings.
Taking one step back, the result already looks quite clean: data storage, analysis and visualisation are available, existing systems are well integrated. However, now we have the software to scale the system up to tens, hundreds, even thousands of machines. Those numbers are no longer manually manageable by any operations team. Instead, automated configuration management and the option for controlled roll-back becomes ever more important. Systems like Chef and Puppet for systems, as well as Zookeeper for configuration management suddenly become crucial for scaling (Figure 3).
Fig. 3: A complex but coherent jigsaw
The final picture no longer looks all that simple. In addition to the many different pieces and best practices, there are various steps that involve making a conscious decision for one system and against another, if one is to avoid mixing and matching all sorts of systems that all solve very common problems but differ in important details.
What is to be gained by such a setup? The most important advantage is integration of data storage. If data is no longer separated in data silos, all sorts of applications become possible and less expensive to build as data no longer has to be copied from team to team but is available and processable to all. Also, business reporting becomes simple by making it possible to integrate all necessary data sources in a consistent way.
To be processable however means that the data format is known, well-documented or at least self-describing. Suddenly logging no longer is optional and is needed for debugging and post-mortem analysis in case of issues. All of a sudden, log files are an asset in themselves – if they contain the right information, enough data and valid entries.
In general, it has been observed that use cases for Hadoop are identified easily after its introduction. However, one should not take introduction of such a system all too lightly – the learning curve for developers is still quite steep.
Systems are still to be considered at a DevOps level and that means both development and operations have to work tightly together to make things work. Integration is not yet fully optimised and as well done as with other more common systems. The most promising route is to identify an existing use case that would greatly benefit from using Hadoop. Introducing Apache Hadoop, and all that is needed around to support it, will provide enough experience to simplify and adapting more and more use cases over time. Having one driving project helps focus on real business needs and to figure out what the important features are. It also helps take decisions as to what projects to use to solve real technical problems.
Overall, Hadoop and its ecosystem provide unique features that allow scaling to large, distributed systems on commodity hardware at reasonable cost. The system has been used in production by both larger (Yahoo!, Facebook, Twitter) and smaller (Neofonie GmbH, nurago, nuxeo) companies to solve their analytics needs. Being an Apache project, it is possible to get involved and have issues fixed within reasonable time. Though release cycles have been growing in the past, the project itself has issued faster releases to get features and bug fixes out to uses at shorter intervals. In addition, the project itself unites developers from different backgrounds and brings together knowledge from teams with lots of experience with distributed processing around the world, which does not only help harden the implementation, but also provides a broad basis to draw from when developing new features.
Having released version 1.0 after several years, it shows its developers’ trust into the maturity of the code base. In upcoming releases, the architecture will become much more flexible separating resource management from computation libraries. This will enable developers to treat MapReduce as a simple programming library and evolve MapReduce APIs independently of cluster versions. 2012 will bring many improvements to the project. Make sure to be part of the process by joining the user and developer mailing lists. You can shape the process by putting in your knowledge and work by providing valuable patches that solve still open issues.