How Data Analysis changes Software Development
Hadoop - Think Big
The price for storage has fallen in recent years – even one GB of SSD storage is expected to fall below a dollar by the end of 2012. Clusters with thousands of machines are no longer the standard of big corporations only. With technologies like Apache Hadoop and HBase, a “throw nothing away culture” has become common. Data scientists are happily sifting through all that raw data finding ever more valuable information and extracting business-relevant pieces that provide competitive advantages. Machine learning helps solve problems that even a few years ago seemed intractable. The data revolution has started. Or has it?
Reading publications on big data today gives the impression that development processes have to be reshaped: growing amounts of data have to be stored to remain competitive. Projects that deal with storing, analysing and visualising data are hyped all over the place. When taking a closer look at individual solutions in production today, the view gets much less coherent – often solutions are built from several different pieces at varying maturity. It is not unusual for developers to take longer than originally estimated to put an architecture to work.
This article tries to put data-driven development in context – comparing and contrasting it with what is already being done in practice. It shows how the big data technologies fit together and where yellow elephants and machines that adapt them fit in.
Let’s start with a hypothetical example of building a web shop. When brainstorming, requirements that comes to mind are things like a place to store products, user data, user transactions (those that are active e. g. “Mary bought new headphones, product ID is 9887876, they are in shipment” and also ideally past ones to be able to make better product recommendations for Mary later). Once the shop is in place, we may guess what changes and improvements users might find helpful. We might explicitly ask Mary for her suggestions. However, average users are very lazy: they only provide feedback very rarely.
What is easier is to watch users online: watching what products they look at, which ones they end up buying, investigating what pages have the highest rate of users leaving the site, what pages are typical entry pages. When going down that road, development turns into a cycle of four steps:
The four steps of development:
1. Observe how users interact with the application – that will lead to finding multiple deficiencies with varying benefits when fixed: Product search might not be ideal, in that users often search for headphones in a particular colour, but colour might not yet be indexed. Users might be searching for the “Buy” button as it is not at the very top of the page.
2. Orient by defining in which direction the application should be improved – decide which deficiency to fix in your next iteration and define what criteria to measure how successful your fix was: Make the colour of headphones navigable and expect to double headphone sales within a month.
3. Decide what to implement – Implementation details may differ, in our toy example options may be to make the colour of headphones part of the index so users can search for them, or include them in a faceting user interface.
4. Act – implement the fix.
In the end, the cycle begins again, by observing how users react to your fix. That last step is what feedback can be made on the current and future implementation. The tighter that cycle can be made, the faster feedback can flow back into development, the more likely a project is to outpace its competitors (Figure 1).
Fig. 1: The four steps of development
This development cycle is not surprising, nor should it be particularly new. Instead, it is what has been done implicitly for any successful implementation. However, this explicit form shows that there are steps that can greatly benefit from gathering well-defined additional data sets or using already existing ones.
Observation is greatly supported by keeping track of any user behaviour – both common and unusual, both successful and unsuccessful. Each time a user successfully interacts with the web shop, it shows what functions work particularly well. Unsuccessful interaction reveals deficiencies and room for improvement. A metric is defined by the success of a new feature which then needs to be evaluated in orientation. Making this step explicit and measurable gives clear numbers against which changes should be compared. It also makes the goal of a change explicit – and helps define if and when that goal is reached.
A second goal of collecting interaction data is to use that data to provide better services to individual users: Most likely the user, Mary, will not want to be searching for another pair of headphones she gave a bad review for. So instead she should get different search results. Also she might be interested in a very specific type of music and may be very happy when presented with compilations that she likes.
In the end both types of using interaction data result in the flowchart displayed in Figure 2.
Fig. 2: Two types of interaction data
Is either way of thinking about data gathering new? Not really. Interaction data has been gathered since the early days of the internet. Users tend to do all sorts of interesting stuff within the infrastructure they are given. As a result, service providers started very early with logging user interactions – be it only to diagnose what caused a problem after the effect.
More common types of requirements engineering on that level include machine sizing based on past traffic patterns, hardening a setup against seen intrusion attempts. Typical features based on user data include showing only new content to readers of an online magazine or showing different content based on the origin of the user’s request.
The data sources used usually include web server access logs, health check results and response time logs. All of these come in different but usually text-based formats that are treated with tools like sed, awk or python scripts. Results are then presented in custom dashboards, gnuplot graphs or even semi-standard tooling for log analysis – AWStats being one prominent and well-known example of web server logs.
Let’s go one step further and look at application-specific data: These include customer databases, transaction logs and the like. What can be learnt from this?
Tools used here are standard relational databases. For analysis there is a standard query language, with some custom extensions depending on the database system provider. For visualisation, developers can come up with custom reports or use tools like tableau.
Now what if log sizes outgrow the storage or compute capacity of one single machine? What if analysis of a customer database takes too long on a single machine or is too expensive to speed up by scaling this one machine up? There is a very simple answer: Take Apache Hadoop.
The not-so-simple, but probably more correct answer is: You won’t get around a distributed system for data analysis. When opting for commodity hardware, your best bet is to use Apache Hadoop as the basis of your system. What you need on top depends on your specific case.
- The four steps of development, application specific data
- Hadoop, MapReduce, conclusions