What is Apache Hadoop?
A look at the pros and cons of the big data processing framework that took the industry by storm.
Apache Hadoop is often thought of as the original Big Data store. Hadoop and its offshoots is a big subject, worthy of many articles in itself. Here I will cover some of the fundamentals and how it fits into the overall landscape of DBMS engines.
According to Wikipedia, Hadoop is defined as a High-availability distributed object-oriented platform. That is a short, but very comprehensive description. The Hadoop framework and components are based on early papers from Google on distributed computing, namely Google MapReduce and the Google File System (GFS).
Hadoop Distributed File System
At its foundation, Hadoop relies on the Hadoop Distributed File System (HDFS). This is a highly available file system, that automatically replicates and distributes file data across potentially a vast number of servers in a Hadoop cluster. HDFS manages high-availability, such that if a node or nodes fail in the cluster, your data is reliable and processing can continue.
In addition HDFS provides huge bandwidth capabilities for adding data, allowing for very high data input rates. As an example, many organizations use Hadoop to store application server log (e.g., Apache Server logs) – log files that can be generated at incredibly high rates with a large infrastructure of Web servers.
Of course that data is only valuable if you can access it and provide meaning to the data, through analysis, and of course my favorite subject – data relationships.
The primary mechanism offered by Hadoop for distributed processing is the MapReduce paradigm. This allows processing to be divided into fragments, split across 10s or 100s or even 1000s of nodes, to analyze huge data sets.
Here is the Wikipedia definition for MapReduce: “A MapReduce program comprises a Map() procedure that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() procedure that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies).”
As the name infers, MapReduce uses two primary steps:
- Map: A procedure to find and/or sort the data needed, and put it into a queue for processing.
- Reduce: The procedure that does the actual analysis on the data from the Map step, such as aggregating records or rows.
If you look at it, this is useful for a wide variety of tasks, often very similar to those you would perform with a traditional RDBMS – but on a much larger scale. For example, if you had a Big Data set of Apache Server log files, and wanted to find out the top 10 pages with more than 1000 views visited across all of your web servers, you could do the following in an RDMBS query (consider this pseudo code for sure):
SELECT page_name, COUNT(*) as page_count FROM apache_log HAVING page_count > 1000 GROUP BY page_name ORDER BY page_count LIMIT 10
Using Hadoop the Map step would produce a list of pages for the process to be counted; the Reduce step would do the sorting, producing a list of page_count values in this example. A second Map could then be performed to sort the results from the first Reduce step bypage_count > 1000, and a second Reduce step to filter out only the top 10 rows.
If you look at the above process, its not that much different from how a RDBMS engine works internally, reading and summarizing rows and utilizing temporary table for interim results.
In fact there is a related project called Apache Hive to perform such queries on a Hadoop infrastructure using a SQL-like language called HiveQL.
The advantage of course with MapReduce is the scale at which you can perform such tasks, far outstripping what any traditional RDMBS engine can perform. It’s important to keep in mind that the objective of such processing and the low-level functionality are very similar.
With the scale and power of Hadoop and MapReduce, why wouldn’t you use it for all of your Big Data needs? In fact some organizations have done just that, but there are some limitations:
- You need a large infrastructure to utilize Hadoop effectively.
- While Hadoop is very good at scaling to large workloads, it is primarily designed as a huge “batch processing” engine. Scheduling and running all of those MapReduce jobs takes time, with results typically available in seconds to hours. This can be fine for analytical queries, but if you need a more real-time system it may not fit the bill.
There are many ongoing efforts and extensions to Hadoop to avoid these disadvantages, and over time the Hadoop environment will continue to improve.
Wrapping it up
This was a very brief tour of Apache Hadoop, a key DBMS engine in the Big Data arsenal. I will cover other DBMS engines in future articles, expanding on the categories already covered. Again, choosing the right DBMS engine(s) for the job can save a huge amount of work, and offer great performance and scalability if appropriate selections are made.