There be gold in that there startup!
Treasure Data: Breaking down the Hadoop barrier
We spoke to Treasure Data, a startup with big plans for the lifecycle of Big Data. This article was originally published in the September edition of JAX Magazine.
There are two camps to Big Data. Either you believe it is truly the next big thing in the software industry, with huge potential, or you are sitting firmly in the trough of disillusionment adamant that it is just another fad.
Unfortunately for those in the latter group, the rise of Big Data and its collection of processing and storage techniques has been pretty inescapable in the past two years. The technology leading the charge is undeniably Hadoop, the Java-coded data processing framework that has matured nicely since notching up its first major release in late 2011.
While MapReduce and HDFS are both steady and fairly performant, enterprise customers are beginning to voice concerns over the speed of Hadoop. With the fast paced nature of business, they want analytics insights at the drop of a hat, and not several days or weeks later. This is without mentioning the cost of setting up a fully-functional infrastructure to begin with, as well as the time it takes to impart Hadoop knowledge onto employees, or the cost of maintaining it all.
Put another way, the furniture is sturdily-built but is in thousand pieces on the floor. The savviest Hadoop vendors have recognised the importance of renovating the innards of Hadoop, whether it’s by giving HDFS a much needed speed boost (Hortonworks’ Stinger Initiatives) or by starting off-shoot projects altogether (Cloudera’s Impala) and while this undoubtedly a good thing to see, some companies just don’t have time to wait around for these projects to develop and mature.
Due to this, we’ve seen plenty of Big-Data centric startups emerge, claiming to offer a highly-tuned service that covers the entire lifecycle of an application. The platform is already built for you, allowing you to focus on deriving insights rather than fiddling with open source projects (though this might be your thing)
Treasure Data are one such example, springing up right at the height of Hadoopmania in December 2011. Promising to “accelerate your time to insight”, the Mountain View company boast that it takes customers an average of 7 days to be fully implemented with their cloud-based platform, saving them valuable time and money.
Although Treasure Data might have the whiff of a very proprietary system, the two co-founders hold exceptional open source experience in Japan and the United States. CEO Hiro Yoshikawa had worked at Red Hat for seven years, mostly marketing various Linux products, while CTO Kaz Ohta’s Hadoop heritage second to none, setting up one of the largest user groups in Japan  as well as being one of the country’s foremost experts on parallel computing.
“A lot of people wanted to leverage big data outside of the top internet companies and it was actually really hard to do so.” Treasure Data’s marketing director Ankush Rustagi tells JAX Magazine.
“There wasn't a lot of technology that made it really easy to use big data. They really saw this opportunity to create a really easy solution to leverage big data with all the new data sources coming out and a lot of businesses wanted to track a lot of online media and machine data much more efficiently.”
Although second nature to those in the know, putting together the building blocks of Big Data isn’t easy to newcomers, becoming an onerous process.
“When we think of ourselves versus a lot of other platforms out there, it's really about providing an end-to-end really full Big Data analytics service,” Rustagi continues. He explains that Treasure Data can either be used for building an entire AWS-deployable infrastructure or in conjunction with an on-premise Hadoop distribution from Cloudera or Hortonworks for example.
To illustrate that time is very much the essence with Treasure Data, Rustagi outlines the use case from a client of theirs, a gaming company. Getting insight into the performance of varying video game titles is critical to making business decisions, as the shelf life of games can be so short.
“They can't wait 6-12 months just to get that infrastructure in place,” Rustagi exclaims. “They might have to close down the games where it's not financially possible.”
The marketing directors believes customers primarily fall into one of two big data scenarios. Most of the time its the first example: they’re trying to figure what solutions are really what they need and their infrastructure is one with a big price tag. The time spent creating a project is also extremely long. Treasure Data therefore swoops in, providing the backbone for the company in question to focus on what they do best – the application and its insights.
“They want to find out the TLC (true life cost), they want to get up and running very quickly especially on the more enterprise side and they don't know if they can make that kind of investment. They need something quicker to really understand.”
The second scenario is often when the company just want to get as much data inside a platform as possible, but their current infrastructure hinders them from doing so. Rustagi recounts a particular horror story of the game company attempting Big Data with a sharded MySQL solution and a nightly batch process.
“They were considering some of the larger solutions that are out there like an Oracle database and it just wasn't going to scale, it was really costly and it was going to take a while to get there” he says, adding that with the help of a few open source tools, they could add data sources as JSON objects and stream data across all titles.
“The analyst team for each studio could get game-specific APIs, like user engagement, like coins and goal completion and then manage to get another of that data to understand things like average revenue per user.”
Break the Hadoop barrier down
Members of Treasure Data’s team believe that the barrier to entry with Hadoop is ultimately the biggest challenge it faces. Marketing VP Rich Ghiossi explains that just having those with the capabilities to glue together machines in-house is difficult, because the talent just simply there.
“I was at a conference just recently and it was amazing to me. I was with a number of "experts" in the industry, and they as experts were having trouble keeping track of all of the new technologies that are coming in,” he revealed. “Frankly, technology is happening so quickly that companies can't keep up.”
Rustagi adds that even some experts from where Hadoop started, in Yahoo and Facebook recognise how difficult and time consuming the initial processes can be.
“Essentially, given though they understand Hadoop better than anyone else, they saw the value in our product. They completely have the skillset, they know exactly how it works by working in some of the largest Big Data environments out there but they still saw great value in Big Data.”
The duo believe there is “a lot of confusion in Big Data in general”, with many unable to ascertain just what Hadoop actual is. Rustagi says that “Hadoop is just one tool” in the box which just so happens to be they’re “wedded to” and the one they have picked for solving.
“There's some things which Hadoop is really good, in terms of scalability it's really good. It terms of manageability, there's some things that are left to be desired we're not wedded to Hadoop but we do understand its value right now.
Ghiossi is slightly more positive about Hadoop’s impact.
“Even though people are confused and probably don't have all the knowledge they need in terms of how to deploy, it's given the industry a kind of supercharged mindset around how to use all kinds of different data that I think the industry wasn't looking at 10-15 years ago,” he exclaims.
“I think in general Hadoop, the whole buzz around Big Data and Hadoop, has done wonders for moving the mindset forward in terms of using things like sensor data, log data, the kinds of data we knew existed but didn't really have an cost-effective way to process.”
Open source credentials
At the heart of Treasure Data are two important pieces of kit that separate it from the pack. The first is FluentD, a live cluster daemon created by another founder of the startup, Sadayuki Farahashi. FluentD is an open source Ruby log collecting tool that is described as “syslog that understands JSON”. It has a small footprint of just 3000 lines of Ruby, easy installation and startup and is fully-pluggable into any architecture.
“We're are actively trying to build that community even greater,” Rustagi states, before adding that the community is starting to see a good turnout.
“From our standpoint, Fluent and our contribution to FluentD is critical to us because the founders and the lineage of the company is really built around open source. It's Kaz and Hiro's background so it's important to use the community as a way to build the product. [But] then also take what they're innovating on and doing with FluentD in terms of the collection layer and provide those benefits to our customers in the enterprise version of td-agent.”
The cloud variant of FluentD is used stream data in real-time from various data sources.
“We'll take the data and you can default that as a buffer every five minutes. Then it packs that data as JSON objects and sends it to any cloud endpoint,” Rustagi says.
As expected, the project has seen fair adoption in Japan thanks to the company’s founders, but also has some interesting business use cases, such as SlideShare . The FluentD community has contributed over 150 plugins for the project, suggesting it is in rude health.
MessagePack, a binary serialisation format, is the other important piece in Treasure Data’s open source puzzle. It’s what they use to compress all the data on their systems, including what is pumped in by FluentD. SlideShare is once again a user, to make their feeds more manageable, while LinkedIn use it for web data.
“Honestly at this point the MessagePack community has grown enough it's self-sustaining without the founders being directly involved, so they just kind of provide overall guidance on the roadmap,” Rustagi explains.
The differentiator in Treasure Data’s offering is far more proprietary and helpful for speedier querying. Plazma is a multi-tenant columnar storage system, created to reduce HDFS inefficiencies and give the storage layer of Hadoop an ops focus.
“One of the things we wanted to focus on was taking the best parts of Hadoop to build the platform out and also improve upon it,” Rustagi notes.
“[With] the data stored in a variety of text files, it significantly improves our IO and secondly, by having a columnar database, we can compress the data. This helps with query processing.”
Cloudera, in conjunction with Twitter, released a columnar format of their own called Parquet, which is crucially open source. While Plazma is aiming at a different demographic, it may have lost that lustre it once held, with other vendors releasing similar projects. Plenty of other Hadoop projects feature under Treasure Data’s hood, as they use a modified version of HiveQL for their infrastructure and an optimising version of query project Pig.
Is Hadoop's future
Treasure Data pull no punches when it comes to talking about Big Data’s leading light. Though Hadoop is the basis for Treasure Data’s platform, it certainly isn’t the be all and all, and clearly has a few frailties.
“One of the largest challenges is the perceived vs the real risk security in the cloud. I think that anyone who is in Big Data or dealing with Hadoop and cloud services feels this,” Rustagi says.
“I think that for us this will be one area that we continue to work on in terms of providing additional security and measures to really make sure enterprises customers feel they can reach compliant standards.”
Aside from security and the general confusion over what Hadoop actually, Rustagi is resolute in his view (and Treasure Data’s) that customers crave easy-access real time analytics from their big data infrastructure.
“They're not just comfortable having back analytics being the core way they can serve their teams and consume the data,” Rustagi states, before adding that Treasure Data plan to release a low latency query engine to help combat the issue, which will be available later this year.
“People like Hadoop and how cheap it is and the infrastructure model, but ultimately, they want real-time Big Data inside, as jargony as that sounds. That's what they're looking for.”