No more munging

Trifacta CEO : The evolution of data transformation and its impact on the bottom line

The laborious art of data transformation is undergoing a process of evolution, on a parallel - more agile - but nonetheless complementary track to traditional ETL (extract transform load). Part of this transition has been driven by rash of bright young startups - among them Trifacta, which today announced a $25m funding round to help them on their mission to drive “transformation” for data-driven enterprises. In this interview, Joe Hellerstein, CEO and co-founder of Trifacta explains how his company’s tools help enable ultra-rapid info takeaways, open-source isn't always the solution, and how, in 2014 data analytics, it still all comes back to Moore’s law.

JAX: What makes Trifacta’s technology unique, and who do you see as your key market? Can you give us a deepdive into your tech?

Hellerstein: Trifacta's mission is to radically improve the way that people work with data. Our product and technology discussions always begin with users and the work they do to add value in an organization. Trifacta is designed with the notion of agile, purpose-driven data transformation in mind. Our product provides analysts both agility and scale, encouraging more creative and aggressive use data than was typical with traditional enterprise data management software.

We’ve delivered the initial versions of our Transformation Platform for use with Hadoop, and are targeting organizations that have embraced Big Data infrastructure as a business asset. That covers a broad range of industries, from health care and government to financial services, media and high tech.

Trifacta’s key technology, Predictive InteractionTM combines machine learning, visualization methods and scalable data technologies to change the user experience for working with data, radically reducing the friction for users to express how they’d like to transform data for each unique use case. Predictive InteractionTM technology combines lightweight user interactions with advanced machine learning intelligence to elevate data transformation to a visual and intuitive experience that we can scale up to the very largest data sets.

In the Trifacta platform, users visualize data and the effects of algorithmic suggestions, while machine learning methods analyze data and user interactions.  By providing an interface that harnesses both human and machine intelligence, Trifacta makes data transformation more intuitive while increasing productivity by as much as 10 times. This enables analysts, data scientists and even IT programmers to focus more on the analysis than being bogged down in data manipulation.

What’s under the hood?

With respect to what’s under the hood—much of our core technology was developed in house. The interaction and user-facing machine learning methods are implemented in Javascript; of course we leverage visualization technologies like Jeff Heer’s D3.js, and the Vega package we released to open source. On the machine learning front we use a variety of statistical models and inference techniques to understand data and user interaction.

At the heart of the platform is a Domain-Specific Language for data transformation we call Wrangle, which is reflected in our visualizations to make sure users can literally see what they are doing to their data every step of the way. Wrangle is backed by an extensible set of compilers that can translate down to multiple execution targets—today these include both Javascript in the browser, and Hadoop as a scalable backend.

Where will you invest this latest round of  funding, and why?

The positive reaction that we've received to our new approach to Data Transformation from a wide variety of users - business analysts, data scientists and Big Data developers - highlights the scope of the opportunity to change how people work with data.  The financing will be used to scale the Trifacta team so that we can accelerate the delivery of  Trifacta’s Data Transformation Platform to more organizations.

You’ve said before that Trifacta — or at least its approach — could one day end up being credited for lowering the barrier to entry for big data projects. What do you think are the current obstacles in place?

The term automation gets used quite a bit when it comes to big data and data analytics and could be a red herring for enterprises tackling the transformation issue. People are critical to the data analysis process and the further they are from the actual data, the harder it is for data to influence business processes.

In an agile environment, the analyst creates links between business problems and relevant data: forming hypotheses, acquiring and preparing data, running analyses and interpreting the resulting numbers. Transformation is a creative human activity, and the more broadly it can be done within a business, the larger the impact of big data on the bottom line. In that sense, Trifacta is lowering the barrier to relevance for big data projects.

Have you seen a spike in interest since your recent team up with Cloudera?

Cloudera has been an important partner for us given their share of the production Hadoop environments in the market today.  Our focus on delivering data transformation at scale and against data that is not always well structured, means that Trifacta’s Data Transformation Platform is particularly well suited to data transformation against Hadoop as a source. The Cloudera partnership was announced just a little over a month ago, and we are already participating in joint business and sales efforts, have joint customers and are seeing quite a bit of traction together.

Why does Trifacta not open source its tools?

Open source is a means to an end: the development of robust, high-quality software. Sometimes open source is the best means to that end, other times it’s not. At Trifacta we’re taking a mixed strategy.

As a team, we are strong believers in an open source approach where it fits, and we put our money where our mouth is.  Our founding team is responsible for open source projects including the D3.js visualization package from Heer’s group and the MADlib machine learning library that I launched.  As an organization, Trifacta develops and supports the Vega visualization package as well. We know from experience that it takes a great deal of time and effort to build and engage with a rich open source developer community. That effort necessarily trades resources that could otherwise be directly used for developing a product in house and engaging with users. That tradeoff makes sense when the community effort can be converted back over time into enhanced value for the product and customer base.

To date, most of the successful open source efforts have been developer-facing software, with a programmatic “interface” consisting of developer APIs. This suits the interests and skills of typical developers. Trifacta’s Data Transformation Platform as a whole, with its integrated visual interface for end-users of data analysis, doesn’t fit the typical template for open source success. So, we’ve chosen a hybrid approach: our core product is closed source, but we’re releasing reusable components like Vega when we see potential for serving a broader developer ecosystem, and leveraging the efforts of that ecosystem to the benefit of our product and customers.

What needs to be done to wake more of the market up to data transformation, in your opinion?

People who work with data on a daily basis are acutely aware of the hurdle presented by manual data transformation. The pain is well recognized. That said, for enterprises to see the complete value of a powerful data transformation solution, they have to look beyond the pain to the upside opportunity: as data professionals develop an agility in transforming data to business purposes, their productivity leads directly to an iterative loop of business improvement.

As an organization begins winning in its market by using data, agile transformation becomes a powerful enabler for building and maintaining that competitive advantage. This realization of potential upsides is a very active discussion across a number of sectors, and as the market evolves they will naturally gravitate to new data transformation technology. Finally, it’s been interesting to see the newer software stacks (Hadoop, NoSQL) begin to address some traditional use cases, technologies and customers. For example, we’re seeing multiple SQL implementations on top of Hadoop, as well as richer consistency (transaction) mechanisms as an option in NoSQL and NewSQL stores. This technology evolution will cause significant disruption in the market, as new stacks and old stacks continue to copy each other’s ideas and trend toward the same unified set of function.

Can you tell us a little about the background of Trifacta?

The company began with a Berkeley-Stanford research initiative dating back to 2010, as part of my work as a professor of computer science at Berkeley. My expertise is in data and distributed systems. At that time I had realized that there was a broadening gap in my field: innovations in computing infrastructure and algorithms had far outpaced most user’s abilities to do useful work with data at scale. As a result, people had become the bottleneck in data analytics. I got excited about developing new technologies to attack these human bottlenecks head on.

To that end, I kicked off a collaborative project with leading thinkers in human-computer interaction, including data visualization guru Prof. Jeffrey Heer of Stanford, and a brilliant former analyst named Sean Kandel who had joined Stanford's computer science Ph.D. program.  Sean did field research, interviewing fellow analysts at 25 different firms across multiple business sectors. The analysts reported that the bulk of their time—up to 80% of their time—was spent on manually transforming (“munging” or “wrangling”) raw data into forms suitable for analysis."Meanwhile, back on campus, we were exploring some radical new approaches for visual data manipulation which blended ideas from machine learning, data visualization, and data management systems.

Those ideas came together in a research prototype called Data Wrangler, an online tool for data manipulation that quickly attracted thousands of users. The prototype was somewhat raw, but it was unlike anything on the market, and was clearly addressing a major need. We knew we were onto something.

We launched Trifacta to extend our technology and broadly change the way that people work with data in the field. Our first target was informed by our field research: knock down the "80%" problem of data transformation. The Trifacta Data Transformation Platform that we shipped this year is addressing this need with an intelligent and massively scalable solution, and an interface that reflects a deep attention to design detail that is unique in our sector. Trifacta then launched the Transformation Platform on February 4, 2014. With the $25 million series C led by Ignition Partners and including existing investors Accel Partners and Greylock Partners, Trifacta has raised a total of $41.3 million to date.

 


Lucy Carey

What do you think?

JAX Magazine - 2014 - 06 Exclucively for iPad users JAX Magazine on Android

Comments

Latest opinions