No more munging

Trifacta CEO : The evolution of data transformation and its impact on the bottom line

Lucy Carey
snarl

Fresh from announcing a $25m funding round this morning, the data analytics young guns take us on a tech deep dive and outline their semi-open approach.

The laborious art of data transformation is
undergoing a process of evolution, on a parallel – more agile – but
nonetheless complementary track to traditional ETL (extract
transform load). Part of this transition has been driven by rash of
bright young startups – among them Trifacta, which
today announced a $25m funding round to help them on their mission
to drive “transformation” for data-driven enterprises. In this
interview, Joe Hellerstein, CEO and co-founder of Trifacta explains
how his company’s tools help enable ultra-rapid info takeaways,
open-source isn’t always the solution, and how, in 2014 data
analytics, it still all comes back to Moore’s law.

JAX: What makes Trifacta’s technology unique,
and who do you see as your key market? Can you give us a deepdive
into your tech?

Hellerstein: Trifacta’s mission
is to radically improve the way that people work with data. Our
product and technology discussions always begin with users and the
work they do to add value in an
organization. 
Trifacta is designed with the
notion of agile, purpose-driven data transformation in mind. Our
product provides analysts both agility and scale, encouraging more
creative and aggressive use data than was typical with traditional
enterprise data management software.

We’ve delivered the initial versions of our
Transformation Platform for use with Hadoop, and are targeting
organizations that have embraced Big Data infrastructure as a
business asset. That covers a broad range of industries, from
health care and government to financial services, media and high
tech.

Trifacta’s key technology, Predictive
Interaction
TM combines machine
learning, visualization methods and scalable data technologies to
change the user experience for working with data, radically
reducing the friction for users to express how they’d like to
transform data for each unique use case. Predictive
Interaction
TM technology combines
lightweight user interactions with advanced machine learning
intelligence to elevate data transformation to a visual and
intuitive experience that we can scale up to the very largest data
sets.

In the Trifacta platform, users visualize data
and the effects of algorithmic suggestions, while machine learning
methods analyze data and user interactions.  By providing an
interface that harnesses both human and machine intelligence,
Trifacta makes data transformation more intuitive while increasing
productivity by as much as 10 times. This enables analysts, data
scientists and even IT programmers to focus more on the analysis
than being bogged down in data manipulation.

What’s under the hood?

With respect to what’s under the hood—much of
our core technology was developed in house. The interaction and
user-facing machine learning methods are implemented in Javascript;
of course we leverage visualization technologies like Jeff Heer’s
D3.js, and the Vega package we released to open source. On the
machine learning front we use a variety of statistical models and
inference techniques to understand data and user
interaction.

At the heart of the platform is a
Domain-Specific Language for data transformation we call Wrangle,
which is reflected in our visualizations to make sure users can
literally see what they are doing to their data every step of the
way. Wrangle is backed by an extensible set of compilers that can
translate down to multiple execution targets—today these include
both Javascript in the browser, and Hadoop as a scalable
backend.

Where will you invest
this latest round of  funding, and why?

The positive reaction that we’ve received to our
new approach to Data Transformation from a wide variety of users –
business analysts, data scientists and Big Data developers –
highlights the scope of the opportunity to change how people work
with data.  The financing will be used to scale the Trifacta
team so that we can accelerate the delivery of  Trifacta’s
Data Transformation Platform to more organizations.

You’ve said before that Trifacta — or at least
its approach — could one day end up being credited for lowering the
barrier to entry for big data projects. What do you think are the
current obstacles in place?

The term automation gets used quite a bit when it
comes to big data and data analytics and could be a red herring for
enterprises tackling the transformation issue. People are critical
to the data analysis process and the further they are from the
actual data, the harder it is for data to influence business
processes.

In an agile environment, the analyst creates links
between business problems and relevant data: forming hypotheses,
acquiring and preparing data, running analyses and interpreting the
resulting numbers. Transformation is a creative human activity, and
the more broadly it can be done within a business, the larger the
impact of big data on the bottom line. In that sense, Trifacta is
lowering the barrier to relevance for big data projects.

Have you seen a spike in interest since your
recent team up with Cloudera?

Cloudera has been an important partner for us
given their share of the production Hadoop environments in the
market today.  Our focus on delivering data transformation at
scale and against data that is not always well structured, means
that Trifacta’s Data Transformation Platform is particularly well
suited to data transformation against Hadoop as a source. The
Cloudera partnership was announced just a little over a month ago,
and we are already participating in joint business and sales
efforts, have joint customers and are seeing quite a bit of
traction together.

Why does Trifacta not open source its
tools?

Open source is a means to an end: the development of
robust, high-quality software. Sometimes open source is the best
means to that end, other times it’s not. At Trifacta we’re taking a
mixed strategy.

As a team, we are strong believers in an open source
approach where it fits, and we put our money where our mouth is.
 Our founding team is responsible for open source projects
including the D3.js visualization package from Heer’s group and the
MADlib machine learning
library
that I launched.  As an organization, Trifacta
develops and supports the Vega
visualization package
as well. We know from experience that it
takes a great deal of time and effort to build and engage with a
rich open source developer community. That effort necessarily
trades resources that could otherwise be directly used for
developing a product in house and engaging with users. That
tradeoff makes sense when the community effort can be converted
back over time into enhanced value for the product and customer
base.

To date, most of the successful open source
efforts have been developer-facing software, with a programmatic
“interface” consisting of developer APIs. This suits the interests
and skills of typical developers. Trifacta’s Data Transformation
Platform as a whole, with its integrated visual interface for
end-users of data analysis, doesn’t fit the typical template for
open source success. So, we’ve chosen a hybrid approach: our core
product is closed source, but we’re releasing reusable components
like Vega when we see potential for serving a broader developer
ecosystem, and leveraging the efforts of that ecosystem to the
benefit of our product and customers.

What needs to be done to wake more of the
market up to data transformation, in your opinion?

People who work with data on a daily basis are
acutely aware of the hurdle presented by manual data
transformation. The pain is well recognized. That said, for
enterprises to see the complete value of a powerful data
transformation solution, they have to look beyond the pain to the
upside opportunity: as data professionals develop an agility in
transforming data to business purposes, their productivity leads
directly to an iterative loop of business improvement.

As an organization begins winning in its market
by using data, agile transformation becomes a powerful enabler for
building and maintaining that competitive advantage. This
realization of potential upsides is a very active discussion across
a number of sectors, and as the market evolves they will naturally
gravitate to new data transformation
technology. 
Finally, it’s been interesting to see
the newer software stacks (Hadoop, NoSQL) begin to address some
traditional use cases, technologies and customers. For example,
we’re seeing multiple SQL implementations on top of Hadoop, as well
as richer consistency (transaction) mechanisms as an option in
NoSQL and NewSQL stores. This technology evolution will cause
significant disruption in the market, as new stacks and old stacks
continue to copy each other’s ideas and trend toward the same
unified set of function.

Can you tell us a little about the background
of Trifacta?

The company began with a Berkeley-Stanford research
initiative dating back to 2010, as part of my work as a professor
of computer science at Berkeley. My expertise is in data and
distributed systems. At that time I had realized that there was a
broadening gap in my field: innovations in computing infrastructure
and algorithms had far outpaced most user’s abilities to do useful
work with data at scale. As a result, people had become the
bottleneck in data analytics. I got excited about developing new
technologies to attack these human bottlenecks head on.

To that end, I kicked off a collaborative project with
leading thinkers in human-computer interaction, including data
visualization guru Prof. Jeffrey Heer of Stanford, and a brilliant
former analyst named Sean Kandel who had joined Stanford’s computer
science Ph.D. program.  Sean did field research, interviewing
fellow analysts at 25 different firms across multiple business
sectors. The analysts reported that the bulk of their time—up to
80% of their time—was spent on manually transforming (“munging” or
“wrangling”) raw data into forms suitable for analysis.”Meanwhile,
back on campus, we were exploring some radical new approaches for
visual data manipulation which blended ideas from machine learning,
data visualization, and data management systems.

Those ideas came together in a research prototype
called Data Wrangler, an online tool for data manipulation that
quickly attracted thousands of users. The prototype was somewhat
raw, but it was unlike anything on the market, and was clearly
addressing a major need. We knew we were onto something.

We launched Trifacta to extend our technology and
broadly change the way that people work with data in the field. Our
first target was informed by our field research: knock down the
“80%” problem of data transformation. The Trifacta Data
Transformation Platform that we shipped this year is addressing
this need with an intelligent and massively scalable solution, and
an interface that reflects a deep attention to design detail that
is unique in our sector. Trifacta then launched the
Transformation Platform on February 4, 2014. With the $25 million
series C led by Ignition Partners and including existing investors
Accel Partners and Greylock Partners, Trifacta has raised a total
of $41.3 million to date.

 


Author
Comments
comments powered by Disqus