Yahoo enters artificial intelligence race with CaffeOnSpark
Yahoo is the latest tech giant to create a deep learning system for developing predictive applications such as image or speech recognition. CaffeOnSpark is able to perform ‘deep learning’ on the massive amount of data kept in the company’s Hadoop file system. The new internally-built software is now available on GitHub.
The aim of CaffeOnSpark is to allow deep learning training and testing to be embedded into Spark applications, Yahoo wrote in the official announcement of its software. The company explained that deep learning represents a critical capability required by its product teams to acquire intelligence from huge amounts of online data.
Yahoo created CaffeOnSpark to help pinpoint the images posted onto its Flickr photo sharing website and to make the search function on it more useful. The tech giant’s new internally-built software under an open-source license can detect suitable images based on specific common characteristics of various kinds of objects, landscapes, animals, etc.
As the name suggests, Yahoo’s CaffeOnSpark unites two existing technologies, namely the deep learning framework Caffe and Spark, which can run on top of big data platform Hadoop. In short, Yahoo found a way to run Caffe atop Spark clusters. Yahoo vice president of architecture Andy Feng told Wired that CaffeOnSpark makes it somewhat easy to spread deep learning processes across a number of servers, something that its rival -the open source version of Google’s TensorFlow- cannot do just yet.
According to Yahoo’s official announcement, CaffeOnSpark has been created to be a Spark deep learning package. Spark MLlib supports a collection of non-deep learning algorithms for elements such as classification, regression and clustering to name a few. The role of CaffeOnSpark is to fill a gap, since Spark MLlib does not have deep learning. CaffeOnSpark API supports dataframes so that users can interface with a training dataset prepared using a Spark application, and obtain the predictions from the features or model from intermediate layers for results and data analysis using MLlib or SQL.
CaffeOnSpark rids of unwanted data movement in traditional solutions and allows deep learning to be conducted directly on big-data clusters. According to Yahoo’s description, many deep learning jobs are long running, and it is important to handle potential system failures. CaffeOnSpark enables training state being snapshotted periodically, and thus users could resume from previous state after a failure of a CaffeOnSpark job.
Yahoo explained that it has applied CaffeOnSpark on several projects in the last few quarters and has received positive feedback from its internal users. The software is beneficial not only to deep learning community, but also the Spark community and can be found on GitHub under Apache 2.0 license.
Yahoo’s announcement comes a few months after Google outsourced its TensorFlow machine learning framework and Microsoft open-sourced its CNTK machine learning framework. China’s Baidu and Facebook have also open-sourced their personal machine learning technologies not long ago. One clear advantage of CaffeOnSpark is its use of an existent big data processing tool, because this eases the transition to a new workflow.