Machine Learning essentials: Best practices, categories and misconceptions
© Shutterstock / maxuser
Artificial Intelligence and Machine Learning are all the rage right now. Many companies feel the pressure to invest in an AI strategy before fully understanding what they are aiming to achieve. We talked with JAX London speakers Sumanas Sarma and Rob Hinds about the different types of ML tasks, the most suitable programming language for ML, misconceptions and more.
JAXenter: Among the different types of ML tasks, a very important distinction is drawn between supervised and unsupervised learning. What is the difference between the two?
Sumanas Sarma & Rob Hinds: Supervised Learning and Unsupervised Learning are considered two of the fundamental categories of Machine Learning (or any type of statistical learning).
Problems that fall in this category have a set of inputs (these can be real values, categorical values) and have one of two kinds of output:
- Tasks where the goal is to output a category
- Tasks where the goal is to output a real-value
The first set are known as Classification tasks and typical examples might be to predict whether an input text has positive or negative sentiment, or given an image to identify the objects present in the image. The second set of tasks is known as Regression and typical examples are predicting house prices based on a set of input features (like latitude, longitude, number of rooms etc) or the age of a user given the set of apps on their phone.
Formally, Supervised Learning projects involve a set of variables that might be denoted as inputs, which are measured or preset. These have some influence on one or more outputs The goal is to use the inputs to predict the values of the outputs.
Algorithms used for these problems: Linear regression, Random Forest Classification, Support Vector Machine Classification, Naive Bayes.
Problems that fall in this category differ from Supervised Learning in one key aspect – for each given set of inputs, there is no given output. The goal, therefore, is to learn/discover the underlying structure or distribution in the input data without any guidance.
Like Supervised Learning problems there are two broad types of Unsupervised Learning problems:
- Clustering: determine inherent groups in the data. Eg, using customer data to predict behavior, or using web traffic sources to determine target pages on the site
- Association: determine inherent rules in the data. Eg, if a customer purchases items x and y within a certain time period, they are likely to purchase z. For instance, if a customer buys onions (x) and potatoes(y), they are likely to then buy hamburger meat (z).
Algorithms used for these problems: k-means clustering, Expectation Maximisation (for Gaussian Mixture Models), Manifold Learning.
Supervised: All input data is mapped to corresponding output values and the algorithms learn to predict the output from the input data.
Unsupervised: Absence of output values and the algorithms learn the inherent structure from the input data.
There is a third type Semi-supervised where some data is labeled but most of it is unlabeled and a mixture of supervised and unsupervised techniques can be used.
JAXenter: Which programming language is the most suitable for ML?
Sumanas Sarma & Rob Hinds: A controversial one, which like any question about programming languages doesn’t have one true answer.
Several choices can be considered with some geared towards a research-focused approach whilst others are built with an aim for use in industrial applications. Much like choosing a programming language, the best candidate often depends on what it will be used for. For example, while using MATLAB or Octave may be easier to experiment with Machine Learning and Deep Learning architectures, they are poorly suited for deploying to production.
Ultimately, the choice of language will hinge on the problem that needs to be solved.
The most popular language used in ML is Python, a lot of examples you will see in tutorials will be Python, and several popular deep learning libraries primarily support Python (Google’s TensorFlow, Keras, Theano) and a large number of Data Scientists seem to favor Python. Moreover, this year’s IEEE Spectrum ranking of top programming has Python at the top spot with C and Java following closely.
However, there are other options and other considerations that people might want to think about depending on technology stack or skills availability. There are a few options on the JVM, Apache Spark supports Java, Scala, and Python. DeepLearning4J is a Java based, distributed, deep learning library that is pitching itself at the Enterprise. In addition, TensorFlow has offered Java APIs for use in addition to Python (although Python remains the primary language for its use).
A large number of Data Scientists seem to favor Python.
Furthermore, there are a few Java-based traditional ML libraries (not deep learning) such as Stanford’s NLP library or Apache OpenNLP.
Ultimately, the choice of language will hinge on the problem that needs to be solved. For instance, for a computer vision problem, it is advisable to start with a pre-trained model with state-of-the-art performance instead of training a fresh model from scratch. The availability of these pre-trained models can often dictate the choice of language in such cases. TensorFlow has a set of pre-trained models available here.
Engineering best practices
JAXenter: Could you name three engineering best practices that can be applied to machine learning?
Sumanas Sarma & Rob Hinds:
A typical Machine Learning pipeline will have lots of moving parts and lots of steps in the process, and given it can take a lot of time to train a ML model, it’s really important to have the individual steps as well tested as possible. Imagine waiting for days for a model to train just to find that a pre-processing data clean up step has a simple bug in it that renders all the training useless! To this end, it’s a good idea to spend some effort early on to build out common testing infrastructure, this will save time later when experimenting with different approaches and will help ensure you are testing the different experiments in a consistent manner.
As mentioned above, there are lots of steps in a ML pipeline, and whilst the quickest thing for a one-off experiment might be to hack everything into a script, you will reap the benefits of thinking about how the code is structured and can be reused — a simple example being the preprocessing steps that clean up data sets or transform inputs — you might want to re-use those across different experiments.
ML pipelines can be complex beasts, so documenting what experiments are doing and why is especially important whilst working in a professional team, as there is a good chance there are other members of your team, as well as the chance that someone else will have to pick up and understand the code after you.
JAXenter: Do Agile best practices translate well to artificial intelligence and/or machine learning?
Sumanas Sarma & Rob Hinds: In general, agile best practices can and should be applied to many aspects of machine learning development, but in doing so presents a different set of challenges.
For example, early on in a project all the effort will be on research and investigating possible approaches to tackle a given problem (and confirming that it is actually possible to solve with ML). These may be longer running tasks than a normal development ticket and looser criteria for completion.
That said, tying research “spikes” into the existing development cycle helps track the progress of lines of research, and also helps provide more visibility to wider development teams about progress. It may be that the team has several ML engineers and research fans-out early on, following several different lines of investigation. Whilst not always possible, you can aim to have research spikes that last the length of the sprint (fortnight or so) with the aim of having a go/no-go call at the end of the sprint as to whether the approach shows promise to warrant further research or whether it should be dropped for a different investigation.
The biggest misconception about machine learning is that it is applicable for all problems, and that everyone should be doing it.
Following the sprint cycles also means the ML stays connected with the wider engineering team and can attend stand-ups (although don’t expect too much of a daily update in research tasks) and sprint demos (which are always fun when there is a ML demo!).
As research progresses, and the team starts to narrow in on an approach, then more agile processes can be used, as the engineers start to think about getting models trained, productionised and the infrastructure that needs to be built to support that.
JAXenter: What is the biggest misconception about ML?
Sumanas Sarma & Rob Hinds: That it is applicable for all problems, and that everyone should be doing it. ML is well suited to several distinct types of problems, but there are also some problems that don’t fit so well. Furthermore, even if ML can solve the problem at hand, that still doesn’t mean it’s the right approach — there are still lots and lots of problems that are best solved with traditional algorithms or rule based systems.
JAXenter: According to this year’s Stack Overflow developer survey, machine learning specialists are runner-ups in the “best paid” race. Why is this skill so sought-after now?
Sumanas Sarma & Rob Hinds: Artificial intelligence (AI) and machine learning (ML) are really trending concepts right now. For example, at Google’s recent I/O 2017 event, they really put ML front and center with plans to bake it into all their products, and most of the big players in technology applying ML techniques (whether it be fraud detection, facial recognition in photos or smarter voice controlled apps) we are continuing to see a trend in new and old companies trying to get on board and start seeing how they can use AI and Machine Learning. According to a recent Narrative Science survey, 38% of enterprises surveyed were already using AI, with 62% expecting to be using it by 2018.
With this surge in interest in the area amongst companies, there is no surprise that there is fierce competition for talent in an area that is relatively young and very fast paced in terms of new approaches and techniques being developed — so it’s to be expected that those with the skills and experience can demand good salaries.
JAXenter: What does the future hold for ML?
Sumanas Sarma & Rob Hinds: This is an interesting question because the answer depends largely on how we end up using ML solutions. At present, ML projects are best utilised where they are used in conjunction with manual oversight. Imagine a customer service desk: it is now common for these operations to offer a text based chat interface as an alternative to email/phone first-tier support. Instead of replacing this interface with a chatbot, it is better to allow the chatbot to deal with queries (like “where is my order?”, “I can’t get X to work!” etc) and allow the chatbot to escalate to a human operator when it is apparent that the solution is not simple.
On a wider research and industrial scale, it is incredibly fascinating to see the applications of Deep Learning and other Machine Learning techniques being applied to solve problems in innovative ways. Consider AtomNet, one of the first deep neural network aimed at drug design. It discovers chemical features and tries to predict candidate biomolecues for disease targets. See here for more info.
Finally, on a note of caution and trust, it is crucial that people and companies understand what it is that they are aiming to get out of ML, and how it can be used to their advantage. Right now, companies may be scrambling to add it to their product but when engineers don’t understand how an algorithm gets its results, it can be difficult to trust the system, particularly if the consequences of incorrect or biased results are detrimental to the people using it ( ref article ). EU legislation is trying to address some of these issues at a larger, more formal level which seeks to prevent decisions that may be detrimental to a person “based solely on automated processing” (Article 22 ).
JAXenter: What fascinates you about ML?
Sumanas Sarma & Rob Hinds: I think it’s a really interesting field at the moment, as there is so much data available and with the increasing computing power (and the availability of cloud based resources) it’s getting really exciting to see the different problems that ML is proving effective on — from self driving cars, image recognition and natural language voice interfaces — it really seems to be starting to deliver some of the promises of old Sci-Fi movies!
I think on a very simple level though, the idea that a machine can learn to generalise is pretty mind blowing at first – having been so used to the idea that machines run with very specific instructions. I remember my mind was blown by the concept of Unsupervised Learning – the idea that a computer could just be given a whole lot of data and just learn from that, with no other input about what was right or wrong. Word2Vec is a really neat demonstration of how effective this can be.
JAXenter: What can attendees get out of your JAX London talk?
Sumanas Sarma & Rob Hinds: We will go through some real life stories of how we have been trying to research and build ML models to solve a problem in a commercial environment and what went well, what failed and what we learnt along the way. We should also have a fun demo of an ML model running in the cloud!
Sumanas Sarma and Rob Hinds will be delivering one talk at JAX London which will focus on the engineering best practices that can be applied to ML, how ML research can be integrated with an agile development cycle, and how open ended research can be managed within project planning