days
0
6
hours
1
5
minutes
2
3
seconds
1
3
search
Sometimes it's hard to see the wood for the trees

Clinton’s campaign: Could machine learning and Java have prevented their failure to handle big data?

MartijnVerburg
java

Red question mark image via Shutterstock

Hillary Clinton’s 18-month campaign as the Democrat candidate for the presidency of the United States has been famously and resoundingly data-driven. A war room of senior analysts, mathematicians and researchers monitored, tracked and calculated every significant movement of her campaign, adjusting strategy to shift focus to different geo-political demographics. The polls gave affirmation that she was on track to win until the final hours of the campaign, but ultimately the pollsters failed to forecast the US election result correctly. A loss not just for Clinton and the Democrats, but for all those who are proponents of the value of data. Or so it is told.

In truth, Clinton’s loss —and our inability to forecast it— was a failure of data and the inability of humans to see biases, and to accurately analyze data at scale and speed. This highlights one of the biggest challenges facing today’s politicians and businesses alike. With so much data being collected and collated on an unprecedented scale, the shock US election reminds us that this information is only useful if it can be efficiently analysed to produce real and actionable insights that can inform decisions.

Machine learning (ML) and artificial intelligence (AI) has the ability to give those insights if employed correctly; it offers the ability to analyse vast datasets and identify correlations that humans, due to the sheer size of the task, are simply unable to process. Data, and the ability to crunch that data, will be an increasingly vital differentiator for big corporates in the future. In our digitally connected age, even traditional manufacturing businesses, like sportswear specialist Nike, are now positioning themselves as data-led companies.

Don’t believe all the hype

Of course, there’s been plenty of media hype about machine learning and how it is the future solution to practically everything. What the hype fails to address is that effective use of machine learning takes scientific rigour to get meaningful results. This can be a time consuming, expensive and complex operation as data is cleansed, approaches are trialled and validations are run.

Clean, unbiased data is paramount

It is necessary to remind ourselves that machine learning will only ever be as good as the data it uses. Clean data is paramount. Furthermore, for supervised learning, it needs to be categorised with accuracy and precision for the machine learning algorithm to make any sense of it.

You only need to look at the pollsters who wrongly forecast the US election result to see how damaging biased data can be. They failed to identify the impact of white middle class voters in several key states [1].

Their fatal mistake, repeated over and over, was in inputting incomplete data into the analytics algorithms, which then resulted in skewed insights.

Validation is essential

There remains a crucial human element to any machine learning; any algorithm needs to be rigorously validated, to ensure it is operating correctly.

Experts have now discovered that, without validation, insights have the potential to be based on a fictitious output. Incorrect validation can be surmised by the following XKCD post where a very small sample set and lack of validation leads to…. Well, read for yourself!

1

Source: https://xkcd.com/1122/

Understanding the algorithm

The likes of Google, Microsoft and Amazon often use Neural Nets, which is a powerful approach for the right domain but is also notoriously hard to validate. RankBrain, Google’s AI algorithm, which makes ‘educated guesses’ to solve new search queries is now so advanced that even a top Google engineer admitted earlier this year that he doesn’t know exactly how it works! [2] So although in practice algorithms can seem to be working well, there could be potential issues down the line if errors occur, as it is no longer possible to understand the logic of how it got there.

Winning the talent war

So, how is the IT industry addressing the challenges? Firstly, with talent. Jobs like ‘Data Scientist’ barely existed a decade ago yet demand for these skills are growing exponentially. Searching academia to find and hire the best data scientists who have real rigour in their methods must be part of any hiring strategy. It’s a war for talent out there, and we are one of the companies looking to attract data science and mathematically minded candidates.

Cost reduction – the great leveller

Today, it is much cheaper to run machine learning experiments. Historically, it has been a very computer and data intensive activity. Costs are constantly dropping, so boutique outfits like ours can afford to use machine learning which was not the case five years ago. This has created a plethora of start-ups using ML in a variety of fields, a strong trend which will continue for many years to come.

How does Java come into the equation?

Java’s stable infrastructure comes to the fore with the output of the machine learning algorithm. Ever since its inception in 1995, Java has been a core tool for writing business rules from the data that comes out of algorithms. The long commercial life and wide adoption of Java has created a robust ecosystem of documentation, libraries and frameworks used in e-commerce, security and complex transactional architectures.

Today, it is Java that connects to us to just about every data source on the planet.

Harnessing the power of Java and machine learning

At jClarity, we have used this ultimate language of integration to write business rules from machine learning results to deliver a solution for our customers. The result is reliable data analysis, at speed.

We fine tune Java performance for our customers like Rightmove, whose websites receive millions of daily visits. By using machine learning techniques, we’re able to help build a major part of the modern Cloud with intelligent, lightweight performance analysis tools so IT teams can stop fire-fighting and build value for their users.

In the case of Rightmove, our recently announced partnership will improve the performance of their property search applications. The property site will benefit from having a specialist Java performance diagnostics engine, machine learning algorithms and Java enhanced search applications.

The key element to powerful machine learning is to focus on a narrow enough scope of the problem you are trying to solve. The media waxes lyrical about AI robots and how AI will take over the world. Machine learning is nowhere near being a general purpose problem solver like a human is. A machine can’t switch from maths problem to opening a door.

For example, Tyler Vigen has famously put together a site on spurious correlations [3]. For example, did you know there is a 99% correlation between “Divorce rates in Maine” and the “Per capital consumption of Margarine”? Through humour, Tyler makes it very clear that it’s all too easy to come up with the wrong answer, especially when it comes to correlations!

At JClarity, our success depends on providing the ability to ensure scientific rigour in terms of data cleansing, the ability to run lots of experiments, and, of course, validation to independently verify the output.

Sounds simple, doesn’t it? It’s not. But it works.

Links

1.http://www.tylervigen.com/spurious-correlations

2.https://www.seroundtable.com/google-dont-understand-rankbrain-21744.html

3.http://www.tylervigen.com/spurious-correlations

Author
MartijnVerburg
Martijn Verburg is the CEO and co-founder of jClarity a Machine Learning based Java/JVM performance analysis company. He is the co-leader of the London Java User Group (LJC), and leads the global Adopt a JSR and Adopt OpenJDK efforts to enable the community to contribute to Java standards and OpenJDK. Martijn is the co-author of "The Well-Grounded Java Developer" on Java 7, Polyglot Programming on the JVM and modern software development techniques. Martijn also acts as a community lead for the PCGen and Ikasan open source projects, moderates at the Javaranch and can be found answering thorny questions on the programmers stack exchange sub-site. He was recently made a Java Champion in recognition for his contribution to the Java ecosystem. He's a popular speaker at major conferences (JavaOne, JFokus, OSCON, Devoxx etc) where he is known for challenging the industry status quo as "the Diabolical Developer".

Comments
comments powered by Disqus