Are you stuck in the past? A case against data sampling Part II
Modern technology can help free yourself from data sampling. Current computing power has made scalability vastly and available and machine learning algorithms have made the discovery of data quality issues automated and easy. Move on from the old ways of data sampling and learn how to enter the new world of big, smart data.
In Part I, which you can find here, we discussed data sampling and its roots, challenges in statistical and machine learning projects, and how samples can undermine business decisions. Here we explore how business analysts can avoid problems with samples outside the context of machine learning and statistics and delve into how modern technologies can free you from samples.
For those business analysts who have a data-driven mindset and a self-service attitude – those who want to leverage modern technology to prepare and shape the data on their own using the entire body of data, not just samples – I offer the following advice.
Be wary of insights that are based on presumptions
Data sampling inherently limits you to a small portion of the data which means your decisions are based on presumptions, or what you have learned from that small sample. However, this approach can be dangerous, as demonstrated by sampling errors that occurred in the 2012 presidential election. While both Nate Silver and the political pundits used publicly available polling data samples to provide insight about the actual election results – a future event – only one prediction turned out to be accurate.
Iterations will delay your time to value
Realizing that samples are limited, many who work with data samples end up iterating through their data collection and preparation exercises. This involves taking a chunk of the data, then detecting some patterns and anomalies on that sample, only to realize you missed some issues that you can clean up with the next batch that comes in. Depending on the size of the samples, you may have to go through 10 to 20 iterations to fully reach a level of confidence in the data.
Surprises are unavoidable
Data is rarely static, which means underlying issues change over time. In machine learning, this is okay because the model has been trained over a large body of the data overtime and therefore is optimized to predict based on new data. In non-machine learning projects, any change in the data will throw off results. For example, let’s assume that you get a transaction that was outside your normal range or has been submitted from a region not accounted for. Chances are you would miss out on the transaction as the regional outlier or the range outlier would not be observed in your data sample.
As a result, most organizations choose to leave samples to the experts who know how to get through the noise and signals of a large data population and how to optimize the data for building predictive analytics. Afterall they possess the coding/statistical techniques needed to get this done. As for the rest of us, we would need the whole data to accurately assess and analyze our findings.
Data prep solutions & sampling
Unfortunately – this paradigm of working with samples has been taken over by some data prep solutions and users are forced to be victims of its natural consequences.
In fact, there are data preparation/wrangling tools that are unable to process larger data sets in real time, provide an interface to business users that exposes a smaller sample of data (often <10%). Upon business user’s discovery, exploration, and ad-hoc data preparation steps to clean and shape the sample data, the tool in turn generates a script (i.e. code) that can be applied to the full data via a subsequent batch process. As you can imagine, there are several fundamental flaws with this approach:
- Users are forced to pick samples. Samples are inherently limited and can never accurately fully guide any data driven process.
- In addition to the limitation of being forced to sample, selecting which sampling technique to use is complex and only suited for experts.
- Taking an iterative approach to building data transformation logic based on samples is time consuming and delays the process. It requires interpreting the outcomes from one sample, identifying what could have missed from that sample, and iterating through the cycle with a different sample/sampling technique.
Bottom line? Even if users take the extreme and impractical step of repeating the sampling creation process 10-20 times, there is simply no way to have full confidence that all issues have been identified and resolved.
How modern technology can free yourself from samples
In a world where computing is infinitely flexible, scalable, and affordable – why are people still doing this?
Without all the earlier limitations, we can explore and learn from the full data – without the technical expertise of data scientists/coders or IT specialists.
Newer data preparation products, designed around self-service for business users, leverage built-in NLP and smart algorithms to detect data quality issues at scale using underlying cloud architecture to discover, prepare and process the data in its entirety, rather than just small samples. Technology advancements that make this possible include:
Spark™: The computational framework for processing data volumes at speed
Big data distributed processing frameworks started in the mid-2000s and gain popularity immediately because it enabled organizations to efficiently store and process large volumes of data on a cluster of commodity hardware. Different frameworks such as Apache Hadoop Distributed File System, MapReduce, and Yarn resource manager created the great beginning of batch processing across large distributed clusters.
Later Apache Spark™ became the go-to big data computing framework. Spark’s in-memory data engine meant that it could perform tasks up to one hundred times faster than MapReduce in certain situations. So, Spark became the next generation in-memory computing environment with magnitudes of speed for accessing, retrieving, and processing data at scale.
While this was the beginning of massive-scale computing, it was only available to programmers, not the data and business analysts who needed it. While native Spark has been used by programmers to source, join, aggregate, and filter data, analysts and less technical users had to stay on desktop applications. In fact, it wasn’t until 2012 that data preparation solutions came on the scene. These tools married the principles of ease of use and visual point-and-click with a powerful and scalable backend Spark™ architecture to provide large-scale business user data preparation available for the non-programmers.
Kubernetes and dynamic scaling across ephemeral clusters
Today, Kubernetes and new container strategies make the large-scale computing even more appealing from a cost and management standpoint.
In large enterprises, central IT can provision a single large compute farm for sharing across teams. In the past, IT had to foresee and over-provision lots of capacity for big data workloads and with several teams competing for the use of resources, the management of the resources became a heavy administrative task.
Kubernetes provides an elegant way to manage such requirements. For example, the solution can simply carve up the resources based on some initial assessment, but cores could balloon to use more resources when demand is higher. Once the processing is over and the results are rectified, these ephemeral clusters scale down until other jobs have to take place.
This optimized resource utilization helps organizations keep costs down while allowing large-scale computing power when needed, automatically and with no manual intervention. It also makes large scale computing much more accessible and affordable – yet another reason why limiting data to samples as a requirement for data transformation is getting more and more absurd.
Missing out on intelligent algorithms
Disambiguating data from a data quality perspective, and de-duplicating and standardizing it has gotten a lot easier with algorithmic techniques such as metaphone, fingerprint, ngram, and other NLP (natural language processing) applications. Regardless of the unending variation in the ways that different users refer to a common term – such as all the different variations of referring to a word such as ‘pharmaceutical’ or the various typos in certain values such as ‘Berkeley’ – these distributed, intelligent algorithms offer clustering and grouping of data values, discovering all the permutations and misspellings of a single value, that a human being simply could not imagine.
This approach not only discovers data quality issues, but it also makes recommendations to users to help them decide what should they do to fix these issues. These techniques are paramount in progressive data preparation platforms. However, if taken upon a smaller, sample of data as opposed to the full data, their value is lost. Being limited to data sampling simply undermines the power and impact of these intelligent algorithms. Which begs the question, why settle for less data when you have smart algorithmic intelligence that exposes and discovers noise from signals across all data?
In order to truly capitalize on the opportunity to leverage data to transform the business, companies need to arm all their staff, not just the technical subject matter experts and data scientists, to explore, clean, blend and analyze data on their own. Unfortunately, to date, analyst-friendly applications have provided self-service capabilities on data samples. This inherently limits and shakes the foundation of self-service; because to see the truth, the “whole” truth and nothing but the truth (in data), business users are still relying on their technical and IT counterparts to apply their sample-based inferred insights on the whole data and assess and analyze the outcomes, just in time to realize the gaps and start the cycle again.
In a world where cloud computing has made scalability vastly available and intelligence machine learning algorithms have made the discovery of data quality issues automated and infinitely easy, why should we be stuck in our old ways?! Better to scale and prepare data in full instead of continuing to deal with old barriers. After all, it’s a whole new, big, smart data world!