Building a custom web analytics tool using Amazon Cloud
Web analytics tools have matured rapidly beyond providing just aggregate level reporting into page views and bounce rates. In this article, Alexey Karavay talks about the appeal of building a custom solution with AWS cloud.
The advent of a digital age has brought about ubiquitous customers who use myriad channels as part of their purchase process. Building out a consolidated view of these digital interactions in a cost-effective manner is one of the top priorities of senior managers and rightly so.
The implementation options for providing these advanced insights, however, are limited to a set of highly expensive enterprise tools such as Adobe Analytics (SiteCatalyst), IBM Customer Analytics (Coremetrics), WebTrends, and Google Analytics 360. While these tools do provide features to track cross-channel visitor behavior, the total cost of ownership of such solutions (software, hardware, implementation, consulting fees, etc.) is usually prohibitory for large-scale adoption.
In this article, software consultants from Itransition share their experience in building custom web analytics solutions with certain big data stack components from the Amazon Cloud. While this approach certainly involves a higher capital expenditure in terms of software development efforts, we are of the view that the long-term savings in costs and also the highly customized nature of implementations make this option a very promising choice for generating advanced, cross-channel customer intelligence.
The conceptual architecture
As with any software solution, it helps to break down the problem into conceptual building blocks of functionality that are tool-agnostic and will work with any platform (Amazon Cloud, Azure, VMWare, Google App Engine etc.). For our custom web analytics solution, the conceptual architecture consists of 5 building blocks:
#1. The pixel server
Tracking user activity using pixels is a standard practice in digital analytics. Web pages (and other tracked resources) typically contain a pixel tag, and when a browser loads the parent page, the pixel is also loaded and can create a trace of the information that was requested in this hit. If each such hit can be associated with a unique user id and date/time stamp, then it should be possible to aggregate hits at the visitor level.
The pixels must be physically stored somewhere such as a server filesystem, a cluster of servers, or a content delivery network. The pixel server component specifies the physical location that pixels are served from. A well-designed pixel server must have the ability to serve a very large number of pixels with minimal latency, regardless of where the requesting user is located and also without slowing down the parent app.
#2. Data collection engine
Pixel servers are designed to provide a quick serving of static images and can typically not store large amounts of log data. For this reason, the data about pixel hits needs to be periodically flushed out to a more specialized data collection layer. We refer to this data collection layer as the data collection engine. The data is still in its raw form (as in the original pixel server) but is much larger in size as compared to that sitting on pixel servers.
This component performs two functions:
- Constantly fetches raw logs from the pixel server into the data collection engine.
- Performs the ETL to create final user/session level datasets, which are dumped onto the data storage engine. For this, the transformer implements all the business logic to sessionize data (defines the duration of a session, pulls together records created within that window). Then it rolls it up further into user-level data (combines data from multiple sessions into a single user level record).
#4. Data storage engine
The pixel servers provide hit-level data, which is periodically moved to the collection engine that is designed to store much bigger datasets. The transformer then transforms the raw data into user/session level datasets that all have a certain schema depending upon business requirements.
The data storage engine provides physical storage for the final, transformed data, which can be plugged into business intelligence engines of analytics applications.
#5. Client-side tracker
Using the Amazon Cloud Platform
Amazon Cloud provides almost plug-and-play tools for implementing each of the conceptual building blocks identified above. Let us see how.
Amazon CloudFront provides a plug-and-play content delivery network to serve as the pixel server. Static pixels could be hosted on Amazon S3 and these get automatically cached to be served from the edge location closest to the requesting browser. CloudFront can be easily configured to store the access logs on Amazon S3. This removes the need to manually migrate raw hit data from the pixel server to the data collection engine.
Data collection engine
Amazon S3 is AWS service that provides near-infinite storage capacity for storing raw text data. With Amazon S3, developers do not have to worry about running out of disk space to store raw logs. Also, the pricing for this service is extremely attractive, which makes it ideal even when logs are petabyte scale and come from multiple corporate servers.
The Transformer component implements all the code to convert raw hit data into a format that can be consumed for business reporting and analysis. PIG is a natural technology choice for this if the underlying storage engine uses HDFS. Other options could include technologies such as Talend for Big Data, Pentaho Kettle, and Informatica—all of which are capable tools to perform complex batch transformations on large datasets.
Data storage engine
This will store both the raw data that will be processed by the Transformer and the final transformer output that will be used by end users. Possible implementation options could be Hive (part of Amazon EMR), Amazon Redshift, Amazon DynamoDB, or even just plain Amazon RDS (running MySQL or some other RDBMS). The choice would depend entirely on how the information needs to be processed.
For example, for largely static reporting needs, companies might consider using Redshift as the data sink. For highly interactive, exploratory data analysis (such as in Embedded BI) it might be better to use Amazon RDS. Similarly, when there is a lot of variation in the kind of meta-data that is tracked, it might work to use something like DynamoDB.
The appeal of building a custom solution with the AWS cloud and using components above lies largely in the fact that all the components above can be up and running with almost zero capital investment. Immediate access to almost infinite storage, processing power capability and most importantly, a significantly lower total cost of ownership are just some of the other value propositions that should be objectively considered when making build vs. buy decisions when it comes to implementing advanced digital analytics.