Building a data platform on Google Cloud Platform
At the moment, big data is very popular and there is a wide variety of products available for handling data. In this article, read a case study about a German startup tackled their data problems and built a common data platform into their architecture. The data platform consists of four components: Ingestion, storage, process, and provisioning.
Joblift is a Germany-based startup. Its core business is a meta-search engine for job seekers in Germany, the Netherlands, France, the UK and since 2018, the US. We aggregate jobs from various job boards, agencies and companies. Our aim is to provide job seekers a single page listing all jobs instead of having to browse multiple websites.
Since its beginnings Joblift focused on the so-called arbitrage business, i.e., buying traffic via various marketing channels, and forwarding it to its own customers. The profit is made from the difference in the price. As this is a successfully running business right now, the time has come to take the platform to the next level.
Our vision is to become the job board of the future. Instead of helping our users search for a job, we want to recommend them their perfect match. The foundation for this new line of business is data.
In order to be able to provide the perfect matching job to a user, we need to gain as much knowledge and insights about them as possible. This means every move of a user on our platform needs to be tracked, analyzed, put into context and modeled into information.
However, especially for new users, we do not have much insight yet. Thus, we want to predict their behaviour based on other users showing a similar user journey. In our case, this could be for example another user simply searching for the same job title. Or, in a more complex scenario, we could also consider users attracted via the same marketing campaigns.
To solve this, we want to put data from different users in relation. One possibility to do this, is using machine learning algorithms. They can be used to predict possible relations with a certain level of confidence. Together with facts we already know about a user (e.g., “User clicked on Job XY” or “Users having clicked on Job XY also clicked on Job AB”), we can store those predictions in a knowledge graph (“User is likely to click on Job AB”).
The foundation of all this is big and fast data. The arising challenge is consequently how to ingest, store, process and provide the accumulated data.
To tackle this challenge, we decided to build a common data platform into our architecture. The data platform is a set of infrastructure components, tools and processes with the sole aim of collecting data, storing it, and providing it to all possible use cases in an efficient way.
As Joblift’s existing architecture is deployed on the Google Cloud Platform (GCP), we decided to stay with GCP also for our data platform. Google offers a wide range of managed products, and wherever fitting, we stuck to those managed products. For us, the big advantage is, that once we decide to get started with a technology, we do not need to invest much time setting up the infrastructure. The drawback is that we might not have all the flexibility we have with self-managed products and have a vendor lock-in. However, in our case, velocity clearly wins over flexibility.
In those cases where we decided not to use a managed product, we deployed the respective components directly to our Kubernetes cluster, also running on GCP.
We aimed to clearly separate the data platform, and the use cases using it. Our Data Platform consists of 4 separate components which we will outline in more detail below, namely:
In a first step. we need to get the data into the platform. We have a wide range of different data sources. They provide either continuous data (e.g., in the form of transactional events) or batched data (e.g., daily job updates from various job boards). Looking at the GCP managed products, the obvious choice for the ingestion component would have been Cloud Pub/Sub. This is GCP’s managed solution for message and event ingestion.
However, on Pub/Sub it was at the time not possible to replay or re-read messages, which is a crucial feature for us (that changed in the meantime). Also Pub/Sub does not guarantee a certain ordering. For those reasons, and because we had a fully setup Kafka cluster on Kubernetes in GCP already, we chose Kafka for ingesting streaming data.
By setting the ley properly, we have a guaranteed order on the partitions and because of the retention are able to replay messages. Additionally, we can have several consumers for the same data, without having to do fan-outs as on Pub/sub (which increases costs). If at some point we would need the additional features Cloud Pub/Sub provides, we can take advantage of the available connectors to connect Kafka to Cloud Pub/Sub via Kafka Connect.
For getting batches of data into our system we mainly use ftp or http.
Once the data is in the system we want to store it in its raw format. We opted here for the concept of a two way storage architecture. On one side we have a long term storage, referred to as cold storage. Data on the cold storage has an unlimited retention with slower access times. On the other hand, there is the short term storage, or hot storage, with limited retention but faster access times.
For the hot storage, we opted for Cloud Bigtable, which is GCP’s managed NoSQL wide-column database service, using the HBase API. The drawback of Bigtable/HBase, compared to Cassandra, for example, is that there is no possibility for a secondary index. We can live with this drawback. In case we need to have different row keys, we duplicate our data with appropriate row keys. This also guarantees us fast access.
For the cold storage, we simply use Google Cloud Storage (GCS), which is designed for durable storage.
For the data processing, that happens inside our data platform, we have various approaches.
For streaming data, we mostly rely on Java microservices using Kafka streaming technology as in and output. This gives us a maximum of flexibility, paired with high scalability and low complexity. Using gitlab CI/CD pipelines, we build docker images, and deploy them to our GCP hosted Kubernetes cluster using Helm charts. Adhering to the microservice principle and using asynchronous communication via Kafka, we can ensure a high level of continuous integration and deployment. New requirements can thus be implemented with an appropriate speed.
For modeling our more advanced ETL data pipelines, with opted for Apache Airflow, respectively the GCP managed version of it, Cloud Composer. It allows us to define and schedule pipelines in the form of a direct acyclic graph (DAG). One of the main advantages of using Cloud Composer is that it neatly integrates with other GCP projects, such as GCS. The DAG defined in Python can simply be uploaded to GCS and then they are automatically picked up by Cloud Composer’s scheduling component.
Gathering all that data is well and good, but it is worthless if we are not able to provide our use cases with the right data in an efficient way. Consequently, the aim of our data platform is not only to gather and store data, but also to provide the correct data in the right way.
For providing streaming access to our data, we opted again for Kafka. Combined with Avro and the kafka-schema registry, this allows us to have a clearly defined interface contract including proper versioning.
For batch access, the provisioning depends on the use case and is optimized for the respective use-case.
Having all this data available and being able to provide it, we now want to use it to fulfill our business needs. In the following sections, we outline several of our use-cases.
Before even tackling our new vision, one of the first use cases for our data is analytics. To provide our business analysts with an easy way of creating business reports on one side, and ad-hoc analysis on the other side, we chose to build an analytics data warehouse. In the landingzone of our data warehouse, we store relevant raw and pre-aggregated data, which can then be used for reporting and analysis.
We opted for Google BigQuery for our data warehouse. It is fully managed, and can be queried using simple SQL. Additionally, with its BI Engine it comes with all that is needed to build reports and dashboards. On top of BigQuery, we use Google Data Studio to create interactive dashboards to support our business and salespeople in their daily decision making.
In view of our vision, our goal is to provide the best matching job to every user. The foundation of Joblift is a search engine. Users can search for jobs. The current flow is a user coming to our page and searching either for any job in a given city (location search) or for a specific job (expert search). The executed search that is based on the keywords entered in the search mask. Unfortunately, the results provided by such a simple search often lack some relevance. Thus we want to enhance the results in order to provide the best matching job. One way to do so is via a knowledge graph. A knowledge graph is a way of representing data in relation. It quickly allows us to extract connections between different users or jobs (e.g., “users that applied for this job, also applied to jobs XYZ”).
The technical foundation for our knowledge graph is a graph database. When this article was written, GCP did not offer any managed graph database. We thus opted for a self-managed JanusGraph deployed to GCP, as it nicely integrates with Google BigTable as backend. For the indexes we use Elasticsearch.
To fill our knowledge graph not only with facts but also with probable relations (“user is likely to click on job XY”) we want to make use of machine learning algorithms. We want to predict their journey based on the knowledge we have from other users. Another use for machine learning algorithms is clustering and categorization of our jobs. Machine learning algorithms need data, ideally lots of historic data, and this data is provided via the cold storage previously described.
We have successfully laid the groundwork to operate a big and fast data platform, and while it clearly showed its dependability and the proof of concept was successful we also had some key learnings.
Big data being very popular at the moment, there is a wide range of open- and closed source, managed and unmanaged products available out there. Having such an overwhelming amount of choices, we learned that sometimes, you just have to take a decision on which product to use. After some time of evaluation, we decided that for the sake of velocity, we would use wherever possible GCP managed products. Those products are optimized for a dedicated purpose and fulfill that purpose with bravery. Some of them are simply the managed version of popular open source products, such as for example Cloud Composer being the managed version of Apache Airflow. Also still being a startup, it is quite some relief not to have to the additional effort operating a self-managed component.
Right with that decision came the second learning, managed products often feel like a blackbox. While for self-deployed services you have all the liberty of configuration and full insights, a managed product usually comes with a limited set of configurable parameters.
For example, with Cloud Composer we had the problem that its logs get fed into stackdriver (GCP’s monitoring solution). For our existing microservices, however, we use the complete ELK stack for storing log files, and Prometheus and Grafana for monitoring and alerting. While stackdriver integrates with Grafana, it just doesn’t run as smooth as our existing setup. Especially when analyzing issues, the stackdriver setup proves less flexible for us.
Finally, additional learning is to fail fast. Documentation is often sparse, and every use case different. All theoretical knowledge in spite, it can still be that your chosen solution just doesn’t work out at the end, for reasons you simply weren’t aware of or didn’t consider.
Thus we opted for an agile development process starting with building prototypes quickly to prove feasibility (or the contrary). When something doesn’t work out, don’t with all means try to stick to the initial idea, just because on paper it reads great. Learn from what you did and move on with an improved solution tackling the encountered issues. Even if sometimes this feels like throwing 2 weeks work away, in the end, this shows to be far more efficient and in the end, there is a stable and dependable solution.
Conclusion and outlook
The combination of managed and self-deployed services so far showed to be a good choice for our data platform. Having proven its dependability with the first set of use cases, the next steps are now implementing further use cases. Additionally, we would like to set up metadata management, to get a full data lineage of our data and gain even more insights. Here we could possibly imagine another one of Google’s products to come into play, namely Data Catalog.