Big commercial websites breathe data: they create a lot of it very fast, but also need the feedback based on the very same data to become better and better. Logging pros Stefan Schadwinkel and Mike Lohmann give an overview of how to make your business data driven with "Log Everything".
Every time we asked a co-worker or peer asked for the logging requirements of a project, the answers we received was never satisfactory. For example, this is one conversation we overheard recently: The technical lead in a company asked his colleague what logging requirements were needed for the task he was about to start working on. To the anger of the technical lead his colleague had no idea! After a few minutes of cooling down and thinking, he realized that this is probably a common problem, and decided to work with his colleague to hunt down these requirements.
We could argue that the technical lead wasn't specific enough with his question, what exactly is the purpose of logging requirements, why should his colleague have them ready? What is the purpose of having logging requirements for each task? Based on our own experiences with a big company, the product department knows that the more logging data that exists. This is a good place to start when putting together your requirements. Unfortunately when big companies launch a new product it is not really possible to say which data will be of interest in the future. Therefore, instead of going through every user and their behavior in the database, the logging requirement is easy… Log Everything!
When working in the tech space, it is well known that you have to log events on different servers; applications explain user behavior on the web applications we are serving. To build an abstraction for the log events, store them multi-dimensionally, as it allows us to create various views of the collected data. If it is done correctly, it is really fulfilling for a technician and you will find they will keep coming up with innovative ways to make this process easier. The main challenge is to make sure that the state of the art technologies are used in all areas of the logging process, creation, transport, transformation, store and analyze. Don't forget to give a warm thank you to the analytic maniacs, who are the backbone of all critical business decisions made involving C-Level employees.
Another challenge is to understand the scope of the project you have been asked to work on, will you need to log everything? Think of the project as a framework; the project should be used to build up sites but have the same underlying business concept, don't forget the different target topics like trading and poker. That means that all framework functionalities can be changed to fulfill the requirements of each vertical base.
In the project example in this book, all vertical bases will use content. Content describes different aspects of the project's content management system, like articles, news, lessons, images, videos, comments and posts. The aspects are the different content types the framework provides. When it comes to images and videos it does not mean the assets but a representation of them in the application, with the added metadata such as author, size, quality and length. All of these content types are logged when they are created, moved, edited or deleted.
Users activities like page hits, registration and logins are tracked. Also information about the status of the application or errors have to be prepared and sent to a location, which acts as a single entry point for debugging and error reporting. That means we need a solution, which is able to send data to different targets, we will need to encode it to each target’s system data structure.
The next challenge is to merge and aggregate all the logged data produced from different systems, and to place it into different domains. This so we can generate useful analytics. To achieve this you'll need to know which data belongs together in a user’s request. Lets introduce the PulseID1; it is generated at the beginning of the request, then is transported over the systems frontiers until it finds it's way to all the logs generated at the end. The pulse makes it easy to find which logs belong to which request.
Let's start a project
“Log everything!” started as a subproject of a web project that aimed to create a framework for education communities. The framework was mainly written in PHP and it used the Symfony2 framework as a base for things like routing, dependency injection, security features, caching features and logging. We needed to log three different types of messages within the education community framework (ECF). The first message type contained the user interaction in the page, for example page hits, login attempts, registrations or comments which had been left by a user, all of these needed to be tracked. The second message type was needed for debugging purposes: A developer needs information about the application’s insights to be able to debug it successfully.
The third message type was the error message: It was similar to the debugging message but was logged only if something was going wrong within the application. We also had load balancers, reverse proxies, different kinds of databases, HTTP-Servers, user data caches (like Redis), Queues (like RabbitMQ) and the server logs themselves. The log entries were interesting for different prospects, meaning that the logs are also a part of the requirement log everything! thus have to be logged somewhere too.
All of these log data streams needed to be collected and sent to the correct destinations. For debugging purposes, on the live system we needed the debug messages in a system which could be easily accessed by different groups and individuals with various access levels. We needed to track log entries to see who had accessed the system, this had to be stored on a system which could handle a big amount of data and run analyses on it.
When we first approached this project it was okay to just store the data produced in a remote filesystem, but after a while we decided that we needed a solution which was able to translate the log data into different formats. Additionally it needed to have a standardized message envelope with metadata information for aggregation and analytics. This envelope contained fields which helped the tools involved in the logging process to sort data. This eventually helped the different groups of developers, business analysts and c-levels to group and aggregate the logged data to answer any questions which may arise. Here are the different fields:
•• origin_type: Defines which component is emitted to the data record.
•• origin_host: The name of the server the data record is created on.
•• created_timestamp: The timestamp of the system the data record was created on.
•• created_date: The date time of the system the data record was created on.
•• envelope_version: The version of the envelope.
•• message_log level_value: The log level as an integer value.
•• message_log level: The log level as a string.
•• pulse: The pulse ID of a request which allows to group log entries belonging to one request.
•• origin_service_type: The Community Application which is called.
•• origin_service_component: What system group emitted the log message? For example, database, UserStore o rCommunity Application.
•• event_handle: The identifier of an event. For example customer.register.
•• event_version: The version of the event_body for the event_handle.
•• event_body: This field contains a certain message like UID or number of login attempts.
Just in case an event was triggered by a user's action (like login, creation of content or commenting content), the data fields were filled with certain information about the event itself. However if the log message was triggered by
a framework event (like an exception thrown, debug or information events), the fields were filled with defaults. Only the event_body field contained the important information about the event (like an exception message) which was logged.
About the authors
Mike Lohmann started working with web technologies as a junior system administrator in 1998. In 2004 he moved to Lycos Europe to become a base PHP programmer whilst studying “informatics” in Bielefeld. Mike has also worked as a freelance developer for various companies like Dannemann and IBM. His next full time position was at Gruner + Jahr in Hamburg where he held the position of Software Architect until 2011. Mike currently works for ICANS Gmbh as a Software Architect. In his free time Mike writes articles for Heise.de, IX and PHPMagazin and speaks at conferences worldwide.
Stefan Schadwinkel currently works as Senior Analytics Engineer at ICANS GmbH, a Germany-based IT company developing and operating Online Community websites. Stefan’s main interests revolve around sophisticated data analysis techniques and machine learning. Having majored in artificial intelligence during his computer science studies, he later on received his PhD in human sciences for MRI studies on human auditory cognition. Currently, these interests are nourished in a diverse set of fields: developing user-based metrics to enhance user experience, fostering online fraud prevention, simulating financial trading activities within different markets, and developing backend processing tools in Erlang.