Want to transform your business into a data driven powerhouse? Of course you do. Stefan Schadwinkel and Mike Lohmann show you how.
Big commercial websites breathe data: they create a lot of
it very fast, but also need the feedback based on the very same
data to become better and better. Logging pros Stefan Schadwinkel
and Mike Lohmann give an overview of how to make your business data
driven with “Log
Every time we asked a co-worker or peer asked for the
logging requirements of a project, the answers we received was
never satisfactory. For example, this is one conversation we
overheard recently: The technical lead in a company asked his
colleague what logging requirements were needed for the task he was
about to start working on. To the anger of the technical lead his
colleague had no idea! After a few minutes of cooling down and
thinking, he realized that this is probably a common problem, and
decided to work with his colleague to hunt down these
We could argue that the technical lead wasn’t specific
enough with his question, what exactly is the purpose of logging
requirements, why should his colleague have them ready? What is the
purpose of having logging requirements for each task? Based on our
own experiences with a big company, the product department knows
that the more logging data that exists. This is a good place to
start when putting together your requirements. Unfortunately when
big companies launch a new product it is not really possible to say
which data will be of interest in the future. Therefore, instead of
going through every user and their behavior in the database, the
logging requirement is easy… Log Everything!
When working in the tech space, it is well known that
you have to log events on different servers; applications explain
user behavior on the web applications we are serving. To build an
abstraction for the log events, store them multi-dimensionally, as
it allows us to create various views of the collected data. If it
is done correctly, it is really fulfilling for a technician and you
will find they will keep coming up with innovative ways to make
this process easier. The main challenge is to make sure that the
state of the art technologies are used in all areas of the logging
process, creation, transport, transformation, store and analyze.
Don’t forget to give a warm thank you to the analytic maniacs, who
are the backbone of all critical business decisions made involving
Another challenge is to understand the scope of the
project you have been asked to work on, will you need to log
everything? Think of the project as a framework; the project should
be used to build up sites but have the same underlying business
concept, don’t forget the different target topics like trading and
poker. That means that all framework functionalities can be changed
to fulfill the requirements of each vertical base.
In the project example in this book, all vertical
bases will use content. Content describes different aspects of the
project’s content management system, like articles, news, lessons,
images, videos, comments and posts. The aspects are the different
content types the framework provides. When it comes to images and
videos it does not mean the assets but a representation of them in
the application, with the added metadata such as author, size,
quality and length. All of these content types are logged when they
are created, moved, edited or deleted.
Users activities like page hits, registration and
logins are tracked. Also information about the status of the
application or errors have to be prepared and sent to a location,
which acts as a single entry point for debugging and error
reporting. That means we need a solution, which is able to send
data to different targets, we will need to encode it to each
target’s system data structure.
The next challenge is to merge and aggregate all the
logged data produced from different systems, and to place it into
different domains. This so we can generate useful analytics. To
achieve this you’ll need to know which data belongs together in a
user’s request. Lets introduce the PulseID1; it is generated at the
beginning of the request, then is transported over the systems
frontiers until it finds it’s way to all the logs generated at the
end. The pulse makes it easy to find which logs belong to which
Let’s start a project
“Log everything!” started as a subproject of a web
project that aimed to create a framework for education communities.
The framework was mainly written in PHP and it used the Symfony2
framework as a base for things like routing, dependency injection,
security features, caching features and logging. We needed to log
three different types of messages within the education community
framework (ECF). The first message type contained the user
interaction in the page, for example page hits, login attempts,
registrations or comments which had been left by a user, all of
these needed to be tracked. The second message type was needed for
debugging purposes: A developer needs information about the
application’s insights to be able to debug it successfully.
The third message type was the error message: It was
similar to the debugging message but was logged only if something
was going wrong within the application. We also had load balancers,
reverse proxies, different kinds of databases, HTTP-Servers, user
data caches (like Redis), Queues (like RabbitMQ) and the server
logs themselves. The log entries were interesting for different
prospects, meaning that the logs are also a part of the requirement
log everything! thus have to be logged somewhere too.
All of these log data streams needed to be collected
and sent to the correct destinations. For debugging purposes, on
the live system we needed the debug messages in a system which
could be easily accessed by different groups and individuals with
various access levels. We needed to track log entries to see who
had accessed the system, this had to be stored on a system which
could handle a big amount of data and run analyses on it.
When we first approached this project it was okay to just
store the data produced in a remote filesystem, but after a while
we decided that we needed a solution which was able to translate
the log data into different formats. Additionally it needed to have
a standardized message envelope with metadata information for
aggregation and analytics. This envelope contained fields which
helped the tools involved in the logging process to sort data. This
eventually helped the different groups of developers, business
analysts and c-levels to group and aggregate the logged data to
answer any questions which may arise. Here are the different
•• origin_type: Defines which component is emitted to
the data record.
•• origin_host: The name of the server the data record
is created on.
•• created_timestamp: The timestamp of the system the
data record was created on.
•• created_date: The date time of the system the data
record was created on.
•• envelope_version: The version of the envelope.
•• message_log level_value: The log level as an
•• message_log level: The log level as a string.
•• pulse: The pulse ID of a request which allows to
group log entries belonging to one request.
•• origin_service_type: The Community Application
which is called.
•• origin_service_component: What system group emitted
the log message? For example, database, UserStore o rCommunity
•• event_handle: The identifier of an event. For
•• event_version: The version of the event_body for
•• event_body: This field contains a certain message
like UID or number of login attempts.
Just in case an event was triggered by a user’s action
(like login, creation of content or commenting content), the data
fields were filled with certain information about the event itself.
However if the log message was triggered by
a framework event (like an exception thrown, debug or
information events), the fields were filled with defaults. Only the
event_body field contained the important information about the
event (like an exception message) which was
About the authors
Mike Lohmann started working with web technologies
as a junior system administrator in 1998. In 2004 he moved to Lycos
Europe to become a base PHP programmer whilst studying
“informatics” in Bielefeld. Mike has also worked as a freelance
developer for various companies like Dannemann and IBM. His next
full time position was at Gruner + Jahr in Hamburg where he held
the position of Software Architect until 2011. Mike currently works
for ICANS Gmbh as a Software Architect. In his free time Mike
writes articles for Heise.de, IX and PHPMagazin and speaks at
Stefan Schadwinkel currently works as Senior Analytics
Engineer at ICANS GmbH, a Germany-based IT company developing and
operating Online Community websites. Stefan’s main interests
revolve around sophisticated data analysis techniques and machine
learning. Having majored in artificial intelligence during his
computer science studies, he later on received his PhD in human
sciences for MRI studies on human auditory cognition. Currently,
these interests are nourished in a diverse set of fields:
developing user-based metrics to enhance user experience, fostering
online fraud prevention, simulating financial trading activities
within different markets, and developing backend processing tools