Data Modelling – 102
In the second installment of Cory Isaacson’s crash-course guide to Data Modelling, we learn how to design a data model good enough to last forever.
In my last article, I talked about the importance of Data Modeling, and the fundamentals of how to build a workable model for your database structure. The article covered the basic components of any Data Model: Entities, Attributes and Relationships. These are the vital components necessary to a well-defined database infrastructure, and the key to maintaining the data elements needed by your application. This brings the idea that all data is relational into real world practice.
In this article, I’m going to walk through building a more complex model, and introduce the concept of database normalization and de-normalization. These concepts are vital for designing a database that will perform as you expect. Further, when considering a Big Data database, you will see how normalization and de-normalization are key components for defining a truly scalable database architecture (the topic of many future articles).
Introducing the Angry Shards Game Database
For a data modeling exercise to be useful at all, we need a real application to model. Over the years I have used all kinds of examples (most of them very boring unfortunately). To make this more interesting and relevant to today’s applications, we’ll look at a game database, specifically one which powers a game called Angry Shards. Angry Shards is a simple game database we use at CodeFutures to demonstrate how database scalability works (admittedly with a bit of fun in the mix).
SEE ALSO: Data Modelling – 101
A game database is an ideal example, because today’s online and social games are high-volume, database intensive, and typically require all common features of a Big Data database. While Angry Shards probably won’t be the next blockbuster in social gaming, it does show the important parts of most data models.
Building the Model
Let’s walk through building the model on a step-by-step basis. I am using a data modeling tool for the design, and at each step I will show the structure of the model to that point.
The first step in any data model is to discover entities. Since we are designing a game, there are two obvious entities we must have:
- Game: A Game instance refers to an individual game being played. Even though our initial game will have just a single Game (Angry Shards), the model is extensible to many games later if desired.
- Player: A game won’t get anywhere without someone to play it, thus the Player entity is vital. Each Player is a unique user that plays the game.
It’s obvious that without a game and a player, there is no game application – thus we used the data modeling technique of discovering the entities. As you walk through any application analysis, it’s generally very straightforward to discover the entities you need. With just these two entities, Figure 1 shows what our model looks like. This is obviously not very useful yet, so let’s add a bit more to our first entities. One important part of the definition of an entity is that it must be uniquely identifiable. This is done by adding a primary key to the entity, i.e., the attribute that uniquely identifies an instance of that entity. There are two ways to do this:
- Natural Key: Some developers like to use an actual data attribute that naturally identifies an instance of an entity. Examples would including a Social Security number, Facebook ID, or other value that is guaranteed to be unique for any instance.
- Sequential Key: More often a sequential key is used, normally supported by the DBMS engine (e.g. the Auto Increment function in MySQL). This is sometimes called an artificial key, in that it is not an actual piece of data about an entity.
I almost always opt for a Sequential Key for all entities, it keeps things consistent and easy to work with across a team of developers. Further, single attribute numeric Sequential Key values are more efficient for indexing and retrieval. Figure 2 shows our model with the primary key attributes.
Next let’s discover some attributes, the actual data elements we will need to store about our entities. In this case, we discover several attributes, including a name for a game, an email, and a screen_name for a player (Figure 3).
OK, now we are closer to a useful model. The next step is to discover relationships. In our game, a player must play a game, therefore that makes a one-to-many relationship between game and player. Figure 4 shows one possible approach.
With this relationship, a player can play a single game. This is limiting, but there is a much bigger problem with this model choice: the player can only play a single instance of a game. In other words, the model restricts player to playing a game only one time (or recording only the last game the player played). This violates two important requirements of our Angry Shards data model:
- In the future, we may want to support another game with the database, other than Angry Shards in the same database.
- It is certainly desirable to have a player play the game many times, not just once, and it’s important to record each instance (time) the player plays the game. I don’t know of any game that would succeed if it only allowed each player to play only once!
To resolve this leads us into the subject of database normalization.
The subject of database normalization is critical when building a data model, so that you end up with a model that is useful, extensible, and efficient. This is a vital skill for the developer, and with practice it can become natural – something that you do throughout the database modeling process. The English definition of the word is: Normalize: to make, conform to or reduce to a norm or standard.
The concept of database normalization was originated by Dr. Codd, the founder of relational databases. The purpose of database normalization is to ensure your model is organized, easy to understand, is flexible for future requirements, and will perform as desired. Further, it ensures that there is no redundant data in the model, and that all attributes are properly associated with the entities they most logically belong to.
With a properly normalized data model, it can serve as the basis of an application for its entire life.
A model to last
This has been very true throughout my career – in fact, surprisingly so. I remember well the first large data model I was asked to design, using these exact principles. At that time, relational databases were new, and there were no data modeling tools available. I drew the model by hand with pencil and a geometry stencil on a large piece of drawing paper. It took me a couple of days to complete (and a significant amount of erasing), but I ended up with a model of about 70 tables. We put the model on the wall to guide the team through a 2 year development process.
How well did it work out? That system is still in use today, and has been extended in 100s of ways by many developers– but the core data model remains, and continues to meet the needs or the organization many years later. I relate this story to underscore just how workable these simple principles are, and how a proper data model can bring a rational, lasting approach to virtually any application.
In my previous example, I purposely introduced a bug into the model, restricting a player to a single instance of a game. The problem is that the relationship between player and game is too limiting, and does not meet the requirements of the game database. To resolve it, we can add a new entity, player_game, allowing the player to play more than one game, and also to play each game as many times as desired. Figure 5 shows how the model looks with this new entity.
This is much better, now we have our new player_game entity, with 2 one-to-many relationships (one from game, the other from player). You can also see some new attributes, to record the status_code (whether the game is in-process, abandoned or complete), the game_start_time, game_end_time, and the player_score for the game.
One other point that is important to note in the model. The data modeling tool shows the primary key (PK) for each table. To reiterate, this is the unique identifier for any instance of a row or object within the table. The tool also shows the foreign key in the related child table, which as you will note is a link to the primary key of the parent table. This is how all one-to-many relationships work, the primary key of the parent table (the one side of the relationship) always links to the foreign key of the child table (the many side of the relationship).
As a general standard, I try to give meaningful names to primary key attributes, so that they are easily recognizable as foreign key attributes in related child tables.
Wrapping it up
Now you have seen a simple step-by-step example of the data modeling process, using the Angry Shards game as an example. There are many other interesting requirements to the Angry Shards application, and several other techniques to be covered for database normalization and de-normalization. That will be the subject of our next article.