Angular shitstorm: What's your opinion on the controversial plans for Angular 2.0?
Tutorial

Getting Started with Neo4j, the Java graph database

MichaelHunger
graph-data

Michael Hunger explains how you can get the most out of the growing trend of graph databases, focusing on one of the leading lights in Neo4j. Features a heavy amount of Keanu Reaves

If you somehow haven’t noticed, graph databases have surged in popularity within the software industry, thanks in part to companies like Google, Facebook and Twitter putting them on them on the map. In this article from June’s JAX Magazine, Michael Hunger explains how you can get the most out of the growing trend, focusing on one of the leading lights in Neo4j.

There is a whole world of information out there where size is not king, and connectedness assumes the throne. Everything in the real – and digital – world is connected and the amount of value in these relationships is tremendous. Historical events are interconnected with political arenas and individual participants. Gene expression is derived from both DNA and environmental factors. Networks, computers, applications and users form intricate interaction networks. Every aspect of our lives is dominated by connected information and things. Big internet companies are trying to harness this power with efforts like the Google Knowledge Graph or Facebook Graph Search.

Figure 1: Harnessing the power of graph data

And whenever we want to store this real world data in a database, we somehow have to take care of this fact. Usually the connections are ignored, denormalized or aggregated to fit in the data model and make operations fast enough. What you lose by doing this is the richness of the information that you could have retained with a different data model and database. That’s where the property graph model and graph databases show up. If graph shaped data shows up in a relational database on the other hand, you’ll easily recognize it by the sheer amount of intermediate join tables and join statements in your queries (and dropping performance levels).

Figure 2: Graph theory

Graph theory is much older than anyone would think. Treating graphs explicitly with database semantics like ACID Transactions is new however. Graph databases are part of the recent NoSQL movement that mostly means non-relational databases. Most of them are open source, developer-friendly and come with a dedicated data-model that suits a certain use case. 

Graph databases are well suited to storing, retrieving and quickly querying interesting networks of information. This kind of connected data is also know as graphs – not to be mixed up with artwork, charts or diagrams. Graphs consist of nodes and directed, typed relationships, both of which can hold arbitrary numbers and types of attributes (key-value properties). That is all there is to the graph model.

Figure 3: Property Graph

You have certainly used graphs in the past, either when modelling relational databases (ER-diagram), when drawing your domain on a whiteboard with some colleagues (circles and lines) or when creatively collecting information (mindmaps). The simplicity of the graph model makes it easy to understand and the direct visualization makes it even easier!

Enter Neo4j 

What is so cool about those graph databases, let’s take Neo4j, the open source database I work on as an example. It is a native graph database implemented in Java. That means its internal database structure directly represents nodes and relationships (and properties) as records in its database files. It does not sit on top of another database but uses its own infrastructure that was solely developed to work with graph shaped data.

How does a graph database achieve the high speed navigation? It uses a cheap trick. Instead of recreating connections at each and every time you issue a query (with huge impact on memory and CPU) like most other databases do, a graph database materializes the connections once at insertion time. It takes a one time hit then but for each query or traversal navigating from one node to another is a constant-time operation. It only has to follow the existing and persistent relationship that points to the other end. And that jump is ultrafast.

Neo4j represents Nodes and Relationships as Java objects in its embedded Java-APIs, as JSON objects in its server’s REST-API and as ASCII art in its query language Cypher.

Wait, ASCII-Art? How does that work? Remember the lines and circles on a whiteboard? Well, if we visualize a real domain that way, it can get pretty complex quickly, especially if we don’t see the forest for the trees. But what are we looking for? We want to find patterns in that graph and within those patterns, we want to aggregate and project data so that our questions are answered and use cases handled. So with graph visualization, we can easily highlight patterns by just redrawing them in another color.

Figure 4: Relationships

But how would we do that in a textual, declarative query language? That’s where the ASCII-art comes in. We employ names with parenthesis for nodes and dashed arrows for relationships. Relationship-types appear in square brackets, and properties show up as JSON-like notation.

Cypher quickstart

Getting started with Cypher is easy. There is a learning track on Cypher, a beautiful Cheat Sheet and a comprehensive section in the Neo4j Manual. In a few minutes you can have a Neo4j Server up and running on your machine and start working with it. Or you can play in the Sandbox of the <a href=”http://neo4j.org/learn/try”>Neo4j Online Console</a>.

In this example from a movies domain, you can see how that would look like.

 

(m:Movie {title: "The Matrix"})
<-[:ACTS_IN {role:"Neo"}]-
(a:Actor {name:"Keanu Reeves"})

 

We see Cypher as a humane language optimized for readability and stating intent. You declare which patterns you want to match in the graph and for which operations, filter, aggregation, sorting and paging you want to apply to the matching results. In Cypher, you do not state how you want to have something done, but rather what you are looking for. In this regard, it is similar to SQL. On the other hand it is much more powerful in terms of declaring complex relationships between nodes, specifying paths of variable length, applying graph algorithms, working with collections, chaining subqueries, and passing on intermediate results. Cypher not only allows fast and powerful queries, but also permits creating and updating information in the graph.

 

// creates a movie with all its actors in one go
 CREATE (m:Movie {title:{movie_title}})
 FOREACH (a in {actors} :                   CREATE (a:Actor {name:a[0]})-[:ACTS_IN {role:a[1]}]->m)

 // find the 10 most frequent co-actors of Keanu Reeves
 MATCH (keanu:Actor)-[:ACTS_IN]->()<-[:ACTS_IN]-(co:Actor) 
WHERE keanu.name="Keanu Reeves"
 RETURN co.name, count(*) as times
 ORDER BY times DESC
 LIMIT 10

 

Cypher is the easiest way to get up and running with Neo4j and in both in the embedded and server deployment mode, it helps you tremendously with writing your application. To interact with Neo4j in your preferred programming language or style, you can choose from a plethora of drivers.

What’s new in Neo4j 2.0?

First of all, we have changed the data model for the first time in 10 years. Now, not only relationships but also nodes can now be labeled. This makes it much easier to represent types of things in the graph – multiple per node – and allows for multiple optimizations to hook into this information. On top of these node-labels, we can start declaring automatic indexes, that are defined per label and property,  and we can introduce property constraints (think uniqueness, value ranges, property types).

 

CREATE INDEX ON :Movie(title);  

// uses the index
 MATCH (m:Movie) WHERE m.title = "The Matrix";  

// examples for a unique constraint
 CREATE CONSTRAINT ON (actor:Actor) ASSERT actor.name IS UNIQUE

 

Neo4j 2.0 also brings a new MERGE functionality that allows you to match and find patterns that you specify, create parts or the whole pattern if they are not there, or update separate operations on insert or match.

 

MERGE (keanu:Person {name:'Keanu Reeves'})
 ON CREATE keanu SET keanu.created = timestamp()
 ON MATCH keanu SET keanu.lastSeen = timestamp() 
RETURN keanu;

 

Another thing we’re really thrilled about is the new transactional Cypher http endpoint. So far, we only supported one transaction per request to the Neo4j Server. Now, a transaction is created and kept alive until an explicit commit (POST to /transaction/id/commit URL) or rollback (DELETE /transaction/id) is posted to the endpoint or the transaction timeout occurs. One can post as often and as many Cypher statements in a streaming manner to the endpoint as one wants and receive the results streamed back at the same time. This endpoint is also much less verbose than the previous incarnations.

This allows a completely new set of drivers to emerge, that focus on Cypher as the sole means of interacting with Neo4j and which much closer resemble current SQL tools and allow better integration with other software.

So how would you get started with a graph database? First of all, is taking a step back. It takes a bit of unlearning the relational model to create a good graph model of your domain. But actually it is not hard at all. Just grab a colleague or two and a whiteboard and start drawing an example of your domain that contains all the information and relationships you need to find answers to your upcoming questions and use-cases. This is the model you want to work towards to, not the existing, technology driven database model.

With this model in mind the next step is to get your data into the graph database. So first download Neo4j Server 2.0 and start it up. On the local Web-UI you can visualize your data, but we’re first interested in the interactive console-shell. The shell allows you to run Cypher commands against the running server, much like you would do with any SQL tool. Those commands can both query and update the database as you’ve seen before. Make sure to have the Cypher cheat sheet at hand to check the syntax.

Now it is up to you to get some data out of your existing database (or a data generator) in a format that is workable. One way is to create separate CSV files for both future nodes and relationships. To convert this tabular data into a graph you just generate some cypher statements. So either write a small script to output the appropriate create statements or use the string concatenation abilities of a spreadsheet. If you have the statements available, you might want to wrap them into a begin CREATE …; CREATE …; commit block to ensure atomic creation of your data (all or none). Then you can paste this script into the console of the Web-UI or have the neo4j-shell connect to the server and execute them bin/neo4j-shell -f import.cql.

Where next?

This is it. You now can query or visualize your graph however you like. Programmatic access to the database is possible through the full selection of drivers for almost all programming languages. Those were graciously contributed by Neo4j’s large developer community. You can of course also use one of the drivers to acquire or generate the data to insert and then populate the graph directly either with Cypher or an equivalent Node & Relationship based API.

Lastly, I want to point out the wide applicability of the graph data model. Every reasonably advanced data model contains a lot of important connections and can be easily represented as a graph. This becomes obvious when you look at domain object models and imagine pointer references to be relationships and objects nodes

To illustrate that, I want to point out a few interesting applications:

  • Facebook Graph Search by Max De Marzi imports data from Facebook and converts natural language queries into Cypher statements
  • Rik Van Bruggen’s beer graph shows that even a non-technical person can create a graph model and data and then run interesting queries on it
  • Open Tree Of Life is working on creating a graph of all the organisms in the world
  • Shutl finds the best courier and route for instant (minutes) delivery of goods purchased through e-commerce channels
  • Telenor handles complex ACL resolution algorithms on top of a graph model

If this article sparks your interest in graphs and graph databases and how they can make your life and development easier, I would recommend to read the more comprehensive Graph Databases, attend one of a local GraphConnect conference or join other curious developers in one of our worldwide trainings. There is so much more to learn and many ways to participate in and engage with the Neo4j community. Our site is a good starting point for that too.

Author Bio: Michael Hunger has been passionate about software development for a long time. He is particularly interested in the people who develop software, software craftsmanship, programming languages, and improving code. For the last few years he has been working with Neo Technology on the Neo4j graph database. As the project lead of Spring Data Neo4j he helped developing the idea to become a convenient and complete solution for object graph mapping. He is also taking care of Neo4j cloud hosting efforts. Good relationships are everywhere in Michael’s life. His life concerns his family and children, running his coffee shop and co-working-space, having fun in the depths of a text-based multi-user dungeon, tinkering with and without Lego and much more.

This article appeared in June’s edition of JAX Magazine – The Graph Renaissance. For that issue and others, click here.

Author
MichaelHunger
Michael Hunger has been passionate about software development for a long time. He is particularly interested in the people who develop software, software craftsmanship, programming languages, and improving code. For the last few years he has been working with Neo Technology on the Neo4j graph database. As the project lead of Spring Data Neo4j he helped developing the idea to become a convenient and complete solution for object graph mapping. He is also taking care of Neo4j cloud hosting efforts. Good relationships are everywhere in Michael
Comments
comments powered by Disqus