Tutorial

Getting Started with Neo4j, the Java graph database

MichaelHunger
graph-data

Michael Hunger explains how you can get the most out of the growing trend of graph databases, focusing on one of the leading lights in Neo4j. Features a heavy amount of Keanu Reaves

If you somehow haven’t noticed, graph databases have surged
in popularity within the software industry, thanks in part to
companies like Google, Facebook and Twitter putting them on them on
the map. In this article from June’s JAX Magazine, Michael
Hunger explains how you can get the most out of the growing trend,
focusing on one of the leading lights in Neo4j.

There is a whole world of information out there where size is
not king, and connectedness assumes the throne. Everything in the
real – and digital – world is connected and the amount of value in
these relationships is tremendous. Historical events are
interconnected with political arenas and individual participants.
Gene expression is derived from both DNA and environmental factors.
Networks, computers, applications and users form intricate
interaction networks. Every aspect of our lives is dominated by
connected information and things. Big internet companies are trying
to harness this power with efforts like the
Google Knowledge Graph
or Facebook Graph
Search
.

Figure 1: Harnessing the power of
graph data

And whenever we want to store this real world data in a
database, we somehow have to take care of this fact. Usually the
connections are ignored, denormalized or aggregated to fit in the
data model and make operations fast enough. What you lose by doing
this is the richness of the information that you could have
retained with a different data model and database. That’s where the
property graph model and graph databases show up. If graph shaped
data shows up in a relational database on the other hand, you’ll
easily recognize it by the sheer amount of intermediate join tables
and join statements in your queries (and dropping performance
levels).

Figure 2: Graph theory

Graph
theory is much older
than anyone would think. Treating graphs
explicitly with database semantics like ACID Transactions is new
however. Graph databases are part of the recent NoSQL movement that
mostly means non-relational databases. Most of them are open
source, developer-friendly and come with a dedicated data-model
that suits a certain use case. 

Graph databases are well suited to storing, retrieving and
quickly querying interesting networks of information. This kind of
connected data is also know as graphs – not to be mixed up with
artwork, charts or diagrams. Graphs consist of nodes and directed,
typed relationships, both of which can hold arbitrary numbers and
types of attributes (key-value properties). That is all there is to
the graph model.

Figure 3: Property Graph

You have certainly used graphs in the past, either when
modelling relational databases (ER-diagram), when drawing your
domain on a whiteboard with some colleagues (circles and lines) or
when creatively collecting information (mindmaps). The simplicity
of the graph model makes it easy to understand and the direct
visualization makes it even easier!

Enter Neo4j 

What is so cool about those graph databases, let’s take Neo4j, the open source database I work
on as an example. It is a native graph database implemented in
Java. That means its internal database structure directly
represents nodes and relationships (and properties) as records in
its database files. It does not sit on top of another database but
uses its own infrastructure that was solely developed to work with
graph shaped data.

How does a graph database achieve the high speed navigation? It
uses a cheap trick. Instead of recreating connections at each and
every time you issue a query (with huge impact on memory and CPU)
like most other databases do, a graph database materializes the
connections once at insertion time. It takes a one time hit then
but for each query or traversal navigating from one node to another
is a constant-time operation. It only has to follow the existing
and persistent relationship that points to the other end. And that
jump is ultrafast.

Neo4j represents Nodes and
Relationships as Java objects in its embedded
Java-APIs, as JSON objects in its server’s REST-API and as ASCII
art in its query language Cypher.

Wait, ASCII-Art? How does that work? Remember the lines and
circles on a whiteboard? Well, if we visualize a real domain that
way, it can get pretty complex quickly, especially if we don’t see
the forest for the trees. But what are we looking for? We want to
find patterns in that graph and within those patterns, we want to
aggregate and project data so that our questions are answered and
use cases handled. So with graph visualization, we can easily
highlight patterns by just redrawing them in another color.

Figure 4: Relationships

But how would we do that in a textual, declarative query
language? That’s where the ASCII-art comes in. We employ names with
parenthesis for nodes and dashed arrows for relationships.
Relationship-types appear in square brackets, and properties show
up as JSON-like notation.

Cypher quickstart

Getting started with Cypher is easy. There is a learning track
on Cypher, a beautiful
Cheat Sheet and a
comprehensive
section
in the Neo4j Manual. In a few minutes you can have a
Neo4j Server up and running
on your machine and start working with it. Or you can play in the
Sandbox of the <a href=”http://neo4j.org/learn/try”>Neo4j
Online Console</a>.

In this example from a movies domain, you can see how that would
look like.

 

(m:Movie {title: "The Matrix"})
<-[:ACTS_IN {role:"Neo"}]-
(a:Actor {name:"Keanu Reeves"})

 

We see Cypher as a humane language optimized for readability and
stating intent. You declare which patterns you want to match in the
graph and for which operations, filter, aggregation, sorting and
paging you want to apply to the matching results. In Cypher, you do
not state how you want to have something done, but rather what you
are looking for. In this regard, it is similar to SQL. On the other
hand it is much more powerful in terms of declaring complex
relationships between nodes, specifying paths of variable length,
applying graph algorithms, working with collections, chaining
subqueries, and passing on intermediate results. Cypher not only
allows fast and powerful queries, but also permits creating and
updating information in the graph.

 

// creates a movie with all its actors in one go
 CREATE (m:Movie {title:{movie_title}})
 FOREACH (a in {actors} :                   CREATE (a:Actor {name:a[0]})-[:ACTS_IN {role:a[1]}]->m)

 // find the 10 most frequent co-actors of Keanu Reeves
 MATCH (keanu:Actor)-[:ACTS_IN]->()<-[:ACTS_IN]-(co:Actor) 
WHERE keanu.name="Keanu Reeves"
 RETURN co.name, count(*) as times
 ORDER BY times DESC
 LIMIT 10

 

Cypher is the easiest way to get up and running with Neo4j and
in both in the embedded and server deployment mode, it helps you
tremendously with writing your application. To interact with Neo4j
in your preferred programming language or style, you can choose
from a plethora of drivers.

What’s new in Neo4j 2.0?

First of all, we have changed the data model for the first time
in 10 years. Now, not only relationships but also nodes can now be
labeled. This makes it much easier to represent types of things in
the graph – multiple per node – and allows for multiple
optimizations to hook into this information. On top of these
node-labels, we can start declaring automatic indexes, that are
defined per label and property,  and we can introduce property
constraints (think uniqueness, value ranges, property types).

 

CREATE INDEX ON :Movie(title);  

// uses the index
 MATCH (m:Movie) WHERE m.title = "The Matrix";  

// examples for a unique constraint
 CREATE CONSTRAINT ON (actor:Actor) ASSERT actor.name IS UNIQUE

 

Neo4j 2.0 also brings a new MERGE functionality that allows you
to match and find patterns that you specify, create parts or the
whole pattern if they are not there, or update separate operations
on insert or match.

 

MERGE (keanu:Person {name:'Keanu Reeves'})
 ON CREATE keanu SET keanu.created = timestamp()
 ON MATCH keanu SET keanu.lastSeen = timestamp() 
RETURN keanu;

 

Another thing we’re really thrilled about is the new
transactional Cypher http endpoint. So far, we only supported one
transaction per request to the Neo4j Server. Now, a transaction is
created and kept alive until an explicit commit (POST to
/transaction/id/commit URL) or rollback (DELETE /transaction/id) is
posted to the endpoint or the transaction timeout occurs. One can
post as often and as many Cypher statements in a streaming manner
to the endpoint as one wants and receive the results streamed back
at the same time. This endpoint is also much less verbose than the
previous incarnations.

This allows a completely new set of drivers to emerge, that
focus on Cypher as the sole means of interacting with Neo4j and
which much closer resemble current SQL tools and allow better
integration with other software.

So how would you get started with a graph database? First of
all, is taking a step back. It takes a bit of unlearning the
relational model to create a good graph model of your domain. But
actually it is not hard at all. Just grab a colleague or two and a
whiteboard and start drawing an example of your domain that
contains all the information and relationships you need to find
answers to your upcoming questions and use-cases. This is the model
you want to work towards to, not the existing, technology driven
database model.

With this model in mind the next step is to get your data into the graph
database
. So first download Neo4j Server 2.0 and start it up.
On the local Web-UI you can visualize your data, but we’re first
interested in the interactive console-shell. The shell allows you
to run Cypher commands against the running server, much like you
would do with any SQL tool. Those commands can both query and
update the database as you’ve seen before. Make sure to have the
Cypher cheat sheet
at hand to check the syntax.

Now it is up to you to get some data out of your existing
database (or a data generator) in a format that is workable. One
way is to create separate CSV files for both future nodes and
relationships. To convert this tabular data into a graph you just
generate some cypher statements. So either write a small script to
output the appropriate create statements or use the string
concatenation abilities of a
spreadsheet
. If you have the statements available, you might
want to wrap them into a begin CREATE …; CREATE …; commit block
to ensure atomic creation of your data (all or none). Then you can
paste this script into the console of the Web-UI or have the
neo4j-shell connect to the server and execute them bin/neo4j-shell
-f import.cql.

Where next?

This is it. You now can query or visualize your graph however
you like. Programmatic access to the database is possible through
the full selection of drivers for almost all programming languages.
Those were graciously contributed by Neo4j’s large developer
community. You can of course also use one of the drivers to acquire
or generate the data to insert and then populate the graph directly
either with Cypher or an equivalent Node & Relationship based
API.

Lastly, I want to point out the wide applicability of the graph
data model. Every reasonably advanced data model contains a lot of
important connections and can be easily represented as a graph.
This becomes obvious when you look at domain object models and
imagine pointer references to be relationships and objects
nodes

To illustrate that, I want to point out a few interesting
applications:


  • Facebook Graph Search
    by Max De Marzi imports data from
    Facebook and converts natural language queries into Cypher
    statements
  • Rik
    Van Bruggen’s beer graph
    shows that even a non-technical person
    can create a graph model and data and then run interesting queries
    on it
  • Open Tree Of Life
    is working on creating a graph of all the organisms in the
    world

  • Shutl
    finds the best courier and route for instant (minutes)
    delivery of goods purchased through e-commerce channels

  • Telenor
    handles complex ACL resolution algorithms on top of a
    graph model

If this article sparks your interest in graphs and graph
databases and how they can make your life and development easier, I
would recommend to read the more comprehensive Graph Databases, attend one of a
local GraphConnect
conference or join other curious developers in one of our worldwide
trainings
. There is so much more to learn and many ways to
participate in and engage with the Neo4j community. Our site is a good starting point for that
too.

Author Bio: Michael Hunger has been
passionate about software development for a long time. He is
particularly interested in the people who develop software,
software craftsmanship, programming languages, and improving code.
For the last few years he has been working with Neo Technology on
the Neo4j graph database. As the project lead of Spring Data Neo4j
he helped developing the idea to become a convenient and complete
solution for object graph mapping. He is also taking care of Neo4j
cloud hosting efforts. Good relationships are everywhere in
Michael’s life. His life concerns his family and children, running
his coffee shop and co-working-space, having fun in the depths of a
text-based multi-user dungeon, tinkering with and without Lego and
much more
.

This article appeared in June’s edition of JAX Magazine – The
Graph Renaissance. For that issue and others, click here.

Author
MichaelHunger
Michael Hunger has been passionate about software development for a long time. He is particularly interested in the people who develop software, software craftsmanship, programming languages, and improving code. For the last few years he has been working with Neo Technology on the Neo4j graph database. As the project lead of Spring Data Neo4j he helped developing the idea to become a convenient and complete solution for object graph mapping. He is also taking care of Neo4j cloud hosting efforts. Good relationships are everywhere in Michael
Comments
comments powered by Disqus