Mining your information

Get more intelligence from your data using Cypher and Neo4j – UPDATED for Neo4j 2.0

PeterBellMichaelHunger
cypher

In this tutorial, Peter Bell and Michael Hunger show you how to turn your data into gold. If you’ve converted over to Neo4j 2.0, you’ll be happy to hear that this article has now been updated to reflect the latest release of the program.

Graphs are on the rise. From the Facebook Graph to the Google Knowledge Graph, large internet companies are taking advantage of graph queries to improve the value they can get from data they already have. Neo4j is one of the new breed of graph databases that allow for similar capabilities in companies that can’t justify building their own graph persistence or querying implementation from scratch.

In this article, we’ll look at Cypher – the language for querying and updating graphs in Neo4j. It’s a declarative, SQL like language for simply specifying graph based queries.

This tutorial is designed for the newest version 2.0 of Neo4j, which was released in December 2013.

Getting started 

Start by downloading the Neo4j file at http://neo4j.org/download.

  • On Windows: Run the installer and start the server with the Neo4j Desktop App.

  • On Linux/Mac OS unzip the file and start from the directory bin/neo4j start (stop it again with  bin/neo4j stop)

Open the Neo4j Browser on http://localhost:7474 and you should see the following welcome screen. Feel free to explore the guides that are available.

Figure 1: The Neo4j Browser

Graph basics

Before we can jump into Cypher, we need to cover just a couple of key points about graph databases. In a graph database, we persist information in labelled nodes and typed, directed relationships that connect them. Both nodes and relationships can have arbitrary properties (key: value pairs).

So for example, in Figure 1, nodes are used to represent movies (“Cloud Atlas” and “The Matrix” labelled “Movie”) and people (actors “Hugo Weaving” and “Tom Hanks” and director “Lana Wachowski” all labelled “Person”). Relationships are used to capture the fact that the actors have ACTED_IN the movies and the director has DIRECTED the movies. All of them have additional properties, e.g. the roles property on ACTED_IN.

Figure 2: The Movie Domain Model

Note that all relationships are directed. While you might be able to argue for a bidirectional relationship for concepts like friends, clearly it is not needed for modelling the domain. Only in domains where the direction is really relevant (e.g. twitter followship). In all other cases it is semantically equivalent as all relationships have a directionality, but they can be traversed either way. Whether I want to know all of the movies Tom Hanks has acted in or see all of the people who’ve worked at “Cloud Atlas” movie, I’m able to express and run both queries easily.

Also note that unlike a “relational” database where you’d probably add a many:many join table to capture the relationship between actors and movies, in Neo4j we can just add properties like the roles (characters) the actors have played right onto the relationship, making modelling both simpler and more natural.

Finally, notice that there is only an optional schema in Neo4j based on labels. Any node or relationship can have any properties. There might be unique constraints or indexes declared on certain labels and properties which are used for query optimization.


Figure 3: A graph to describe a graph

So a graph stores information in both nodes and relationships. Relationships are used to organize (and traverse) nodes, and both nodes and relationships can have any number of key:value properties.v

Importing Data

To import the movie dataset we’ll be working with we can just run :play movies in the command line which will bring up a larger import statement. Click on that to bring it into the editor and hit the run button on the right side.

After a few seconds it will have imported the data, the visualization shows a single node.

To see a bit more of the graph, open the “star” tab on the left and run the “Get some data” query.

Click on one movie node to bring up a pop up for styling the graph. On the right small tab you can select the color, size and property to show (“title”), do the same for the people nodes and choose their name as property to display.

Figure 4: The Browser Visualization and Styling

Understanding Cypher

Historically, writing graph traversals has been painful enough to persuade many developers to put up with the limitations of a relational store for information that’s better fitted to a graph database. With Cypher, that has changed. While it takes a little while to become comfortable with the syntax of Cypher queries, it doesn’t take long to understand the power of the language and to get used to writing queries.

Feel free to run the queries we discuss directly in the browser and look at the results.

At its heart, Cypher allows you to describe the patterns you want to look for in a graph. Let’s say that you wanted to look for any two nodes which had a relationship between the first and second nodes. This is the Cypher query you’d write: 

MATCH (a)-->(b)

RETURN a, b;

We’re describing a pattern in the MATCH clause, where the node we’re starting on (a) has an outbound relationship to another node (b). As we want to return the information about both of the nodes and any properties that they might have we just RETURN a, b.

Naturally there are a lot of results as this is a pretty broad query (it finds all pairs of connected nodes). You see the visualization, and on the tabular view (right icon) you see the a rendering of the JSON representation for each row with the properties of the nodes.

Try the following query:

MATCH (a)-[r]->()

RETURN a.name, type(r);

There are a number of differences in this query. Firstly, we don’t care about what node we’re connecting to, so we don’t need to assign it a variable in the MATCH clause for referring to it in the RETURN clause.

Secondly, we’re just displaying the name property of the first node rather than returning the entire node. Finally, we’re using square braces to refer to the relationship and we’re using a special “type()” function that allows us to know what *kind* of relationship connects the two nodes.

As an aside, there are only two absolute data integrity constraints in Cypher. Firstly, a relationship must always directionally connect two nodes (they can be the same node). Secondly, you must always give the relationship a type. The type can be anything you want, but whether it’s a FRIEND, ROUTE, DEPARTMENT or YELLOW_ELEPHANT, you need to provide the type of a relationship when you create it.

Answering Real Questions

The queries above start to give us a sense of the syntax used by Cypher, but how about we look at some real queries to get a sense of how this works in practice?

Let’s say we wanted to find all of the actors in our database and all of the movies they’d acted in. Try entering the following query and see what you get: 

MATCH (a)-[r:ACTED_IN]->(m:Movie)

RETURN a.name, r.roles, m.title;

What does this do? Well, we’re still doing a full graph search – don’t worry – we’ll see later how to use bound nodes to run more performant, local queries. But now we’re limiting the type of relationship to ACTED_IN, so we’re not finding producers, directors or reviewers. Note that the movie node (m) is tagged with the label :Movie so that other engagements (e.g. tv-series) are not considered. Then we’re returning the name of the actor, a collection of the roles they played in the movie and the title of the movie, so we get to see all of the actors, roles and movies in the database.

One of the types supported by Neo4j for properties is “array”, so that’s how we’re getting an arbitrary set of roles. Another approach would have been to create one ACTED_IN relationship for each role an actor played in each movie, but in this case we’re putting all the roles into a single relationship to show that property values in Neo4j are not just limited to numbers and strings.

So, we just wrote a query to find out all of the movie titles and roles that each actor has been associated to. If directors have a relationship with movies of type DIRECTED, what do you think the query would be to show all of the actors’ names, movie titles and directors’ names for every actor/director/movie combination in the database? Take a moment, have a go in the query console and see what you get.

Did you get it working? Here’s a query that would do what we wanted:

MATCH (a)-[:ACTED_IN]->(m)<-[:DIRECTED]-(d)

RETURN a.name, m.title, d.name;

So we’re still using a full graph search to start on every single node in the graph. We’re trying to match a more complex pattern where an actor has an ACTED_IN relationship with a movie AND where the movie has a DIRECTED relationship from a director. For example, if a movie didn’t have a director it wouldn’t be a match, and if an actor acted in a movie with two directors, there would be two records – one for the actor, the movie and the first director and another for the actor, the movie and the second director. In the last line we’re deciding what information to include in the result set. In this case it’s the name of the actor, the title of the movie, and the name of the director. Run the query (if you didn’t already) and have a look at what you get.

Let’s say you wanted to find all of the actors who directed the movie they acted in. What change would you make to the query? Give it a try and see what happens.

Here’s a solution: 

MATCH (a:Person)-[:ACTED_IN]->(m)<-[:DIRECTED]-(a)

RETURN a.name, m.title;

We want to say that whoever ACTED_IN the movie, i.e. person (a) also DIRECTED the movie. By putting the same variable name for the far end of the DIRECTED relationship, what we’re saying is that the same person must both have acted in and directed that same movie to match the pattern and be included in the record set. I also removed the d.name from the RETURN clause as it wouldn’t be a valid reference as there is no “d” in the query any more. Run the query and you’ll see it returns a lot less results.

Starting Somewhere

Sometimes you’ll want to run a full graph search for use cases like graph wide migrations or exporting all of your data, but most of your queries are likely to be much more constrained. In the case of the movie database, maybe we want to see all of the movies that Tom Hanks has acted in (within the data set). To do that, we need to constrain the property on the actor bind an actor node.

MATCH (tom:Person)-[:ACTED_IN]->(movie)

WHERE tom.name = "Tom Hanks"

RETURN movie.title;

In a production environment, you’d create indexes for looking up a nodes by their key properties. Eg. user by their email address or a product by its inventory number, we’re creating one index and one constraint for our dataset. With those in place the bound queries are sped up and uniqueness is guaranteed for movies by title (might not reflect reality).

CREATE INDEX ON :Person(name);

CREATE CONSTRAINT ON (m:Movie) ASSERT m.title IS unique;

So if you now run the same query again, it will use the index on :Person(name) to quickly look up the starting point and only traverse in the local neighbourhood. There is also a shortcut notation for these lookups without a WHERE clause.

MATCH (tom:Person {name:"Tom Hanks"})-[:ACTED_IN]->(movie)

RETURN movie.title;

More Complex Queries

There are lots of other things that can be done in Cypher. Aggregate functions allow you to return distinct records, a sum, average, min, max, count or even a collect (which aggregates values into an array or list). For example:

MATCH (a)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(d)

RETURN a.name, d.name, collect(m.title);

This query returns every pair of actors and directors who worked together on a movie and for each pair it returns an array containing all of the titles of the movies that they worked together on.

Figure 5: Tabular Results from a Cypher Query

We can also constrain queries. For example, we can constrain a query by a property of a node or relationship. For example:

MATCH (tom {name:"Tom Hanks"})-[:ACTED_IN]->(movie)

WHERE movie.released < 1992

RETURN movie.title;

This returns all of the titles of the movies that Tom Hanks acted in that were released before 1992. We can also constrain based on comparisons. An example would be:

MATCH (tom  {name:"Tom Hanks"})-[:ACTED_IN]->(movie)<-[:ACTED_IN]-(a)

WHERE a.born < tom.born

RETURN DISTINCT a.name;

This returns the names of all of the actor colleagues older than Tom Hanks who have acted in at least one movie with him.

It’s even possible to constrain based on patterns. For example:

MATCH (gene {name:"Gene Hackman"})-[:ACTED_IN]->(movie)<-[:ACTED_IN]-(n)

WHERE (n)-[:DIRECTED]->()

RETURN DISTINCT n.name;

This query returns the names of all of the people who have acted in at least one movie with Gene Hackman but who have also directed at least one movie. Hint. It’s “Clint Eastwood”

Conclusion

We’ve barely scratched the surface of Cypher, but hopefully you’re starting to see just how powerful it can be for writing sophisticated graph queries. Cypher also allows you to create and update data in the graph, which we haven’t covered here.

If you’d like to find out more, head on over to neo4j.org where there are a range of resources for learning more about implementing graph solutions using Neo4j.

This article covers the first part of the Neo4j online course. For the graph epiphany, read the comprehensive Graph Databases book by Jim Webber and Ian Robinson, which is also available as a free PDF on http://graphdatabases.com. And last but not least check out how other companies are using Neo4j to create innovative systems.

Peter Bell

CTO/Founder of Speak Geek

Peter is a contract member of the Github training team, and provides enterprise corporate training on a range of NoSQL data stores including Neo4j, as well as lean startup/product and mobile training and consulting to enterprises. He’s also the CTO and founder of Speak Geek, which trains business people to more effectively hire and manage development teams.

Updated to Neo4j 2.0 by Michael Hunger

Michael Hunger has been passionate about software development for a long time. He is particularly interested in the people who develop software, software craftsmanship, programming languages, and improving code.

For the last few years he has been working with Neo Technology on the Neo4j graph database. As the project lead of Spring Data Neo4j he helped developing the idea to become a convenient and complete solution for object graph mapping. He is also taking care of Neo4j cloud hosting efforts. Michael now takes care of the Neo4j community in all regards and is involved with activities in all parts of the company.

 


Author

PeterBellMichaelHunger

All Posts by PeterBellMichaelHunger

Comments
comments powered by Disqus