The popularity of graph databases is still growing. But how to choose the right database? And will graph databases work well in every project? In today’s article, I would like to take a look at the advantages of graph databases and encourage as many people as possible from the programming industry to take advantage of all the benefits they can bring to a developer’s daily work. I will pay special attention to my favorite representative of this genre, which is Neo4j.
Neo4j (https://neo4j.com/) is one of the most popular, if not the most popular, graph database. As a reminder, as well as to organize your knowledge: a graph is a composition of two types of elements, which are nodes and relationships. A node can represent a specific type or several types and has its own properties. Relationships, on the other hand, apart from a name and their own properties, have – most importantly – a direction of interaction. The properties mentioned above are collections of key – value pairs. They are used to store important information. For example: if the node is a person, then their properties can be: given name, last name, age or list of favorite books.
Relationships between nodes in Neo4j – as in graph databases in general – are just as important as the nodes in terms of data. We treat them as objects, the existence of which is determined by the presence of given nodes. The independent existence of relationships has no justification.
Relationships known from RDBMS (Relational Database Management System)-type databases are usually associated with marking data in a row of one table as reflected in a particular row of another table. This allows for cascading operations when editing or deleting data. When normalizing a data model in RDBMS, we can come across the necessity of introducing special intermediate tables between the two tables used for binding whole groups of rows together.
Read also: Clean Architecture
Many communities may still cling to the belief that RDBMS are ideally suited to all kinds of tasks. Decision-makers responsible for database selection may have negative attitudes toward alternative data storage methods, and Neo4j has long been considered such an alternative. How can you reasonably choose the database(s) to use in a given project? First and foremost, the choice should be determined by the tasks facing your application.
As early as at the stage of application design – at the very stage of database selection – it is worth considering several issues:
Therefore, database selection depends on the basic function of the application being designed, as well as on the nature of the data. Will it be the data of people employed in a particular organization, who co‑create its organizational structure, and will the application you are creating be used to ensure efficient document circulation between employees? Or maybe you have to develop a flight search engine? Or an application that will provide support for a transport company in terms of logistics? Or maybe you received a super-secret order from government special services to implement a system supporting the management of a network of agents and informants? I admit the last example is indeed slightly over the top, but this is mainly because the possibilities provided by graph databases are really enormous.
However, the principle of matching a database to the purpose of a given application is not always observed, and the supporters of new and often more appropriate solutions have to struggle with the skepticism of managers (“After all, no one uses these solutions”).
How surprised and astonished those critics must be when they see numerous logotypes of world-famous brands or government or research institutions when visiting the Neo4j project’s home page – from medical companies to scientific and research institutes to financial, transport, telecommunication and military companies. A full, extensive list and case studies are available on the project website. This is a tidbit for all data analysis and modeling professionals as well as application architects.
As you can see, the popularity of Neo4j and graph databases is still growing. In many rankings, including my own, the Neo4j database is one of the leaders of such solutions. It is worth mentioning that the creators of Neo4j took good care of friendly organization of clustering and cloud computing, and the most common solution is Neo4j hosted on AWS – which also speaks in its favor.
It is worth remembering that Neo4j – as a tool – is strongly supported by both modern code writing tools and popular frameworks, such as Spring Framework in the “Spring Data Neo4j” project. The fact that graph databases are being developed so intensively bodes well for the future.
Just as databases from the RDBMS group use SQL query language, Neo4j uses the Cypher language. In both cases they are declarative languages. While syntactically Cypher is similar to SQL in many aspects, one of the most frequently indicated differences is the use of the MATCH keyword instead of SELECT. Another one is the use – in literal terms – of relationship arrows.
Cypher is a very flexible language in terms of query building capabilities. This is well demonstrated in examples like this one, where we perform conditional matching. In SQL, we always place the condition in the WHERE clause, whereas in Cypher it can be included right at the stage of node declaration. We can create interesting and useful queries composed of stages, using the WITH clause.
Very important, and often even crucial, is the readability of queries, which we find here. Also noteworthy is the ease of writing queries for people who have an understanding of the graph and have encountered the spelling of queries in SQL before.
The example of a ticket booking application presented below is a real-life example. The problem I faced came up a few years ago, mainly due to decision-makers’ attachment to solutions considered to be proven, and the development team’s reluctance to search for new possibilities.
Project: bus ticket booking application.
End product: seat booking and ticket purchase. Due to the excessive complexity of the topic, however, let’s limit ourselves to the connection search engine only.
Operation of the application: before booking a ticket, the user should first indicate the bus stop from which they want to depart, as well as select the destination stop from the list of available stops.
Modeling: in the case of a RDBMS-compliant approach, we will need at least three tables to model this area. The first one will be a register of bus stops and the second one will be a register of tracks on which the stops are located. In the third table, we will assign particular stops to particular tracks.
Figure 1. RDBMS-compliant table and relationship chart
Here we can notice a typical problem with a table containing mapped relationships between individual rows of bound tables. The rows in the relationship contain cells filled with unreadable numbers usually belonging to the indexed table primary keys in the relationship. Deciphering the origin and meaning of these numbers can sometimes require considerable effort. The more complex the table, the more effort will be needed. In our example, such a bonding table could look like this:
Table 1. Sample fragment of possible content of a ‘track_bus_stop’ table
We can find such “creations” in Many-to-Many relationships. They can be particularly burdensome when there are numerous tables, defined foreign keys and relationships in the project. They also prove problematic when there is a plethora of data in scripts, which is needed to supply test instances of databases for integration verification of the correctness of implementation. Such a situation is often crucial in terms of development. It is often necessary to work really hard to prepare new data records while maintaining the relationships with intermediate tables. The data contained in the rows must be unique or match the master data set, and it must be consistent with other “non-null” data from at least ten other tables. It may happen that recurring references start to appear in the data model, e.g. as a result of inattention. Then, without performing the operation of disabling all restrictions on the database, it is not possible to continue adding new data or even to import correct data into it.
Let’s move on to the first step using the connection search engine. A passenger will search for all possible tracks assigned to a selected bus stop.
Listing 1. Query in SQL to search for the names of all tracks to which the sample “Pstrągowa” bus stop belongs
What interesting thing has just happened? At this stage, data from three separate tables has merged in the query into one and we can now choose the rows that contain the stop we are looking for. If one track is defined by an N of the stops that make it up and are connected to it by the names of the tracks, then we reject all the rows in which the assigned stop is different from the one we indicated.
Below is an example of a result table before stripping away the rows with a different bus stop name than the one given:
Table 2. Table showing a set of the most important data from the track and bus stop tables compiled together using the ‘track_bus_stop’ table
Having rejected the rows in which the bus stop name does not correspond to the one we indicated, we obtain the presented set of rows, which should then be appropriately trimmed by rejecting any repetitions.
Table 3. Set of unique track names to which the indicated bus stop belongs
Let’s imagine now that our passenger needs a list of all the bus stops which it is possible to reach by getting on the bus at the stop indicated. Below is a sample query that meets the expectations of the application user.
Listing 2. Sample query returning a list of unique bus stops that can be reached from stop X
In the above query, we have to use two subqueries for one projection operation. There are a total of six table linking operations using the JOIN command. The complexity of this query is an undoubted disadvantage. Another disadvantage is the lack of intuitive character of SQL structures when we want to get a slice of the data set of the reality we are investigating. The result of the query in our example will be a set of data contained in the table below.
Table 4. Sample result set of possible bus stops to which the user will depart from the given initial stop
Queries from both previous listings are only an introduction to other operations leading to a ticket purchase. Further queries may sometimes be even more complex – especially once we create additional tables to store information about ticket prices depending on the chosen departure and destination stop, type of track, run, night-time or e.g. periodic discounts on given sections of the journey.
Let’s take a look at the analyzed data model from the perspective of graph objects. What is the significance of the fact that something can be described using a graph? As I said at the beginning, a graph is a set of nodes and their mutual directed relationships. In the context of the analyzed bus connection search engine, our sample nodes and relationships can be presented in this way using Neo4j:
Figure 2. Graph showing the bus stops as nodes, along with their mutual defined relationships
The stops are nodes, and the road that leads from one stop to another naturally determines the relationship that exists between them. The road between the stops will result in the occurrence of appropriate tracks, so we will treat the track as a property of the relationship determined by the road.
Listing 3. Query in Cypher that allows us to search for all tracks to which the indicated bus stop has been assigned
Sample result of the above query:
Table 5. Sample result of the query used to find all tracks to which the indicated bus stop belongs
At first glance, the Listing 3 query is much more intuitive than the SQL one. It is certainly also less complicated. In the above query, we try to select all LEADS TO-type relationships to other bus stops – those coming out of the one whose name corresponds to the stop indicated in the query. We then return their tracks using the RETURN keyword.
Let’s go to step two, which should be done by the user in order to get a list of possible stops that the bus will drive through. In Neo4j, we will get this list by, for example, executing a query like the one presented in Listing 4.
Listing 4. The query will return the names of all the bus stops that can be reached from the one indicated by the user.
We can interpret the above query as follows: “Since there is a relationship between the indicated bus stop and the subsequent ones in the form of a road connecting them, then return all the subsequent stops to me until the last one.”
Below is a sample result in the form of a graph and a table of names:
Figure 3. Graph presenting possible destination stops and their mutual relationships
Table 6. Set of node
As we can see in the above examples, queries in Cypher are much shorter than their equivalents in SQL, and presented in the relevant SQL listings. At the same time, they allow us to obtain identical effects. An additional advantage is that Neo4j, apart from the tabular view, offers also a view of nodes and their relationships.
If the reality slice we are working on is a collection of objects and their mutual relationships, it is very likely that we will find a graph structure there, and therefore using a graph database will make sense. In this article, I wanted to present, above all, the intuitive character and ease of constructing queries in Neo4j. I provided an example of an application where I faced the problem of a mismatch between the tool and the project. If I had my current experience and knowledge of graph databases back then, I would have tried to convince the project’s decision-makers to use Neo4j. This would have protected the client from numerous unnecessary problems.