SPARQL and OWL 2 Inference for Neo4j

As part of my master thesis I developed a new SPARQL plugin for Neo4j.
The current plugin is developed as server plugin and somewhat limited,
as the SPARQL protocol standards are not correctly implemented
(regarding result formats and RDF input).

The new extension is developed as unmanaged extension and fully implements the
SPARQL 1.1 Protocol standard and the SPARQL 1.1 Graph Store HTTP Protocol
standard. That means SPARQL 1.1 queries and update queries are supported and
also updating of RDF data using HTTP.

For large datasets it is possible to import them in chunks.
The extension will commit smaller chunks to the database to reduce
memory consumption.

Moreover, the extension includes a new approach to OWL 2 inference using query
rewriting of SPARQL algebra expressions. For SPARQL 1.1 queries the extension
will rewrite the query in such a way that also inferred solutions are returned.

Installing the Extension

To install the extension just download the latest release jar and place
it in the “plugins” folder inside your Neo4j installation. Afterwards
add the following property to the “/conf/neo4j-server.properties” file:

org.neo4j.server.thirdparty_jaxrs_classes=de.unikiel.inf.comsys.neo4j=/rdf

Make sure that your database is empty before starting an import. If you are
unsure just delete the “graph.db” folder:

rm -rf data/graph.db

Now start the Neo4j database:

bin/neo4j console

This will start the database and log everything to the console.

Now it is possible to interact with the SPARQL-Extension using the “/rdf”
resources. If you use the default server configuration Neo4j will be
listening on port 7474 on localhost.

The first step is to import RDF data

For this blog post let’s use the following files in Turtle syntax:

tbox.ttl:
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix : <http://comsys.uni-kiel.de/sparql/test/> .
: a owl:Ontology .
:hasSpouse a owl:SymmetricProperty .
:hasParent a owl:ObjectProperty .
:hasChildInLaw a owl:ObjectProperty .
:hasParentInLaw a owl:ObjectProperty ;
owl:inverseOf :hasChildInLaw ;
owl:propertyChainAxiom ( :hasSpouse :hasParent ) .

abox.ttl:
@prefix : <http://comsys.uni-kiel.de/sparql/test/> .
:Alice :hasParentInLaw :Bob .
:Chris :hasSpouse :Emily .
:Emily :hasParent :Dave .

The first step is to import the ABox into the default graph. This needs
to be done with an HTTP PUT request on the “/rdf/graph” resource:

curl -v -X PUT localhost:7474/rdf/graph \
-H "Content-Type:text/turtle" --data-binary @abox.ttl

Afterwards the TBox must be imported into the special graph
“urn:sparqlextension:tbox”. This graph will be used to rewrite SPARQL queries
and return inferred solutions. Again this needs to be an HTTP PUT request,
but with the “path” query parameter:

curl -v -X PUT \
localhost:7474/rdf/graph\?graph=urn%3Asparqlextension%3Atbox \
-H "Content-Type:text/turtle" --data-binary @tbox.ttl

Querying using SPARQL

Now it is possible to query the data using SPARQL. SPARQL queries must be
sent to the “/rdf/query” resource. The following example uses a POST request
and the query “SELECT ?p1 ?p2 WHERE { ?p1 :hasParentInLaw ?p2 }”
in the HTTP body. This will query all “hasParentInLaw” relations.

curl -v -X POST localhost:7474/rdf/query \
-H "Content-Type: application/sparql-query" \
-H "Accept: text/tab-separated-values" \
-d "PREFIX : <http://comsys.uni-kiel.de/sparql/test/> \
SELECT ?p1 ?p2 WHERE { ?p1 :hasParentInLaw ?p2 }"

The results are returned as tab separated values for better readability
on the terminal:

?p1 ?p2
<http://comsys.uni-kiel.de/sparql/test/Alice> <http://comsys.uni-kiel.de/sparql/test/Bob>

To return also inferred solutions the query must be sent to the
resource “/rdf/query/inference”. This will rewrite the query and also
return inferred solutions:

curl -v -X POST localhost:7474/rdf/query/inference \
-H "Content-Type: application/sparql-query" \
-H "Accept: text/tab-separated-values" \
-d "PREFIX : <http://comsys.uni-kiel.de/sparql/test/> \
SELECT ?p1 ?p2 WHERE { ?p1 :hasParentInLaw ?p2 }"

The results of the inferred request are:

?p1 ?p2
<http://comsys.uni-kiel.de/sparql/test/Alice> <http://comsys.uni-kiel.de/sparql/test/Bob>
<http://comsys.uni-kiel.de/sparql/test/Chris> <http://comsys.uni-kiel.de/sparql/test/Dave>

SPARQL Update queries must be sent to the resource “/rdf/update”.
The following example replaces the triple “:Emily :hasParent :Dave” with
“:Emily :hasParent :Bob”:

curl -v -X POST localhost:7474/rdf/update \
-H "Content-Type: application/sparql-update" \
-d "PREFIX : <http://comsys.uni-kiel.de/sparql/test/> \
DELETE DATA { :Emily :hasParent :Dave }; \
INSERT DATA { :Emily :hasParent :Bob }"

More information on the supported OWL 2 inference and HTTP resources is on the GitHub page.

Mapping RDF to the Property Graph

To store RDF data it must be mapped to the graph model of Neo4j. Fortunately the Blueprints project already implemented an RDF mapping as part of the “Sail Ouplementation“. For the mapping the RDF-triples are interpreted as directed graph. Each node and edge of the graph will be mapped as a node or edge in the property graph with special properties.

Each RDF node is mapped as a Neo4j node. The type (RDF, literal, blank node) is stored in the “kind” property. The value of the node is stored in “value”. The following example shows all three variants.
A URI node:
(a {kind: "uri", value: "http://example.com" })
A literal node:
(a {kind: "literal", value: "Text", type: "http://www.w3.org/2001/XMLSchema#"})
A blank node:
(a {kind: "bnode" value: "genid--b1234"})
The URI of an RDF edge is mapped as type of the Neo4j edge:
(a)-[p:`http://example.com/property`]->(b)

For faster querying the blueprints implementation supports indices on triples. The wiki has more information on this topic.

Performance

The performance of the extension was tested against the Fuseki graph store. The tests are based on the “Berlin SPARQL benchmark“. They were executed using different dataset sizes (8498, 75550, 725305, 7073571 and 70294080 triple). While the extension is just about 2.4 times slower using the smallest dataset, it is about 27 times slower than Fuseki running the largest dataset. The biggest problem seems to be the inefficient mapping of RDF to Neo4j. The largest dataset is encoded in n-triples and is about 17.9 GB in size. Imported into Fuseki (using TDB) it consumes about 9 GB on disk, but the same dataset imported using the extension uses about 390 GB. As Neo4j needs to search and move a lot more data it also needs more time to answer queries.

Future Work

I see two possible directions for future work. One interesting problem is the mapping from RDF to Neo4j. It may be possible to develop a better mapping that also uses more features of the property graph (e.g. labels). An efficient mapping would reduce the size of the data and could also speed up querying. Another interesting project would be to port the inference component that rewrites SPARQL algebra to Jena. The extension is based on Sesame, but Jena also has a SPARQL algebra implementation that can be used to transform the expressions.

Comments are closed.