Position Statement: The RDF* and SPARQL* Approach to Annotate Statements in RDF and to Reconcile RDF and Property Graphs

This post presents my position statement for the W3C Workshop on Web Standardization for Graph Data.

Update (November 18, 2020): Work on a spec that defines the approach has started.

Update (August 6, 2020): Native support for RDF* and SPARQL* has found its way into the following
RDF-related programming libraries: Eclipse RDF4J, Apache Jena, RDF.rb, and N3.js.

Update (April 23, 2020): In addition to the two RDF graph database systems mentioned in
the blog post (Blazegraph and AnzoGraph), two more such systems have added support
for RDF* and SPARQL* in the meantime; these systems are Stardog and GraphDB.

Update (June 27, 2019): We now have a W3C mailing list to discuss question related to RDF* and SPARQL*.

Update (June 9, 2019): In the meantime, I have defined SPARQL* Update.

The lack of a convenient way to annotate RDF triples and to query such annotations has been a long standing issue for RDF. Such annotations are a native feature in other contemporary graph data models (e.g., edge properties in the Property Graph model) and there exist a number of popular use cases, including the annotation of statements with certainty scores, weights, temporal restrictions, and provenance information. To mitigate the inherent lack of a native support for such annotations in the purely triple-based data model of RDF, there exist several proposals to capture such annotations in the RDF context (e.g., RDF reification as proposed in the RDF specifications, singleton properties, single-triple named graphs). However, these proposals have a number of shortcomings and none of them has yet been adopted as a (de facto) standard.

We are proposing an alternative approach that is based on nesting of RDF triples and of query patterns. This approach has already attracted interest not only in the RDF and Semantic Web research community (as indicated by some blog posts and by winning the People’s Choice Best Poster Award at ISWC 2017) but also among RDF system vendors. In fact, the approach is already supported in two commercial RDF graph database systems (Blazegraph and AnzoGraph) and in an extension of the popular Open Source framework Apache Jena. Important properties of the approach are that

  1. it allows for a compact representation of data and queries,
  2. it is backwards-compatible with the aforementioned existing approaches,
  3. it can serve naturally as a foundation for achieving interoperability between the RDF and the Property Graphs world, and
  4. it can be employed as a common conceptual framework to capture more specific annotation-related extensions of RDF and SPARQL (such as temporal or probabilistic extensions).

The goal of this position statement is to bring the approach to the attention of the workshop attendees and to put on the workshop agenda a discussion regarding standardization opportunities for this approach.

In the remainder of this position statement we outline the approach and elaborate more on its properties.

Overview of the Approach

The basis of the proposed approach is to extend RDF with a notion of nested triples. More precisely, with this extension, called RDF*, any triple that represents metadata about another triple may directly contain this other triple as its subject or its object. For instance, suppose we want to capture a statement indicating the age of Bob together with the metadata fact that we are 90% certain about this statement. RDF* allows us to represent both the data and the metadata by using a nested triple as follows.

   <<:bob foaf:age 23>> ex:certainty 0.9 .

Notice that we write the nested triple using an extension of the RDF Turtle syntax that captures the notion of nested triples by enclosing any embedded triple using the strings ‘<<‘ and ‘>>’. This extended syntax is called Turtle* and it is specified in Section 3.3 of our technical report.

Given the outlined notion of RDF* which supports (arbitrarily deep) nesting of triples, the crux of the proposed approach is to extend the RDF query language SPARQL accordingly. That is, in the extended query language, called SPARQL*, triple patterns may also be nested, which gives users a query syntax in which accessing specific metadata about a triple is just a matter of mentioning the triple in the subject (or object) position of a metadata-related triple pattern. For instance, by adopting the aforementioned syntax for nesting, we may query for all age-statements and their respective certainty as follows (prefix declarations omitted).

   SELECT ?p ?a ?c WHERE {
     <<?p foaf:age ?a>> ex:certainty ?c .
   }

Notice that the query is represented in a very compact form; in particular, in contrast to the corresponding queries for other proposals (e.g., RDF reification, singleton properties), this compact syntax does not require users to write verbose patterns or other constructs whose only purpose is to match artifacts that these proposals introduce to establish the relationship between a triple and the metadata about it.

In addition to nested triple patterns, SPARQL* introduces a new type of BIND clauses that allows us to express the example query in the following, semantically equivalent form.

   SELECT ?p ?a ?c WHERE {
     BIND (<<?p foaf:age ?a>> AS ?t)
     ?t ex:certainty ?c .
   }

The latter example also highlights the fact that in SPARQL*, variables in query results may be bound not only to IRIs, literals, or blank nodes, but also to full RDF* triples. For a detailed formalization of SPARQL*, including the complete extension of the full W3C specification of SPARQL, refer to Sections 4-5 of the technical report.

Properties of the Approach

We emphasize three orthogonal perspectives on the proposed approach:

  1. On one hand, RDF* and SPARQL* may be understood–and used–simply as syntactic sugar on top of RDF and SPARQL. That is, any RDF*-specific syntax such as Turtle* may be parsed directly into plain RDF data that uses RDF reification or any of the other approaches to annotate statements in RDF. Likewise, SPARQL* queries may be rewritten into ordinary SPARQL queries. Based on such conversions, RDF* and SPARQL* may be supported easily by implementing wrappers on top of existing RDF triple stores. Then, users can query either RDF* data or RDF data with other forms of statement annotations, both by using SPARQL*. The formal mappings necessary as a foundation of such wrapper-based implementations have already been defined and studied, and there exists an initial set of conversion tools.
  2. On the other hand, the proposal may also be conceived of as a new abstract data model in its own right. As such, it may be implemented by developing techniques to execute SPARQL* queries directly on a physical storage model that is designed to support RDF* natively. The formal foundations of this perspective exist; that is, we have defined the RDF* data model and a formal semantics of SPARQL*. Moreover, the RDF graph database systems Blazegraph and AnzoGraph provide native support for RDF* and SPARQL*, and so does the aforementioned extension of Apache Jena.
  3. A third perspective on the approach is that it presents a step towards closing the gap between the RDF and the Property Graphs world. That is, by extending RDF and SPARQL with a feature that is similar to the notion of edge properties in Property Graphs, the approach may serve as an abstraction for integrating RDF data and Property Graphs. In fact, in addition to the aforementioned RDF*-to-RDF mappings, there already exist formal definitions of direct mappings from RDF* to Property Graphs and vice versa, and these mappings have been implemented in conversion tools.

Lightweight summary of our paper “Semantics and Complexity of GraphQL”

While the ease of use of its query language has certainly contributed to the phenomenal adoption of GraphQL, little has been known about fundamental computational properties of this language. To fill this gap, together with Jorge Pérez from the Universidad de Chile, we have recently embarked on a formal study of the language. The result of our work is a research paper titled “Semantics and Complexity of GraphQL” (download a pre-print of our paper). After a typical academic review process, this paper has been accepted to be presented in The Web Conference 2018. Due to the highly technical, math-heavy nature of this work, the paper may not be easily digestible for a broader, Web-developer audience. Even the summary of the paper by Adrian Colyer of “the morning paper” is still fairly technical. Therefore, in this post I aim to provide a more lightweight high-level summary of the results of our work (if you prefer a more visual summary, you may want to check out the video recording of my talk at GraphQL Europe). Hereafter, when I talk about the GraphQL language, I mean the query-specific part of the language that the GraphQL framework introduces to retrieve data from a Web server.

We have studied this language in terms of a number of typical computation-related questions that people focus on when analyzing query languages. In Theoretical Computer Science such types of questions are captured in the form of formally-defined problems and the subject of study is to achieve a mathematically-proven understanding of how difficult it is to solve any possible instance of a given problem. A typical example of such a problem in the context of database query languages is to decide for any given answer that may be in the result of a given query over a given database whether this answer is indeed in the query result. You may notice that this decision problem is related to the task of producing the query result. In fact, the difficulty of this problem (for a given query language) is one of the most commonly used formal measures to identify how computationally complex a query language is. The problem is usually called the evaluation problem of a query language.

The Evaluation Problem of GraphQL

One of our contributions in our paper has been to show that evaluation problem of the GraphQL language is “very simple”. More precisely, we show that the problem is NL-complete. It is well-known that NL-complete problems can be solved in practice by programs that use a high degree of parallelism. Hence, this gives us our first positive observation about the GraphQL language. Admittedly, this is a very abstract finding. However, to appreciate this property of the GraphQL language we may look at other well-known query languages. For instance, we may consider the Relational Algebra, which is the formal foundation of SQL. It has been shown that the complexity of the evaluation problem of the Relational Algebra is PSPACE-complete. The same has been shown for version 1.0 of the RDF query language SPARQL. Problems in this class are highly intractable. As another example, for general conjunctive queries the evaluation problem is NP-complete. NP-complete problems are commonly believed to be less complex than PSPACE-complete problems but still intractable. In contrast, there is a class called PTIME which consists of problems that are known to be tractable; that is, for each such problem there exists an algorithm whose runtime is polynomial in the size of the input. This is the case for SPARQL basic graph patterns (which comprise the conjunctive fragment of SPARQL) for which the evaluation problem is in PTIME. How does this compare to the NL-completeness of the evaluation problem of GraphQL? The good news for GraphQL fans is that NL is an even lower complexity class than PTIME.

The Enumeration Problem of GraphQL

After having done this initial theoretical comparison of the GraphQL language to other query languages, we have looked at another, more practical computation-related problem. This problem is called the enumeration problem and is concerned with how difficult it is to produce the complete result of a query. To study this problem we made our life a bit easier by considering only a particular class of GraphQL queries. We call the queries in this class “non-redundant queries in ground-typed normal form“. Informally, these are queries for which the process of field collection (as specified in Section 6.3.2 of the GraphQL spec) is not necessary. For these queries, and any arbitrary set of data, we have shown that the respective query results can be produced symbol-by-symbol with only constant time delay between symbols. This property is the best that one can hope for in a query language and not many query languages possess this property!

At this point it has to be emphasized that focusing on non-redundant queries in ground-typed normal form is not a limitation of our work. In contrast! As we also show in our paper, every arbitrary GraphQL query can be rewritten into a non-redundant, ground-typed query that is guaranteed to produce the same result. For the necessary rewriting rules refer to Proposition 3.9 in our paper.

Given our positive finding regarding the constant time delay, we can also infer the following property of the GraphQL language: The time required to produce the complete query result (for any non-redundant query in ground-typed normal form) depends linearly on the size of this result. This is another highly desirable property of a query language!

The Issue of Exponential Result Size Blow-up in GraphQL

The only caveat regarding the latter finding is that there are cases in which the size of the result of a GraphQL query depends exponentially on the size of the query. For a simple example, consider querying data with a GraphQL schema whose query type has a field that allows us to retrieve data about a person, say Alice; this data has a scalar field name with value Alice and another field knows with an array of two data objects about two other persons; each of these data objects also has a knows field referring back to Alice. Then, for a query of the form

start { knows { knows { ... { knows { name } } ... } } }

in which 2xN knows fields are nested, the result contains the value Alice 2^N times! In other words, increasing the size of the query linearly (adding two more levels of nesting in each step) will cause the size of the query result to increase exponentially. This is not just a theoretical issue. In the introduction of our paper we describe an experiment in which we have observed the issue when querying the public GraphQL API of Github. As a sidenote, and in all fairness to the GraphQL language, such an exponential result size blow-up is possible in many query languages (all that it needs is the possibility of expressing conjunctive queries).

While we are the first to formally quantify the possible extend of the result size blow-up of GraphQL, the basic issue of potentially huge results has been recognized before in the GraphQL community. Existing approaches to address this issue are to restrict queries i) based on their nesting depth, ii) based on a calculation of the maximum possible number of result nodes at each level of the nesting depth, or ii) based on some cost estimation or complexity estimation for which users have to provide various cost factors for different elements of GraphQL schemas. Notice that all these approaches use heuristics and none of them takes into account the actual data being queried. As a consequence, these approaches fall short in providing a robust solution for the issue as they can fail in both directions: discarding requests for which the response may be of a manageable size, and allowing requests for which producing the complete response is too resource intensive.

Our Solution to the Result Size Issue

Instead of relying on data-agnostic heuristics and estimates, we introduce an approach which is accurate. This approach comes in the form of an algorithm that returns the exact size of the result of any given non-redundant GraphQL query in ground-typed normal form (see above) over any possible data set (without actually producing this query result). We use this algorithm to prove that the result size can be computed in an amount of time that depends only linearly on the product of the query size times the data set size. Hence, our algorithm achieves this complexity bound. The idea of the algorithm is to recursively consider every subquery and add up the sizes of results of the subquery for every part of the queried data for which the subquery has to be evaluated. During the recursive process, data objects are labeled with the result sizes of subqueries evaluated at them. These labels are then used to avoid the repeated calculation of subquery result sizes.

As an example, consider some simple example data that may be illustrated as the following graph of objects (where the node named r in the graph corresponds to the query object that would have to be specified in a GraphQL schema for this data).

Moreover, consider the following GraphQL query.

query {
  start {
    advisor {
      univ { name }
    }
    friend {
      univ { name }
    }
  }
}

Apparently, the result of this query over our example data would be the following JSON object.

start: {
  advisor: {
    univ: { name: UCh }
  }
  friend: {
    univ: { name: UCh }
  }
}

Now, instead of immediately starting to produce this result, suppose we first want to calculate its size. We measure the size of GraphQL query results in terms of symbols, where every field name counts as a symbol, and so does every scalar value and every special character (such as colons and curly braces). For instance, the aforementioned result of our example query happens to be of size 26. However, we want to calculate this size without already having produced the query result. To this end, we may first look at the root field start of the query object, and we observe that the number of result symbols that would be produced from this field is 4+x where the number 4 covers the symbols in the first and the last line of the example result and x is the sum of the result symbols obtained from going to the data object u and evaluating each of the two subqueries, q1 = advisor{univ{name}} and q2 = friend{univ{name}}. To make the following explanation more concise, let us denote the size of these two subquery results by writing size(q1,u) and size(q2,u), respectively. Thus, we know that the size of the complete query result is equivalent to

4 + size(q1,u) + size(q2,u)

What remains is to calculate the sizes of the two subquery results, which we can do in a recursive fashion using a similar reasoning as before: First, let’s focus on size(q1,u). By looking at the data object u, we observe that the root field advisor of subquery q1 is present in u. Therefore, we know that the subquery result will contain 4 result symbols from the advisor field plus the symbols obtained from going to data object v and evaluating the subquery q3 = univ{name}. Hence, we have that

size(q1,u) = 4 + size(q3,v)

and we need to enter the recursion again to calculate size(q3,v). In this case, we have that

size(q3,v) = 4 + size(q4,w)

with subquery q4 = name, and another recursion step is needed for size(q4,w). This step now brings us to a base case of the recursion because, by looking at data object w, we see that the result of subquery q4 is name:UCh and, thus, contains exactly 3 symbols. At this point, our algorithm adds a label to data object w to indicate that the result size of subquery q4 at this object is 3. Next, the recursion comes back to calculating the value of size(q3,v), for which it now becomes clear that this value is 7. So, data object v is labeled to indicate that the result of subquery q3 at this object is 7. Returning now to the calculation of size(q1,u), it becomes clear that size(q1,u)=11 and the data object u is labeled accordingly.

Being back at the initial step of the recursion now, we still have to calculate the value of size(q2,u). Observe that q2 contains the subquery q3 that we have seen before and it also has to be evaluated at the same data object as before (that is, object v). Therefore, to calculate

size(q1,u) = 4 + size(q3,v)

it is now possible to obtain the value of size(q3,v) by reading the corresponding label at object v instead of going into the recursion again for size(q3,v). The remaining steps of the calculation should not be difficult to guess at this point.

As can be observed from this example, labeling the data objects during the execution of our algorithm enables the algorithm to avoid repeating the same calculations and, thus, to achieve the polynomial runtime (even in cases in which actually producing the query result would have an exponential runtime). For a pseudo-code representation of the complete algorithm refer to our paper. We also have a prototypical JavaScript implementation of the algorithm.

Given this algorithm, we propose that GraphQL servers should perform the following steps for any given query:

  • First, compute the size of the response, which can be done efficiently by using our algorithm.
  • If the size is too big, reject query.
  • Else, inform the size to the client, and
  • Produce and send the response byte by byte.

Conceptual Contributions

Before wrapping up, I would like to also highlight some of the conceptual contributions that we have made in addition to all the aforementioned technical contributions. As a foundation of our work, we have developed a formally more precise definition of the GraphQL language. The main reason why this was necessary is that the semantics of GraphQL queries—i.e., the definition of what the expected result of any given query is—is given in the GraphQL specification by means of a recursive program specified by pseudo code. This recursion is based on an operation to resolve any field in a query. Surprisingly, this operation is not fully specified and, instead, simply assumes access to an “internal function […] for determining the […] value of [the] field.” While the lack of a more precise definition of this internal function is likely to be intentional (to allow for implementations of GraphQL on top of arbitrary database back-ends), it makes a systematic analysis of the GraphQL language unworkable. As a basis for providing a well-defined formalization of the language, we had to introduce a formal model that defines the notion of a so-called GraphQL graph. This notion is an artifact of realizing that GraphQL enables users to query an underlying data set in terms of a virtual, graph-like view of this data set. The form of this view depends on the GraphQL schema being used, and the resolver functions that implement the schema can be understood to establish the view. Our notion of a GraphQL graph is an abstraction to conceptualize such a view independent of its relationship to the underlying data set. As a sidenote for readers who are familiar with the notion of Property Graphs as supported by popular graph database systems like Neo4J and by the Apache Tinkerpop graph processing framework, a GraphQL graph is very similar to a Property Graph (with a few additional, GraphQL-specific features). The technical details of all these definitions can be found in our paper.

Thanks

As a final remark, I want to thank the GraphQL community (in particular, the initial creators of the original GraphQL spec) for the excellent documentation of GraphQL which has made our work of formalizing the language a pleasant journey.

A Foundation for Comparing Linked Data Fragment Interfaces

The Linked Data Fragments (LDF) conceptual framework is an attempt to address the realization in the Semantic Web community that SPARQL endpoints (i.e., Web services that execute any given SPARQL over an internal RDF dataset) are perhaps not the ultimate solution to provide public query access to datasets on the Web. The trouble with these endpoints is that they may be easily overloaded when trying to answer any given SPARQL query, in particular when these queries are complex or multiple clients issue queries concurrently. The main idea of the LDF framework is to consider other types of query-based data access interfaces that are more restricted in the types of queries they support and, thus, impose that the effort for executing more complex queries is shifted to the clients. The initial example of such an interface has been the Triple Pattern Fragment (TPF) interface that limits the server-side effort to the evaluation of triple patterns only (i.e., the simplest type of patterns that SPARQL queries are built of); any other operation needed for a given query has to be performed by a client-side query execution algorithm that is based on obtaining triple pattern results from a TPF server. Several such algorithms have been proposed in the literature, and so have a number of other types of LDF interfaces. Each such proposal aims to hit a sweet spot of trade-offs along multiple dimensions such as server-side load, query throughput, network load, etc.

While the experimental evaluations in the various LDF-related research papers have provided us with a comprehensive elementary understanding of the existing proposals and their respective trade-offs, I strongly believe there is many more interesting work to be done regarding LDFs.

However, you know what I always thought would be great to have in this context? Since the beginning of the LDF work, I was looking for a way that allows us to achieve a more fundamental understanding of possible LDF interfaces, including interfaces that have not yet been implemented! In particular, I was after a formal framework that allows us to organize LDF interfaces into some kind of a lattice, or perhaps multiple lattices, based on the fundamental trade-offs that the interfaces entail. Such lattices would not only provide us with a more complete picture of how different interfaces compare to each other, they would also be a basis for making more informed decisions about whether it is worth to spend the time implementing and studying a possible interface experimentally.

As you likely have guessed by now, such a formal framework is not just an idea anymore. Together with Jorge Pérez and Ian Letter at the Universidad de Chile, we have developed an abstract machine model for which we have shown that it is a suitable foundation for the type of formal framework described above. From a computer science point of view, the most exciting part of this work is that our abstract machine model presents a basis for defining new complexity measures that allow us to capture many more aspects of computation in a client-server setting than what is captured by the classical measure of computational complexity. We will present this work next week at the 16th International Semantic Web Conference (ISWC). If you are interested in reading about our machine model and how we applied it to study various existing types of LDF interfaces, refer to our research paper about it (and, yes, we have actual lattices in that paper 😉

Dagstuhl seminar on Federated Semantic Data Management

I have started a blog that I will use to write about events, ideas, and results related to my research. For this first post I below copy the report of a Dagstuhl seminar that I recently wrote on the blog of our Semantic Web research group at LiU.    –Olaf

During the last week of June, I co-organized a Dagstuhl seminar on Federated Semantic Data Management together with Maria-Esther Vidal and Johann-Christoph Freytag. It was a very intense week with a packed schedule and almost no time to catch some breath (exactly like how a Dagstuhl seminar should be I guess 😉

To start with, we had scheduled a few short, survey-style talks on a number of topics related to the seminar. In particular, these talks covered:

While these talks were meant to establish a common understanding of key concepts and terminology, the major focus of the seminar was on discussions and working groups. To this end, we had invited a good mix of participants from the Semantic Web field, from Databases, as well as from application areas. Due to this mix, we ended up on several occasions and in different constellations discussing and reflecting in depth the fundamental assumptions and the core ideas of federated semantic data management. These general discussions and reflections kept re-emerging not only during the sessions, but also during the meals, the coffee breaks, and the evenings in Dagstuhl’s wine cellar. In my opinion, clearly articulating and repeatedly arguing about these assumptions and ideas was a long-needed discussion to be had in the community. After this week, I would guess that many of the participants have a much clearer understanding of what federated semantic data management can and should be, and I am certain that this understanding will be reflected in the reports that the working groups are preparing.

Speaking of working groups, the seminar was structured around four topics addressed by four separate working groups who came together occasionally to report on their progress and obtain feedback from the other groups. The topics were:

  • RDF and graph data models
  • Federated query processing
  • Access control and privacy
  • Use cases and applications

Each of the working groups is currently preparing a summary of their discussions and results. These summaries will become part of our Dagstuhl report (to be published some time in August if all goes well). In addition to this report, we are planning to document the discussions and the results of the seminar in a collection of more detailed publications.

What’s next? We have some ideas to keep the momentum and to advance the discussions around the seminar topics in a more continuous community process. Stay tuned.