Skip to content

RDF Guide

GCHQDeveloper81 edited this page Jul 16, 2024 · 3 revisions

RDF is a W3C standard for describing and exchanging directed graph data. RDF revolves around making statements in the form of "triples" which represent the subject, predicate and object of a particular relationship:

:steve       :marriedTo     :bob
(subject)    (predicate)    (object)

The items in the above triple all represent resources. RDF identifies specific resources using IRIs (Internationalized Resource Identifiers) - these are the elements preceded by colons in the above example. This short-hand way of writing IRIs with just a colon would not fly in the real world - you would need to fully qualify IRIs to a location that would uniquely identify that particular resource. As well as IRIs, some parts of the triple can also be represented as literals (such as strings) or "blank nodes" (these are discussed below).

An RDF Graph is a set of such triples (using the mathematical definition of the term "Set" - i.e. a graph can not contain duplicate triples). Triples are the atomic data item in an RDF graph - a subject or a predicate cannot exist on the graph by themselves.

That's pretty much all there is to it - RDF alone does not really do any more than as already described and so is similar to formats such as CSV and SVG in terms of just being a way to serialise a particular type of data. Several different serialisation formats are described within the spec, with the most popular being "Turtle".

Whilst RDF has been around since the late 90's and has found widespread adoption in certain industries (government, historic curation...) it is not widely used within general information system development and is still the subject of active research. Even so, it remains one of only two variants that exist for representing directed graphs - the other being the Labelled Property Graph (LPG). The key differences in capability between the two is that properties can be assigned to relationships within a LPG (you can achieve this effect within RDF too, but it requires you to create new triples) - Labelled Property Graphs are, however, not based on any kind of agreed standard and therefore do not carry many of the out-of-the-box benefits that you get with RDF.

Triple format

Triples take the form of a subject, predicate and object - in that order and with the following constraints:

  • Subject must be an IRI or a Blank Node.
  • Predicate must be an IRI.
  • Object must be an IRI, a Literal or a Blank Node.

Several popular serialisation syntaxes exist including XML, Turtle, N-triples and JSON-LD. Consider the following statement:

Steve is a Person who is married to Bob.

The following widget allows you to see this statement represented in RDF using various serialization formats. This first one is called Turtle:

@prefix ex: <http://www.example.org#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

ex:bob 
  a ex:Person ;
  ex:marriedTo ex:steve .

And this one is called JSON-LD:

[
  { "@id": "http://www.example.org#Person" },
  { "@id": "http://www.example.org#steve" },
  {
    "@id": "http://www.example.org#bob",
    "@type": ["http://www.example.org#Person"],
    "http://www.example.org#marriedTo": [
      { "@id": "http://www.example.org#steve" },
    ],
  },
]

The most popular and widely used syntax currently is Turtle due to its readability and subtlety. As well as existing as independent data, RDF can also be embedded directly into HTML pages using a syntax known as RDFa to "mark up" a page with additional semantic detail. Many popular search engines use the RDFa format to allow developers to provide machine-readable meaning to their content.

The RDF standard allows for high interoperability - simply by choosing RDF, developers are able to combine data sources with anyone else who chose RDF, vastly simplifying the process of integrating with other systems or combining distributed data sources. The format is also very flexible, negating many of the data modelling issues that appear when creating brittle schemas within relational databases (such as the difficulty in extending or altering them).

The RDF format also allows data to be very easily streamed, as each triple represents a piece of knowledge that can be ingested and used independently of any of the other triples in the graph.

Quads

A newer addition to RDF came in the form of adding a fourth facet to a triple - "graph label". This allows graphs to be combined together into a single data source whilst preserving the origin of a particular statement.

IRIs

IRIs are IDs used to refer to specific resources - these are largely synonymous with web page URLs however they differ in that they do not necessarily need to refer to a web page (an IRI can, for example, represent a real life person such as :steve or some kind of abstract concept, like :marriedTo). IRIs also have an expanded number of allowed characters over URLs.

IRIs should be dereferencable - which is just a fancy way of saying "I should be able to use the IRI to look up the thing it represents". A really common and simple way of doing this is to use HTTP, which allows either the user or a computer system to send off a request to the web in order to find out more information about a particular IRI. Often when people are working with RDF they conventionally use www.example.com to refer to an imaginary future web location from which this information might eventually be available.

Sometimes an IRI is just a link to a webpage that talks directly about a particular resource, but it might be that you want an IRI to serve a "dual purpose" of being readable by both computers and humans (and in fact, this is a core stipulation of any system trying to implement linked data). There are a couple of conventions that have risen up around this.

You could vary the response using the Accept request header - maybe the IRI itself links directly to a web page (intended to be consumed by humans) but if you send a request stating that you only accept the text/turtle MIME type, it responds with a redirect to some other HTTP location.(Interestingly, the appropriate redirect status code when using this convention is 303: See Other).

Alternatively, the IRI could instead be a hash IRI - consider these triples (note that foaf is an vocabulary for expressing details about people and their relationships with one another):

http://www.example.com/people/bob#me rdf:typeof foaf:Person
http://www.example.com/people/bob#me foaf:name http://www.example.com/bob#name
http://www.example.com/people/bob#me foaf:nick http://www.example.com/bob#nickname

Visiting these URLs in a web browser would direct you to both a page and an element within that page. If that element happened to also be decorated with an RDFa attribute, it could be parsed and interpreted by both a human and a computer. The result of visiting http://www.example.com/bob might look like this:

<html prefix="foaf: http://xmlns.com/foaf/0.1/" language="en">
    <body id="me" typeof="foaf:Person" about="http://www.example.com/bob">
        <dl>
            <dt>Full Name</dt>
            <dd id="name" property="foaf:name">Robert Smith</dd>
            <dt>Nickname</dt>
            <dd id="nickname" property="foaf:nick">Bob</dd>
        </dl>
    </body>
</html>

Adjacent standards

By choosing RDF as a format, you unlock the potential of using several adjacent standards which can be bolted onto systems as needed:

  • SPARQL - for querying RDF data
  • RDFS - for describing set logic and class heirachies
  • SKOS - for describing vocabularies
  • OWL - for describing ontologies, therefore unlocking inference
  • SHACL - for describing graph shapes, therefore unlocking validation

Note that this is a non-exhaustive list and that some of the above are still the topic of active research.

It is not necessary to adopt all (or in fact any) of these adjacent standards when using RDF - for example, depending on what you are trying to achieve, you might even be able to get away with using only a subset of RDFS. You can use these standards as and when you need to without needing to think about data structures or access patterns of your model ahead of time, which is something you would have to do with other database standards. Choosing RDF as a format therefore somewhat future-proofs your data model and allows it to cope with potential future requirements, however this comes at a cost:

  • Scalability can be more difficult with RDF as it is not a format that is optimised for speed (it's optimised for expressiveness and flexibility).
  • RDF models can become very complicated, with the limitations of the format forcing you to create overly-complex representations that differ greatly from the model you originally conceived of.

It certainly carries a high risk of making things more complicated for application developers, but potentially much less complicated for data consumers (your customers) - this is the choice you need to make: where do you want this complexity to exist?

Blank Nodes

A blank node (or b-node) in RDF represents a resource which does not have an IRI because either creating one is not appropriate/useful or because the true identity of the resource is not fully known. It might be that you still want to include such a resource in the graph in order to say something about it.

Here is a set of statements that represent the fact that the imaginary village in the Van Gogh painting "Starry Night" is based on the real village of Saint-Rémy. The "blank node" is represented by the namespace _.

:Starry_Night :paintedBy :Van_Gogh .
:Starry_Night :depicts _:village .
_:village :basedOn :Saint_Rémy .
_:village :isFictional true .

Using blank nodes in this way allows us to talk about the imaginary village without going to the lengths of creating an IRI. There are a number of other syntaxes available within Turtle to represent blank nodes, and support also exists within the other serialisation formats.

Reification

"Reification" is just a way of saying that you want to make a statement about a statement. Consider the following triples about the artist Van Gogh and his painting "Starry Night";

:Van_Gogh :painted :Starry_Night .
:Van_Gogh :notProudOf :Starry_Night .

It is suspected that Van Gogh wasn't too big of a fan of Starry Night - despite its cultural relevance, he wrote about it in letters very little and eventually referred to it as a failure. We might want to capture the justification for this statement somehow. The pattern of reification takes this statement and splits it out into a separate resource, allowing you to then add as many statements about the statement as you want:

_:x rdf:type rdf:Statement .
_:x rdf:subject :Van_Gogh .
_:x rdf:predicate :notProudOf .
_:x rdf:object :Starry_Night .
_:x :justifiedBy "Lack of letters written" .
_:x :justifiedBy "His referring to it as a failure"

Reification does make a model more complicated and is therefore not a step to be taken lightly if a simple change to the ontology/vocabulary could provide an alternative way to make statements like this easier to express.

Software

It's important not to get too hung up on the unavailability of RDF software. There currently does not exist any production-ready "jack of all trades" semantic web software that allows developers to take full advantage of RDF and each of its adjacent standards, however there is plenty of software out there that can be "bolted together" to create the required capabilities. It might also be that you can use the RDF format directly, without actually needing any specific software library (a data provider, for example, doesn't need to necessarily use anything but the RDF standard directly).

Whilst RDF-specific databases do exist (called triple-stores) you can just use any relational or NoSQL database to store RDF data. A SQL database with three columns (S, P, O) will store RDF data just fine and this pattern is very commonly used. It is also very common for RDF data simply to exist as files (in much the same way JSON files or CSV files).

One reason you might choose a triple store over a more popular database for RDF is if you needed triple-based optimisations (there's some clever/efficient access patterns that triple stores implement out of the box). If you are using several of the RDF-adjacent standards such as SPARQL, some triple stores also come with support for those pre-packaged.