Skip to content

Latest commit

 

History

History
312 lines (254 loc) · 15.1 KB

GETTING-STARTED.md

File metadata and controls

312 lines (254 loc) · 15.1 KB

Home | Getting Started

Getting Started

If you are new to publishing schema.org, here are some general tips to getting started.

Goals

To provide a place for the scientific data community to work out how best to implement schema.org and other external vocabularies on web pages by publishing guidance documents. Pull requests and Github Issues are welcome!

Approach

  1. To be pragmatic with our use of schema.org and external vocabulary adoption.
  2. To consider schema.org classes and properties first before considering external vocabularies.
  3. Use JSON-LD in our guidance documents for simplicity and terseness as compared to Microdata and RDFa. For more, see Why JSON-LD? from the Conventions document.
  4. Presently, the Google Structured Data Testing Tool enforces use of schema.org classes and properties by displaying an error whenever external vocabularies are used. schema.org proposes linking to external vocabularies usuing the schema:additionalType property. While this property is defined as a sub property of rdf:type, it's data type is a literal. We encourage the use of JSON-LD '@type' for typing classes to external vocabularies. For more, see Typing to External Vocabularies from the Conventions document.
  5. See Governance for how we will govern the project.
  6. See Conventions for guidance on creating/editing guidance documents.

Prerequisites

  1. We assume a general understanding of JSON.
  2. We assume a basic knowledge about JSON-LD.

JSON-LD is valid JSON, so standard developer tools that support JSON can be used. For some specific JSON-LD and schema.org help though, there are some other resources.

JSON-LD resources https://json-ld.org

Generating the JSON-LD is best done via libraries like those you can find at https://json-ld.org.
There are libraries for; Javascript, Python, PHP, Ruby, Java, C# and Go. While JSON-LD is just JSON and can be generated many ways, these libraries can generate valid JSON-LD spec output.

The playground is hosted at the very useful JSON-LD web site site. You can explore examples of JSON-LD and view how they convert to RDF, flatten, etc. Note, that JSON-LD is not associated with schema.org. It can be used for much more and so most examples here don't use schema.org and this site will NOT look to see if you are using schema.org types and properties correctly. Only that your JSON-LD is well formed.

  1. We assume that you've heard about schema.org and have already decided that it's useful to you.
  2. We assume that you have a general understanding of what may describe a scientific dataset.

Let's go!

Introduction

There is an emerging practice to leverage structured metadata to aid in the discovery of web based resources. Much of this work is taking place in the context (no pun intended) of schema.org. This approach has extended to the resource type Dataset. This page will present approaches, tools and references that will aid in the understanding and development of schema.org in JSON-LD and its connection to external vocabularies. For a more thorough presentation on this visit the Google AI Blog entry of January 24 2017 at https://ai.googleblog.com/2017/01/facilitating-discovery-of-public.html .

Using schema.org

Modifying web pages to include schema.org as JSON-LD

JSON-LD should be incorporated into the landing page html inside the <head></head> as a <script> element.

<html>
  <head>
    ...
    <script id="schemaorg" type="application/ld+json">
    {
      "@context": {
        "@vocab": "https://schema.org/"
       },
       "@id": "http://opencoredata.org/id/dataset/bcd15975-680c-47db-a062-ac0bb6e66816",
       "@type": "Dataset",
       "description": "Janus Thermal Conductivity for ocean drilling ...",
       ...
    }
    </script>
    ...
  </head>
  ...
</html>

Data Types

For each schema.org type, such as Person or Event, there are fields that let you specify more information about that type. Each of these fields has an expected data type that is defined in the documentation as you can see from Figure 1..

Figure 1. schema.org field data types
The expected data type for each field appears in the middle column. The left column is the name of the field, the middle column is the data type, and the right column is the field's description.

Every data type is either a resource or a literal. Resources refer to other schema.org types. For example a Dataset type has a field called author of which the data type can be either a Person or an Organization. Because Person and Organization are other schema.org "types" who have their own fields, they are called resources. In JSON-LD, you specify resources by using curly brackets {}:

{
  "@context": "https://schema.org/",
  "@type": "Dataset",
  "author": {
    "@type": "Person",
    "name": "Jane Goodall"
  }
}

In the JSON-LD above, the 'author' is a resource of type 'Person'. Fields that simply have a value are called literal data types. For examples, the 'Person' type above has a 'name' of "Jane Goodall" - a literal text value.

Schema.org defines six literal, or primitive, data types: Text, Number, Boolean, Date, DateTime, and Time. Text has two special variations: URL and how to specify when text is actually HTML.

When using schema.org, literal data types are not not specified using curly brackets {} as these are resrved for specifying 'objects' or 'resources' such as other schema.org types like Person, Organization, etc. First, let's see how to use a primitive data type by using fields of CreativeWork, the superclass for Dataset.

Text

Imagine we want to say the name of our Creative Work is "Passenger Manifest for H.M.S. Titanic". The name field of CreativeWork specifies that it expects Text as the data type. We would use it in this way:

{
  "@context": "https://schema.org/",
  "@type": "CreativeWork",
  "name": "Passenger Manifest for H.M.S. Titanic"
}

Number

Let's say we want to specify the version number of our manifest using the version field of CreativeWork which expects a Number. To specify numbers in JSON-LD, we omit the quotations surrounding the value:

{
  "@context": "https://schema.org/",
  "@type": "CreativeWork",
  "name": "Passenger Manifest for H.M.S. Titanic",
  "version": 1
}

URL

Now, let's specify the URL of our manifest using the url field of CreativeWork, an inheritied field from Thing. This fields expects a valid URL represented as Text:

{
  "@context": "https://schema.org/",
  "@type": "CreativeWork",
  "name": "Passenger Manifest for H.M.S. Titanic",
  "version": 1,
  "url": "https://raw.githubusercontent.com/Geoyi/Cleaning-Titanic-Data/master/titanic_original.csv"
}

Boolean

Using the Boolean value, we can speficy that our manifest is accessible for free using the field isAccessibleForFree by using the text true or false and omitting the quotes:

{
  "@context": "https://schema.org/",
  "@type": "CreativeWork",
  "name": "Passenger Manifest for H.M.S. Titanic",
  "version": 1,
  "url": "https://raw.githubusercontent.com/Geoyi/Cleaning-Titanic-Data/master/titanic_original.csv",
  "isAccessibleForFree": true
}

Date

To specify the datePublished, which allows either a Date or DateTime, as a Date, we can use any ISO 8601 date format by wrapping the date in double-quotes:

{
  "@context": "https://schema.org/",
  "@type": "CreativeWork",
  "name": "Passenger Manifest for H.M.S. Titanic",
  "version": 1,
  "url": "https://raw.githubusercontent.com/Geoyi/Cleaning-Titanic-Data/master/titanic_original.csv",
  "isAccessibleForFree": true,
  "datePublished": "2018-07-29"
}

DateTime

To specify the dateModified as a DateTime, as a Date, we must follow the ISO 8601 format for combining date and time representations using the form [-]CCYY-MM-DDThh:mm:ss[Z|(+|-)hh:mm] :

{
  "@context": "https://schema.org/",
  "@type": "CreativeWork",
  "name": "Passenger Manifest for H.M.S. Titanic",
  "version": 1,
  "url": "https://raw.githubusercontent.com/Geoyi/Cleaning-Titanic-Data/master/titanic_original.csv",
  "isAccessibleForFree": true,
  "datePublished": "2018-07-29",
  "dateModified": "2018-07-30T14:30Z"
}

Time

Time is a rarely-used data type because it must represent a point in time recurring on multiple days following the XML Schema definition using the form hh:mm:ss[Z|(+|-)hh:mm] (see XML schema for details).

{
  "@context": "https://schema.org/",
  "@type": "CreativeWork",
  "name": "Passenger Manifest for H.M.S. Titanic",
  "version": 1,
  "url": "https://raw.githubusercontent.com/Geoyi/Cleaning-Titanic-Data/master/titanic_original.csv",
  "isAccessibleForFree": true,
  "datePublished": "2018-07-29",
  "dateModified": "2018-07-30T14:30Z"
}

HTML

The HTML data type is a special variation of the Text data type. In some cases where Text is the expected data type, our actual data type may be HTML (because we are dealing with web pages). In this case, the schema.org JSON-LD context defines HTML to mean rdf:HTML, the data type for specifying that a string of text should be interpreted as HTML. Let's say that we have a description of our manifest and want to use the description field, but we have HTML inside that text. Using the text field as we did above for the name field, we would specify the description as:

{
  "@context": "https://schema.org/",
  "@type": "CreativeWork",
  "name": "Passenger Manifest for H.M.S. Titanic",
  "version": 1,
  "url": "https://raw.githubusercontent.com/Geoyi/Cleaning-Titanic-Data/master/titanic_original.csv",
  "isAccessibleForFree": true,
  "datePublished": "2018-07-29",
  "dateModified": "2018-07-30T14:30Z",
  "description": "<h3>Acquisition</h3><p>The data was acquired from an office outside of <a href\"https://en.wikipedia.org/wiki/New_York_City\">New York City</a>."
}

However, to specify that the description field should be interpreted as HTML, you specify description as a resource, setting the @type of that resource to "HTML" and placing the HTML string in a JSON-LD property @value:

{
  "@context": "https://schema.org/",
  "@type": "CreativeWork",
  "name": "Passenger Manifest for H.M.S. Titanic",
  "version": 1,
  "url": "https://raw.githubusercontent.com/Geoyi/Cleaning-Titanic-Data/master/titanic_original.csv",
  "isAccessibleForFree": true,
  "datePublished": "2018-07-29",
  "dateModified": "2018-07-30T14:30Z",
  "description": { 
    "@type": "HTML", 
    "@value": "<h3>Acquisition</h3><p>The data was acquired from an office outside of <a href\"https://en.wikipedia.org/wiki/New_York_City\">New York City</a>." 
  }
}

NOTE: As of 7/28/2018, the Google Structured Data Testing Tool understands the value of description to be rdf:HTML, but the tool specifies this type is unknown. However, you can see from the schema.org Github repository, that this method was discussed and implemented in pull #1634: alias HTML to rdf:HTML

Resource Types

All schema.org resources should make use of the @type property which 'classifies' the resources as a specific type. For example, an un-typed resource would look like:

{
  "@context": "https://schema.org/",
  "name": "My Dataset"
}

Even though the above resource has a name of 'My Dataset' harvesters are unaware that your intent was to classify it as a Dataset. Un-typed resources are not valid schema.org resources, and so they require the @type property:

{
  "@context": "https://schema.org/",
  "@type": "Dataset",
  "name": "My Dataset"
}

In some cases, it useful to multi-type a resource. One example of this may be a data repository. A data repositotry is typically functioning as noth an 'Organization' that employs people and has an address, but it also functions as a 'Service' to its user community. To multi-type a resource, we use JSON arrays:

{
  "@context": "https://schema.org/",
  "@type": ["Organization", "Service"],
  "name": "My Data Repository"
}

All schema.org types may be found here.