This part of the pipeline
- reconciles the names in our dataset with wikidata IDs
- runs the same 4
sparql
requests on all IDs - stores the output of the
sparql
requests in asparql
file - updates the
tei
files with the wikidata IDs
The aim is to produce normlised data to connect to catalogue entries, in order to understand our dataset better and to isolate the factors detrmining a price.
With the proper python virtual environment sourced, without running tests, just type:
python main.py -n # build the input data table
python main.py -i # align tei:names with wikidata entities
python main.py -s # run sparql queries on those entities
python main.py -w # reinject the wikidata ids into the tei catalogues
Even simpler, you can just run the below script:
bash pipeline.sh
This works on MacOS and Linux (ubuntu, debian based distributions).
git clone https://github.com/katabase/3_WikidataEnrichment # clone the repo
cd 3_WikidataEnrichment # move to the dictory
python3 -m venv env # create a python virtualenv
source env/bin/activate # source python from the virtualenv
pip install -r requirements.txt # install the necessary librairies
All scripts run by running main.py
with a specific argument. 4 GBs of RAM are recommended to
run the scripts.
As a reminder, here is catalogue entries'tei
structure:
<item n="80" xml:id="CAT_000146_e80">
<num>80</num>
<name type="author">Cherubini (L.),</name>
<trait>
<p>l'illustre compositeur</p>
</trait>
<desc>
<term>L. a. s.</term>;<date>1836</date>,
<measure type="length" unit="p" n="1">1 p.</measure>
<measure unit="f" type="format" n="8">in-8</measure>.
<measure commodity="currency" unit="FRF" quantity="12">12</measure>
</desc>
</item>
Step 1 : create an input TSV - python main.py -n
The first step is to create a tsv
file that will be used to retrieve the wikidata IDs:
- the
tsv
is made of 5 columns (see example below):xml id
: the item'sxml:id
wikidata id
: the wikidata ID (to be retrieved in the next step)name
: thetei:name
of that itemtrait
: thetei:trait
of that item
xml id,wikidata id,name,trait
CAT_000362_e27086,,ADAM (Ad.),célèbre compositeur de musique.
- running this step:
python main.py -n
Step 2 : retrieve the wikidata IDs - python main.py -i
The wikidata IDs are retrieved by running a full text search using the wikidata API.
- the algorithm functions as follows:
- the input is file created at the previous step (
script/tables/nametable_in.tsv
). Thename
andtrait
columns are used to create data for the API search - two columns are processed to prepare the data for the API search:
- from the
name
, we determine the kind ofname
we're working with (the name of a person, of a nobility, of an event, of a place...). This determines different behaviours. - the
name
is normalized: we extract and translate nobility titles, locations... First and last names are extracted. If the first name is abbreviated, we try to rebuild a full name from its abbreviated version. - the
trait
is processed to extract and translate occupations, dates... - the output is stored in a dictionnary
- from the
- this
dict
is passed to a second algorithm to run text searches on the API. Depending on the data stored in the dict, different queries are ran. A series of queries are run until a result is obtained - finally, the result is written to a TSV file (
out/wikidata/nametable_out.tsv
). Its structure is the same as that ofnametable_in
, with some changes. Here are the column names:tei:xml_id
: the@xml:id
from thetei
fileswd:id
: the wikidata IDtei:name
: thetei:name
wd:name
: the name corresponding to the wikidata ID (to ease the data verification process)wd:snippet
: a short summary of the wikidata page (to ease the data verification process)tei:trait
: thetei:trait
wd:certitude
: an evaluation of the degree of certitude (whether we're certain that the proper id has been retrieved)
- once this script has completed, a deduplicated list of wikidata IDs is written to
script/tables/id_wikidata.txt
. This file will be used as input for the next step. - the F1 score for this step (evaluating the number of good wikidata IDs retrieved) is
0.674
, based on tests run on 200 items. - this step takes a lot of time to complete, but, thanks to log files, the script can be interrupted and restarted at any point.
- the input is file created at the previous step (
- running this step :
python main.py -i
Step 3 : running sparql
queries - python main.py -s
- the algorithm is much simpler: for each wikidata ID, 4 sparql queries are run. The results are returned
in
json
or, if there's a mistake,xml
. The results are translated to a simplerjson
and the result is stored toout/wikidata/wikidata_enrichments.json
. This step takes a lot of time, but the script can be stopped and continued at any point. - the output structure is as follows (each key is mapped to a list of results ; the list can be empty ; the empty lines in the dict separates the different wikidata queries):
out = {
'instance': [], 'instanceL': [], # what "category" an id belongs to (person, litterary work...)
'gender': [], 'genderL': [], # the gender of a person
'citizenship': [], 'citizenshipL': [], # citizenship
'lang': [], 'langL': [], # languages spoken
'deathmanner': [], 'deathmannerL': [], # the way a person died
'birthplace': [], 'birthplaceL': [], # the place a person is born
'deathplace': [], 'deathplaceL': [], # the place a person died
'residplace': [], 'residplaceL': [], # the place a person lived
'burialplace': [], 'burialplaceL': [], # where a person is buried
'educ': [], 'educL': [], # where a person studied
'religion': [], 'religionL': [], # a person's religion
'occupation': [], 'occupationL': [], # general description of a person's occupation
'award': [], 'awardL': [], # awards gained
'position': [], 'positionL': [], # precise positions held by a person
'member': [], 'memberL': [], # institution a person is member of
'nobility': [], 'nobilityL': [], # nobility titles
'workcount': [], # number of works (books...) documented on wikidata
'conflictcount': [], # number of conflicts (wars...) a person has participated in
'image': [], # url to the portrait of a person
'signature': [], # url to the signature of a person
'birth': [], 'death': [], # birth and death dates
'title': [], # title of a work of art / book...
'inception': [], # date a work was created or published
'author': [], 'authorL': [], # author of a book
'pub': [], 'pubL': [], # publisher of a work
'pubplace': [], 'pubplaceL': [], # place a work was published
'pubdate': [], # date a work was published
'creator': [], 'creatorL': [], # creator of a work of art
'material': [], 'materialL': [], # material in which a work of art is made
'height': [], # height of a work of art
'genre': [], 'genreL': [], # genre of a work or genre of works created by a person
'movement': [], 'movementL': [], # movement in which a person or an artwork are inscribed
'creaplace': [], 'creaplaceL': [], # place where a work was created
'viafID': [], # viaf identifier
'bnfID': [], # bibliothèque nationale de france ID
'isniID': [], # isni id
'congressID': [], # library of congress identifier
'idrefID': [] # idref identifier
}
- running this step:
python main.py -s
Step 4: reinject the wikidata ids into the TEI catalogues - python main.py -w
- all
tei:items
are linked with a wikidata ID retrieved during the process. - the wikidata IDs are included in a
@key
attribute inside thetei:name
and prefixed by the tokenwd:
. - a pattern to handle this prefix is provided in the
tei:teiHeader
, in thetei:editorialDecl//tei:listPrefixDef
. this allows to automatically rebuilt a URL to the proper wikidata page. - the output is written to
out/catalogues
.
Running tests - python main.py -t
- the tests are only run on the step 2 (for the rest, we are certain of the result).
- They are based on 200 catalogue entries. The test dataset ressembles the full dataset (about as many
different kinds of entries, from different catalogues, with as many
tei:trait
s as in the main dataset) - Several tests are run. Two tests are testing isolate parameters of the dictionnary built in the step 1 and the efficiency of the function that rebuilds the first name from its abbreviation. The other tests are for the final algorithm and they build statistics it. They also calculate its execution time using different parameters.
- They are based on 200 catalogue entries. The test dataset ressembles the full dataset (about as many
different kinds of entries, from different catalogues, with as many
- running the tests:
python main.py -t
Other options:
- counting the most used words in the
tei:trait
s of the input dataset (to tweak the way the dictionnary is built in the step 2) :python main.py -c
python main.py -x
: a throwaway option to map to a function in order to use a script that is not accessible from the above arguments
Summarizing, the options are
* -c --traitcounter : count most used terms in the tei:trait (to tweak the matching tables)
* -t --test : run tests (takes ~20 minutes)
* -i --wikidataids : retrieve wikidata ids (takes up to 10 to 20 hours!)
* -s --runsparql : run sparql queries (takes +-5 hours)
* -n --buildnametable: build the input table for -i --wikidataids (a table from which
to retrieve wikidata ids
* -x --throwaway : run the current throwaway script (to test a function or whatnot)
Scripts developped by Paul Kervegan in spring-summer 2022 and available under GNU-GPL 3.0 license.
The catalogues are licensed under Creative Commons Attribution 4.0 International Licence and the code is licensed under GNU GPL-3.0.