Skip to content
/ grobid Public

Automatic XML TEI encoding of catalogues using GROBID technologies

Notifications You must be signed in to change notification settings

gabays/grobid

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Automatic encoding of manuscripts catalogues with GROBID

Originally designed for dictionaries, we are trying to use GROBID with manuscripts catalogues.

Credits

GROBID dictionaries is developed by Mohamed Khemakhem (GitHub).

More info on GROBID technologies can be found here.

Research on catalogues and training is carried by Simon Gabay.

Corpus

Tests are carried on scans of the Revue des autographes, directes by Gabriel Charavay (data.bnf)

Methodology

PDF are OCRised with Transkribus. You can ask for our model.

The GROBID model is trained on four excerpts (three pages each) of the corpus (toyData/dataset/dictionary-segmentation/corpus>PDF).

Files

Training data are available in ToyData

Samples of pdf and tools to manipulate them (cpdf) are in TrainingTools

Paper

A first paper was presented at the TEI 2018 in Tokyo

@inproceedings{khemakhem:hal-01819505,
  TITLE = {{Automatically Encoding Encyclopedic-like Resources in TEI}},
  AUTHOR = {Mohamed Khemakhem, Laurent Romary, Simon Gabay, Hervé Bohbot, Francesca Frontini, Giancarlo Luxardo},
  URL = {https://hal.archives-ouvertes.fr/hal-01819505},
  BOOKTITLE = {{TEI 2018}},
  ADDRESS = {Tokyo, Japan},
  YEAR = {2018},
  MONTH = September,
  KEYWORDS = {Manuscripts auction catalogues, GROBID-Dictionaries, TEI, Dictionaries},
  PDF = {https://hal.inria.fr/hal-01819505/document},
  HAL_ID = {hal-01819505},
}

Licence

Regarding GROBID, cf. here.

Regarding the corpus: extracted data is CC-BY.

Creative Commons License

About

Automatic XML TEI encoding of catalogues using GROBID technologies

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages