Automatic encoding of manuscripts catalogues with GROBID

Originally designed for dictionaries, we are trying to use GROBID with manuscripts catalogues.

Credits

GROBID dictionaries is developed by Mohamed Khemakhem (GitHub).

More info on GROBID technologies can be found here.

Research on catalogues and training is carried by Simon Gabay.

Corpus

Tests are carried on scans of the Revue des autographes, directes by Gabriel Charavay (data.bnf)

Methodology

PDF are OCRised with Transkribus. You can ask for our model.

The GROBID model is trained on four excerpts (three pages each) of the corpus (toyData/dataset/dictionary-segmentation/corpus>PDF).

Files

Training data are available in ToyData

Samples of pdf and tools to manipulate them (cpdf) are in TrainingTools

Paper

A first paper was presented at the TEI 2018 in Tokyo

@inproceedings{khemakhem:hal-01819505,
  TITLE = {{Automatically Encoding Encyclopedic-like Resources in TEI}},
  AUTHOR = {Mohamed Khemakhem, Laurent Romary, Simon Gabay, Hervé Bohbot, Francesca Frontini, Giancarlo Luxardo},
  URL = {https://hal.archives-ouvertes.fr/hal-01819505},
  BOOKTITLE = {{TEI 2018}},
  ADDRESS = {Tokyo, Japan},
  YEAR = {2018},
  MONTH = September,
  KEYWORDS = {Manuscripts auction catalogues, GROBID-Dictionaries, TEI, Dictionaries},
  PDF = {https://hal.inria.fr/hal-01819505/document},
  HAL_ID = {hal-01819505},
}

Licence

Regarding GROBID, cf. here.

Regarding the corpus: extracted data is CC-BY.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
TrainingTools/TestTrainingPDF		TrainingTools/TestTrainingPDF
toyData/dataset		toyData/dataset
.gitignore		.gitignore
ERRORS.txt		ERRORS.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automatic encoding of manuscripts catalogues with GROBID

Credits

Corpus

Methodology

Files

Paper

Licence

About

Releases

Packages

Contributors 2

Languages

gabays/grobid

Folders and files

Latest commit

History

Repository files navigation

Automatic encoding of manuscripts catalogues with GROBID

Credits

Corpus

Methodology

Files

Paper

Licence

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages