Originally designed for dictionaries, we are trying to use GROBID with manuscripts catalogues.
GROBID dictionaries is developed by Mohamed Khemakhem (GitHub).
More info on GROBID technologies can be found here.
Research on catalogues and training is carried by Simon Gabay.
Tests are carried on scans of the Revue des autographes, directes by Gabriel Charavay (data.bnf)
PDF are OCRised with Transkribus. You can ask for our model.
The GROBID model is trained on four excerpts (three pages each) of the corpus (toyData/dataset/dictionary-segmentation/corpus>PDF
).
Training data are available in ToyData
Samples of pdf and tools to manipulate them (cpdf
) are in TrainingTools
A first paper was presented at the TEI 2018 in Tokyo
@inproceedings{khemakhem:hal-01819505,
TITLE = {{Automatically Encoding Encyclopedic-like Resources in TEI}},
AUTHOR = {Mohamed Khemakhem, Laurent Romary, Simon Gabay, Hervé Bohbot, Francesca Frontini, Giancarlo Luxardo},
URL = {https://hal.archives-ouvertes.fr/hal-01819505},
BOOKTITLE = {{TEI 2018}},
ADDRESS = {Tokyo, Japan},
YEAR = {2018},
MONTH = September,
KEYWORDS = {Manuscripts auction catalogues, GROBID-Dictionaries, TEI, Dictionaries},
PDF = {https://hal.inria.fr/hal-01819505/document},
HAL_ID = {hal-01819505},
}
Regarding GROBID
, cf. here.
Regarding the corpus: extracted data is CC-BY.