Matching the groundtruth with the hypothesis baselines for CER / WER #119

icarl-ad · 2024-07-18T11:11:23Z

icarl-ad
Jul 18, 2024

Hi,

I was wondering how the baselines (or the transcription of baselines) of the groundtruth and hypothesis are matched? This is obviosly important for the calculation of KPIs.

If e.g. the levenshtein distance is applied, how does this work with an extremly bad transcription or baselines that occur only in groundtruth or hypothesis?

Thank you for your help!

mikegerber · 2024-07-18T14:02:36Z

mikegerber
Jul 18, 2024
Maintainer

(Baselines is not the right term here, so I am using the term "textlines")

dinglehopper matches the extracted text in order (e.g. ReadingOrder for PAGE documents, or in document order in ALTO, text files etc.). It uses the shortest alignment as per the Levenshtein algorithm.

If you have lines that only appear on one side, there will be no match (or a bad match). An example (from the README):

https://raw.githubusercontent.com/qurator-spk/dinglehopper/master/.screenshots/dinglehopper.png

If you look at the second line, there's no match on the left (GT) side and the extra text on the right side (OCR) is counted as an error.

0 replies

icarl-ad · 2024-07-19T09:41:39Z

icarl-ad
Jul 19, 2024
Author

Does that mean that the levenshtein distance is calculated from every textline of the ground truth to every textline of the hypothesis and based on the smallest distance the allignment is made?

0 replies

mikegerber · 2024-07-19T10:27:57Z

mikegerber
Jul 19, 2024
Maintainer

No, the alignment/calculating the distance works on the full concatenated text, not on single textlines.

1 reply

mikegerber Jul 19, 2024
Maintainer

Note that calculating the distance requires alignment, and so is essentially the same thing.

icarl-ad · 2024-07-19T14:02:25Z

icarl-ad
Jul 19, 2024
Author

Can you elaborate on that? I don´t really understand how calculating the levenshtein distance on the whole transcription would be helpful, escpecially since it sounds like the alignment of the textlines has been set beforehand. Thank you so much again!

0 replies

mikegerber · 2024-07-22T14:27:44Z

mikegerber
Jul 22, 2024
Maintainer

Both texts are extracted, in the order defined in the files. At this point you do not have any textlines anymore, but a series of characters (including newline characters).
Then the Levenshtein distance is calculated.

Calculating the Levenshtein distance involves aligning the two texts, so you don't get "a distance", but "the distance" = the shortest series of operations to get from one text to the other.

0 replies

mikegerber · 2024-08-20T09:00:56Z

mikegerber
Aug 20, 2024
Maintainer

@icarl-ad mailed me about this.

There seems to be a fundamental misunderstanding: There are no textlines matched. dinglehopper does not work with lines.

The text is extracted, in reading order, as defined by the input data.
The resulting sequences of characters are compared using the Levenshtein algorithm (yields differences and distance aka CER) as implemented by the RapidFuzz library. (Words are extracted from this text, too, and then, sequences of words are compared.)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matching the groundtruth with the hypothesis baselines for CER / WER #119

{{title}}

Replies: 6 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Matching the groundtruth with the hypothesis baselines for CER / WER #119

icarl-ad Jul 18, 2024

Replies: 6 comments · 1 reply

mikegerber Jul 18, 2024 Maintainer

icarl-ad Jul 19, 2024 Author

mikegerber Jul 19, 2024 Maintainer

mikegerber Jul 19, 2024 Maintainer

icarl-ad Jul 19, 2024 Author

mikegerber Jul 22, 2024 Maintainer

mikegerber Aug 20, 2024 Maintainer

icarl-ad
Jul 18, 2024

Replies: 6 comments 1 reply

mikegerber
Jul 18, 2024
Maintainer

icarl-ad
Jul 19, 2024
Author

mikegerber
Jul 19, 2024
Maintainer

mikegerber Jul 19, 2024
Maintainer

icarl-ad
Jul 19, 2024
Author

mikegerber
Jul 22, 2024
Maintainer

mikegerber
Aug 20, 2024
Maintainer