Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README.rst #128

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 7 additions & 9 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,11 @@ CorporaCreator

This is a command line tool to create Common Voice corpora.

.. contents:: Table of Contents


Installation
===========
=============

After checking this repo out one installs the corresponding python package as follows
After checking this repo out one installs the corresponding Python package as follows

``CorporaCreator$ python3 setup.py install``

Expand All @@ -25,7 +23,7 @@ Given the ``clips.tsv`` file dumped from the Common Voice database, you can crea

This will create the corpora in the directory ``corpora`` from the ``clips.tsv`` file.

If you would like to just create corpora for a some language(s), you can pass the ``--langs`` flag as follows:
If you would like to just create corpora for some language(s), you can pass the ``--langs`` flag as follows:

``CorporaCreator$ create-corpora -d corpora -f clips.tsv --langs en fr``

Expand Down Expand Up @@ -97,7 +95,7 @@ The purpose of the ``create-corpora`` command line tool is to provide a jumping-
Cleaning Sentences
------------------

The ``clips.tsv`` file is a `tab separated file`_ containing a dump of the raw data from Common Voice with the following columns:
The ``clips.tsv`` file is a `tab-separated file`_ containing a dump of the raw data from Common Voice with the following columns:

1) ``client_id`` - A unique identifier for the contributor that was randomly generated when the contributor joined
2) ``path`` - The path to the audio file containing the contribution
Expand All @@ -118,7 +116,7 @@ Our problem is that data in the column ``sentence`` needs to be cleaned, as ther
What Needs to be Cleaned?
`````````````````````````

To actually see what needs to be cleaned first hand, the best thing to do is to run ``create-corpora`` as suggested above:
To actually see what needs to be cleaned firsthand, the best thing to do is to run ``create-corpora`` as suggested above:

``CorporaCreator$ create-corpora -d corpora -f clips.tsv``

Expand Down Expand Up @@ -149,7 +147,7 @@ This method is input the sentence to clean, cleans the sentence in a language in

If the sentence is not able to be cleaned, e.g. it consisted only of HTML fragments, this method can return is_valid set to False.

Currently `common.py`_ decodes any URL encoded elements of sentence, removes any HTML tags in a sentence, removes any non-printable characters in a sentence, and marks as invalid any sentence containing digits, in that order. (For the details refer to `common.py`_ .) This seems to catch most language independent problems, but if you see more, please open an issue or make a pull request.
Currently, `common.py`_ decodes any URL encoded elements of a sentence, removes any HTML tags in a sentence, removes any non-printable characters in a sentence, and marks as invalid any sentence containing digits, in that order. (For the details refer to `common.py`_ .) This seems to catch most language independent problems, but if you see more, please open an issue or make a pull request.


Language Dependent Cleaning
Expand Down Expand Up @@ -182,7 +180,7 @@ Language Independent vs Dependent Cleaning

Of note is that in the language dependent case the method that does the cleaning takes not only the sentence but also the client_id of the contributor who read the sentence. In the language independent case this client_id was not present. However, for the language dependent case it's unfortunately required.

A sentence may contain text which is able to be read in many different, but valid, ways. For example, the sentence "I am in room 4025." can be validly read as "I am in room four oh two five". Equivalently, a valid reading is: "I am in room four zero two five". There are also other valid readings: "I am in room forty twenty five.", "I am in room four thousand twenty five."... To actually determine which of these readings a particular contributor gave, you have to listen to the audio, determine what they said, then replace the digits with text reflecting the contributor's reading, returning this cleaned sentence.
A sentence may contain text which is able to be read in many different but valid ways. For example, the sentence "I am in room 4025." can be validly read as "I am in room four oh two five". Equivalently, a valid reading is: "I am in room four zero two five". There are also other valid readings: "I am in room forty twenty five.", "I am in room four thousand twenty five."... To actually determine which of these readings a particular contributor gave, you have to listen to the audio, determine what they said, then replace the digits with text reflecting the contributor's reading, returning this cleaned sentence.


Contributing Code
Expand Down