This project aims at digitizing various scanned documents retrieved from the Cini foundation. The goal is to segment the scans, find the the paintings and text areas inside them, and later be able to extract the text.
Downloading C libraries (apt-get install
or equivalent):
- libzbar-dev, libzbar0, tesseract-ocr (
apt-get install libzbar-dev libzbar0 tesseract-ocr
)
Downloading anaconda:
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
Say yes to PATH export in.bashrc
at the end of installation.- Reload
.bashrc
withsource .bashrc
or close-open terminal.
Cloning the project and installing the dependencies
git clone https://github.com/GrimReaperSam/Cini-OCR.git
cd Cini-OCR
conda config --add channels https://conda.anaconda.org/menpo
conda config --add channels https://conda.anaconda.org/Atanahel
conda env create -f environment.yml
creates the virtual environment called OCR with all the dependencies necessary.
Every time you open a new shell you have to activate the virutal environment :
source activate OCR
(Note : you can add this line to the.bashrc
instead of running it every new shell)
To run the program,you can use the following command:
python pipeline.py
The arguments are:
- '-r' Raws directory
- '-d' Destination directory
- '-s' Skip processed
Example: python pipeline.py -r 'samples' -d 'destination' -s True
The main files in the project are the following:
shared.py
: Contains some shared constantsutils.py
: Contains some shared utility functionsraw_converter.py
: Converts a RAW CR2 file into a numpy arraydocument.py
: Detects the cardboard inside the image and crops it outcardboard.py
: Detects the painting and the text section inside the cardboard and crops them outbarcode.py
: Detects the barcode area in a verso and reads itextractor.py
: Given a text section finds the different boxes inside it, and the location of the text. Then using an OCR, it reads it and creates a bounds+text structurepipeline.py
: Groups all the previous classes in a pipeline. Takes a folder of raws images and processes them one by one. Saves each recto/verso pair into a folder with all their extracted information.