CINI OCR

This project aims at digitizing various scanned documents retrieved from the Cini foundation. The goal is to segment the scans, find the the paintings and text areas inside them, and later be able to extract the text.

Installation (Tested on Ubuntu 14.04, should work on other Ubuntu version)

Downloading C libraries (apt-get install or equivalent):

libzbar-dev, libzbar0, tesseract-ocr (apt-get install libzbar-dev libzbar0 tesseract-ocr)

Downloading anaconda:

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh Say yes to PATH export in .bashrc at the end of installation.
Reload .bashrc with source .bashrc or close-open terminal.

Cloning the project and installing the dependencies

git clone https://github.com/GrimReaperSam/Cini-OCR.git
cd Cini-OCR
conda config --add channels https://conda.anaconda.org/menpo
conda config --add channels https://conda.anaconda.org/Atanahel
conda env create -f environment.yml creates the virtual environment called OCR with all the dependencies necessary.

Running

Every time you open a new shell you have to activate the virutal environment :

source activate OCR (Note : you can add this line to the .bashrc instead of running it every new shell)

To run the program,you can use the following command:

python pipeline.py

The arguments are:

'-r' Raws directory
'-d' Destination directory
'-s' Skip processed

Example: python pipeline.py -r 'samples' -d 'destination' -s True

Project Structure

The main files in the project are the following:

shared.py: Contains some shared constants
utils.py: Contains some shared utility functions
raw_converter.py: Converts a RAW CR2 file into a numpy array
document.py: Detects the cardboard inside the image and crops it out
cardboard.py: Detects the painting and the text section inside the cardboard and crops them out
barcode.py: Detects the barcode area in a verso and reads it
extractor.py: Given a text section finds the different boxes inside it, and the location of the text. Then using an OCR, it reads it and creates a bounds+text structure
pipeline.py: Groups all the previous classes in a pipeline. Takes a folder of raws images and processes them one by one. Saves each recto/verso pair into a folder with all their extracted information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

CINI OCR

Installation (Tested on Ubuntu 14.04, should work on other Ubuntu version)

Running

Project Structure

Files

README.md

Latest commit

History

README.md

File metadata and controls

CINI OCR

Installation (Tested on Ubuntu 14.04, should work on other Ubuntu version)

Running

Project Structure