Skip to content

GrimReaperSam/Cini-OCR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CINI OCR

This project aims at digitizing various scanned documents retrieved from the Cini foundation. The goal is to segment the scans, find the the paintings and text areas inside them, and later be able to extract the text.

Installation (Tested on Ubuntu 14.04, should work on other Ubuntu version)

Downloading C libraries (apt-get install or equivalent):

  • libzbar-dev, libzbar0, tesseract-ocr (apt-get install libzbar-dev libzbar0 tesseract-ocr)

Downloading anaconda:

  • wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
  • bash Miniconda3-latest-Linux-x86_64.sh Say yes to PATH export in .bashrc at the end of installation.
  • Reload .bashrc with source .bashrc or close-open terminal.

Cloning the project and installing the dependencies

  • git clone https://github.com/GrimReaperSam/Cini-OCR.git
  • cd Cini-OCR
  • conda config --add channels https://conda.anaconda.org/menpo
  • conda config --add channels https://conda.anaconda.org/Atanahel
  • conda env create -f environment.yml creates the virtual environment called OCR with all the dependencies necessary.

Running

Every time you open a new shell you have to activate the virutal environment :

  • source activate OCR (Note : you can add this line to the .bashrc instead of running it every new shell)

To run the program,you can use the following command:

  • python pipeline.py

The arguments are:

  • '-r' Raws directory
  • '-d' Destination directory
  • '-s' Skip processed

Example: python pipeline.py -r 'samples' -d 'destination' -s True

Project Structure

The main files in the project are the following:

  • shared.py: Contains some shared constants
  • utils.py: Contains some shared utility functions
  • raw_converter.py: Converts a RAW CR2 file into a numpy array
  • document.py: Detects the cardboard inside the image and crops it out
  • cardboard.py: Detects the painting and the text section inside the cardboard and crops them out
  • barcode.py: Detects the barcode area in a verso and reads it
  • extractor.py: Given a text section finds the different boxes inside it, and the location of the text. Then using an OCR, it reads it and creates a bounds+text structure
  • pipeline.py: Groups all the previous classes in a pipeline. Takes a folder of raws images and processes them one by one. Saves each recto/verso pair into a folder with all their extracted information.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages