Skip to content

cs-pub-ro/textit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

94 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TextIT

Prerequisites

Get the language identification model.

sudo apt install libreoffice
conda install conda-forge::tesseract
conda install conda-forge::ghostscript
pip3 install -r requirements.txt
cd src/textit/processors && mkdir -p lang_id && cd lang_id && touch __init__.py && wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin

Usage

The following code turns all the files from tests/fixtures int json files in extracted_text.

python use_extractor.py tests/fixtures  extracted_text/

To write the files in a two level directory structure based on the hash of the file:

--use_hash_directories

About

Document processing pipeline

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages