Skip to content

Extracts plain text, language identification and more metadata from Spiderling prevertical files

License

Notifications You must be signed in to change notification settings

bitextor/prevertical2text

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

prevertical2text

Extracts plain text, language identification and more metadata from Spiderling prevertical files

Download

Clone this repo along with submodules:

git clone --recurse-submodules https://github.com/bitextor/prevertical2text.git

Or:

git clone https://github.com/bitextor/prevertical2text.git
git submodule update --init

Compile

mkdir build
cd build
cmake -DCMAKE_INSTALL_PREFIX=/your/prefix/path ..
# cmake .. -DCMAKE_BUILD_TYPE=Debug # for debug
make -j
make install

Usage

prevertical2text -o <output_folder> [ -f <output_files> ] <prevertical_file>...
  • --output/-o output folder
  • --files/-f list of output files separated by commas (and without .gz); text and url are always written, while mime and html are optional
  • --verbose/-v print progress and filtering information
  • --silent/-s print only warnings and errors

Included dependencies

HTML Tokenizer by c-smile

HTML entities decoder by Christoph Gärtner


Connecting Europe Facility

All documents and software contained in this repository reflect only the authors' view. The Innovation and Networks Executive Agency of the European Union is not responsible for any use that may be made of the information it contains.

About

Extracts plain text, language identification and more metadata from Spiderling prevertical files

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •