Extracts plain text, language identification and more metadata from Spiderling prevertical files
Clone this repo along with submodules:
git clone --recurse-submodules https://github.com/bitextor/prevertical2text.git
Or:
git clone https://github.com/bitextor/prevertical2text.git
git submodule update --init
mkdir build
cd build
cmake -DCMAKE_INSTALL_PREFIX=/your/prefix/path ..
# cmake .. -DCMAKE_BUILD_TYPE=Debug # for debug
make -j
make install
prevertical2text -o <output_folder> [ -f <output_files> ] <prevertical_file>...
--output
/-o
output folder--files
/-f
list of output files separated by commas (and without.gz
);text
andurl
are always written, whilemime
andhtml
are optional--verbose
/-v
print progress and filtering information--silent
/-s
print only warnings and errors
HTML Tokenizer by c-smile
HTML entities decoder by Christoph Gärtner
All documents and software contained in this repository reflect only the authors' view. The Innovation and Networks Executive Agency of the European Union is not responsible for any use that may be made of the information it contains.