clevertagger - morphologically informed POS tagging for German

ABOUT

clevertagger is a German part-of-speech tagger based on a CRF tool and SMOR. Its main component is a module that extracts features from SMOR's morphological analysis. The combination of machine learning and FST-based morphological features promises a robust performance even for words that have not been observed during training, in particular morphologically complex (and rare) adjectives, verbs and nouns, which tend to have high error rates with conventional taggers.

smor_getpos.py can also be used as a stand-alone script to convert the SMOR output into a list of possible part-of-speech tags in the STTS tagset.

AUTHOR

Rico Sennrich, Institute of Computational Linguistics, University of Zurich (http://www.cl.uzh.ch).

LICENSE

clevertagger is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License (see LICENSE).

tokenizer.perl and nonbreaking_prefix.de are from the Moses toolkit and licensed under the LGPL (http://www.statmt.org/moses/)

preprocessing/sentence_splitter is from the NLTK and licensed under the Apache License 2.0 (https://github.com/nltk/nltk)

REQUIREMENTS

Linux (currently SFST is Unix/Linux only)
Python >= 2.6
one of these CRF tools:
- Wapiti http://wapiti.limsi.fr/
- CRF++ http://crfpp.googlecode.com/svn/trunk/doc/index.html (no trained models available)
SFST >= 1.3 http://www.ims.uni-stuttgart.de/projekte/gramotron/SOFTWARE/SFST.html

Optional dependencies:

Perl (for tokenizer)

INSTALLATION INSTRUCTIONS

Install the dependencies listed above.
Obtain an SMOR tranducer and a corresponding CRF model. Both are available at http://kitt.ifi.uzh.ch/kitt/zmorge/ .
Set the options SMOR_MODEL and CRF_MODEL in config.py (and adjust other options if necessary).

USAGE

Assuming that you have trained a CRF++/Wapiti model, you can call clevertagger like this:

./clevertagger < input_file

Further options are displayed through

./clevertagger -h

By default, clevertagger expects tokenized input (one word per line; empty line for sentence boundaries); for untokenized input, use the --tokenize option. A sentence splitter is included in preprocessing. To process raw text, call:

preprocess/sentence_splitter < input_file | ./clevertagger --tokenize

clevertagger also supports the n-best-tagging features of CRF++/Wapiti. Use the option -n to get multiple analyses for each sentence, and -t to get multiple analyses for each token.

You can also use clevertagger as a Python module with a persistent tagger class; it expects a list of tokenized sentences as input:

import clevertagger
tagger = clevertagger.Clevertagger()

for sentence in tagger.tag(['Das ist ein Test .', 'Das auch .']):
    print sentence + '\n'

TRAINING INSTRUCTIONS

A new CRF model can be trained with a training text in the format illustrated by sample_training_file.txt, i.e. one word per line, token and tag separated by spaces/tab; empty lines for sentence boundaries.

Then, execute the following two commands. The second one may take you several days, depending on corpus size and the number of cores (set the number processes (-p) accordingly).

./clevertagger -e < training_file > crf_training_file

For Wapiti, a typical training command is:

wapiti train --compact -p crf_config --nthread 10 crf_training_file crfmodel

For CRF++, a typical command is:

crf_learn -f 3 -c 1.5 -p 10 crf_config crf_training_file crfmodel

Finally, change the option CRF_MODEL in config.py to point to the trained model, or move the trained model in this directory.

PERFORMANCE

Some evaluation results from (Sennrich, Volk and Schneider 2013), with TnT/clevertagger models trained on Tüba-D/Z (and the standard TreeTagger model), and using Morphisto for morphological analysis:

Tagging accuracy (in %)

Tagger	TüBa-D/Z	Sofies Welt
TreeTagger	94.9	95.0
TnT	97.0	94.7
clevertagger	97.6	96.6

Tagging performance depends on the quality of the morphological analysis, and is slightly better with the SMOR lexicon.

A more indirect evaluation measuring parsing performance of ParZu on a 3000-sentence test set using different taggers:

Tagger	precision	recall	f-measure
TreeTagger	85.6	83.7	84.6
clevertagger	87.9	86.7	87.3
clevertagger (50-best)	88.0	87.7	87.8
gold tags	89.8	89.3	89.5

PUBLICATIONS

The tagger is described in:

Rico Sennrich, Martin Volk and Gerold Schneider (2013): Exploiting Synergies Between Open Resources for German Dependency Parsing, POS-tagging, and Morphological Analysis. In: Proceedings of the International Conference Recent Advances in Natural Language Processing 2013, Hissar, Bulgaria.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
preprocess		preprocess
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
clevertagger		clevertagger
clevertagger.py		clevertagger.py
config.py		config.py
crf_config		crf_config
extract_features.py		extract_features.py
gertwol-wrapper.py		gertwol-wrapper.py
postprocess.py		postprocess.py
sample_training_file.txt		sample_training_file.txt
smor_getpos.py		smor_getpos.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

clevertagger - morphologically informed POS tagging for German

ABOUT

AUTHOR

LICENSE

REQUIREMENTS

INSTALLATION INSTRUCTIONS

USAGE

TRAINING INSTRUCTIONS

PERFORMANCE

PUBLICATIONS

About

Releases

Packages

Contributors 2

Languages

License

rsennrich/clevertagger

Folders and files

Latest commit

History

Repository files navigation

clevertagger - morphologically informed POS tagging for German

ABOUT

AUTHOR

LICENSE

REQUIREMENTS

INSTALLATION INSTRUCTIONS

USAGE

TRAINING INSTRUCTIONS

PERFORMANCE

PUBLICATIONS

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages