Using a simple language, tokens of a .conllu
-file can be edited if a condition is met
bin/replace.sh rules.txt input.conllu [--nostrict] > output.conllu
The rule file as line per rule
condition > new_values
condition
is a logical expression which is evaluated for each word, and if true the new values are set to the token which satisfies the condition.
The condition
is a set of key:values
, operators like and
, or
or not
and parentheses. The condition may contain whitespaces:
Examples:
Upos:ADP and !Deprel:case
: true if the current token hasADP
as UPOS and its deprel is notcase
. Available keys:Upos:
(Values:[A-Z]+
)Xpos:
(Values: string of any character except whitespaces,)
and&
)Lemma:
(Values: string of any character except whitespaces,)
and&
)Form:
(Values: string of any character except whitespaces,)
and&
)Deprel:
(Values: string of any character except whitespaces,)
and&
, optionally followed by:
and a string of any character except whitespaces,)
and&
)HeadId:
(Values:[+-][0-9]+
(relative from current head) or[0-9]+
(absolute head id), true if the head of the current token matchesEUD:
(Values:[+-][0-9]+
:
, deprel, if EudHeadId is*
any head position is accepted, without-
or+
the EudHead is interpreted as an absolute value))Feat:
(Values: FeatureName=Value or FeatureName:Value. The Featurename must match[A-Za-z_\[\]]+
, the Value[A-Za-z0-9]+
)Misc:
(Values: MiscName=Value or MiscName:Value. MiscName must match[A-Za-z_]+
, the Value can be any string without whitespaces,)
and&
)Id:
(Values: integer)MWT:
(Values: length of the multi-word token[2-9]
)IsEmpty
(no value, true if the current node is empty)IsMWT
(no value, true if the current node is a MWT)
Form:
, Lemma:
and Xpos:
can contain simple regular expression (only the character ')' cannot be used.
To check for any Feat or Misc value, leave the value empty:
Feat:Gender:
true if the current word has the featureGender
with any value
In order to check for the absence of a given Featurename in the Feature or Misc column, use the following:
not Feat:Gender:
true if the current word has no featureGender
EUD
cannot deal (yet) with empty word ids (n.m
)
Lemma
and Form
can have either a regex as argument or a filename of a file which contains a list of forms or lemmas:
Lemma:sing.* > misc:"Value=Sing"
Lemma:#mylemmas.txt > misc:"Value=Sing"
(if the filemylemmas.txt
does not exist, the condition is false)
In addition to key keys listed above, four functions are available to take the context of the token into account:
child()
child of current tokenhead()
head of current tokenprec()
preceding tokennext()
following token
For example:
head(head(Upos:VERB and Feat:Tense=Past))
: true if the current token has a head who has a head with UPOSVERB and the feature
Tense=Past`child(Upos:VERB && Feat:VerbForm=Part) and child(Upos:DET)
: true if the current token has a dependant with UPOSVERB
and a featureVerbForm=Part
and another child with UPOSDET
.head(next(Upos:NOUN))
: true if the current token has a head which is followed by a token with UPOSNOUN
Functions can be nested (eventhough child(head())
does not make sense, does it :-)
In order to compare values (for instance to check whether subject-verb agreement is OK),
value comparison is possible using the access operator @
: e.g. @Upos
or @Feat:Number
gives access to column values
and =
is used to compare.
If any of the accessed columns is empty (_
) the comparison is evaluated as false.
For example:
@Feat:Number=head(@Feat:Number)
returns true if the current word and its head both have a featureNumber
with the same value@Upos=@Xpos
returns true if the current word has the same value forUPOS
andXPOS
@Deprel=prec(@Deprel)
: true, if the current word and the preceding word have the samedeprel
value@Xpos=head(head(@Feat:Featname))
true if theXPOS
of the current word has the same value as the featureFeatname
of the head of its head.@Feat:Gender=head(@Feat:Gender) and not Upos:DET
true if the head and the current word have the same value for the featureGender
and the current word is not aDET
If either of the two words had no featureGender
the whole expression is evaluated as false.
The same search language is used for complex search and replace.
For more information check the formal grammar for conditions.
new_values
is a whitespace separated list of targeted_colum:value
which modify the tokens matched the condition.
The targeted_column
indicates which column of the word a new value is assigned to:
Possible keys
:
Form
Lemma
Upos
Xpos
Deprel
HeadId
Feat
Eud
Misc
(theId
column cannot be changed).
value
is a combination (using +
) of strings or functions which give access to other columns of the current word or it's head. Strings must be included
in double quotes "NOUN"
.
column_name
to retrieve a value from can be:
Form
Lemma
Upos
Xpos
Feat_<FeatureName>
Deprel
Misc_<KeyName>
HeadId
Available functions are:
this(<column_name>)
value of the given column of the current tokenhead(<column_name>)
value of the given column of the head of the current tokenhead(head(<column_name>)
value of the given column of the head's head of the current tokensubstring(this()/head(), start, end)
take the substring of the this/head expression fromstart
toend
substring(this()/head(), start)
take the substring of the result of the this/head expression fromstart
until the end of the stringupper(this()/head())
uppercase the result of the this/head expressionlower(this()/head())
lowercase the result of the this/head expressioncap(this()/head())
capitalize (first character uppercase, rest lowercase) the result of the this/head expressionreplace(this()/head(), regex, newstring) replaces the
regexof the result fo the this/head expression by
newstring`
If a token has a head 0, it's deprel will always be root
unless the option --nostrict
is used with replace.sh
Upos:"NOUN"
set Upos toNOUN
Eud:"+2:dep"
add a enhanced UD relation "dep" using the current id + 2 (must be a negative or positive integer without 0 (if resulting head id is out of the sentence, the head id is not modified)Eud:head(HeadId)+":"+head(Deprel)
set EUD to head and deprel of the headwordHeadId:"+2"
set head to current ud + 2 (must be a negative or positive integer without 0 (if resulting head id is out of the sentence, the head id is not modified)HeadId:"-1"
set head to current ud - 1HeadId:"5"
set head to 5 (n must be 0 or a positive integer)HeadId:head(Headid)
set head to the headid of head nodeFeat:"Number=Sing"
adds a featureNumber=Sing
(Number: deletes the feature)Lemma:this(Form)
set lemma to the form of current tokenLemma:this(Misc_Translit)
set lemma to the keyTranslit
of theMisc
columnLemma:this(Form)+"er"
set lemma to the form + "er"Lemma:"de"+token(Form)
set lemma to "de" + formFeat:"Featname"+this(Lemma)
set the feature Featname to the value of LemmaFeat:"Gender"+this(Misc_Special)
set the feature Gender to the value of the Misc SpecialMisc:"Keyname"+head(head(Upos))
set the key "Keyname" ofMisc
column to the Upos of the head of the headLemma:substring(this(Form),1,3)
set lemma to the substring (1 - 3) of the formLemma:substring(this(Form),1)
set lemma to the substring (1 - end) ofthe formForm:replace(this(Form),"é","e")
replace all occurrances ofé
in the form bye
N.B. no white spaces allowed in a value expression!
therefore Lemma:substring(this(Form), 1, 3)
or Lemma:this(Form) + "er"are invalid, use
Lemma:substring(this(Form),1,3) or Lemma:this(Form)+"er"
instead.
In order to empty a column, just set it to "_"
: Feat:"_"
, Xpos:"_"
, Eud:"_"
etc.
For more information check the formal grammar for replacements (the part after the first :
).