Skip to content

A tool for extracting structural and positional attributes from a corpus vertical file

License

Notifications You must be signed in to change notification settings

czcorpus/vert-tagextract

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vert-tagextract

Vert-tagextract (vte) is a program and a library for extracting structural attribute metadata and n-gram frequency information (ipm, ARF) from a corpus vertical file to an SQL database.

It can be used either as a library or command line tools:

vte2

Vte2 is a tool for extracting structural metadata and n-grams.

udex

Udex is a UD tag data extractor. It is a drop in replacement of the prepare_ud.py script in KonText. It also runs almost twice as fast as the original script.

The meta-data database part is used by KonText for its liveattrs plug-in. The complete word frequency database is used by Word at a Glance but it can be used by anyone interested in n-gram analysis.

Preparing the process

To prepare data extraction from a specific corpus, a configuration file must be defined. You can start by generating a config template:

vte template > syn_v4.json

Example config

An example configuration file written for corpus syn_v4 looks like this:

{
    "corpus": "syn_v4",
    "verticalFile": "/path/to/vertical/file",
    "dbFile": "/var/opt/kontext/metadata/syn_v4.db",
    "encoding": "utf-8",
    "atomStructure": "text",
    "stackStructEval": true,
    "selfJoin": {
        "argColumns": ["doc_id", "text_id"],
        "generatorFn": "identity"
    },
    "structures": {
        "doc" : [
            "id",
            "title",
            "subtitle",
            "author",
            "issue",
            "publisher"
        ],
        "text": [
            "id",
            "section",
            "section_orig",
            "author"
        ]
    },
    "indexedCols": ["doc_title"],
    "bibView" : {
        "cols" : [
            "doc_id",
            "doc_title",
            "doc_author"
        ],
        "idAttr" : "doc_id"
    },
    "countColumns": [0, 1, 3],
    "countColMod": ["toLower", "toLower", "firstChar"],
    "calcARF": true
}

Configuration items

verticalFile

type: string

a path to a vertical file (plain text or gz)

db

type: object

attributes:

  • type: 'sqlite'|'mysql'
  • name: string
  • host: string
  • user: string
  • password: string
  • preconfSettings: Array<string>

atomStructure

type: string

This setting specifies a structure understood as a row in the exported metadata database. It means that any nested structures (e.g. p within text) will be ignored. On the other hand, all the ancestor structures (e.g. doc in case of text) will be processed as long as there are some configured structural attributes to be exported (see the example above).

stackStructEval

type: boolean

When true then structures within a vertical file are evaluated by a stack-based processor which requires the sturctures to be nested properly (e.g. just like in case of XML). If false then overlapping structures can be in the vertical file:

<foo>
token1
<bar>
token2
</foo>
token3
</bar>

In case you are not sure about your vertical file structure, use false.

structures

type: {[key:string]:Array<string>}

An object containing structures and their respective attributes to be exported. Generally, this should be a superset of values found in a respective corpus registry file under the SUBCORPATTRS key.

indexedCols

type: Array<string>

By defualt, vte creates indices for primary keys and for the item_id (see selfJoin) column (if defined). In case of a large database it may be a good idea to create additional indices for frequently accessed columns (e.g. a title of a document, genre etc.).

Please note that the format of structural attribute name matches the metadata column name format (e.g. doc_title instead of doc.title).

selfJoin

type: {argColumns: Array<string>; generatorFn: string}

This setting defines a column used to join rows belonging to different corpora (this is used mainly with the InterCorp). Argument generatorFn contains an identifier of an internal function vte uses to generate column names (current options are: empty, identity and intercorp). Argument argColumns contains a list of attributes used as arguments to the generatorFn.

E.g. in case we want to create a compound item_id identifier from doc.id, text.id and p.id we can define "generatorFn" = "identity" and "argColumns" = ["doc_id", "text_id", "p_id"]. The column format is purely internal matter of KonText - the important thing is to match columns properly and make the (corpus_id, item_id) pair unique.

bibView

type: {idAttr: string; cols: Array<string>}

This setting defines a database view used to fetch detail about a single "bibliographic unit" (e.g. a book). This is optional as it may not apply for some cases (e.g. spoken corpora).

* *idAttr* specifies an unique column to access the "bibliographic unit"
* *cols* specifies columns displayed in bibliographic unit detail

Please note (again) the format of column names (doc_title, not doc.title).

countColumns

type: Array<number>

If a non-empty array is provided, then vte will also extract the defined columns (referred by their position starting from left and indexed from zero) along with number of occurrences of each variant (i.e. all the unique combinations for defined columns - e.g. "word"+"lemma"+"pos" and their respective absolute frequencies).

The data are stored into a separate table colcounts.

This can be used e.g. to generate lists of unique PoS tags for KonText's taghelper plug-in. For this purpose, script scripts/postag2file.py is available:

python scripts/postag2file.py path/to/generated/database

countColMod

type: Array<string|null>

It is also possible to define value modification function(s) per individual extracted token columns. Full length of countColumns must be used. Columns without value modifications should contain null.

Available functions: toLower, firstChar, null (= identity is used)

calcARF

type: boolean

If true and if countColumns is also defined then vte will also calculate ARF. Such a calculation requires a 2nd pass of the vertical file so the whole process consumes roughly twice as much time compared with non-ARF processing.

filter

type: {lib:string; fn:string}

Specifies a path to a compiled plug-in library along with exported variable implementing LineFilter interface. It is used as a filter for each token where input is given by current structural attributes and their respective values. This can be used to process just a predefined subcorpus of the original corpus.

Running the export process

To create a new or replace an existing database use:

vte create path/to/config.json

Or in case we want to add multiple corpora to a single database (e.g. in case of InterCorp):

vte create path/to/config1.json
vte append path/to/config2.json
vte append path/to/config3.json
...
vte append path/to/configN.json

In this case, a proper selfJoin must be configured for KonText to be able to match rows from different corpora as aligned ones.

About

A tool for extracting structural and positional attributes from a corpus vertical file

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages