Klogproc is a service for processing and archiving logs generated by applications run by the Institute of the Czech National Corpus (CNC).
In general, Klogproc reads continuously an application-specific log record format from a file, parses individual lines and converts them into a target format which is then stored to ElasticSearch database.
In the CNC, Klogproc replaced LogStash as a less resource-hungry alternative. All the processing (reading, writing, handling multiple files) is performed concurrently which makes it quite fast.
Name | config code | versions | scripting | note |
---|---|---|---|---|
Akalex | akalex | ❌ | ❌ | a Shiny app with a custom log (:asterisk:) |
APIGuard | apiguard | ❌ | ❌ | CNC's internal API proxy and watchdog |
Calc | calc | ❌ | ❌ | a Shiny app with a custom log (:asterisk:) |
CNC-VLO | vlo | ❌ | ❌ | a custom CNC node for the Clarin VLO (JSONL log) |
Gramatikat | gramatikat | ❌ | ❌ | a Shiny app with a custom log (:asterisk:) |
KonText | kontext | 0.13 , 0.14 , 0.15 , 0.16 , 0.17 , 0.18 |
✅ | |
KorpusDB | korpus-db | ❌ | ❌ | |
Kwords | kwords | 1 , 2 |
✅ | |
Lists | lists | ❌ | ❌ | a Shiny app with a custom log (:asterisk:) |
Mapka | mapka | 1 , 2 , 3 |
✅ (v3) | using Nginx/Apache access log |
Morfio | morfio | ❌ | ❌ | |
MQuery-SRU | mquery-sru | ❌ | ❌ | a Clarin FCS endpoint (JSONL log) |
QuitaUP | quita-up | ❌ | ❌ | a Shiny app with a custom log (:asterisk:) |
SkE | ske | ❌ | ❌ | using Nginx/Apache access log |
SyD | syd | ❌ | ❌ | a custom app log |
Treq | treq | current, v1-api |
✅ | a custom app log |
WaG | wag | 0.6 , 0.7 |
✅ | web access log, currently without user credentials |
(:asterisk:) All the Shiny apps use the same log fromat.
The program can work in two modes - batch
and tail
For non-regular imports e.g. when migrating older data or when debugging a log processing routines,
batch
mode allows importing of multiple files from a single directory. The contents of the directory
can be even changed over time by adding newer log records and klogproc will
be able to import only new items as it keeps a worklog with the newest record
currently processed.
This is the mode which replaces CNC's LogStash solution and it is a typical mode of use. One or more log file listeners can be configured to read newly added lines. The log files are checked in regular intervals (i.e. the change is not detected immediately). Klogproc remembers current inode and current seek position for watched files so it should be able to continue after outages etc. (as long as the log files are not overwritten in the meantime due to log rotation).
Install Go language if it is not already available on your system.
Clone the klogproc
project:
git clone https://klogproc.git
Build the project:
make
Copy the binary somewhere:
sudo cp klogproc /opt/klogproc/bin
Create a config file (e.g. in /opt/klogproc/etc/klogproc.json
):
{
"logging": {
"path": "/opt/klogproc/var/log/klogproc.log"
},
"logTail": {
"intervalSecs": 15,
"worklogDir": "/opt/klogproc/var/worklog-tail",
"files": [
{"path": "/var/log/ucnk/syd.log", "appType": "syd"},
{"path": "/var/log/treq/treq.log", "appType": "treq"},
{"path": "/var/log/ucnk/morfio.log", "appType": "morfio"},
{"path": "/var/log/ucnk/kwords.log", "appType": "kwords", "tzShift": -120}
{"path": "/var/log/wag/current.log", "appType": "wag", "version": "0.7"}
]
},
"elasticSearch": {
"majorVersion": 6,
"server": "http://elastic:9200",
"index": "app",
"pushChunkSize": 500,
"scrollTtl": "3m",
"reqTimeoutSecs": 10
},
"geoIPDbPath": "/opt/klogproc/var/GeoLite2-City.mmdb",
"anonymousUsers": [0, 1, 2]
}
Notes:
- Do not forget to create directory for logging, worklog and also download and save GeoLite2-City database.
- The applied
tzShift
for the kwords app is just an example; it should be applied iff the stored datetime values provide incorrect time-zone (e.g. if it looks like UTC time but the actual values reprezent local time) - see the section Time-zone notes for more info.
Configure systemd (/etc/systemd/system/klogproc.service
):
[Unit]
Description=A custom agent for collecting UCNK apps logs
After=network.target
[Service]
Type=simple
ExecStart=/opt/klogproc/bin/klogproc tail /opt/klogproc/etc/klogproc.json
User=klogproc
Group=klogproc
[Install]
WantedBy=multi-user.target
Reload systemd config:
systemctl daemon-reload
Start the service:
systemctl start klogproc
Klogproc treats each log type individually when parsing but it converts all the
timestamps to UTC. In case there is an application storing incorrect values
(e.g. missing timezone info even if the time values are actually non-UTC), it
is possible to use tzShift
setting which defines number of minutes klogproc
should add/remove to/from the logged values.
For the tail action, the config is as follows:
{
"logTail": {
"intervalSecs": 5,
"worklogDir": "/path/to/tail-worklog",
"numErrorsAlarm": 0,
"errCountTimeRangeSecs": 15,
"files": [
{
"path": "/path/to/application.log",
"appType": "korpus-db",
"tzShift": 120
}
]
}
}
For the batch mode, the config is like this:
{
"logFiles": {
"appType": "korpus-db",
"worklogDir": "/path/to/batch-worklog",
"srcPath": "/path/to/log/files/dir",
"tzShift": 120,
"partiallyMatchingFiles": false
}
}
Note: partiallyMatchingFiles
set to true
will allow processing files which are partially older
than requested minimum datetime (but still - only the matching records will be accepted)
Because ElasticSearch underwent some backward incompatible changes between versions 5
and 6
,
the configuration contains the majorVersion
key which specifies how Klogproc stores the data.
This version supports multiple data types ("mappings") per index which was also
the default approach how CNC applications were stored - single index, multiple document
types (one per application). In this case, the configuration directive elasticSearch.index
specifies directly the index name Klogproc works with. Individual document types
can be distinguished either via ES internal _type
property or via normal property type
which is created by Klogproc.
In ES6, multiple data mappings per index has been removed. Klogproc in this case
uses its elasticSearch.index
key as a prefix for index name created for an individual
application. E.g. index log_archive
with configured treq
and morfio
apps expects
you to have two indices: log_archive_treq
and log_archive_morfio
. Please note
that Klogproc does not create the indices for you. The property type
is still present in documents
for backward compatibility.
See the docs/scripting.md page.