Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query and error logs #12

Open
RunDevelopment opened this issue Sep 23, 2020 · 15 comments
Open

Query and error logs #12

RunDevelopment opened this issue Sep 23, 2020 · 15 comments
Labels
enhancement New feature or request

Comments

@RunDevelopment
Copy link
Member

Previously, a TomCat Java server handeled the logging but after its removal, this application has to do it.

This issue should be used to discuss the log format, how logging will be enabled, and where logs are saved (= the responsibilities of this application).

@RunDevelopment RunDevelopment added the enhancement New feature or request label Sep 23, 2020
@johanneskiesel
Copy link

johanneskiesel commented Sep 23, 2020

Current format for args:

queryTime[TAB]queryText[TAB]ipAddress[TAB]url[NEWLINE]

(https://git.webis.de/code-research/arguana/args/args-framework/-/blob/master/src/main/java/me/args/server/QueryParser.java#L130)

But if we want to extend this to several services, we should rather use some key-value-based format to allow for special cases that some services have and others have not. We then need a common vocabulary for the usual fields.

Plausible would be one JSON per line (as suggested by Michael). We can first discuss here, but should then find a place for the result. Probably in the web services notes: https://webis.de/facilities.html?q=service

What could we use as a starting point?

Also pinging @mam10eks in case he wants to chime in.

@RunDevelopment
Copy link
Member Author

Plausible would be one JSON per line (as suggested by Michael)

The basic idea is that each log file uses the JSON Lines format. This gives us the ability to log structured data and maximum flexibility but it also means that we have should standardize the JSON values to some extent (e.g. requiring timestamps).

As for staring points: Log4J JSON (Java), Bunyan (JS)

@mam10eks
Copy link
Member

I am indeed interested in that topic ;)
I think logging the important data (query logs) using JSONL to CephFS makes sense.

But I guess that for error logs or similar we may use services that we already have running in our infrastructure: I think ChatNoir (maybe others too?) use LOKI for that.

Maybe @phoerious also has an oppinion on that?

@johanneskiesel
Copy link

So which fields should we standardize?

  • Some date-time field (various names are plausible, as the format one could use either the HTTP ones or "2020-09-24T06:29:42Z"... I have a preference for the latter as it is a bit easier to work with). "timestamp" seems to be a widespread name for this.
  • Some identifier so that we can match queries of the same user and remove our own test queries in later analyses. I think the IP address is the best thing we have for this, even if we then have to anonymize the logs before processing. Args will likely in the future also have a "user name" (a random string stored in the browser) that could be placed in this field instead of the IP address (we will get a few session features in args, and I want to be able to transfer my session across devices... unsure whether this option should be available to the end-user). Maybe just "user"?
  • The full request URL (important for me to distinguish usage of the GUI from API calls and for all the different parameters that args supports)... Maybe "url"?

It is also good to have a "query" field with the query text. This will allow us to log also other requests into the same log, where the queries can then be identified easily and one does not need to parse the URL to work with them.

@phoerious
Copy link

phoerious commented Sep 24, 2020

I have had the same thoughts for ChatNoir already. My conclusion was that logging to Elasticsearch would be the best thing. For long-term storage, the Elasticsearch bulk format can be used. I would not log directly to CephFS, since that only creates problems when logging from multiple instances at the same time and for any meaningful analytics you'd need to index it to Elasticsearch anyway.

If you don't want to log to Elasticsearch directly, you can also write to a local log and then use Logstash to write batches to Elasticsearch.

the HTTP ones or "2020-09-24T06:29:42Z"... I have a preference for the latter as it is a bit easier to work with). "timestamp" seems to be a widespread name for this.

That's an ISO datetime and the standard format in both Python and Elasticsearch. HTTP dates are notoriously hard to work with.

@johanneskiesel
Copy link

How easy is it to export the Elasticsearch logs to a text file? It is a requirement for args to easily get access to the logs in a text format. But if there is somewhere a button that you can click (accessible over VPN), I think this would be fine.

We solved logging from multiple instances for our purposes by using the hostname in the filename.

Logging to the filesystem is nice when I start a server locally for testing, but it would be ok-ish for me to have different possibilities here (for testing/production).

@RunDevelopment
Copy link
Member Author

RunDevelopment commented Sep 24, 2020

the HTTP ones or "2020-09-24T06:29:42Z"... I have a preference for the latter as it is a bit easier to work with). "timestamp" seems to be a widespread name for this.

Let's please use the ISO format. Virtually every language has native support for this and I don't want to use a third-party library just for parsing timestamps.

The full request URL

gRPC/gRPC-web services don't use URL params. They use Protobuf messages in the request body, so we have to log that as well (Protobuf has native support for converting messages to JSON). So gRPC services need to log a "url" (?) and "message" (?) field.
(I don't really like "message", it's way too general. Anyone has some better ideas?)


To summaries: Every log entry should look like this so far:

{
  "timestamp": string, // ISO date time
  "user": string, // identifier for the current user
  "url"?: string, // (optional) the full request URL
  "message"?: unknown, // the request Protobuf message (only for gRPC services)
  "query"?: string, // (optional) the query text of the request
}

Services are allowed to add more custom fields.

Edit: Made the "url" field optional

@johanneskiesel
Copy link

Looks good! Not sure if we should then make the "url" mandatory.

Should we create an own Java repo for creating/writing/parsing log records? Or just have one document at some place that describes the fields? Like in the FAQ or the services howto?

@RunDevelopment
Copy link
Member Author

Not sure if we should then make the "url" mandatory.

I don't see why one wouldn't want to log it. Could you an example?

Should we create an own Java repo for creating/writing/parsing log records? Or just have one document at some place that describes the fields? Like in the FAQ or the services howto?

We should most definitely write this down instead of having just a reference implementation. Maybe in FAQ under "How to do logging"?

I don't know if we need a dedicated Java project for that for now. How many services will use this kind of logging?
(This is a C++ project, so a Java library isn't going to help me here)

@johanneskiesel
Copy link

If a service has just one URL, it does not make much sense to log it.

At the moment I'd indeed then favor a FAQ entry, but I have to think a bit more about this. However, in order to get things done, it might be good to just start the FAQ entry and then move it somewhere else if necessary (but keep a link). I agree that this has priority over a reference implementation.

@mam10eks
Copy link
Member

mam10eks commented Sep 25, 2020

I also agree that elasticsearch makes sense, and from the application-view it should not make any overhead, since probably all popular logging-libraries should have a logstash plugin.

@johanneskiesel: I think the best way to transform the elasticsearch logs into plain text files like JSONL is by using the scroll-api. (Janek recommendet this in the wstud-stustu-kolleg channel.)

@RunDevelopment Since this is a c/c++ project, do you know good logging libraries that may have out of the box support for logstash? I just googled a bit and found that log4cplus for example does not seem to have out of the box support (e.g. here).

@johanneskiesel
Copy link

Ok, I will have a look at logstash. But Janek suggested that it is possible to first write to disk and then use Logstash to write Batches to Elasticsearch. I think I like this option a lot. Then the log files are already accessible as text, but we still get the uniform place and analytical powers of Elasticsearch.

I started the entry here: https://git.webis.de/code-generic/code-webis-faq/-/blob/master/README.md#how-to-code-logging

Feel free to extend/improve

@RunDevelopment
Copy link
Member Author

Since this is a c/c++ project, do you know good logging libraries that may have out of the box support for logstash?

I thought about using boost::log since I'm using boost anyway but it's not a hard preference. It seems pretty easy to add a logstash sink for boost::log but I haven't found any C++ logging lib that supports logstash out of the box (given that one only needs a few line of code, I can understand why).

@mam10eks
Copy link
Member

Nice!

But even when it is only a few lines, this is (maybe) complex:

  • What is when the remote service is (temporarily) not available or does not accept writes?
    • Our elasticsearch cluster shares the infrastructure with our hadoop cluster. E.g., I encountered several times the problem that elasticsearch was not able to write a single new record (solving this required some manual steps. Elasticsearch stays in this state until someone fixes this manually).
  • Still, logging should have no impact at the application's performance, even when the remote service is currently not available and there is a queue of pending log messages.

Of course logging does not need to have to be perfect, but I would hope that available implementations/plugins would apply some reasonable best practices fur such stuff.

@mam10eks
Copy link
Member

mam10eks commented Sep 25, 2020

Still, logging to the file-system can also fail (e.g., temporary problems with CephFS). So I guess it would still be ok to log directly to elasticsearch, but I would feel safer when we add an prometheus alert (e.g., no submitted queries within the last 48 hours) that we recognize potential problems within a few days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants