Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Event ID #36

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Add Event ID #36

wants to merge 2 commits into from

Conversation

wernerb
Copy link
Contributor

@wernerb wernerb commented Dec 17, 2020

This changes the API to FilterLogEventsPages instead of GetLogEvents. The problematic part of it is that NextToken is not in every return message.

To fix this I reworked and use the pagination function from aws sdk that handles each page automatically. We then "move the needle" towards the last event time and use that as the next starting time instead.

I want to understand what other purpose the state has except for deduplication. If that is the problem then the new Meta._id field will probably fix this: https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-deduplication.html
I did attempt to change NextToken to LastEventMessage instead.

Could you advise on what to do wrt multiline buffer? I am unsure if this works correctly with LastEventMessage being used as the starttime now when recovering registry. But again, I rather have the beat reprocess the logs it might have missed and send them again. I am curious what you think.

Closes #35

@kkentzo
Copy link
Collaborator

kkentzo commented Dec 26, 2020

Thanks @wernerb! Some thoughts:

  • first of all, let me say that it'd be great if we're able to get rid of the state saved in the s3 bucket (one dependency less!). It seems to me that this is doable with the approach that you propose (the Meta._id field). As I understand it, when the beat starts then it will process all messages within its horizon and send them to the sink (logstash or ES). The messages that have already been ingested (based on the Meta._id field will be discarded by ES. Will this work out of the box on the ES side?

  • if the above is true, then I would say that the s3 file is redundant! Indeed, its only purpose is to make sure that we don't process the same message twice. If this deduplication can be performed on the ES side then the only impact would be a little bit more cpu and network traffic when the beat starts up. The magnitude of this impact is dependent on the values of the stream_event_horizon and hot_stream_event_horizon in the config yml (but it shouldn't be that bad for most use cases I suppose).

  • it's not clear to me what we gain by switching to FilterLogEvents from GetLogEvents. You mention that "the problematic part of it is that NextToken is not in every return message". Why is this problematic? I suppose that this is the case where there are currently no more data to be retrieved - wouldn't the same be true with FilterLogEvents (which also uses tokens)? Wouldn't this be the case where lastPage is true in the callback function?

  • as far as multiline buffers are concerned, I am pretty sure that there would be many edge cases where the event would break :) Without thinking that much, I'd dare say that getting rid of the s3 state would be beneficial to multiline events because the re-processing of the messages would result in rescuing multiline events that would not have been properly assembled in other cases.

To sum it up:

  • can we confirm that the Meta._id attribute in the event's payload will work out of the box on all reasonable ES installations (i.e. avoid insertion of duplicate messages)?
  • if yes, then I would propose to kill the s3 registry functionality and reprocess all messages within the configured horizons when the beat starts up (since we have established that we avoid duplicated documents in ES)
  • given the above, would there still be a reason to change our current use of the cloudwatchlogs API (i.e. GetLogEvents)? My argument here is why change something that works fine and face the risk of introducing new bugs in the codebase without getting some benefit back?

Happy to hear feedback on the above!

@wernerb
Copy link
Contributor Author

wernerb commented Dec 29, 2020

can we confirm that the Meta._id attribute in the event's payload will work out of the box on all reasonable ES installations (i.e. avoid insertion of duplicate messages)?

For anyone using an elasticsearch output in filebeat this works out of the box. The documentation isn't clear as filebeat elasticsearch output doesn't support updating documents anyway but here the elasticsearch team member confirms what I see in my setup which is that documents are not overwritten because it uses the POST /<target>/_create/<_id> api which doesn't support updating existing documents. I have tested this numerous times and this works out of the box when using the elasticsearch output in filebeat!

Now where things become interesting is for other outputs such as logstash,kafka,redis: source

When publishing to elasticsearch, the id value will be used for _id. When publishing to logstash/redis/kafka, beats add a @metadata field to the event. This field will contain the id. You can configure the elasticsearch output in Logstash to use [@metadata][id].
This means that there is some extra configuration work involved in logstash (because redis/kafka are always inputs to logstash when you want to get them into elasticsearch). But this is described here: https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-deduplication.html#ls-doc-id and we can adopt this in the README would be my proposal.

Elasticsearch can only deduplicate events going to the same index, because obviously a new daily index will not have that event. This is fine because by default daily index setups use the original timestamp to decide what index to go from and you can set as large an horizon as you see fit.
For setups like mine where we use ILM it is more difficult as the indexes are rolled depending on index age or size of index. See issue here: elastic/elasticsearch#44794. This basically means when a rollover happens that elasticsearch will just index the documents again because its a new index. This is not really an issue for our use-cases and even then we would see the timestamp/event be the same for a small number of events. We really only expect the cloudwatchlogsbeat to be down for a few hours max at any time so the horizon is rather short, the probability of having a rollover event happen is also rather small with large enough indices (50gb) and index age being a few days.

To summarise the above:

  1. Elasticsearch output in filebeat supports this out of the box
  2. Any other output will carry the _id in metadata and needs to be configured to be used as document_id in logstash as per https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-deduplication.html#ls-doc-id
  3. I would add to the README the logstash caveat and the setting of horizon in combination with ILM to explain the tradeoffs.

if yes, then I would propose to kill the s3 registry functionality and reprocess all messages within the configured horizons when the beat starts up (since we have established that we avoid duplicated documents in ES)

It really depends on what you think. The rollover stuff with ILM can really be an issue but it is an upstream issue I expect to be fixed. We could remove the message stuff entirely to make things simpler and keep setting NextToken (if available) to state.

given the above, would there still be a reason to change our current use of the cloudwatchlogs API (i.e. GetLogEvents)? My argument here is why change something that works fine and face the risk of introducing new bugs in the codebase without getting some benefit back?

This is easy, this is because GetLogEvents doesn't include EventIDs otherwise I would not have changed it. See http://docs.amazonaws.cn/AmazonCloudWatchLogs/latest/APIReference/API_GetLogEvents.html and https://docs.aws.amazon.com/AmazonCloudWatchLogs/latest/APIReference/API_FilterLogEvents.html

What is also nice about FilterLogEvents is that a pattern to filter can be added for filtering events (if desirable) and it is possible to query multiple streams in 1 call instead of doing lots of GetLogEvents per stream. See logStreamNamePrefix. It also has ingestionTime which could be useful for people to use.

I am assuming results from GetLogEvents comes in, the NextToken is then written out only after handling all the events in it. Because if we have the events in memory, then they should be handled and beat buffers will handle processing everything correctly.
The API states that with GetLogEvents returns NextToken always even if you arrive at the last page and that if you are on the last page it returns the same NextToken. How in the code is this case handled? If at last page then state is again written out but this would I guess cause the last page event to be repeated in case of failure?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add EventId in order to deduplicate
2 participants