Investigate client behaviour in a case of target pod/node restart #252

mtrunkat · 2022-05-30T08:43:44Z

From this discussion https://apifier.slack.com/archives/C013WC26144/p1653552365035479, it seems that sometimes there is a series of network errors that lead to a suspicion that the client might be retrying the requests to the same pod although it's dead.

2022-05-16T00:38:56.894Z WARN  ApifyClient: API request failed 4 times. Max attempts: 9.
2022-05-16T00:38:56.897Z Cause:Error: aborted
2022-05-16T00:38:56.899Z     at connResetException (node:internal/errors:692:14)
2022-05-16T00:38:56.901Z     at Socket.socketCloseListener (node:_http_client:414:19)

The text was updated successfully, but these errors were encountered:

mnmkng · 2022-05-30T14:06:47Z

I think it might be because of the keepalive connections and HTTPS tunneling. How does the client learn that the pod is down and it should retry elsewhere?

fnesveda · 2022-06-01T12:12:13Z

Note: We could test this on multistaging by starting two API pods, starting an actor which uses the API in a loop, and then we would kill one of the two pods. We could also make a testing version of the client with some more debug logging to help us figure it out.

jirimoravcik · 2022-06-26T19:26:40Z

2 pod multistaging here https://github.com/apify/apify-core/pull/6934

drobnikj · 2022-07-11T16:52:24Z

It looks like keepalive doesn't work it will not propagate through the application load balancer and the requests are distributed between pods.
There is a list of pods used for each API call, I was doing get run API call from the same apify client instance every 0,5 s.
Because I have just 2 pods and ALB uses a round-robin schema the pods were switched each request.

0: "apify-api-dummymultistagingbranchfortesting-67745785c5-4btl4"
1: "apify-api-dummymultistagingbranchfortesting-67745785c5-gf6m6"
2: "apify-api-dummymultistagingbranchfortesting-67745785c5-4btl4"
3: "apify-api-dummymultistagingbranchfortesting-67745785c5-gf6m6"
4: "apify-api-dummymultistagingbranchfortesting-67745785c5-4btl4"
5: "apify-api-dummymultistagingbranchfortesting-67745785c5-gf6m6"
6: "apify-api-dummymultistagingbranchfortesting-67745785c5-4btl4"
7: "apify-api-dummymultistagingbranchfortesting-67745785c5-gf6m6"
8: "apify-api-dummymultistagingbranchfortesting-67745785c5-4btl4"
9: "apify-api-dummymultistagingbranchfortesting-67745785c5-gf6m6"
10: "apify-api-dummymultistagingbranchfortesting-67745785c5-4btl4"
11: "apify-api-dummymultistagingbranchfortesting-67745785c5-gf6m6"
12: "apify-api-dummymultistagingbranchfortesting-67745785c5-4btl4"
13: "apify-api-dummymultistagingbranchfortesting-67745785c5-gf6m6"
14: "apify-api-dummymultistagingbranchfortesting-67745785c5-4btl4"
15: "apify-api-dummymultistagingbranchfortesting-67745785c5-gf6m6"
16: "apify-api-dummymultistagingbranchfortesting-67745785c5-4btl4"
17: "apify-api-dummymultistagingbranchfortesting-67745785c5-gf6m6"
18: "apify-api-dummymultistagingbranchfortesting-67745785c5-4btl4"
19: "apify-api-dummymultistagingbranchfortesting-67745785c5-gf6m6"
20: "apify-api-dummymultistagingbranchfortesting-67745785c5-4btl4"
21: "apify-api-dummymultistagingbranchfortesting-67745785c5-gf6m6"
22: "apify-api-dummymultistagingbranchfortesting-67745785c5-4btl4"
23: "apify-api-dummymultistagingbranchfortesting-67745785c5-gf6m6"
24: "apify-api-dummymultistagingbranchfortesting-67745785c5-4btl4"
25: "apify-api-dummymultistagingbranchfortesting-67745785c5-gf6m6"
26: "apify-api-dummymultistagingbranchfortesting-67745785c5-4btl4"
27: "apify-api-dummymultistagingbranchfortesting-67745785c5-gf6m6"
28: "apify-api-dummymultistagingbranchfortesting-67745785c5-4btl4"
29: "apify-api-dummymultistagingbranchfortesting-67745785c5-gf6m6"
30: "apify-api-dummymultistagingbranchfortesting-67745785c5-4btl4"
31: "apify-api-dummymultistagingbranchfortesting-67745785c5-gf6m6"
32: "apify-api-dummymultistagingbranchfortesting-67745785c5-4btl4"
33: "apify-api-dummymultistagingbranchfortesting-67745785c5-gf6m6"
34: "apify-api-dummymultistagingbranchfortesting-67745785c5-4btl4"

If you restart one node it simply switches to the new one.

drobnikj · 2022-07-11T16:56:24Z

If we want to support keep-alive headers we probably need some changes on ALB or elsewhere in platform networking. Not sure if it can affect users that it is not working right now, but it probably didn't work from a time when we start using ALB, cc @dragonraid @mnmkng

drobnikj · 2022-07-11T17:04:19Z

I move this to the icebox and we can follow up once the issue appears again. It looks like some network or any other error. But hard to say two months after, we do not have any logs and the issue didn't appear in the same actor again till this report.

mtrunkat added the bug Something isn't working. label May 30, 2022

fnesveda added medium priority Medium priority issues to be done in a couple of sprints. next sprint Check this out when planning next sprint. labels Jun 1, 2022

fnesveda assigned jirimoravcik Jun 6, 2022

fnesveda added this to the 40th sprint - Platform team milestone Jun 6, 2022

fnesveda modified the milestones: 40th sprint - Platform team, 41th sprint - Platform team Jun 20, 2022

fnesveda assigned drobnikj Jul 11, 2022

fnesveda modified the milestones: 41st sprint - Platform team, 42nd sprint - Platform team Jul 11, 2022

drobnikj unassigned jirimoravcik and drobnikj Jul 11, 2022

drobnikj removed next sprint Check this out when planning next sprint. medium priority Medium priority issues to be done in a couple of sprints. labels Jul 11, 2022

fnesveda added the t-platform Issues with this label are in the ownership of the platform team. label Jul 19, 2022

fnesveda removed this from the 42nd sprint - Platform team milestone Aug 9, 2022

fnesveda added the backend Issues related to the platform backend. label Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate client behaviour in a case of target pod/node restart #252

Investigate client behaviour in a case of target pod/node restart #252

mtrunkat commented May 30, 2022

mnmkng commented May 30, 2022

fnesveda commented Jun 1, 2022

jirimoravcik commented Jun 26, 2022

drobnikj commented Jul 11, 2022

drobnikj commented Jul 11, 2022

drobnikj commented Jul 11, 2022 •

edited

Loading

Investigate client behaviour in a case of target pod/node restart #252

Investigate client behaviour in a case of target pod/node restart #252

Comments

mtrunkat commented May 30, 2022

mnmkng commented May 30, 2022

fnesveda commented Jun 1, 2022

jirimoravcik commented Jun 26, 2022

drobnikj commented Jul 11, 2022

drobnikj commented Jul 11, 2022

drobnikj commented Jul 11, 2022 • edited Loading

drobnikj commented Jul 11, 2022 •

edited

Loading