Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG : Race condition when initializing SDK in EventRepository #492

Open
ZacBeemer-JaneApp opened this issue Dec 9, 2024 · 4 comments
Open
Assignees
Labels
bug Something isn't working

Comments

@ZacBeemer-JaneApp
Copy link

ZacBeemer-JaneApp commented Dec 9, 2024

Describe the bug
There appears to be a race condition causing silent failures when the EventRepository object is first initialized. This prevents the app from sending events to Rudderstack.

We are working on a React Native project using the Rudderstack React Native SDK. We noticed that we were not receiving any messages from Android devices and, through a long and tedious process, uncovered that this Android SDK has an uncaught race condition which is silently causing the SDK Initialization process to hang and never complete.

The issue stems from how the RudderServerConfigManager class is using ReentrantLock.

  • In the following pseudo code explanation, the issue should become clear:
    •    final ExecutorService executor;
         final ReentrantLock lock;
         
         Constructor() {
             brokenThread();
             normalThread();
         }
      
         void normalThread() {
             executor.submit(() -> {
                 lock.lock();
                 System.out.println("Hello World");
                 lock.unlock();
             });
         }
      
         void brokenThread() {
             executor.submit(() -> {
                 lock.lock();
                 throw new NullPointerException();
                 lock.unlock();
             });
         }
    • In the example, if the ExecutorService performs the brokenThread() logic before normalThread() then the call to lock.unlock() is never performed and normalThread() cannot proceed as it can never obtain the lock.

More specifically, the scenario being played out in our testing is as follows:

  • EventRepository is created
  • The constructor initializes a new RudderServerConfigManager
    • The RudderServerConfigManager constructor calls on RudderServerConfigManager.fetchConfig() which submits a task to be performed asynchronously
    • The fetchConfig() method is exited regardless of whether or not the async logic is performed
  • The EventRepository constructor then eventually calls EventRepository.initiateSDK()
    • InitiateSDK spawns another Thread to perform its logic asynchronously
    • InitiateSDK does a lot, but the important piece is where it confirms that the Server Config exists
      • The call to RudderServerConfigManager.getConfig() is implicitly dependent on the logic from RudderServerConfigManager.fetchConfig() having been completed, which is not guaranteed
    • If the logic submitted by RudderServerConfigManager.fetchConfig() has not completed yet (ie. Slow download, device thread prioritization), then initiateSDK() will continue and enter its retry-loop
  • Meanwhile, since initiateSDK() spawned a new Thread, it returns almost immediately and EventRepository works its way down to EventRepository.initiateRudderReporterFromPrefetchedConfig()
    • This method submits some callback logic to RudderServerConfigManager.getFetchedConfig()
      • This logic is also implicitly dependent on RudderServerConfigManager.fetchConfig() having completed
    • The callback logic makes a call to ReportManager.enableStatsCollection(), passing in the Data Plane URL
      • In the case where the Data Plane URL is not available, ReportManager.enableStatsCollection() throws an uncaught NullPointerException
        • Keep in mind that the following example is from manually surrounding logic with a Try-Catch and injecting custom log messages so the line references will not match the mainline branch
        • 2024-12-09 13:26:53.315  5514-5596  System.err              W  java.lang.NullPointerException: Attempt to invoke virtual method 'com.rudderstack.android.sdk.core.SourceConfiguration$Metrics com.rudderstack.android.sdk.core.SourceConfiguration$StatsCollection.getMetrics()' on a null object reference
          2024-12-09 13:26:53.315  5514-5596  System.err              W  	at com.rudderstack.android.sdk.core.ReportManager.enableStatsCollection(ReportManager.java:166)
          2024-12-09 13:26:53.315  5514-5596  System.err              W  	at com.rudderstack.android.sdk.core.EventRepository.$r8$lambda$YGTjZjmH_we7SltuiZkqwISvAUw(Unknown Source:0)
          2024-12-09 13:26:53.315  5514-5596  System.err              W  	at com.rudderstack.android.sdk.core.EventRepository.lambda$initiateRudderReporterFromPrefetchedConfig$0(EventRepository.java:169)
          2024-12-09 13:26:53.315  5514-5596  System.err              W  	at com.rudderstack.android.sdk.core.EventRepository$$ExternalSyntheticLambda2.onConfigFetched(D8$$SyntheticClass:0)
          2024-12-09 13:26:53.316  5514-5596  System.err              W  	at com.rudderstack.android.sdk.core.RudderServerConfigManager.lambda$getFetchedConfig$1(RudderServerConfigManager.java:98)
          2024-12-09 13:26:53.316  5514-5596  System.err              W  	at com.rudderstack.android.sdk.core.RudderServerConfigManager.$r8$lambda$D1rwBTLDndjjqf3X7BU51T5tnRk(Unknown Source:0)
          2024-12-09 13:26:53.316  5514-5596  System.err              W  	at com.rudderstack.android.sdk.core.RudderServerConfigManager$$ExternalSyntheticLambda1.run(D8$$SyntheticClass:0)
          2024-12-09 13:26:53.316  5514-5596  System.err              W  	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:487)
          2024-12-09 13:26:53.316  5514-5596  System.err              W  	at java.util.concurrent.FutureTask.run(FutureTask.java:264)
          2024-12-09 13:26:53.316  5514-5596  System.err              W  	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
          2024-12-09 13:26:53.316  5514-5596  System.err              W  	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:644)
          2024-12-09 13:26:53.316  5514-5596  System.err              W  	at java.lang.Thread.run(Thread.java:1012)
          
  • When the uncaught NPE is thrown, it kills the Thread it's running within.
    • Since the use of the ReentrantLock in RudderServerConfigManager is not using the following suggested structure, the Thread that previously claimed the lock will never release it and all other Threads waiting for it will be stuck.
      • // The "Correct" way according to the Javadocs
        lock.lock();
        try {
            // Critical Section
        } finally {
            lock.unlock();
        }
      • What interested us the most is that this does not appear to be breaking the app that Rudderstack is being used within, but simply prevents Rudderstack from doing anything
        • For example, calls to flush() will throw an error (if log output is explicitly configured), but otherwise not kill the app
          • image
      • Also worth noting; The act of catching the NPE prevents this issue from occurring as the Thread is not killed, but rather carries on following the printing of the Stacktrace

To Reproduce
Steps to reproduce the behaviour:

  1. We have a simple recreation in a running React Native project if you need to see it in action.
  1. Download the example app and launch it (virtual or real device; doesn't matter)
  • The example app comes with a README which describes the process of launching and recreating the issue

Expected Behavior
The event should be fired off and no errors should be observed

Screenshots

  • image
  • Custom log capture painting a picture of the timeline of events in the app:
    • ---------------------------- PROCESS STARTED (6097) for package [REDACTED] ----------------------------
      2024-12-09 13:44:00.332  6097-6152  RudderSDK                  D  Debug: [TEST] (Thread ID 204) EventRepository constructor entered
      2024-12-09 13:44:00.373  6097-6152  RudderSDK                  D  Debug: [TEST] (Thread ID 204) RudderServerConfigManager constructor entered
      2024-12-09 13:44:00.373  6097-6152  RudderSDK                  D  Debug: [TEST] (Thread ID 204) fetchConfig() entered
      2024-12-09 13:44:00.373  6097-6152  RudderSDK                  D  Debug: [TEST] (Thread ID 204) fetchConfig() exited
      2024-12-09 13:44:00.374  6097-6152  RudderSDK                  D  Debug: [TEST] (Thread ID 204) RudderServerConfigManager constructor exited
      2024-12-09 13:44:00.379  6097-6152  RudderSDK                  D  Debug: [TEST] (Thread ID 204) initiateSDK() entered
      2024-12-09 13:44:00.380  6097-6175  RudderSDK                  D  Debug: [TEST] (Thread ID 219) SDK Not Initialized. Retry 0
      2024-12-09 13:44:00.380  6097-6175  RudderSDK                  D  Debug: [TEST] (Thread ID 219) Calling configManager.getConfig()
      2024-12-09 13:44:00.380  6097-6175  RudderSDK                  D  Debug: [TEST] (Thread ID 219) getConfig() entered
      2024-12-09 13:44:00.380  6097-6175  RudderSDK                  D  Debug: [TEST] (Thread ID 219) claiming lock
      2024-12-09 13:44:00.380  6097-6175  RudderSDK                  D  Debug: [TEST] (Thread ID 219) successfully claimed lock
      2024-12-09 13:44:00.380  6097-6175  RudderSDK                  D  Debug: [TEST] (Thread ID 219) releasing lock
      2024-12-09 13:44:00.380  6097-6175  RudderSDK                  D  Debug: [TEST] (Thread ID 219) successfully released lock
      2024-12-09 13:44:00.380  6097-6175  RudderSDK                  D  Debug: [TEST] (Thread ID 219) getConfig() exited
      2024-12-09 13:44:00.380  6097-6152  RudderSDK                  D  Debug: [TEST] (Thread ID 204) initiateRudderReporterFromPrefetchedConfig() entered
      2024-12-09 13:44:00.381  6097-6152  RudderSDK                  D  Debug: [TEST] (Thread ID 204) getFetchedConfig() entered
      2024-12-09 13:44:00.381  6097-6152  RudderSDK                  D  Debug: [TEST] (Thread ID 204) getFetchedConfig() exited
      2024-12-09 13:44:00.381  6097-6152  RudderSDK                  D  Debug: [TEST] (Thread ID 204) initiateRudderReporterFromPrefetchedConfig() exited
      2024-12-09 13:44:00.381  6097-6152  RudderSDK                  D  Debug: [TEST] (Thread ID 204) DataPlaneUrl at end of EventRepository constructor: null
      2024-12-09 13:44:00.381  6097-6152  RudderSDK                  D  Debug: [TEST] (Thread ID 204) Exiting EventRepository constructor: null
      2024-12-09 13:44:00.381  6097-6174  RudderSDK                  D  Debug: [TEST] (Thread ID 218) Executing fetchConfig() logic
      2024-12-09 13:44:00.381  6097-6174  RudderSDK                  D  Debug: [TEST] (Thread ID 218) downloadConfig() entered
      2024-12-09 13:44:00.386  6097-6174  RudderSDK                  D  Debug: [TEST] (Thread ID 218) RudderServerConfigManager: downloadConfig: configUrl: [REDACTED]
      2024-12-09 13:44:00.391  6097-6176  RudderSDK                  D  Debug: [TEST] (Thread ID 220) getFetchedConfig() logic entered
      2024-12-09 13:44:00.391  6097-6176  RudderSDK                  D  Debug: [TEST] (Thread ID 220) claiming lock
      2024-12-09 13:44:00.391  6097-6176  RudderSDK                  D  Debug: [TEST] (Thread ID 220) successfully claimed lock
      2024-12-09 13:44:00.391  6097-6176  RudderSDK                  D  Debug: [TEST] (Thread ID 220) getFetchedConfig() logic critical section entered
      2024-12-09 13:44:00.391  6097-6176  RudderSDK                  D  Debug: [TEST] (Thread ID 220) getFetchedRudderServerConfig() entered
      2024-12-09 13:44:00.396  6097-6176  RudderSDK                  D  Debug: [TEST] (Thread ID 220) getFetchedRudderServerConfig() exited
      2024-12-09 13:44:00.396  6097-6176  RudderSDK                  D  Debug: [TEST] (Thread ID 220) EventRepository: constructor: Prefetched source serverConfig is available
      2024-12-09 13:44:00.396  6097-6176  RudderSDK                  D  Debug: [TEST] (Thread ID 220) getDataPlaneUrl() entered
      2024-12-09 13:44:00.397  6097-6176  RudderSDK                  D  Debug: [TEST] (Thread ID 220) getDataResidencyUrl() entered
      2024-12-09 13:44:00.397  6097-6176  RudderSDK                  D  Debug: [TEST] (Thread ID 220) Data Residency URL is using Default Region
      2024-12-09 13:44:00.397  6097-6176  RudderSDK                  D  Debug: [TEST] (Thread ID 220) getDataResidencyUrl() exited
      2024-12-09 13:44:00.397  6097-6176  RudderSDK                  D  Debug: [TEST] (Thread ID 220) Data Residency URL was empty. Returning Config Value: [REDACTED]
      2024-12-09 13:44:00.397  6097-6176  RudderSDK                  D  Debug: [TEST] (Thread ID 220) getDataPlaneUrl() exited
      2024-12-09 13:44:00.397  6097-6176  RudderSDK                  D  Debug: [TEST] (Thread ID 220) enableStatsCollection() entered
      2024-12-09 13:44:00.397  6097-6176  RudderSDK                  D  Debug: [TEST] (Thread ID 220) isStatsReporterAvailable() entered
      2024-12-09 13:44:00.397  6097-6176  RudderSDK                  D  Debug: [TEST] (Thread ID 220) isStatsReporterAvailable() exited
      2024-12-09 13:44:00.397  6097-6176  RudderSDK                  D  Debug: [TEST] (Thread ID 220) releasing lock
      2024-12-09 13:44:00.397  6097-6176  RudderSDK                  D  Debug: [TEST] (Thread ID 220) successfully released lock
      2024-12-09 13:44:00.397  6097-6176  RudderSDK                  D  Debug: [TEST] (Thread ID 220) getFetchedConfig() logic exited
      2024-12-09 13:44:00.667  6097-6174  RudderSDK                  D  Debug: [TEST] (Thread ID 218) RudderServerConfigManager: downloadConfig: configJson: {
                                                                            "source": {
                                                                              "config": {},
                                                                              "id": "[REDACTED]",
                                                                              "name": "sandbox_mobile_reactnative",
                                                                              "writeKey": "[REDACTED]",
                                                                              "enabled": true,
                                                                              "sourceDefinitionId": "[REDACTED]",
                                                                              "deleted": false,
                                                                              "createdAt": "Thu May 04 2023 15:47:00 GMT-0400 (Eastern Daylight Time)",
                                                                              "updatedAt": "Thu May 04 2023 15:47:00 GMT-0400 (Eastern Daylight Time)",
                                                                              "sourceDefinition": {
                                                                                "id": "[REDACTED]",
                                                                                "name": "ReactNative",
                                                                                "options": null,
                                                                                "displayName": "React Native",
                                                                                "category": null,
                                                                                "createdAt": "2020-04-16T11:22:37.660Z",
                                                                                "updatedAt": "2020-06-18T11:54:09.089Z"
                                                                              },
                                                                              "destinations": []
                                                                            },
                                                                            "metadata": {
                                                                              "version": "1.0.2"
                                                                            }
                                                                          }
      2024-12-09 13:44:00.668  6097-6174  RudderSDK                  D  Debug: [TEST] (Thread ID 218) saveRudderServerConfig() entered
      2024-12-09 13:44:00.670  6097-6174  RudderSDK                  D  Debug: [TEST] (Thread ID 218) saveRudderServerConfig() exited
      2024-12-09 13:44:00.670  6097-6174  RudderSDK                  I  Info: [TEST] (Thread ID 218) RudderServerConfigManager: downloadConfig: server config download successful
      2024-12-09 13:44:00.670  6097-6174  RudderSDK                  D  Debug: [TEST] (Thread ID 218) downloadConfig() exited
      2024-12-09 13:44:00.670  6097-6174  RudderSDK                  D  Debug: [TEST] (Thread ID 218) claiming lock
      2024-12-09 13:44:00.670  6097-6174  RudderSDK                  D  Debug: [TEST] (Thread ID 218) successfully claimed lock
      2024-12-09 13:44:00.670  6097-6174  RudderSDK                  D  Debug: [TEST] (Thread ID 218) getFetchedRudderServerConfig() entered
      2024-12-09 13:44:00.671  6097-6174  RudderSDK                  D  Debug: [TEST] (Thread ID 218) getFetchedRudderServerConfig() exited
      2024-12-09 13:44:00.672  6097-6174  RudderSDK                  D  Debug: [TEST] (Thread ID 218) releasing lock
      2024-12-09 13:44:00.672  6097-6174  RudderSDK                  D  Debug: [TEST] (Thread ID 218) successfully released lock
      2024-12-09 13:44:00.672  6097-6174  RudderSDK                  D  Debug: [TEST] (Thread ID 218) Finished fetchConfig() logic
      2024-12-09 13:44:02.381  6097-6175  RudderSDK                  D  Debug: [TEST] (Thread ID 219) SDK Not Initialized. Retry 1
      2024-12-09 13:44:02.381  6097-6175  RudderSDK                  D  Debug: [TEST] (Thread ID 219) Calling configManager.getConfig()
      2024-12-09 13:44:02.381  6097-6175  RudderSDK                  D  Debug: [TEST] (Thread ID 219) getConfig() entered
      2024-12-09 13:44:02.381  6097-6175  RudderSDK                  D  Debug: [TEST] (Thread ID 219) claiming lock
      2024-12-09 13:44:02.381  6097-6175  RudderSDK                  D  Debug: [TEST] (Thread ID 219) successfully claimed lock
      2024-12-09 13:44:02.381  6097-6175  RudderSDK                  D  Debug: [TEST] (Thread ID 219) releasing lock
      2024-12-09 13:44:02.381  6097-6175  RudderSDK                  D  Debug: [TEST] (Thread ID 219) successfully released lock
      2024-12-09 13:44:02.381  6097-6175  RudderSDK                  D  Debug: [TEST] (Thread ID 219) getConfig() exited
      2024-12-09 13:44:02.381  6097-6175  RudderSDK                  D  Debug: [TEST] (Thread ID 219) serverConfig is not null
      2024-12-09 13:44:02.381  6097-6175  RudderSDK                  D  Debug: [TEST] (Thread ID 219) SDK is enabled
      2024-12-09 13:44:02.381  6097-6175  RudderSDK                  D  Debug: [TEST] (Thread ID 219) DataPlaneUrl is: null
      2024-12-09 13:44:02.381  6097-6175  RudderSDK                  D  Debug: [TEST] (Thread ID 219) getDataResidencyUrl() entered
      2024-12-09 13:44:02.381  6097-6175  RudderSDK                  D  Debug: [TEST] (Thread ID 219) Data Residency URL is using Default Region
      2024-12-09 13:44:02.381  6097-6175  RudderSDK                  D  Debug: [TEST] (Thread ID 219) getDataResidencyUrl() exited
      2024-12-09 13:44:02.381  6097-6175  RudderSDK                  D  Debug: [TEST] (Thread ID 219) DataResidencyUrls is set to: null
      2024-12-09 13:44:02.381  6097-6175  RudderSDK                  D  Debug: [TEST] (Thread ID 219) getDataPlaneUrl() entered
      2024-12-09 13:44:02.381  6097-6175  RudderSDK                  D  Debug: [TEST] (Thread ID 219) getDataResidencyUrl() entered
      2024-12-09 13:44:02.381  6097-6175  RudderSDK                  D  Debug: [TEST] (Thread ID 219) Data Residency URL is using Default Region
      2024-12-09 13:44:02.381  6097-6175  RudderSDK                  D  Debug: [TEST] (Thread ID 219) getDataResidencyUrl() exited
      2024-12-09 13:44:02.381  6097-6175  RudderSDK                  D  Debug: [TEST] (Thread ID 219) Data Residency URL was empty. Returning Config Value: [REDACTED]
      2024-12-09 13:44:02.381  6097-6175  RudderSDK                  D  Debug: [TEST] (Thread ID 219) getDataPlaneUrl() exited
      2024-12-09 13:44:02.381  6097-6175  RudderSDK                  D  Debug: [TEST] (Thread ID 219) enableStatsCollection() entered
      2024-12-09 13:44:02.381  6097-6175  RudderSDK                  D  Debug: [TEST] (Thread ID 219) isStatsReporterAvailable() entered
      2024-12-09 13:44:02.381  6097-6175  RudderSDK                  D  Debug: [TEST] (Thread ID 219) isStatsReporterAvailable() exited
      {{No further [TEST] logs | App continues to "operate"}}
      

Version of the Android SDK

  • Android 14 (API Level 34)

Version of the Rudderstack Android SDK

  • 1.25.1

Mobile device mode integration

  1. Rudderstack React Native SDK 1.14.0 (Issue present up to 1.15.0)
  2. The example app is relying on the Rudderstack React Native SDK to perform it's tasks so I don't believe that qualifies as calling on the "native SDK" directly

SDK initialisation snippet

  1. See Rudderstack React Native SDK source code for this initialization logic

Check for Correct Usage of writeKey and dataPlaneUrl
Confirm that the correct writeKey and dataPlaneUrl are utilized during SDK initialization.

  • This issue prevents dataPlaneUrl from being correctly initialized
@ZacBeemer-JaneApp ZacBeemer-JaneApp added the bug Something isn't working label Dec 9, 2024
@contributor-support
Copy link

Thanks for opening this issue! We'll get back to you shortly. If it is a bug, please make sure to add steps to reproduce the issue.

@1abhishekpandey
Copy link
Contributor

1abhishekpandey commented Dec 10, 2024

Hey @ZacBeemer-JaneApp,

Thanks for raising this issue. It seems you're using the self-hosted control plane URL and the metrics config is missing. I would request you to add the following config:

"source": {
    "config": {
        "statsCollection": {
            "errors": {
                "enabled": false
            },
            "metrics": {
                "enabled": false
            }
        }
    },
}

Or, try using the default control plane and try again:

await rudderClient.setup(rudderConfig.writeKey, {
  dataPlaneUrl: rudderConfig.dataPlaneUrl,
  // Just remove the controlPlaneUrl field and SDK will automatically use the default controlPlaneUrl.
  trackAppLifecycleEvents: true,
  recordScreenViews: false,
  logLevel: RUDDER_LOG_LEVEL.VERBOSE,
});

Let us know if you are still facing any issues after performing the above steps.

@1abhishekpandey 1abhishekpandey self-assigned this Dec 10, 2024
@ZacBeemer-JaneApp
Copy link
Author

@1abhishekpandey Thank you for the suggestion! We'll look at implementing it and get back to you with whether it made a difference or not.

Do you believe it's worth enhancing the logic to be more resilient / communicative regarding this issue in case other groups run into it as well?

@1abhishekpandey
Copy link
Contributor

Sure, we will definitely work on improving that in the future.

Currently, we are developing a new Android SDK built from scratch in Kotlin, which will address many improvements, including the issue you mentioned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants