AWS Glue Start-Blueprint-Run running into timeout issues with increased number of jobs #9

namman2 · 2022-10-26T22:06:19Z

Describe the bug
Using AWS-CLI (Version: aws-cli/2.8.6 Python/3.9.11 Windows/10 exe/AMD64 prompt/off), starting a Glue Blueprint Run works fine when the number of objects generated inside the workflow (Triggers/Glue Jobs) is under 30-40 objects in total. But when there's more objects being generated inside the glue workflow, the Blueprint Run seems to be timing out and gets stuck in RUNNING state.

Expected Behavior
We expect the Blueprint Run to spin up the workflow with all the number of jobs as needed. This is a sample example of the workflow where each row consists of 2 jobs with a trigger in the middle for each task, and there could be N number of tasks like this:

Current Behavior
These number of rows of tasks when under 8-10 tasks, the blueprint run is successful and doesn't time out but when it's more the Blueprint Run is stuck in RUNNING state and we never get the workflow generated.

Reproduction Steps
This is the AWS CLI command we're using right now:
aws glue start-blueprint-run --blueprint-name BLUEPRINT_NAME --role-arn IAMRoleARN --parameters "file://FILE_PATH.json" --region us-east-1 --profile test-naga --cli-connect-timeout 900 --cli-read-timeout 900

The JSON object takes in a collection of table names and loops over them and the layout file creates the workflow as shown in the image above. It's not easy to reproduce this with the exact same example but maybe one of the samples here https://github.com/awslabs/aws-glue-blueprint-libs/tree/master/samples could be used for a higher number of jobs/objects created through the Blueprint Run

Possible Solution
I'm wondering if it's related to the --cli-connect-timeout and --cli-read-timeout issue as the default value is 60 seconds and it seems like the Blueprint Run tries to spin up all the resources in that time but if there are more objects to spin up and it crosses this time, the whole process times out and gets stuck in the RUNNING state without doing anything.

We also tried setting these values to 0 and still the same issue. The number of objects it spins up when timing out seems to be random across each runs.

CLI version used
2.8.6

Environment details (OS name and version, etc.)
Windows 10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS Glue Start-Blueprint-Run running into timeout issues with increased number of jobs #9

AWS Glue Start-Blueprint-Run running into timeout issues with increased number of jobs #9

namman2 commented Oct 26, 2022

AWS Glue Start-Blueprint-Run running into timeout issues with increased number of jobs #9

AWS Glue Start-Blueprint-Run running into timeout issues with increased number of jobs #9

Comments

namman2 commented Oct 26, 2022