Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS Glue Start-Blueprint-Run running into timeout issues with increased number of jobs #9

Open
namman2 opened this issue Oct 26, 2022 · 0 comments

Comments

@namman2
Copy link

namman2 commented Oct 26, 2022

Describe the bug
Using AWS-CLI (Version: aws-cli/2.8.6 Python/3.9.11 Windows/10 exe/AMD64 prompt/off), starting a Glue Blueprint Run works fine when the number of objects generated inside the workflow (Triggers/Glue Jobs) is under 30-40 objects in total. But when there's more objects being generated inside the glue workflow, the Blueprint Run seems to be timing out and gets stuck in RUNNING state.

Expected Behavior
We expect the Blueprint Run to spin up the workflow with all the number of jobs as needed. This is a sample example of the workflow where each row consists of 2 jobs with a trigger in the middle for each task, and there could be N number of tasks like this:

image

Current Behavior
These number of rows of tasks when under 8-10 tasks, the blueprint run is successful and doesn't time out but when it's more the Blueprint Run is stuck in RUNNING state and we never get the workflow generated.

Reproduction Steps
This is the AWS CLI command we're using right now:
aws glue start-blueprint-run --blueprint-name BLUEPRINT_NAME --role-arn IAMRoleARN --parameters "file://FILE_PATH.json" --region us-east-1 --profile test-naga --cli-connect-timeout 900 --cli-read-timeout 900

The JSON object takes in a collection of table names and loops over them and the layout file creates the workflow as shown in the image above. It's not easy to reproduce this with the exact same example but maybe one of the samples here https://github.com/awslabs/aws-glue-blueprint-libs/tree/master/samples could be used for a higher number of jobs/objects created through the Blueprint Run

Possible Solution
I'm wondering if it's related to the --cli-connect-timeout and --cli-read-timeout issue as the default value is 60 seconds and it seems like the Blueprint Run tries to spin up all the resources in that time but if there are more objects to spin up and it crosses this time, the whole process times out and gets stuck in the RUNNING state without doing anything.

We also tried setting these values to 0 and still the same issue. The number of objects it spins up when timing out seems to be random across each runs.

CLI version used
2.8.6

Environment details (OS name and version, etc.)
Windows 10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant