Firemind Data Toolkit Framework

Table of Contents

About The Project
Initial Setup

Pre-requisites
Deploy IAM Role and OIDC Identity Provider
GitHub workflows
Infrastructure Setup

Sample Workflow Execution Details
TL;DR Quick Setup

TL;DR: This framework allows you to get started with a Data Pipeline on AWS using native services and ETL Tools. The example in this framework AWS Glue and Amazon Athena for schema generation and data query.

Data Toolkit is part of Firemind's Modern Data Strategy tools

Key AWS Services

AWS Step Functions
AWS Glue
Amazon Athena
Amazon EventBridge
AWS Identity Access Management (IAM) Roles
Amazon Simple Storage Service (Amazon S3) Buckets
AWS Systems Manager Parameter Store (SSM) Parameters
Amazon Simple Notification Service (Amazon SNS)

Infrastructure Diagram

Initial Setup

Prerequisites

Ensure your CLI has correct credentials to access the AWS account you want this framework deployed to.

To use this framework, create an empty remote repo in your organisation in GitHub, clone a copy of this repo and push to your remote.

Navigate to github-oidc-federation-template-infra.yml file and add a default value for:

GitHubOrg: This should be the name of the organisation where your repo exists.
FullRepoName: The name of the repo which has a copy of this infrastructure.

Add the following to your remote repository secrets:

AWS_REGION: <e.g. eu-west-1>.
S3_TERRAFORM_STATE_REGION: <e.g. eu-west-1>.
S3_TERRAFORM_STATE_BUCKET: ml-core-<account_id>-state-bucket.
ACTION_IAM_ROLE: arn:aws:iam::<account_id>:role/GithubActionsDeployInfra.

Further details can be found

Deploy IAM Role and OIDC Identity Provider

The first step is to deploy a GitHub Actions Role and GitHub OIDC identity provider in the account that allows you to run GitHub actions for the infrastructure.

Note: This only needs to be run once per AWS account. Details on this can be found here: https://github.com/marketplace/actions/configure-aws-credentials-action-for-github-actions

Important Note: If an identity provider already exists for your project. Always check that the identity provider exists for your project, which can be found within the AWS IAM console.

Run the following command in the terminal. Can change the stack name and region:

aws cloudformation deploy --template-file github-oidc-federation-template-infra.yml --stack-name app-authorisation-infra-github-authentication --region {{ eu-west-1 }} --capabilities CAPABILITY_IAM --capabilities CAPABILITY_NAMED_IAM

Github Workflows

GitHub actions is used to deploy the infrastructure. The config for this can be found in the .gitHub/workflows

We send through a variety of different environment variables

BUILD_STAGE - We get this from the branch names.
S3_TERRAFORM_STATE_BUCKET - Get this from GitHub secrets.
S3_TERRAFORM_STATE_REGION - Get this from GitHub secrets.
AWS_REGION - Get this from GitHub secrets.
SERVICE - Has default but can be set by user in the .github/workflows files.

Infrastructure Setup

For quick setup follow these instructions:

Create an empty repo within your GitHub account.
Checkout this repository on development branch to you local drive and push to your remote repo.
Assuming the GitHub actions have been set up correctly, the deployment will begin.

If you are having any issues please report a bug via the repo.

Sample Workflow Execution Details

Once the infrastructure has been deployed, navigate to S3 and find the bucket created by the framework data-core-[stage]-[account_id]-asset-bucket.
Navigate to input_data/ folder and upload the sample data found in sample-data/AC2021_AnnualisedEntryExit.csv.
This triggers an Amazon EventBridge Rule that targets the Data Pipeline Step Function on Object Creation to S3.
Navigate to the AWS Step Function service and notice the workflow running.
- The first state starts a Glue Crawler that generates a data schema based on the uploaded data.
- This schema is stored in a Glue Data Catalog.
- Once the Glue Crawler has finished running, a map of SQL queries are executed in parallel through Amazon Athena.
- The results of the queries are saved back to S3 under the query_results/ suffix.
- Finally, an SNS message is sent to the configured SNS Topic. **Note**: There are no subscribers to this topic but this can be configured.

TL;DR

Configure your AWS credentials in the CLI with permissions to deploy to your account.

Deploy

bash deployment-scripts/quick-deploy.sh

Destroy

bash deployment-scripts/quick-destroy.sh

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github		.github
config		config
deployment-scripts		deployment-scripts
sample-data		sample-data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
arch.png		arch.png
athena.tf		athena.tf
eventbridge.tf		eventbridge.tf
github-oidc-federation-template-infra.yml		github-oidc-federation-template-infra.yml
glue.tf		glue.tf
iam.tf		iam.tf
kms.tf		kms.tf
provider.tf		provider.tf
s3.tf		s3.tf
sns.tf		sns.tf
ssm.tf		ssm.tf
stepfunction.tf		stepfunction.tf
variables.tf		variables.tf
wheel.png		wheel.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Firemind Data Toolkit Framework

Data Toolkit is part of Firemind's Modern Data Strategy tools

Key AWS Services

Infrastructure Diagram

Initial Setup

Prerequisites

Deploy IAM Role and OIDC Identity Provider

Github Workflows

Infrastructure Setup

Sample Workflow Execution Details

TL;DR

About

Releases

Packages

Languages

License

hellofiremind/data-toolkit

Folders and files

Latest commit

History

Repository files navigation

Firemind Data Toolkit Framework

Data Toolkit is part of Firemind's Modern Data Strategy tools

Key AWS Services

Infrastructure Diagram

Initial Setup

Prerequisites

Deploy IAM Role and OIDC Identity Provider

Github Workflows

Infrastructure Setup

Sample Workflow Execution Details

TL;DR

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages