Skip to content

hellofiremind/data-toolkit

Repository files navigation

Firemind Data Toolkit Framework

Report issue . Submit a feature

Table of Contents
  1. About The Project
  2. Initial Setup
    1. Pre-requisites
    2. Deploy IAM Role and OIDC Identity Provider
    3. GitHub workflows
    4. Infrastructure Setup
  3. Sample Workflow Execution Details
  4. TL;DR Quick Setup

TL;DR: This framework allows you to get started with a Data Pipeline on AWS using native services and ETL Tools. The example in this framework AWS Glue and Amazon Athena for schema generation and data query.

Data Toolkit is part of Firemind's Modern Data Strategy tools

Firemind's Modern Data Strategy

Key AWS Services

  • AWS Step Functions
  • AWS Glue
  • Amazon Athena
  • Amazon EventBridge
  • AWS Identity Access Management (IAM) Roles
  • Amazon Simple Storage Service (Amazon S3) Buckets
  • AWS Systems Manager Parameter Store (SSM) Parameters
  • Amazon Simple Notification Service (Amazon SNS)

Infrastructure Diagram

Architecture

Initial Setup

Prerequisites

Ensure your CLI has correct credentials to access the AWS account you want this framework deployed to.

To use this framework, create an empty remote repo in your organisation in GitHub, clone a copy of this repo and push to your remote.

Navigate to github-oidc-federation-template-infra.yml file and add a default value for:

  • GitHubOrg: This should be the name of the organisation where your repo exists.
  • FullRepoName: The name of the repo which has a copy of this infrastructure.

Add the following to your remote repository secrets:

  • AWS_REGION: <e.g. eu-west-1>.
  • S3_TERRAFORM_STATE_REGION: <e.g. eu-west-1>.
  • S3_TERRAFORM_STATE_BUCKET: ml-core-<account_id>-state-bucket.
  • ACTION_IAM_ROLE: arn:aws:iam::<account_id>:role/GithubActionsDeployInfra.

Further details can be found

Deploy IAM Role and OIDC Identity Provider

The first step is to deploy a GitHub Actions Role and GitHub OIDC identity provider in the account that allows you to run GitHub actions for the infrastructure.

Note: This only needs to be run once per AWS account. Details on this can be found here: https://github.com/marketplace/actions/configure-aws-credentials-action-for-github-actions

  • Important Note: If an identity provider already exists for your project. Always check that the identity provider exists for your project, which can be found within the AWS IAM console.

Run the following command in the terminal. Can change the stack name and region:

aws cloudformation deploy --template-file github-oidc-federation-template-infra.yml --stack-name app-authorisation-infra-github-authentication --region {{ eu-west-1 }} --capabilities CAPABILITY_IAM --capabilities CAPABILITY_NAMED_IAM

Github Workflows

GitHub actions is used to deploy the infrastructure. The config for this can be found in the .gitHub/workflows

We send through a variety of different environment variables

  • BUILD_STAGE - We get this from the branch names.
  • S3_TERRAFORM_STATE_BUCKET - Get this from GitHub secrets.
  • S3_TERRAFORM_STATE_REGION - Get this from GitHub secrets.
  • AWS_REGION - Get this from GitHub secrets.
  • SERVICE - Has default but can be set by user in the .github/workflows files.

Infrastructure Setup

For quick setup follow these instructions:

  • Create an empty repo within your GitHub account.
  • Checkout this repository on development branch to you local drive and push to your remote repo.
  • Assuming the GitHub actions have been set up correctly, the deployment will begin.

If you are having any issues please report a bug via the repo.

Sample Workflow Execution Details

  1. Once the infrastructure has been deployed, navigate to S3 and find the bucket created by the framework data-core-[stage]-[account_id]-asset-bucket.
  2. Navigate to input_data/ folder and upload the sample data found in sample-data/AC2021_AnnualisedEntryExit.csv.
  3. This triggers an Amazon EventBridge Rule that targets the Data Pipeline Step Function on Object Creation to S3.
  4. Navigate to the AWS Step Function service and notice the workflow running.
    • The first state starts a Glue Crawler that generates a data schema based on the uploaded data.
    • This schema is stored in a Glue Data Catalog.
    • Once the Glue Crawler has finished running, a map of SQL queries are executed in parallel through Amazon Athena.
    • The results of the queries are saved back to S3 under the query_results/ suffix.
    • Finally, an SNS message is sent to the configured SNS Topic. **Note**: There are no subscribers to this topic but this can be configured.

TL;DR

Configure your AWS credentials in the CLI with permissions to deploy to your account.

Deploy

bash deployment-scripts/quick-deploy.sh

Destroy

bash deployment-scripts/quick-destroy.sh

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published