AirQ-Forecaster: An ETL-to-ML Pipeline for Predicting Air Quality Index

Hacktoberfest 2023 - Ploomber Mentorship Program

Description

This project focuses on creating an ETL (Extract, Transform, Load) pipeline with Ploomberfor air quality data. The pipeline fetches and processes air quality measurements from the OpenAQ API and uses DuckDB and MotherDuck for data storage and management.

We implemented the ARIMA (AutoRegressive Integrated Moving Average) model for time series forecasting to predict the Air Quality Index. ARIMA is a powerful and widely-used statistical method effective for short-term forecasting with data having seasonality, or cyclic patterns.

Key Features of ARIMA:

AutoRegressive (AR): Leverages the relationship between an observation and a number of lagged observations (autoregression).
Integrated (I): Involves differencing the time series to make it stationary, i.e., to stabilize the mean of the time series by removing changes in the level.
Moving Average (MA): Models the error term as a linear combination of error terms at various times in the past.

Watch a video explaining our project!

Data sources

OpenAQ - API designed for aggregating and sharing open air quality data from around the world.

We used an air quality sensor with location ID 380422 (49.208733, -122.9118), located in the city of New Westminster in British Columbia, Canada.

Parameters

pm1 - PM1 ➡️ Particulate matter less than 1 micrometer in diameter mass concentration, µg/m³
pm10 - PM10 ➡️ Particulate matter less than 10 micrometers in diameter mass concentration, µg/m³
pm25 - PM2.5 ➡️ Particulate matter less than 2.5 micrometers in diameter mass concentration, µg/m³
um003 - PM0.3 ➡️ count, particles/cm³
um005 - PM0.5 ➡️ count, particles/cm³
um010 - PM1 ➡️ count, particles/cm³
um025 - PM2.5 ➡️ count, particles/cm³
um050 - PM5.0 ➡️ count, particles/cm³
um100 - PM10 ➡️ count, particles/cm³
pressure ➡️ Atmospheric or barometric pressure, hPa
temperature ➡️ °C
humidity ➡️ %

Methods

GitHub Actions: The ETL process is automatically executed every hour.
Ploomber Pipeline: The ETL process is managed using Ploomber, a workflow management tool. The pipeline configuration can be found in pipeline.yaml.
Data Extraction and Cloud Data Storage MotherDuck: The data extraction process fetches air quality measurements from the OpenAQ API. The extraction logic is implemented in extract_duckdb.py. The extracted data is stored in the cloud using MotherDuck.
Jupyter Notebooks: The project includes Jupyter notebooks for data exploration and analysis (see extract.ipynb).
Docker Integration: The project is containerized using Docker, allowing for easy setup and deployment. The Dockerfile provides the necessary instructions to build the Docker image.

Dependencies

See pyproject.toml for all package requirements. Dependencies are managed using poetry.

User Interface

The Streamlit App is deployed in Ploomber Cloud. You can access the app here or by following this URL: https://blue-bird-7594.ploomberapp.io/

Authors

Alejandro Leiva - aleivaar94

Oscar Beltrán - beltran-oscar

Acknowledgments

We want to thank the Ploomber Team for their time and dedicated mentorship during the development of this project. Special mention to Laura Funderburk - Developer Advocate at Ploomber, for her patience and dedication to guide all mentees.

We also want to thank Eduardo Blancas - Co-founder/CEO at Ploomber for this mentorship opportunity.

License

The project is licensed under the Apache 2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.github/workflows		.github/workflows
images		images
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pipeline.yaml		pipeline.yaml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
setup.md		setup.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AirQ-Forecaster: An ETL-to-ML Pipeline for Predicting Air Quality Index

Hacktoberfest 2023 - Ploomber Mentorship Program

Description

Data sources

Parameters

Methods

Dependencies

User Interface

Authors

Acknowledgments

License

About

Releases

Packages

Contributors 3

Languages

License

beltran-oscar/ETL-pipeline-ML

Folders and files

Latest commit

History

Repository files navigation

AirQ-Forecaster: An ETL-to-ML Pipeline for Predicting Air Quality Index

Hacktoberfest 2023 - Ploomber Mentorship Program

Description

Data sources

Parameters

Methods

Dependencies

User Interface

Authors

Acknowledgments

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages