This project focuses on creating an ETL (Extract, Transform, Load) pipeline with Ploomberfor air quality data. The pipeline fetches and processes air quality measurements from the OpenAQ API and uses DuckDB and MotherDuck for data storage and management.
We implemented the ARIMA (AutoRegressive Integrated Moving Average) model for time series forecasting to predict the Air Quality Index. ARIMA is a powerful and widely-used statistical method effective for short-term forecasting with data having seasonality, or cyclic patterns.
Key Features of ARIMA:
-
AutoRegressive (AR): Leverages the relationship between an observation and a number of lagged observations (autoregression).
-
Integrated (I): Involves differencing the time series to make it stationary, i.e., to stabilize the mean of the time series by removing changes in the level.
-
Moving Average (MA): Models the error term as a linear combination of error terms at various times in the past.
Watch a video explaining our project!
OpenAQ - API designed for aggregating and sharing open air quality data from around the world.
We used an air quality sensor with location ID 380422 (49.208733, -122.9118), located in the city of New Westminster in British Columbia, Canada.
pm1 - PM1
➡️ Particulate matter less than 1 micrometer in diameter mass concentration, µg/m³pm10 - PM10
➡️ Particulate matter less than 10 micrometers in diameter mass concentration, µg/m³pm25 - PM2.5
➡️ Particulate matter less than 2.5 micrometers in diameter mass concentration, µg/m³um003 - PM0.3
➡️ count, particles/cm³um005 - PM0.5
➡️ count, particles/cm³um010 - PM1
➡️ count, particles/cm³um025 - PM2.5
➡️ count, particles/cm³um050 - PM5.0
➡️ count, particles/cm³um100 - PM10
➡️ count, particles/cm³pressure
➡️ Atmospheric or barometric pressure, hPatemperature
➡️ °Chumidity
➡️ %
-
GitHub Actions: The ETL process is automatically executed every hour.
-
Ploomber Pipeline: The ETL process is managed using Ploomber, a workflow management tool. The pipeline configuration can be found in
pipeline.yaml
. -
Data Extraction and Cloud Data Storage MotherDuck: The data extraction process fetches air quality measurements from the OpenAQ API. The extraction logic is implemented in
extract_duckdb.py
. The extracted data is stored in the cloud using MotherDuck. -
Jupyter Notebooks: The project includes Jupyter notebooks for data exploration and analysis (see
extract.ipynb
). -
Docker Integration: The project is containerized using Docker, allowing for easy setup and deployment. The Dockerfile provides the necessary instructions to build the Docker image.
See pyproject.toml
for all package requirements. Dependencies are managed using poetry
.
The Streamlit App is deployed in Ploomber Cloud. You can access the app here or by following this URL: https://blue-bird-7594.ploomberapp.io/
Alejandro Leiva - aleivaar94
Oscar Beltrán - beltran-oscar
We want to thank the Ploomber Team for their time and dedicated mentorship during the development of this project. Special mention to Laura Funderburk - Developer Advocate at Ploomber, for her patience and dedication to guide all mentees.
We also want to thank Eduardo Blancas - Co-founder/CEO at Ploomber for this mentorship opportunity.
The project is licensed under the Apache 2.0 License.