Watch about it here:
Read about it here:
The main technologies are:
- Python3
- Docker
- Spark
- Airflow
- MinIo
- DeltaTable
- Hive
- Mariadb
- Presto
- Superset
I this section each part of this ETL pipeline will be illustrated:
Spark is used to read and write data in distributed and scalable manner.
make spark
will run spark master and one instance of worker
make scale-spark
will scale spark worker.
One of the best workflow managemnet for spark jobs.
make airflow
An opensource, distributed and performant object storage for datalake files and hive tables.
make minio
An opensource columnar parquet files formats with snappy compression. Delta supports update and delete, which is very nice. All necessary jar files for supporting delta and s3 objects are added to hive and spark docker images.
In order to create tables to run Spark SQL on delta tables, spark needs hive metastore and hive needs mariadb as metastoreDb. Mariadb is also used for data warehouse for to run query faster and create dashboards.
make hive
It will create hive and mariadb instances.
In order to have acces to delta tables without spark, presto is going to be employed as distributed query engine. It works with superset and hive tables. Presto is opensource, scalable and it can connect to any databases.
make presto-cluster
By this command will create a presto coordinator and worker, the worker can scale horizontally. In order to query delta tables using presto:
make presto-cli
In presto-cli just like spark sql, any query can be run.
Superset is opensource, supports any databases with many dashbord styles also famous in tech in order to create dashboards or to get hands on databases.