Search Results

Found 142 repositories(showing 30)

airflow-etl-learn

asatrya

❤️35

This is a simple ETL using Airflow. First, we fetch data from API (extract). Then, we drop unused columns, convert to CSV, and validate (transform). Finally, we load the transformed data to database (load).

Python

Updated 5 months ago

airflowcloud-composer-environmentetl-pipeline

APACHE_AIRFLOW_DATA_PIPELINES

ultranet1

❤️45

Project Description: A music streaming company wants to introduce more automation and monitoring to their data warehouse ETL pipelines and they have come to the conclusion that the best tool to achieve this is Apache Airflow. As their Data Engineer, I was tasked to create a reusable production-grade data pipeline that incorporates data quality checks and allows for easy backfills. Several analysts and Data Scientists rely on the output generated by this pipeline and it is expected that the pipeline runs daily on a schedule by pulling new data from the source and store the results to the destination. Data Description: The source data resides in S3 and needs to be processed in a data warehouse in Amazon Redshift. The source datasets consist of JSON logs that tell about user activity in the application and JSON metadata about the songs the users listen to. Data Pipeline design: At a high-level the pipeline does the following tasks. Extract data from multiple S3 locations. Load the data into Redshift cluster. Transform the data into a star schema. Perform data validation and data quality checks. Calculate the most played songs for the specified time interval. Load the result back into S3. dag Structure of the Airflow DAG Design Goals: Based on the requirements of our data consumers, our pipeline is required to adhere to the following guidelines: The DAG should not have any dependencies on past runs. On failure, the task is retried for 3 times. Retries happen every 5 minutes. Catchup is turned off. Do not email on retry. Pipeline Implementation: Apache Airflow is a Python framework for programmatically creating workflows in DAGs, e.g. ETL processes, generating reports, and retraining models on a daily basis. The Airflow UI automatically parses our DAG and creates a natural representation for the movement and transformation of data. A DAG simply is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. A DAG describes how you want to carry out your workflow, and Operators determine what actually gets done. By default, airflow comes with some simple built-in operators like PythonOperator, BashOperator, DummyOperator etc., however, airflow lets you extend the features of a BaseOperator and create custom operators. For this project, I developed several custom operators. operators The description of each of these operators follows: StageToRedshiftOperator: Stages data to a specific redshift cluster from a specified S3 location. Operator uses templated fields to handle partitioned S3 locations. LoadFactOperator: Loads data to the given fact table by running the provided sql statement. Supports delete-insert and append style loads. LoadDimensionOperator: Loads data to the given dimension table by running the provided sql statement. Supports delete-insert and append style loads. SubDagOperator: Two or more operators can be grouped into one task using the SubDagOperator. Here, I am grouping the tasks of checking if the given table has rows and then run a series of data quality sql commands. HasRowsOperator: Data quality check to ensure that the specified table has rows. DataQualityOperator: Performs data quality checks by running sql statements to validate the data. SongPopularityOperator: Calculates the top ten most popular songs for a given interval. The interval is dictated by the DAG schedule. UnloadToS3Operator: Stores the analysis result back to the given S3 location. Code for each of these operators is located in the plugins/operators directory. Pipeline Schedule and Data Partitioning: The events data residing on S3 is partitioned by year (2018) and month (11). Our task is to incrementally load the event json files, and run it through the entire pipeline to calculate song popularity and store the result back into S3. In this manner, we can obtain the top songs per day in an automated fashion using the pipeline. Please note, this is a trivial analyis, but you can imagine other complex queries that follow similar structure. S3 Input events data: s3://<bucket>/log_data/2018/11/ 2018-11-01-events.json 2018-11-02-events.json 2018-11-03-events.json .. 2018-11-28-events.json 2018-11-29-events.json 2018-11-30-events.json S3 Output song popularity data: s3://skuchkula-topsongs/ songpopularity_2018-11-01 songpopularity_2018-11-02 songpopularity_2018-11-03 ... songpopularity_2018-11-28 songpopularity_2018-11-29 songpopularity_2018-11-30 The DAG can be configured by giving it some default_args which specify the start_date, end_date and other design choices which I have mentioned above. default_args = { 'owner': 'shravan', 'start_date': datetime(2018, 11, 1), 'end_date': datetime(2018, 11, 30), 'depends_on_past': False, 'email_on_retry': False, 'retries': 3, 'retry_delay': timedelta(minutes=5), 'catchup_by_default': False, 'provide_context': True, } How to run this project? Step 1: Create AWS Redshift Cluster using either the console or through the notebook provided in create-redshift-cluster Run the notebook to create AWS Redshift Cluster. Make a note of: DWN_ENDPOINT :: dwhcluster.c4m4dhrmsdov.us-west-2.redshift.amazonaws.com DWH_ROLE_ARN :: arn:aws:iam::506140549518:role/dwhRole Step 2: Start Apache Airflow Run docker-compose up from the directory containing docker-compose.yml. Ensure that you have mapped the volume to point to the location where you have your DAGs. NOTE: You can find details of how to manage Apache Airflow on mac here: https://gist.github.com/shravan-kuchkula/a3f357ff34cf5e3b862f3132fb599cf3 start_airflow Step 3: Configure Apache Airflow Hooks On the left is the S3 connection. The Login and password are the IAM user's access key and secret key that you created. Basically, by using these credentials, we are able to read data from S3. On the right is the redshift connection. These values can be easily gathered from your Redshift cluster connections Step 4: Execute the create-tables-dag This dag will create the staging, fact and dimension tables. The reason we need to trigger this manually is because, we want to keep this out of main dag. Normally, creation of tables can be handled by just triggering a script. But for the sake of illustration, I created a DAG for this and had Airflow trigger the DAG. You can turn off the DAG once it is completed. After running this DAG, you should see all the tables created in the AWS Redshift. Step 5: Turn on the load_and_transform_data_in_redshift dag As the execution start date is 2018-11-1 with a schedule interval @daily and the execution end date is 2018-11-30, Airflow will automatically trigger and schedule the dag runs once per day for 30 times. Shown below are the 30 DAG runs ranging from start_date till end_date, that are trigged by airflow once per day. schedule

Python

Updated 1 month ago

airflow-etl

sevkw

❤️35

A project that runs a simple ETL orchestrated by Airflow on local machine and AWS EC2.

Python

Updated 1 month ago

postgres-pipeline

stemitom

❤️40

A simple pipeline infrastructure with ETL pipeline contained in a Docker environment on Apache Airflow for orchestration and Postgres for data warehousing

MIT

Python

Updated 8 months ago

airflowdata-engineeringdata-orchestration+8

Data-pipeline-Airflow-on-AWSRedshift

nhntran

❤️35

A demo for creating a simple data warehouse ETL pipeline on AWS with Airflow

Jupyter Notebook

Updated 2 years ago

pet_project_simple_etl_airflow

k0rsakov

🧡50

https://www.notion.so/korsak0v/Data-Engineer-185c62fdf79345eb9da9928356884ea0

MIT

Python

Updated 2 months ago

simple-etl-airflow-covid-to-bigquery

anekpattanakij

❤️25

No description available

Python

Updated 2 years ago

Simple_PySpark_ETL_with_Airflow

mkmasudrana806

❤️45

Building a simple data pipeline using PySpark and Apache Airflow with PostgreSQL database. ETL with Airflow Orchestration

Jupyter Notebook

Updated 1 month ago

data_replication_ETL_with_Airflow_Pyspark_rdbms

tobyweb3x

❤️35

This is a simple project that brings automation and showcase one of the most predominant job of a Data Engineer -- DATA REPLICATION using an ETL/EtLT approach with python and orchestrated with Airflow.

Jupyter Notebook

Updated 3 years ago

simple-etl-airflow

patharanordev

❤️40

Simple Extract/Transform/Load(ETL) for COVID-19 data by using Airflow

MIT

Python

Updated 4 years ago

airflowcovid-19dag+4

freddie-mercury

rendybjunior

❤️25

Simple etl scheduler using airflow as engine

Apache-2.0

Python

Updated 3 years ago

gurgo22

❤️45

Simple ETL pipeline orchastrated with Airflow for loading crypto coin prices with an API into a BigQuery data warehouse

Python

Updated 2 months ago

apache-airflowbigquerydata-pipeline+4

databricks-airflow-dbt-pipeline

raphaph

❤️30

Using a simple Apache airflow image on docker with this repository, create an environment for Databricks with orchestration and ETL.

Python

Updated 1 year ago

batch-data-processing

quangliz

❤️35

A simple project that implements a scalable batch ETL (Extract, Transform, Load) pipeline using Apache Airflow for orchestration, Apache Spark for data processing, PostgreSQL for structured data storage, and Apache Cassandra for scalable raw data storage.

Python

Updated 10 months ago

dibimbing-case-study-etl

caesarmario

❤️35

End-to-end Weather ETL built with Apache Airflow, MinIO, and PostgreSQL. Extract Open-Meteo hourly data → normalize to Parquet → load to L1/L2 tables with idempotent, backfillable DAGs, simple data checks, and SQL-only transformations.

Python

Updated 4 months ago

airflowairflow-dagsapache-airflow+17

Introduction-to-Workflow-Management-Platform-Airflow

Develop-Packt

❤️40

In this module, you will look at creating a pipeline by breaking down a job into multiple executable stages. You will implement a simple linear pipeline and then move further by implementing a multi-stage data pipeline, then automate the multi-stage pipeline using Bash. Further to this you will improve the efficiency by running the pipeline as an asynchronous process using the ETL workflow and then create DAG for the pipeline and implement it using Airflow.

MIT

Jupyter Notebook

Updated 5 years ago

artificial-intelligencebashdata-pipeline+1

OpenWeather-Data-Pipeline

Syedhashirayub

❤️35

The goal of this project is to create a simple ETL pipeline using Airflow and OpenWeather API. In this pipeline, we will extract data from OpenWeather API, transform it, and load it into an AWS S3 bucket using s3fs.

Python

Updated 2 years ago

Airflow_dolar

Nicckks

❤️45

Este projeto implementa uma automação diária utilizando Apache Airflow para coletar a cotação do dólar (USD → BRL), salvar os dados em CSV, armazenar em um banco SQLite e registrar uma notificação final. É um pipeline simples, mas completo, ideal para estudos de orquestração e ETL.

Python

Updated 1 month ago

Simple-ETL-Pipeline-Automation-With-Airflow

chaba-victor

❤️40

This project uses Apache Airflow to extract weather data from an API, transform the data, and load it into an S3 bucket.

MIT

Python

Updated 2 years ago

BD_HW_2

lmnindzja

❤️35

Airflow simple ETL

Python

Updated 1 year ago

Airflow_1

AzlieJohari57

❤️35

simple airflow ETL

Updated 7 months ago

SimpleETLProject_ApacheAirflow

wandaarma

❤️25

No description available

Python

Updated 5 months ago

simple-etl-airflow

CaoKha

❤️25

No description available

HTML

Updated 1 year ago

Simple_ETL_Airflow

truongthanhvu2337

❤️35

No description available

Python

Updated 2 months ago

GitHub Explorer

Search Results

airflow-etl-learn

APACHE_AIRFLOW_DATA_PIPELINES

airflow-etl

postgres-pipeline

Data-pipeline-Airflow-on-AWSRedshift

pet_project_simple_etl_airflow

simple-etl-airflow-covid-to-bigquery

Simple_PySpark_ETL_with_Airflow

data_replication_ETL_with_Airflow_Pyspark_rdbms

simple-etl-airflow

freddie-mercury

talend_customer_example

ETL_airflow_MSSQL_postgres

Course_Rating_ETL_Pipeline

meow_fact_pipeline

ETL_with_Airflow

Air-Quality-Dashboard-ETL-Pipeline

crypto-etl-pipeline

databricks-airflow-dbt-pipeline

batch-data-processing

dibimbing-case-study-etl

Introduction-to-Workflow-Management-Platform-Airflow

OpenWeather-Data-Pipeline

Airflow_dolar

Simple-ETL-Pipeline-Automation-With-Airflow

BD_HW_2

Airflow_1

SimpleETLProject_ApacheAirflow

simple-etl-airflow

Simple_ETL_Airflow

airflow-etl-learn

APACHE_AIRFLOW_DATA_PIPELINES

airflow-etl

postgres-pipeline

Data-pipeline-Airflow-on-AWSRedshift

pet_project_simple_etl_airflow

simple-etl-airflow-covid-to-bigquery

Simple_PySpark_ETL_with_Airflow

data_replication_ETL_with_Airflow_Pyspark_rdbms

simple-etl-airflow

freddie-mercury

talend_customer_example

ETL_airflow_MSSQL_postgres

Course_Rating_ETL_Pipeline

meow_fact_pipeline

ETL_with_Airflow

Air-Quality-Dashboard-ETL-Pipeline

crypto-etl-pipeline

databricks-airflow-dbt-pipeline

batch-data-processing

dibimbing-case-study-etl

Introduction-to-Workflow-Management-Platform-Airflow

OpenWeather-Data-Pipeline

Airflow_dolar

Simple-ETL-Pipeline-Automation-With-Airflow

BD_HW_2

Airflow_1

SimpleETLProject_ApacheAirflow

simple-etl-airflow

Simple_ETL_Airflow