Search Results

Found 1,109 repositories(showing 30)

data-pipelines-with-apache-airflow

BasPH

💛75

Code for Data Pipelines with Apache Airflow

815

411

NOASSERTION

Python

Updated 2 days ago

e2e-data-engineering

airscholar

🧡57

An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

322

146

Python

Updated 2 weeks ago

apache-airflowapache-kafkaapache-spark+12

Movalytics-Data-Warehouse

alanchn31

🧡66

Data pipeline performing ETL to AWS Redshift using Spark, orchestrated with Apache Airflow

167

Python

Updated 17 hours ago

airflowanalyticsaws-redshift+16

Skytrax-Data-Warehouse

iam-mhaseeb

❤️45

A full data warehouse infrastructure with ETL pipelines running inside docker on Apache Airflow for data orchestration, AWS Redshift for cloud data warehouse and Metabase to serve the needs of data visualizations such as analytical dashboards.

140

MIT

Python

Updated 1 month ago

airflowdata-analysisdata-analytics+16

data-pipelines-with-apache-airflow

K9Ns

❤️46

No description available

118

Python

Updated 2 weeks ago

productionalizing-data-pipelines-airflow

axel-sirota

🧡52

Productionalizing Data Pipelines with Apache Airflow

116

151

MIT

Python

Updated 1 month ago

automating-your-data-pipeline-with-apache-airflow

dataength

❤️25

Automating Your Data Pipeline with Apache Airflow

MIT

Python

Updated 1 year ago

airflowdata-engineeringdata-pipeline+1

FootballDataEngineering

airscholar

❤️45

An end-to-end data engineering pipeline that fetches data from Wikipedia, cleans and transforms it with Apache Airflow and saves it on Azure Data Lake. Other processing takes place on Azure Data Factory, Azure Synapse and Tableau.

Python

Updated 1 month ago

apache-airflowazure-data-factoryazure-data-lake-gen2+4

Stock_streaming_pipeline_project

Joshua-omolewa

❤️30

Built a real-time streaming pipeline to extract stock data, using Apache Nifi, Debezium, Kafka, and Spark Streaming. Loaded the transformed data into Glue database and created real-time dashboards using Power BI and Tableau with Athena. The pipeline is orchestrated using Airflow.

Python

Updated 3 months ago

airflowapache-kafkaapache-spark+7

e2e-structured-streaming

akarce

❤️45

End-to-end data pipeline that ingests, processes, and stores data. It uses Apache Airflow to schedule scripts that fetch data from an API, sends the data to Kafka, and processes it with Spark before writing to Cassandra. The pipeline, built with Python and Apache Zookeeper, is containerized with Docker for easy deployment and scalability.

Python

Updated 1 month ago

airflowapache-airflowapache-kafka+10

APACHE_AIRFLOW_DATA_PIPELINES

ultranet1

❤️45

Project Description: A music streaming company wants to introduce more automation and monitoring to their data warehouse ETL pipelines and they have come to the conclusion that the best tool to achieve this is Apache Airflow. As their Data Engineer, I was tasked to create a reusable production-grade data pipeline that incorporates data quality checks and allows for easy backfills. Several analysts and Data Scientists rely on the output generated by this pipeline and it is expected that the pipeline runs daily on a schedule by pulling new data from the source and store the results to the destination. Data Description: The source data resides in S3 and needs to be processed in a data warehouse in Amazon Redshift. The source datasets consist of JSON logs that tell about user activity in the application and JSON metadata about the songs the users listen to. Data Pipeline design: At a high-level the pipeline does the following tasks. Extract data from multiple S3 locations. Load the data into Redshift cluster. Transform the data into a star schema. Perform data validation and data quality checks. Calculate the most played songs for the specified time interval. Load the result back into S3. dag Structure of the Airflow DAG Design Goals: Based on the requirements of our data consumers, our pipeline is required to adhere to the following guidelines: The DAG should not have any dependencies on past runs. On failure, the task is retried for 3 times. Retries happen every 5 minutes. Catchup is turned off. Do not email on retry. Pipeline Implementation: Apache Airflow is a Python framework for programmatically creating workflows in DAGs, e.g. ETL processes, generating reports, and retraining models on a daily basis. The Airflow UI automatically parses our DAG and creates a natural representation for the movement and transformation of data. A DAG simply is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. A DAG describes how you want to carry out your workflow, and Operators determine what actually gets done. By default, airflow comes with some simple built-in operators like PythonOperator, BashOperator, DummyOperator etc., however, airflow lets you extend the features of a BaseOperator and create custom operators. For this project, I developed several custom operators. operators The description of each of these operators follows: StageToRedshiftOperator: Stages data to a specific redshift cluster from a specified S3 location. Operator uses templated fields to handle partitioned S3 locations. LoadFactOperator: Loads data to the given fact table by running the provided sql statement. Supports delete-insert and append style loads. LoadDimensionOperator: Loads data to the given dimension table by running the provided sql statement. Supports delete-insert and append style loads. SubDagOperator: Two or more operators can be grouped into one task using the SubDagOperator. Here, I am grouping the tasks of checking if the given table has rows and then run a series of data quality sql commands. HasRowsOperator: Data quality check to ensure that the specified table has rows. DataQualityOperator: Performs data quality checks by running sql statements to validate the data. SongPopularityOperator: Calculates the top ten most popular songs for a given interval. The interval is dictated by the DAG schedule. UnloadToS3Operator: Stores the analysis result back to the given S3 location. Code for each of these operators is located in the plugins/operators directory. Pipeline Schedule and Data Partitioning: The events data residing on S3 is partitioned by year (2018) and month (11). Our task is to incrementally load the event json files, and run it through the entire pipeline to calculate song popularity and store the result back into S3. In this manner, we can obtain the top songs per day in an automated fashion using the pipeline. Please note, this is a trivial analyis, but you can imagine other complex queries that follow similar structure. S3 Input events data: s3://<bucket>/log_data/2018/11/ 2018-11-01-events.json 2018-11-02-events.json 2018-11-03-events.json .. 2018-11-28-events.json 2018-11-29-events.json 2018-11-30-events.json S3 Output song popularity data: s3://skuchkula-topsongs/ songpopularity_2018-11-01 songpopularity_2018-11-02 songpopularity_2018-11-03 ... songpopularity_2018-11-28 songpopularity_2018-11-29 songpopularity_2018-11-30 The DAG can be configured by giving it some default_args which specify the start_date, end_date and other design choices which I have mentioned above. default_args = { 'owner': 'shravan', 'start_date': datetime(2018, 11, 1), 'end_date': datetime(2018, 11, 30), 'depends_on_past': False, 'email_on_retry': False, 'retries': 3, 'retry_delay': timedelta(minutes=5), 'catchup_by_default': False, 'provide_context': True, } How to run this project? Step 1: Create AWS Redshift Cluster using either the console or through the notebook provided in create-redshift-cluster Run the notebook to create AWS Redshift Cluster. Make a note of: DWN_ENDPOINT :: dwhcluster.c4m4dhrmsdov.us-west-2.redshift.amazonaws.com DWH_ROLE_ARN :: arn:aws:iam::506140549518:role/dwhRole Step 2: Start Apache Airflow Run docker-compose up from the directory containing docker-compose.yml. Ensure that you have mapped the volume to point to the location where you have your DAGs. NOTE: You can find details of how to manage Apache Airflow on mac here: https://gist.github.com/shravan-kuchkula/a3f357ff34cf5e3b862f3132fb599cf3 start_airflow Step 3: Configure Apache Airflow Hooks On the left is the S3 connection. The Login and password are the IAM user's access key and secret key that you created. Basically, by using these credentials, we are able to read data from S3. On the right is the redshift connection. These values can be easily gathered from your Redshift cluster connections Step 4: Execute the create-tables-dag This dag will create the staging, fact and dimension tables. The reason we need to trigger this manually is because, we want to keep this out of main dag. Normally, creation of tables can be handled by just triggering a script. But for the sake of illustration, I created a DAG for this and had Airflow trigger the DAG. You can turn off the DAG once it is completed. After running this DAG, you should see all the tables created in the AWS Redshift. Step 5: Turn on the load_and_transform_data_in_redshift dag As the execution start date is 2018-11-1 with a schedule interval @daily and the execution end date is 2018-11-30, Airflow will automatically trigger and schedule the dag runs once per day for 30 times. Shown below are the 30 DAG runs ranging from start_date till end_date, that are trigged by airflow once per day. schedule

Python

Updated 1 month ago

sm-data-wrangler-mlops-workflows

aws-samples

❤️35

Integrate SageMaker Data Wrangler into your MLOps workflows with Amazon SageMaker Pipelines, AWS Step Functions, and Amazon Managed Workflow for Apache Airflow (MWAA)

MIT-0

Jupyter Notebook

Updated 11 months ago

data-preparationmachine-learningmlops

Data_Pipelines_with_Apache_Airflow

patelatharva

❤️35

Creating Data Pipelines with Apache Airflow to manage ETL from Amazon S3 into Amazon Redshift

Python

Updated 10 months ago

Li-Airflow-Backfill-Plugin

❤️30

Li-Airflow-Backfill-Plugin is a plugin to work with Apache Airflow to provide data backfill feature, ie. to rerun pipelines for a certain date range.

BSD-2-Clause

Python

Updated 10 months ago

airflow-postgres-superset-on-docker

finloop

🧡65

End to end data pipeline with Apache Airflow, Postgres and Apache Superset

Jupyter Notebook

Updated 1 day ago

airflowpostgrespostgresql+1

Retail-Spark-Streaming-with-Kafka-Cassandra

codsalah

🧡50

A real-time data pipeline that collects and processes retail data using Apache Kafka and Spark. It organizes the data through medallion architecture to prepare clean and useful insights. The pipeline is managed with Airflow, Dbt and visualized using Power BI.

Python

Updated 3 weeks ago

Airflow-Data-Pipeline-Automation

AnthonyByansi

❤️30

Automate your data pipelines using Apache Airflow with this ready-to-use DAG for data integration, ETL and workflow automation.

MIT

Updated 1 year ago

airflowairflow-dagsapache-airflow+7

Data-Pipelines-with-Apache-Airflow

write4alive

❤️35

Data Engineering Nano Degree Programm of Udacity - Project 5 - Data Pipelines with Apache Airflow

Python

Updated 10 months ago

airflowdata-engineeringetl+2

postgres-pipeline

stemitom

❤️40

A simple pipeline infrastructure with ETL pipeline contained in a Docker environment on Apache Airflow for orchestration and Postgres for data warehousing

MIT

Python

Updated 8 months ago

airflowdata-engineeringdata-orchestration+8

medalion_architecture_pipeline

sweetkobem

❤️45

This repository uses the Medalion architecture to build a data lakehouse, with the help of Apache Airflow as the orchestrator and pipeline engine, and DuckDB as the data warehouse and for data transformation

Python

Updated 1 month ago

Streaming--data-engineering-project

CDMagu

❤️25

This project creates a robust data pipeline for efficient ingestion, processing, and storage. Using Apache Airflow for orchestration, it integrates Python, Kafka, Zookeeper, and Spark for real-time data processing, with Cassandra for storage. Docker containerization ensures smooth deployment and scalability of all components.

Python

Updated 6 months ago

AirFlow

DSandovalFlavio

❤️35

Data-Pipelines-with-Apache-AirFlow-Notes

Python

Updated 10 months ago

Airflow-dbt-bigquery-gcs-healthcare-data-pipeline

ChuquEmeka

🧡50

This project demonstrates an end-to-end healthcare data pipeline using Apache Airflow for orchestration, dbt for transformations, and Google BigQuery/GCS for data storage and querying. It automates data workflows, including raw data generation, testing, and transformation, with support for dev/prod environments.

Python

Updated 3 weeks ago

Data-Engineering-Project-5-Data-Pipelines-with-Airflow

nadirl00

❤️35

Data Pipelines with Apache Airflow

Python

Updated 7 months ago

airflow_savant

awaisajaz1

❤️35

A comprehensive data engineering pipeline has been established to coordinate the ingestion, processing, and storage of data. This pipeline utilizes Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and FastDBs. All these components have been containerized with Docker to facilitate straightforward deployment and scalability

Python

Updated 4 months ago

apache-airflowapache-kafkaapache-spark+3

airflow-weather

KimGeunUk

❤️45

Weather & Weather Forecast Data Pipeline With Apache Airflow

Python

Updated 1 month ago

taxi-data-pipeline

Elsayed91

❤️35

Automated data & MLOps pipeline leveraging Kubernetes and Apache Airflow. Integrates Spark, Kafka, and DBT with a focus on data quality. Tailors solutions for diverse user needs.

Python

Updated 1 year ago

airbnb-elt-pipeline

phuonganh-38

❤️40

A production-ready ELT data pipeline using Apache Airflow and dbt Cloud, with the primary goal of processing and transforming Airbnb and Census data for Sydney

MIT

Python

Updated 11 months ago

churn-prediction-ml-pipeline

anandsuraj

🧡55

A fully automated ML pipeline for customer churn prediction in telecom, orchestrated with Apache Airflow. Covers data ingestion, validation, feature engineering, model training, deployment, and monitoring with DVC-based versioning for complete reproducibility.

Python

Updated 1 week ago

apache-airflowcustomer-churn-predictioncustomer-churn-prediction-with-machine-learning+2

finflow-project

ifyjakande

❤️20

🏦 Modern financial data pipeline built with Apache Airflow, Google Cloud Platform (GCP), and dbt. Implements dimensional modeling for transaction analytics with automated data quality checks and robust transformations. Features synthetic data generation, BigQuery integration, and optimized warehouse design for analytical queries.

Python

Updated 6 months ago

apache-airflow-google-bigquery-dbt-docker

GitHub Explorer

Search Results

data-pipelines-with-apache-airflow

e2e-data-engineering

Movalytics-Data-Warehouse

Skytrax-Data-Warehouse

data-pipelines-with-apache-airflow

productionalizing-data-pipelines-airflow

automating-your-data-pipeline-with-apache-airflow

FootballDataEngineering

Stock_streaming_pipeline_project

e2e-structured-streaming

APACHE_AIRFLOW_DATA_PIPELINES

sm-data-wrangler-mlops-workflows

Data_Pipelines_with_Apache_Airflow

Li-Airflow-Backfill-Plugin

airflow-postgres-superset-on-docker

Retail-Spark-Streaming-with-Kafka-Cassandra

Airflow-Data-Pipeline-Automation

Data-Pipelines-with-Apache-Airflow

postgres-pipeline

medalion_architecture_pipeline

Streaming--data-engineering-project

AirFlow

Airflow-dbt-bigquery-gcs-healthcare-data-pipeline

Data-Engineering-Project-5-Data-Pipelines-with-Airflow

airflow_savant

airflow-weather

taxi-data-pipeline

airbnb-elt-pipeline

churn-prediction-ml-pipeline

finflow-project

data-pipelines-with-apache-airflow

e2e-data-engineering

Movalytics-Data-Warehouse

Skytrax-Data-Warehouse

data-pipelines-with-apache-airflow

productionalizing-data-pipelines-airflow

automating-your-data-pipeline-with-apache-airflow

FootballDataEngineering

Stock_streaming_pipeline_project

e2e-structured-streaming

APACHE_AIRFLOW_DATA_PIPELINES

sm-data-wrangler-mlops-workflows

Data_Pipelines_with_Apache_Airflow

Li-Airflow-Backfill-Plugin

airflow-postgres-superset-on-docker

Retail-Spark-Streaming-with-Kafka-Cassandra

Airflow-Data-Pipeline-Automation

Data-Pipelines-with-Apache-Airflow

postgres-pipeline

medalion_architecture_pipeline

Streaming--data-engineering-project

AirFlow

Airflow-dbt-bigquery-gcs-healthcare-data-pipeline

Data-Engineering-Project-5-Data-Pipelines-with-Airflow

airflow_savant

airflow-weather

taxi-data-pipeline

airbnb-elt-pipeline

churn-prediction-ml-pipeline

finflow-project