Search Results

Found 363 repositories(showing 30)

End-to-End-Data-Pipeline

hoangsonww

💛71

📈 A scalable, production-ready data pipeline for real-time streaming & batch processing, integrating Kafka, Spark, Airflow, AWS, Kubernetes, and MLflow. Supports end-to-end data ingestion, transformation, storage, monitoring, and AI/ML serving with CI/CD automation using Terraform & GitHub Actions.

MIT

Python

Updated 2 days ago

airflowapachedocker+17

ETL-with-AWS-EMR-and-MWAA

LoveNui

❤️35

Create a data pipeline on AWS to execute batch processing in a Spark cluster provisioned by Amazon EMR. ETL using managed airflow: extracts data from S3, transform data using spark, load transformed data back to S3.

Python

Updated 8 months ago

airflowaws-ec2aws-s3+3

Covid-19-and-Aviation-Industry

siddharth271101

❤️35

The goal of this project is to analyse the impact of Covid-19 on the Aviation industry through data engineering processes using technologies such as Apache Airflow, Apache Spark, Tableau and couple of AWS services

MIT

Python

Updated 6 months ago

airflowanalyticsathena+14

guidance-for-migrating-tabular-data-from-amazon-s3-to-s3-tables

aws-solutions-library-samples

🧡50

This solution sets up an automated migration process for moving tables (Apache Iceberg and Hive tables) registered in AWS Glue Table Catalog and stored in Amazon S3 general-purpose buckets to Amazon S3 Table buckets using AWS Step Functions, and Amazon EMR with Apache Spark.

MIT-0

Python

Updated 2 months ago

Retailstore_ETL_pipeline_project

Joshua-omolewa

🧡55

Built a Data Pipeline for a Retail store using AWS services that collects data from its transactional database (OLTP) in Snowflake and transforms the raw data (ETL process) using Apache spark to meet business requirements and also enables Data Analyst create Data Visualization using Superset. Airflow is used to orchestrate the pipeline

Python

Updated 2 weeks ago

airflowdockerec2-instance+5

Spark-AWS-ETL

devindatt

❤️35

Building an ETL process using Spark EMR in AWS

Python

Updated 3 years ago

Batch-ETL-with-AWS-EMR-and-MWAA

anthonywong611

❤️45

Python

Updated 1 month ago

airflowaws-cloudformationbatch-processing+2

Data-Lake-with-Spark-and-AWS-S3

jkoth

❤️35

Create Data Lake on AWS S3 to store dimensional tables after processing data using Spark on AWS EMR cluster

Python

Updated 1 year ago

apache-sparkaws-emraws-s3+9

EMR-for-data-engineers

airscholar

❤️35

This project demonstrates the use of Amazon Elastic Map Reduce (EMR) for processing large datasets using Apache Spark. It includes a Spark script for ETL (Extract, Transform, Load) operations, AWS command line instructions for setting up and managing the EMR cluster, and a dataset for testing and demonstration purposes.

Python

Updated 3 months ago

apache-sparkawsaws-s3+1

Streaming_Data_ETL_with_Apache_Spark_on_Databricks

priye-1

❤️35

This repository provides examples and code snippets for streaming data processing using Apache Spark on Databricks with AWS services. It demonstrates how to leverage the power of Spark streaming to process and analyze real-time data in a distributed and scalable manner, utilizing various AWS services for data ingestion and storage

Jupyter Notebook

Updated 2 years ago

twitter-data-pipeline-using-airflow-and-apache-spark

christopherkindl

❤️35

Data pipeline to process and analyse Twitter data in a distributed fashion using Apache Spark and Airflow in AWS environment

Python

Updated 1 year ago

airflowapache-sparkaws+2

Covid-Data-Process

Ren294

❤️40

This project integrates real-time data processing and analytics using Apache NiFi, Kafka, Spark, Hive, and AWS services for comprehensive COVID-19 data insights.

MIT

Shell

Updated 4 months ago

airflowawsaws-ec2+17

Twitter-data-analysis-with-AWS-services

saurabh-smk

❤️35

An automated system for Twitter data Analysis using AWS Services. In this architecture, a python script pulls data continuously using Twitter API, a Lambda function along with Kinesis Firehose is then used for pre-processing data and then writing it to S3 in JSON format. An EMR activity (Spark Job) is scheduled on hourly basis to create reports using Spark-SQL which are then written in S3.

Python

Updated 3 years ago

Mobile-AWS-Pipeline-Engineering

trgtanhh04

❤️45

Big data processing pipeline system, using PostgreSQL, Apache Kafka, Apache Spark, Apache Airflow and AWS S3 distributed storage to build a phone recommendation system for users.

Jupyter Notebook

Updated 2 months ago

financial_fraud_detection

yashvisharma1204

❤️40

A scalable fraud detection system leveraging Apache Spark on AWS EMR for large-scale financial transaction processing. This project enables distributed inference, efficient ML predictions, and AI-generated fraud analysis reports stored in AWS S3

GPL-3.0

Jupyter Notebook

Updated 7 months ago

pyspark-projects

devangkale10

❤️35

Explore a variety of PySpark applications in this repository, from movie recommendations to social network analysis, all designed to demonstrate Spark’s capabilities on AWS Elastic MapReduce. This collection highlights efficient data processing, advanced analytics, and machine learning techniques suitable for handling large-scale data.

Python

Updated 1 year ago

Data-Engineer-Kafka-DataLake-

Naseer5196

❤️35

Indeed Home - For employers Dashboard Find resumes Analytics Need Help? Start of main content Jobs Candidates Messages Search candidates Search candidates Data Engineer -Immediate Joiner (Work From Office) Vedhas Technology Solutions Pvt Ltd – Hyderabad, Telangana Clicks Your job 17/09/21 18/09/21 19/09/21 20/09/21 21/09/21 22/09/21 23/09/21 0 15 30 Clicks this week 63 Candidates Awaiting Review 6 Total (excluding rejected) 6 0 Rejected Discover your top applicants faster by sending a free assessment Get a more complete picture of each candidate by being able to view and compare their assessment score results when you turn on the assessment of your choice. Job description Required Data Engineer - Work From Office Location: Himayatnagar, Hyderabad Experience: 2 - 4 yrs. Job Description: · Experience in Big Data components such as Spark, Kafka, Scala/PySpark, SQL, Data frames, Airflow etc. implemented using Data Bricks would be preferred. · Databricks integration with other cloud services like (Azure - Data Lake, Data Factory, Synapse, Azure DevOps, etc.) or (AWS S3, GLUE, Athena, Redshift, Lambda, CloudWatch etc.) · Reading, processing, and writing data in various file formats using Spark & Databricks. · Knowledge of best Databricks Job Optimization process and standards. Good to Have: · Databricks Delta Table & ML-Flow knowledge will be a plus. · AWS/Azure/Databricks Certifications will be a plus · Strong Data Warehousing experience · Good understanding of Database schema, design, optimization, scalability. · Ability to learn new technologies quickly. · Great communication skills, strong work ethic. Role Data Engineer Industry Type IT Services & Consulting Functional Area IT Software - Application Programming, Maintenance Employment Type Full Time, Permanent Education: UG- B. Tech/B.E. in Any Specialization Key Skills: Data Bricks, Data Lake, Kafka, Azure DevOps, SQL. Remuneration: No Bar for Right Candidate. Work Shift: Day Working Days: 5 per week Location: Vedhas Technology Solutions Pvt Ltd 1st Floor City Centre Himayatnagar, Hyderabad -500029. Email ID: HR@TECHVEDHAS.COM Contact HR: 040-23224181.

Updated 7 months ago

immigration_data_etl_analysis

richjdowney

❤️20

This project demonstrates skills in data engineering, specifically it contains an efficient ETL process utilizing AWS EC2, EMR and S3, Python and Spark and orchestrating the data pipeline with Airflow

Python

Updated 1 year ago

US-Immigration-Data-Lake

saurabhsoni5893

❤️35

Udacity Data Engineering Nanodegree Capstone project that covers almost all the aspects of Data Engineering - Data Exploration, Data Cleaning, Data modeling, ELT(Extract, Load & Transform), Data Processing on AWS Cloud using Apache Spark and automating data-pipelines using Apache Airflow.

Jupyter Notebook

Updated 2 years ago

apache-airflowapache-sparkaws+9

Smart-City-END-2-END-Realtime-Pipeline-

Kvriem

❤️35

Smart City End-to-End Realtime Data Pipeline: Simulates IoT data streaming from ingestion (Kafka, Zookeeper, Docker) through processing (Spark), storage (AWS Glue, Athena, Redshift), and visualization (PowerBI), using Python and AWS Cloud services.

Python

Updated 7 months ago

end2end-datawarehouse-project

mouadja02

❤️40

End-to-end data engineering pipeline with real-time streaming, cloud processing, and analytics. Built with Apache Kafka, Spark, AWS Glue, and Snowflake using Apache Iceberg tables.

MIT

Python

Updated 3 months ago

apacheapache-icebergapache-kafka+11

Credit-Scoring-Data-Pipeline

windi-wulandari

❤️35

This project implements an end-to-end data pipeline designed to manage and analyze large-scale credit scoring data. Using AWS S3 as a scalable storage solution and Databricks for processing, the pipeline leverages the power of Apache Spark through PySpark and SQL Spark to handle data transformation and analysis efficiently.

Python

Updated 3 months ago

apache-sparkawsaws-s3+4

DataEngineeringNanodegree-P4-DataLakes-Spark-AWS

Manny-Brar

❤️35

Using Spark and data lakes to build an ETL pipeline for a data lake hosted on AWS S3. First we will load json data from an S3 bucket and process the data using a Spark cluster on AWS, into analytical tables and utilizing a star schema. Then we will load the tables into a new S3 bucket.

Python

Updated 2 years ago

awsdata-lakedataengineering+5

Production-of-Cryptocurrency-Data-Lake-Using-Spark-

guanliu321

❤️35

This project is a ETL pipeline processing structured financial data and unstructured social media data related to cryptocurrencies(dataset with millions of record). Which prepare for exploring the relationship between the price trend of cryptocurrency assets and the sentiment of its social media platform. Use python, spark, BianceAPI, etc. to extract tradedata from the cryptocurrency exchange platform, transform it to marketdata on AWS EMR, and store it in AWS S3 Bucket. Use python, spark, TwitterAPI, etc. to extract tweets from the Twitter platform, transform and store them in AWS S3 Bucket. Perform data quality checks on tweets and marketdata and persist them on AWS S3 Bucket. Utilized:Python,Pyspark,Spark,SQL,AWS,Amazon S3,AWS EMR,BianceAPI,TwitterAPI,Data Quality,Structured data,Unstructured Data,Data Lake,ETL,Big Data,Hadoop.

Jupyter Notebook

Updated 8 months ago

Sales-analytics-Data-Lakehouse

ChahiriAbderrahmane

🧡50

This project simulates a real-world enterprise data migration and modernization strategy. It extracts transactional data from a simulated "On-Premise" environment (hosted on AWS EC2), performs heavy distributed processing using a Hadoop/Spark cluster, and ultimately serves the data via a Cloud-Native, serverless architecture to optimize costs .

Python

Updated 3 weeks ago

amazon-athenaamazon-quicksightamazon-s3+11

Pipeline-Genie

PrathameshLakawade

🧡50

Pipeline-Genie is an intelligent data pipeline that processes CSV datasets, identifies their schema, and leverages LLaMA 2.0 to extract business insights. Users can select relevant business needs, triggering automated ETL transformations using Apache Spark. The final transformed dataset is stored in AWS S3 and made available for download.

MIT

Python

Updated 1 month ago

apache-sparkartificial-intelligenceaws-s3+12

Spark-Processing-AWS

longNguyen010203

❤️40

👷🌇 Set up and build a big data processing pipeline with Apache Spark, 📦 AWS services (S3, EMR, EC2, IAM, VPC, Redshift) Terraform to setup the infrastructure and Integration Airflow to automate workflows🥊

Apache-2.0

Python

Updated 1 year ago

apache-airflowapache-sparkaws+13

streaming-data-analytics-pipeline

jyotidabass

❤️45

Large-Scale Data Processing Pipeline for Streaming Platform Analytics | Apache Spark, Kafka, Dask, AWS/GCP Integration

Python

Updated 2 months ago

EventsRealTimeAnalysis

luke202001

❤️35

A reference implementation to analyze events using various stream data collection & processing technologies (Kafka, Flink Spark, MongoDB, RabbitMQ, Spring Boot, AWS, Azure, Docker, Kubernetes)

Java

Updated 5 years ago

SmartCity_Data_Engineering_project

eremah

❤️30

Smart City End to End Realtime data streaming pipeline covering each phase from data ingestion to processing and finally storage. We'll utilize tools like IOT devices, Apache Zookeeper, ApacheKafka, Apache Spark, Docker, Python, AWS Cloud, AWS Glue, AWS Athena, AWS IAM, AWS Redshift and finally PowerBI to visualize data on Reshift.

Python

Updated 5 months ago

GitHub Explorer

Search Results

End-to-End-Data-Pipeline

ETL-with-AWS-EMR-and-MWAA

Covid-19-and-Aviation-Industry

guidance-for-migrating-tabular-data-from-amazon-s3-to-s3-tables

Retailstore_ETL_pipeline_project

Spark-AWS-ETL

Batch-ETL-with-AWS-EMR-and-MWAA

Data-Lake-with-Spark-and-AWS-S3

EMR-for-data-engineers

Streaming_Data_ETL_with_Apache_Spark_on_Databricks

twitter-data-pipeline-using-airflow-and-apache-spark

Covid-Data-Process

Twitter-data-analysis-with-AWS-services

Mobile-AWS-Pipeline-Engineering

financial_fraud_detection

pyspark-projects

Data-Engineer-Kafka-DataLake-

immigration_data_etl_analysis

US-Immigration-Data-Lake

Smart-City-END-2-END-Realtime-Pipeline-

end2end-datawarehouse-project

Credit-Scoring-Data-Pipeline

DataEngineeringNanodegree-P4-DataLakes-Spark-AWS

Production-of-Cryptocurrency-Data-Lake-Using-Spark-

Sales-analytics-Data-Lakehouse

Pipeline-Genie

Spark-Processing-AWS

streaming-data-analytics-pipeline

EventsRealTimeAnalysis

SmartCity_Data_Engineering_project

End-to-End-Data-Pipeline

ETL-with-AWS-EMR-and-MWAA

Covid-19-and-Aviation-Industry

guidance-for-migrating-tabular-data-from-amazon-s3-to-s3-tables

Retailstore_ETL_pipeline_project

Spark-AWS-ETL

Batch-ETL-with-AWS-EMR-and-MWAA

Data-Lake-with-Spark-and-AWS-S3

EMR-for-data-engineers

Streaming_Data_ETL_with_Apache_Spark_on_Databricks

twitter-data-pipeline-using-airflow-and-apache-spark

Covid-Data-Process

Twitter-data-analysis-with-AWS-services

Mobile-AWS-Pipeline-Engineering

financial_fraud_detection

pyspark-projects

Data-Engineer-Kafka-DataLake-

immigration_data_etl_analysis

US-Immigration-Data-Lake

Smart-City-END-2-END-Realtime-Pipeline-

end2end-datawarehouse-project

Credit-Scoring-Data-Pipeline

DataEngineeringNanodegree-P4-DataLakes-Spark-AWS

Production-of-Cryptocurrency-Data-Lake-Using-Spark-

Sales-analytics-Data-Lakehouse

Pipeline-Genie

Spark-Processing-AWS

streaming-data-analytics-pipeline

EventsRealTimeAnalysis

SmartCity_Data_Engineering_project