Found 363 repositories(showing 30)
hoangsonww
📈 A scalable, production-ready data pipeline for real-time streaming & batch processing, integrating Kafka, Spark, Airflow, AWS, Kubernetes, and MLflow. Supports end-to-end data ingestion, transformation, storage, monitoring, and AI/ML serving with CI/CD automation using Terraform & GitHub Actions.
LoveNui
Create a data pipeline on AWS to execute batch processing in a Spark cluster provisioned by Amazon EMR. ETL using managed airflow: extracts data from S3, transform data using spark, load transformed data back to S3.
siddharth271101
The goal of this project is to analyse the impact of Covid-19 on the Aviation industry through data engineering processes using technologies such as Apache Airflow, Apache Spark, Tableau and couple of AWS services
aws-solutions-library-samples
This solution sets up an automated migration process for moving tables (Apache Iceberg and Hive tables) registered in AWS Glue Table Catalog and stored in Amazon S3 general-purpose buckets to Amazon S3 Table buckets using AWS Step Functions, and Amazon EMR with Apache Spark.
Joshua-omolewa
Built a Data Pipeline for a Retail store using AWS services that collects data from its transactional database (OLTP) in Snowflake and transforms the raw data (ETL process) using Apache spark to meet business requirements and also enables Data Analyst create Data Visualization using Superset. Airflow is used to orchestrate the pipeline
devindatt
Building an ETL process using Spark EMR in AWS
anthonywong611
Create a data pipeline on AWS to execute batch processing in a Spark cluster provisioned by Amazon EMR. ETL using managed airflow: extracts data from S3, transform data using spark, load transformed data back to S3.
Create Data Lake on AWS S3 to store dimensional tables after processing data using Spark on AWS EMR cluster
airscholar
This project demonstrates the use of Amazon Elastic Map Reduce (EMR) for processing large datasets using Apache Spark. It includes a Spark script for ETL (Extract, Transform, Load) operations, AWS command line instructions for setting up and managing the EMR cluster, and a dataset for testing and demonstration purposes.
This repository provides examples and code snippets for streaming data processing using Apache Spark on Databricks with AWS services. It demonstrates how to leverage the power of Spark streaming to process and analyze real-time data in a distributed and scalable manner, utilizing various AWS services for data ingestion and storage
christopherkindl
Data pipeline to process and analyse Twitter data in a distributed fashion using Apache Spark and Airflow in AWS environment
Ren294
This project integrates real-time data processing and analytics using Apache NiFi, Kafka, Spark, Hive, and AWS services for comprehensive COVID-19 data insights.
saurabh-smk
An automated system for Twitter data Analysis using AWS Services. In this architecture, a python script pulls data continuously using Twitter API, a Lambda function along with Kinesis Firehose is then used for pre-processing data and then writing it to S3 in JSON format. An EMR activity (Spark Job) is scheduled on hourly basis to create reports using Spark-SQL which are then written in S3.
trgtanhh04
Big data processing pipeline system, using PostgreSQL, Apache Kafka, Apache Spark, Apache Airflow and AWS S3 distributed storage to build a phone recommendation system for users.
yashvisharma1204
A scalable fraud detection system leveraging Apache Spark on AWS EMR for large-scale financial transaction processing. This project enables distributed inference, efficient ML predictions, and AI-generated fraud analysis reports stored in AWS S3
devangkale10
Explore a variety of PySpark applications in this repository, from movie recommendations to social network analysis, all designed to demonstrate Spark’s capabilities on AWS Elastic MapReduce. This collection highlights efficient data processing, advanced analytics, and machine learning techniques suitable for handling large-scale data.
Naseer5196
Indeed Home - For employers Dashboard Find resumes Analytics Need Help? Start of main content Jobs Candidates Messages Search candidates Search candidates Data Engineer -Immediate Joiner (Work From Office) Vedhas Technology Solutions Pvt Ltd – Hyderabad, Telangana Clicks Your job 17/09/21 18/09/21 19/09/21 20/09/21 21/09/21 22/09/21 23/09/21 0 15 30 Clicks this week 63 Candidates Awaiting Review 6 Total (excluding rejected) 6 0 Rejected Discover your top applicants faster by sending a free assessment Get a more complete picture of each candidate by being able to view and compare their assessment score results when you turn on the assessment of your choice. Job description Required Data Engineer - Work From Office Location: Himayatnagar, Hyderabad Experience: 2 - 4 yrs. Job Description: · Experience in Big Data components such as Spark, Kafka, Scala/PySpark, SQL, Data frames, Airflow etc. implemented using Data Bricks would be preferred. · Databricks integration with other cloud services like (Azure - Data Lake, Data Factory, Synapse, Azure DevOps, etc.) or (AWS S3, GLUE, Athena, Redshift, Lambda, CloudWatch etc.) · Reading, processing, and writing data in various file formats using Spark & Databricks. · Knowledge of best Databricks Job Optimization process and standards. Good to Have: · Databricks Delta Table & ML-Flow knowledge will be a plus. · AWS/Azure/Databricks Certifications will be a plus · Strong Data Warehousing experience · Good understanding of Database schema, design, optimization, scalability. · Ability to learn new technologies quickly. · Great communication skills, strong work ethic. Role Data Engineer Industry Type IT Services & Consulting Functional Area IT Software - Application Programming, Maintenance Employment Type Full Time, Permanent Education: UG- B. Tech/B.E. in Any Specialization Key Skills: Data Bricks, Data Lake, Kafka, Azure DevOps, SQL. Remuneration: No Bar for Right Candidate. Work Shift: Day Working Days: 5 per week Location: Vedhas Technology Solutions Pvt Ltd 1st Floor City Centre Himayatnagar, Hyderabad -500029. Email ID: HR@TECHVEDHAS.COM Contact HR: 040-23224181.
richjdowney
This project demonstrates skills in data engineering, specifically it contains an efficient ETL process utilizing AWS EC2, EMR and S3, Python and Spark and orchestrating the data pipeline with Airflow
saurabhsoni5893
Udacity Data Engineering Nanodegree Capstone project that covers almost all the aspects of Data Engineering - Data Exploration, Data Cleaning, Data modeling, ELT(Extract, Load & Transform), Data Processing on AWS Cloud using Apache Spark and automating data-pipelines using Apache Airflow.
Smart City End-to-End Realtime Data Pipeline: Simulates IoT data streaming from ingestion (Kafka, Zookeeper, Docker) through processing (Spark), storage (AWS Glue, Athena, Redshift), and visualization (PowerBI), using Python and AWS Cloud services.
mouadja02
End-to-end data engineering pipeline with real-time streaming, cloud processing, and analytics. Built with Apache Kafka, Spark, AWS Glue, and Snowflake using Apache Iceberg tables.
windi-wulandari
This project implements an end-to-end data pipeline designed to manage and analyze large-scale credit scoring data. Using AWS S3 as a scalable storage solution and Databricks for processing, the pipeline leverages the power of Apache Spark through PySpark and SQL Spark to handle data transformation and analysis efficiently.
Using Spark and data lakes to build an ETL pipeline for a data lake hosted on AWS S3. First we will load json data from an S3 bucket and process the data using a Spark cluster on AWS, into analytical tables and utilizing a star schema. Then we will load the tables into a new S3 bucket.
This project is a ETL pipeline processing structured financial data and unstructured social media data related to cryptocurrencies(dataset with millions of record). Which prepare for exploring the relationship between the price trend of cryptocurrency assets and the sentiment of its social media platform. Use python, spark, BianceAPI, etc. to extract tradedata from the cryptocurrency exchange platform, transform it to marketdata on AWS EMR, and store it in AWS S3 Bucket. Use python, spark, TwitterAPI, etc. to extract tweets from the Twitter platform, transform and store them in AWS S3 Bucket. Perform data quality checks on tweets and marketdata and persist them on AWS S3 Bucket. Utilized:Python,Pyspark,Spark,SQL,AWS,Amazon S3,AWS EMR,BianceAPI,TwitterAPI,Data Quality,Structured data,Unstructured Data,Data Lake,ETL,Big Data,Hadoop.
ChahiriAbderrahmane
This project simulates a real-world enterprise data migration and modernization strategy. It extracts transactional data from a simulated "On-Premise" environment (hosted on AWS EC2), performs heavy distributed processing using a Hadoop/Spark cluster, and ultimately serves the data via a Cloud-Native, serverless architecture to optimize costs .
PrathameshLakawade
Pipeline-Genie is an intelligent data pipeline that processes CSV datasets, identifies their schema, and leverages LLaMA 2.0 to extract business insights. Users can select relevant business needs, triggering automated ETL transformations using Apache Spark. The final transformed dataset is stored in AWS S3 and made available for download.
longNguyen010203
👷🌇 Set up and build a big data processing pipeline with Apache Spark, 📦 AWS services (S3, EMR, EC2, IAM, VPC, Redshift) Terraform to setup the infrastructure and Integration Airflow to automate workflows🥊
jyotidabass
Large-Scale Data Processing Pipeline for Streaming Platform Analytics | Apache Spark, Kafka, Dask, AWS/GCP Integration
luke202001
A reference implementation to analyze events using various stream data collection & processing technologies (Kafka, Flink Spark, MongoDB, RabbitMQ, Spring Boot, AWS, Azure, Docker, Kubernetes)
Smart City End to End Realtime data streaming pipeline covering each phase from data ingestion to processing and finally storage. We'll utilize tools like IOT devices, Apache Zookeeper, ApacheKafka, Apache Spark, Docker, Python, AWS Cloud, AWS Glue, AWS Athena, AWS IAM, AWS Redshift and finally PowerBI to visualize data on Reshift.