Search Results

Found 187 repositories(showing 30)

beginner_de_project

josephmachado

💛73

Beginner data engineering project - batch edition

576

199

MIT

HTML

Updated 13 hours ago

airflowdatabasedocker+7

My Insight Data Engineering Fellowship project. I implemented a big data processing pipeline based on lambda architecture, that aggregates Twitter and US stock market data for user sentiment analysis using open source tools - Apache Kafka for data ingestions, Apache Spark & Spark Streaming for batch & real-time processing, Apache Cassandra f or storage, Flask, Bootstrap and HighCharts f or frontend.

507

131

Scala

Updated 4 days ago

batch-data-engineering-project

alero-awani

❤️30

A batch Data Pipeline that retrieves data from a user purchase table and a movie review table and is transformed to form a user behaviour metric table.

HCL

Updated 5 months ago

airflowaws-s3data-engineering-pipeline+5

Real-Time-E-commerce-Recommendation-App

beratturan

❤️35

This is a software engineering project for an e-commerce platform to build batch and real-time data pipelines together with REST APIs to create a real-time recommendation engine. The REST API will be the source of two recommendation lists on the main page: browsing history and best seller products.

Python

Updated 2 months ago

github-data-pipeline

oktavianidewi

❤️35

Batch Final Project for Data Engineering Zoomcamp Course

Jupyter Notebook

Updated 1 year ago

datadezoomcampengineering

DataEngineering_SFO_Eviction_Data_ETL_Pipeline

sanyassyed

❤️35

Data Engineering & Analysis Project- San Francisco Eviction Data ETL Pipeline An end-to-end batch data pipeline for performing ETL on San Francisco City Eviction Data from DataSF. The project aims to analyze eviction trends and patterns from historical to current data.

Python

Updated 5 months ago

bigquerydata-analysisdata-engineering+6

biketheft_berlin

lisallreiber

❤️35

Batch pipeline to process and visualise daily published data on bike thefts in berlin. Data Engineering Zoomcamp 2023 project

Jupyter Notebook

Updated 11 months ago

bigquerydata-engineeringdbt+2

NYC-Taxi-Trip-Data-Pipeline

BrightOsas

❤️30

DataTalkClub Data engineering Bootcamp Project: Building an Efficient Batch Data Pipeline for Analytical Insights

Python

Updated 11 months ago

airflowawsboto3+6

Data-Engineering-Arlington-Property

bin43256

❤️35

End to end data engineering project that builds a data pipeline to batch process the property data from Arlington County public API

Python

Updated 4 months ago

data-engineering-zoomcamp

yelzha

🧡55

Data Engineering pet-project covering GCP, Docker, workflow orchestration with Mage, data transforming with dbt, batch processing via Spark and data streaming using Kafka

Jupyter Notebook

Updated 3 weeks ago

dbtdockergcp+5

Data-Engineering-Project---Automatic-Batch-Data-Processing

95xin

❤️30

Data Engineering Project - Automated Batch Data Processing

Python

Updated 11 months ago

airflowbigquerydata-engineering+6

elt-pipeline-on-gcp

irfan-fadhlurrahman

❤️45

This repo contains Data Engineering Zoomcamp Final Project. The project purpose is to perform Extract, Load and Transform (ELT) approach on GCP to batch process Capital Bike Share Dataset from January 2021 to January 2023.

MIT

Python

Updated 1 month ago

data-engineeringdbtdocker+4

Twitter_Streaming_Kafka_Spark

Abdokacimi

❤️35

My Insight Data Engineering project. I implemented a parallel big data processing pipelines , that streams real-time users data and loading it to a PostgreSQL DATABASE using open source tools - Apache Kafka for data ingestions, and aggregates in specific window time most popular hashtags world wide using open source tools - Apache Spark & Spark Streaming for batch & real-time processing.

Roff

Updated 2 years ago

Predicting-whether-a-given-tweet-is-about-a-real-disaster-or-not

Tanima704

❤️35

The purpose of the project was to determine from tweets (which are microblogs with a limit of 280 characters only) whether it was pointing towards an actual disaster or not. As part of Exploratory Data Analysis, identified missing values & unique values and Tweet distribution & cardinality with respect to keywords; handled meta-features; showed target class distribution; found out Unigrams, Bigrams and Trigrams. As part of feature engineering, tweet cleansing, checking of vocabulary and text coverage pre-cleaning vs post-cleaning and detection/correction of mislabeled tweets were done. Random Test-train split and cross validation were done before model training. BERT was used because it is designed to help computers understand the meaning of ambiguous language in text by using surrounding text to establish context. BERT, which stands for Bidirectional Encoder Representations from Transformers, is based on Transformers, a deep learning model in which every output element is connected to every input element, and the weightings between them are dynamically calculated based upon their connection. Parameters such as lr (learning rate), epochs (the number of epochs is a hyperparameter of gradient descent that controls the number of complete passes through the training dataset) and batch_size (the batch size is a hyperparameter of gradient descent that controls the number of training samples to work through before the model’s internal parameters are updated) were used for controlling the learning process. The BERT author’s recommended values were used for these parameters. 10 times randomly data splitting was done and for each splitting, model was run, then the average accuracy and F1-score were calculated from the 10 runs. The average Accuracy Score and F1 scores were 86.59% and 0.83 respectively.

Jupyter Notebook

Updated 2 years ago

end_to_end_batch_sql_data_engineering_project

DhanushSriniv

🧡50

Building a data warehouse project as a Data engineer and for BI use case.

MIT

TSQL

Updated 1 month ago

healthcare-DataEngineer

TioSptra

❤️35

Final-Project batch generator: JC-Data Engineering Purwadhika

Python

Updated 5 months ago

Sparkify-data-warehousing

ayanhussain81

❤️35

This repository contains project assignment for the SMIT Cloud Data Engineering (Batch 2).

Updated 5 months ago

GoodJob

rubywu0604

❤️35

This repository is for personal project of "AppWorks School Batch #21 Data-Engineering Class".

Python

Updated 1 year ago

data-fellowship-fireflow

ShreyaJaiswal1604

❤️40

This project implements a batch data pipeline for NYC's Citibike data. It extracts raw data, stores it in Google Cloud Storage and BigQuery, transforms it using DBT, and visualizes insights with Google Looker Data Studio. The pipeline showcases the end-to-end data engineering process.

MIT

HTML

Updated 1 year ago

ML-Credit-Card-Fraud-Detection-System

SaniyaSheikh-15

❤️35

End-to-end machine learning system for real-time credit card fraud detection. Includes data preprocessing, feature engineering, model training, FastAPI inference service, batch scoring, Docker containerization, and full project documentation.

Python

Updated 4 months ago

GitHub Explorer

Search Results

beginner_de_project

HashtagCashtag

batch-data-engineering-project

Real-Time-E-commerce-Recommendation-App

github-data-pipeline

DataEngineering_SFO_Eviction_Data_ETL_Pipeline

biketheft_berlin

NYC-Taxi-Trip-Data-Pipeline

Data-Engineering-Arlington-Property

data-engineering-zoomcamp

Data-Engineering-Project---Automatic-Batch-Data-Processing

elt-pipeline-on-gcp

Twitter_Streaming_Kafka_Spark

Predicting-whether-a-given-tweet-is-about-a-real-disaster-or-not

end_to_end_batch_sql_data_engineering_project

healthcare-DataEngineer

Sparkify-data-warehousing

GoodJob

data-fellowship-fireflow

beginner_de_project

tengrinews-open-project

E-Commerce-Data-Pipeline

Uber-Data-Engineering-Pipeline

Data_Engineering

2023.09_Feature_Engineering_PostgreSQL_Property_Sales

dwh-project

data-engineer-projects

Uber---EventHubs-and-SDP-Spark-Declarative-Pipelines-

NYC-Citibike-Data-Pipeline

ML-Credit-Card-Fraud-Detection-System

beginner_de_project

HashtagCashtag

batch-data-engineering-project

Real-Time-E-commerce-Recommendation-App

github-data-pipeline

DataEngineering_SFO_Eviction_Data_ETL_Pipeline

biketheft_berlin

NYC-Taxi-Trip-Data-Pipeline

Data-Engineering-Arlington-Property

data-engineering-zoomcamp

Data-Engineering-Project---Automatic-Batch-Data-Processing

elt-pipeline-on-gcp

Twitter_Streaming_Kafka_Spark

Predicting-whether-a-given-tweet-is-about-a-real-disaster-or-not

end_to_end_batch_sql_data_engineering_project

healthcare-DataEngineer

Sparkify-data-warehousing

GoodJob

data-fellowship-fireflow

beginner_de_project

tengrinews-open-project

E-Commerce-Data-Pipeline

Uber-Data-Engineering-Pipeline

Data_Engineering

2023.09_Feature_Engineering_PostgreSQL_Property_Sales

dwh-project

data-engineer-projects

Uber---EventHubs-and-SDP-Spark-Declarative-Pipelines-

NYC-Citibike-Data-Pipeline

ML-Credit-Card-Fraud-Detection-System