Found 187 repositories(showing 30)
josephmachado
Beginner data engineering project - batch edition
shafiab
My Insight Data Engineering Fellowship project. I implemented a big data processing pipeline based on lambda architecture, that aggregates Twitter and US stock market data for user sentiment analysis using open source tools - Apache Kafka for data ingestions, Apache Spark & Spark Streaming for batch & real-time processing, Apache Cassandra f or storage, Flask, Bootstrap and HighCharts f or frontend.
alero-awani
A batch Data Pipeline that retrieves data from a user purchase table and a movie review table and is transformed to form a user behaviour metric table.
beratturan
This is a software engineering project for an e-commerce platform to build batch and real-time data pipelines together with REST APIs to create a real-time recommendation engine. The REST API will be the source of two recommendation lists on the main page: browsing history and best seller products.
oktavianidewi
Batch Final Project for Data Engineering Zoomcamp Course
Data Engineering & Analysis Project- San Francisco Eviction Data ETL Pipeline An end-to-end batch data pipeline for performing ETL on San Francisco City Eviction Data from DataSF. The project aims to analyze eviction trends and patterns from historical to current data.
lisallreiber
Batch pipeline to process and visualise daily published data on bike thefts in berlin. Data Engineering Zoomcamp 2023 project
BrightOsas
DataTalkClub Data engineering Bootcamp Project: Building an Efficient Batch Data Pipeline for Analytical Insights
End to end data engineering project that builds a data pipeline to batch process the property data from Arlington County public API
Data Engineering pet-project covering GCP, Docker, workflow orchestration with Mage, data transforming with dbt, batch processing via Spark and data streaming using Kafka
Data Engineering Project - Automated Batch Data Processing
irfan-fadhlurrahman
This repo contains Data Engineering Zoomcamp Final Project. The project purpose is to perform Extract, Load and Transform (ELT) approach on GCP to batch process Capital Bike Share Dataset from January 2021 to January 2023.
Abdokacimi
My Insight Data Engineering project. I implemented a parallel big data processing pipelines , that streams real-time users data and loading it to a PostgreSQL DATABASE using open source tools - Apache Kafka for data ingestions, and aggregates in specific window time most popular hashtags world wide using open source tools - Apache Spark & Spark Streaming for batch & real-time processing.
The purpose of the project was to determine from tweets (which are microblogs with a limit of 280 characters only) whether it was pointing towards an actual disaster or not. As part of Exploratory Data Analysis, identified missing values & unique values and Tweet distribution & cardinality with respect to keywords; handled meta-features; showed target class distribution; found out Unigrams, Bigrams and Trigrams. As part of feature engineering, tweet cleansing, checking of vocabulary and text coverage pre-cleaning vs post-cleaning and detection/correction of mislabeled tweets were done. Random Test-train split and cross validation were done before model training. BERT was used because it is designed to help computers understand the meaning of ambiguous language in text by using surrounding text to establish context. BERT, which stands for Bidirectional Encoder Representations from Transformers, is based on Transformers, a deep learning model in which every output element is connected to every input element, and the weightings between them are dynamically calculated based upon their connection. Parameters such as lr (learning rate), epochs (the number of epochs is a hyperparameter of gradient descent that controls the number of complete passes through the training dataset) and batch_size (the batch size is a hyperparameter of gradient descent that controls the number of training samples to work through before the model’s internal parameters are updated) were used for controlling the learning process. The BERT author’s recommended values were used for these parameters. 10 times randomly data splitting was done and for each splitting, model was run, then the average accuracy and F1-score were calculated from the 10 runs. The average Accuracy Score and F1 scores were 86.59% and 0.83 respectively.
DhanushSriniv
Building a data warehouse project as a Data engineer and for BI use case.
TioSptra
Final-Project batch generator: JC-Data Engineering Purwadhika
ayanhussain81
This repository contains project assignment for the SMIT Cloud Data Engineering (Batch 2).
rubywu0604
This repository is for personal project of "AppWorks School Batch #21 Data-Engineering Class".
billiechristian
Hi! It's repository for FIREFLOW🔥team final project of IYKRA Data Fellowship Data Engineering Batch 8
gokhanalmis94
an adaptation of this training project: https://www.startdataengineering.com/post/data-engineering-project-for-beginners-batch-edition/
yelzha
Data Engineering pet-project covering GCP, Docker, workflow orchestration with Mage, data transforming with dbt, batch processing via Spark
MohammedHameds
This project demonstrates key concepts in modern data engineering: batch + streaming processing, data lake architecture, orchestration, data quality checks, and business analytics
NaveenNavzT
This project demonstrates an end-to-end data engineering pipeline that processes both real-time and batch ride data using Azure and Databricks.
Lohithgk
A collection of data engineering resources, projects, and best practices, covering the full pipeline from data ingestion and transformation to data warehousing, batch/stream processing, and visualization.
The final project in "SP701 SQL for Data Engineering" by Project SPARTA concentrates on teaching SQL for data tidying, altering, and shifting between database styles, like converting operational data into reports, and managing big data batches using SQL.
december-man
One of the key projects that I've made during the Data Analytics Engineering lab @ EPAM Systems — the "Supremestores" data warehouse with batch incremental load and complete ETL pipeline.
ozaairrr
This repository showcases my expertise in Azure Data Engineering, featuring batch & real-time pipelines, data warehousing, and ETL automation. It includes project documentation, Python & SQL scripts, and Azure Data Factory (ADF) pipelines for scalable data solutions.
End-to-end Data Engineering project using Databricks, Azure Data Factory, Event Hubs, PySpark, and Spark Structured Streaming. Covers batch and real-time ingestion, transformations, and pipelines with Spark Declarative Pipelines.
ShreyaJaiswal1604
This project implements a batch data pipeline for NYC's Citibike data. It extracts raw data, stores it in Google Cloud Storage and BigQuery, transforms it using DBT, and visualizes insights with Google Looker Data Studio. The pipeline showcases the end-to-end data engineering process.
SaniyaSheikh-15
End-to-end machine learning system for real-time credit card fraud detection. Includes data preprocessing, feature engineering, model training, FastAPI inference service, batch scoring, Docker containerization, and full project documentation.