Found 696 repositories(showing 30)
SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.
shafiab
My Insight Data Engineering Fellowship project. I implemented a big data processing pipeline based on lambda architecture, that aggregates Twitter and US stock market data for user sentiment analysis using open source tools - Apache Kafka for data ingestions, Apache Spark & Spark Streaming for batch & real-time processing, Apache Cassandra f or storage, Flask, Bootstrap and HighCharts f or frontend.
alibaba
SparkCube is an open-source project for extremely fast OLAP data analysis. SparkCube is an extension of Apache Spark.
weltond
:rocket:Some projects on Big Data Analysis like Spark, Hive, Presto and Data Visualization like Superset
airscholar
This project serves as a comprehensive guide to building an end-to-end data engineering pipeline using TCP/IP Socket, Apache Spark, OpenAI LLM, Kafka and Elasticsearch. It covers each stage from data acquisition, processing, sentiment analysis with ChatGPT, production to kafka topic and connection to elasticsearch.
drisskhattabi6
This repo contains Big Data Project, its about "Real Time Twitter Sentiment Analysis via Kafka, Spark Streaming, MongoDB and Django Dashboard".
Personal project where I perform some analytics (including Sentiment Analysis) over a Twitter Stream using Big Data Technologies of the Hadoop echosystem such as Flume, Kafka, and Spark Streaming.
Performed Aspect Based Sentiment Analysis using Topic Modeling(LDA) and sentiment analysis and Regression analysis using Python and Spark on Yelp Restaurant Reviews. The objective of the project was to understand how to extract quantifiable information from reviews to understand the impact of the important aspects for different cuisines and their impact on overall ratings. This project was done in collaboration with my peers at Carlson School Of Management as part of Big Data Analytics Project.
This project aims to use the Hadoop framework to analyze unstructured data that we obtain from Twitter and perform sentiment and trend analysis using Hive on MapReduce and Spark on keyword “COVID19”. We then compare the Hive and Spark approaches to determine the best performance.
Monish-Nallagondalla
This project uses PySpark and Python to analyze a Google Play Store dataset. It covers data cleaning, duplicate removal, and visual analysis, performed in Jupyter Notebook with Spark's distributed computing.
GroupAYECS765P
BDP 05: CLUSTERING OF LARGE UNLABELED DATASETS OVERVIEW Real world data is frequently unlabeled and can seem completely random. In these sort of situations, unsupervised learning techniques are a great way to find underlying patterns. This project looks at one such algorithm, KMeans clustering, which searches for boundaries separating groups of points based on their differences in some features. The goal of the project is to implement an unsupervised clustering algorithm using a distributed computing platform. You will implement this algorithm on the stack overflow user base to find different ways the community can be divided, and investigate what causes these groupings. The clustering algorithm must be designed in a way that is appropriate for data intensive parallel computing frameworks. Spark would be the primary choice for this project, but it could also be implemented in Hadoop MapReduce. Algorithm implementations from external libraries such as Spark MLib may not be utilised; the code must be original from the students. However, once the algorithm is completed, a comparison between your own results and that generated by MLlib could be interesting and aid your investigation. Stack Overflow is the main dataset for this project, but alternative datasets can be adopted after consultation with the module organiser. Additionally, different clustering algorithms may be utilised, but this must be discussed and approved y the module organiser. DATASET The project will use the Stack Overflow dataset. This dataset is located in HDFS at /data/stackoverflow The dataset for StackOverflow is a set of files containing Posts, Users, Votes, Comments, PostHistory and PostLinks. Each file contains one XML record per line. For complete schema information: Click here In order to define the clustering use case, you must define what should be the features of each post that will be used to cluster the data. Have a look at the different fields to define your use case. ALGORITHM The project will implement the k-means algorithm for clustering. This algorithm iteratively recomputes the location of k centroids (k is the number of clusters, defined beforehand), that aim to classify the data. Points are labelled to the closest centroid, with each iteration updating the centroids location based on all the points labelled with that value. Spark and Map/Reduce can be utilised for implementing this problem. Spark is recommended for this task, due to its performance benefits in . However, note that the MLib extension of Spark is not allowed to be used as the primary implementation. The group must code its own original implementation of the algorithm. However, it is possible to also use the mllib implementation, in order to evaluate the results from each clustering implementation. Report Contents Brief literature survey on clustering algorithms, including the challenges on implementing them at scale for parallel frameworks. The report should then justify the chosen algorithm (if changed) and the implementation. Definition of the project use case, where the implemented project will be part of the solution. Implementation in MapReduce or Spark of a clustering algorithm(KMeans). Must take into account the potential enormous size of the dataset, and develop sensible code that will scale and efficiently use additional computing nodes. The code will also need to potentially convert the dataset from its storage format to an in-memory representation. Source code should not be included in the report. However, the algorithms should be explained in the report. Results section. Adequate figures and tables should be used to present the results. The effectiveness of the algorithm should also be shown, including performance indications. Not really sure if this can be done for clustering. Critical evaluation of the results should be provided. Experiments demonstrating the technique can successfully group users in the dataset. Representation of the results, and discussion of the findings in a critical manner. ASSESSMENT The project according to the specification has a base difficulty of 85/100. This means that a perfect implementation and report would get a 85. Additional technical features and experimentation would raise the difficulty in order to opt for a full 100/100 mark. Report presentation: 20% Appropriate motivation for the work. Lack of typos/grammar errors, adequate format. Clear flow and style. Related work section including adequate referencing. Technical merit: 50% Completeness of the implementation. [25%] Provided source code. Code is documented. [10%] Design rationale of the code is provided. [10%] Efficient, and appropriate implementation for the chosen platform. [5%] Results/Analysis: 30% Experiments have been carried out on the full dataset. [10%] Adequate plots/tables are provided, with captions. [10%] Results are not only presented but discussed appropriately. [10%] Additional project goals: Implementation of additional functions beyond the base specification can raise the base mark up to 100. A non-exhaustive list of expansion ideas include: Exploration and discussion of hyperparameter tuning (e.g. the number of k groups to cluster the data into) [up to 10 marks] Comparative evaluation of clustering technique with existing implementations (e.g. mllib) [up to 10 marks] Bringing in additional datasets from stackoverflow, such as user badges, to aid in clustering [up to 5 marks] Cluster additional datasets (such as posts) [up to 10 marks] LEAD DEMONSTRATOR For specific queries related to this coursework topic, please liaise with Mr/Ms TBD, who will be the lead demonstrator for this project, as well as with the module organiser. SUBMISSION GUIDELINES The report will have a maximum length of 8 pages, not counting cover page and table of contents. The report must include motivation of the problem, brief literature survey, explanation of the selected technique, implementation details and discussion of the obtained results, and references used in the work. Additionally, the source code must be included as a separate compressed file in the submission.
DIYBigData
A collection of data analysis projects done using PySpark via Jupyter notebooks.
venkateshkannayiram
The project aims to develop a real-time analysis system using Apache Kafka and Apache Spark. The system will collect real-time data and stream the data into Kafka. Apache Spark will then be used to process and analyze the data in real-time. The processed data will be visualized using appropriate visualizations and graphs.
surajsrivathsa
A project which involves analysis of Authorship graph data from Microsoft academic graph. In this project we calculate different graph features using temporal parameters of the authors and tried different classifiers. The final aim is to predict the link or coauthorsip possibility between two authors based on topological graph features and also find out the feasibility of performing this task on Neo4j and Spark
naikshrihari
The project aims to analyze the Energy Consumption Patterns and it's relationship with socio-demographic and weather indicators for European cities like London, post the installation of Smart Energy Meters over a period of 30 months. Exploratory Data Analysis was performed on Tableau and Python. Data Preprocessing was done using Spark and a variety of Machine Learning Algorithms were utilized to perform Time Series Forecasting.
MHassaanButt
In this project I stream data and do crime classification using Spark. This dataset contains incidents derived from the SFPD Crime Incident Reporting system. The data ranges from 1/1/2003 to 5/13/2015. I do some data analysis of crime scenes in different areas and with respect to other parameters.
j-adamczyk
Projects made for Big Data Analysis (Apache Spark) course at AGH
MatteoBiviano
University project for the Distributed Data Analysis and Mining exam, made using Spark
phuonganh-38
A comprehensive data processing and analysis project on NYC taxi trips using Databricks and Apache Spark
yassinessadi
EC-JANE Entertainment launches an innovative film recommendation project utilizing MovieLens data. This project aims to incorporate Big Data analysis and machine learning to enhance movie suggestions, leveraging Apache Spark, Elasticsearch, and a Flask API to provide a personalized and dynamic user experience.
jgrove90
🛸 This project showcases an Extract, Load, Transform (ELT) pipeline built with Python, Apache Spark, Delta Lake, and Docker. The objective of the project is to scrape UFO sighting data from NUFORC and process it through the Medallion architecture to create a star schema in the Gold layer that is ready for analysis.
aymanboufarhi
Big Data Project : "Real-Time Twitter Sentiment Analysis using Kafka, Spark, MongoDB, Django and Docker"
leorickli
Data Engineering an Analysis project using Apache Cassandra as database and Apache Spark for processing with infrastructure deployed in Google Cloud Platform (GCP).
Discover ML projects with Scala & Python. Explore data analysis, MLflow integration, regression, decision tree classification, Spark DataFrame manipulation, flight & retail sales analysis, and statistical utilities. Includes datasets like forestfires and online shoppers intention for practical learning.
BharatPanera
A project focused on real-time sales data analysis for RetailCorp Inc., utilizing Apache Spark Streaming and Kafka. This project computes key performance indicators (KPIs) from real-time sales data, facilitating business insights through efficient data processing and storage.
windi-wulandari
This project implements an end-to-end data pipeline designed to manage and analyze large-scale credit scoring data. Using AWS S3 as a scalable storage solution and Databricks for processing, the pipeline leverages the power of Apache Spark through PySpark and SQL Spark to handle data transformation and analysis efficiently.
SaiprakashShetty
Predicting US Airline Delay using spark(pyspark) and Apache Arrow.The objective of this project is to perform analysis on the historical flight data to gain valuable insights and build a predictive model to predict whether a flight will be delayed or not for a given set of flight characteristics.
Vismaya-Murali
This project tackles the challenge of real-time Reddit post analysis by streaming data with Apache Kafka, processing it using Apache Spark, and storing results in PostgreSQL. It compares real-time and batch processing , focusing on sentiment analysis. Insights are visualized through Power BI
longNguyen010203
🌈📊📈 The Zillow Home Value Prediction project employs linear regression models on Kaggle datasets to forecast house prices. 📉💰Using Apache Spark (PySpark) within a Docker setup enables efficient data preprocessing, exploration, analysis, visualization, and model building with distributed computing for parallel computation.
iftekharalamfahim
A large-scale data analysis project built on Apache Hadoop and Apache Spark, analyzing 7M+ Yelp reviews, 150K businesses, and 2M users. Covers business intelligence, user behavior, rating patterns, and review trends using PySpark and Hive on a multi-node cluster. Visualized through Apache Zeppelin notebooks.