Search Results

Found 54 repositories(showing 30)

spark-py-notebooks

jadianes

💛71

Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

1.7k

907

NOASSERTION

Jupyter Notebook

Updated 1 week ago

big-databigdatadata-analysis+9

Includes notes on using Apache Spark, with drill down on Spark for Physics, how to run TPCDS on PySpark, how to create histograms with Spark. Also tools for stress testing, measuring CPUs' performance, and I/O latency heat maps. Jupyter notebooks examples for using various DB systems.

462

153

Apache-2.0

Jupyter Notebook

Updated 4 days ago

apache-sparkdatabasejupyter-notebooks+3

sparkmonitor

swan-cern

❤️45

An extension for Jupyter Lab & Jupyter Notebook to monitor Apache Spark (pyspark) from notebooks

Apache-2.0

Scala

Updated 1 week ago

jupyterjupyter-notebook-extensionjupyterlab+2

ml-interpretability-european-football

marcgarnica13

❤️40

Understanding gender differences in professional European football through Machine Learning interpretability and match actions data. This repository contains the full data pipeline implemented for the study *Understanding gender differences in professional European football through Machine Learning interpretability and match actions data*. We evaluated European male, and female football players' main differential features in-match actions data under the assumption of finding significant differences and established patterns between genders. A methodology for unbiased feature extraction and objective analysis is presented based on data integration and machine learning explainability algorithms. Female (1511) and male (2700) data points were collected from event data categorized by game period and player position. Each data point included the main tactical variables supported by research and industry to evaluate and classify football styles and performance. We set up a supervised classification pipeline to predict the gender of each player by looking at their actions in the game. The comparison methodology did not include any qualitative enrichment or subjective analysis to prevent biased data enhancement or gender-related processing. The pipeline had three representative binary classification models; A logic-based Decision Trees, a probabilistic Logistic Regression and a multilevel perceptron Neural Network. Each model tried to draw the differences between male and female data points, and we extracted the results using machine learning explainability methods to understand the underlying mechanics of the models implemented. A good model predicting accuracy was consistent across the different models deployed. ## Installation Install the required python packages ``` pip install -r requirements.txt ``` To handle heterogeneity and performance efficiently, we use PySpark from [Apache Spark](https://spark.apache.org/). PySpark enables an end-user API for Spark jobs. You might want to check how to set up a local or remote Spark cluster in [their documentation](https://spark.apache.org/docs/latest/api/python/index.html). ## Repository structure This repository is organized as follows: - Preprocessed data from the two different data streams is collecting in [the data folder](data/). For the Opta files, it contains the event-based metrics computed from each match of the 2017 Women's Championship and a single file calculating the event-based metrics from the 2016 Men's Championship published [here](https://figshare.com/collections/Soccer_match_event_dataset/4415000/5). Even though we cannot publish the original data source, the two python scripts implemented to homogenize and integrate both data streams into event-based metrics are included in [the data gathering folder](data_gathering/) folder contains the graphical images and media used for the report. - The [data cleaning folder](data_cleaning/) contains descriptor scripts for both data streams and [the final integration](data_cleaning/merger.py) - [Classification](classification/) contains all the Jupyter notebooks for each model present in the experiment as well as some persistent models for testing.

MIT

Jupyter Notebook

Updated 1 year ago

spark-tutorials

astrolabsoftware

🧡50

PySpark notebooks to learn Apache Spark (WIP)

Apache-2.0

Jupyter Notebook

Updated 2 months ago

apache-sparkpysparktutorials

Apache-Spark-Spark-Streaming-pySpark-

nxs5899

❤️35

Big Data Streaming project with Apache Spark in pySpark, please see python file and the notebook.

Jupyter Notebook

Updated 4 years ago

FeatureEngineeringInPySpark

sumantrapatnaik

❤️35

This repository contains code for performing data preprocessing, EDA, visualization and feature engineering using Apache Spark Dataframes and Pandas Dataframes in PySpark and Jupyter Notebook

Jupyter Notebook

Updated 2 years ago

Business-Intelligence-on-Big-Data-_-U-TAD-2017-Big-Data-Master-Final-Project

ptobarra

❤️35

This is the final project I had to do to finish my Big Data Expert Program in U-TAD in September 2017. It uses the following technologies: Apache Spark v2.2.0, Python v2.7.3, Jupyter Notebook (PySpark), HDFS, Hive, Cloudera Impala, Cloudera HUE and Tableau.

Jupyter Notebook

Updated 1 year ago

analyticsapache-sparkbig-data+17

spark-cluster-with-docker

nghoanglongde

❤️35

The implementation of Apache Spark (combine with PySpark, Jupyter Notebook) on top of Hadoop cluster using Docker

Shell

Updated 1 year ago

apache-hadoopapache-sparkdocker

scui-bigdata-03-YelpAnalysis

iftekharalamfahim

🧡55

A large-scale data analysis project built on Apache Hadoop and Apache Spark, analyzing 7M+ Yelp reviews, 150K businesses, and 2M users. Covers business intelligence, user behavior, rating patterns, and review trends using PySpark and Hive on a multi-node cluster. Visualized through Apache Zeppelin notebooks.

Python

Updated 2 weeks ago

apache-sparkbigdatadata-analysis+7

learning-spark

jvujjini

❤️35

This Repo contains PySpark Notebooks of the Spark Programming Guides available on http://spark.apache.org/docs/latest/

Jupyter Notebook

Updated 8 years ago

PysparkCourse

szhang118

❤️35

Notebooks from edX course on Apache Spark. Data cleaning and construction of machine learning pipelines with Pyspark on Databricks platform.

Jupyter Notebook

Updated 2 years ago

bank_marketing_Spark_MongoDB

SatyamSingh1299

❤️35

Predictive modeling project using PySpark, Apache Spark on an Azure VM, Jupyter Notebooks, Parquet storage, and MongoDB to build and evaluate a Decision Tree classifier that forecasts Portuguese bank term deposit subscriptions from marketing campaign data.

Jupyter Notebook

Updated 3 months ago

taxis-fare-mlib-pyspark

JoseJuan98

❤️35

The software technology which will be discussed in this report is Apache Spark, an unified tool for analysing and processing data. In addition, PySpark is going to be used to write the code in iPython Notebook.

TeX

Updated 2 years ago

PySpark_ETL

kevkillion

❤️35

Apache Spark capabilities showcased in this project to run a batch and real time ETL process for Invoice system data. Pysql and Pyspark used within a jupyter notebook to build and test the invoice ETL process.

Jupyter Notebook

Updated 5 months ago

Spark-MachineLearning-Databricks-project-notebook

BahaManai

❤️35

Notebook éducatif démontrant un workflow Big Data et Machine Learning avec Apache Spark, PySpark et MLflow sur Databricks Free Edition. Inclut préparation de données, feature engineering, régression logistique et suivi des expériences ML pour un projet tutoriel.

Jupyter Notebook

Updated 3 months ago

apache-sparkbig-datadatabricks+5

Apache_PySpark_by_Example

abhiram540

❤️35

The Colab Notebook in this repository does a great job explaining why you would use Spark and PySpark, walking you through the technical setup, illustrating how to talk to the DataFrame API, and distinguishing these DataFrames from RDDs — Resilient Distributed Datasets that are essentially big lists, but stored in different locations — a cornerstone of the PySpark toolkit. (Credits: LinkedIn Learning - Apache PySpark by Example))

Jupyter Notebook

Updated 2 years ago

expedia_hotel_recommendations

hamedrazavi

❤️35

(Ongoing notebook) This is a kaggle competition. With more than 37 million users log data we want to predict the hotel cluster given features such as hotel country, user country, check-in check-out dates. The data is big enough that the classic Pandas, and Scikit-learn will not work out for analysis and building ML models; instead, I will be using Apache Spark (Pyspark). Also, for running the notebook I will be using Amazon Elastic Mapreduce (EMR) service.

Jupyter Notebook

Updated 2 years ago

Apache-Spark-PySpark-Notebooks

psifio

❤️25

No description available

Python

Updated 10 years ago

spark_notes

EdViV

❤️35

Jupyter notebooks from Apache Spark & Pyspark course

Jupyter Notebook

Updated 3 years ago

Python-Spark

xuandt1289

❤️35

Apache Spark & Python (PySpark) tutorials and Machine Learning applications as Jupyter notebooks

Jupyter Notebook

Updated 4 years ago

apache-spark-docker

sgboateng

❤️45

Apache Spark setup using Docker, contains PySpark codes and notebooks for various ETL operations

Jupyter Notebook

Updated 2 months ago

spark-experiments

sidharth15

❤️40

Experiments with Apache Spark using PySpark. The repo contains everything from setting up PySpark in AWS EC2 to Jupyter notebooks with Spark code.

GPL-2.0

Jupyter Notebook

Updated 3 years ago

spark-cross-validation-and-parallelization-labs

javierquesadapajares

❤️35

Lab notebooks using Apache Spark (PySpark) for cross-validation experiments, model evaluation, and training parallelization

Jupyter Notebook

Updated 4 months ago

Spark-py-notebooks

eric-erki

❤️40

Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

NOASSERTION

Jupyter Notebook

Updated 6 years ago

spark-ml-labs

zgzwelldone

❤️35

Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

Updated 4 years ago

Spark

farfaness

❤️35

data parallelism using the open-source cluster-computing framework : Apache Spark. Notebooks in PySpark and Scala languages.

Jupyter Notebook

Updated 6 years ago

my-learning-notebooks

moizsiddiqui7893

❤️40

A collection of my notebooks for learning Python, OpenCV, Pandas, Numpy, Apache Spark (Pyspark), HADOOP, Tensorflow etc.

MIT

Jupyter Notebook

Updated 6 years ago

Pyspark_Projects

Bhanusri-kanumilli

❤️35

A collection of PySpark projects and notebooks for distributed data processing, ETL, and big data analytics using Apache Spark.

Python

Updated 9 months ago

Hands-on-PySpark-Project

neeleshgangrade

❤️35

Worked on Spark Sessions, Transformation, Actions, and Optimization Techniques using PySpark. Tools Used: Apache Spark, Shell Scripting and Jupyter Notebook

Jupyter Notebook

Updated 3 years ago

GitHub Explorer

Search Results

spark-py-notebooks

Miscellaneous

sparkmonitor

ml-interpretability-european-football

spark-tutorials

Apache-Spark-Spark-Streaming-pySpark-

FeatureEngineeringInPySpark

Business-Intelligence-on-Big-Data-_-U-TAD-2017-Big-Data-Master-Final-Project

spark-cluster-with-docker

scui-bigdata-03-YelpAnalysis

learning-spark

PysparkCourse

bank_marketing_Spark_MongoDB

taxis-fare-mlib-pyspark

PySpark_ETL

Spark-MachineLearning-Databricks-project-notebook

Apache_PySpark_by_Example

expedia_hotel_recommendations

Apache-Spark-PySpark-Notebooks

spark_notes

Python-Spark

apache-spark-docker

spark-experiments

spark-cross-validation-and-parallelization-labs

Spark-py-notebooks

spark-ml-labs

Spark

my-learning-notebooks

Pyspark_Projects

Hands-on-PySpark-Project

spark-py-notebooks

Miscellaneous

sparkmonitor

ml-interpretability-european-football

spark-tutorials

Apache-Spark-Spark-Streaming-pySpark-

FeatureEngineeringInPySpark

Business-Intelligence-on-Big-Data-_-U-TAD-2017-Big-Data-Master-Final-Project

spark-cluster-with-docker

scui-bigdata-03-YelpAnalysis

learning-spark

PysparkCourse

bank_marketing_Spark_MongoDB

taxis-fare-mlib-pyspark

PySpark_ETL

Spark-MachineLearning-Databricks-project-notebook

Apache_PySpark_by_Example

expedia_hotel_recommendations

Apache-Spark-PySpark-Notebooks

spark_notes

Python-Spark

apache-spark-docker

spark-experiments

spark-cross-validation-and-parallelization-labs

Spark-py-notebooks

spark-ml-labs

Spark

my-learning-notebooks

Pyspark_Projects

Hands-on-PySpark-Project