Found 403 repositories(showing 30)
jadianes
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
LucaCanali
Includes notes on using Apache Spark, with drill down on Spark for Physics, how to run TPCDS on PySpark, how to create histograms with Spark. Also tools for stress testing, measuring CPUs' performance, and I/O latency heat maps. Jupyter notebooks examples for using various DB systems.
andfanilo
Jupyter notebooks for pyspark tutorials given at University
swan-cern
An extension for Jupyter Lab & Jupyter Notebook to monitor Apache Spark (pyspark) from notebooks
dimajix
Docker image for Jupyter notebooks with PySpark
ayushsubedi
Notebooks/materials on Big Data with PySpark skill track from datacamp (primarily). Also, contains books/cheat-sheets.
telia-oss
Birgitta is a Python ETL test and schema framework, providing automated tests for pyspark notebooks/recipes.
Azure-Samples
Instructions and examples for installing CNTK on an HDInsight cluster and running CNTK-Pyspark applications from Jupyter notebooks.
RealKinetic
An example CI/CD pipeline using GitHub Actions for doing continuous deployment of AWS Glue jobs built on PySpark and Jupyter Notebooks.
bennyaustin
Reusable Python classes that extend open source PySpark capabilities. Examples of implementation is available under notebooks of repo https://github.com/bennyaustin/synapse-dataplatform
marcgarnica13
Understanding gender differences in professional European football through Machine Learning interpretability and match actions data. This repository contains the full data pipeline implemented for the study *Understanding gender differences in professional European football through Machine Learning interpretability and match actions data*. We evaluated European male, and female football players' main differential features in-match actions data under the assumption of finding significant differences and established patterns between genders. A methodology for unbiased feature extraction and objective analysis is presented based on data integration and machine learning explainability algorithms. Female (1511) and male (2700) data points were collected from event data categorized by game period and player position. Each data point included the main tactical variables supported by research and industry to evaluate and classify football styles and performance. We set up a supervised classification pipeline to predict the gender of each player by looking at their actions in the game. The comparison methodology did not include any qualitative enrichment or subjective analysis to prevent biased data enhancement or gender-related processing. The pipeline had three representative binary classification models; A logic-based Decision Trees, a probabilistic Logistic Regression and a multilevel perceptron Neural Network. Each model tried to draw the differences between male and female data points, and we extracted the results using machine learning explainability methods to understand the underlying mechanics of the models implemented. A good model predicting accuracy was consistent across the different models deployed. ## Installation Install the required python packages ``` pip install -r requirements.txt ``` To handle heterogeneity and performance efficiently, we use PySpark from [Apache Spark](https://spark.apache.org/). PySpark enables an end-user API for Spark jobs. You might want to check how to set up a local or remote Spark cluster in [their documentation](https://spark.apache.org/docs/latest/api/python/index.html). ## Repository structure This repository is organized as follows: - Preprocessed data from the two different data streams is collecting in [the data folder](data/). For the Opta files, it contains the event-based metrics computed from each match of the 2017 Women's Championship and a single file calculating the event-based metrics from the 2016 Men's Championship published [here](https://figshare.com/collections/Soccer_match_event_dataset/4415000/5). Even though we cannot publish the original data source, the two python scripts implemented to homogenize and integrate both data streams into event-based metrics are included in [the data gathering folder](data_gathering/) folder contains the graphical images and media used for the report. - The [data cleaning folder](data_cleaning/) contains descriptor scripts for both data streams and [the final integration](data_cleaning/merger.py) - [Classification](classification/) contains all the Jupyter notebooks for each model present in the experiment as well as some persistent models for testing.
Muhammadatef
This Repo contains Jupyter Notebooks to recap on RDD, DataFrame, Spark Streaming and ML operations using Pyspark
astrolabsoftware
PySpark notebooks to learn Apache Spark (WIP)
DIYBigData
A collection of data analysis projects done using PySpark via Jupyter notebooks.
cu-csci-4253-datacenter
Python notebooks providing a tutorial for Pyspark for CSCI 4253 / 5253
rcuevass
Repo containing IPython notebooks using Spark/PySpark for Machine Learning and Data Analysis
abhisheksaurabh1985
Jupyter notebooks for learning PySpark
nahuel-lopez-dev
Teoría y práctica en Notebooks sobre Spark, utilizando PySpark y Spark SQL
amanjeetsahu
This repo contains my learnings and practice notebooks on Spark using PySpark (Python Language API on Spark). All the notebooks in the repo can be used as template code for most of the ML algorithms and can be built upon it for more complex problems.
mohanab89
AI-assisted SQL migration for Databricks: convert Snowflake, T-SQL, Oracle, Teradata, Redshift, MySQL, PostgreSQL and more into Databricks SQL or PySpark notebooks. Includes validation and reconciliation features.
dimajix
Jupyter Notebooks for PySpark Advanced Workshop
beingdatum
PySpark Notebooks
MrPowers
PySpark example notebooks
shreyashji
Adding my python,spark, pyspark, scala notebooks logics which i solve/see on daily basis,it contains optimization techniques for big data processing and real time scenarios
mar1boroman
Common ETL patterns and utilities for PySpark. Notebooks tested on Databricks Community edition
VidhyaVinayagam
Pyspark Demo Notebooks
devesshhh
Hands-on notebooks built while preparing for the Databricks Machine Learning Associate Certification. Covers PySpark, MLflow, AutoML, Model Deployment, and MLOps.
AndreasRenghart
PySpark Batch and Stream Processing with Spark Dataframe API and Spark SQL. Basics on Spark ML.
Data Engineering with Databricks Study Materials
dimajix
Jupyter Notebooks for PySpark Workshop using NYC Taxi Trip data