Search Results

Found 915 repositories(showing 30)

spark-py-notebooks

jadianes

💛81

Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

1.7k

908

NOASSERTION

Jupyter Notebook

Updated 6 days ago

big-databigdatadata-analysis+9

Includes notes on using Apache Spark, with drill down on Spark for Physics, how to run TPCDS on PySpark, how to create histograms with Spark. Also tools for stress testing, measuring CPUs' performance, and I/O latency heat maps. Jupyter notebooks examples for using various DB systems.

462

153

Apache-2.0

Jupyter Notebook

Updated 6 hours ago

apache-sparkdatabasejupyter-notebooks+3

hunter

ThreatHuntingProject

🧡61

A threat hunting / data analysis environment based on Python, Pandas, PySpark and Jupyter Notebook.

251

MIT

Jupyter Notebook

Updated 1 week ago

pyspark-tutorial

coder2j

🧡52

PySpark Tutorial for Beginners - Practical Examples in Jupyter Notebook with Spark version 3.4.1. The tutorial covers various topics like Spark Introduction, Spark Installation, Spark RDD Transformations and Actions, Spark DataFrame, Spark SQL, and more. It is completely free on YouTube and is beginner-friendly without any prerequisites.

144

141

MIT

Jupyter Notebook

Updated 1 month ago

apache-sparkdata-analysisdata-engineering+6

pyspark-tutorial

andfanilo

❤️31

Jupyter notebooks for pyspark tutorials given at University

110

Jupyter Notebook

Updated 3 months ago

jupyter-notebookpyspark

sparkmonitor

swan-cern

❤️45

An extension for Jupyter Lab & Jupyter Notebook to monitor Apache Spark (pyspark) from notebooks

Apache-2.0

Scala

Updated 1 week ago

jupyterjupyter-notebook-extensionjupyterlab+2

pyspark-setup-demo

garystafford

❤️40

Demo of PySpark and Jupyter Notebook with the Jupyter Docker Stacks

MIT

Jupyter Notebook

Updated 1 year ago

big-datadockerjupyter+3

SparkSummitDemo

IBMDataScience

🧡55

PySpark Notebook and Shiny App for Demo

Jupyter Notebook

Updated 3 weeks ago

docker-jupyter-spark

dimajix

❤️40

Docker image for Jupyter notebooks with PySpark

Apache-2.0

Shell

Updated 1 year ago

dockerhadoopjupyter+3

aws-glue-etl-boilerplate

nanlabs

❤️20

A complete example of an AWS Glue application that uses the Serverless Framework to deploy the infrastructure and DevContainers and/or Docker Compose to run the application locally with AWS Glue Libs, Spark, Jupyter Notebook, AWS CLI, among other tools. It provides jobs using Python Shell and PySpark.

Python

Updated 5 months ago

hacktoberfest

PySpark_Essentials_March_2019

vkocaman

❤️45

Complete PySpark Guide for the beginners... I prepared this notebook for my students.

Jupyter Notebook

Updated 1 month ago

pyspark-notebook-helm

A3Data

❤️35

This repo provides the Kubernetes Helm chart for deploying Pyspark Notebook.

Smarty

Updated 1 year ago

PySpark-ETL-Telecom

cvilla87

❤️45

Jupyter Notebook showing how to process Telecom datasets using PySpark (SparkSQL and DataFrames) and plotting the results using Matplotlib.

Jupyter Notebook

Updated 1 month ago

csvdataframeetl+10

pyspark-tutorial

nicodv

❤️40

A short tutorial notebook on PySpark

MIT

Jupyter Notebook

Updated 5 years ago

big-data-with-pyspark

ayushsubedi

❤️30

Notebooks/materials on Big Data with PySpark skill track from datacamp (primarily). Also, contains books/cheat-sheets.

Jupyter Notebook

Updated 3 months ago

learningpyspark

birgitta

telia-oss

❤️25

Birgitta is a Python ETL test and schema framework, providing automated tests for pyspark notebooks/recipes.

MIT

Python

Updated 2 years ago

aws-glue-pipeline-example

RealKinetic

❤️40

An example CI/CD pipeline using GitHub Actions for doing continuous deployment of AWS Glue jobs built on PySpark and Jupyter Notebooks.

Apache-2.0

Jupyter Notebook

Updated 3 months ago

hdinsight-pyspark-cntk-integration

Azure-Samples

❤️35

Instructions and examples for installing CNTK on an HDInsight cluster and running CNTK-Pyspark applications from Jupyter notebooks.

MIT

Jupyter Notebook

Updated 2 years ago

pyspark-utils

bennyaustin

❤️40

Reusable Python classes that extend open source PySpark capabilities. Examples of implementation is available under notebooks of repo https://github.com/bennyaustin/synapse-dataplatform

Apache-2.0

Python

Updated 6 months ago

apache-sparkazure-databricksazure-synapse-analytics+4

Big-Data-Analytics-and-Visualization-Using-PySpark

Monish-Nallagondalla

❤️30

This project uses PySpark and Python to analyze a Google Play Store dataset. It covers data cleaning, duplicate removal, and visual analysis, performed in Jupyter Notebook with Spark's distributed computing.

Jupyter Notebook

Updated 7 months ago

ml-interpretability-european-football

marcgarnica13

❤️40

Understanding gender differences in professional European football through Machine Learning interpretability and match actions data. This repository contains the full data pipeline implemented for the study *Understanding gender differences in professional European football through Machine Learning interpretability and match actions data*. We evaluated European male, and female football players' main differential features in-match actions data under the assumption of finding significant differences and established patterns between genders. A methodology for unbiased feature extraction and objective analysis is presented based on data integration and machine learning explainability algorithms. Female (1511) and male (2700) data points were collected from event data categorized by game period and player position. Each data point included the main tactical variables supported by research and industry to evaluate and classify football styles and performance. We set up a supervised classification pipeline to predict the gender of each player by looking at their actions in the game. The comparison methodology did not include any qualitative enrichment or subjective analysis to prevent biased data enhancement or gender-related processing. The pipeline had three representative binary classification models; A logic-based Decision Trees, a probabilistic Logistic Regression and a multilevel perceptron Neural Network. Each model tried to draw the differences between male and female data points, and we extracted the results using machine learning explainability methods to understand the underlying mechanics of the models implemented. A good model predicting accuracy was consistent across the different models deployed. ## Installation Install the required python packages ``` pip install -r requirements.txt ``` To handle heterogeneity and performance efficiently, we use PySpark from [Apache Spark](https://spark.apache.org/). PySpark enables an end-user API for Spark jobs. You might want to check how to set up a local or remote Spark cluster in [their documentation](https://spark.apache.org/docs/latest/api/python/index.html). ## Repository structure This repository is organized as follows: - Preprocessed data from the two different data streams is collecting in [the data folder](data/). For the Opta files, it contains the event-based metrics computed from each match of the 2017 Women's Championship and a single file calculating the event-based metrics from the 2016 Men's Championship published [here](https://figshare.com/collections/Soccer_match_event_dataset/4415000/5). Even though we cannot publish the original data source, the two python scripts implemented to homogenize and integrate both data streams into event-based metrics are included in [the data gathering folder](data_gathering/) folder contains the graphical images and media used for the report. - The [data cleaning folder](data_cleaning/) contains descriptor scripts for both data streams and [the final integration](data_cleaning/merger.py) - [Classification](classification/) contains all the Jupyter notebooks for each model present in the experiment as well as some persistent models for testing.

MIT

Jupyter Notebook

Updated 1 year ago

pyspark-notebook

prabeesh

❤️40

Pyspark Notebook With Docker

Apache-2.0

Python

Updated 2 years ago

apache-sparkbigdatadocker+9

PyNotes

Dirkster99

❤️40

My notebook on using Python with Jupyter Notebook, PySpark etc

MIT

Jupyter Notebook

Updated 1 year ago

dataframejupyter-notebookpanda+7

var-notebook

willb

❤️35

the notebook component of a PySpark application to calculate value-at-risk for a portfolio of securities

Jupyter Notebook

Updated 4 years ago

Spark-Recap

Muhammadatef

❤️35

This Repo contains Jupyter Notebooks to recap on RDD, DataFrame, Spark Streaming and ML operations using Pyspark

Jupyter Notebook

Updated 6 months ago

spark-tutorials

astrolabsoftware

🧡50

PySpark notebooks to learn Apache Spark (WIP)

Apache-2.0

Jupyter Notebook

Updated 2 months ago

apache-sparkpysparktutorials

spark-data-analysis-projects

DIYBigData

❤️35

A collection of data analysis projects done using PySpark via Jupyter notebooks.

Jupyter Notebook

Updated 1 year ago

data-analysisdata-sciencejupyter+1

mapping-dataflow-to-fabric-with-openai

sethiaarun

❤️30

Convert Azure Mapping dataflow to Microsoft Fabric PySpark Notebook using OpenAI

Python

Updated 1 year ago

pyspark-tutorials

cu-csci-4253-datacenter

❤️45

Python notebooks providing a tutorial for Pyspark for CSCI 4253 / 5253

Jupyter Notebook

Updated 2 months ago

Spark_PySpark_Machine_Learning

rcuevass

❤️35

Repo containing IPython notebooks using Spark/PySpark for Machine Learning and Data Analysis

Jupyter Notebook

Updated 4 years ago

GitHub Explorer

Search Results

spark-py-notebooks

Miscellaneous

hunter

pyspark-tutorial

pyspark-tutorial

sparkmonitor

pyspark-setup-demo

SparkSummitDemo

docker-jupyter-spark

aws-glue-etl-boilerplate

PySpark_Essentials_March_2019

pyspark-notebook-helm

PySpark-ETL-Telecom

pyspark-tutorial

big-data-with-pyspark

birgitta

aws-glue-pipeline-example

hdinsight-pyspark-cntk-integration

pyspark-utils

Big-Data-Analytics-and-Visualization-Using-PySpark

ml-interpretability-european-football

pyspark-notebook

PyNotes

var-notebook

Spark-Recap

spark-tutorials

spark-data-analysis-projects

mapping-dataflow-to-fabric-with-openai

pyspark-tutorials

Spark_PySpark_Machine_Learning

spark-py-notebooks

Miscellaneous

hunter

pyspark-tutorial

pyspark-tutorial

sparkmonitor

pyspark-setup-demo

SparkSummitDemo

docker-jupyter-spark

aws-glue-etl-boilerplate

PySpark_Essentials_March_2019

pyspark-notebook-helm

PySpark-ETL-Telecom

pyspark-tutorial

big-data-with-pyspark

birgitta

aws-glue-pipeline-example

hdinsight-pyspark-cntk-integration

pyspark-utils

Big-Data-Analytics-and-Visualization-Using-PySpark

ml-interpretability-european-football

pyspark-notebook

PyNotes

var-notebook

Spark-Recap

spark-tutorials

spark-data-analysis-projects

mapping-dataflow-to-fabric-with-openai

pyspark-tutorials

Spark_PySpark_Machine_Learning