Search Results

Found 1,350 repositories(showing 30)

spark_study

shijinkui

🧡51

spark源码学习

299

Apache-2.0

Updated 1 month ago

StudySpark

yangtong123

❤️36

学习 Spark 的一个小项目，以及其中各种调优的笔记

177

Java

Updated 8 months ago

RoadOfStudySpark

yangtong123

🧡56

Spark 学习之路，包含 Spark Core，Spark SQL，Spark Streaming，Spark mllib 学习笔记

146

Scala

Updated 3 weeks ago

Data-Science-with-Spark

anindya-saha

❤️36

Machine Learning and Data Analysis Case Studies using Spark.

Jupyter Notebook

Updated 1 year ago

Cloudera_Material

san089

🧡50

Cloudera_Material: Study Material to help people preparing for Cloudera CCA Spark and Hadoop Developer Exam (CCA175). Feel free to collaborate.

MIT

Updated 1 month ago

big-databigdatacca+13

Spark-StudyClub

DataEngineering-LATAM

❤️35

Grupo de Estudios de Apache Spark organizado por la comunidad Data Engineering Latam

Jupyter Notebook

Updated 3 months ago

data-engineeringdata-sciencepyspark+1

databricks-certification

mdrakiburrahman

❤️35

My Study guide used to pass the CRT020 Spark Certification exam

HTML

Updated 9 months ago

Rain-Fall_Data_Analysis_Using_Data_Science

shishirdas

🧡55

Context Rainfall is very crucial things for any types of agricultural task. Climate related data is important to analyse agricultural and crop seeding related field, where those data can be used to show the predict the rainfall in different season also for different types of crops. Developed application can be found from http://ml.bigalogy.com/ Paper: http://dspace.uiu.ac.bd/handle/52243/178 Abstract Mankind have been attempting to predict the weather from prehistory. For good reason for knowing when to plant crops, when to build and when to prepare for drought and flood. In a nation such as Bangladesh being able to predict the weather, especially rainfall has never been so vitally important. The proposed research work pursues to produce prediction model on rainfall using the machine learning algorithms. The base data for this work has been collected from Bangladesh Meteorological Department. It is mainly focused on the development of models for long term rainfall prediction of Bangladesh divisions and districts (Weather Stations). Rainfall prediction is very important for the Bangladesh economy and day to day life. Scarcity or heavy - both rainfall effects rural and urban life to a great extent with the changing pattern of the climate. Unusual rainfall and long lasting rainy season is a great factor to take account into. We want to see whether too much unusual behavior is taking place another pattern resulting new clamatorial description. As agriculture is dependent on rain and heavy rainfall caused flood frequently leading to great loss to crops, rainfall is a very complex phenomenon which is dependent on various atmospheric, oceanic and geographical parameters. The relationship between these parameters and rainfall is unstable. Beside this changing behavior of clamatorial facts making the existing meteorological forecasting less usable to the users. Initially linear regression models were developed for monthly rainfall prediction of station and national level as per day month year. Here humidity, temperatures & wind parameters are used as predictors. The study is further extended by developing another popular regression analysis algorithm named Random Forest Regression. After then, few other classification algorithms have been used for model building, training and prediction. Those are Naive Bayes Classification, Decision Tree Classification (Entropy and Gini) and Random Forest Classification. In all model building and training predictor parameters were Station, Year, Month and Day. As the effect of rainfall affecting parameters is embedded in rainfall, rainfall was the label or dependent variable in these models. The developed and trained model is capable of predicting rainfall in advance for a month of a given year for a given area (for area we used here are the stations (weather parameters values are measured by Bangladesh Meteorological Department). The accuracy of rainfall estimation is above 65%. Accuracy percentage varies from algorithm to algorithm. Two regression analysis and three classification analysis models has been developed for rainfall prediction of 33 Bangladeshi weather station. Apache Spark library has been used for machine library in Scala programming language. The main idea behind the use of classification and regression analysis is to see the comparative difference between types of algorithms prediction output and the predictability along with usability. This thesis is a contribution to the effort of rainfall prediction within Bangladesh. It takes the strategy of applying machine learning models to historical weather data gathered in Bangladesh. As part of this work, a web-based software application was written using Apache Spark, Scala and HighCharts to demonstrate rainfall prediction using multiple machine learning models. Models are successively improved with the rainfall prediction accuracy. Content The given data has weather station and year wise monthly rainfall data of Bangladesh. Data is two format - 46 year (33 Weather Station) : From 1970 to 2016 Daily Rainfall Data Monthly Rainfall Data Columns: Station (Weather Station, along with Station Index) Year Month Day [For daily data file]

Jupyter Notebook

Updated 1 week ago

SparkStudy

Quincy1994

❤️30

some code for spark

Java

Updated 2 years ago

Axa-Insurance-Telematics-Kaggle

AnilSener

❤️30

I developed this case study only in 7 days with Pyspark (Spark 1.6.0) SQL & MLlib. I used Databricks cluster and AWS. %90 AUC is achieved (without involving Trip Matching-Repeated Trips feature) with Random Forest. Many ensembles with RF, GBT and Logistic Regression and outlier elimination could be used to improve this result. There are two versions of my code (test and full execution). Since AWS costs have exceeded my budget I sopped to train my model(s) all dataset for full dataset execution. There is also a ppt that presents my outputs in test execution. Full Data Execution code is more production ready and slightly different version. I had to use Databricks Table Caching to TRAIN and TEST data tables to obtain acceptable performance in production ready version.

Jupyter Notebook

Updated 1 year ago

bigdata-tutorial

micmiu

❤️35

study demos for hadoop、hbase、hive、spark、storm .......

Apache-2.0

Java

Updated 1 year ago

hpdc-scalding-spark

anilmuppalla

❤️35

Code for Springer Book: High Performance Distributed Computing: Case Studies with Hadoop, Scalding and Spark

Scala

Updated 3 years ago

coursera-spark-notes

hongchangwu

🧡50

Study notes for "Big Data Analysis with Scala and Spark" on Coursera

MIT

Updated 2 months ago

big-datacourserascala+1

ml-interpretability-european-football

marcgarnica13

❤️40

Understanding gender differences in professional European football through Machine Learning interpretability and match actions data. This repository contains the full data pipeline implemented for the study *Understanding gender differences in professional European football through Machine Learning interpretability and match actions data*. We evaluated European male, and female football players' main differential features in-match actions data under the assumption of finding significant differences and established patterns between genders. A methodology for unbiased feature extraction and objective analysis is presented based on data integration and machine learning explainability algorithms. Female (1511) and male (2700) data points were collected from event data categorized by game period and player position. Each data point included the main tactical variables supported by research and industry to evaluate and classify football styles and performance. We set up a supervised classification pipeline to predict the gender of each player by looking at their actions in the game. The comparison methodology did not include any qualitative enrichment or subjective analysis to prevent biased data enhancement or gender-related processing. The pipeline had three representative binary classification models; A logic-based Decision Trees, a probabilistic Logistic Regression and a multilevel perceptron Neural Network. Each model tried to draw the differences between male and female data points, and we extracted the results using machine learning explainability methods to understand the underlying mechanics of the models implemented. A good model predicting accuracy was consistent across the different models deployed. ## Installation Install the required python packages ``` pip install -r requirements.txt ``` To handle heterogeneity and performance efficiently, we use PySpark from [Apache Spark](https://spark.apache.org/). PySpark enables an end-user API for Spark jobs. You might want to check how to set up a local or remote Spark cluster in [their documentation](https://spark.apache.org/docs/latest/api/python/index.html). ## Repository structure This repository is organized as follows: - Preprocessed data from the two different data streams is collecting in [the data folder](data/). For the Opta files, it contains the event-based metrics computed from each match of the 2017 Women's Championship and a single file calculating the event-based metrics from the 2016 Men's Championship published [here](https://figshare.com/collections/Soccer_match_event_dataset/4415000/5). Even though we cannot publish the original data source, the two python scripts implemented to homogenize and integrate both data streams into event-based metrics are included in [the data gathering folder](data_gathering/) folder contains the graphical images and media used for the report. - The [data cleaning folder](data_cleaning/) contains descriptor scripts for both data streams and [the final integration](data_cleaning/merger.py) - [Classification](classification/) contains all the Jupyter notebooks for each model present in the experiment as well as some persistent models for testing.

MIT

Jupyter Notebook

Updated 1 year ago

SparkStudy

ustbly

❤️35

No description available

Scala

Updated 1 month ago

sparkar-scripting-workshop

tomaspietravallo

❤️40

Part of the first Javascript workshop for Spark AR (study group 1)

GPL-3.0

Updated 3 years ago

bigdata

giovannigarifo

❤️45

Code samples, summaries, cheatsheets and other study material for Hadoop MapReduce and Apache Spark

Java

Updated 2 months ago

big-databigdatahadoop+12

spark-study

dhinojosa

❤️35

Content for Spark training and study

Scala

Updated 3 years ago

Spark-In-Action

53SWTP

❤️35

Study <Spark In Action>

Updated 1 year ago

SparkNotes

leotse90

❤️35

Spark & Hadoop Study Notes

Updated 3 years ago

hadoop-spark

artxgj

❤️20

Hadoop, Spark, Python and Scala Study/Experiment

Python

Updated 1 year ago

databricks_spark_cert_study_guide

bclipp

❤️25

No description available

Updated 2 years ago

big_data_study_notes

nancyyanyu

❤️35

Studying notes on coursera's Big Data Essentials: HDFS, MapReduce and Spark RDD

Jupyter Notebook

Updated 7 months ago

courserahadoophdfs+3

SparkStudy

YBIGTA

❤️35

Git Repository for Spark Study

Updated 7 years ago

hadoop-spark

woshidandan

❤️35

The study of hadoop and spark.

Scala

Updated 3 years ago

sparkStudy

km1994

❤️35

spark 学习

Jupyter Notebook

Updated 1 year ago

spark-ar-blinking-game

jiyeonseo

❤️35

🤾‍♂️ A simple Game for studying Spark AR

JavaScript

Updated 3 years ago

OpenSpark-13B-Chat

xiangking

🧡50

个人转化的讯飞星火 Spark-13B (2024版) 的 Hugging Face 实现。供社区研究早期大模型架构演进与学习用途。A Hugging Face compatible implementation of the early 2024 iFlytek Spark 13B Chat model. Converted for educational and research purposes to study LLM architecture evolution.

Apache-2.0

Python

Updated 2 months ago

HadoopAndSparkDataStudy

qlycool

❤️36

这是一本关于大数据学习记录的手册,主要针对初学者.做为一个老IT工作者,学习是一件很辛苦的事情.希望这本手册对帮助大家快速的学习与认识大数据(特指Hadoop Spark),为了不让初学者一下接触爆炸式的新概念,我们会以实验先行,概念跟进的方式进行课程学习,这样有利于大家快速进入状态,而不至于一直深陷逻辑概念出不来,但是每个人的学习方式不一样,仁者见仁智者见智吧.大家如果有意见请给我发邮件chu888chu888@qq.com — 楚广明

144

Python

Updated 1 year ago

Spark-Funds-Investment-CaseStudy

santhoshpkumar

❤️35

Spark Funds - asset management

HTML

Updated 3 years ago

GitHub Explorer

Search Results

spark_study

StudySpark

RoadOfStudySpark

Data-Science-with-Spark

Cloudera_Material

Spark-StudyClub

databricks-certification

Rain-Fall_Data_Analysis_Using_Data_Science

SparkStudy

Axa-Insurance-Telematics-Kaggle

bigdata-tutorial

hpdc-scalding-spark

coursera-spark-notes

ml-interpretability-european-football

SparkStudy

sparkar-scripting-workshop

bigdata

spark-study

Spark-In-Action

SparkNotes

hadoop-spark

databricks_spark_cert_study_guide

big_data_study_notes

SparkStudy

hadoop-spark

sparkStudy

spark-ar-blinking-game

OpenSpark-13B-Chat

HadoopAndSparkDataStudy

Spark-Funds-Investment-CaseStudy

spark_study

StudySpark

RoadOfStudySpark

Data-Science-with-Spark

Cloudera_Material

Spark-StudyClub

databricks-certification

Rain-Fall_Data_Analysis_Using_Data_Science

SparkStudy

Axa-Insurance-Telematics-Kaggle

bigdata-tutorial

hpdc-scalding-spark

coursera-spark-notes

ml-interpretability-european-football

SparkStudy

sparkar-scripting-workshop

bigdata

spark-study

Spark-In-Action

SparkNotes

hadoop-spark

databricks_spark_cert_study_guide

big_data_study_notes

SparkStudy

hadoop-spark

sparkStudy

spark-ar-blinking-game

OpenSpark-13B-Chat

HadoopAndSparkDataStudy

Spark-Funds-Investment-CaseStudy