Search Results

Found 114 repositories(showing 30)

Stackoverflow-Analysis

recodehive

🧡51

Stack overflow is a professional community for developers. This repo analysis 3 years of developer Survey done by Stackoverflow and do visualization and predict the salary of Data Scientist in future.

243

121

MIT

Jupyter Notebook

Updated 1 month ago

canvacollaboratedata-analysis+10

BDP-05-Large-scale-Clustering

GroupAYECS765P

❤️45

BDP 05: CLUSTERING OF LARGE UNLABELED DATASETS OVERVIEW Real world data is frequently unlabeled and can seem completely random. In these sort of situations, unsupervised learning techniques are a great way to find underlying patterns. This project looks at one such algorithm, KMeans clustering, which searches for boundaries separating groups of points based on their differences in some features. The goal of the project is to implement an unsupervised clustering algorithm using a distributed computing platform. You will implement this algorithm on the stack overflow user base to find different ways the community can be divided, and investigate what causes these groupings. The clustering algorithm must be designed in a way that is appropriate for data intensive parallel computing frameworks. Spark would be the primary choice for this project, but it could also be implemented in Hadoop MapReduce. Algorithm implementations from external libraries such as Spark MLib may not be utilised; the code must be original from the students. However, once the algorithm is completed, a comparison between your own results and that generated by MLlib could be interesting and aid your investigation. Stack Overflow is the main dataset for this project, but alternative datasets can be adopted after consultation with the module organiser. Additionally, different clustering algorithms may be utilised, but this must be discussed and approved y the module organiser. DATASET The project will use the Stack Overflow dataset. This dataset is located in HDFS at /data/stackoverflow The dataset for StackOverflow is a set of files containing Posts, Users, Votes, Comments, PostHistory and PostLinks. Each file contains one XML record per line. For complete schema information: Click here In order to define the clustering use case, you must define what should be the features of each post that will be used to cluster the data. Have a look at the different fields to define your use case. ALGORITHM The project will implement the k-means algorithm for clustering. This algorithm iteratively recomputes the location of k centroids (k is the number of clusters, defined beforehand), that aim to classify the data. Points are labelled to the closest centroid, with each iteration updating the centroids location based on all the points labelled with that value. Spark and Map/Reduce can be utilised for implementing this problem. Spark is recommended for this task, due to its performance benefits in . However, note that the MLib extension of Spark is not allowed to be used as the primary implementation. The group must code its own original implementation of the algorithm. However, it is possible to also use the mllib implementation, in order to evaluate the results from each clustering implementation. Report Contents Brief literature survey on clustering algorithms, including the challenges on implementing them at scale for parallel frameworks. The report should then justify the chosen algorithm (if changed) and the implementation. Definition of the project use case, where the implemented project will be part of the solution. Implementation in MapReduce or Spark of a clustering algorithm(KMeans). Must take into account the potential enormous size of the dataset, and develop sensible code that will scale and efficiently use additional computing nodes. The code will also need to potentially convert the dataset from its storage format to an in-memory representation. Source code should not be included in the report. However, the algorithms should be explained in the report. Results section. Adequate figures and tables should be used to present the results. The effectiveness of the algorithm should also be shown, including performance indications. Not really sure if this can be done for clustering. Critical evaluation of the results should be provided. Experiments demonstrating the technique can successfully group users in the dataset. Representation of the results, and discussion of the findings in a critical manner. ASSESSMENT The project according to the specification has a base difficulty of 85/100. This means that a perfect implementation and report would get a 85. Additional technical features and experimentation would raise the difficulty in order to opt for a full 100/100 mark. Report presentation: 20% Appropriate motivation for the work. Lack of typos/grammar errors, adequate format. Clear flow and style. Related work section including adequate referencing. Technical merit: 50% Completeness of the implementation. [25%] Provided source code. Code is documented. [10%] Design rationale of the code is provided. [10%] Efficient, and appropriate implementation for the chosen platform. [5%] Results/Analysis: 30% Experiments have been carried out on the full dataset. [10%] Adequate plots/tables are provided, with captions. [10%] Results are not only presented but discussed appropriately. [10%] Additional project goals: Implementation of additional functions beyond the base specification can raise the base mark up to 100. A non-exhaustive list of expansion ideas include: Exploration and discussion of hyperparameter tuning (e.g. the number of k groups to cluster the data into) [up to 10 marks] Comparative evaluation of clustering technique with existing implementations (e.g. mllib) [up to 10 marks] Bringing in additional datasets from stackoverflow, such as user badges, to aid in clustering [up to 5 marks] Cluster additional datasets (such as posts) [up to 10 marks] LEAD DEMONSTRATOR For specific queries related to this coursework topic, please liaise with Mr/Ms TBD, who will be the lead demonstrator for this project, as well as with the module organiser. SUBMISSION GUIDELINES The report will have a maximum length of 8 pages, not counting cover page and table of contents. The report must include motivation of the problem, brief literature survey, explanation of the selected technique, implementation details and discussion of the obtained results, and references used in the work. Additionally, the source code must be included as a separate compressed file in the submission.

Python

Updated 2 months ago

StackOverflowSurveyDataAnalysis

mburakergenc

❤️35

This project finds out insights in the field of Software Engineering using the survey data published by StackOverflow on 2019.

Jupyter Notebook

Updated 3 years ago

StackOverflow-Dev-Survey-Data-Analysis

sousablde

❤️20

Quick descriptive analysis of the Stack Overflow survey results dataset with focus on conditions conductive to higher job satisfaction

Jupyter Notebook

Updated 5 years ago

data-sciencemachine-learningscikit-learn+6

-Stackoverflow-Developer-Survey-Data-Analysis-using-Pandas

Jashank17

❤️25

No description available

Jupyter Notebook

Updated 5 years ago

stack-overflow-developer-survey-analysis

akashsky1994

❤️40

Data Analysis for 2018 Stackoverflow Developer Survey Data

MIT

Jupyter Notebook

Updated 3 years ago

so_analysis

kpmatta

❤️35

stackoverflow survey data analysis

Python

Updated 6 years ago

stackoverflow_survey_analysis

slitayem

❤️20

Stackoverflow 2019 Survey Data Analysis

HTML

Updated 2 years ago

data-analysisdata-sciencenotebook+3

database-project-stackoverflow-survey-analysis

trantathung2004

❤️10

No description available

Python

Updated 10 months ago

StackOverflow-Developer-Survey-2024_Data-Analysis

Haidar-Dagham

❤️35

StackOverflow Developer Survey 2024 - Data Analysis

Jupyter Notebook

Updated 8 months ago

StackOverflowDeveloperSurveyDatasetAnalysis

Salah-Alhaidri

❤️35

This project is part of my IBM Data Analyst Capstone Project, aimed at analyzing and visualizing the developer survey dataset to gain insights into developer demographics, preferences, and job satisfaction.

Jupyter Notebook

Updated 1 year ago

StackOverflow2020survey-data-analysis

GAURAV19999

❤️35

A detailed analysis of the 2020 StackOverflow developer survey in Jupyter notebooks using Pandas, Matplotlib and Seaborn. Part of the course Zero to Pandas, hosted by Jovian.ml

Updated 2 years ago

Data-Analysis-StackOverFlow-2022-Survey-data

MurtazaSFakhry

❤️35

I used what I have learned from my '100 days of Code' to analyze Stackoverflow data. I used Matplotlib, Pandas, and a little Numpy.

Jupyter Notebook

Updated 2 years ago

Annual-developer-survey-analysis

YKarsten

❤️40

Data Analysis on Stackoverflow Annual Developer Survey

MIT

Jupyter Notebook

Updated 2 years ago

data-analysispythonstackoverflow+1

Python-Exploratory-Data-Analysis-EDA-on-stackoverflow-survey

utkarshthakur24

❤️25

No description available

Jupyter Notebook

Updated 2 years ago

Analysis-on-the-annual-survey-data-set-of-stackoverflow

badrex76

❤️35

آآنالیز بر روی مجموعه داده های نظرسنجی سالیانه کاربران stack overflow

Jupyter Notebook

Updated 5 years ago

Write-A-DataScience-Blog

jaskaranbhatia

❤️20

A data analysis using Stackoverflow’s 2018 and 2019 Annual Developer Survey

HTML

Updated 5 years ago

Stack-Overflow-Developer-Survey-Analysis

AnkushGit14

💛70

Analyzed the StackOverflow Developer Survey 2020 dataset. The dataset contains responses to an annual survey conducted by StackOverflow. You can find the official analysis of the data here: https://insights.stackoverflow.com/survey.

GPL-2.0

Python

Updated 5 days ago

EDA_StackOverFlow_2020

beingabhi27

❤️35

Exploratory Data Analysis of StackOverFlow Survey 2020 data using Python and its libraries to get useful insights.

Updated 2 years ago

Write-a-Data-Science-Blog-Post

jamalmehr19

❤️35

Data analysis and visualization using data from surveys conducted from 2017 to 2019 among Stackoverflow's community

HTML

Updated 6 years ago

Write-A-Data-Science-Blog-Post

Faisal-AlDhuwayhi

❤️35

Writing a Data Science Blog Post using conclusions drawn from the analysis of Stackoverflow survey 2017 dataset

Jupyter Notebook

Updated 4 years ago

blog-postdatadata-science+4

stackoverflow-data-analysis

Juan-Pisco

❤️35

Data analysis according to CRISP-DM for answering development field questions with StackOverflow's survey results from 2020

Jupyter Notebook

Updated 3 years ago

StackOverflow-Developer-Survey-Analysis

TrueCodee

❤️40

This repository contains an analysis of programming language preferences among developers based on age groups and experience levels, utilizing data from the StackOverflow Developer Survey.

MIT

Jupyter Notebook

Updated 1 year ago

data-analysisdeveloper-demographicsprogramming-language-preferences+1

Stackoverflow-survey-analysis

pragyapandey870

❤️35

stackoverflow survey data analysis

Jupyter Notebook

Updated 3 years ago

Stackoverflow_Survey2022_Data_analysis

muratcanaydogdu21

❤️25

No description available

Jupyter Notebook

Updated 1 year ago

stackoverflow-survey-data-analysis

aldrinlambon

❤️35

Exploratory data analysis of Stack Overflow's developer survey from 2017-2019

Jupyter Notebook

Updated 6 years ago

stackoverflow-survey-data-analysis

Djotchuang

❤️35

Data analysis of Stackoverflow survey data for 2021

Jupyter Notebook

Updated 4 years ago

stackoverflow_survey_data_analysis

jackwills04

❤️20

No description available

Python

Updated 9 months ago

stackoverflow-survey-data-analysis

ommnnitald

❤️30

No description available

MIT

Jupyter Notebook

Updated 2 years ago

Stackoverflow_Survey_Data-Analysis

catherinecao123

❤️35

Use Stackoverflow’s 2017 Annual Developer Survey Data to Answer "tech job" Related Questions

Updated 4 years ago

GitHub Explorer

Search Results

Stackoverflow-Analysis

BDP-05-Large-scale-Clustering

StackOverflowSurveyDataAnalysis

StackOverflow-Dev-Survey-Data-Analysis

-Stackoverflow-Developer-Survey-Data-Analysis-using-Pandas

stack-overflow-developer-survey-analysis

so_analysis

stackoverflow_survey_analysis

database-project-stackoverflow-survey-analysis

StackOverflow-Developer-Survey-2024_Data-Analysis

StackOverflowDeveloperSurveyDatasetAnalysis

StackOverflow2020survey-data-analysis

Data-Analysis-StackOverFlow-2022-Survey-data

Annual-developer-survey-analysis

Python-Exploratory-Data-Analysis-EDA-on-stackoverflow-survey

Analysis-on-the-annual-survey-data-set-of-stackoverflow

Write-A-DataScience-Blog

Stack-Overflow-Developer-Survey-Analysis

EDA_StackOverFlow_2020

Write-a-Data-Science-Blog-Post

Write-A-Data-Science-Blog-Post

stackoverflow-data-analysis

StackOverflow-Developer-Survey-Analysis

Stackoverflow-survey-analysis

Stackoverflow_Survey2022_Data_analysis

stackoverflow-survey-data-analysis

stackoverflow-survey-data-analysis

stackoverflow_survey_data_analysis

stackoverflow-survey-data-analysis

Stackoverflow_Survey_Data-Analysis

Stackoverflow-Analysis

BDP-05-Large-scale-Clustering

StackOverflowSurveyDataAnalysis

StackOverflow-Dev-Survey-Data-Analysis

-Stackoverflow-Developer-Survey-Data-Analysis-using-Pandas

stack-overflow-developer-survey-analysis

so_analysis

stackoverflow_survey_analysis

database-project-stackoverflow-survey-analysis

StackOverflow-Developer-Survey-2024_Data-Analysis

StackOverflowDeveloperSurveyDatasetAnalysis

StackOverflow2020survey-data-analysis

Data-Analysis-StackOverFlow-2022-Survey-data

Annual-developer-survey-analysis

Python-Exploratory-Data-Analysis-EDA-on-stackoverflow-survey

Analysis-on-the-annual-survey-data-set-of-stackoverflow

Write-A-DataScience-Blog

Stack-Overflow-Developer-Survey-Analysis

EDA_StackOverFlow_2020

Write-a-Data-Science-Blog-Post

Write-A-Data-Science-Blog-Post

stackoverflow-data-analysis

StackOverflow-Developer-Survey-Analysis

Stackoverflow-survey-analysis

Stackoverflow_Survey2022_Data_analysis

stackoverflow-survey-data-analysis

stackoverflow-survey-data-analysis

stackoverflow_survey_data_analysis

stackoverflow-survey-data-analysis

Stackoverflow_Survey_Data-Analysis