Found 237 repositories(showing 30)
dslp
The Data Science Lifecycle Process is a process for taking data science teams from Idea to Value repeatedly and sustainably. The process is documented in this repo.
Aastha2104
Introduction Parkinson’s Disease is the second most prevalent neurodegenerative disorder after Alzheimer’s, affecting more than 10 million people worldwide. Parkinson’s is characterized primarily by the deterioration of motor and cognitive ability. There is no single test which can be administered for diagnosis. Instead, doctors must perform a careful clinical analysis of the patient’s medical history. Unfortunately, this method of diagnosis is highly inaccurate. A study from the National Institute of Neurological Disorders finds that early diagnosis (having symptoms for 5 years or less) is only 53% accurate. This is not much better than random guessing, but an early diagnosis is critical to effective treatment. Because of these difficulties, I investigate a machine learning approach to accurately diagnose Parkinson’s, using a dataset of various speech features (a non-invasive yet characteristic tool) from the University of Oxford. Why speech features? Speech is very predictive and characteristic of Parkinson’s disease; almost every Parkinson’s patient experiences severe vocal degradation (inability to produce sustained phonations, tremor, hoarseness), so it makes sense to use voice to diagnose the disease. Voice analysis gives the added benefit of being non-invasive, inexpensive, and very easy to extract clinically. Background Parkinson's Disease Parkinson’s is a progressive neurodegenerative condition resulting from the death of the dopamine containing cells of the substantia nigra (which plays an important role in movement). Symptoms include: “frozen” facial features, bradykinesia (slowness of movement), akinesia (impairment of voluntary movement), tremor, and voice impairment. Typically, by the time the disease is diagnosed, 60% of nigrostriatal neurons have degenerated, and 80% of striatal dopamine have been depleted. Performance Metrics TP = true positive, FP = false positive, TN = true negative, FN = false negative Accuracy: (TP+TN)/(P+N) Matthews Correlation Coefficient: 1=perfect, 0=random, -1=completely inaccurate Algorithms Employed Logistic Regression (LR): Uses the sigmoid logistic equation with weights (coefficient values) and biases (constants) to model the probability of a certain class for binary classification. An output of 1 represents one class, and an output of 0 represents the other. Training the model will learn the optimal weights and biases. Linear Discriminant Analysis (LDA): Assumes that the data is Gaussian and each feature has the same variance. LDA estimates the mean and variance for each class from the training data, and then uses properties of statistics (Bayes theorem , Gaussian distribution, etc) to compute the probability of a particular instance belonging to a given class. The class with the largest probability is the prediction. k Nearest Neighbors (KNN): Makes predictions about the validation set using the entire training set. KNN makes a prediction about a new instance by searching through the entire set to find the k “closest” instances. “Closeness” is determined using a proximity measurement (Euclidean) across all features. The class that the majority of the k closest instances belong to is the class that the model predicts the new instance to be. Decision Tree (DT): Represented by a binary tree, where each root node represents an input variable and a split point, and each leaf node contains an output used to make a prediction. Neural Network (NN): Models the way the human brain makes decisions. Each neuron takes in 1+ inputs, and then uses an activation function to process the input with weights and biases to produce an output. Neurons can be arranged into layers, and multiple layers can form a network to model complex decisions. Training the network involves using the training instances to optimize the weights and biases. Naive Bayes (NB): Simplifies the calculation of probabilities by assuming that all features are independent of one another (a strong but effective assumption). Employs Bayes Theorem to calculate the probabilities that the instance to be predicted is in each class, then finds the class with the highest probability. Gradient Boost (GB): Generally used when seeking a model with very high predictive performance. Used to reduce bias and variance (“error”) by combining multiple “weak learners” (not very good models) to create a “strong learner” (high performance model). Involves 3 elements: a loss function (error function) to be optimized, a weak learner (decision tree) to make predictions, and an additive model to add trees to minimize the loss function. Gradient descent is used to minimize error after adding each tree (one by one). Engineering Goal Produce a machine learning model to diagnose Parkinson’s disease given various features of a patient’s speech with at least 90% accuracy and/or a Matthews Correlation Coefficient of at least 0.9. Compare various algorithms and parameters to determine the best model for predicting Parkinson’s. Dataset Description Source: the University of Oxford 195 instances (147 subjects with Parkinson’s, 48 without Parkinson’s) 22 features (elements that are possibly characteristic of Parkinson’s, such as frequency, pitch, amplitude / period of the sound wave) 1 label (1 for Parkinson’s, 0 for no Parkinson’s) Project Pipeline pipeline Summary of Procedure Split the Oxford Parkinson’s Dataset into two parts: one for training, one for validation (evaluate how well the model performs) Train each of the following algorithms with the training set: Logistic Regression, Linear Discriminant Analysis, k Nearest Neighbors, Decision Tree, Neural Network, Naive Bayes, Gradient Boost Evaluate results using the validation set Repeat for the following training set to validation set splits: 80% training / 20% validation, 75% / 25%, and 70% / 30% Repeat for a rescaled version of the dataset (scale all the numbers in the dataset to a range from 0 to 1: this helps to reduce the effect of outliers) Conduct 5 trials and average the results Data a_o a_r m_o m_r Data Analysis In general, the models tended to perform the best (both in terms of accuracy and Matthews Correlation Coefficient) on the rescaled dataset with a 75-25 train-test split. The two highest performing algorithms, k Nearest Neighbors and the Neural Network, both achieved an accuracy of 98%. The NN achieved a MCC of 0.96, while KNN achieved a MCC of 0.94. These figures outperform most existing literature and significantly outperform current methods of diagnosis. Conclusion and Significance These robust results suggest that a machine learning approach can indeed be implemented to significantly improve diagnosis methods of Parkinson’s disease. Given the necessity of early diagnosis for effective treatment, my machine learning models provide a very promising alternative to the current, rather ineffective method of diagnosis. Current methods of early diagnosis are only 53% accurate, while my machine learning model produces 98% accuracy. This 45% increase is critical because an accurate, early diagnosis is needed to effectively treat the disease. Typically, by the time the disease is diagnosed, 60% of nigrostriatal neurons have degenerated, and 80% of striatal dopamine have been depleted. With an earlier diagnosis, much of this degradation could have been slowed or treated. My results are very significant because Parkinson’s affects over 10 million people worldwide who could benefit greatly from an early, accurate diagnosis. Not only is my machine learning approach more accurate in terms of diagnostic accuracy, it is also more scalable, less expensive, and therefore more accessible to people who might not have access to established medical facilities and professionals. The diagnosis is also much simpler, requiring only a 10-15 second voice recording and producing an immediate diagnosis. Future Research Given more time and resources, I would investigate the following: Create a mobile application which would allow the user to record his/her voice, extract the necessary vocal features, and feed it into my machine learning model to diagnose Parkinson’s. Use larger datasets in conjunction with the University of Oxford dataset. Tune and improve my models even further to achieve even better results. Investigate different structures and types of neural networks. Construct a novel algorithm specifically suited for the prediction of Parkinson’s. Generalize my findings and algorithms for all types of dementia disorders, such as Alzheimer’s. References Bind, Shubham. "A Survey of Machine Learning Based Approaches for Parkinson Disease Prediction." International Journal of Computer Science and Information Technologies 6 (2015): n. pag. International Journal of Computer Science and Information Technologies. 2015. Web. 8 Mar. 2017. Brooks, Megan. "Diagnosing Parkinson's Disease Still Challenging." Medscape Medical News. National Institute of Neurological Disorders, 31 July 2014. Web. 20 Mar. 2017. Exploiting Nonlinear Recurrence and Fractal Scaling Properties for Voice Disorder Detection', Little MA, McSharry PE, Roberts SJ, Costello DAE, Moroz IM. BioMedical Engineering OnLine 2007, 6:23 (26 June 2007) Hashmi, Sumaiya F. "A Machine Learning Approach to Diagnosis of Parkinson’s Disease."Claremont Colleges Scholarship. Claremont College, 2013. Web. 10 Mar. 2017. Karplus, Abraham. "Machine Learning Algorithms for Cancer Diagnosis." Machine Learning Algorithms for Cancer Diagnosis (n.d.): n. pag. Mar. 2012. Web. 20 Mar. 2017. Little, Max. "Parkinsons Data Set." UCI Machine Learning Repository. University of Oxford, 26 June 2008. Web. 20 Feb. 2017. Ozcift, Akin, and Arif Gulten. "Classifier Ensemble Construction with Rotation Forest to Improve Medical Diagnosis Performance of Machine Learning Algorithms." Computer Methods and Programs in Biomedicine 104.3 (2011): 443-51. Semantic Scholar. 2011. Web. 15 Mar. 2017. "Parkinson’s Disease Dementia." UCI MIND. N.p., 19 Oct. 2015. Web. 17 Feb. 2017. Salvatore, C., A. Cerasa, I. Castiglioni, F. Gallivanone, A. Augimeri, M. Lopez, G. Arabia, M. Morelli, M.c. Gilardi, and A. Quattrone. "Machine Learning on Brain MRI Data for Differential Diagnosis of Parkinson's Disease and Progressive Supranuclear Palsy."Journal of Neuroscience Methods 222 (2014): 230-37. 2014. Web. 18 Mar. 2017. Shahbakhi, Mohammad, Danial Taheri Far, and Ehsan Tahami. "Speech Analysis for Diagnosis of Parkinson’s Disease Using Genetic Algorithm and Support Vector Machine."Journal of Biomedical Science and Engineering 07.04 (2014): 147-56. Scientific Research. July 2014. Web. 2 Mar. 2017. "Speech and Communication." Speech and Communication. Parkinson's Disease Foundation, n.d. Web. 22 Mar. 2017. Sriram, Tarigoppula V. S., M. Venkateswara Rao, G. V. Satya Narayana, and D. S. V. G. K. Kaladhar. "Diagnosis of Parkinson Disease Using Machine Learning and Data Mining Systems from Voice Dataset." SpringerLink. Springer, Cham, 01 Jan. 1970. Web. 17 Mar. 2017.
AaltoGIS
Site for "Spatial Data Science for Sustainable Development" course at the Dept. Built Environment, Aalto University
alan-turing-institute
Repository for mini-projects in the Data science for Sustainable development project
Dansk-Data-Science-Community
Danish Data Science Community's guide to sustainable data science
paschalugwu
Enhancing Farming in Maji Ndogo: Leveraging data science and AI to boost crop yield, sustainability, and resource efficiency.
Marine data mobilization workshop for Biology and Ecosystem Essential Ocean Variables (Bio-Eco EOV) as a Contribution to the UN Decade on Ocean Science for Sustainable Development
georgeroshankujur
Project to develop data-driven solutions for optimizing food supply chains. Aims to use predictive modeling and inventory analytics to forecast surplus/shortages, reduce waste, and improve sustainability in food operations. Showcases data science skills in environmental and logistical problem-solving.
joelcthomas
Data science and engineering project template for productive enterprise teams to build scalable and sustainable solutions
UniteIdeas
Bring together data from a curated list of reputable sources worldwide to match innovators, entrepreneurs, and everyday users with the science and technology they need to build a sustainable world.
Developed a robust machine learning model (XGBoost), to demonstrate how data science can directly support energy optimization, sustainability goals, and cost-effective production planning.
Design data streaming architecture and API for a real-life application called the Step Trending Electronic Data Interface (STEDI). It is a working application used to assess fall risk for seniors. When a senior takes a test, they are scored using an index which reflects the likelihood of falling, and potentially sustaining an injury in the course of walking. STEDI uses a Redis datastore for risk score and other data. The Data Science team has completed a working graph for population risk at a STEDI clinic. The problem is the data is not populated yet. You will work with Kafka Connect Redis Source events and Business Events to create a Kafka topic containing anonymized risk scores of seniors in the clinic.
ZH-pku
Accurate mapping of vegetation is a premise for conserving, managing, and sustainably using vegetation resources, especially at conditions of intensive human activities and accelerating global changes. However, it is still challenging today to produce high-resolution multiclass vegetation map in high accuracy, due to the incapacity of traditional mapping technology in distinguishing mosaic vegetation classes with subtle differences and the paucity of fieldwork data. This study, using extensive features and abundant vegetation survey data, created a workflow by adopting a promising classifier, eXtreme Gradient Boosting (XGBoost), to produce accurate vegetation maps of two strikingly different cases: Dzungarian Basin in China and New Zealand. For Dzungarian Basin, a vegetation map with 7 vegetation types, 17 subtypes, and 43 associations was produced, with an overall accuracy of 0.907, 0.801, and 0.748, respectively. For New Zealand, a map of 10 habitats and a map of 41 vegetation classes were produced, at an overall accuracy of 0.946, 0.703, respectively. The workflow incorporating simplified field survey procedures outperformed conventional field surveying and remote sensing based methods in terms of accuracy as well as efficiency. Besides, it opens the possibility of building large-scale, high-resolution, and timely vegetation monitoring platforms for most terrestrial ecosystems worldwide with the aid of Google Earth Engine and citizen science programs.
antarctica
Tools and Resources collected from "Open Science and Sustainable Software for Data-driven Discovery"
No description available
ThusharaN
The project, undertaken as part of the Post Graduate Dissertation, aims to identify, analyze, and propose innovative solutions for sustainable textile recycling, leveraging data science techniques.
Nelvinebi
A data science project analyzing synthetic environmental and fishing activity data to model fish stock decline in the Niger Delta, using exploratory analysis, correlation mapping, and regression modeling to understand key drivers and support sustainable fisheries management.
Vidhi1290
"Predicting a Greener Future 🌾📊 Delve into the world of agriculture and data science with our Yield Prediction project. We harness machine learning and weather data to forecast crop yields accurately. Join us in cultivating smarter farming practices for a sustainable tomorrow."
ManhHoDinh
Discover the future of urban mobility with the City Sense which is a UIT Data Science Traffic Application for Smart Cities. Our cutting-edge solution revolutionizes the way cities manage traffic, enhancing the quality of life for residents and fostering sustainable urban development.
instabaines
Context From World Health Organization - On 31 December 2019, WHO was alerted to several cases of pneumonia in Wuhan City, Hubei Province of China. The virus did not match any other known virus. This raised concern because when a virus is new, we do not know how it affects people. So daily level information on the affected people can give some interesting insights when it is made available to the broader data science community. Johns Hopkins University has made an excellent dashboard using the affected cases data. Data is extracted from the google sheets associated and made available here. Edited: Now data is available as csv files in the Johns Hopkins Github repository. Please refer to the github repository for the Terms of Use details. Uploading it here for using it in Kaggle kernels and getting insights from the broader DS community. Content 2019 Novel Coronavirus (2019-nCoV) is a virus (more specifically, a coronavirus) identified as the cause of an outbreak of respiratory illness first detected in Wuhan, China. Early on, many of the patients in the outbreak in Wuhan, China reportedly had some link to a large seafood and animal market, suggesting animal-to-person spread. However, a growing number of patients reportedly have not had exposure to animal markets, indicating person-to-person spread is occurring. At this time, it’s unclear how easily or sustainably this virus is spreading between people - CDC This dataset has daily level information on the number of affected cases, deaths and recovery from 2019 novel coronavirus. Please note that this is a time series data and so the number of cases on any given day is the cumulative number. The data is available from 22 Jan, 2020. Column Description Main file in this dataset is covid_19_data.csv and the detailed descriptions are below. covid_19_data.csv Sno - Serial number ObservationDate - Date of the observation in MM/DD/YYYY Province/State - Province or state of the observation (Could be empty when missing) Country/Region - Country of observation Last Update - Time in UTC at which the row is updated for the given province or country. (Not standardised and so please clean before using it) Confirmed - Cumulative number of confirmed cases till that date Deaths - Cumulative number of of deaths till that date Recovered - Cumulative number of recovered cases till that date 2019_ncov_data.csv This is older file and is not being updated now. Please use the covid_19_data.csv file Added two new files with individual level information COVID_open_line_list_data.csv This file is obtained from this link COVID19_line_list_data.csv This files is obtained from this link Country level datasets If you are interested in knowing country level data, please refer to the following Kaggle datasets: India - https://www.kaggle.com/sudalairajkumar/covid19-in-india South Korea - https://www.kaggle.com/kimjihoo/coronavirusdataset Italy - https://www.kaggle.com/sudalairajkumar/covid19-in-italy Brazil - https://www.kaggle.com/unanimad/corona-virus-brazil USA - https://www.kaggle.com/sudalairajkumar/covid19-in-usa Switzerland - https://www.kaggle.com/daenuprobst/covid19-cases-switzerland Indonesia - https://www.kaggle.com/ardisragen/indonesia-coronavirus-cases Acknowledgements Johns Hopkins University for making the data available for educational and academic research purposes MoBS lab - https://www.mobs-lab.org/2019ncov.html World Health Organization (WHO): https://www.who.int/ DXY.cn. Pneumonia. 2020. http://3g.dxy.cn/newh5/view/pneumonia. BNO News: https://bnonews.com/index.php/2020/02/the-latest-coronavirus-cases/ National Health Commission of the People’s Republic of China (NHC): http://www.nhc.gov.cn/xcs/yqtb/list_gzbd.shtml China CDC (CCDC): http://weekly.chinacdc.cn/news/TrackingtheEpidemic.htm Hong Kong Department of Health: https://www.chp.gov.hk/en/features/102465.html Macau Government: https://www.ssm.gov.mo/portal/ Taiwan CDC: https://sites.google.com/cdc.gov.tw/2019ncov/taiwan?authuser=0 US CDC: https://www.cdc.gov/coronavirus/2019-ncov/index.html Government of Canada: https://www.canada.ca/en/public-health/services/diseases/coronavirus.html Australia Government Department of Health: https://www.health.gov.au/news/coronavirus-update-at-a-glance European Centre for Disease Prevention and Control (ECDC): https://www.ecdc.europa.eu/en/geographical-distribution-2019-ncov-cases Ministry of Health Singapore (MOH): https://www.moh.gov.sg/covid-19 Italy Ministry of Health: http://www.salute.gov.it/nuovocoronavirus Picture courtesy : Johns Hopkins University dashboard Inspiration Some insights could be Changes in number of affected cases over time Change in cases over time at country level Latest number of affected cases
kienlef
Udemy Online Course Material: Data Science on Sustainable Development Goals
No description available
We analyze marine life sustainability and the water quality, using data science, analysis and machine learning. Evaluation is done on various physical and chemical properties and by calculating WQI.
seydoudia
Welcome to Seydou's portfolio where you will find various personal research papers and data science projects in relation to machine learning, energy and sustainable development.
delnouty
This is some additional information for the article "Data science and environmental sustainability: mastering the information and knowledge"
uzaif-lab
GeoWindNet: AI-powered offshore wind farm suitability predictor , Machine learning solution using CNN to analyze geospatial data and predict optimal seafloor locations for wind farm installation. Combines deep learning, data science, and renewable energy technology to accelerate sustainable infrastructure planning.
Vishnu252005
This repository contains a collection of applied machine learning and data science projects focused on sustainability, smart systems, and social impact. Each notebook demonstrates a complete workflow, from data preprocessing to model evaluation and ethical reflection.
illumincrotty
Our team has a diverse background in multiple fields including Computer Science, Data Analytics, Remote Data Collection, User Experience, Environmental Philosophy, Mathematics, and Physics. Our team consists of mainly seniors completing their Bachelor of Science in Computer Science and includes a physics major, two math minors, and two philosophy minors. We provide years of experience working with not only data analysis, but particularly environmental data and sustainability analysis. Our plan has three parts. The first is to develop a simple to understand, interactive dashboard to provide easily accessible sustainability analysis. The second is to create a database backend for centralized sustainability and consumption data for the various resources. The third is to create remote data monitoring systems using Raspberry Pis (a low energy, low cost computer) to provide a cost effective way to collect the data at utility meters around Denison. Our team has the capability and drive to complete this project. Each of our fields includes the study of resource efficiency and sustainability practices, especially physics and philosophy. We believe that our particular skill sets and qualifications make us the optimal choice for a project involving data analysis and representation of Denison’s resource usage and sustainability. If given this project, we will be equipped to utilize our experiences with sustainability analysis, remote data collection, and software design to create an elegant and interactive dashboard for Denison’s resource usage. We look forward to working with Denison to create a more sustainable Denison.
This project introduces a unique solution using advanced data science and machine learning to address the critical need for accurate energy usage modelling. Unlike traditional approaches, our solution provides actionable insights, empowering stakeholders to optimise resource allocation and enhance sustainability through real-time data analytics.
Kriti-Data-Business
Harness AI and data science to revolutionize energy operations! This project optimizes efficiency, detects anomalies, and reduces emissions using advanced machine learning. Inspired by McKinsey's Vistra Corp. case, it showcases a scalable framework for cost savings, sustainability, and operational excellence in the energy sector.