Found 52 repositories(showing 30)
Aastha2104
Introduction Parkinson’s Disease is the second most prevalent neurodegenerative disorder after Alzheimer’s, affecting more than 10 million people worldwide. Parkinson’s is characterized primarily by the deterioration of motor and cognitive ability. There is no single test which can be administered for diagnosis. Instead, doctors must perform a careful clinical analysis of the patient’s medical history. Unfortunately, this method of diagnosis is highly inaccurate. A study from the National Institute of Neurological Disorders finds that early diagnosis (having symptoms for 5 years or less) is only 53% accurate. This is not much better than random guessing, but an early diagnosis is critical to effective treatment. Because of these difficulties, I investigate a machine learning approach to accurately diagnose Parkinson’s, using a dataset of various speech features (a non-invasive yet characteristic tool) from the University of Oxford. Why speech features? Speech is very predictive and characteristic of Parkinson’s disease; almost every Parkinson’s patient experiences severe vocal degradation (inability to produce sustained phonations, tremor, hoarseness), so it makes sense to use voice to diagnose the disease. Voice analysis gives the added benefit of being non-invasive, inexpensive, and very easy to extract clinically. Background Parkinson's Disease Parkinson’s is a progressive neurodegenerative condition resulting from the death of the dopamine containing cells of the substantia nigra (which plays an important role in movement). Symptoms include: “frozen” facial features, bradykinesia (slowness of movement), akinesia (impairment of voluntary movement), tremor, and voice impairment. Typically, by the time the disease is diagnosed, 60% of nigrostriatal neurons have degenerated, and 80% of striatal dopamine have been depleted. Performance Metrics TP = true positive, FP = false positive, TN = true negative, FN = false negative Accuracy: (TP+TN)/(P+N) Matthews Correlation Coefficient: 1=perfect, 0=random, -1=completely inaccurate Algorithms Employed Logistic Regression (LR): Uses the sigmoid logistic equation with weights (coefficient values) and biases (constants) to model the probability of a certain class for binary classification. An output of 1 represents one class, and an output of 0 represents the other. Training the model will learn the optimal weights and biases. Linear Discriminant Analysis (LDA): Assumes that the data is Gaussian and each feature has the same variance. LDA estimates the mean and variance for each class from the training data, and then uses properties of statistics (Bayes theorem , Gaussian distribution, etc) to compute the probability of a particular instance belonging to a given class. The class with the largest probability is the prediction. k Nearest Neighbors (KNN): Makes predictions about the validation set using the entire training set. KNN makes a prediction about a new instance by searching through the entire set to find the k “closest” instances. “Closeness” is determined using a proximity measurement (Euclidean) across all features. The class that the majority of the k closest instances belong to is the class that the model predicts the new instance to be. Decision Tree (DT): Represented by a binary tree, where each root node represents an input variable and a split point, and each leaf node contains an output used to make a prediction. Neural Network (NN): Models the way the human brain makes decisions. Each neuron takes in 1+ inputs, and then uses an activation function to process the input with weights and biases to produce an output. Neurons can be arranged into layers, and multiple layers can form a network to model complex decisions. Training the network involves using the training instances to optimize the weights and biases. Naive Bayes (NB): Simplifies the calculation of probabilities by assuming that all features are independent of one another (a strong but effective assumption). Employs Bayes Theorem to calculate the probabilities that the instance to be predicted is in each class, then finds the class with the highest probability. Gradient Boost (GB): Generally used when seeking a model with very high predictive performance. Used to reduce bias and variance (“error”) by combining multiple “weak learners” (not very good models) to create a “strong learner” (high performance model). Involves 3 elements: a loss function (error function) to be optimized, a weak learner (decision tree) to make predictions, and an additive model to add trees to minimize the loss function. Gradient descent is used to minimize error after adding each tree (one by one). Engineering Goal Produce a machine learning model to diagnose Parkinson’s disease given various features of a patient’s speech with at least 90% accuracy and/or a Matthews Correlation Coefficient of at least 0.9. Compare various algorithms and parameters to determine the best model for predicting Parkinson’s. Dataset Description Source: the University of Oxford 195 instances (147 subjects with Parkinson’s, 48 without Parkinson’s) 22 features (elements that are possibly characteristic of Parkinson’s, such as frequency, pitch, amplitude / period of the sound wave) 1 label (1 for Parkinson’s, 0 for no Parkinson’s) Project Pipeline pipeline Summary of Procedure Split the Oxford Parkinson’s Dataset into two parts: one for training, one for validation (evaluate how well the model performs) Train each of the following algorithms with the training set: Logistic Regression, Linear Discriminant Analysis, k Nearest Neighbors, Decision Tree, Neural Network, Naive Bayes, Gradient Boost Evaluate results using the validation set Repeat for the following training set to validation set splits: 80% training / 20% validation, 75% / 25%, and 70% / 30% Repeat for a rescaled version of the dataset (scale all the numbers in the dataset to a range from 0 to 1: this helps to reduce the effect of outliers) Conduct 5 trials and average the results Data a_o a_r m_o m_r Data Analysis In general, the models tended to perform the best (both in terms of accuracy and Matthews Correlation Coefficient) on the rescaled dataset with a 75-25 train-test split. The two highest performing algorithms, k Nearest Neighbors and the Neural Network, both achieved an accuracy of 98%. The NN achieved a MCC of 0.96, while KNN achieved a MCC of 0.94. These figures outperform most existing literature and significantly outperform current methods of diagnosis. Conclusion and Significance These robust results suggest that a machine learning approach can indeed be implemented to significantly improve diagnosis methods of Parkinson’s disease. Given the necessity of early diagnosis for effective treatment, my machine learning models provide a very promising alternative to the current, rather ineffective method of diagnosis. Current methods of early diagnosis are only 53% accurate, while my machine learning model produces 98% accuracy. This 45% increase is critical because an accurate, early diagnosis is needed to effectively treat the disease. Typically, by the time the disease is diagnosed, 60% of nigrostriatal neurons have degenerated, and 80% of striatal dopamine have been depleted. With an earlier diagnosis, much of this degradation could have been slowed or treated. My results are very significant because Parkinson’s affects over 10 million people worldwide who could benefit greatly from an early, accurate diagnosis. Not only is my machine learning approach more accurate in terms of diagnostic accuracy, it is also more scalable, less expensive, and therefore more accessible to people who might not have access to established medical facilities and professionals. The diagnosis is also much simpler, requiring only a 10-15 second voice recording and producing an immediate diagnosis. Future Research Given more time and resources, I would investigate the following: Create a mobile application which would allow the user to record his/her voice, extract the necessary vocal features, and feed it into my machine learning model to diagnose Parkinson’s. Use larger datasets in conjunction with the University of Oxford dataset. Tune and improve my models even further to achieve even better results. Investigate different structures and types of neural networks. Construct a novel algorithm specifically suited for the prediction of Parkinson’s. Generalize my findings and algorithms for all types of dementia disorders, such as Alzheimer’s. References Bind, Shubham. "A Survey of Machine Learning Based Approaches for Parkinson Disease Prediction." International Journal of Computer Science and Information Technologies 6 (2015): n. pag. International Journal of Computer Science and Information Technologies. 2015. Web. 8 Mar. 2017. Brooks, Megan. "Diagnosing Parkinson's Disease Still Challenging." Medscape Medical News. National Institute of Neurological Disorders, 31 July 2014. Web. 20 Mar. 2017. Exploiting Nonlinear Recurrence and Fractal Scaling Properties for Voice Disorder Detection', Little MA, McSharry PE, Roberts SJ, Costello DAE, Moroz IM. BioMedical Engineering OnLine 2007, 6:23 (26 June 2007) Hashmi, Sumaiya F. "A Machine Learning Approach to Diagnosis of Parkinson’s Disease."Claremont Colleges Scholarship. Claremont College, 2013. Web. 10 Mar. 2017. Karplus, Abraham. "Machine Learning Algorithms for Cancer Diagnosis." Machine Learning Algorithms for Cancer Diagnosis (n.d.): n. pag. Mar. 2012. Web. 20 Mar. 2017. Little, Max. "Parkinsons Data Set." UCI Machine Learning Repository. University of Oxford, 26 June 2008. Web. 20 Feb. 2017. Ozcift, Akin, and Arif Gulten. "Classifier Ensemble Construction with Rotation Forest to Improve Medical Diagnosis Performance of Machine Learning Algorithms." Computer Methods and Programs in Biomedicine 104.3 (2011): 443-51. Semantic Scholar. 2011. Web. 15 Mar. 2017. "Parkinson’s Disease Dementia." UCI MIND. N.p., 19 Oct. 2015. Web. 17 Feb. 2017. Salvatore, C., A. Cerasa, I. Castiglioni, F. Gallivanone, A. Augimeri, M. Lopez, G. Arabia, M. Morelli, M.c. Gilardi, and A. Quattrone. "Machine Learning on Brain MRI Data for Differential Diagnosis of Parkinson's Disease and Progressive Supranuclear Palsy."Journal of Neuroscience Methods 222 (2014): 230-37. 2014. Web. 18 Mar. 2017. Shahbakhi, Mohammad, Danial Taheri Far, and Ehsan Tahami. "Speech Analysis for Diagnosis of Parkinson’s Disease Using Genetic Algorithm and Support Vector Machine."Journal of Biomedical Science and Engineering 07.04 (2014): 147-56. Scientific Research. July 2014. Web. 2 Mar. 2017. "Speech and Communication." Speech and Communication. Parkinson's Disease Foundation, n.d. Web. 22 Mar. 2017. Sriram, Tarigoppula V. S., M. Venkateswara Rao, G. V. Satya Narayana, and D. S. V. G. K. Kaladhar. "Diagnosis of Parkinson Disease Using Machine Learning and Data Mining Systems from Voice Dataset." SpringerLink. Springer, Cham, 01 Jan. 1970. Web. 17 Mar. 2017.
Smart India Hackathon 2018 Identification of meritorious students in primary education Problem Statement:- Gujarat government has nearly 90 lac students studying in primary education across state. They are in different cities and villages across state. There is no mechanism to identify bright students who are performing well in study, sports or other activities. Web portal can be designed to acquire date about such students and can be analyzed on different parameters. What Exact Problem is being solved? : Such identified students can be provided with extra resources or special attention can be given to their upbringing. Abstract To identify meritorious students firstly all the educational institutions need to upload the results of students as well as points of extra curriculum activity (activity name, score out of 10 for performance) to the database for a student according to the current class of study. Aadhar number for all the students will always be given (from there students details will be verified).A parent or any other nongovernment institute can also upload scanned copy of result or certificate of any student with his/her Aadhar number and their own details. Admin will Cross-check and verify it for the update in the database. One’s (schools and institutions) first login or registration, there will be a unique token, (user id and password) to the Portal. That login will be further verified. So every institution will have a unique user id and password and students' details will be uploaded yearly and updates will be done twice in a year. The second fold of the solution is to sort the data according to the merit of students. The designed application will perform the operation with the provided data and present a lesser (according to requirement) students' details. There should be some methods (a faster and optimal Algorithm to sort data by marks and activity score from database Base will be adopted i.e., any tree type-level representation) to sort the data (details) of meritorious students from provided records of all the students. The third and final part is providing the list of meritorious students to the education department and university. Each official and university will also have a login section. The list of meritorious students will be provided according to year, required field. The education department or university can also post the facilities provided to the selected and shortlisted student as a notice. Therefore, we are going to solve the stated problem by providing a Web-based application comprising of Web portal and secured database to identify meritorious students in primary education according to data (100%) uploaded and retrieved from several institutions and selected meritorious students list will be provided to (according to specification of different facilities 20-30%) to Education Department and Universities. Keywords: Aadhar Number as Primary key of Student Table. Online WEB-portal. Update Records every year to keep a check on the improvement, Standardization & Soring data based on Z – stat to filter out the meritorious students on the basis of acads and extra-curricular activities. Tree type-level representation of Database i.e. Admin – Institute – Student. Use Case :- Choice Based selection of meritorious student from data set. For instances if the requirement is only limited to academics, they can refer to the website to fetch a list of top scorers say top 100 or top 200 students. Again if the requirement is limited to selection of Extra-curricular activity like – singing, painting, dancing etc they can fetch the list of students having expertise in that particular field only. Identification of poor meritorious students and Funding based support from different NGO’s, organizations and donations if they want to provide. Supervising data based on entries done in every year (Region based) to keep a check on the individual growth of a student. For instances, a diligent student say X has been receiving scholarship every year now say that X student’s data has not been registered in Database in the next year. Thus there is a decay of GDP in the sample space. To highlight the social issues such as Child Labour, child trafficking, by year wise regulation Data. To prevent the girl child marriage on the basis of Dataset by the investigation Team. For instance if a girl found not registering in the consecutive Year, an investigation team can take action accordingly. Special Features: The school should submit their data to get a recognition as well as to be in sight of fund providing parties (governmental or non-governmental). Students will be benefited as direct communication is in between officials and student and no middle man in between. • Data analysis will be the key point to identification using assignment of z-marks by standard normal distribution. Technology Stack: We are to make a Web-based app, in a microlithic structure format, where the app structure is broken into different fragments, which does the different job. One part will be taking in the to the database from a web portal designed using CSS, JavaScript, PHP, and Servlet. Computation of the sorted data and the various mathematical calculations i.e. arranging the sorted data according to given criteria etc on a mathematical platform powered by JAVA. Another part will be integrated with the API's of various Education Department and Universities to provide them up with shortlisted meritorious students, integrating with their personal choices and cut-offs, and also where shortlisted students will be notified by notice posted. Keeping in mind the ease of obtaining marks and details which has increased throughout the years. In the web app, after one's first login or registration, each part of the education department, university and institution have a unique token, (user id and password) to the database. Coming to the part of its database, My SQL or Oracle or Mongo DB can be used with a firmed dashboard powered by python or JavaScript on a network frame. Since the app will be containing huge academic details of many students, so a strong encryption algorithm is to be used for data integrity and data security. AES-256 or MD5 would be best to use to protect the data in the database and for authentication Biometric data will also be preserved.
neerajchowdary889
This tool makes it easy to gather information from Reddit and its subreddits. With just a few clicks, you can create datasets for analysis, research, or any other project you have in mind. Leveraging the Reddit PRAW, our tool offers a user-friendly APIs to extract multiple subreddits, posts, comments, user details, and more.
HunterKane
Project 1: Explanatory Data Analysis & Data Presentation (Movies Dataset) Project Brief for Self-Coders Here you´ll have the opportunity to code major parts of Project 1 on your own. If you need any help or inspiration, have a look at the Videos or the Jupyter Notebook with the full code. Keep in mind that it´s all about getting the right results/conclusions. It´s not about finding the identical code. Things can be coded in many different ways. Even if you come to the same conclusions, it´s very unlikely that we have the very same code. Data Import and first Inspection Import the movies dataset from the CSV file "movies_complete.csv". Inspect the data.
Anyra20
This project is divided into two major parts. Exploratory data analysis on a PISA 2012 survey dateset. You will use Python data science and data visualization libraries to explore the dataset’s variables and understand the data’s structure, oddities, patterns and relationships. The analysis in this part should be structured, going from simple univariate relationships up through multivariate relationships, but it does not need to be clean or perfect. There is no one single answer that needs to come out of a given dataset. This part of the project is your opportunity to ask questions of the data and make your own discoveries. It’s important to keep in mind that sometimes exploration can lead to dead ends, and that it can take multiple steps to dig down to what you’re truly looking for. Be patient with your steps, document your work carefully, and be thorough in the perspective that you choose to take with your dataset. In the second part, you will take your main findings from your exploration and convey them to others through an explanatory analysis. To this end, you will create a slide deck that leverages polished, explanatory visualizations to communicate your results. This part of the project should make heavy use of the first part of the project. Select one or two major paths in your exploration, choose relevant visualizations along that path, and then polish them to construct a story for your readers to understand what you found.
Rishika0812
The MIND Dataset (Microsoft News Dataset) is a large-scale dataset for news recommendation research. It consists of news articles and user interactions, collected to facilitate the study of personalized news recommendation systems. The dataset contains detailed information such as news titles, abstracts, categories, and user click histories.
nusnlp
This repository contains the datasets, code, and scripts to conduct the analysis in paper Mind the Biases: Quantifying Cognitive Biases in Language Model Prompting
Exploratory Data Analysis (EDA), preprocessing the MIND dataset, implementing the GNN model, training, evaluating, visualizing results, and frontend implementation
This project focuses on conducting an Exploratory Data Analysis (EDA) of the AMCAT dataset, which contains employment outcomes of engineering graduates. Released by Aspiring Minds, the dataset comprises approximately 4,000 candidates and 40 features, including demographic information, academic performance, and AMCAT test scores.
kirollos2001
Data Mind – AI Data Analysis Assistant (Python, Streamlit, LLMs) Built an AI-powered assistant that enables natural-language querying of datasets and automatic generation of interactive visualizations and dashboards. Utilized LangChain and RAG with FAISS to analyze data and deliver insights through a Streamlit interface.
This project is divided into two major parts. In the first part, you will conduct an exploratory data analysis on a dataset of your choosing. You will use Python data science and data visualization libraries to explore the dataset’s variables and understand the data’s structure, oddities, patterns and relationships. The analysis in this part should be structured, going from simple univariate relationships up through multivariate relationships, but it does not need to be clean or perfect. There is no one single answer that needs to come out of a given dataset. This part of the project is your opportunity to ask questions of the data and make your own discoveries. It’s important to keep in mind that sometimes exploration can lead to dead ends, and that it can take multiple steps to dig down to what you’re truly looking for. Be patient with your steps, document your work carefully, and be thorough in the perspective that you choose to take with your dataset. In the second part, you will take your main findings from your exploration and convey them to others through an explanatory analysis. To this end, you will create a slide deck that leverages polished, explanatory visualizations to communicate your results. This part of the project should make heavy use of the first part of the project. Select one or two major paths in your exploration, choose relevant visualizations along that path, and then polish them to construct a story for your readers to understand what you found.
aceyourgrace
As a reader, I often struggle when it comes to determining my next read. I mean, it takes more than just a google search, or a friend's suggestion to match a book that meets all our preferences(especially when you are a bibliophile.) So wouldn't it be much easier if we could simply look into different visual graphs and make an objective decision as per our likings? Keeping these things in mind, among many other things, I decided to perform an exploratory data analysis on a dataset containing over 52,000 books. You can think of it as "A Book Recommender System Driven By Data".
vaishnavi2207
FINGERPRINT TEMPLATE PROTECTION USING PATTERN TRANSFORMATION :The fingerprint template protection using pattern transformation technique has been proposed with keeping in mind the challenges prevailing in the field of biometric template security arena in preserving privacy. Analysis of user behavior or traits is important in the field of security systems. The better we do to secure systems, the more it helps in privacy protection. This project has been visualized in a holistic approach considering the critical issues that are daunting in the domain. The proposed framework for fingerprint template protection has been developed in such a way that it is as easy as possible to implement. The results are fairly consistent when tested with different datasets. Experimental results show that proposed work is highly accurate and secure with a very low EER % of 0.1%.
Introduction- Brain tumor detection project This project comprises a program that gets a mind Magnetic Resonance Image (MRI) and gives a finding that can be the presence or not of a tumor in that cerebrum. Why this task? In clinical analysis, checking mind tumors among a lot of MRI pictures, as a rule, take specialists much time. For instance, in tests of this undertaking, a patient has around 200 MRI pictures, however, tumor tissues just show up in 15 pictures. Thusly, this venture intends to naturally recognize tumor tissues in a huge measure of MRI picture information. Our objectives: 1) naturally, recognize if tumor tissue shows up in the MRI picture 2) automatical mind tumor division in MRI picture The outcome when we give a picture to the program is a likelihood that the cerebrum contains a tumor, so we could organize the patients which attractive reverberation have higher probabilities to have one, and treat them first. Another objective could be to move the obligation of seeing these pictures from the specialists to the machine, which in the long run could have a greater ability of location, as it has learned by viewing an enormous amount of pictures knowing their genuine finding. This would be an away form of collaboration among people and machines. The dataset The dataset utilized in the task is a lot of pictures with and without tumors, from which we know the genuine finding. You can discover it here: https://www.kaggle.com/navoneel/brain-mri-images-for-brain-tumor-detection Data Preprocessing 1. First, we needed to do some picture preparing, and afterward pass 80% of these pictures to a neural organization, cause it to learn and be fit for making a precise finding of another picture. 2. The other 20% of the pictures are utilized to test the model. We will contrast their genuine finding with the one that the model gives, to perceive how it performs.
senagulhan
No description available
AndersonOliveiraDaRocha
Oil Analysis with Gap Minder DataSet
No description available
This repository includes the relevant files for Gap Minder Dataset Investigation, which is submitted and accepted for the 2nd project of Udacity's Professional Data Analysis Nanodegree Program.
om-okg
A Data Analysis project on AMCAT dataset. The dataset was released by Aspiring Minds from the Aspiring Mind Employment Outcome 2015 (AMEO
everymind
Instructions and code for re-creating analysis done on the Surprising Minds at Sea Life Brighton dataset
JureSindi
News recommendation system using GRU-enhanced DKN models with fairness and bias analysis on the MIND dataset.
NSafarian
This repo contains files associated with weighted gene correlation network analysis (WGCNA) on "Common Mind Consortium (CMC)" datasets.
anniechiennn
Performing analysis of the MIND dataset and building NLP model to create a news recommendation system and deliver content personalization.
KonankiSaiCharan
Did behavioral analysis on young minds dataset to understand the impact of spending patterns, Place of birth on how happy a person is
sachoumdh
A Jupyter Notebook and a CSV dataset used for practicing data analysis with Python and Pandas. It includes exercises on dataset exploration, handling missing values, removing duplicates, and filtering information from the Open Minds Club members dataset.
Tulsani
Principal Component Analysis of data set to reduce dimensionality of the data set to easily observe trends in dataset . Princiapl components choosen keeping higest varience in mind
Welcome to the GitHub repository for my exploratory data analysis (EDA) project focused on the Aspiring Minds Employability Outcomes dataset from 2015. In this repository, you will find a comprehensive analysis of this unique dataset, along with the code, visualizations, and insights gained from the exploration.
EvgenieTiv
Exploratory and statistical analysis of the *Healthy Minds Study* dataset, focused on depression predictors among U.S. students. Includes preprocessing, visualization, hypothesis testing, and cross-validation of statistical findings.
akashmoses97
A structural and modeling study of large-scale news recommendation systems using the MIND dataset. This is a course-based project of CSCE-676 : Data Mining & analysis @ Texas A&M University
HLovisiEnnes
Here there are some useful stuff I have written to work on topological data analysis. Some of the code was developed with specific datasets in mind, but most should generalize well.