Found 126 repositories(showing 30)
Aastha2104
Introduction Parkinson’s Disease is the second most prevalent neurodegenerative disorder after Alzheimer’s, affecting more than 10 million people worldwide. Parkinson’s is characterized primarily by the deterioration of motor and cognitive ability. There is no single test which can be administered for diagnosis. Instead, doctors must perform a careful clinical analysis of the patient’s medical history. Unfortunately, this method of diagnosis is highly inaccurate. A study from the National Institute of Neurological Disorders finds that early diagnosis (having symptoms for 5 years or less) is only 53% accurate. This is not much better than random guessing, but an early diagnosis is critical to effective treatment. Because of these difficulties, I investigate a machine learning approach to accurately diagnose Parkinson’s, using a dataset of various speech features (a non-invasive yet characteristic tool) from the University of Oxford. Why speech features? Speech is very predictive and characteristic of Parkinson’s disease; almost every Parkinson’s patient experiences severe vocal degradation (inability to produce sustained phonations, tremor, hoarseness), so it makes sense to use voice to diagnose the disease. Voice analysis gives the added benefit of being non-invasive, inexpensive, and very easy to extract clinically. Background Parkinson's Disease Parkinson’s is a progressive neurodegenerative condition resulting from the death of the dopamine containing cells of the substantia nigra (which plays an important role in movement). Symptoms include: “frozen” facial features, bradykinesia (slowness of movement), akinesia (impairment of voluntary movement), tremor, and voice impairment. Typically, by the time the disease is diagnosed, 60% of nigrostriatal neurons have degenerated, and 80% of striatal dopamine have been depleted. Performance Metrics TP = true positive, FP = false positive, TN = true negative, FN = false negative Accuracy: (TP+TN)/(P+N) Matthews Correlation Coefficient: 1=perfect, 0=random, -1=completely inaccurate Algorithms Employed Logistic Regression (LR): Uses the sigmoid logistic equation with weights (coefficient values) and biases (constants) to model the probability of a certain class for binary classification. An output of 1 represents one class, and an output of 0 represents the other. Training the model will learn the optimal weights and biases. Linear Discriminant Analysis (LDA): Assumes that the data is Gaussian and each feature has the same variance. LDA estimates the mean and variance for each class from the training data, and then uses properties of statistics (Bayes theorem , Gaussian distribution, etc) to compute the probability of a particular instance belonging to a given class. The class with the largest probability is the prediction. k Nearest Neighbors (KNN): Makes predictions about the validation set using the entire training set. KNN makes a prediction about a new instance by searching through the entire set to find the k “closest” instances. “Closeness” is determined using a proximity measurement (Euclidean) across all features. The class that the majority of the k closest instances belong to is the class that the model predicts the new instance to be. Decision Tree (DT): Represented by a binary tree, where each root node represents an input variable and a split point, and each leaf node contains an output used to make a prediction. Neural Network (NN): Models the way the human brain makes decisions. Each neuron takes in 1+ inputs, and then uses an activation function to process the input with weights and biases to produce an output. Neurons can be arranged into layers, and multiple layers can form a network to model complex decisions. Training the network involves using the training instances to optimize the weights and biases. Naive Bayes (NB): Simplifies the calculation of probabilities by assuming that all features are independent of one another (a strong but effective assumption). Employs Bayes Theorem to calculate the probabilities that the instance to be predicted is in each class, then finds the class with the highest probability. Gradient Boost (GB): Generally used when seeking a model with very high predictive performance. Used to reduce bias and variance (“error”) by combining multiple “weak learners” (not very good models) to create a “strong learner” (high performance model). Involves 3 elements: a loss function (error function) to be optimized, a weak learner (decision tree) to make predictions, and an additive model to add trees to minimize the loss function. Gradient descent is used to minimize error after adding each tree (one by one). Engineering Goal Produce a machine learning model to diagnose Parkinson’s disease given various features of a patient’s speech with at least 90% accuracy and/or a Matthews Correlation Coefficient of at least 0.9. Compare various algorithms and parameters to determine the best model for predicting Parkinson’s. Dataset Description Source: the University of Oxford 195 instances (147 subjects with Parkinson’s, 48 without Parkinson’s) 22 features (elements that are possibly characteristic of Parkinson’s, such as frequency, pitch, amplitude / period of the sound wave) 1 label (1 for Parkinson’s, 0 for no Parkinson’s) Project Pipeline pipeline Summary of Procedure Split the Oxford Parkinson’s Dataset into two parts: one for training, one for validation (evaluate how well the model performs) Train each of the following algorithms with the training set: Logistic Regression, Linear Discriminant Analysis, k Nearest Neighbors, Decision Tree, Neural Network, Naive Bayes, Gradient Boost Evaluate results using the validation set Repeat for the following training set to validation set splits: 80% training / 20% validation, 75% / 25%, and 70% / 30% Repeat for a rescaled version of the dataset (scale all the numbers in the dataset to a range from 0 to 1: this helps to reduce the effect of outliers) Conduct 5 trials and average the results Data a_o a_r m_o m_r Data Analysis In general, the models tended to perform the best (both in terms of accuracy and Matthews Correlation Coefficient) on the rescaled dataset with a 75-25 train-test split. The two highest performing algorithms, k Nearest Neighbors and the Neural Network, both achieved an accuracy of 98%. The NN achieved a MCC of 0.96, while KNN achieved a MCC of 0.94. These figures outperform most existing literature and significantly outperform current methods of diagnosis. Conclusion and Significance These robust results suggest that a machine learning approach can indeed be implemented to significantly improve diagnosis methods of Parkinson’s disease. Given the necessity of early diagnosis for effective treatment, my machine learning models provide a very promising alternative to the current, rather ineffective method of diagnosis. Current methods of early diagnosis are only 53% accurate, while my machine learning model produces 98% accuracy. This 45% increase is critical because an accurate, early diagnosis is needed to effectively treat the disease. Typically, by the time the disease is diagnosed, 60% of nigrostriatal neurons have degenerated, and 80% of striatal dopamine have been depleted. With an earlier diagnosis, much of this degradation could have been slowed or treated. My results are very significant because Parkinson’s affects over 10 million people worldwide who could benefit greatly from an early, accurate diagnosis. Not only is my machine learning approach more accurate in terms of diagnostic accuracy, it is also more scalable, less expensive, and therefore more accessible to people who might not have access to established medical facilities and professionals. The diagnosis is also much simpler, requiring only a 10-15 second voice recording and producing an immediate diagnosis. Future Research Given more time and resources, I would investigate the following: Create a mobile application which would allow the user to record his/her voice, extract the necessary vocal features, and feed it into my machine learning model to diagnose Parkinson’s. Use larger datasets in conjunction with the University of Oxford dataset. Tune and improve my models even further to achieve even better results. Investigate different structures and types of neural networks. Construct a novel algorithm specifically suited for the prediction of Parkinson’s. Generalize my findings and algorithms for all types of dementia disorders, such as Alzheimer’s. References Bind, Shubham. "A Survey of Machine Learning Based Approaches for Parkinson Disease Prediction." International Journal of Computer Science and Information Technologies 6 (2015): n. pag. International Journal of Computer Science and Information Technologies. 2015. Web. 8 Mar. 2017. Brooks, Megan. "Diagnosing Parkinson's Disease Still Challenging." Medscape Medical News. National Institute of Neurological Disorders, 31 July 2014. Web. 20 Mar. 2017. Exploiting Nonlinear Recurrence and Fractal Scaling Properties for Voice Disorder Detection', Little MA, McSharry PE, Roberts SJ, Costello DAE, Moroz IM. BioMedical Engineering OnLine 2007, 6:23 (26 June 2007) Hashmi, Sumaiya F. "A Machine Learning Approach to Diagnosis of Parkinson’s Disease."Claremont Colleges Scholarship. Claremont College, 2013. Web. 10 Mar. 2017. Karplus, Abraham. "Machine Learning Algorithms for Cancer Diagnosis." Machine Learning Algorithms for Cancer Diagnosis (n.d.): n. pag. Mar. 2012. Web. 20 Mar. 2017. Little, Max. "Parkinsons Data Set." UCI Machine Learning Repository. University of Oxford, 26 June 2008. Web. 20 Feb. 2017. Ozcift, Akin, and Arif Gulten. "Classifier Ensemble Construction with Rotation Forest to Improve Medical Diagnosis Performance of Machine Learning Algorithms." Computer Methods and Programs in Biomedicine 104.3 (2011): 443-51. Semantic Scholar. 2011. Web. 15 Mar. 2017. "Parkinson’s Disease Dementia." UCI MIND. N.p., 19 Oct. 2015. Web. 17 Feb. 2017. Salvatore, C., A. Cerasa, I. Castiglioni, F. Gallivanone, A. Augimeri, M. Lopez, G. Arabia, M. Morelli, M.c. Gilardi, and A. Quattrone. "Machine Learning on Brain MRI Data for Differential Diagnosis of Parkinson's Disease and Progressive Supranuclear Palsy."Journal of Neuroscience Methods 222 (2014): 230-37. 2014. Web. 18 Mar. 2017. Shahbakhi, Mohammad, Danial Taheri Far, and Ehsan Tahami. "Speech Analysis for Diagnosis of Parkinson’s Disease Using Genetic Algorithm and Support Vector Machine."Journal of Biomedical Science and Engineering 07.04 (2014): 147-56. Scientific Research. July 2014. Web. 2 Mar. 2017. "Speech and Communication." Speech and Communication. Parkinson's Disease Foundation, n.d. Web. 22 Mar. 2017. Sriram, Tarigoppula V. S., M. Venkateswara Rao, G. V. Satya Narayana, and D. S. V. G. K. Kaladhar. "Diagnosis of Parkinson Disease Using Machine Learning and Data Mining Systems from Voice Dataset." SpringerLink. Springer, Cham, 01 Jan. 1970. Web. 17 Mar. 2017.
Aryia-Behroziuan
An ANN is a model based on a collection of connected units or nodes called "artificial neurons", which loosely model the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit information, a "signal", from one artificial neuron to another. An artificial neuron that receives a signal can process it and then signal additional artificial neurons connected to it. In common ANN implementations, the signal at a connection between artificial neurons is a real number, and the output of each artificial neuron is computed by some non-linear function of the sum of its inputs. The connections between artificial neurons are called "edges". Artificial neurons and edges typically have a weight that adjusts as learning proceeds. The weight increases or decreases the strength of the signal at a connection. Artificial neurons may have a threshold such that the signal is only sent if the aggregate signal crosses that threshold. Typically, artificial neurons are aggregated into layers. Different layers may perform different kinds of transformations on their inputs. Signals travel from the first layer (the input layer) to the last layer (the output layer), possibly after traversing the layers multiple times. The original goal of the ANN approach was to solve problems in the same way that a human brain would. However, over time, attention moved to performing specific tasks, leading to deviations from biology. Artificial neural networks have been used on a variety of tasks, including computer vision, speech recognition, machine translation, social network filtering, playing board and video games and medical diagnosis. Deep learning consists of multiple hidden layers in an artificial neural network. This approach tries to model the way the human brain processes light and sound into vision and hearing. Some successful applications of deep learning are computer vision and speech recognition.[68] Decision trees Main article: Decision tree learning Decision tree learning uses a decision tree as a predictive model to go from observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). It is one of the predictive modeling approaches used in statistics, data mining, and machine learning. Tree models where the target variable can take a discrete set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. In data mining, a decision tree describes data, but the resulting classification tree can be an input for decision making. Support vector machines Main article: Support vector machines Support vector machines (SVMs), also known as support vector networks, are a set of related supervised learning methods used for classification and regression. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that predicts whether a new example falls into one category or the other.[69] An SVM training algorithm is a non-probabilistic, binary, linear classifier, although methods such as Platt scaling exist to use SVM in a probabilistic classification setting. In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces. Illustration of linear regression on a data set. Regression analysis Main article: Regression analysis Regression analysis encompasses a large variety of statistical methods to estimate the relationship between input variables and their associated features. Its most common form is linear regression, where a single line is drawn to best fit the given data according to a mathematical criterion such as ordinary least squares. The latter is often extended by regularization (mathematics) methods to mitigate overfitting and bias, as in ridge regression. When dealing with non-linear problems, go-to models include polynomial regression (for example, used for trendline fitting in Microsoft Excel[70]), logistic regression (often used in statistical classification) or even kernel regression, which introduces non-linearity by taking advantage of the kernel trick to implicitly map input variables to higher-dimensional space. Bayesian networks Main article: Bayesian network A simple Bayesian network. Rain influences whether the sprinkler is activated, and both rain and the sprinkler influence whether the grass is wet. A Bayesian network, belief network, or directed acyclic graphical model is a probabilistic graphical model that represents a set of random variables and their conditional independence with a directed acyclic graph (DAG). For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases. Efficient algorithms exist that perform inference and learning. Bayesian networks that model sequences of variables, like speech signals or protein sequences, are called dynamic Bayesian networks. Generalizations of Bayesian networks that can represent and solve decision problems under uncertainty are called influence diagrams. Genetic algorithms Main article: Genetic algorithm A genetic algorithm (GA) is a search algorithm and heuristic technique that mimics the process of natural selection, using methods such as mutation and crossover to generate new genotypes in the hope of finding good solutions to a given problem. In machine learning, genetic algorithms were used in the 1980s and 1990s.[71][72] Conversely, machine learning techniques have been used to improve the performance of genetic and evolutionary algorithms.[73] Training models Usually, machine learning models require a lot of data in order for them to perform well. Usually, when training a machine learning model, one needs to collect a large, representative sample of data from a training set. Data from the training set can be as varied as a corpus of text, a collection of images, and data collected from individual users of a service. Overfitting is something to watch out for when training a machine learning model. Federated learning Main article: Federated learning Federated learning is an adapted form of distributed artificial intelligence to training machine learning models that decentralizes the training process, allowing for users' privacy to be maintained by not needing to send their data to a centralized server. This also increases efficiency by decentralizing the training process to many devices. For example, Gboard uses federated machine learning to train search query prediction models on users' mobile phones without having to send individual searches back to Google.[74] Applications There are many applications for machine learning, including: Agriculture Anatomy Adaptive websites Affective computing Banking Bioinformatics Brain–machine interfaces Cheminformatics Citizen science Computer networks Computer vision Credit-card fraud detection Data quality DNA sequence classification Economics Financial market analysis[75] General game playing Handwriting recognition Information retrieval Insurance Internet fraud detection Linguistics Machine learning control Machine perception Machine translation Marketing Medical diagnosis Natural language processing Natural language understanding Online advertising Optimization Recommender systems Robot locomotion Search engines Sentiment analysis Sequence mining Software engineering Speech recognition Structural health monitoring Syntactic pattern recognition Telecommunication Theorem proving Time series forecasting User behavior analytics In 2006, the media-services provider Netflix held the first "Netflix Prize" competition to find a program to better predict user preferences and improve the accuracy of its existing Cinematch movie recommendation algorithm by at least 10%. A joint team made up of researchers from AT&T Labs-Research in collaboration with the teams Big Chaos and Pragmatic Theory built an ensemble model to win the Grand Prize in 2009 for $1 million.[76] Shortly after the prize was awarded, Netflix realized that viewers' ratings were not the best indicators of their viewing patterns ("everything is a recommendation") and they changed their recommendation engine accordingly.[77] In 2010 The Wall Street Journal wrote about the firm Rebellion Research and their use of machine learning to predict the financial crisis.[78] In 2012, co-founder of Sun Microsystems, Vinod Khosla, predicted that 80% of medical doctors' jobs would be lost in the next two decades to automated machine learning medical diagnostic software.[79] In 2014, it was reported that a machine learning algorithm had been applied in the field of art history to study fine art paintings and that it may have revealed previously unrecognized influences among artists.[80] In 2019 Springer Nature published the first research book created using machine learning.[81] Limitations Although machine learning has been transformative in some fields, machine-learning programs often fail to deliver expected results.[82][83][84] Reasons for this are numerous: lack of (suitable) data, lack of access to the data, data bias, privacy problems, badly chosen tasks and algorithms, wrong tools and people, lack of resources, and evaluation problems.[85] In 2018, a self-driving car from Uber failed to detect a pedestrian, who was killed after a collision.[86] Attempts to use machine learning in healthcare with the IBM Watson system failed to deliver even after years of time and billions of dollars invested.[87][88] Bias Main article: Algorithmic bias Machine learning approaches in particular can suffer from different data biases. A machine learning system trained on current customers only may not be able to predict the needs of new customer groups that are not represented in the training data. When trained on man-made data, machine learning is likely to pick up the same constitutional and unconscious biases already present in society.[89] Language models learned from data have been shown to contain human-like biases.[90][91] Machine learning systems used for criminal risk assessment have been found to be biased against black people.[92][93] In 2015, Google photos would often tag black people as gorillas,[94] and in 2018 this still was not well resolved, but Google reportedly was still using the workaround to remove all gorillas from the training data, and thus was not able to recognize real gorillas at all.[95] Similar issues with recognizing non-white people have been found in many other systems.[96] In 2016, Microsoft tested a chatbot that learned from Twitter, and it quickly picked up racist and sexist language.[97] Because of such challenges, the effective use of machine learning may take longer to be adopted in other domains.[98] Concern for fairness in machine learning, that is, reducing bias in machine learning and propelling its use for human good is increasingly expressed by artificial intelligence scientists, including Fei-Fei Li, who reminds engineers that "There’s nothing artificial about AI...It’s inspired by people, it’s created by people, and—most importantly—it impacts people. It is a powerful tool we are only just beginning to understand, and that is a profound responsibility.”[99] Model assessments Classification of machine learning models can be validated by accuracy estimation techniques like the holdout method, which splits the data in a training and test set (conventionally 2/3 training set and 1/3 test set designation) and evaluates the performance of the training model on the test set. In comparison, the K-fold-cross-validation method randomly partitions the data into K subsets and then K experiments are performed each respectively considering 1 subset for evaluation and the remaining K-1 subsets for training the model. In addition to the holdout and cross-validation methods, bootstrap, which samples n instances with replacement from the dataset, can be used to assess model accuracy.[100] In addition to overall accuracy, investigators frequently report sensitivity and specificity meaning True Positive Rate (TPR) and True Negative Rate (TNR) respectively. Similarly, investigators sometimes report the false positive rate (FPR) as well as the false negative rate (FNR). However, these rates are ratios that fail to reveal their numerators and denominators. The total operating characteristic (TOC) is an effective method to express a model's diagnostic ability. TOC shows the numerators and denominators of the previously mentioned rates, thus TOC provides more information than the commonly used receiver operating characteristic (ROC) and ROC's associated area under the curve (AUC).[101] Ethics Machine learning poses a host of ethical questions. Systems which are trained on datasets collected with biases may exhibit these biases upon use (algorithmic bias), thus digitizing cultural prejudices.[102] For example, using job hiring data from a firm with racist hiring policies may lead to a machine learning system duplicating the bias by scoring job applicants against similarity to previous successful applicants.[103][104] Responsible collection of data and documentation of algorithmic rules used by a system thus is a critical part of machine learning. Because human languages contain biases, machines trained on language corpora will necessarily also learn these biases.[105][106] Other forms of ethical challenges, not related to personal biases, are more seen in health care. There are concerns among health care professionals that these systems might not be designed in the public's interest but as income-generating machines. This is especially true in the United States where there is a long-standing ethical dilemma of improving health care, but also increasing profits. For example, the algorithms could be designed to provide patients with unnecessary tests or medication in which the algorithm's proprietary owners hold stakes. There is huge potential for machine learning in health care to provide professionals a great tool to diagnose, medicate, and even plan recovery paths for patients, but this will not happen until the personal biases mentioned previously, and these "greed" biases are addressed.[107] Hardware Since the 2010s, advances in both machine learning algorithms and computer hardware have led to more efficient methods for training deep neural networks (a particular narrow subdomain of machine learning) that contain many layers of non-linear hidden units.[108] By 2019, graphic processing units (GPUs), often with AI-specific enhancements, had displaced CPUs as the dominant method of training large-scale commercial cloud AI.[109] OpenAI estimated the hardware compute used in the largest deep learning projects from AlexNet (2012) to AlphaZero (2017), and found a 300,000-fold increase in the amount of compute required, with a doubling-time trendline of 3.4 months.[110][111] Software Software suites containing a variety of machine learning algorithms include the following: Free and open-source so
reddyprasade
Prepare to Technical Skills Here are the essential skills that a Machine Learning Engineer needs, as mentioned Read me files. Within each group are topics that you should be familiar with. Study Tip: Copy and paste this list into a document and save to your computer for easy referral. Computer Science Fundamentals and Programming Topics Data structures: Lists, stacks, queues, strings, hash maps, vectors, matrices, classes & objects, trees, graphs, etc. Algorithms: Recursion, searching, sorting, optimization, dynamic programming, etc. Computability and complexity: P vs. NP, NP-complete problems, big-O notation, approximate algorithms, etc. Computer architecture: Memory, cache, bandwidth, threads & processes, deadlocks, etc. Probability and Statistics Topics Basic probability: Conditional probability, Bayes rule, likelihood, independence, etc. Probabilistic models: Bayes Nets, Markov Decision Processes, Hidden Markov Models, etc. Statistical measures: Mean, median, mode, variance, population parameters vs. sample statistics etc. Proximity and error metrics: Cosine similarity, mean-squared error, Manhattan and Euclidean distance, log-loss, etc. Distributions and random sampling: Uniform, normal, binomial, Poisson, etc. Analysis methods: ANOVA, hypothesis testing, factor analysis, etc. Data Modeling and Evaluation Topics Data preprocessing: Munging/wrangling, transforming, aggregating, etc. Pattern recognition: Correlations, clusters, trends, outliers & anomalies, etc. Dimensionality reduction: Eigenvectors, Principal Component Analysis, etc. Prediction: Classification, regression, sequence prediction, etc.; suitable error/accuracy metrics. Evaluation: Training-testing split, sequential vs. randomized cross-validation, etc. Applying Machine Learning Algorithms and Libraries Topics Models: Parametric vs. nonparametric, decision tree, nearest neighbor, neural net, support vector machine, ensemble of multiple models, etc. Learning procedure: Linear regression, gradient descent, genetic algorithms, bagging, boosting, and other model-specific methods; regularization, hyperparameter tuning, etc. Tradeoffs and gotchas: Relative advantages and disadvantages, bias and variance, overfitting and underfitting, vanishing/exploding gradients, missing data, data leakage, etc. Software Engineering and System Design Topics Software interface: Library calls, REST APIs, data collection endpoints, database queries, etc. User interface: Capturing user inputs & application events, displaying results & visualization, etc. Scalability: Map-reduce, distributed processing, etc. Deployment: Cloud hosting, containers & instances, microservices, etc. Move on to the final lesson of this course to find lots of sample practice questions for each topic!
Lekshmi2003-glitch
This repository contains R implementations of classification models including KNN, Logistic Regression, Decision Trees, and Naive Bayes. It explores the impact of Principal Component Analysis (PCA) on model performance using simulated data, comparing accuracy, sensitivity, specificity, and kappa statistics.
Stock market indexes predictions have always been under the radar of stalwarts belonging from the domains of econometrics, statistics, and mathematics. This has been a fascinating challenge to deal with since a major portion of the research community who promotes the idea of the efficient market hypothesis (EMH) believes that no predictive model can accurately predict the fluctuations of the ever-changing market, while recent works in this field using more advanced techniques like statistical modelling and machine learning can be used to demonstrate the gesticulations of a time series data with exceptional levels of accuracy. The stock market of a given country can be divided into its constituent sectors which represent the entire behaviour of a particular domain instead of the performance of individual companies. In this dissertation, it has been proposed how machine learning and deep neural network technique like Long Short Term Memory (LSTM) can be used to obtain fantastic results for prediction of stock value. Here, the IT sector of India has been taken into account to analyse its characteristic features about its ascend and descend according to the trend of the market. Regression techniques have been used to predict the probable indexes of the closing values and classification methods for identifying their pattern of movement. At first, a detailed machine learning approach has been adopted by using all adept methods like ensemble techniques like bagging and boosting, random forest, multivariate regression, decision tree, support vector machines, MARS, logistic regression and artificial neural networks. The application of univariate time series (with 5 input) deep learning model for regression was also implemented which has outperformed all the machine techniques as expected.
valentin-schwind
Decision Tree for Inferential Statistics
huomarc
Due to the intricacies of the body of a fetus, as well as the rapid rate at which fetuses develop, fetal health care is one of the most difficult medical fields to practice effectively. Due to the challenges around fetal health care, reducing child and maternal mortality rates is of paramount importance to every modern country. The reduction of child mortality is reflected in several of the United Nations' Sustainable Development Goals. The UN expects that by 2030, member countries will effectively end preventable deaths of newborns and children under 5 years of age (Child Survival and the SDGs, 2021). All member countries are aiming to reduce neonatal mortality to at least as low as 12 deaths per 1,000 live births and under-5 mortality to at least as low as 25 deaths per 1,000 live births (Child Survival and the SDGs, 2021). Maternal mortality is intrinsically tied with child mortality and has continued to have wide racial/ethnic gaps, even today. Black, American Indian, and Alaska Native (AI/AN) women are two to three times more likely to die from pregnancy-related causes than white women, and that disparity only increases with age (National Center for Health Statistics). Thus, the necessity of mitigating child mortality as much as possible cannot be overstated. Electronic fetal monitoring or cardiotocography is a “visual representation” of uterine contractions and fetal heart rate, which has been recognized as a prominent indicator of fetal health since the 19th century (Petker, 2018). Certain fetal heart rate patterns are linked to non-reassuring fetal status, and therefore, fetal heart rate monitoring can help prevent poor fetal outcomes (Petker, 2018). The fetal heart rate monitor identifies the normal baseline rate and tracks variability, accelerations, and decelerations to provide insight into a baby’s level of stress, oxygenation, acidemia (increase in hydrogen ion blood concentration), and other vital signs (Petker, 2018). A host of models were applied, including XGBoost, LightGBM, Gradient Boosting, Random Forest, Multi-layer Perceptron, Support Vector, Decision Tree, K-Nearest Neighbors, Linear Support Vector, and Logistic Regression, to the cardiotocography data to predict fetal health outcomes and level of care.
Data Science course taught by Juan Gabriel Gomila. Part 1 - Installing Python and packages needed for data science, machine learning, and data visualization Part 2 - Historical evolution of predictive analytics and machine learning Part 3 - Pre-processing and data cleaning Part 4 - Data handling and data wrangling, operations with datasets and most famous probability distributions Part 5 - Review of basic statistics, confidence intervals, hypothesis tests, correlation,... Part 6 - Simple linear regression, multiple linear regression and polynomial regression, categorical variables and treatment of outliers. Part 7 - Classification with logistic regression, maximum likelihood estimation, cross validation, K-fold cross validation, ROC curves Part 8 - Clustering, K-means, K-medoids, dendrograms and hierarchical clustering, elbow technique and silhouette analysis Part 9 - Classification with trees, random forests, pruning techniques, entropy, information maximization Part 10 - Support Vector Machines for Classification and Regression Issues, Nonlinear Kernels, Face Recognition (How CSI Works) Part 11 - K Nearest Neighbors, Majority Decision, Programming Machine Learning Algorithms vs Python Libraries Part 12 - Principal Component Analysis, Dimension Reduction, LDA Part 13 - Deep learning, Reinforcement Learning, Artificial and convolutional neural networks and Tensor Flow
This project is about analyzing Ionosphere data and measuring the accuracies of the electromagnetic signal data. The radar statistics were gathered by an arrangement in Goose Bay, Labrador. This system involves a phased array of 16 high-frequency transmitters with an aggregate transferred power on the order of 6.4 kilowatts. Expected waves were handled by exercising an autocorrelation function whose arguments are the time of a pulse and the pulse number. There were 17 pulse numbers for the Goose Bay system. Two attributes per pulse number describe instances in this database. This dataset describes high-frequency antenna returns from high energy particles in the atmosphere, and whether the return shows structure or not. The problem is a binary classification that contains 351 instances and 35 numerical attributes. The majority of the data in this set are continuous data points which range between -1 and 1, with one binomial variable which defines the type of the electromagnetic signals. The objective of the project is to measure the accuracies of ‘good’ instances and ‘bad’ cases by feeding the dataset to the machine learning models mentioned below and report some of the measures to improve the overall performance of the models. Predicting the good and bad signals is very important as these signals propagate through distant places and contribute in providing better communication and help in improving the navigation. We will predict the good and bad signal results using 3 methods - KNN, GLM and decision tree and then use ensemble techniques to improve the accuracy of the model. In the ensemble technique, we will use the stacking method. We observed that generalized linear model has better classification rate among the rest and after implementing stacking technique we were able to improve the overall performance of the stacked models. Introduction Source Information: -- Donor: Vince Sigillito (vgs@aplcen.apl.jhu.edu) -- Date: 1989 -- Source: Space Physics Group, Applied Physics Laboratory, Johns Hopkins University, MD 20723 The first 34 columns are continuous numerical data which represent 17 pulse numbers of received electromagnetic signals. There are two attributes per pulse number, which is the time of the pulse and the pulse number. The 35th column is categorical data "good" or "bad". "good" means those radar showing evidence of some type of structure in the ionosphere. “bad" implies those radar does not indicate their signals pass through the ionosphere. Implementation of the Project First, we install the necessary packages and load the required libraries as mentioned below and then we read the dataset in R. We convert the last column label feature from character to factor. Next, to identify the important features we applied fitted Boruta model with the data and found out that column two i.e, V2 is not important and therefore, we removed V2 from the dataset and Created the significant dataset with important variables only. Then we split the dataset to train dataset and test dataset. Once, we have the training and test datasets we made use of knn() available in Class library for implementing KNN algorithm and glm() to implement logistic regression and rpart () to implement decision tree methods on our dataset. We chose these methods for our prediction and data analysis as we have binomial variable data with a binomial output. Because the above-mentioned algorithms perform better while dealing with categorical data points, we decided to implement the aforesaid classification methods. After completing with our modelling, we decided to improve the resulted accuracies of the models by implementing ensemble technique and we chose stacking for this case because it’s designed to combine model outputs of different types.
aiok03
Descriptive statistics and Explanatory data analysis In order to have an idea of the received data, we look through our table transactions and train. The shape of the train is 6000 rows and 2 columns (client_id and target – gender). Also we considered the info of transactions and noticed that there are no empty values, all of them are equal to 130039. After that we merged two tables and called it as data. To display unique codes and types we used ‘unique’ function and noticed that unique codes 173 and unique types 61. Using ‘describe’ function we can see minimal code, type, sum and the same parameters but maximum. The first hypothesis was to find what gender makes lots of requests. For conveniency we used for loop to make values in percentile view. And according to the barplot the biggest number of processes are made by females. The second hypothesis was to find the code with the biggest sum. For that we grouped by code and counted the mean of all sums. This list we converted from series to frame for further working process. The problem was that the code interpreted the code as the index, that’s why we have to fix it with ‘reset_index’ function. After that we plotted the graph and noticed that the most high sum is with 4722 code and proved it with another code under the graph. The third hypothesis is to find the distribution of sums relatively to the gender. But the first graph didn’t replaced this information because the scatter of the data is too high. The sign is not normally distributed and it is not symmetrical. It is hard to asses, that’s why we grouped information by gender and counted mean of the sum. According to this information we noticed that males spend more money than women. The same process we made with median and got the same conclusion. And since the mean and median values are not equal, our assumption about unnormalized data was proved. The last hypothesis was to find number of clients for each type and code – to find the most popular request within clients. For that we applied ‘str’ to each parameter for correct visualization on the graph. Counted the number of each request for type and code and reflected it in the graphs. According to them the most popular is 1010 type and 6011 code. Lastly, for further working process we returned type and code to the int type. Feature engineering Client’s balance condition We took every sum from dataframe data, grouped for every client and found the sum for each of them. We calculated the income and expenses for each client. Some clients with minus value made more expenses, some of them not, that means that he got more income. In minus is 0, in plus is 1. RFM In RFM section we started from Recency. For each client we grouped the information about them and found the maximum date where the transaction was done. The datetime column consisted from two values – date and time, for further working process in future engineering section we divided them for different columns. The most recent day we equaled to 457 and according to this value started to count the recency of last transactions for each client by subtraction. The next step is Frequency. We used ‘group by’ function and counted appearance of each client in our database. The last step is Monetary (to count expenses). Using group by function and condition, where the sum is less than 0 (expenses are negative values), we counted the total expenses of each client and noticed one point. That some clients didn’t spend any money at all. Segmentation based on RFM We merged all the tables into one and made a rank according to the best values in each segment using percentage. Using the formula we divided clients by 5 score scale, by this database and elbow method, plotted the graph, where 3 clusters were optimal solution. With KMeans library we plotted the k-mean illustration of clients according to the distance from randomly chosen centroids, showed distribution of clients in clusters. After the work done we gathered basic table with clusters using prefixes to each of them. Clustering for codes Now we'll work with codes to create clustering codes, and we'll utilize TF IDF and k-means to do it. We will also employ limitization, tokenization, and stop word elimination. We import the pymorphy2 library for limiting, and limiting is when words take their original form. Tokenization by sentences is the process of dividing a written language into component sentences. We also need to delete stop words, a stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query. We would not want these words to take up space in our database, or taking up valuable processing time. We also make use of the re – Regular expression operations library, which is a library for regular expression operations. In this section we also use MorphAnalyzer() - Morphological analysis is the identification of a word's features based on how it is spelt. Morphological analysis does not make use of information about nearby words. For morphological analysis of words, there is a MorphAnalyzer class in pymorphy2. If we apply directly the clustering on those matrix, we will have issues as our matrices are very sparse and the computation of distances will be a mess. What we can do, is to perform IS to reduce data to a dense matrix of dimension 156 by applying SVD. Singular Value Decomposition (SVD) is one of the widely used methods for dimensionality reduction. We defined that 156 is the right number in our case. We used the Silhouette score to evaluate the quality of clusters created using K-Means. By Silhouette score we chose number of clusters and performed k means clustering on our tf-idf matrix. Then we tried to do a visualization of our clusters and we applied t-sne . t-SNE is a tool to visualize high-dimensional data. And then we added clusters to data and df dataframe. Finally we created word cloud by our clusters Clustering for types Data cleaning for types Firstly, we noticed that there were 155 types. However in data, there are 61 types. When we merge the data and that types, the total number of types become 58. This means that 3 types have no any description and that’s why we replace them with the mode value. Also we found that some types have type description ‘н.д’ which means no data and their total number in data is 26. Also we noticed that type description repeats for several types and we dropped duplicates and replaced them with first accurancy type in data. Creating clusters for types We manually divided them into the 5 categories according to dome key words in description. And merged them with our dataframe. Then we noticed outliers in recency and frequency. We found 0.999 and 0.001 quantile, where the first one is considered as the high, and the second is the low boundary. Everything above 0,999 and below 0.001 is considered as an outlier. We removed them for both recency and frequency. After that we checked dataframe by describe and concluded that everything become normal. Supervised learning The time for prediction came. We divided our dataframe into train and test and used KNN, Decision Tree Classifier and Random Forest, Logistic Regression for further predictions. We decided to investigate the accuracy from 1 to 20 with step 2 for each neighbor in train and test. And built the plot. The best result is accuracy 58 for 19 neighbors. Decision Tree gave us 54 for test set and Random Forest’s accuracy was 64. We investigated feature importance for both of them and noticed that monetary had the most influence on predicting the data. For Grid Search we manually set the hyper parameters and for cross validation equals to four folds. Best estimater for random forest classifier for grid search was found. After that good estimaters were chosen for random forest, and the same accuracy occurred. Best accuracy for random forest with default hyper parameters. We built confusion matrix and calculated recall, precision and f-1 score. Also we decided to build lofistic regression but the accuracy was too small, that’s why we build roc-auc and precision-recall curve. Conclusion All the models showed that taken data was not enough and actually not the best for gender prediction. Actions for increase the accuracy were done, such as adding more features, removing outliers. According to this investigation the best choice was random forest.
Client scoring (propensity to buy) using Decision Tree, XGBoost as well as following algorithms: Isolation Forest, dummy coding, Spearman corr, Forward features selection method. Moreover, following methods was used to evaluate models: Confusion matrix, classification report, Accuracy, Precision, Recall, F1, AUC, Gini, ROC, PROFIT, LIFT, comparision of statistics on train and test datasets.
Gradient statistics for implementing a generalized hybrid loss function in gradient-boosted decision trees (GBDT) for binary classification tasks.
RajputJay41
Julia provides powerful tools for deep learning (Flux.jl and Knet.jl), machine learning and AI. Julia’s mathematical syntax makes it an ideal way to express algorithms just as they are written in papers, build trainable models with automatic differentiation, GPU acceleration and support for terabytes of data with JuliaDB. Julia's rich machine learning and statistics ecosystem includes capabilities for generalized linear models, decision trees, and clustering. You can also find packages for Bayesian Networks and Markov Chain Monte Carlo.
nehasha001
Regression and Classification: By using R programming language, analysed decision tree which is a predictive model that maps an observation about an item to its conclusions about item’s target value. It is one of the predictive modelling approaches used in statistics, data mining and machine learning. Used packages "ISLR" and "Carseats" dataset for classification.Basically we have three data sets: Training, Testing and Validation. Then split data into testing and training sets. Cross validated to check where to stop pruning. For regression used package "MASS" and dataset "Boston".
shraddhaghadage
Breast cancer is one of the most common cancers among women worldwide, representing the majority of new cancer cases and cancer-related deaths according to global statistics, making it a significant public health problem in today’s society. The early diagnosis of it can improve the prognosis and chance of survival significantly, as it can promote timely clinical treatment to patients. Further accurate classification of benign tumors can prevent patients undergoing unnecessary treatments. Thus, the correct diagnosis of Breast Cancer and classification of patients into malignant or benign groups is the subject of much research. Because of its unique advantages in critical features detection from complex Breast Cancer datasets, machine learning (ML) is widely recognized as the methodology of choice in Breast cancer pattern classification and forecast modelling. Classification and data mining methods are an effective way to classify data. Especially in medical field, where those methods are widely used in diagnosis and analysis to make decisions. Because we are categorizing whether the tissue is cancerous or benign, we will train multiple Tree-based models for this procedure. We’ll experiment with hyper-parameters to see if we can enhance the accuracy. Try to solve the problem using the approach outlined below. For further information on each feature, consult the data dictionary. Decision trees (DTs) form the basis of ensemble algorithms in machine learning. These are powerful algorithms that can fit complex data. In this project, our focus is on understanding the core concepts of the Decision Tree for healthcare analysis, followed by understanding the different ensemble techniques.
nalinimacharla9
In this assignment, we chose “Portuguese Bank Marketing” Dataset that was retrieved from the UCI Machine Learning Repository. Here, we will perform Exploratory Data Analysis, Inferential Statistics, and Regularization Techniques on the bank dataset and build a machine learning models like Logistic Regression to know the relationship between the variables, Support Vector Machines, Cross Validation, and Decision Tree Classifiers in order to predict the terms of a Deposit Subscription by the clients in the bank. It will be a supervised ML model which will try to solve the classification problem like whether the client will subscribe the term deposit in a bank or not (a simple yes or no). Also, we will do some analysis on one-sample t-test, two-sample t-test, Paired t-test, the test of equal or given proportions, and F-tests. A one-sample t-test states whether an unidentified population means is dissimilar from a definite value. The two-sample t-test is also known as independent samples t-test to test whether the unknown population means of 2 groups are identical or not. Paired t-test also called the dependent sample t-test to discover whether the mean change between 2 sets is 0. They are validated two times, resulting for pairs of observations. The test of equal or given proportions will test whether or not a sample from a population represents the true proportion from the entire population. The last test, F-test signifies the linearity gives improved fit.
Bet4Arrio
Simple code of decision tree using only Descriptive statistics to "
molnareszter
K-Means Clustering and Decision Tree algorithms on network statistics (Hungarian literary ecosystem in Transylvania)
bharatgw
I use descriptive statistics and machine learning algorithms (linear regression, decision trees and random forests) to plan bid prices for modules.
Aryia-Behroziuan
Decision tree learning uses a decision tree as a predictive model to go from observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). It is one of the predictive modeling approaches used in statistics, data mining, and machine learning. Tree models where the target variable can take a discrete set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. In data mining, a decision tree describes data, but the resulting classification tree can be an input for decision making.
jjlee740
An implementation of a decision tree that uses training data to build a predictive model that can predict statistics of entered testing data.
Took the data from Kaggle, and used statistics methods such as Decision Tree(Random Forest, C5.0, Rpart), Negative Binomial Regression, Nonparametric to observe the influence of different types of weather on sales.
Predict The Development Level of Countries based on corruption index, cost of living, tourism statistics, and unemployment rates using Gradient Boosting Classifier, Decision Tree Classifier, SVM, Extra Trees, Random Forest Classifier, Logistic Regression and also see their evaluation matrices (Accuracy, Precision, Recall, F1-Score), Summary.
Imswappy
Interactive loan prediction system built with Streamlit and Decision Tree Classifier. Implements mathematical models like entropy, Gini impurity, and information gain with real-time visualization. Combines ML, statistics, and feature engineering for interpretable financial analytics.
YasmineJedidi
This is set one of my data analytics projects with python (beginners set). It covers basics like data cleaning, correlation, descriptive statistics, as well beginner projects for KNN classification and regression, decision trees, and random forest.
surrealsun
An advanced Tic-Tac-Toe game in C featuring adaptive AI with Minimax, multiplayer access, undo functionality, and player statistics. Leverages arrays, stacks, linked lists, hash tables, and decision trees for efficient gameplay and dynamic session management.
Compared and clustered Major League Baseball players using process-based statistics from the 2002-2014. Created seven player archetypes through principal component analysis and attempted to recover a method of classification through cluster analysis and decision trees.
AmisiJospin
These projects are covered these topics: Optimization and the Knapsack Problem, Decision Trees and Dynamic Programming, Graph Problems, Plotting, Stochastic Thinking, Random Walks, Inferential Statistics, Monte Carlo Simulations, Sampling and Standard Error, Experimental Data, Machine Learning, and Statistical Fallacies
janvikale2005
📊 Hands-on Data Science & Statistics practicals in Python using Jupyter Notebook. Covers data acquisition, cleaning, visualization, and ML algorithms like Linear & Logistic Regression, KNN, SVM, Decision Tree, and Random Forest. Perfect for learning and building strong data analysis skills!
Diego-HernSua
This project focuses on the application of various supervised learning tools to analyze a dataset containing comprehensive statistics of FIFA 23 players. By employing techniques such as classification (e.g., Logistic Regression, Decision Trees, Random Forest) and regression methods (e.g., Linear Regression, Ridge, Lasso).