Found 123 repositories(showing 30)
Aastha2104
Introduction Parkinson’s Disease is the second most prevalent neurodegenerative disorder after Alzheimer’s, affecting more than 10 million people worldwide. Parkinson’s is characterized primarily by the deterioration of motor and cognitive ability. There is no single test which can be administered for diagnosis. Instead, doctors must perform a careful clinical analysis of the patient’s medical history. Unfortunately, this method of diagnosis is highly inaccurate. A study from the National Institute of Neurological Disorders finds that early diagnosis (having symptoms for 5 years or less) is only 53% accurate. This is not much better than random guessing, but an early diagnosis is critical to effective treatment. Because of these difficulties, I investigate a machine learning approach to accurately diagnose Parkinson’s, using a dataset of various speech features (a non-invasive yet characteristic tool) from the University of Oxford. Why speech features? Speech is very predictive and characteristic of Parkinson’s disease; almost every Parkinson’s patient experiences severe vocal degradation (inability to produce sustained phonations, tremor, hoarseness), so it makes sense to use voice to diagnose the disease. Voice analysis gives the added benefit of being non-invasive, inexpensive, and very easy to extract clinically. Background Parkinson's Disease Parkinson’s is a progressive neurodegenerative condition resulting from the death of the dopamine containing cells of the substantia nigra (which plays an important role in movement). Symptoms include: “frozen” facial features, bradykinesia (slowness of movement), akinesia (impairment of voluntary movement), tremor, and voice impairment. Typically, by the time the disease is diagnosed, 60% of nigrostriatal neurons have degenerated, and 80% of striatal dopamine have been depleted. Performance Metrics TP = true positive, FP = false positive, TN = true negative, FN = false negative Accuracy: (TP+TN)/(P+N) Matthews Correlation Coefficient: 1=perfect, 0=random, -1=completely inaccurate Algorithms Employed Logistic Regression (LR): Uses the sigmoid logistic equation with weights (coefficient values) and biases (constants) to model the probability of a certain class for binary classification. An output of 1 represents one class, and an output of 0 represents the other. Training the model will learn the optimal weights and biases. Linear Discriminant Analysis (LDA): Assumes that the data is Gaussian and each feature has the same variance. LDA estimates the mean and variance for each class from the training data, and then uses properties of statistics (Bayes theorem , Gaussian distribution, etc) to compute the probability of a particular instance belonging to a given class. The class with the largest probability is the prediction. k Nearest Neighbors (KNN): Makes predictions about the validation set using the entire training set. KNN makes a prediction about a new instance by searching through the entire set to find the k “closest” instances. “Closeness” is determined using a proximity measurement (Euclidean) across all features. The class that the majority of the k closest instances belong to is the class that the model predicts the new instance to be. Decision Tree (DT): Represented by a binary tree, where each root node represents an input variable and a split point, and each leaf node contains an output used to make a prediction. Neural Network (NN): Models the way the human brain makes decisions. Each neuron takes in 1+ inputs, and then uses an activation function to process the input with weights and biases to produce an output. Neurons can be arranged into layers, and multiple layers can form a network to model complex decisions. Training the network involves using the training instances to optimize the weights and biases. Naive Bayes (NB): Simplifies the calculation of probabilities by assuming that all features are independent of one another (a strong but effective assumption). Employs Bayes Theorem to calculate the probabilities that the instance to be predicted is in each class, then finds the class with the highest probability. Gradient Boost (GB): Generally used when seeking a model with very high predictive performance. Used to reduce bias and variance (“error”) by combining multiple “weak learners” (not very good models) to create a “strong learner” (high performance model). Involves 3 elements: a loss function (error function) to be optimized, a weak learner (decision tree) to make predictions, and an additive model to add trees to minimize the loss function. Gradient descent is used to minimize error after adding each tree (one by one). Engineering Goal Produce a machine learning model to diagnose Parkinson’s disease given various features of a patient’s speech with at least 90% accuracy and/or a Matthews Correlation Coefficient of at least 0.9. Compare various algorithms and parameters to determine the best model for predicting Parkinson’s. Dataset Description Source: the University of Oxford 195 instances (147 subjects with Parkinson’s, 48 without Parkinson’s) 22 features (elements that are possibly characteristic of Parkinson’s, such as frequency, pitch, amplitude / period of the sound wave) 1 label (1 for Parkinson’s, 0 for no Parkinson’s) Project Pipeline pipeline Summary of Procedure Split the Oxford Parkinson’s Dataset into two parts: one for training, one for validation (evaluate how well the model performs) Train each of the following algorithms with the training set: Logistic Regression, Linear Discriminant Analysis, k Nearest Neighbors, Decision Tree, Neural Network, Naive Bayes, Gradient Boost Evaluate results using the validation set Repeat for the following training set to validation set splits: 80% training / 20% validation, 75% / 25%, and 70% / 30% Repeat for a rescaled version of the dataset (scale all the numbers in the dataset to a range from 0 to 1: this helps to reduce the effect of outliers) Conduct 5 trials and average the results Data a_o a_r m_o m_r Data Analysis In general, the models tended to perform the best (both in terms of accuracy and Matthews Correlation Coefficient) on the rescaled dataset with a 75-25 train-test split. The two highest performing algorithms, k Nearest Neighbors and the Neural Network, both achieved an accuracy of 98%. The NN achieved a MCC of 0.96, while KNN achieved a MCC of 0.94. These figures outperform most existing literature and significantly outperform current methods of diagnosis. Conclusion and Significance These robust results suggest that a machine learning approach can indeed be implemented to significantly improve diagnosis methods of Parkinson’s disease. Given the necessity of early diagnosis for effective treatment, my machine learning models provide a very promising alternative to the current, rather ineffective method of diagnosis. Current methods of early diagnosis are only 53% accurate, while my machine learning model produces 98% accuracy. This 45% increase is critical because an accurate, early diagnosis is needed to effectively treat the disease. Typically, by the time the disease is diagnosed, 60% of nigrostriatal neurons have degenerated, and 80% of striatal dopamine have been depleted. With an earlier diagnosis, much of this degradation could have been slowed or treated. My results are very significant because Parkinson’s affects over 10 million people worldwide who could benefit greatly from an early, accurate diagnosis. Not only is my machine learning approach more accurate in terms of diagnostic accuracy, it is also more scalable, less expensive, and therefore more accessible to people who might not have access to established medical facilities and professionals. The diagnosis is also much simpler, requiring only a 10-15 second voice recording and producing an immediate diagnosis. Future Research Given more time and resources, I would investigate the following: Create a mobile application which would allow the user to record his/her voice, extract the necessary vocal features, and feed it into my machine learning model to diagnose Parkinson’s. Use larger datasets in conjunction with the University of Oxford dataset. Tune and improve my models even further to achieve even better results. Investigate different structures and types of neural networks. Construct a novel algorithm specifically suited for the prediction of Parkinson’s. Generalize my findings and algorithms for all types of dementia disorders, such as Alzheimer’s. References Bind, Shubham. "A Survey of Machine Learning Based Approaches for Parkinson Disease Prediction." International Journal of Computer Science and Information Technologies 6 (2015): n. pag. International Journal of Computer Science and Information Technologies. 2015. Web. 8 Mar. 2017. Brooks, Megan. "Diagnosing Parkinson's Disease Still Challenging." Medscape Medical News. National Institute of Neurological Disorders, 31 July 2014. Web. 20 Mar. 2017. Exploiting Nonlinear Recurrence and Fractal Scaling Properties for Voice Disorder Detection', Little MA, McSharry PE, Roberts SJ, Costello DAE, Moroz IM. BioMedical Engineering OnLine 2007, 6:23 (26 June 2007) Hashmi, Sumaiya F. "A Machine Learning Approach to Diagnosis of Parkinson’s Disease."Claremont Colleges Scholarship. Claremont College, 2013. Web. 10 Mar. 2017. Karplus, Abraham. "Machine Learning Algorithms for Cancer Diagnosis." Machine Learning Algorithms for Cancer Diagnosis (n.d.): n. pag. Mar. 2012. Web. 20 Mar. 2017. Little, Max. "Parkinsons Data Set." UCI Machine Learning Repository. University of Oxford, 26 June 2008. Web. 20 Feb. 2017. Ozcift, Akin, and Arif Gulten. "Classifier Ensemble Construction with Rotation Forest to Improve Medical Diagnosis Performance of Machine Learning Algorithms." Computer Methods and Programs in Biomedicine 104.3 (2011): 443-51. Semantic Scholar. 2011. Web. 15 Mar. 2017. "Parkinson’s Disease Dementia." UCI MIND. N.p., 19 Oct. 2015. Web. 17 Feb. 2017. Salvatore, C., A. Cerasa, I. Castiglioni, F. Gallivanone, A. Augimeri, M. Lopez, G. Arabia, M. Morelli, M.c. Gilardi, and A. Quattrone. "Machine Learning on Brain MRI Data for Differential Diagnosis of Parkinson's Disease and Progressive Supranuclear Palsy."Journal of Neuroscience Methods 222 (2014): 230-37. 2014. Web. 18 Mar. 2017. Shahbakhi, Mohammad, Danial Taheri Far, and Ehsan Tahami. "Speech Analysis for Diagnosis of Parkinson’s Disease Using Genetic Algorithm and Support Vector Machine."Journal of Biomedical Science and Engineering 07.04 (2014): 147-56. Scientific Research. July 2014. Web. 2 Mar. 2017. "Speech and Communication." Speech and Communication. Parkinson's Disease Foundation, n.d. Web. 22 Mar. 2017. Sriram, Tarigoppula V. S., M. Venkateswara Rao, G. V. Satya Narayana, and D. S. V. G. K. Kaladhar. "Diagnosis of Parkinson Disease Using Machine Learning and Data Mining Systems from Voice Dataset." SpringerLink. Springer, Cham, 01 Jan. 1970. Web. 17 Mar. 2017.
mGalarnyk
Repo for my graduate data science machine learning class at UCSD (UC San Diego). This course provides a broad introduction to the practical side of machine-learning and data analysis. The topics covered in this class include topics in supervised learning, such as k-nearest neighbor classifiers, decision trees, boosting and perceptrons, and topics in unsupervised learning, such as k-means, PCA and Gaussian mixture models.
By learning and using prediction for failures, it is one of the important steps to improve the reliability of the cloud computing system. Furthermore, gave the ability to avoid incidents of failure and costs overhead of the system. It created a wonderful opportunity with the breakthroughs of machine learning and cloud storage that utilize generated huge data that provide pathways to predict when the system or hardware malfunction or fails. It can be used to improve the reliability of the system with the help of insights of using statistical analysis on the workload data from the cloud providers. This research will discuss regarding job usage data of tasks on the large “Google Cluster Workload Traces 2019” dataset, using multiple resampling techniques such as “Random Under Sampling, Random Oversampling and Synthetic Minority Oversampling Technique” to handle the imbalanced dataset. Furthermore, using multiple machine learning algorithm which is for traditional machine learning algorithm are “Logistic Regression, Decision Tree Classifier, Random Forest Classifier, Gradient Boosting Classifier and Extreme Gradient Boosting Classifier” while deep learning algorithm using “Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU)” for job failure prediction between imbalanced and balanced dataset. Then, to have a comparison of imbalanced and balanced in terms of model accuracy, error rate, sensitivity, f – measure, and precision. The results are Extreme Gradient Boosting Classifier and Gradient Boosting Classifier is the most performing algorithm with and without imbalanced handling techniques. It showcases that SMOTE is the best method to choose from for handling imbalanced data. The deep learning model of LSTM and Gated Recurrent Unit may be not the best for the in terms of accuracy, based on the ROC Curve its better than the XGBoost Classifier and Gradient Boosting Classifier.
*****PROJECT SPECIFICATION: Machine Learning Capstone Analysis Project***** This capstone project involves machine learning modeling and analysis of clinical, demographic, and brain related derived anatomic measures from human MRI (magnetic resonance imaging) tests (http://www.oasis-brains.org/). The objectives of these measurements are to diagnose the level of Dementia in the individuals and the probability that these individuals may have Alzheimer's Disease (AD). In published studies, Machine Learning has been applied to Alzheimer’s/Dementia identification from MRI scans and related data in the academic papers/theses in References 10 and 11 listed in the References Section below. Recently, a close relative of mine had to undergo a sequence of MRI tests for cognition difficulties.The motivation for choosing this topic for the Capstone project arose from the desire to understand and analyze potential for Dementia and AD from MRI related data. Cognitive testing, clinical assessments and demographic data related to these MRI tests are used in this project. This Capstone project does not use the MRI "imaging" data and does not focus on AD, focusses only on Dementia. *****Conclusions, Justification, and Reflections***** [Student adequately summarizes the end-to-end problem solution and discusses one or two particular aspects of the project they found interesting or difficult.] The formulation of OASIS data (Ref 1 and 2) in terms of a dementia classification problem based on demographic and clinical data only (and without directly using the MRI image data), is a simplification that has major advantages and appeal. This means the trained model can classify whether an individual has dementia or not with about 87% accuracy, without having to wait for radiological interpretation of MRI scans. This can provide an early alert for intervention and initiation of treatment for those with onset of dementia. The assumption that the combined cross-sectional and longitudinal datasets would lead to dementia label classification of acceptable accuracy came out to be true. The method required careful data cleaning and data preparation work, converting it to a binary classification problem, as outlined in this notebook. At the outset it was not clear which algorithm(s) would be more appropriate for the binary and multi-label classification problem. The approach of spot checking the algorithms early for accuracy led to the determination of a smaller set of algorithms with higher accuracy (e.g. Gadient Boosting and Random Forest) for a deeper dive examination, e.g. use of a k-fold cross-validation approach in classifying the CDR label. The neural network benchmark model accuracy of 78% for binary classification was exceeded by the classification accuracy of the main output of this study, the trained Gradient Boosting and Random Forest classification models. This builds confidence in the latter model for further training with new data and further classification use for new patients.
Shailja12326646
This project predicts the type of Anaemia in patients using machine learning techniques, specifically the XGBoost algorithm and a Sacking Ensemble Classifier.. After preprocessing the data , we apply XGBoost—a powerful gradient boosting framework known for its accuracy and efficiency—as a base model.
Aayushi-2808
# Cervical_cancer_detection_using_ML # Introduction According to World Health Organisation (WHO), when detected at an early stage, cervical cancer is one of the most curable cancers. Hence, the main motive behind this project is to detect the cancer in its early stages so that it can be treated and managed in the patients effectively. # Flow of project is as explained below: This project is divided into 5 parts: 1. Data Cleaning 2. Exploratory Data Analysis 3. Baseline model: Logistic Regression 4. Ensemble Models: Bagging with Decision Trees, Random forest and Boosting 5. Model Comparison and results # Refer below for References: Link to basic information regarding cervical cancer : https://www.cdc.gov/cancer/cervical/basic_info/index.htm The dataset for tackling the problem is supplied by the UCI repository for Machine Learning. Link to Dataset : https://archive.ics.uci.edu/ml/datasets/Cervical+cancer+%28Risk+Factors%29 The dataset contains a list of risk factors that lead up to the Biopsy examination. The generation of the predictor variable is taken care of in part 2 (Exploratory data analysis) of this report. We will try to predict the 'biopsy' variable from the dataset using Logistic Regression, Random Forest, Bagging with Decision Trees and Boosting with XGBoost Classifier. # Results: Based on our Base model and The Ensemble Models we used, we observed - 1. After the entire process of training, hyperparameter tuning and tackling class imbalance was complete , we obtained the results as depicted through the graphics. 2. We observe that Bagging and Random Forest gives the highest accuracy and precision of 97.09 and 80% resp. 3. Plotting the Confusion matrix showed us that Random Forest using upsampling and class weights gives us 2 false positives and 3 false negatives with auc of 0.87 # Why random forest is the best model?? 1. So as we see, while comparing all of our models,RF has maximum f1_score and accuracy along with Bagging i.e. 76.2 n 97.09% resp. 2. And it also produces the same amount of false negatives with a recall of 72.73% just like all the other models. 3. But we still consider RF better coz of its added advantage that, the decision trees are decorrelated as compared to bagging leading to lesser variance and greater ability to generalize. # Conclusion: On observing the feature importance of the best model i.e random forest, we can see that the most important features are Schiller, Hinselmann, HPV, Citology, etc. This also makes sense because Schiller and Hinselmann are actually the tests used to detect cervical cancer. # Problems Faced: A major problem encountered while training the model was that it had too little data to train. On collaborating with all the hospitals in India, we can have enough data points to train a model with a higher recall, thus making the model better. # Scope of Improvement As next steps I would want to do exactly that, to deploy the model and refine it. We may also modify the number of the predictor variables, as it may well turn out that there are other predictors which may not be present in our current dataset. This can only be found by practical implementation of our predictions.
Bribak
This repository constitutes SURFY2 and corresponds to the bioRxiv preprint 'Updating the in silico human surfaceome with meta-ensemble learning and feature engineering' by Daniel Bojar. SURFY2 is a machine learning classifier to predict whether a human transmembrane protein is located at the surface of a cell (the plasma membrane) or in one of the intracellular membranes based on the sequence characteristics of the protein. Making use of the data described in the recent publication from Bausch-Fluck et al. (https://doi.org/10.1073/pnas.1808790115), SURFY2 considerably improves on their reported classifier SURFY in terms of accuracy (95.5%), precision (94.3%), recall (97.6%) and area under ROC curve (0.954) when using a test set never seen by the classifier before. SURFY2 consists of a layer of 12 base estimators generating 24 new engineered features (class probabilities for both classes) which are appended to the original 253 features. Then, a soft voting classifier with three optimized base estimators (Random Forest, Gradient Boosting and Logistic Regression) and optimized voting weights is trained on this expanded dataset, resulting in the final prediction. The motivation of SURFY2 is to provide an updated and better version of the in silico human surfaceome to facilitate research and drug development on human surface-exposed transmembrane proteins. Additionally, SURFY2 enabled insights into biological properties of these proteins and generated several new hypotheses / ideas for experiments. The workflow is as following: 1) dataPrep Gets training data from data.xlsx, labels it according to surface class and outputs 'train_data.csv' 2) split Gets train_data.csv, splits it into training, validation and test data and outputs 'train.csv', 'val.csv', 'test.csv'. 3) main_val Was used for optimizing hyperparameters of base estimators and estimators & weights of voting classifier. Stores all estimators. Evaluates meta-ensemble classifier SURFY2 on validation set. 4) classifier_selection All base estimators and meta-ensemble approaches are tested on the initial dataset as well as the expanded dataset including the engineered features and compared in terms of their cross-validation score. 5) main_test Evaluates SURFY2 on the separate test set (trained on training + validation set). 6) testing_SURFY Evaluates the original SURFY through cross-validation and on validation as well as test set. 7) pred_unlabeled Uses SURFY2 to predict the surface label (+ prediction score) for unlabeled proteins in data.xlsx. Also gets the feature importances of the voting classifier estimators. 8) getting_discrepancies Compare predictions with those made by SURFY ('surfy.xlsx') and store mismatches. Also store the 10 most confident mismatches (by SURFY2 classification score) from each class. 9) feature_importances Plot the 10 most important features for the voting classifier estimators (Random Forest, Gradient Boosting, Logistic Regression) to interpret predictions. 10) base_estimator_importances Plot the 10 most important features for the two most important base estimators (XGBClassifier and Gradient Boosting). 11) comparing_mismatches Separate datasets into shared & discrepant predictions (between SURFY and SURFY2). Compare feature means and select features with the highest class feature mean differences between prediction datasets. Statistically analyze differences in features means between classes in both prediction datasets. Plot 9 representative features with their means grouped according to class and prediction dataset to rationalize discrepant predictions. 12) tSNE_surfy2 Perform nonlinear dimensionality reduction using t-SNE on proteins with predictions from both SURFY and SURFY2. Plot the two t-SNE dimensions and label the proteins according to their prediction class in order to see where discrepant predictions reside in the landscape. Plot surface proteins with most prevalent annotated functional subclasses and label them according to their subclass to enable comparison to class predictions. Functional annotations came from 'surfy.xlsx'.
saugatapaul1010
Ensemble Learning — Bagging, Boosting, Stacking and Cascading Classifiers in Machine Learning using SKLEARN and MLEXTEND libraries.
Sid-darthvader
Transition metal oxides are attractive materials for high temperature thermoelectric applications due to their thermal stability, low cost bulk processing and natural abundance. Notwithstanding the high power factor, their high thermal conductivity is a roadblock in achieving higher efficiency. The search space for new thermoelectric oxides has been limited to the alloys of a few previously explored systems, such as ZnO, SrTiO3 and CaMnO3. The phenomenon of thermal conduction in crystalline alloys and its dependence on crystal properties is also poorly understood, which limits the ability to design new alloys. In this paper, we apply machine-learning models for discovering novel transition metal oxides with low lattice thermal conductivity (kL). A two-step process is proposed to address the problem of small datasets frequently encountered in materials informatics. First, a gradient boosted tree classifier is learnt to categorize unknown compounds into three categories of thermal conductivity: Low, Medium, and High. In the second step, we fit regression models on the targeted class (i.e. low kL) to estimate kL with an R2 value of 0.96. Gradient boosted tree model was also used to identify key material properties influencing classification of kL, namely lattice energy per atom, atom density, electronic energy band gap, mass density, and ratio of oxygen by transition metal atoms. Only fundamental materials properties describing the crystal symmetry, compound chemistry and interatomic bonding were used in the classification process, which can be readily used as selection parameters. The proposed two-step process addresses the problem of small datasets and improves the predictive accuracy.
Epilepsy is the name of a neurological disorder of the human brain, which is characterized by chronic disorders and occurs at random to interrupt the normal function of the brain. The diagnosis and analysis of epileptic seizure is made with the help of Electroencephalography (EEG). In order to detect seizure, it involves the interpretation of long EEG records by the expert physicians, which is time-consuming and need high human efforts. Thus, this study aims to construct an automatic seizure detection system to analyze epileptic EEG signals. The CHB-MIT Scalp EEG recording of patients is used in this work for the experiment purpose. The Welch Fast Fourier Transform is used to convert time domain features to the frequency domain. The statistical features are extracted respectively in the time domain and frequency domain. The ANOVA based feature selection is used to deduct variables. The Random Under-sampling (RUS) and Synthetic Minority Oversampling Technique (SMOTE) methods are used to solve the data imbalance problem. Four machine learning algorithms, including decision tree classifier (DTC), extra-decision tree classifier (EDTC), Linear Discriminant Analysis Classifier(LDAC), Quadratic Discriminant Classifier(QDC), Random Forest Classifier (RFC), Gradient Boosting Classifier (GBC), Multi-layer Perceptron Classifier (MLPC), and Stochastic Gradient Descent Classifier (SGDC) are used to classify the data. As a result, the performance of the proposed classifier is 99.48% of accuracy, 99.79% of sensitivity, and 99.17% of specificity. The system might be a helpful tool for doctors to make a more reliable and objective analysis of patient EEG records.
Buzz Prediction on Twitter: Buzz Prediction on Twitter Project Description: There are two different datasets for Regression and Classification tasks. Right-most column in both the datasets is a dependent variable i.e. buzz. Data description files are also provided for both the datasets. Deciding which dataset is for which task is part of the project. Read data into Jupyter notebook, use pandas to import data into a data frame. Preprocess data: Explore data, check for missing data and apply data scaling. Justify the type of scaling used. Regression Task: Apply all the regression models you've learned so far. If your model has a scaling parameter(s) use Grid Search to find the best scaling parameter. Use plots and graphs to help you get a better glimpse of the results. Then use cross-validation to find average training and testing score. Your submission should have at least the following regression models: KNN regressor, linear regression, Ridge, Lasso, polynomial regression, SVM both simple and with kernels. Finally, find the best regressor for this dataset and train your model on the entire dataset using the best parameters and predict buzz for the test_set. Classification Task: Decide about a good evaluation strategy and justify your choice. Find best parameters for the following classification models: KNN classification, Logistic Regression, Linear Support Vector Machine, Kernelized Support Vector Machine, Decision Tree. Which model gives the best results? Buzz Prediction on Twitter Project Description: Use same datasets as Project 2. Run all the models only on 10% data. Use code given in Project 2 for sampling. Preprocess data: Explore data and apply data scaling. Regression Task: Apply any two models with bagging and any two models with pasting. Apply any two models with adaboost boosting Apply one model with gradient boosting Apply PCA on data and then apply all the models in project 2 again on data you get from PCA. Compare your results with results in project 2. You don't need to apply all the models twice. Just copy the result table from project 2, prepare similar table for all the models after PCA and compare both tables. Does PCA help in getting better results? Apply deep learning models covered in class Classification Task: Apply four voting classifiers - two with hard voting and two with soft voting Apply any two models with bagging and any two models with pasting. Apply any two models with adaboost boosting Apply one model with gradient boosting Apply PCA on data and then apply all the models in project 2 again on data you get from PCA. Compare your results with results in project 2. You don't need to apply all the models twice. Just copy the result table from project 2, prepare similar table for all the models after PCA and compare both tables. Does PCA help in getting better results? Apply deep learning models covered in class
In this research, Machine learning algorithms like Long-Short Term Model (LSTM), Decision tree, Random Forest and XG Boost were used as a classifier to improve the accuracy of Snowfall prediction for the region of Boston. The geographical parameters like Humidity, Temperature, Wind-speed, Precipitation, Sea-level, Dew-point and Visibility were used as independent variables. Before the modeling phase, Data lagging was performed for 2 step followed by Exploratory Data Analysis was using techniques like Multiple Linear Regression, Correlation Plot and variable importance plot. Feature Selection was also executed using Logistic Regression and Boruta algorithm. Experimental evaluations resulted in the highest accuracy shown by LSTM with an accuracy of 89.98%. In terms of sensitivity, Random Forest outperformed other classifier models. Whereas, Decision tree and XG Boost resulted well in the overall performance of prediction with respect to other evaluation metrics. The results of this research added to the contribution of the knowledge in weather prediction in the domain of Snowfall for the machine learning industry.
Niranjan41288
I have done my individual project (dissertation) on ensemble methods. In which I first did the background study on different ensemble methods and then implemented Boosting, AdaBoost, Bagging and random forest techniques on underlying machine learning algorithms. I used boosting method to boost the performance of weak learner like decision stumps. Implemented bagging for decision trees (both regression and classification problems) and for KNN classifier. Used random forest for classification trees. I have implemented a special algorithm of boosting called “AdaBoost” on logistic regression algorithm using different threshold values. Then plotted the different graphs like an error rate as a function of boosting, bagging and random forest iterations. Compared results of bagging with boosting. Analysed the performance of classifier before applying ensemble methods and after applying ensemble methods. Used different model evaluation techniques like cross-validation, MSE, PRSS, ROC curves, confusion matrix, and out-of-bag error estimation to estimate the performance of ensemble techniques.
Viru9029
Practical Machine Learning : Machine Learning in Nut shell, Supervised Learning, Unsupervised Learning, ML applications in the real world. Introduction to Feature engineering and Data Pre-processing: Data Preparation, Feature creation, Data cleaning & transformation, Data Validation & Modelling, Feature selection Techniques, Dimensionality reduction, Recommendation Systems and anomaly detection, PCA ML Algorithms: Decision Trees, Oblique trees, Random forest, Bayesian analysis and Naïve bayes classifier, Support vector Machines, KNN, Gradient boosting, Ensemble methods, Bagging & Boosting, Association rules learning, Apriori and FP growth algorithms, Linear and Nonlinear classification, Regression Techniques, Clustering, K-means, Overview of Factor Analysis, ARIMA, ML in real time, Algorithm performance metrics, ROC, AOC, Confusion matrix, F1score, MSE, MAE, DBSCAN Clustering in ML, Anomaly Detection, Recommender System Self-Study: • Usage of ML algorithms, Algorithm performance metrics (confusion matrix sensitivity, Specificity, ROC, AOC, F1score, Precision, Recall, MSE, MAE) • Credit Card Fraud Analysis, Intrusion Detection system
kpasagada
Machine Learning mini projects to implement Perceptron Learning, SVM - Primal and Dual, KNN, Decision Trees, Boosting with AdaBoost, AdaBoost with Co-ordinate descent, Bagging, PCA, Gaussian Naive Bayes classifier, Spectral Clustering, L1 and L2 Logistic Regression, and Gaussian Mixture Models using Expectation - Maximization (EM) algorithm from scratch in Python on UCI data sets such as Leaf Data Set, Sonar Data Set, SPECT Heart Data Set, Parkinsons Data Set and Mushroom Data Set.
darshanhs11
The problem of epilepsy has grown exponentially and is now considered as one of the most prevailing neurological disorders affecting around 50 million people around the globe. Epilepsy is identified by analyzing the interictal activity present in the EEG signal. Visual analysis of EEG is a tedious process and subject to human error. This work proposes a robust method to ease the burden of intractable seizures by automatic recognition of ictal epileptiform activity in the EEG of epileptic patients. The classification between EEG having an epileptic seizure and non-seizure is done using various machine learning algorithms. The classifiers used are weighted KNN, Boosted Trees, Bagged Trees, subspace discrimination, Subspace KNN and RUS boosted tree. Based on the accuracy of the classifier we will select one method and we will export the model as function to use for validation purpose. These are the methods used to classify epileptic seizure EEG signals.
Brain-Computer Interfaces have impacted the lives of many, especially those whose mobility and ability to speak are affected. It is able to do so by bridging the gap between thoughts and devices. One of its most popular applications, the P300 Speller, is a powerful aid that allows the patients to regain a certain level of autonomy. Detection of a P300 peak and character identification are the two major components of a P300 speller. In this study, the first component of P300 speller is covered. Various conventional learning algorithms like Support Vector Machine, Discriminant Analysis, Neural Network and their variants have been used in previous studies. These methods have limitations: some are prone to overfitting; others require a large amount of training data, while there are some limitations that necessitate complicated computing thus making them less favorable for real-time analysis. Boosting algorithms are very less explored in the field of Electroencephalography (EEG) and less prone to most of the limitations of these conventional models. This paper evaluates the performances of LightGBM and CatBoost on the dataset used in the competition BCI NER 2015 on Kaggle. These algorithms have recently gained popularity and have proven to be powerful. Further, they are compared with the performances of XGBoost and AdaBoost and a maximum 1 Score of 0.84 was achieved using LightGBM as a classifier.
agrawal-priyank
Built classifiers using logistic regression and decision trees to classify product reviews and used machine learning techniques such as boosting, precision and recall, and stochastic gradient descent for optimization in Python
SamarthSajwan
The main aim of every academia enthusiast is placement in a reputed MNC’s and even the reputation and every year admission of Institute depends upon placement that it provides to their students. So, any system that will predict the placements of the students will be a positive impact on an institute and increase strength and decreases some workload of any institute’s training and placement office (TPO). With the help of Machine Learning techniques, the knowledge can be extracted from past placed students and placement of upcoming students can be predicted. Data used for training is taken from the same institute for which the placement prediction is done. Suitable data pre-processing methods are applied along with the feature selections. Some Domain expertise is used for pre-processing as well as for outliers that grab in the dataset. We have used various Machine Learning Algorithms like Logistic, SVM, KNN, Decision Tree, Random Forest and advance techniques like Bagging, Boosting and Voting Classifier Nowadays Placement plays an important role in this world full of unemployment. Even the ranking and rating of institutes depend upon the amount of average package and amount of placement they are providing. So basically, main objective of this model is to predict whether the student might get placement or not. Different kinds of classifiers were applied i.e., Logistic Regression, SVM, Decision Tree, Random Forest, KNN, AdaBoost, Gradient Boosting and XGBoost. For this all over academics of students are taken under consideration. As placements activity take place in last year of academics so last year semesters are not taken under consideration
This project aims to detect fake news articles using various machine learning algorithms, including Logistic Regression, Decision Tree, Gradient Boosting Classifier (GBC), and Random Forest Classifier (RFC). Fake news detection is a critical task in today's information age, as it helps in identifying and mitigating the spread of false information.
nasehacho
A basic machine learning pipeline that includes the ensemble classifier (random forest or boosted model) that will assess if fraudulent activity has occurred for a transaction in a psuedo-dataset.
AvaAvarai
Cross-platform tool for Computational Interactive Visual Learning using lossless General Line Coordinate data visualizations and human-in-the-loop guided classification by eight classifier algorithms to find, test, and boost robust machine learning models with a goal of high case to parameter ratio.
SunandaBiswas
The dataset is a real dataset on Pima Indian diabetes data that consist of several medical predictor (independent) variables and one target (dependent) variable, Outcome. Independent variables include the number of pregnancies the patient has had, their BMI, insulin level,glucose level,level of Bloodpressure, skin thickness age, and so on. To predict Diabetes on diagonstic, I have used several models such as Logistic Regression, KNN Classifier, Decision Tree Classifier, Random Forest Classifier, Support Vector Classifier and Gradient Boosting Classifier. Moreover, the important part of building any machine learning model is Exploratory Data analysis(EDA) that has been described in this project shortly.
Object Detection using Haar feature-based cascade classifiers is an effective object detection method proposed by Paul Viola and Michael Jones in their paper, "Rapid Object Detection using a Boosted Cascade of Simple Features" in 2001. It is a machine learning based approach where a cascade function is trained from a lot of positive and negative images. It is then used to detect objects in other images. Here we will work with face detection. Initially, the algorithm needs a lot of positive images (images of faces) and negative images (images without faces) to train the classifier. Then we need to extract features from it. For this, Haar features shown in the below image are used. They are just like our convolutional kernel. Each feature is a single value obtained by subtracting sum of pixels under the white rectangle from sum of pixels under the black rectangle. Now, all possible sizes and locations of each kernel are used to calculate lots of features. (Just imagine how much computation it needs? Even a 24x24 window results over 160000 features). For each feature calculation, we need to find the sum of the pixels under white and black rectangles. To solve this, they introduced the integral image. However large your image, it reduces the calculations for a given pixel to an operation involving just four pixels. Nice, isn't it? It makes things super-fast. But among all these features we calculated, most of them are irrelevant. For example, consider the image below. The top row shows two good features. The first feature selected seems to focus on the property that the region of the eyes is often darker than the region of the nose and cheeks. The second feature selected relies on the property that the eyes are darker than the bridge of the nose. But the same windows applied to cheeks or any other place is irrelevant. So how do we select the best features out of 160000+ features? It is achieved by Adaboost.
Sudip-Pandit
Description of the Project: + The "Breast Cancer Dataset" is used in this project. It has df.shape=(569, 31) which means 569 rows and 32 columns. + The link of the datset used in this project is -https://www.kaggle.com/uciml/breast-cancer-wisconsin-data + I am importing the important python packages- skelarn, pandas, numpy, seaborn and matplotlib to complete the project. + The machine learning models such as Logistic Regression, Decision Tree, Random Forest, XGBoost, AdaBoost and Gradient Boosting classifier have been used. + The performance of the machine learnig models have been tested on the basis of accuracy score, confusion matrix, classification report, f1 score and roc auc score. + I had tuned hyperparameters to improve the perforamnce for XGBoost model + The good visualization is also important along with accuracy score in model building. The performance of the model have been visualized in this project. Problem statement: The full form of XGBoost is eXtreme Gradient Boosting, also called winner for several kaggle competetion machine learning model. Most of the literatues of Machine Learning found in google has described this model as having best accuracy, efficient and feasibility. It is a decision-tree-based ensemble ML algorithm based on gradient boosting framework. It is considered that XGBoost provides a convenient way of cross-validation. Cross-validation technique is applied to test the model's overfitting during the training phase. If the model gives good accuracy in training dataset but the model works very poor in testing unseen dataset then it is called overfitting or a model of low bias and high variance. I have to calculate the model training and testing errors with different learning rates.As we know that the best technique to choose the learning rate value is between 0 and 1. I will be going to start the test by putting the learning rate as 0.01. It would easy to see the results through good visualization. I am also going to visualize the training and testing errors and accuracies by making a graph. Finally, I will tune the hyperparameters which helps us predict the testing datasets i.e. x_test.
edubu2
Using XGBoost Classifier to Predict whether an InstaCart customer will purchase an item again in their next order using a gradient-boosting (XGBoost) machine learning algorithm.
In this paper, I applied different pre-processing techniques, sampling methods and machine learning techniques like Random Forest, Logistic Regression, Light GBM Classifier and Extreme Gradient Boosting (XGB) Classifier to predict the customer churn. To evaluate the performance of machine learning techniques, I have considered the measures like recall, AUC (Area under curve) Score and F1 score.
nawaz-kmr
• Did in depth exploratory data analysis on the churn dataset and got valuable insight for the machine learning model. • Created a machine learning model using a bunch of algorithms (LR, KNN, SVC, Random Forest, Gradient Boosting) to predict customer churn based on historical data.GB classifier achieved 8% decrease in the overall churn rate
This project focuses on predicting the likelihood of diabetes in individuals using ensemble machine learning models. It combines various ensemble techniques, including Random Forest, AdaBoost, Gradient Boosting, Bagging, Extra Trees, XGBoost, Voting Classifier and some others to get predictions.
The main goal of this project is to build several models to predict customers' default behavior on credit card payment in a dataset with more than 30,000 customer transaction records. Used python and visualization package seaborn to explore data and do basic data analysis, such as visualizing data and calculating correlation matrix. Used sklearn to build machine learning models, such as Logistic Regression, Random forest, Gradient boost, Adaboost, Voting classifier (ie. the ensemble of random forest, Gradient boost, Adaboost) and use keras and tensorflow to build deep learning models, such as feed forward network. And use seaborn to visualize the accuracy of these models. Used Grid search and cross-validation method to optimize each algorithm, and finally determine the Voting classifier as the best model. Utilized:Python,Keras,Tensorflow,Seaborn,Grid Search,Adaboost,Machine Learning,Deep Learning,Logistic Regression,Random Forest,Gradient Boost,Voting Classifier,Cross-Validation