Found 1,429 repositories(showing 30)
Code to compute permutation and drop-column importances in Python scikit-learn models
This project demonstrates an AI-driven workflow for synthetic carbon sequestration potential mapping. It generates geospatial eco-variables, builds a Random Forest model, evaluates performance, interprets feature importances, and produces prediction maps—designed for teaching, research, and demonstration, not real-world policy or operational use.
elena-roff
Data Analysis and Machine Learning with Python: EDA with ECDF and Correlation analysis, Preprocessing and Feature engineering, L1 (Lasso) Regression and Random Forest Regressor with scikit-learn backed up by cross-validation, grid search and plots of feature importance.
tsyoshihara
Alzheimer’s Disease (AD) is the most common neurodegenerative disease. It is typically late onset and can develop substantially before diagnosable symptoms appear. Electroencephalogram (EEG) could potentially serve as a noninvasive diagnostic tool for AD. Machine learning can be helpful in making inferences about changes in frequency bands in EEG data and how these changes relate to neural function. The EEG data was sourced from 2014 paper titled Alzheimer’s disease patients classification through EEG signals processing by Fiscon et al. There were patients with AD, mild cognitive impairment (MCI), and healthy controls. The data was already preprocessed using a fast fourier transform (FFT) to take the data from the time domain to the frequency domain. There were differing levels of effectiveness in terms of classification but generally, Fisher’s discriminant analysis (FDA), relevance vector machine, and random forest approaches were most successful. Due to inconsistent feature importances in different models, conclusions about important frequency bands for classification were not able to be made at this time. Similarly, different frequencies were not able to be localized to different regions of the brain. Further research is necessary to develop more interpretable models for classification.
danielchristopher513
Stroke is a disease that affects the arteries leading to and within the brain. A stroke occurs when a blood vessel that carries oxygen and nutrients to the brain is either blocked by a clot or ruptures. According to the WHO, stroke is the 2nd leading cause of death worldwide. Globally, 3% of the population are affected by subarachnoid hemorrhage, 10% with intracerebral hemorrhage, and the majority of 87% with ischemic stroke. 80% of the time these strokes can be prevented, so putting in place proper education on the signs of stroke is very important. The existing research is limited in predicting risk factors pertained to various types of strokes. Early detection of stroke is a crucial step for efficient treatment and ML can be of great value in this process. To be able to do that, Machine Learning (ML) is an ultimate technology which can help health professionals make clinical decisions and predictions. During the past few decades, several studies were conducted on the improvement of stroke diagnosis using ML in terms of accuracy and speed. The existing research is limited in predicting whether a stroke will occur or not. Machine Learning techniques including Random Forest, KNN , XGBoost , Catboost and Naive Bayes have been used for prediction.Our work also determines the importance of the characteristics available and determined by the dataset.Our contribution can help predict early signs and prevention of this deadly disease
SWFSC
Estimate Permutation p-Values for Random Forest Importance Metrics
gxf1027
C++ implementation of random forests classification, regression, proximity, variable importance and imputation.
Vansh-Arora09
🩺 PulsePredict: An end-to-end ML platform for public health analysis. Predicts mortality risk categories and "Diagnostic Verdicts" (Causes of Death) using Random Forest & PCA. Features a custom Streamlit dashboard with real-time risk gauges and feature importance audits. 🚀
We used different machine learning approaches to build models for detecting and visualizing important prognostic indicators of breast cancer survival rate. This repository contains R source codes for 5 steps which are, model evaluation, Random Forest further modelling, variable importance, decision tree and survival analysis. These can be a pipeline for researcher who are interested to conduct studies on survival prediction of any type of cancers using multi model data.
ZhengzeZhou
Implementation of unbiased measurement of feature importance in Random Forests
dwpsutton
Feature selection tool using random forest variable importance measures.
NhanPhamThanh-IT
🍾 A comprehensive machine learning project using Random Forest algorithm to predict wine quality based on physicochemical properties. Features EDA, model training, hyperparameter tuning, feature importance analysis, and detailed documentation.
This project applies machine learning to predict crop yield using synthetic remote sensing and weather data, showcasing how vegetation indices and climatic factors can influence agricultural productivity. It demonstrates yield forecasting with Random Forest and visual insights into feature importance.
Various machine learning approaches are widely applied for short-term solar power forecasting, which is highly demanded for renewable energy integration and power system planning. However, appropriate selection of machine learning models and data features is a significant challenge. In this study, a framework is developed to quantitatively evaluate various models and feature selection methods, and the best combination for short-term solar power forecasting is discovered. More specifically, the machine learning methods include the random forest, artificial neural network and extreme gradient boosting (XGBoost), and the feature selection techniques include the feature importance and principle component analysis (PCA). All possible combinations of these machine learning and feature selection methods are developed and evaluated for solar power forecasting. The best ensemble of machine learning methods and feature selection techniques is identified for solar power forecasting in Hawaii, US. Simulation results show that the XGBoost method with features selected by the PCA method outperforms the other approaches. In addition, the random forest and XGBoost models have rarely been used for short-term solar forecasting. This framework can be used to select appropriate machine learning approaches for short-term solar power forecasting and the simulation results can be used as a baseline for comparison.
Aayushi-2808
# Cervical_cancer_detection_using_ML # Introduction According to World Health Organisation (WHO), when detected at an early stage, cervical cancer is one of the most curable cancers. Hence, the main motive behind this project is to detect the cancer in its early stages so that it can be treated and managed in the patients effectively. # Flow of project is as explained below: This project is divided into 5 parts: 1. Data Cleaning 2. Exploratory Data Analysis 3. Baseline model: Logistic Regression 4. Ensemble Models: Bagging with Decision Trees, Random forest and Boosting 5. Model Comparison and results # Refer below for References: Link to basic information regarding cervical cancer : https://www.cdc.gov/cancer/cervical/basic_info/index.htm The dataset for tackling the problem is supplied by the UCI repository for Machine Learning. Link to Dataset : https://archive.ics.uci.edu/ml/datasets/Cervical+cancer+%28Risk+Factors%29 The dataset contains a list of risk factors that lead up to the Biopsy examination. The generation of the predictor variable is taken care of in part 2 (Exploratory data analysis) of this report. We will try to predict the 'biopsy' variable from the dataset using Logistic Regression, Random Forest, Bagging with Decision Trees and Boosting with XGBoost Classifier. # Results: Based on our Base model and The Ensemble Models we used, we observed - 1. After the entire process of training, hyperparameter tuning and tackling class imbalance was complete , we obtained the results as depicted through the graphics. 2. We observe that Bagging and Random Forest gives the highest accuracy and precision of 97.09 and 80% resp. 3. Plotting the Confusion matrix showed us that Random Forest using upsampling and class weights gives us 2 false positives and 3 false negatives with auc of 0.87 # Why random forest is the best model?? 1. So as we see, while comparing all of our models,RF has maximum f1_score and accuracy along with Bagging i.e. 76.2 n 97.09% resp. 2. And it also produces the same amount of false negatives with a recall of 72.73% just like all the other models. 3. But we still consider RF better coz of its added advantage that, the decision trees are decorrelated as compared to bagging leading to lesser variance and greater ability to generalize. # Conclusion: On observing the feature importance of the best model i.e random forest, we can see that the most important features are Schiller, Hinselmann, HPV, Citology, etc. This also makes sense because Schiller and Hinselmann are actually the tests used to detect cervical cancer. # Problems Faced: A major problem encountered while training the model was that it had too little data to train. On collaborating with all the hospitals in India, we can have enough data points to train a model with a higher recall, thus making the model better. # Scope of Improvement As next steps I would want to do exactly that, to deploy the model and refine it. We may also modify the number of the predictor variables, as it may well turn out that there are other predictors which may not be present in our current dataset. This can only be found by practical implementation of our predictions.
Bribak
This repository constitutes SURFY2 and corresponds to the bioRxiv preprint 'Updating the in silico human surfaceome with meta-ensemble learning and feature engineering' by Daniel Bojar. SURFY2 is a machine learning classifier to predict whether a human transmembrane protein is located at the surface of a cell (the plasma membrane) or in one of the intracellular membranes based on the sequence characteristics of the protein. Making use of the data described in the recent publication from Bausch-Fluck et al. (https://doi.org/10.1073/pnas.1808790115), SURFY2 considerably improves on their reported classifier SURFY in terms of accuracy (95.5%), precision (94.3%), recall (97.6%) and area under ROC curve (0.954) when using a test set never seen by the classifier before. SURFY2 consists of a layer of 12 base estimators generating 24 new engineered features (class probabilities for both classes) which are appended to the original 253 features. Then, a soft voting classifier with three optimized base estimators (Random Forest, Gradient Boosting and Logistic Regression) and optimized voting weights is trained on this expanded dataset, resulting in the final prediction. The motivation of SURFY2 is to provide an updated and better version of the in silico human surfaceome to facilitate research and drug development on human surface-exposed transmembrane proteins. Additionally, SURFY2 enabled insights into biological properties of these proteins and generated several new hypotheses / ideas for experiments. The workflow is as following: 1) dataPrep Gets training data from data.xlsx, labels it according to surface class and outputs 'train_data.csv' 2) split Gets train_data.csv, splits it into training, validation and test data and outputs 'train.csv', 'val.csv', 'test.csv'. 3) main_val Was used for optimizing hyperparameters of base estimators and estimators & weights of voting classifier. Stores all estimators. Evaluates meta-ensemble classifier SURFY2 on validation set. 4) classifier_selection All base estimators and meta-ensemble approaches are tested on the initial dataset as well as the expanded dataset including the engineered features and compared in terms of their cross-validation score. 5) main_test Evaluates SURFY2 on the separate test set (trained on training + validation set). 6) testing_SURFY Evaluates the original SURFY through cross-validation and on validation as well as test set. 7) pred_unlabeled Uses SURFY2 to predict the surface label (+ prediction score) for unlabeled proteins in data.xlsx. Also gets the feature importances of the voting classifier estimators. 8) getting_discrepancies Compare predictions with those made by SURFY ('surfy.xlsx') and store mismatches. Also store the 10 most confident mismatches (by SURFY2 classification score) from each class. 9) feature_importances Plot the 10 most important features for the voting classifier estimators (Random Forest, Gradient Boosting, Logistic Regression) to interpret predictions. 10) base_estimator_importances Plot the 10 most important features for the two most important base estimators (XGBClassifier and Gradient Boosting). 11) comparing_mismatches Separate datasets into shared & discrepant predictions (between SURFY and SURFY2). Compare feature means and select features with the highest class feature mean differences between prediction datasets. Statistically analyze differences in features means between classes in both prediction datasets. Plot 9 representative features with their means grouped according to class and prediction dataset to rationalize discrepant predictions. 12) tSNE_surfy2 Perform nonlinear dimensionality reduction using t-SNE on proteins with predictions from both SURFY and SURFY2. Plot the two t-SNE dimensions and label the proteins according to their prediction class in order to see where discrepant predictions reside in the landscape. Plot surface proteins with most prevalent annotated functional subclasses and label them according to their subclass to enable comparison to class predictions. Functional annotations came from 'surfy.xlsx'.
waltergenchi
Explore Importance of Features in Random Forests
Xion6cc
LightGBM, XGboost, Random Forest, Stacking, Feature Selection (PCA, correlation, feature importance)
ABSTRACT: Student dropout is of the utmost concern in higher education and machine learning techniques have become a powerful tool for proactively identifying students at-risk of dropping out. Data from more than 17,000 National University students were used to train nine machine learning algorithms to predict first-year dropout under four conditions, thus resulting in thirty-six models. The algorithms included were Logistic Regression, Naïve Bayes, Neural Networks, k-Nearest Neighbor, Support Vector Machine with linear and polynomial kernels, Decision Tree, Random Forest, and XGBoost. Modeling conditions varied with regard to class balancing and feature reduction. Models were evaluated based on ROC area and accuracy. Ensemble tree-methods XGBoost and Random Forest were superior across all modeling conditions. Overall, class balancing and feature reduction did not improve model performance. Feature importance was examined and many novel features proved to be useful for dropout prediction. Recommendations for
MadhavJee
The Restaurant Success Prediction project uses a Random Forest Classifier to predict restaurant success (rating ≥ 4.0) from a Zomato dataset. It cleans data, encodes features like location and cost, and evaluates performance with metrics and visuals like feature importance plots. Key drivers include votes and location.
In this research, Machine learning algorithms like Long-Short Term Model (LSTM), Decision tree, Random Forest and XG Boost were used as a classifier to improve the accuracy of Snowfall prediction for the region of Boston. The geographical parameters like Humidity, Temperature, Wind-speed, Precipitation, Sea-level, Dew-point and Visibility were used as independent variables. Before the modeling phase, Data lagging was performed for 2 step followed by Exploratory Data Analysis was using techniques like Multiple Linear Regression, Correlation Plot and variable importance plot. Feature Selection was also executed using Logistic Regression and Boruta algorithm. Experimental evaluations resulted in the highest accuracy shown by LSTM with an accuracy of 89.98%. In terms of sensitivity, Random Forest outperformed other classifier models. Whereas, Decision tree and XG Boost resulted well in the overall performance of prediction with respect to other evaluation metrics. The results of this research added to the contribution of the knowledge in weather prediction in the domain of Snowfall for the machine learning industry.
huomarc
Due to the intricacies of the body of a fetus, as well as the rapid rate at which fetuses develop, fetal health care is one of the most difficult medical fields to practice effectively. Due to the challenges around fetal health care, reducing child and maternal mortality rates is of paramount importance to every modern country. The reduction of child mortality is reflected in several of the United Nations' Sustainable Development Goals. The UN expects that by 2030, member countries will effectively end preventable deaths of newborns and children under 5 years of age (Child Survival and the SDGs, 2021). All member countries are aiming to reduce neonatal mortality to at least as low as 12 deaths per 1,000 live births and under-5 mortality to at least as low as 25 deaths per 1,000 live births (Child Survival and the SDGs, 2021). Maternal mortality is intrinsically tied with child mortality and has continued to have wide racial/ethnic gaps, even today. Black, American Indian, and Alaska Native (AI/AN) women are two to three times more likely to die from pregnancy-related causes than white women, and that disparity only increases with age (National Center for Health Statistics). Thus, the necessity of mitigating child mortality as much as possible cannot be overstated. Electronic fetal monitoring or cardiotocography is a “visual representation” of uterine contractions and fetal heart rate, which has been recognized as a prominent indicator of fetal health since the 19th century (Petker, 2018). Certain fetal heart rate patterns are linked to non-reassuring fetal status, and therefore, fetal heart rate monitoring can help prevent poor fetal outcomes (Petker, 2018). The fetal heart rate monitor identifies the normal baseline rate and tracks variability, accelerations, and decelerations to provide insight into a baby’s level of stress, oxygenation, acidemia (increase in hydrogen ion blood concentration), and other vital signs (Petker, 2018). A host of models were applied, including XGBoost, LightGBM, Gradient Boosting, Random Forest, Multi-layer Perceptron, Support Vector, Decision Tree, K-Nearest Neighbors, Linear Support Vector, and Logistic Regression, to the cardiotocography data to predict fetal health outcomes and level of care.
jkkooiju
In this notebook, we explored the dataset, performed necessary preprocessing steps, and developed a Random Forest model to predict diabetes. Feature importance analysis provides insights into the crucial factors influencing the predictions.
janeminmin
1> Background information Bluebikes is Metro Boston’s public bike share program, with more than 1800 bikes at over 200 stations across Boston and nearby areas. The bikes sharing program launched in 2011. The program aimed for individuals to use it for short-term basis for a price. It allows individuals to borrow a bike from a dock station after using it, which makes it ideal for one-way trips. The City of Boston is committed to providing bike share as a part of the public transportation system. However, to build a transport system that encourages bicycling, it is important to build knowledge about the current bicycle flows, and what factors are involved in the decision-making of potential bicyclists when choosing whether to use the bicycle. It is logical to make hypotheses that age and gender, bicycle infrastructure, safety perception are possible determinants of bicycling. On the short-term perspective, it has been shown that weather plays an important role whether to choose the bicycle. 2> Data collection The Bluebikes collects and provides system data to the public. The datasets used in the project can be download through this link (https://www.bluebikes.com/system-data). Based on this time series dataset (start from 2017-01-01 00:00:00 to 2019-03-31 23:00:00), we could have the information includes: Trip duration, start time and data, stop time and data, start station name and id, end station name and id, bike id, user type (casual or subscribed), birth year, gender. Besides, any trips that were below 60 seconds in length is considered as potentially false starts, which is already removed in the datasets. The number of bicycles used during a particular time period, varies over time based on several factors, including the current weather conditions, time of the day, time of the year and the current interest of the biker to use the bicycle as a transport mode. The current interest is different between subscribed users and casual users, so we should analyze them separately. Factors such as season, day of a week, month, hour, and if a holiday can be extracted from the date and time column in the datasets. Since we would analyze the hourly bicycle rental flow, we need hourly weather conditions data from 2017-01-01 00:00:00 to 2019-03-31 23:00:00 to complete our regression model of prediction. The weather data used in the project is scrapped using python selenium from Logan airport station (42.38 °N, 71.04 °W) webpage (https://www.wunderground.com/history/daily/us/ma/boston/KBOS/date/2019-7-15) maintained by weather underground website. The hourly weather observations include time, temperature, dew point, humidity, wind, wind speed, wind gust, pressure, precipitation, precipitation accumulated, condition. 3> The problem The aims of the project are to gain insight of the factors that could give short-term perspective of bicycle flows in Boston. It also aimed to investigate the how busy each station is, the division of bicycle trip direction and duration of the usage of a busy station and the mean flows variation within a day or during that period. The addition to the factors included in the regression model, there also exist other factors than influence how the bicycle flows vary over longer periods time. For example, general tendency to use the bicycle. Therefore, there is potential to improve the regression model accuracy by incorporating a long-term trend estimate taken over the time series of bicycle usage. Then the result from the machine learning algorithm-based regression model should be compared with the time series forecasting-based models. 4> Possible solutions Data preprocessing/Exploration and variable selection: date approximation manipulation, correlation analysis among variables, merging data, scrubbing for duplicate data, verifying errors, interpolation for missing values, handling outliers and skewness, binning low frequent levels, encoding categorical variables. Data visualization: split number of bike usage by subscribed/casual to build time series; build heatmap to present how busy is each station and locate the busiest station in the busiest period of a busy day; using boxplot and histogram to check outliers and determine appropriate data transformation, using weather condition text to build word cloud. Time series trend curve estimates: two possible way we considered are fitting polynomials of various degrees to the data points in the time series or by using time series decomposition functions and forecast functions to extract and forecast. We would emphasize on the importance to generate trend curve estimates that do not follow the seasonal variations: the seasonal variations should be captured explicitly by the input weather related variables in the regression model. Prediction/regression/time series forecasting: It is possible to build up multilayer perceptron neural network regressor to build up models and give prediction based on all variables of data, time and weather. However, considering the interpretability of model, we prefer to build regression models based on machine learning algorithms (like random forest or SVM) respectively for subscribed/casual users. Then the regressor would be combined with trend curve extracted and forecasted by ARIMA, and then comparing with the result of time series forecasting by STL (Seasonal and Trend decomposition using Loess) with multiple seasonal periods and the result of TBATS (Trigonometric Seasonal, Box-Cox Transformation, ARMA residuals, Trend and Seasonality).
Repository for the paper "Understanding variable importances in forests of randomized trees"
mananshukla6
AutoML_Partial_Dependence_Plot, Decision Tree Surrogate Model, K-Lime, LOCO, Neural Network Visualisation, Partial dependence plots, Random forest feature importance
RohithM191
Amazon-Food-Reviews-Analysis-and-Modelling Using Various Machine Learning Models Performed Exploratory Data Analysis, Data Cleaning, Data Visualization and Text Featurization(BOW, tfidf, Word2Vec). Build several ML models like KNN, Naive Bayes, Logistic Regression, SVM, Random Forest, GBDT, LSTM(RNNs) etc. Objective: Given a text review, determine the sentiment of the review whether its positive or negative. Data Source: https://www.kaggle.com/snap/amazon-fine-food-reviews About Dataset The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon. Number of reviews: 568,454 Number of users: 256,059 Number of products: 74,258 Timespan: Oct 1999 - Oct 2012 Number of Attributes/Columns in data: 10 Attribute Information: Id ProductId - unique identifier for the product UserId - unqiue identifier for the user ProfileName HelpfulnessNumerator - number of users who found the review helpful HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not Score - rating between 1 and 5 Time - timestamp for the review Summary - brief summary of the review Text - text of the review 1 Amazon Food Reviews EDA, NLP, Text Preprocessing and Visualization using TSNE Defined Problem Statement Performed Exploratory Data Analysis(EDA) on Amazon Fine Food Reviews Dataset plotted Word Clouds, Distplots, Histograms, etc. Performed Data Cleaning & Data Preprocessing by removing unneccesary and duplicates rows and for text reviews removed html tags, punctuations, Stopwords and Stemmed the words using Porter Stemmer Documented the concepts clearly Plotted TSNE plots for Different Featurization of Data viz. BOW(uni-gram), tfidf, Avg-Word2Vec and tf-idf-Word2Vec 2 KNN Applied K-Nearest Neighbour on Different Featurization of Data viz. BOW(uni-gram), tfidf, Avg-Word2Vec and tf-idf-Word2Vec Used both brute & kd-tree implementation of KNN Evaluated the test data on various performance metrics like accuracy also plotted Confusion matrix using seaborne Conclusions: KNN is a very slow Algorithm takes very long time to train. Best Accuracy is achieved by Avg Word2Vec Featurization which is of 89.38%. Both kd-tree and brute algorithms of KNN gives comparatively similar results. Overall KNN was not that good for this dataset. 3 Naive Bayes Applied Naive Bayes using Bernoulli NB and Multinomial NB on Different Featurization of Data viz. BOW(uni-gram), tfidf. Evaluated the test data on various performance metrics like accuracy, f1-score, precision, recall,etc. also plotted Confusion matrix using seaborne Printed Top 25 Important Features for both Negative and Positive Reviews Conclusions: Naive Bayes is much faster algorithm than KNN The performance of bernoulli naive bayes is way much more better than multinomial naive bayes. Best F1 score is acheived by BOW featurization which is 0.9342 4 Logistic Regression Applied Logistic Regression on Different Featurization of Data viz. BOW(uni-gram), tfidf, Avg-Word2Vec and tf-idf-Word2Vec Used both Grid Search & Randomized Search Cross Validation Evaluated the test data on various performance metrics like accuracy, f1-score, precision, recall,etc. also plotted Confusion matrix using seaborne Showed How Sparsity increases as we increase lambda or decrease C when L1 Regularizer is used for each featurization Did pertubation test to check whether the features are multi-collinear or not Conclusions: Sparsity increases as we decrease C (increase lambda) when we use L1 Regularizer for regularization. TF_IDF Featurization performs best with F1_score of 0.967 and Accuracy of 91.39. Features are multi-collinear with different featurization. Logistic Regression is faster algorithm. 5 SVM Applied SVM with rbf(radial basis function) kernel on Different Featurization of Data viz. BOW(uni-gram), tfidf, Avg-Word2Vec and tf-idf-Word2Vec Used both Grid Search & Randomized Search Cross Validation Evaluated the test data on various performance metrics like accuracy, f1-score, precision, recall,etc. also plotted Confusion matrix using seaborne Evaluated SGDClassifier on the best resulting featurization Conclusions: BOW Featurization with linear kernel with grid search gave the best results with F1-score of 0.9201. Using SGDClasiifier takes very less time to train. 6 Decision Trees Applied Decision Trees on Different Featurization of Data viz. BOW(uni-gram), tfidf, Avg-Word2Vec and tf-idf-Word2Vec Used both Grid Search with random 30 points for getting the best max_depth Evaluated the test data on various performance metrics like accuracy, f1-score, precision, recall,etc. also plotted Confusion matrix using seaborne Plotted feature importance recieved from the decision tree classifier Conclusions: BOW Featurization(max_depth=8) gave the best results with accuracy of 85.8% and F1-score of 0.858. Decision Trees on BOW and tfidf would have taken forever if had taken all the dimensions as it had huge dimension and hence tried with max 8 as max_depth 6 Ensembles(RF&GBDT) Applied Random Forest on Different Featurization of Data viz. BOW(uni-gram), tfidf, Avg-Word2Vec and tf-idf-Word2Vec Used both Grid Search with random 30 points for getting the best max_depth, learning rate and n_estimators. Evaluated the test data on various performance metrics like accuracy, f1-score, precision, recall,etc. also plotted Confusion matrix using seaborne Plotted world cloud of feature importance recieved from the RF and GBDT classifier Conclusions: TFIDF Featurization in Random Forest (BASE-LEARNERS=10) with random search gave the best results with F1-score of 0.857. TFIDF Featurization in GBDT (BASE-LEARNERS=275, DEPTH=10) gave the best results with F1-score of 0.8708.
NikolaosProusalis
It is a fact that the aviation industry is expected to develop remarkably in near future. Consequently, delays and environmental problems are caused, due to aircraft congestion in the airports. Thus, it is necessary for the airports to use the existing infrastructure efficiently and generate some changes in the system and the way airports are operated. For this reason, the prediction of aircraft taxi time is substantial in order to help airports understand what is necessary to change so as to optimise their efficiency and reduce the aircraft taxi time. This project concerns Manchester’s airport and data about the aircrafts’ features and external factors were given in order to predict taxi time. This machine learning project was following the CRISP-DM process for data mining. All the processes were handled on Python, as it provides Pandas library which creates a useful data frame that provides an easy way to handle and modify the data. Moreover, the scikit-learn library was used for the machine learning was used for the machine learning procedure, by providing all the algorithm that are necessary for this problem. The machine learning algorithms that were applied are Linear regression, Polynomial Regression, Random Forests and Multilayer Perceptrons. The examination of algorithms that were applied showed that the most suitable for the project is Polynomial regression, because it provides the most precise and accurate prediction of taxi time with accuracy equal to 79.94%. Furthermore, it was noted the importance of each variable as two datasets were applied (one has two extra variables) and the variables were ranked regarding the variable selection technique that was used
aiok03
Descriptive statistics and Explanatory data analysis In order to have an idea of the received data, we look through our table transactions and train. The shape of the train is 6000 rows and 2 columns (client_id and target – gender). Also we considered the info of transactions and noticed that there are no empty values, all of them are equal to 130039. After that we merged two tables and called it as data. To display unique codes and types we used ‘unique’ function and noticed that unique codes 173 and unique types 61. Using ‘describe’ function we can see minimal code, type, sum and the same parameters but maximum. The first hypothesis was to find what gender makes lots of requests. For conveniency we used for loop to make values in percentile view. And according to the barplot the biggest number of processes are made by females. The second hypothesis was to find the code with the biggest sum. For that we grouped by code and counted the mean of all sums. This list we converted from series to frame for further working process. The problem was that the code interpreted the code as the index, that’s why we have to fix it with ‘reset_index’ function. After that we plotted the graph and noticed that the most high sum is with 4722 code and proved it with another code under the graph. The third hypothesis is to find the distribution of sums relatively to the gender. But the first graph didn’t replaced this information because the scatter of the data is too high. The sign is not normally distributed and it is not symmetrical. It is hard to asses, that’s why we grouped information by gender and counted mean of the sum. According to this information we noticed that males spend more money than women. The same process we made with median and got the same conclusion. And since the mean and median values are not equal, our assumption about unnormalized data was proved. The last hypothesis was to find number of clients for each type and code – to find the most popular request within clients. For that we applied ‘str’ to each parameter for correct visualization on the graph. Counted the number of each request for type and code and reflected it in the graphs. According to them the most popular is 1010 type and 6011 code. Lastly, for further working process we returned type and code to the int type. Feature engineering Client’s balance condition We took every sum from dataframe data, grouped for every client and found the sum for each of them. We calculated the income and expenses for each client. Some clients with minus value made more expenses, some of them not, that means that he got more income. In minus is 0, in plus is 1. RFM In RFM section we started from Recency. For each client we grouped the information about them and found the maximum date where the transaction was done. The datetime column consisted from two values – date and time, for further working process in future engineering section we divided them for different columns. The most recent day we equaled to 457 and according to this value started to count the recency of last transactions for each client by subtraction. The next step is Frequency. We used ‘group by’ function and counted appearance of each client in our database. The last step is Monetary (to count expenses). Using group by function and condition, where the sum is less than 0 (expenses are negative values), we counted the total expenses of each client and noticed one point. That some clients didn’t spend any money at all. Segmentation based on RFM We merged all the tables into one and made a rank according to the best values in each segment using percentage. Using the formula we divided clients by 5 score scale, by this database and elbow method, plotted the graph, where 3 clusters were optimal solution. With KMeans library we plotted the k-mean illustration of clients according to the distance from randomly chosen centroids, showed distribution of clients in clusters. After the work done we gathered basic table with clusters using prefixes to each of them. Clustering for codes Now we'll work with codes to create clustering codes, and we'll utilize TF IDF and k-means to do it. We will also employ limitization, tokenization, and stop word elimination. We import the pymorphy2 library for limiting, and limiting is when words take their original form. Tokenization by sentences is the process of dividing a written language into component sentences. We also need to delete stop words, a stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query. We would not want these words to take up space in our database, or taking up valuable processing time. We also make use of the re – Regular expression operations library, which is a library for regular expression operations. In this section we also use MorphAnalyzer() - Morphological analysis is the identification of a word's features based on how it is spelt. Morphological analysis does not make use of information about nearby words. For morphological analysis of words, there is a MorphAnalyzer class in pymorphy2. If we apply directly the clustering on those matrix, we will have issues as our matrices are very sparse and the computation of distances will be a mess. What we can do, is to perform IS to reduce data to a dense matrix of dimension 156 by applying SVD. Singular Value Decomposition (SVD) is one of the widely used methods for dimensionality reduction. We defined that 156 is the right number in our case. We used the Silhouette score to evaluate the quality of clusters created using K-Means. By Silhouette score we chose number of clusters and performed k means clustering on our tf-idf matrix. Then we tried to do a visualization of our clusters and we applied t-sne . t-SNE is a tool to visualize high-dimensional data. And then we added clusters to data and df dataframe. Finally we created word cloud by our clusters Clustering for types Data cleaning for types Firstly, we noticed that there were 155 types. However in data, there are 61 types. When we merge the data and that types, the total number of types become 58. This means that 3 types have no any description and that’s why we replace them with the mode value. Also we found that some types have type description ‘н.д’ which means no data and their total number in data is 26. Also we noticed that type description repeats for several types and we dropped duplicates and replaced them with first accurancy type in data. Creating clusters for types We manually divided them into the 5 categories according to dome key words in description. And merged them with our dataframe. Then we noticed outliers in recency and frequency. We found 0.999 and 0.001 quantile, where the first one is considered as the high, and the second is the low boundary. Everything above 0,999 and below 0.001 is considered as an outlier. We removed them for both recency and frequency. After that we checked dataframe by describe and concluded that everything become normal. Supervised learning The time for prediction came. We divided our dataframe into train and test and used KNN, Decision Tree Classifier and Random Forest, Logistic Regression for further predictions. We decided to investigate the accuracy from 1 to 20 with step 2 for each neighbor in train and test. And built the plot. The best result is accuracy 58 for 19 neighbors. Decision Tree gave us 54 for test set and Random Forest’s accuracy was 64. We investigated feature importance for both of them and noticed that monetary had the most influence on predicting the data. For Grid Search we manually set the hyper parameters and for cross validation equals to four folds. Best estimater for random forest classifier for grid search was found. After that good estimaters were chosen for random forest, and the same accuracy occurred. Best accuracy for random forest with default hyper parameters. We built confusion matrix and calculated recall, precision and f-1 score. Also we decided to build lofistic regression but the accuracy was too small, that’s why we build roc-auc and precision-recall curve. Conclusion All the models showed that taken data was not enough and actually not the best for gender prediction. Actions for increase the accuracy were done, such as adding more features, removing outliers. According to this investigation the best choice was random forest.
dssg
Speaks for the trees by providing individual feature importances from random forests.