Found 27 repositories(showing 27)
janeminmin
1> Background information Bluebikes is Metro Boston’s public bike share program, with more than 1800 bikes at over 200 stations across Boston and nearby areas. The bikes sharing program launched in 2011. The program aimed for individuals to use it for short-term basis for a price. It allows individuals to borrow a bike from a dock station after using it, which makes it ideal for one-way trips. The City of Boston is committed to providing bike share as a part of the public transportation system. However, to build a transport system that encourages bicycling, it is important to build knowledge about the current bicycle flows, and what factors are involved in the decision-making of potential bicyclists when choosing whether to use the bicycle. It is logical to make hypotheses that age and gender, bicycle infrastructure, safety perception are possible determinants of bicycling. On the short-term perspective, it has been shown that weather plays an important role whether to choose the bicycle. 2> Data collection The Bluebikes collects and provides system data to the public. The datasets used in the project can be download through this link (https://www.bluebikes.com/system-data). Based on this time series dataset (start from 2017-01-01 00:00:00 to 2019-03-31 23:00:00), we could have the information includes: Trip duration, start time and data, stop time and data, start station name and id, end station name and id, bike id, user type (casual or subscribed), birth year, gender. Besides, any trips that were below 60 seconds in length is considered as potentially false starts, which is already removed in the datasets. The number of bicycles used during a particular time period, varies over time based on several factors, including the current weather conditions, time of the day, time of the year and the current interest of the biker to use the bicycle as a transport mode. The current interest is different between subscribed users and casual users, so we should analyze them separately. Factors such as season, day of a week, month, hour, and if a holiday can be extracted from the date and time column in the datasets. Since we would analyze the hourly bicycle rental flow, we need hourly weather conditions data from 2017-01-01 00:00:00 to 2019-03-31 23:00:00 to complete our regression model of prediction. The weather data used in the project is scrapped using python selenium from Logan airport station (42.38 °N, 71.04 °W) webpage (https://www.wunderground.com/history/daily/us/ma/boston/KBOS/date/2019-7-15) maintained by weather underground website. The hourly weather observations include time, temperature, dew point, humidity, wind, wind speed, wind gust, pressure, precipitation, precipitation accumulated, condition. 3> The problem The aims of the project are to gain insight of the factors that could give short-term perspective of bicycle flows in Boston. It also aimed to investigate the how busy each station is, the division of bicycle trip direction and duration of the usage of a busy station and the mean flows variation within a day or during that period. The addition to the factors included in the regression model, there also exist other factors than influence how the bicycle flows vary over longer periods time. For example, general tendency to use the bicycle. Therefore, there is potential to improve the regression model accuracy by incorporating a long-term trend estimate taken over the time series of bicycle usage. Then the result from the machine learning algorithm-based regression model should be compared with the time series forecasting-based models. 4> Possible solutions Data preprocessing/Exploration and variable selection: date approximation manipulation, correlation analysis among variables, merging data, scrubbing for duplicate data, verifying errors, interpolation for missing values, handling outliers and skewness, binning low frequent levels, encoding categorical variables. Data visualization: split number of bike usage by subscribed/casual to build time series; build heatmap to present how busy is each station and locate the busiest station in the busiest period of a busy day; using boxplot and histogram to check outliers and determine appropriate data transformation, using weather condition text to build word cloud. Time series trend curve estimates: two possible way we considered are fitting polynomials of various degrees to the data points in the time series or by using time series decomposition functions and forecast functions to extract and forecast. We would emphasize on the importance to generate trend curve estimates that do not follow the seasonal variations: the seasonal variations should be captured explicitly by the input weather related variables in the regression model. Prediction/regression/time series forecasting: It is possible to build up multilayer perceptron neural network regressor to build up models and give prediction based on all variables of data, time and weather. However, considering the interpretability of model, we prefer to build regression models based on machine learning algorithms (like random forest or SVM) respectively for subscribed/casual users. Then the regressor would be combined with trend curve extracted and forecasted by ARIMA, and then comparing with the result of time series forecasting by STL (Seasonal and Trend decomposition using Loess) with multiple seasonal periods and the result of TBATS (Trigonometric Seasonal, Box-Cox Transformation, ARMA residuals, Trend and Seasonality).
daksh26022002
Fake News Detection Fake News Detection in Python In this project, we have used various natural language processing techniques and machine learning algorithms to classify fake news articles using sci-kit libraries from python. Getting Started These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system. Prerequisites What things you need to install the software and how to install them: Python 3.6 This setup requires that your machine has python 3.6 installed on it. you can refer to this url https://www.python.org/downloads/ to download python. Once you have python downloaded and installed, you will need to setup PATH variables (if you want to run python program directly, detail instructions are below in how to run software section). To do that check this: https://www.pythoncentral.io/add-python-to-path-python-is-not-recognized-as-an-internal-or-external-command/. Setting up PATH variable is optional as you can also run program without it and more instruction are given below on this topic. Second and easier option is to download anaconda and use its anaconda prompt to run the commands. To install anaconda check this url https://www.anaconda.com/download/ You will also need to download and install below 3 packages after you install either python or anaconda from the steps above Sklearn (scikit-learn) numpy scipy if you have chosen to install python 3.6 then run below commands in command prompt/terminal to install these packages pip install -U scikit-learn pip install numpy pip install scipy if you have chosen to install anaconda then run below commands in anaconda prompt to install these packages conda install -c scikit-learn conda install -c anaconda numpy conda install -c anaconda scipy Dataset used The data source used for this project is LIAR dataset which contains 3 files with .tsv format for test, train and validation. Below is some description about the data files used for this project. LIAR: A BENCHMARK DATASET FOR FAKE NEWS DETECTION William Yang Wang, "Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection, to appear in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), short paper, Vancouver, BC, Canada, July 30-August 4, ACL. the original dataset contained 13 variables/columns for train, test and validation sets as follows: Column 1: the ID of the statement ([ID].json). Column 2: the label. (Label class contains: True, Mostly-true, Half-true, Barely-true, FALSE, Pants-fire) Column 3: the statement. Column 4: the subject(s). Column 5: the speaker. Column 6: the speaker's job title. Column 7: the state info. Column 8: the party affiliation. Column 9-13: the total credit history count, including the current statement. 9: barely true counts. 10: false counts. 11: half true counts. 12: mostly true counts. 13: pants on fire counts. Column 14: the context (venue / location of the speech or statement). To make things simple we have chosen only 2 variables from this original dataset for this classification. The other variables can be added later to add some more complexity and enhance the features. Below are the columns used to create 3 datasets that have been in used in this project Column 1: Statement (News headline or text). Column 2: Label (Label class contains: True, False) You will see that newly created dataset has only 2 classes as compared to 6 from original classes. Below is method used for reducing the number of classes. Original -- New True -- True Mostly-true -- True Half-true -- True Barely-true -- False False -- False Pants-fire -- False The dataset used for this project were in csv format named train.csv, test.csv and valid.csv and can be found in repo. The original datasets are in "liar" folder in tsv format. File descriptions DataPrep.py This file contains all the pre processing functions needed to process all input documents and texts. First we read the train, test and validation data files then performed some pre processing like tokenizing, stemming etc. There are some exploratory data analysis is performed like response variable distribution and data quality checks like null or missing values etc. FeatureSelection.py In this file we have performed feature extraction and selection methods from sci-kit learn python libraries. For feature selection, we have used methods like simple bag-of-words and n-grams and then term frequency like tf-tdf weighting. we have also used word2vec and POS tagging to extract the features, though POS
cran
:exclamation: This is a read-only mirror of the CRAN R package repository. VarSelLCM — Variable Selection for Model-Based Clustering of Mixed-Type Data Set with Missing Values. Homepage: http://varsellcm.r-forge.r-project.org/
Given Ames Housing dataset, the project started with an exploratory data analysis (EDA) to identify the missing values, suspicious data, and redundant variables. Then I performed a mixed stepwise selection to reduce the set of variables and select the best model based on AIC, BIC, and adjust R-squared. With the best model selected, the model assumptions were checked regarding normality, homoscedasticity, collinearity, and linearity between response and predictors. Several solutions were proposed to solve the assumption violation. The model was then tested on unseen data and scored on Root-Mean-Squared-Error (RMSE).
K-Nearest-Neighbors-with-Python This repository contains projects related to KNN algorithm using Python. Introduction: K Nearest Neighbors - Classification K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions). KNN has been used in statistical estimation and pattern recognition already in the beginning of 1970’s as a non-parametric technique. Knn uses follwoing as a distance function : knn 1)Euclidean distance is calculated as the square root of the sum of the squared differences between a new point (x) and an existing point (xi) across all input attributes j. EuclideanDistance(x, xi) = sqrt( sum( (xj – xij)^2 ) ) 2)Manhattan Distance: Calculate the distance between real vectors using the sum of their absolute difference. Also called City Block Distance. 3)Minkowski Distance: Generalization of Euclidean and Manhattan distance. Different disciplines in KNN 1) Instance-Based Learning: The raw training instances are used to make predictions. As such KNN is often referred to as instance-based learning or a case-based learning (where each training instance is a case from the problem domain). 2) Lazy Learning: No learning of the model is required and all of the work happens at the time a prediction is requested. As such, KNN is often referred to as a lazy learning algorithm. 3)Non-Parametric: KNN makes no assumptions about the functional form of the problem being solved. As such KNN is referred to as a non-parametric machine learning algorithm. How to Prepare Data for KNN 1) Rescale Data: KNN performs much better if all of the data has the same scale. Normalizing your data to the range [0, 1] is a good idea. It may also be a good idea to standardize your data if it has a Gaussian distribution. 2)Address Missing Data: Missing data will mean that the distance between samples can not be calculated. These samples could be excluded or the missing values could be imputed. 3)Lower Dimensionality: KNN is suited for lower dimensional data. You can try it on high dimensional data (hundreds or thousands of input variables) but be aware that it may not perform as well as other techniques. KNN can benefit from feature selection that reduces the dimensionality of the input feature space. Source: https://tinyurl.com/y8fh9fgn
gohminghui88
JAETL - Just Another ETL tool is a tiny and fast ETL tool to develop data warehouse. JAETL allows to Extract data from ARFF (Weka), CSV, and SQL, Transform the data with join, replace missing values, remove duplicates, mapping filtering, variable selection, and Load the data into SQL server and export to CSV and ARFF.
sujaikarunakaran96
Used the data set produced VMware to create a propensity to respond model. The dataset was initially an imbalanced dataset will 707 features and 50,006 rows. Therefore, the Synthetic Minority Over-Sampling Technique was used to balance the data. Pre-processing was required as the data contained a lot missing values. To address this issue, variables with more than 60% missing values were removed. And for the rest of the variables with less than 60% missing values, mean and mode imputation was done. This brought down the number of features from 707 to 653. As there were 653 variables, it was necessary to perform feature selection to reduce the number of variables. Regularized L1 Logistic Regression (LASSO) was used to select important variables to work with. This method reduced the number of variables to 90. After the variables selection, Random Forest, XGBoost, LASSO Regression, Ridge Regressing and Stacking was performed on the data set to evaluate the suitable model. As the problem of imbalance was addressed by the SMOTE technique, Random Forest yielded 98.9990% test accuracy, XGBoost yielded 99.9987% Accuracy, LASSO Yielded 99.9845% Accuracy. As there is a small chance for the data to be imbalanced, accuracy might not be a good measure to evaluate a model. Therefore, Macro Averaging Metric was used. From the Macro-Average F-score, the Random forest and LASSO Regression model Yielded 0.87 whereas the XGboost yielded a F-score of 1. Therefore, it was concluded that XGBoost was the best model for this dataset. Hence a Propensity to respond model was created which can now be used for B2B Marketing.
Mohsenselseleh
Admissions.csv simulates administrative data where each row represents a unique admission to a hospital. Lab.csv simulates results for patients who had laboratory testing (e.g. blood counts) in their admission. Transfusions.csv simulates information on patients who underwent a blood transfusion in their admission. 1. Impute the missing charlson_comorbidity_index values in any way you see fit, with the intention that this variable will be used as a predictor in a statistical model. 2. Determine if there is a significant difference in sex between patients who had an rbc_transfusion and patients that did not. Fit a linear regression model using the result_value of the “Platelet Count” lab tests as the dependent variable and age, sex, and hospital as the independent variables. Briefly interpret the results. 4. Create one or multiple plots that demonstrate the relationships between length_of_stay (discharge date and time minus admission date and time), charlson_comorbidity_index, and age. 5. You are interested in evaluating the effect of platelet transfusions on a disease. The patients with platelet_transfusion represent the selected treatment group. Select a control group in any way you see fit. How could you improve your selection if you had more data and access to any clinical variable you can think of? 6. Fit a first-iteration statistical model of your choosing to predict the result_value of the “Hemoglobin” lab tests and evaluate its performance. How could you improve the model if you had more data and access to any clinical variable you can think of?
No description available
No description available
Yifan22Oct
Variable selection for high dimensional generalized linear model with block-missing data
GunnHJ
Code and data used for "How to Apply Variable Selection Machine Learning Algorithms with Multiply Imputed Data: A Missing Discussion"
ragu8
"DataPrep-Toolkit" streamlines ML data cleaning/preprocessing with Python. Covers common issues, missing data, outliers, scaling, feature selection, categorical variables, and imbalanced data.
juanda3005
Complete analysis of a prediction model using regression in python ✓ Data Wrangling (Deal with Missing data, Binning, Groupby, Dummy variables), ✓ Exploratory data analysis. ✓ Selection of variables. ✓Creation and evaluation of the best prediction model using regression (Linear, Polynomial and Nonlinear Regression)
KUSHAGRARAJTIWARI
Data Science Project- Regression Problem under Supervised Machine Learning using RMSE for model evaluation. Training data consists of 55M observations and four location variables, number of passengers and fare_amount (target variable) with missing values. Exploratory Data includes missing value analysis, outlier analysis and visualization of predictor variables with target variables. Feature engineering includes feature addition and selection. Finally using Light GBM to reach RMSE of $3.92.
BehlimAmaan
This repository focuses on feature engineering techniques used in machine learning. It covers data cleaning, handling missing values, encoding categorical variables, feature scaling, transformation, and feature selection with practical examples and implementations.
harsh9104
Predicting car price using Linear Regression model and Lasso Regression model to ensure accurate, data-driven valuations with optimized feature selection, minimal overfitting, and enhanced predictive performance. Data preprocessing, including handling missing values, encoding categorical variables.
amiTanmayNath
• Carried out exploratory data analysis and data preprocessing to handle missing values and redundant variables • Performed forward feature selection, optimizing regression with key features for an effective and precise model fit • Attained adjusted R2 value of 0.9109 with Linear Regression and 0.7091 with K-Nearest Neighbors Algorithm
Advanced predictive analytics repository featuring two projects: one optimizing disease spread control through strategic variable selection and minimized false negatives, and another reconstructing missing sensor data via time-series analysis and feature engineering. Robust modeling with Python & scikit-learn.
Wizard-hash2
This repo containd preprocessing feature engineering pipeline using pandas and scikit-learn: handling missing data, encoding categorical variables, partitioning datasets, scaling features, analyzing regularization effects in logistic regression. It implements sequential backward selection with k-NN for subset optimiztn
cran
:exclamation: This is a read-only mirror of the CRAN R package repository. flevr — Flexible, Ensemble-Based Variable Selection with Potentially Missing Data. Homepage: https://github.com/bdwilliamson/flevr Report bugs for this package: https://github.com/bdwilliamson/flevr/issues
Shamsavvasher
This project analyzes the Titanic dataset to predict passenger survival using machine learning techniques. It involves data exploration, preprocessing (handling missing values, outlier treatment, and encoding categorical variables), feature selection, model building with logistic regression, and evaluation of model performance.
Developed a predictive model to classify movies as successful or unsuccessful using Kaggle’s movie dataset. Applied Random Forest and Logistic Regression, achieving 84% accuracy with hyperparameter tuning and feature selection. Preprocessed data by handling missing values, encoding categorical variables, and performing feature scaling.
Hush-Baby-Hush
Techniques of machine learning to various signal problems: regression, including linear regression, multiple regression, regression forest and nearest neighbors regression; classification with various methods, including logistic regression, support vector machines, nearest neighbors, simple boosting and decision forests; clustering with various methods, including basic agglomerative clustering and k-means; resampling methods, including cross-validation and the bootstrap; model selection methods, including AIC, stepwise selection and the lasso; hidden Markov models; model estimation in the presence of missing variables; and neural networks, including deep networks. The course will focus on tool-oriented and problem-oriented exposition. Application areas include computer vision, natural language, interpreting accelerometer data, and understanding audio data.
13Akarsht
An education company named X Education sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses. The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos. When these people fill up a form providing their email address or phone number, they are classified to be a lead. Moreover, the company also gets leads through past referrals. Once these leads are acquired, employees from the sales team start making calls, writing emails, etc. Through this process, some of the leads get converted while most do not. The typical lead conversion rate at X education is around 30%. Now, although X Education gets a lot of leads, its lead conversion rate is very poor. For example, if, say, they acquire 100 leads in a day, only about 30 of them are converted. To make this process more efficient, the company wishes to identify the most potential leads, also known as ‘Hot Leads’. If they successfully identify this set of leads, the lead conversion rate should go up as the sales team will now be focusing more on communicating with the potential leads rather than making calls to everyone. As we can see, there are a lot of leads generated in the initial stage (top) but only a few of them come out as paying customers from the bottom. In the middle stage, you need to nurture the potential leads well (i.e. educating the leads about the product, constantly communicating etc. ) in order to get a higher lead conversion. Our target is to analysis and make a logistic regression model to predict who will be converted at the end using previous data. Steps for logistic regression model Collecting and reading the data Data treatment Filling the missing values with nan or adequate value Dropping columns having large number of null/missing values Dropped those values with only one value. Exploratory data analysis Univariate data analysis Multivariate data analysis Data preparation Forming adequate dummy variables for model building. Splitting the data into train and test sets for training and testing the data. Scaling the numerical variables using a standard scaler. Model building Summarizing the data Feature selection using RFE Assessing the model with StatsModels Obtaining an adequate model using the Value inflation factor method. Creating Prediction Model Evaluation Checking precision and recall values Plot ROC curve Finding optimal cutoff point Making prediction on test data set Conclusion
The essence of this project surrounds a paper, "Influence of life stress on depression: moderation by a polymorphism in the 5-HTT gene" (Caspi et al. 2003) ,in which the authors use regression techniques to model depression based on certain factors. The paper discusses the development of depression and the role a person’s genetic makeup along with the stressful situations a person has endured throughout their life as possible factors. It focuses on how a person deals with stressful situations in life and relates this to one’s genetic makeup to test the relationship between these two things and the onset of depression. The authors of this paper found interactions between gene and environmental variables in their regression analysis. These gene-environment interactions have been scrutinized by other researchers and considered an error made by Caspi et al. In the paper "Interaction Between the Serotonin Transporter Gene(5-HTTLPR), Stressful Life Events, and Risk of Depression: A Meta-analysis" (Risch et al. 2009) the researchers deny this claim of gene-environment interactions and conclude that there is only correlation between the stressful events in one’s life and their development of depression. My conclusion is that, based on the given data, only environmental impacts were significant in causing depression in individuals. My methods included data cleaning and merging, advanced imputation on missing values using the EM Algorithm, multiple regression, and stepwise selection in both directions. The referenced papers can be found below: https://www.researchgate.net/publication/10654279_Influence_of_Life_Stress_on_Depression_Moderation_by_a_Polymorphism_in_the_5-HTT_Gene https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2938776/.
Tirth8038
Expectation Maximization is an iterative algorithm for calculating the maximum likelihood estimates(MLE) or Maximum a posteriori probability (MAP) of parameters. It helps to estimate the missing values in the dataset given the general form of probability distribution associated with these latent variables and then using that data to update the values of the parameters in the Maximization step. In this task, we do not know whether Coin A or Coin B is flipped for each set of 30 flips.Hence, In this scenario, the coin is not observed, and could be considered a hidden or latent variable. Initialization Step: We can initialize random biases for the selection of the Coin which can give the estimate of which coin was chosen in each trial. Expectation Step: Given the estimation of the coin selected, we can determine what is the probability of getting heads or tails in the outcome using the concept of Conditional Probability(Bayes Theorem). One approach could be to see which coin bias better matches the flips and assign all flips to that coin. For example: For a flip, if we see 11110011001110001111 and our current assumed biases for Coin A and Coin B are 0.4 and 0.7 respectively, we just assume that it is coin B with 13 H and 7 T. But the problem arises when the cases are not obvious like if the Coin has equal chances of getting H or T. Hence, we estimate the probability that each coin is the true coin given the flips we see in the trial, and use that to assign H and T counts to each coin. As a final result, it gives the probability of H and T for Coin A and Coin B respectively for each flips. Maximization Step: Given the values for the latent variables we computed in the Expectation step, we estimate new values for thetas for both Coins that maximize a variant of the likelihood function. The Theta values will be the probability of H from Coin A and Coin B out of the total probability of both H and T from Coin A and Coin B. Convergence Step: After getting the values of Thetas for both Coins, we need to optimize them using a number of iterations till they converge to the global minima. Here, we check whether the Thetas are converging or not, if yes, then stop, otherwise repeat the “Expectation” step and “Maximization” step until the convergence occurs. For this, I am using a dynamic number of iterations by comparing the values of Theta with Theta values of previous 2 iterations. If it remains constant, I will break the loop and store the last Theta values as Optimal Values.
All 27 repositories loaded