Found 814 repositories(showing 30)
PacktPublishing
Machine Learning for Imbalanced Data, published by Packt
solegalli
Code repository for the online course Machine Learning with Imbalanced Data
ulookme
Use machine learning to classify malware. Malware analysis 101. Set up a cybersecurity lab environment. Learn how to tackle data class imbalance. Unsupervised anomaly detection. End-to-end deep neural networks for malware classification. Create a machine learning Intrusion Detection System (IDS). Employ machine learning for offensive security. Learn how to address False Positive constraints. Break a CAPTCHA system using machine learning.
By learning and using prediction for failures, it is one of the important steps to improve the reliability of the cloud computing system. Furthermore, gave the ability to avoid incidents of failure and costs overhead of the system. It created a wonderful opportunity with the breakthroughs of machine learning and cloud storage that utilize generated huge data that provide pathways to predict when the system or hardware malfunction or fails. It can be used to improve the reliability of the system with the help of insights of using statistical analysis on the workload data from the cloud providers. This research will discuss regarding job usage data of tasks on the large “Google Cluster Workload Traces 2019” dataset, using multiple resampling techniques such as “Random Under Sampling, Random Oversampling and Synthetic Minority Oversampling Technique” to handle the imbalanced dataset. Furthermore, using multiple machine learning algorithm which is for traditional machine learning algorithm are “Logistic Regression, Decision Tree Classifier, Random Forest Classifier, Gradient Boosting Classifier and Extreme Gradient Boosting Classifier” while deep learning algorithm using “Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU)” for job failure prediction between imbalanced and balanced dataset. Then, to have a comparison of imbalanced and balanced in terms of model accuracy, error rate, sensitivity, f – measure, and precision. The results are Extreme Gradient Boosting Classifier and Gradient Boosting Classifier is the most performing algorithm with and without imbalanced handling techniques. It showcases that SMOTE is the best method to choose from for handling imbalanced data. The deep learning model of LSTM and Gated Recurrent Unit may be not the best for the in terms of accuracy, based on the ROC Curve its better than the XGBoost Classifier and Gradient Boosting Classifier.
dataprofessor
No description available
It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase. Content The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions. It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise. Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification. Update (03/05/2021) A simulator for transaction data has been released as part of the practical handbook on Machine Learning for Credit Card Fraud Detection - https://fraud-detection-handbook.github.io/fraud-detection-handbook/Chapter_3_GettingStarted/SimulatedDataset.html. We invite all practitioners interested in fraud detection datasets to also check out this data simulator, and the methodologies for credit card fraud detection presented in the book. Acknowledgements The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on https://www.researchgate.net/project/Fraud-detection-5 and the page of the DefeatFraud project Please cite the following works: Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015 Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE Dal Pozzolo, Andrea Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi) Carcillo, Fabrizio; Dal Pozzolo, Andrea; Le Borgne, Yann-Aël; Caelen, Olivier; Mazzer, Yannis; Bontempi, Gianluca. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier Carcillo, Fabrizio; Le Borgne, Yann-Aël; Caelen, Olivier; Bontempi, Gianluca. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing Bertrand Lebichot, Yann-Aël Le Borgne, Liyun He, Frederic Oblé, Gianluca Bontempi Deep-Learning Domain Adaptation Techniques for Credit Cards Fraud Detection, INNSBDDL 2019: Recent Advances in Big Data and Deep Learning, pp 78-88, 2019 Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Frederic Oblé, Gianluca Bontempi Combining Unsupervised and Supervised Learning in Credit Card Fraud Detection Information Sciences, 2019 Yann-Aël Le Borgne, Gianluca Bontempi Machine Learning for Credit Card Fraud Detection - Practical Handbook
Aayushi-2808
# Cervical_cancer_detection_using_ML # Introduction According to World Health Organisation (WHO), when detected at an early stage, cervical cancer is one of the most curable cancers. Hence, the main motive behind this project is to detect the cancer in its early stages so that it can be treated and managed in the patients effectively. # Flow of project is as explained below: This project is divided into 5 parts: 1. Data Cleaning 2. Exploratory Data Analysis 3. Baseline model: Logistic Regression 4. Ensemble Models: Bagging with Decision Trees, Random forest and Boosting 5. Model Comparison and results # Refer below for References: Link to basic information regarding cervical cancer : https://www.cdc.gov/cancer/cervical/basic_info/index.htm The dataset for tackling the problem is supplied by the UCI repository for Machine Learning. Link to Dataset : https://archive.ics.uci.edu/ml/datasets/Cervical+cancer+%28Risk+Factors%29 The dataset contains a list of risk factors that lead up to the Biopsy examination. The generation of the predictor variable is taken care of in part 2 (Exploratory data analysis) of this report. We will try to predict the 'biopsy' variable from the dataset using Logistic Regression, Random Forest, Bagging with Decision Trees and Boosting with XGBoost Classifier. # Results: Based on our Base model and The Ensemble Models we used, we observed - 1. After the entire process of training, hyperparameter tuning and tackling class imbalance was complete , we obtained the results as depicted through the graphics. 2. We observe that Bagging and Random Forest gives the highest accuracy and precision of 97.09 and 80% resp. 3. Plotting the Confusion matrix showed us that Random Forest using upsampling and class weights gives us 2 false positives and 3 false negatives with auc of 0.87 # Why random forest is the best model?? 1. So as we see, while comparing all of our models,RF has maximum f1_score and accuracy along with Bagging i.e. 76.2 n 97.09% resp. 2. And it also produces the same amount of false negatives with a recall of 72.73% just like all the other models. 3. But we still consider RF better coz of its added advantage that, the decision trees are decorrelated as compared to bagging leading to lesser variance and greater ability to generalize. # Conclusion: On observing the feature importance of the best model i.e random forest, we can see that the most important features are Schiller, Hinselmann, HPV, Citology, etc. This also makes sense because Schiller and Hinselmann are actually the tests used to detect cervical cancer. # Problems Faced: A major problem encountered while training the model was that it had too little data to train. On collaborating with all the hospitals in India, we can have enough data points to train a model with a higher recall, thus making the model better. # Scope of Improvement As next steps I would want to do exactly that, to deploy the model and refine it. We may also modify the number of the predictor variables, as it may well turn out that there are other predictors which may not be present in our current dataset. This can only be found by practical implementation of our predictions.
This project explores UK road accident data with the goal of predicting accident severity using machine learning. Techniques include clustering (KMeans, DBSCAN), association rule mining (Apriori), and classification models (Random Forest, Decision Tree, Gradient Boosting) enhanced by SMOTE for class imbalance handling.
This project develops a machine learning model to predict cancer risk levels (High, Medium, Low) based on demographic, behavioral, and health data. It addresses class imbalance using techniques like SMOTE and optimizes model performance with hyperparameter tuning, providing crucial insights for early detection and intervention.
Epilepsy is the name of a neurological disorder of the human brain, which is characterized by chronic disorders and occurs at random to interrupt the normal function of the brain. The diagnosis and analysis of epileptic seizure is made with the help of Electroencephalography (EEG). In order to detect seizure, it involves the interpretation of long EEG records by the expert physicians, which is time-consuming and need high human efforts. Thus, this study aims to construct an automatic seizure detection system to analyze epileptic EEG signals. The CHB-MIT Scalp EEG recording of patients is used in this work for the experiment purpose. The Welch Fast Fourier Transform is used to convert time domain features to the frequency domain. The statistical features are extracted respectively in the time domain and frequency domain. The ANOVA based feature selection is used to deduct variables. The Random Under-sampling (RUS) and Synthetic Minority Oversampling Technique (SMOTE) methods are used to solve the data imbalance problem. Four machine learning algorithms, including decision tree classifier (DTC), extra-decision tree classifier (EDTC), Linear Discriminant Analysis Classifier(LDAC), Quadratic Discriminant Classifier(QDC), Random Forest Classifier (RFC), Gradient Boosting Classifier (GBC), Multi-layer Perceptron Classifier (MLPC), and Stochastic Gradient Descent Classifier (SGDC) are used to classify the data. As a result, the performance of the proposed classifier is 99.48% of accuracy, 99.79% of sensitivity, and 99.17% of specificity. The system might be a helpful tool for doctors to make a more reliable and objective analysis of patient EEG records.
ElahehJafarigol
Imbalanced Learning with Parametric Linear Programming Support Vector Machine for Weather Data Application
mserra0
A machine learning project addressing credit card fraud detection using imbalanced datasets. Utilizes techniques like cost-sensitive learning, SMOTE, and ensemble models for high precision and accuracy, emphasizing robust performance despite challenging data distributions.
maedemadani
Machine Learning pipeline for classifying network activities into security categories (allow/deny/drop/reset). Includes data cleaning, feature engineering, imbalance handling, and model deployment ready for production.
deypadma
Stroke is the second leading cause of death worldwide and remains an important health burden both for individuals and for the national healthcare systems. Potentially modifiable risk factors for stroke include hypertension, cardiac disease, diabetes, dysregulation of glucose metabolism, atrial fibrillation, and lifestyle factors. Therefore, the goal of our project is to apply principles of machine learning over large existing data sets to effectively predict stroke based on potentially modifiable risk factors. Then it intended to develop the application to provide a personalized warning based on each user’s level of stroke risk and a lifestyle correction message about the stroke risk factors. In this article, we discuss the symptoms and causes of a stroke and also a machine learning model that predicts the likelihood of a patient having a stroke based on age, BMI, and glucose level for a group of patients. To proceed with the implementation, different datasets were considered from Kaggle. Out of all the existing datasets, an appropriate dataset was collected for model building. After collecting the dataset, the next step lies in preparing the dataset to make the data clearer and easily understood by the machine. This step is called Data pre-processing. This step includes handling missing values, handling imbalanced data and performing label encoding that is specific to this particular dataset. Now that the data is pre-processed, it is ready for model building. For model building, pre-processed datasets along with machine learning algorithms are required. Logistic Regression, Decision Tree Classification algorithm, Random Forest Classification algorithm, K-Nearest Neighbour algorithm, Support Vector Classification, KMeans Clustering Classification and Naïve Bayes Classification algorithm are used. After building seven different models, they are compared using four accuracy metrics namely Accuracy Score, Precision Score, Recall Score, and F1 Score.
hananahmed1
This contains application data and code for article about regression for imbalanced data in machine learning
amrutdeshpande
1. Created an automated machine failure prediction solution that monitors data coming in every day and historical data and predicts the chances of failure 30 days in advance to enable the respective stake holders to take necessary action to minimize the loss . 2. Extract the Time series data from MS SQL Server by establishing a link using python, clean the data for discrepancies (outliers, class imbalance etc.), perform EDA and define baseline metrics . 3. Built & trained Machine learning models like LSTM, RNN, Decision Trees and optimized their parameters to increase their accuracy and execution efficiency. The prediction result is then pushed into a UI built on Visual Studio & Tableau Dashboard to enable for easy consumption by the stake holders.
Johnnywang1899
imbalanced-learn library, supervised learning, Scikit-learn machine learning library for Python (sklearn), supervised learning in linear models (linear regression & logistic regression), dataset split into training & testing datasets, Accuracy/Precision/sensitivity(recall)/F1 score, confusion matrix, SVM(Support Vector Machine – support vector, hyperplane), data preprocessing: - labelling (encoding – convert all text columns to numeric label), - Data Scale & Normalization (Standard scaler – mean = 0, variance = 1), Decision Trees, Ensemble Learning – Random Forest (weak/moderate/strong learner), Bootstrap Aggregation, Boosting [Adaptive boosting (AdaBoost), Gradient Boosting, for Class imbalance (solution 1: Oversampling (Random oversampling, synthetic minority oversampling technique (SMOTE)), solution 2: Undersampling (Random undersampling, Cluster Centroid undersampling, solution 3: SMOTEENN)
Human comfort datasets are widely used for multiple scenarios in smart buildings. From thermal comfort prediction to personalized indoor environments, labelled subjective responses from participants in a experiment are required to feed different machine learning models. However, many of these dataset are small in samples per participants, number of participants, or suffer from a class-imbalanced of its subjective responses. In this work we explore the use of Generative Adversarial Networks to generate synthetic samples to be used in combination with real ones for data-driven applications in the built environment.
Week1 Report Here is a quick summary of what I have achieved to learn in my first week of training under ParrotAi. Introduction to Machine learning , I have achieved to know a good intro into Machine Learning which include the history of ML ,the types of ML such supervised, unsupervised, Reinforcement learning. And also answers to questions such why machine learning? , challenges facing machine learning which include insufficient data, irrelevant on data, overfitting, underfitting and there solutions in general. Supervised Machine algorithms, here I learnt the theory and intuition behind the common used supervised ML including the KNN, Linear Regressions, Logistic, Regression, and Ensemble algorithm the Random forest. Also not only the intuition but their implementation in python using the sklearn library and parameter tuning them to achieve a best model with stunning accuracy(here meaning the way to regularize the model to avoid overfitting and underfitting).And also the intuition on where to use/apply the algorithms basing on the problem I.e classification or regression. Also which model performs better on what and poor on what based on circumstances. Data preprocessing and representation here I learnt on the importance of preprocessing the data, also the techniques involved such scaling(include Standard Scaling, RobustScaling and MinMaxScaler) ,handling the missing data either by ignoring(technical termed as dropping) the data which is not recommended since one could loose important patterns on the data and by fitting the mean or median of the data points on the missing places. On data representation involved on how we can represent categorical features so as they can be used in the algorithm, the method learnt here is One-Hot Encoding technique and its implementation in python using both Pandas and Sklearn Libraries. Model evaluation and improvement. In this section I grasped the concept of how you can evaluate your model if its performing good or bad and the ways you could improve it. As the train_test_split technique seems to be imbalance hence the cross-validation technique which included the K-fold , Stratified K-fold and other strategies such LeaveOneOut which will help on the improvement of your model by splitting data in a convenience manner to help in training of model, thus making it generalize well on unseen data. I learnt also on the GridSearch technique which included the best method in which one can choose the best parameters for the model to improve the performance such as the simple grid search and the GridSearch with cross-validation technique, all this I was able to implement them in code using the Sklearn library in python. Lastly the week challenge task given to us was tremendous since I got to apply what I learned in theory to solve a real problem.It was good to apply the workflow of a machine learning task starting from understanding the problem, getting to know the data, data preprocessing , visualising the data to get more insights, model selection, training the model to applying the model to make prediction In general I was able to grasp and learn much in this week from basic foundation of Machine Learning to the implementations of the algorithms in code. The great achievement so far is the intuition behind the algorithm especially supervised ones. Though yet is much to be covered but the accomplishment I have attained so far its a good start to say to this journey on Machine learning. My expectation on the coming week is on having a solid foundation on deep learning.
Shivangi1Raghav
Polycystic Ovary Syndrome (PCOS) is a widespread pathology that affects many aspects of women's health, with long-term consequences beyond the reproductive age. The wide variety of clinical referrals, as well as the lack of internationally accepted diagnostic procedures, have had a significant impact on making it difficult to determine the exact etiology of the disease. The exact histology of PCOS is not yet clear. It is therefore a multifaceted study, which shares genetic and environmental factors. The aim of this project is to analyse simple factors (height, weight, lifestyle changes, etc.) and complex (imbalances of bio hormones and chemicals such as insulin, vitamin D, etc.) factors that contribute to the development of the disease. The data we used for our project was published in Kaggle, written by Prasoon Kottarathil, called Polycystic ovary syndrome (PCOS) in 2020. This database contains records of 543 PCOS patients tested on the basis of 40 parameters. For this, we have used Machine Learning techniques such as Logistic Regression, Decision Trees, SVMs, Random Forests, etc, A detailed analysis of all the items made using graphs and programs and prediction using Machine Learning Models helped us to identify the most important indicators for the same.
Pseud0n1nja
Stroke_Prediction: Using different Machine Learning Algorithms from Random forest to XGBoost and Smote for imbalanced data
HPI-Information-Systems
DataGossip is an extension for asynchronous distributed data parallel machine learning that improves the training on imbalanced partitions.
RashmiRatnayake
This repository provides a benchmark dataset for using machine learning models for blockchain data trustworthiness. The data is categorized into three major dimensions of trust, each with 'balanced' and 'imbalanced' versions. Supporting Jupyter notebooks are included for data generation, labeling using weak supervision (Snorkel), and exporting.
splch
An effective and flexible Quantile-Based Balanced Sampling algorithm for addressing class imbalance in datasets while preserving the underlying data distribution, improving model performance across various machine learning applications.
Jeremy-Cleland
This project aims to predict sepsis in patients using advanced machine learning models. The workflow encompasses data preprocessing, feature engineering, class imbalance handling, hyperparameter optimization, model training, evaluation, model card generation, and model registry management for reproducibility and scalability.
Ashutosh-2109
End-to-end customer churn prediction system using Machine Learning. Includes EDA, data preprocessing, class imbalance handling with SMOTE, model training using Random Forest and XGBoost, performance evaluation, feature importance analysis, and optional inference for new customers. Built as part of a CodeChef club recruitment task.
sethnlewis
An in-depth study of how attributes of political protests are an extremely valuable predictor of government change to come. The project uses Data Science techniques for detailed data wrangling, cleaning, and exploration before applying Machine Learning models. After applying techniques such as encoding, binning, log transformations, addressing class imbalance and hyperparameter tuning, the final model provides an F1 score of 0.80 and an accuracy of 0.96.
ANALYZING ROAD SAFETY & TRAFFIC DEMOGRAPHICS IN THE UK (Multi-class Classification) SUMMARY Here, I am aim to analyze the Road Safety and Traffic Demographics dataset (UK), containing accidents reported by the police between the years of 2004 - 2017. PROJECT GOALS: Identify factors responsible for most of the reported accidents. Build a machine learning model that is capable of accurately predicting the severity of an accident. Provide recommendations to the Department of Transport (UK Government), to improve road safety policies and prevent recurrences of severe accidents where possible. PACKAGES USED: Scikit-learn, numpy, pandas, imblearn (imbalanced-learn), seaborn, Matplotlib MOTIVATION World Health Organization (WHO) reported that more than 1.25 million people die each year while 50 million are injured as a result of road accidents worldwide. Road accidents are the 10th leading cause of death globally. On current trends, road traffic accidents are to become the 7th leading cause of death by 2030 making it a major public health concern. Between the years 2005 and 2016, there were roughly 2 million road accidents reported in the United Kingdom (UK) alone of which 16,000 were fatal. As a big data project, I wanted to explore the traffic demographics data in greater detail using machine learning! CONTEXT The UK government amassed traffic data from 2004 to 2017, recording over 2 million accidents in the process and making this one of the most comprehensive traffic data sets out there. It's a huge picture of a country undergoing change. Note that all the contained accident data comes from police reports, so this data does not include minor incidents. For steps undertaken to pre-process and clean the data, please view the "Data Cleansing & Descriptive Analysis_UK Traffic Demographics.ipynb" file DESCRIPTIVE ANALYTICS (EDA) Tools used include Python, Tableau, MS PowerBI Percent (%) distribution of target classes Percent dist of Accident Severity As seen above, the data is highly imbalanced. For detailed steps undertaken to deal with the imbalanced data, please view the "Modelling_Predictive Analytics_UK Traffic Demographics.ipynb" file. This article provides some great tips on utilizing the correct performance metrics when analyzing a models performance trained on an imbalanced dataset. This article describes several strategies that can help combat the case of a severly imbalanced dataset. Methods include: Resampling strategies (under - Tomek Links, Cluster Centroids, over sampling - SMOTE) Using Decision Tree based models Using Cost-Sensitive training (Penalize algorithms) Number of accidents by Year and Accident Severity Total accidents by year and severity It can be seen above that the trend seems to be increasing as the years go. In addition, the spike between 2008 - 2009 was because of a enhancement in the reporting system introduced in the UK in 2009, where all accident including minor accidents needed to be reported by the police so as to match the counts represented by hospitals, insurance claims etc. Accidents density by Location geomap Most accidents took place in major cities - Birmingham, London, leeds, Newcastle Accidents by Gender and Age Accidents by gender and age Accidents by Day of the week and Year Accidents by year and weekday Most accidents take place on a Friday Vehicle Manoever at time of accident Vehicle Manoever at time of accident Most accidents take place as a result of overtaking For more findings, please go to the "Images" folder. For steps undertaken to carry out some predictive modeling and hyper-parameter tuning, please view the "Modelling_Predictive Analytics_UK Traffic Demographics.ipynb" file. RECOMMENDATIONS TO THE DEPARTMENT OF TRANSPORT (UK) Decrease emergency response times during afternoon rush-hours (15-19) especially on Fridays. Allocate resources to investigate high density traffic points and identify new infrastructure needs to divert traffic from dual-carriage ways. Explore conditions of vehicles and casualties such as vehicle type, age of vehicles registered, pedestrian movements, etc. for policy makers. Adopt comprehensive distracted driving laws that increase penalties for drivers who commit traffic violations like aggressive overtaking. ACKNOWLEDGEMENTS The license for this dataset is the Open Givernment Licence used by all data on data.gov.uk. The raw datasets are available from the UK Department of Transport website. I had a lot of fun working on this dataset and learned a lot in the process. I plan to further my research in the area of predictive modeling using imabalanced data and how to effectively build a highly robust model for future projects. About Here, I analyze the Road Safety and Traffic Demographics dataset (UK), containing accidents reported by the police between the years of 2004 - 2017. Topics accident-rate accident-severity imbalanced-data imbalanced-learning road-accident reported-accidents road-safety uk-government transport traffic-demographics severe-accidents pca classification Resources Readme Releases No releases published Packages No packages published Languages Jupyter Notebook 100.0% © 2020 GitHub, Inc.
pankaj614
Problem statement The problem statement chosen for this project is to predict fraudulent credit card transactions with the help of machine learning models. In this project, we will analyse customer-level data which has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group. The dataset is taken from the Kaggle Website website and it has a total of 2,84,807 transactions, out of which 492 are fraudulent. Since the dataset is highly imbalanced, so it needs to be handled before model building. Business Problem Overview For many banks, retaining high profitable customers is the number one business goal. Banking fraud, however, poses a significant threat to this goal for different banks. In terms of substantial financial losses, trust and credibility, this is a concerning issue to both banks and customers alike. It has been estimated by Nilson report that by 2020 the banking frauds would account to $30 billion worldwide. With the rise in digital payment channels, the number of fraudulent transactions is also increasing with new and different ways. In the banking industry, credit card fraud detection using machine learning is not just a trend but a necessity for them to put proactive monitoring and fraud prevention mechanisms in place. Machine learning is helping these institutions to reduce time-consuming manual reviews, costly chargebacks and fees, and denials of legitimate transactions. Understanding and Defining Fraud Credit card fraud is any dishonest act and behaviour to obtain information without the proper authorization from the account holder for financial gain. Among different ways of frauds, Skimming is the most common one, which is the way of duplicating of information located on the magnetic strip of the card. Apart from this, the other ways are: Manipulation/alteration of genuine cards Creation of counterfeit cards Stolen/lost credit cards Fraudulent telemarketing Data Dictionary The dataset can be download using this link The data set includes credit card transactions made by European cardholders over a period of two days in September 2013. Out of a total of 2,84,807 transactions, 492 were fraudulent. This data set is highly unbalanced, with the positive class (frauds) accounting for 0.172% of the total transactions. The data set has also been modified with Principal Component Analysis (PCA) to maintain confidentiality. Apart from ‘time’ and ‘amount’, all the other features (V1, V2, V3, up to V28) are the principal components obtained using PCA. The feature 'time' contains the seconds elapsed between the first transaction in the data set and the subsequent transactions. The feature 'amount' is the transaction amount. The feature 'class' represents class labelling, and it takes the value 1 in cases of fraud and 0 in others. Project Pipeline The project pipeline can be briefly summarized in the following four steps: Data Understanding: Here, we need to load the data and understand the features present in it. This would help us choose the features that we will need for your final model. Exploratory data analytics (EDA): Normally, in this step, we need to perform univariate and bivariate analyses of the data, followed by feature transformations, if necessary. For the current data set, because Gaussian variables are used, we do not need to perform Z-scaling. However, you can check if there is any skewness in the data and try to mitigate it, as it might cause problems during the model-building phase. Train/Test Split: Now we are familiar with the train/test split, which we can perform in order to check the performance of our models with unseen data. Here, for validation, we can use the k-fold cross-validation method. We need to choose an appropriate k value so that the minority class is correctly represented in the test folds. Model-Building/Hyperparameter Tuning: This is the final step at which we can try different models and fine-tune their hyperparameters until we get the desired level of performance on the given dataset. We should try and see if we get a better model by the various sampling techniques. Model Evaluation: We need to evaluate the models using appropriate evaluation metrics. Note that since the data is imbalanced it is is more important to identify which are fraudulent transactions accurately than the non-fraudulent. We need to choose an appropriate evaluation metric which reflects this business goal.
ctcrahul
A Machine Learning project for detecting credit card fraud using imbalanced data handling and multiple classification models.