Search Results

Found 350 repositories(showing 30)

Online-Payments-Fraud-Detection-Dataset-Case-Study

valentineashio

❤️35

A Data Science/Machine Learning Project. According to Bolster , Global Fraud Index (as at June 2022) is at 10,183 and growing. This is high risk to businesses and customers transacting online. This indicates that traditional rules-based methods of detecting and combating fraud are fast becoming less effective. It becomes imperative for stakeholders to develop innovative means to make transacting online as safe as possible. Artificial intelligence provides viable and efficient solutions via Machine Learning models/algorithms. In this project, I trained a fraud detection model to predict online payment fraud using Blossom Bank PLC as case study. Blosssom Bank ( BB PLC) is a multinational financial services group, that offers retail and investment banking, pension management, assets management and payment services, headquartered in London, UK. Blossom Bank wants to build a machine learning model to predict online payment fraud. Here is the dataset used for this task. With this model, BB PLC will: Keep up with fast evolving technological threats and better prevent the loss of funds (profit) to fraudsters. Accurately detect and identify anomalies in managing online transactions done on its platforms which may go undetected using traditional rules-based methods. 3.Improve quality assurance thus retaining old customers and acquire new ones. This will increase credit/profit base. Improve its policy and decision making. Steps: 1.Loading necessary python libraries. Loading Dataset. Exploratory Data Analysis. Higlighting Relationships and insights. Data Transformation; Using resampling techniques to address Class-imbalace.. Feature Engineering. Model Training. Model Evaluation. Challenges: I encountered a number of challenges during coding which made me run into error reports. these were due to improper documentations, syntax, especially during feature engineering (one-hot encoding: 'fit.transform'). This aspect consumed most of my time I was able to solve these challenges by making extensive research and paying close attention to syntax. I was able to selve the encoding by using 'pd.get_dummies() and making some specifications in the methods.

Jupyter Notebook

Updated 5 months ago

Task-3-Exploratory-Data-Analysis-Retail

abhisheks008

❤️40

Here I have created the analysis model of a Super market datasheet given by the company and have deployed the parameters successfully.

Apache-2.0

Jupyter Notebook

Updated 1 year ago

Stock-prediction

rishabhathiya

❤️20

# Forecasting Stock Market Prices It is a **Time Series** dataset.A time series is simply a series of data points ordered in time.In a time series, time is often the independent variable and the goal is usually to make a forecast for the future. ## PROBLEM STATEMENT: Our Aim is to create a model that can forecast the future stock price based on the model training and provided dataset. ### Data We will be using a [Huge stock market dataset](https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs) from the Kaggle platform which has a very good collection of datasets.The file we will be using is present in following directory in the dataset zip file input\Data\Stocks\gs.us.txt The data is presented in CSV format as follows : Date, Open, High, Low, Close, Volume, OpenInt. Features: - Date - Open - High - Low - Close - Volume - OpenInt Note that prices have been adjusted for dividends and splits. ### LICENSE OF DATASET : [LICENSE](https://creativecommons.org/publicdomain/zero/1.0/) ### Requirements You will also need to have software installed to run and execute a [Jupyter Notebook](http://ipython.org/notebook.html) If you do not have Python installed yet, it is highly recommended that you install the [Anaconda](http://continuum.io/downloads) distribution of Python, which already has the above packages and more included. This project requires **Python** and the following Python libraries installed: - [NumPy](http://www.numpy.org/) - [Pandas](http://pandas.pydata.org/) - [matplotlib](http://matplotlib.org/) - [scikit-learn](http://scikit-learn.org/stable/) - [statsmodels](https://www.statsmodels.org/stable/) ### Run In a terminal or command window, navigate to the top-level project directory `STOCK MARKET FORECASTING/` (that contains this README) and run one of the following commands: ipython notebook Forecasting_Stock_Market_Prices_task.ipynb or jupyter notebook Forecasting_Stock_Market_Prices_task.ipynb This will open the Jupyter Notebook software and project file in your browser. ### Steps : 1. Importing Libraries 2. Exploring the Dataset 3. Exploratory Data Analysis > * Univariate Analysis 4. Data Preprocessing 5. Model Building > * AUTOREGRESSIVE MODEL > * MOVING AVERAGE MODEL 6. Evaluation > * MEAN SQUARE ERROR > * MEAN ABSOLUTE ERROR > * ROOT MEAN SQUARE ERROR 7. Conclusion

Jupyter Notebook

Updated 2 years ago

EDA-with-python-and-pandas

Mahima9861

❤️35

To perform Exploratory Data Analysis (EDA) on a supermarket sales dataset. It will be accomplised by completing each task in the project: Task 1: Initial Data Exploration Task 2: Univariate Analysis Task 3: Bivariate Analysis Task 4: Dealing With Duplicate Rows and Missing Values Task 5: Correlation Analysis

Jupyter Notebook

Updated 2 years ago

Task-3-Exploratory-Data-Analysis---Retail

jr-pandit

❤️35

** Project done during the Data Science & Analytics Internship at The Sparks Foundation **

Python

Updated 4 years ago

TASK_3_EXPLORATORY-DATA-ANALYSIS-RETAIL-LEVEL-BEGINNER-

kaviyavp

❤️25

No description available

Jupyter Notebook

Updated 4 years ago

TheSparkFoundation-Internship-ML-

adityajamwal02

❤️35

Task-1 (Supervised ML) Task-2 (Unsupervised ML) Task-3 (Exploratory Data Analysis)

Jupyter Notebook

Updated 3 years ago

data-analysismachine-learningpython+1

The-Sparks-Foundation-Internship

vaib321

❤️35

Data Science and Analytics Internship at The Sparks Foundation This repository contains all the tasks for the Data Science and Analytics Intern at The Sparks Foundation. TASK-1 Improve our LinkedIn profile. TASK-2 To Explore Supervised Machine Learning In this regression task we will predict the percentage of marks that a student is expected to score based upon the number of hours they studied. This is a simple linear regression task as it involves just two variables. TASK-3 To Explore Unsupervised Machine Learning From the given ‘Iris’ dataset, predict the optimum number of clusters and represent it visually. TASK-4 To Explore Decision Tree Algorithm For the given ‘Iris’ dataset, create the Decision Tree classifier and visualize it graphically. The purpose is if we feed any new data to this classifier, it would be able to predict the right class accordingly. TASK-5 To explore Business Analytics Perform ‘Exploratory Data Analysis’ on the provided dataset ‘SampleSuperstore’. You are the business owner of the retail firm and want to see how your company is performing. You are interested in finding out the weak areas where you can work to make more profit. What all business problems you can derive by looking into the data? You can choose any of the tool of your choice (Python/R/Tableau/PowerBI/Excel)

Jupyter Notebook

Updated 3 years ago

Telecom-churn-case-study

ShrutiM1234

❤️35

Business problem overview In the telecom industry, customers are able to choose from multiple service providers and actively switch from one operator to another. In this highly competitive market, the telecommunications industry experiences an average of 15-25% annual churn rate. Given the fact that it costs 5-10 times more to acquire a new customer than to retain an existing one, customer retention has now become even more important than customer acquisition. For many incumbent operators, retaining high profitable customers is the number one business goal. To reduce customer churn, telecom companies need to predict which customers are at high risk of churn. In this project, you will analyse customer-level data of a leading telecom firm, build predictive models to identify customers at high risk of churn and identify the main indicators of churn. Understanding and defining churn There are two main models of payment in the telecom industry - postpaid (customers pay a monthly/annual bill after using the services) and prepaid (customers pay/recharge with a certain amount in advance and then use the services). In the postpaid model, when customers want to switch to another operator, they usually inform the existing operator to terminate the services, and you directly know that this is an instance of churn. However, in the prepaid model, customers who want to switch to another network can simply stop using the services without any notice, and it is hard to know whether someone has actually churned or is simply not using the services temporarily (e.g. someone may be on a trip abroad for a month or two and then intend to resume using the services again). Thus, churn prediction is usually more critical (and non-trivial) for prepaid customers, and the term ‘churn’ should be defined carefully. Also, prepaid is the most common model in India and Southeast Asia, while postpaid is more common in Europe in North America. This project is based on the Indian and Southeast Asian market. Definitions of churn There are various ways to define churn, such as: Revenue-based churn: Customers who have not utilised any revenue-generating facilities such as mobile internet, outgoing calls, SMS etc. over a given period of time. One could also use aggregate metrics such as ‘customers who have generated less than INR 4 per month in total/average/median revenue’. The main shortcoming of this definition is that there are customers who only receive calls/SMSes from their wage-earning counterparts, i.e. they don’t generate revenue but use the services. For example, many users in rural areas only receive calls from their wage-earning siblings in urban areas. Usage-based churn: Customers who have not done any usage, either incoming or outgoing - in terms of calls, internet etc. over a period of time. A potential shortcoming of this definition is that when the customer has stopped using the services for a while, it may be too late to take any corrective actions to retain them. For e.g., if you define churn based on a ‘two-months zero usage’ period, predicting churn could be useless since by that time the customer would have already switched to another operator. In this project, you will use the usage-based definition to define churn. High-value churn In the Indian and the Southeast Asian market, approximately 80% of revenue comes from the top 20% customers (called high-value customers). Thus, if we can reduce churn of the high-value customers, we will be able to reduce significant revenue leakage. In this project, you will define high-value customers based on a certain metric (mentioned later below) and predict churn only on high-value customers. Understanding the business objective and the data The dataset contains customer-level information for a span of four consecutive months - June, July, August and September. The months are encoded as 6, 7, 8 and 9, respectively. The business objective is to predict the churn in the last (i.e. the ninth) month using the data (features) from the first three months. To do this task well, understanding the typical customer behaviour during churn will be helpful. Understanding customer behaviour during churn Customers usually do not decide to switch to another competitor instantly, but rather over a period of time (this is especially applicable to high-value customers). In churn prediction, we assume that there are three phases of customer lifecycle : The ‘good’ phase: In this phase, the customer is happy with the service and behaves as usual. The ‘action’ phase: The customer experience starts to sore in this phase, for e.g. he/she gets a compelling offer from a competitor, faces unjust charges, becomes unhappy with service quality etc. In this phase, the customer usually shows different behaviour than the ‘good’ months. Also, it is crucial to identify high-churn-risk customers in this phase, since some corrective actions can be taken at this point (such as matching the competitor’s offer/improving the service quality etc.) The ‘churn’ phase: In this phase, the customer is said to have churned. You define churn based on this phase. Also, it is important to note that at the time of prediction (i.e. the action months), this data is not available to you for prediction. Thus, after tagging churn as 1/0 based on this phase, you discard all data corresponding to this phase. In this case, since you are working over a four-month window, the first two months are the ‘good’ phase, the third month is the ‘action’ phase, while the fourth month is the ‘churn’ phase. Data dictionary The dataset can be download using this link. The data dictionary is provided for download below. Data Dictionary - Telecom Churn Download The data dictionary contains meanings of abbreviations. Some frequent ones are loc (local), IC (incoming), OG (outgoing), T2T (telecom operator to telecom operator), T2O (telecom operator to another operator), RECH (recharge) etc. The attributes containing 6, 7, 8, 9 as suffixes imply that those correspond to the months 6, 7, 8, 9 respectively. Data Preparation The following data preparation steps are crucial for this problem: 1. Derive new features This is one of the most important parts of data preparation since good features are often the differentiators between good and bad models. Use your business understanding to derive features you think could be important indicators of churn. 2. Filter high-value customers As mentioned above, you need to predict churn only for the high-value customers. Define high-value customers as follows: Those who have recharged with an amount more than or equal to X, where X is the 70th percentile of the average recharge amount in the first two months (the good phase). After filtering the high-value customers, you should get about 29.9k rows. 3. Tag churners and remove attributes of the churn phase Now tag the churned customers (churn=1, else 0) based on the fourth month as follows: Those who have not made any calls (either incoming or outgoing) AND have not used mobile internet even once in the churn phase. The attributes you need to use to tag churners are: total_ic_mou_9 total_og_mou_9 vol_2g_mb_9 vol_3g_mb_9 After tagging churners, remove all the attributes corresponding to the churn phase (all attributes having ‘ _9’, etc. in their names). Modelling Build models to predict churn. The predictive model that you’re going to build will serve two purposes: It will be used to predict whether a high-value customer will churn or not, in near future (i.e. churn phase). By knowing this, the company can take action steps such as providing special plans, discounts on recharge etc. It will be used to identify important variables that are strong predictors of churn. These variables may also indicate why customers choose to switch to other networks. In some cases, both of the above-stated goals can be achieved by a single machine learning model. But here, you have a large number of attributes, and thus you should try using a dimensionality reduction technique such as PCA and then build a predictive model. After PCA, you can use any classification model. Also, since the rate of churn is typically low (about 5-10%, this is called class-imbalance) - try using techniques to handle class imbalance. You can take the following suggestive steps to build the model: Preprocess data (convert columns to appropriate formats, handle missing values, etc.) Conduct appropriate exploratory analysis to extract useful insights (whether directly useful for business or for eventual modelling/feature engineering). Derive new features. Reduce the number of variables using PCA. Train a variety of models, tune model hyperparameters, etc. (handle class imbalance using appropriate techniques). Evaluate the models using appropriate evaluation metrics. Note that it is more important to identify churners than the non-churners accurately - choose an appropriate evaluation metric which reflects this business goal. Finally, choose a model based on some evaluation metric. The above model will only be able to achieve one of the two goals - to predict customers who will churn. You can’t use the above model to identify the important features for churn. That’s because PCA usually creates components which are not easy to interpret. Therefore, build another model with the main objective of identifying important predictor attributes which help the business understand indicators of churn. A good choice to identify important variables is a logistic regression model or a model from the tree family. In case of logistic regression, make sure to handle multi-collinearity. After identifying important predictors, display them visually - you can use plots, summary tables etc. - whatever you think best conveys the importance of features. Finally, recommend strategies to manage customer churn based on your observations.

Jupyter Notebook

Updated 6 months ago

Telecom-Churn-Case-Study

hebbarvn

❤️35

In the telecom industry, customers are able to choose from multiple service providers and actively switch from one operator to another. In this highly competitive market, the telecommunications industry experiences an average of 15-25% annual churn rate. Given the fact that it costs 5-10 times more to acquire a new customer than to retain an existing one, customer retention has now become even more important than customer acquisition. For many incumbent operators, retaining high profitable customers is the number one business goal. To reduce customer churn, telecom companies need to predict which customers are at high risk of churn. In this project, you will analyse customer-level data of a leading telecom firm, build predictive models to identify customers at high risk of churn and identify the main indicators of churn. Understanding and Defining Churn There are two main models of payment in the telecom industry - postpaid (customers pay a monthly/annual bill after using the services) and prepaid (customers pay/recharge with a certain amount in advance and then use the services). In the postpaid model, when customers want to switch to another operator, they usually inform the existing operator to terminate the services, and you directly know that this is an instance of churn. However, in the prepaid model, customers who want to switch to another network can simply stop using the services without any notice, and it is hard to know whether someone has actually churned or is simply not using the services temporarily (e.g. someone may be on a trip abroad for a month or two and then intend to resume using the services again). Thus, churn prediction is usually more critical (and non-trivial) for prepaid customers, and the term ‘churn’ should be defined carefully. Also, prepaid is the most common model in India and southeast Asia, while postpaid is more common in Europe in North America. This project is based on the Indian and Southeast Asian market. Definitions of Churn There are various ways to define churn, such as: Revenue-based churn: Customers who have not utilised any revenue-generating facilities such as mobile internet, outgoing calls, SMS etc. over a given period of time. One could also use aggregate metrics such as ‘customers who have generated less than INR 4 per month in total/average/median revenue’. The main shortcoming of this definition is that there are customers who only receive calls/SMSes from their wage-earning counterparts, i.e. they don’t generate revenue but use the services. For example, many users in rural areas only receive calls from their wage-earning siblings in urban areas. Usage-based churn: Customers who have not done any usage, either incoming or outgoing - in terms of calls, internet etc. over a period of time. A potential shortcoming of this definition is that when the customer has stopped using the services for a while, it may be too late to take any corrective actions to retain them. For e.g., if you define churn based on a ‘two-months zero usage’ period, predicting churn could be useless since by that time the customer would have already switched to another operator. In this project, you will use the usage-based definition to define churn. High-value Churn In the Indian and the southeast Asian market, approximately 80% of revenue comes from the top 20% customers (called high-value customers). Thus, if we can reduce churn of the high-value customers, we will be able to reduce significant revenue leakage. In this project, you will define high-value customers based on a certain metric (mentioned later below) and predict churn only on high-value customers. Understanding the Business Objective and the Data The dataset contains customer-level information for a span of four consecutive months - June, July, August and September. The months are encoded as 6, 7, 8 and 9, respectively. The business objective is to predict the churn in the last (i.e. the ninth) month using the data (features) from the first three months. To do this task well, understanding the typical customer behaviour during churn will be helpful. Understanding Customer Behaviour During Churn Customers usually do not decide to switch to another competitor instantly, but rather over a period of time (this is especially applicable to high-value customers). In churn prediction, we assume that there are three phases of customer lifecycle : The ‘good’ phase: In this phase, the customer is happy with the service and behaves as usual. The ‘action’ phase: The customer experience starts to sore in this phase, for e.g. he/she gets a compelling offer from a competitor, faces unjust charges, becomes unhappy with service quality etc. In this phase, the customer usually shows different behaviour than the ‘good’ months. Also, it is crucial to identify high-churn-risk customers in this phase, since some corrective actions can be taken at this point (such as matching the competitor’s offer/improving the service quality etc.) The ‘churn’ phase: In this phase, the customer is said to have churned. You define churn based on this phase. Also, it is important to note that at the time of prediction (i.e. the action months), this data is not available to you for prediction. Thus, after tagging churn as 1/0 based on this phase, you discard all data corresponding to this phase. In this case, since you are working over a four-month window, the first two months are the ‘good’ phase, the third month is the ‘action’ phase, while the fourth month is the ‘churn’ phase. The data dictionary contains meanings of abbreviations. Some frequent ones are loc (local), IC (incoming), OG (outgoing), T2T (telecom operator to telecom operator), T2O (telecom operator to another operator), RECH (recharge) etc. The attributes containing 6, 7, 8, 9 as suffixes imply that those correspond to the months 6, 7, 8, 9 respectively. Data Preparation The following data preparation steps are crucial for this problem: 1. Derive new features This is one of the most important parts of data preparation since good features are often the differentiators between good and bad models. Use your business understanding to derive features you think could be important indicators of churn. 2. Filter high-value customers As mentioned above, you need to predict churn only for the high-value customers. Define high-value customers as follows: Those who have recharged with an amount more than or equal to X, where X is the 70th percentile of the average recharge amount in the first two months (the good phase). After filtering the high-value customers, you should get about 29.9k rows. 3. Tag churners and remove attributes of the churn phase Now tag the churned customers (churn=1, else 0) based on the fourth month as follows: Those who have not made any calls (either incoming or outgoing) AND have not used mobile internet even once in the churn phase. The attributes you need to use to tag churners are: total_ic_mou_9 total_og_mou_9 vol_2g_mb_9 vol_3g_mb_9 After tagging churners, remove all the attributes corresponding to the churn phase (all attributes having ‘ _9’, etc. in their names). Modelling Build models to predict churn. The predictive model that you’re going to build will serve two purposes: It will be used to predict whether a high-value customer will churn or not, in near future (i.e. churn phase). By knowing this, the company can take action steps such as providing special plans, discounts on recharge etc. It will be used to identify important variables that are strong predictors of churn. These variables may also indicate why customers choose to switch to other networks. In some cases, both of the above-stated goals can be achieved by a single machine learning model. But here, you have a large number of attributes, and thus you should try using a dimensionality reduction technique such as PCA and then build a predictive model. After PCA, you can use any classification model. Also, since the rate of churn is typically low (about 5-10%, this is called class-imbalance) - try using techniques to handle class imbalance. You can take the following suggestive steps to build the model: Preprocess data (convert columns to appropriate formats, handle missing values, etc.) Conduct appropriate exploratory analysis to extract useful insights (whether directly useful for business or for eventual modelling/feature engineering). Derive new features. Reduce the number of variables using PCA. Train a variety of models, tune model hyperparameters, etc. (handle class imbalance using appropriate techniques). Evaluate the models using appropriate evaluation metrics. Note that is is more important to identify churners than the non-churners accurately - choose an appropriate evaluation metric which reflects this business goal. Finally, choose a model based on some evaluation metric. The above model will only be able to achieve one of the two goals - to predict customers who will churn. You can’t use the above model to identify the important features for churn. That’s because PCA usually creates components which are not easy to interpret. Therefore, build another model with the main objective of identifying important predictor attributes which help the business understand indicators of churn. A good choice to identify important variables is a logistic regression model or a model from the tree family. In case of logistic regression, make sure to handle multi-collinearity. After identifying important predictors, display them visually - you can use plots, summary tables etc. - whatever you think best conveys the importance of features. Finally, recommend strategies to manage customer churn based on your observations. Note: Everything has to be submitted in one Jupyter notebook. The evaluation rubrics are mentioned on the next page.

Jupyter Notebook

Updated 1 year ago

Task-3-Exploratory-Data-Analysis

Harikrishnaa3131

❤️30

No description available

MIT

Jupyter Notebook

Updated 4 years ago

AI-ML-Internship-Task-3

Vidit3859

❤️45

Task 3 – Exploratory Data Analysis (EDA)

Jupyter Notebook

Updated 2 months ago

Exploratory-Data-Analysis---Retail

hari-kalyan-2

❤️35

Task-3 Exploratory Data Analysis - Retail

Jupyter Notebook

Updated 3 years ago

Exploratory-Data-Analysis---Retail-Task-3

ria496

❤️25

No description available

Jupyter Notebook

Updated 4 years ago

Task-3---Exploratory-Data-Analysis---Retail

Prathyusha-L

❤️25

No description available

Jupyter Notebook

Updated 4 years ago

Task-3.-Exploratory-Data-Analysis---Retail

ritikumar2905

❤️25

No description available

Jupyter Notebook

Updated 4 years ago

Task-3-Exploratory-Data-Analysis---Retail-

chandan5569

❤️35

● Perform ‘Exploratory Data Analysis’ on dataset ‘SampleSuperstore’ ● As a business manager, try to find out the weak areas where you can work to make more profit. ● What all business problems you can derive by exploring the data?

Jupyter Notebook

Updated 3 years ago

The-Sparks-Foundation_Task3_Exploratory-Data-Analysis

khushiisharmaa

❤️25

No description available

Jupyter Notebook

Updated 1 year ago

Spark_Foundation_Task3_EXploratory-Data-Analysis_Retail

sinumariam

❤️35

This is a project I performed during my Intership @The Spark Foundation

Updated 3 years ago

GRIP-FEBRUARY22-TASK3-EXPLORATORY-DATA-ANALYSIS

vinay976

❤️35

we will be trying to find out the weak areas where we can work to make more profit.

Jupyter Notebook

Updated 3 years ago

Task-3Exploratory-Data-Analysis---Retail-By-Akshay

Akshaypareek01

❤️25

No description available

Updated 9 months ago

TASK-3-Perform-Exploratory-Data-Analysis-on-dataset-SampleSuperstore-

mallikarjunyadav27

❤️35

As a business manager, try to find out the weak areas where you can work to make more profit. Approach is like What all business problems you can derive by exploring the data?

Jupyter Notebook

Updated 4 years ago

Task-3-Exploratory-Data-Analysis-Retail-on-dataset-SampleSuperstore

kingharshit

❤️35

Data Science And Business Analytics Internship GRIP The Spark Foundation GRIPNOV20

Jupyter Notebook

Updated 5 years ago

Task-3-Exploratory-Data-Analysis-Retail-on-dataset-SampleSuperstore-main

prithvidev

❤️25

No description available

Jupyter Notebook

Updated 5 years ago

Code-Alpha-Data-Analyst-Internship

Allaboutanshul

🧡55

CodeAlpha Data Analytics Tasks 📋 Overview This repository showcases the completion of my Data Analytics internship tasks at CodeAlpha. It demonstrates a complete data pipeline from raw web extraction to advanced sentiment insights. Task 1: Web Scraping, Task 2: Exploratory Data Analysis (EDA), Task 3: Data Visualization&Task 4: Sentiment Analysis

Jupyter Notebook

Updated 2 weeks ago

Disney_Studio_Income_Analytics

ShripadJagtap

❤️35

Disney Studio Income Analytics Task 3 CollegeRanker Exploratory data analysis and regression models to predict box office revenue prediction on Disney movies produced since the debut film Snow White and Seven Dwarf in 1937.

Jupyter Notebook

Updated 3 years ago

Exploratory-Data-Analysis-on-SampleSuperstore-Dataset

sneha24102000

❤️35

Author : Sneha M Data Science & Business Analytics Internship GRIP - The Spark Foundation TASK 3 - Perform ‘Exploratory Data Analysis’ on dataset Objective : 1. As a business manager, try to find out the weak areas where you can work to make more profit. 2.What all business problems you can derive by exploring the data?

Jupyter Notebook

Updated 5 years ago

IBM-HR-Analytics-Employee-Attrition-Modeling-.

MohdShadab999

❤️35

DESCRIPTION IBM is an American MNC operating in around 170 countries with major business vertical as computing, software, and hardware. Attrition is a major risk to service-providing organizations where trained and experienced people are the assets of the company. The organization would like to identify the factors which influence the attrition of employees. Data Dictionary Age: Age of employee Attrition: Employee attrition status Department: Department of work DistanceFromHome Education: 1-Below College; 2- College; 3-Bachelor; 4-Master; 5-Doctor; EducationField EnvironmentSatisfaction: 1-Low; 2-Medium; 3-High; 4-Very High; JobSatisfaction: 1-Low; 2-Medium; 3-High; 4-Very High; MaritalStatus MonthlyIncome NumCompaniesWorked: Number of companies worked prior to IBM WorkLifeBalance: 1-Bad; 2-Good; 3-Better; 4-Best; YearsAtCompany: Current years of service in IBM Analysis Task: - Import attrition dataset and import libraries such as pandas, matplotlib.pyplot, numpy, and seaborn. - Exploratory data analysis Find the age distribution of employees in IBM Explore attrition by age Explore data for Left employees Find out the distribution of employees by the education field Give a bar chart for the number of married and unmarried employees - Build up a logistic regression model to predict which employees are likely to attrite.

Updated 3 years ago

Data-Analysis-Project-02

Mickwen

❤️35

Introduction Fine particulate matter (PM2.5) is an ambient air pollutant for which there is strong evidence that it is harmful to human health. In the United States, the Environmental Protection Agency (EPA) is tasked with setting national ambient air quality standards for fine PM and for tracking the emissions of this pollutant into the atmosphere. Approximatly every 3 years, the EPA releases its database on emissions of PM2.5. This database is known as the National Emissions Inventory (NEI). You can read more information about the NEI at the EPA National Emissions Inventory web site. For each year and for each type of PM source, the NEI records how many tons of PM2.5 were emitted from that source over the course of the entire year. The data that you will use for this assignment are for 1999, 2002, 2005, and 2008. Data The data for this assignment are available from the course web site as a single zip file: Data for Peer Assessment [29Mb] The zip file contains two files: PM2.5 Emissions Data (summarySCC_PM25.rds): This file contains a data frame with all of the PM2.5 emissions data for 1999, 2002, 2005, and 2008. For each year, the table contains number of tons of PM2.5 emitted from a specific type of source for the entire year. Here are the first few rows. ## fips SCC Pollutant Emissions type year ## 4 09001 10100401 PM25-PRI 15.714 POINT 1999 ## 8 09001 10100404 PM25-PRI 234.178 POINT 1999 ## 12 09001 10100501 PM25-PRI 0.128 POINT 1999 ## 16 09001 10200401 PM25-PRI 2.036 POINT 1999 ## 20 09001 10200504 PM25-PRI 0.388 POINT 1999 ## 24 09001 10200602 PM25-PRI 1.490 POINT 1999 fips: A five-digit number (represented as a string) indicating the U.S. county SCC: The name of the source as indicated by a digit string (see source code classification table) Pollutant: A string indicating the pollutant Emissions: Amount of PM2.5 emitted, in tons type: The type of source (point, non-point, on-road, or non-road) year: The year of emissions recorded Source Classification Code Table (Source_Classification_Code.rds): This table provides a mapping from the SCC digit strings in the Emissions table to the actual name of the PM2.5 source. The sources are categorized in a few different ways from more general to more specific and you may choose to explore whatever categories you think are most useful. For example, source “10100101” is known as “Ext Comb /Electric Gen /Anthracite Coal /Pulverized Coal”. You can read each of the two files using the readRDS() function in R. For example, reading in each file can be done with the following code: ## This first line will likely take a few seconds. Be patient! NEI <- readRDS("summarySCC_PM25.rds") SCC <- readRDS("Source_Classification_Code.rds") as long as each of those files is in your current working directory (check by calling dir() and see if those files are in the listing). Assignment The overall goal of this assignment is to explore the National Emissions Inventory database and see what it say about fine particulate matter pollution in the United states over the 10-year period 1999–2008. You may use any R package you want to support your analysis. Questions You must address the following questions and tasks in your exploratory analysis. For each question/task you will need to make a single plot. Unless specified, you can use any plotting system in R to make your plot. Have total emissions from PM2.5 decreased in the United States from 1999 to 2008? Using the base plotting system, make a plot showing the total PM2.5 emission from all sources for each of the years 1999, 2002, 2005, and 2008. Have total emissions from PM2.5 decreased in the Baltimore City, Maryland (fips == "24510") from 1999 to 2008? Use the base plotting system to make a plot answering this question. Of the four types of sources indicated by the type (point, nonpoint, onroad, nonroad) variable, which of these four sources have seen decreases in emissions from 1999–2008 for Baltimore City? Which have seen increases in emissions from 1999–2008? Use the ggplot2 plotting system to make a plot answer this question. Across the United States, how have emissions from coal combustion-related sources changed from 1999–2008? How have emissions from motor vehicle sources changed from 1999–2008 in Baltimore City? Compare emissions from motor vehicle sources in Baltimore City with emissions from motor vehicle sources in Los Angeles County, California (fips == "06037"). Which city has seen greater changes over time in motor vehicle emissions? Making and Submitting Plots For each plot you should Construct the plot and save it to a PNG file. Create a separate R code file (plot1.R, plot2.R, etc.) that constructs the corresponding plot, i.e. code in plot1.R constructs the plot1.png plot. Your code file should include code for reading the data so that the plot can be fully reproduced. You must also include the code that creates the PNG file. Only include the code for a single plot (i.e. plot1.R should only include code for producing plot1.png) Upload the PNG file on the Assignment submission page Copy and paste the R code from the corresponding R file into the text box at the appropriate point in the peer assessment.

Updated 10 years ago

Telecom-Churn-Prediction

sidf3ar

❤️35

Business Problem Overview In the telecom industry, customers are able to choose from multiple service providers and actively switch from one operator to another. In this highly competitive market, the telecommunications industry experiences an average of 15-25% annual churn rate. Given the fact that it costs 5-10 times more to acquire a new customer than to retain an existing one, customer retention has now become even more important than customer acquisition. For many incumbent operators, retaining high profitable customers is the number one business goal. To reduce customer churn, telecom companies need to predict which customers are at high risk of churn. In this project, you will analyse customer-level data of a leading telecom firm, build predictive models to identify customers at high risk of churn and identify the main indicators of churn. Understanding and Defining Churn There are two main models of payment in the telecom industry - postpaid (customers pay a monthly/annual bill after using the services) and prepaid (customers pay/recharge with a certain amount in advance and then use the services). In the postpaid model, when customers want to switch to another operator, they usually inform the existing operator to terminate the services, and you directly know that this is an instance of churn. However, in the prepaid model, customers who want to switch to another network can simply stop using the services without any notice, and it is hard to know whether someone has actually churned or is simply not using the services temporarily (e.g. someone may be on a trip abroad for a month or two and then intend to resume using the services again). Thus, churn prediction is usually more critical (and non-trivial) for prepaid customers, and the term ‘churn’ should be defined carefully. Also, prepaid is the most common model in India and southeast Asia, while postpaid is more common in Europe in North America. This project is based on the Indian and Southeast Asian market. Definitions of Churn There are various ways to define churn, such as: Revenue-based churn: Customers who have not utilised any revenue-generating facilities such as mobile internet, outgoing calls, SMS etc. over a given period of time. One could also use aggregate metrics such as ‘customers who have generated less than INR 4 per month in total/average/median revenue’. The main shortcoming of this definition is that there are customers who only receive calls/SMSes from their wage-earning counterparts, i.e. they don’t generate revenue but use the services. For example, many users in rural areas only receive calls from their wage-earning siblings in urban areas. Usage-based churn: Customers who have not done any usage, either incoming or outgoing - in terms of calls, internet etc. over a period of time. A potential shortcoming of this definition is that when the customer has stopped using the services for a while, it may be too late to take any corrective actions to retain them. For e.g., if you define churn based on a ‘two-months zero usage’ period, predicting churn could be useless since by that time the customer would have already switched to another operator. In this project, you will use the usage-based definition to define churn. High-value Churn In the Indian and the southeast Asian market, approximately 80% of revenue comes from the top 20% customers (called high-value customers). Thus, if we can reduce churn of the high-value customers, we will be able to reduce significant revenue leakage. In this project, you will define high-value customers based on a certain metric (mentioned later below) and predict churn only on high-value customers. Understanding the Business Objective and the Data The dataset contains customer-level information for a span of four consecutive months - June, July, August and September. The months are encoded as 6, 7, 8 and 9, respectively. The business objective is to predict the churn in the last (i.e. the ninth) month using the data (features) from the first three months. To do this task well, understanding the typical customer behaviour during churn will be helpful. Understanding Customer Behaviour During Churn Customers usually do not decide to switch to another competitor instantly, but rather over a period of time (this is especially applicable to high-value customers). In churn prediction, we assume that there are three phases of customer lifecycle : The ‘good’ phase: In this phase, the customer is happy with the service and behaves as usual. The ‘action’ phase: The customer experience starts to sore in this phase, for e.g. he/she gets a compelling offer from a competitor, faces unjust charges, becomes unhappy with service quality etc. In this phase, the customer usually shows different behaviour than the ‘good’ months. Also, it is crucial to identify high-churn-risk customers in this phase, since some corrective actions can be taken at this point (such as matching the competitor’s offer/improving the service quality etc.) The ‘churn’ phase: In this phase, the customer is said to have churned. You define churn based on this phase. Also, it is important to note that at the time of prediction (i.e. the action months), this data is not available to you for prediction. Thus, after tagging churn as 1/0 based on this phase, you discard all data corresponding to this phase. In this case, since you are working over a four-month window, the first two months are the ‘good’ phase, the third month is the ‘action’ phase, while the fourth month is the ‘churn’ phase. Data Dictionary The dataset can be download using this link. The data dictionary is provided for download below. Data Dictionary - Telecom Churn Download The data dictionary contains meanings of abbreviations. Some frequent ones are loc (local), IC (incoming), OG (outgoing), T2T (telecom operator to telecom operator), T2O (telecom operator to another operator), RECH (recharge) etc. The attributes containing 6, 7, 8, 9 as suffixes imply that those correspond to the months 6, 7, 8, 9 respectively. Data Preparation The following data preparation steps are crucial for this problem: 1. Derive new features This is one of the most important parts of data preparation since good features are often the differentiators between good and bad models. Use your business understanding to derive features you think could be important indicators of churn. 2. Filter high-value customers As mentioned above, you need to predict churn only for the high-value customers. Define high-value customers as follows: Those who have recharged with an amount more than or equal to X, where X is the 70th percentile of the average recharge amount in the first two months (the good phase). After filtering the high-value customers, you should get about 29.9k rows. 3. Tag churners and remove attributes of the churn phase Now tag the churned customers (churn=1, else 0) based on the fourth month as follows: Those who have not made any calls (either incoming or outgoing) AND have not used mobile internet even once in the churn phase. The attributes you need to use to tag churners are: total_ic_mou_9 total_og_mou_9 vol_2g_mb_9 vol_3g_mb_9 After tagging churners, remove all the attributes corresponding to the churn phase (all attributes having ‘ _9’, etc. in their names). Modelling Build models to predict churn. The predictive model that you’re going to build will serve two purposes: It will be used to predict whether a high-value customer will churn or not, in near future (i.e. churn phase). By knowing this, the company can take action steps such as providing special plans, discounts on recharge etc. It will be used to identify important variables that are strong predictors of churn. These variables may also indicate why customers choose to switch to other networks. In some cases, both of the above-stated goals can be achieved by a single machine learning model. But here, you have a large number of attributes, and thus you should try using a dimensionality reduction technique such as PCA and then build a predictive model. After PCA, you can use any classification model. Also, since the rate of churn is typically low (about 5-10%, this is called class-imbalance) - try using techniques to handle class imbalance. You can take the following suggestive steps to build the model: Preprocess data (convert columns to appropriate formats, handle missing values, etc.) Conduct appropriate exploratory analysis to extract useful insights (whether directly useful for business or for eventual modelling/feature engineering). Derive new features. Reduce the number of variables using PCA. Train a variety of models, tune model hyperparameters, etc. (handle class imbalance using appropriate techniques). Evaluate the models using appropriate evaluation metrics. Note that is is more important to identify churners than the non-churners accurately - choose an appropriate evaluation metric which reflects this business goal. Finally, choose a model based on some evaluation metric. The above model will only be able to achieve one of the two goals - to predict customers who will churn. You can’t use the above model to identify the important features for churn. That’s because PCA usually creates components which are not easy to interpret. Therefore, build another model with the main objective of identifying important predictor attributes which help the business understand indicators of churn. A good choice to identify important variables is a logistic regression model or a model from the tree family. In case of logistic regression, make sure to handle multi-collinearity. After identifying important predictors, display them visually - you can use plots, summary tables etc. - whatever you think best conveys the importance of features. Finally, recommend strategies to manage customer churn based on your observations.

Jupyter Notebook

Updated 6 months ago

decision-treesdepth-maplogistic-regression+2

GitHub Explorer

Search Results

Online-Payments-Fraud-Detection-Dataset-Case-Study

Task-3-Exploratory-Data-Analysis-Retail

Stock-prediction

EDA-with-python-and-pandas

Task-3-Exploratory-Data-Analysis---Retail

TASK_3_EXPLORATORY-DATA-ANALYSIS-RETAIL-LEVEL-BEGINNER-

TheSparkFoundation-Internship-ML-

The-Sparks-Foundation-Internship

Telecom-churn-case-study

Telecom-Churn-Case-Study

Task-3-Exploratory-Data-Analysis

AI-ML-Internship-Task-3

Exploratory-Data-Analysis---Retail

Exploratory-Data-Analysis---Retail-Task-3

Task-3---Exploratory-Data-Analysis---Retail

Task-3.-Exploratory-Data-Analysis---Retail

Task-3-Exploratory-Data-Analysis---Retail-

The-Sparks-Foundation_Task3_Exploratory-Data-Analysis

Spark_Foundation_Task3_EXploratory-Data-Analysis_Retail

GRIP-FEBRUARY22-TASK3-EXPLORATORY-DATA-ANALYSIS

Task-3Exploratory-Data-Analysis---Retail-By-Akshay

TASK-3-Perform-Exploratory-Data-Analysis-on-dataset-SampleSuperstore-

Task-3-Exploratory-Data-Analysis-Retail-on-dataset-SampleSuperstore

Task-3-Exploratory-Data-Analysis-Retail-on-dataset-SampleSuperstore-main

Code-Alpha-Data-Analyst-Internship

Disney_Studio_Income_Analytics

Exploratory-Data-Analysis-on-SampleSuperstore-Dataset

IBM-HR-Analytics-Employee-Attrition-Modeling-.

Data-Analysis-Project-02

Telecom-Churn-Prediction

Online-Payments-Fraud-Detection-Dataset-Case-Study

Task-3-Exploratory-Data-Analysis-Retail

Stock-prediction

EDA-with-python-and-pandas

Task-3-Exploratory-Data-Analysis---Retail

TASK_3_EXPLORATORY-DATA-ANALYSIS-RETAIL-LEVEL-BEGINNER-

TheSparkFoundation-Internship-ML-

The-Sparks-Foundation-Internship

Telecom-churn-case-study

Telecom-Churn-Case-Study

Task-3-Exploratory-Data-Analysis

AI-ML-Internship-Task-3

Exploratory-Data-Analysis---Retail

Exploratory-Data-Analysis---Retail-Task-3

Task-3---Exploratory-Data-Analysis---Retail

Task-3.-Exploratory-Data-Analysis---Retail

Task-3-Exploratory-Data-Analysis---Retail-

The-Sparks-Foundation_Task3_Exploratory-Data-Analysis

Spark_Foundation_Task3_EXploratory-Data-Analysis_Retail

GRIP-FEBRUARY22-TASK3-EXPLORATORY-DATA-ANALYSIS

Task-3Exploratory-Data-Analysis---Retail-By-Akshay

TASK-3-Perform-Exploratory-Data-Analysis-on-dataset-SampleSuperstore-

Task-3-Exploratory-Data-Analysis-Retail-on-dataset-SampleSuperstore

Task-3-Exploratory-Data-Analysis-Retail-on-dataset-SampleSuperstore-main

Code-Alpha-Data-Analyst-Internship

Disney_Studio_Income_Analytics

Exploratory-Data-Analysis-on-SampleSuperstore-Dataset

IBM-HR-Analytics-Employee-Attrition-Modeling-.

Data-Analysis-Project-02

Telecom-Churn-Prediction