Found 350 repositories(showing 30)
valentineashio
A Data Science/Machine Learning Project. According to Bolster , Global Fraud Index (as at June 2022) is at 10,183 and growing. This is high risk to businesses and customers transacting online. This indicates that traditional rules-based methods of detecting and combating fraud are fast becoming less effective. It becomes imperative for stakeholders to develop innovative means to make transacting online as safe as possible. Artificial intelligence provides viable and efficient solutions via Machine Learning models/algorithms. In this project, I trained a fraud detection model to predict online payment fraud using Blossom Bank PLC as case study. Blosssom Bank ( BB PLC) is a multinational financial services group, that offers retail and investment banking, pension management, assets management and payment services, headquartered in London, UK. Blossom Bank wants to build a machine learning model to predict online payment fraud. Here is the dataset used for this task. With this model, BB PLC will: Keep up with fast evolving technological threats and better prevent the loss of funds (profit) to fraudsters. Accurately detect and identify anomalies in managing online transactions done on its platforms which may go undetected using traditional rules-based methods. 3.Improve quality assurance thus retaining old customers and acquire new ones. This will increase credit/profit base. Improve its policy and decision making. Steps: 1.Loading necessary python libraries. Loading Dataset. Exploratory Data Analysis. Higlighting Relationships and insights. Data Transformation; Using resampling techniques to address Class-imbalace.. Feature Engineering. Model Training. Model Evaluation. Challenges: I encountered a number of challenges during coding which made me run into error reports. these were due to improper documentations, syntax, especially during feature engineering (one-hot encoding: 'fit.transform'). This aspect consumed most of my time I was able to solve these challenges by making extensive research and paying close attention to syntax. I was able to selve the encoding by using 'pd.get_dummies() and making some specifications in the methods.
abhisheks008
Here I have created the analysis model of a Super market datasheet given by the company and have deployed the parameters successfully.
rishabhathiya
# Forecasting Stock Market Prices It is a **Time Series** dataset.A time series is simply a series of data points ordered in time.In a time series, time is often the independent variable and the goal is usually to make a forecast for the future. ## PROBLEM STATEMENT: Our Aim is to create a model that can forecast the future stock price based on the model training and provided dataset. ### Data We will be using a [Huge stock market dataset](https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs) from the Kaggle platform which has a very good collection of datasets.The file we will be using is present in following directory in the dataset zip file input\Data\Stocks\gs.us.txt The data is presented in CSV format as follows : Date, Open, High, Low, Close, Volume, OpenInt. Features: - Date - Open - High - Low - Close - Volume - OpenInt Note that prices have been adjusted for dividends and splits. ### LICENSE OF DATASET : [LICENSE](https://creativecommons.org/publicdomain/zero/1.0/) ### Requirements You will also need to have software installed to run and execute a [Jupyter Notebook](http://ipython.org/notebook.html) If you do not have Python installed yet, it is highly recommended that you install the [Anaconda](http://continuum.io/downloads) distribution of Python, which already has the above packages and more included. This project requires **Python** and the following Python libraries installed: - [NumPy](http://www.numpy.org/) - [Pandas](http://pandas.pydata.org/) - [matplotlib](http://matplotlib.org/) - [scikit-learn](http://scikit-learn.org/stable/) - [statsmodels](https://www.statsmodels.org/stable/) ### Run In a terminal or command window, navigate to the top-level project directory `STOCK MARKET FORECASTING/` (that contains this README) and run one of the following commands: ipython notebook Forecasting_Stock_Market_Prices_task.ipynb or jupyter notebook Forecasting_Stock_Market_Prices_task.ipynb This will open the Jupyter Notebook software and project file in your browser. ### Steps : 1. Importing Libraries 2. Exploring the Dataset 3. Exploratory Data Analysis > * Univariate Analysis 4. Data Preprocessing 5. Model Building > * AUTOREGRESSIVE MODEL > * MOVING AVERAGE MODEL 6. Evaluation > * MEAN SQUARE ERROR > * MEAN ABSOLUTE ERROR > * ROOT MEAN SQUARE ERROR 7. Conclusion
Mahima9861
To perform Exploratory Data Analysis (EDA) on a supermarket sales dataset. It will be accomplised by completing each task in the project: Task 1: Initial Data Exploration Task 2: Univariate Analysis Task 3: Bivariate Analysis Task 4: Dealing With Duplicate Rows and Missing Values Task 5: Correlation Analysis
** Project done during the Data Science & Analytics Internship at The Sparks Foundation **
No description available
adityajamwal02
Task-1 (Supervised ML) Task-2 (Unsupervised ML) Task-3 (Exploratory Data Analysis)
Data Science and Analytics Internship at The Sparks Foundation This repository contains all the tasks for the Data Science and Analytics Intern at The Sparks Foundation. TASK-1 Improve our LinkedIn profile. TASK-2 To Explore Supervised Machine Learning In this regression task we will predict the percentage of marks that a student is expected to score based upon the number of hours they studied. This is a simple linear regression task as it involves just two variables. TASK-3 To Explore Unsupervised Machine Learning From the given ‘Iris’ dataset, predict the optimum number of clusters and represent it visually. TASK-4 To Explore Decision Tree Algorithm For the given ‘Iris’ dataset, create the Decision Tree classifier and visualize it graphically. The purpose is if we feed any new data to this classifier, it would be able to predict the right class accordingly. TASK-5 To explore Business Analytics Perform ‘Exploratory Data Analysis’ on the provided dataset ‘SampleSuperstore’. You are the business owner of the retail firm and want to see how your company is performing. You are interested in finding out the weak areas where you can work to make more profit. What all business problems you can derive by looking into the data? You can choose any of the tool of your choice (Python/R/Tableau/PowerBI/Excel)
ShrutiM1234
Business problem overview In the telecom industry, customers are able to choose from multiple service providers and actively switch from one operator to another. In this highly competitive market, the telecommunications industry experiences an average of 15-25% annual churn rate. Given the fact that it costs 5-10 times more to acquire a new customer than to retain an existing one, customer retention has now become even more important than customer acquisition. For many incumbent operators, retaining high profitable customers is the number one business goal. To reduce customer churn, telecom companies need to predict which customers are at high risk of churn. In this project, you will analyse customer-level data of a leading telecom firm, build predictive models to identify customers at high risk of churn and identify the main indicators of churn. Understanding and defining churn There are two main models of payment in the telecom industry - postpaid (customers pay a monthly/annual bill after using the services) and prepaid (customers pay/recharge with a certain amount in advance and then use the services). In the postpaid model, when customers want to switch to another operator, they usually inform the existing operator to terminate the services, and you directly know that this is an instance of churn. However, in the prepaid model, customers who want to switch to another network can simply stop using the services without any notice, and it is hard to know whether someone has actually churned or is simply not using the services temporarily (e.g. someone may be on a trip abroad for a month or two and then intend to resume using the services again). Thus, churn prediction is usually more critical (and non-trivial) for prepaid customers, and the term ‘churn’ should be defined carefully. Also, prepaid is the most common model in India and Southeast Asia, while postpaid is more common in Europe in North America. This project is based on the Indian and Southeast Asian market. Definitions of churn There are various ways to define churn, such as: Revenue-based churn: Customers who have not utilised any revenue-generating facilities such as mobile internet, outgoing calls, SMS etc. over a given period of time. One could also use aggregate metrics such as ‘customers who have generated less than INR 4 per month in total/average/median revenue’. The main shortcoming of this definition is that there are customers who only receive calls/SMSes from their wage-earning counterparts, i.e. they don’t generate revenue but use the services. For example, many users in rural areas only receive calls from their wage-earning siblings in urban areas. Usage-based churn: Customers who have not done any usage, either incoming or outgoing - in terms of calls, internet etc. over a period of time. A potential shortcoming of this definition is that when the customer has stopped using the services for a while, it may be too late to take any corrective actions to retain them. For e.g., if you define churn based on a ‘two-months zero usage’ period, predicting churn could be useless since by that time the customer would have already switched to another operator. In this project, you will use the usage-based definition to define churn. High-value churn In the Indian and the Southeast Asian market, approximately 80% of revenue comes from the top 20% customers (called high-value customers). Thus, if we can reduce churn of the high-value customers, we will be able to reduce significant revenue leakage. In this project, you will define high-value customers based on a certain metric (mentioned later below) and predict churn only on high-value customers. Understanding the business objective and the data The dataset contains customer-level information for a span of four consecutive months - June, July, August and September. The months are encoded as 6, 7, 8 and 9, respectively. The business objective is to predict the churn in the last (i.e. the ninth) month using the data (features) from the first three months. To do this task well, understanding the typical customer behaviour during churn will be helpful. Understanding customer behaviour during churn Customers usually do not decide to switch to another competitor instantly, but rather over a period of time (this is especially applicable to high-value customers). In churn prediction, we assume that there are three phases of customer lifecycle : The ‘good’ phase: In this phase, the customer is happy with the service and behaves as usual. The ‘action’ phase: The customer experience starts to sore in this phase, for e.g. he/she gets a compelling offer from a competitor, faces unjust charges, becomes unhappy with service quality etc. In this phase, the customer usually shows different behaviour than the ‘good’ months. Also, it is crucial to identify high-churn-risk customers in this phase, since some corrective actions can be taken at this point (such as matching the competitor’s offer/improving the service quality etc.) The ‘churn’ phase: In this phase, the customer is said to have churned. You define churn based on this phase. Also, it is important to note that at the time of prediction (i.e. the action months), this data is not available to you for prediction. Thus, after tagging churn as 1/0 based on this phase, you discard all data corresponding to this phase. In this case, since you are working over a four-month window, the first two months are the ‘good’ phase, the third month is the ‘action’ phase, while the fourth month is the ‘churn’ phase. Data dictionary The dataset can be download using this link. The data dictionary is provided for download below. Data Dictionary - Telecom Churn Download The data dictionary contains meanings of abbreviations. Some frequent ones are loc (local), IC (incoming), OG (outgoing), T2T (telecom operator to telecom operator), T2O (telecom operator to another operator), RECH (recharge) etc. The attributes containing 6, 7, 8, 9 as suffixes imply that those correspond to the months 6, 7, 8, 9 respectively. Data Preparation The following data preparation steps are crucial for this problem: 1. Derive new features This is one of the most important parts of data preparation since good features are often the differentiators between good and bad models. Use your business understanding to derive features you think could be important indicators of churn. 2. Filter high-value customers As mentioned above, you need to predict churn only for the high-value customers. Define high-value customers as follows: Those who have recharged with an amount more than or equal to X, where X is the 70th percentile of the average recharge amount in the first two months (the good phase). After filtering the high-value customers, you should get about 29.9k rows. 3. Tag churners and remove attributes of the churn phase Now tag the churned customers (churn=1, else 0) based on the fourth month as follows: Those who have not made any calls (either incoming or outgoing) AND have not used mobile internet even once in the churn phase. The attributes you need to use to tag churners are: total_ic_mou_9 total_og_mou_9 vol_2g_mb_9 vol_3g_mb_9 After tagging churners, remove all the attributes corresponding to the churn phase (all attributes having ‘ _9’, etc. in their names). Modelling Build models to predict churn. The predictive model that you’re going to build will serve two purposes: It will be used to predict whether a high-value customer will churn or not, in near future (i.e. churn phase). By knowing this, the company can take action steps such as providing special plans, discounts on recharge etc. It will be used to identify important variables that are strong predictors of churn. These variables may also indicate why customers choose to switch to other networks. In some cases, both of the above-stated goals can be achieved by a single machine learning model. But here, you have a large number of attributes, and thus you should try using a dimensionality reduction technique such as PCA and then build a predictive model. After PCA, you can use any classification model. Also, since the rate of churn is typically low (about 5-10%, this is called class-imbalance) - try using techniques to handle class imbalance. You can take the following suggestive steps to build the model: Preprocess data (convert columns to appropriate formats, handle missing values, etc.) Conduct appropriate exploratory analysis to extract useful insights (whether directly useful for business or for eventual modelling/feature engineering). Derive new features. Reduce the number of variables using PCA. Train a variety of models, tune model hyperparameters, etc. (handle class imbalance using appropriate techniques). Evaluate the models using appropriate evaluation metrics. Note that it is more important to identify churners than the non-churners accurately - choose an appropriate evaluation metric which reflects this business goal. Finally, choose a model based on some evaluation metric. The above model will only be able to achieve one of the two goals - to predict customers who will churn. You can’t use the above model to identify the important features for churn. That’s because PCA usually creates components which are not easy to interpret. Therefore, build another model with the main objective of identifying important predictor attributes which help the business understand indicators of churn. A good choice to identify important variables is a logistic regression model or a model from the tree family. In case of logistic regression, make sure to handle multi-collinearity. After identifying important predictors, display them visually - you can use plots, summary tables etc. - whatever you think best conveys the importance of features. Finally, recommend strategies to manage customer churn based on your observations.
hebbarvn
In the telecom industry, customers are able to choose from multiple service providers and actively switch from one operator to another. In this highly competitive market, the telecommunications industry experiences an average of 15-25% annual churn rate. Given the fact that it costs 5-10 times more to acquire a new customer than to retain an existing one, customer retention has now become even more important than customer acquisition. For many incumbent operators, retaining high profitable customers is the number one business goal. To reduce customer churn, telecom companies need to predict which customers are at high risk of churn. In this project, you will analyse customer-level data of a leading telecom firm, build predictive models to identify customers at high risk of churn and identify the main indicators of churn. Understanding and Defining Churn There are two main models of payment in the telecom industry - postpaid (customers pay a monthly/annual bill after using the services) and prepaid (customers pay/recharge with a certain amount in advance and then use the services). In the postpaid model, when customers want to switch to another operator, they usually inform the existing operator to terminate the services, and you directly know that this is an instance of churn. However, in the prepaid model, customers who want to switch to another network can simply stop using the services without any notice, and it is hard to know whether someone has actually churned or is simply not using the services temporarily (e.g. someone may be on a trip abroad for a month or two and then intend to resume using the services again). Thus, churn prediction is usually more critical (and non-trivial) for prepaid customers, and the term ‘churn’ should be defined carefully. Also, prepaid is the most common model in India and southeast Asia, while postpaid is more common in Europe in North America. This project is based on the Indian and Southeast Asian market. Definitions of Churn There are various ways to define churn, such as: Revenue-based churn: Customers who have not utilised any revenue-generating facilities such as mobile internet, outgoing calls, SMS etc. over a given period of time. One could also use aggregate metrics such as ‘customers who have generated less than INR 4 per month in total/average/median revenue’. The main shortcoming of this definition is that there are customers who only receive calls/SMSes from their wage-earning counterparts, i.e. they don’t generate revenue but use the services. For example, many users in rural areas only receive calls from their wage-earning siblings in urban areas. Usage-based churn: Customers who have not done any usage, either incoming or outgoing - in terms of calls, internet etc. over a period of time. A potential shortcoming of this definition is that when the customer has stopped using the services for a while, it may be too late to take any corrective actions to retain them. For e.g., if you define churn based on a ‘two-months zero usage’ period, predicting churn could be useless since by that time the customer would have already switched to another operator. In this project, you will use the usage-based definition to define churn. High-value Churn In the Indian and the southeast Asian market, approximately 80% of revenue comes from the top 20% customers (called high-value customers). Thus, if we can reduce churn of the high-value customers, we will be able to reduce significant revenue leakage. In this project, you will define high-value customers based on a certain metric (mentioned later below) and predict churn only on high-value customers. Understanding the Business Objective and the Data The dataset contains customer-level information for a span of four consecutive months - June, July, August and September. The months are encoded as 6, 7, 8 and 9, respectively. The business objective is to predict the churn in the last (i.e. the ninth) month using the data (features) from the first three months. To do this task well, understanding the typical customer behaviour during churn will be helpful. Understanding Customer Behaviour During Churn Customers usually do not decide to switch to another competitor instantly, but rather over a period of time (this is especially applicable to high-value customers). In churn prediction, we assume that there are three phases of customer lifecycle : The ‘good’ phase: In this phase, the customer is happy with the service and behaves as usual. The ‘action’ phase: The customer experience starts to sore in this phase, for e.g. he/she gets a compelling offer from a competitor, faces unjust charges, becomes unhappy with service quality etc. In this phase, the customer usually shows different behaviour than the ‘good’ months. Also, it is crucial to identify high-churn-risk customers in this phase, since some corrective actions can be taken at this point (such as matching the competitor’s offer/improving the service quality etc.) The ‘churn’ phase: In this phase, the customer is said to have churned. You define churn based on this phase. Also, it is important to note that at the time of prediction (i.e. the action months), this data is not available to you for prediction. Thus, after tagging churn as 1/0 based on this phase, you discard all data corresponding to this phase. In this case, since you are working over a four-month window, the first two months are the ‘good’ phase, the third month is the ‘action’ phase, while the fourth month is the ‘churn’ phase. The data dictionary contains meanings of abbreviations. Some frequent ones are loc (local), IC (incoming), OG (outgoing), T2T (telecom operator to telecom operator), T2O (telecom operator to another operator), RECH (recharge) etc. The attributes containing 6, 7, 8, 9 as suffixes imply that those correspond to the months 6, 7, 8, 9 respectively. Data Preparation The following data preparation steps are crucial for this problem: 1. Derive new features This is one of the most important parts of data preparation since good features are often the differentiators between good and bad models. Use your business understanding to derive features you think could be important indicators of churn. 2. Filter high-value customers As mentioned above, you need to predict churn only for the high-value customers. Define high-value customers as follows: Those who have recharged with an amount more than or equal to X, where X is the 70th percentile of the average recharge amount in the first two months (the good phase). After filtering the high-value customers, you should get about 29.9k rows. 3. Tag churners and remove attributes of the churn phase Now tag the churned customers (churn=1, else 0) based on the fourth month as follows: Those who have not made any calls (either incoming or outgoing) AND have not used mobile internet even once in the churn phase. The attributes you need to use to tag churners are: total_ic_mou_9 total_og_mou_9 vol_2g_mb_9 vol_3g_mb_9 After tagging churners, remove all the attributes corresponding to the churn phase (all attributes having ‘ _9’, etc. in their names). Modelling Build models to predict churn. The predictive model that you’re going to build will serve two purposes: It will be used to predict whether a high-value customer will churn or not, in near future (i.e. churn phase). By knowing this, the company can take action steps such as providing special plans, discounts on recharge etc. It will be used to identify important variables that are strong predictors of churn. These variables may also indicate why customers choose to switch to other networks. In some cases, both of the above-stated goals can be achieved by a single machine learning model. But here, you have a large number of attributes, and thus you should try using a dimensionality reduction technique such as PCA and then build a predictive model. After PCA, you can use any classification model. Also, since the rate of churn is typically low (about 5-10%, this is called class-imbalance) - try using techniques to handle class imbalance. You can take the following suggestive steps to build the model: Preprocess data (convert columns to appropriate formats, handle missing values, etc.) Conduct appropriate exploratory analysis to extract useful insights (whether directly useful for business or for eventual modelling/feature engineering). Derive new features. Reduce the number of variables using PCA. Train a variety of models, tune model hyperparameters, etc. (handle class imbalance using appropriate techniques). Evaluate the models using appropriate evaluation metrics. Note that is is more important to identify churners than the non-churners accurately - choose an appropriate evaluation metric which reflects this business goal. Finally, choose a model based on some evaluation metric. The above model will only be able to achieve one of the two goals - to predict customers who will churn. You can’t use the above model to identify the important features for churn. That’s because PCA usually creates components which are not easy to interpret. Therefore, build another model with the main objective of identifying important predictor attributes which help the business understand indicators of churn. A good choice to identify important variables is a logistic regression model or a model from the tree family. In case of logistic regression, make sure to handle multi-collinearity. After identifying important predictors, display them visually - you can use plots, summary tables etc. - whatever you think best conveys the importance of features. Finally, recommend strategies to manage customer churn based on your observations. Note: Everything has to be submitted in one Jupyter notebook. The evaluation rubrics are mentioned on the next page.
Harikrishnaa3131
No description available
Vidit3859
Task 3 – Exploratory Data Analysis (EDA)
hari-kalyan-2
Task-3 Exploratory Data Analysis - Retail
No description available
Prathyusha-L
No description available
ritikumar2905
No description available
chandan5569
● Perform ‘Exploratory Data Analysis’ on dataset ‘SampleSuperstore’ ● As a business manager, try to find out the weak areas where you can work to make more profit. ● What all business problems you can derive by exploring the data?
khushiisharmaa
No description available
This is a project I performed during my Intership @The Spark Foundation
we will be trying to find out the weak areas where we can work to make more profit.
Akshaypareek01
No description available
mallikarjunyadav27
As a business manager, try to find out the weak areas where you can work to make more profit. Approach is like What all business problems you can derive by exploring the data?
Data Science And Business Analytics Internship GRIP The Spark Foundation GRIPNOV20
No description available
Allaboutanshul
CodeAlpha Data Analytics Tasks 📋 Overview This repository showcases the completion of my Data Analytics internship tasks at CodeAlpha. It demonstrates a complete data pipeline from raw web extraction to advanced sentiment insights. Task 1: Web Scraping, Task 2: Exploratory Data Analysis (EDA), Task 3: Data Visualization&Task 4: Sentiment Analysis
ShripadJagtap
Disney Studio Income Analytics Task 3 CollegeRanker Exploratory data analysis and regression models to predict box office revenue prediction on Disney movies produced since the debut film Snow White and Seven Dwarf in 1937.
Author : Sneha M Data Science & Business Analytics Internship GRIP - The Spark Foundation TASK 3 - Perform ‘Exploratory Data Analysis’ on dataset Objective : 1. As a business manager, try to find out the weak areas where you can work to make more profit. 2.What all business problems you can derive by exploring the data?
MohdShadab999
DESCRIPTION IBM is an American MNC operating in around 170 countries with major business vertical as computing, software, and hardware. Attrition is a major risk to service-providing organizations where trained and experienced people are the assets of the company. The organization would like to identify the factors which influence the attrition of employees. Data Dictionary Age: Age of employee Attrition: Employee attrition status Department: Department of work DistanceFromHome Education: 1-Below College; 2- College; 3-Bachelor; 4-Master; 5-Doctor; EducationField EnvironmentSatisfaction: 1-Low; 2-Medium; 3-High; 4-Very High; JobSatisfaction: 1-Low; 2-Medium; 3-High; 4-Very High; MaritalStatus MonthlyIncome NumCompaniesWorked: Number of companies worked prior to IBM WorkLifeBalance: 1-Bad; 2-Good; 3-Better; 4-Best; YearsAtCompany: Current years of service in IBM Analysis Task: - Import attrition dataset and import libraries such as pandas, matplotlib.pyplot, numpy, and seaborn. - Exploratory data analysis Find the age distribution of employees in IBM Explore attrition by age Explore data for Left employees Find out the distribution of employees by the education field Give a bar chart for the number of married and unmarried employees - Build up a logistic regression model to predict which employees are likely to attrite.
Mickwen
Introduction Fine particulate matter (PM2.5) is an ambient air pollutant for which there is strong evidence that it is harmful to human health. In the United States, the Environmental Protection Agency (EPA) is tasked with setting national ambient air quality standards for fine PM and for tracking the emissions of this pollutant into the atmosphere. Approximatly every 3 years, the EPA releases its database on emissions of PM2.5. This database is known as the National Emissions Inventory (NEI). You can read more information about the NEI at the EPA National Emissions Inventory web site. For each year and for each type of PM source, the NEI records how many tons of PM2.5 were emitted from that source over the course of the entire year. The data that you will use for this assignment are for 1999, 2002, 2005, and 2008. Data The data for this assignment are available from the course web site as a single zip file: Data for Peer Assessment [29Mb] The zip file contains two files: PM2.5 Emissions Data (summarySCC_PM25.rds): This file contains a data frame with all of the PM2.5 emissions data for 1999, 2002, 2005, and 2008. For each year, the table contains number of tons of PM2.5 emitted from a specific type of source for the entire year. Here are the first few rows. ## fips SCC Pollutant Emissions type year ## 4 09001 10100401 PM25-PRI 15.714 POINT 1999 ## 8 09001 10100404 PM25-PRI 234.178 POINT 1999 ## 12 09001 10100501 PM25-PRI 0.128 POINT 1999 ## 16 09001 10200401 PM25-PRI 2.036 POINT 1999 ## 20 09001 10200504 PM25-PRI 0.388 POINT 1999 ## 24 09001 10200602 PM25-PRI 1.490 POINT 1999 fips: A five-digit number (represented as a string) indicating the U.S. county SCC: The name of the source as indicated by a digit string (see source code classification table) Pollutant: A string indicating the pollutant Emissions: Amount of PM2.5 emitted, in tons type: The type of source (point, non-point, on-road, or non-road) year: The year of emissions recorded Source Classification Code Table (Source_Classification_Code.rds): This table provides a mapping from the SCC digit strings in the Emissions table to the actual name of the PM2.5 source. The sources are categorized in a few different ways from more general to more specific and you may choose to explore whatever categories you think are most useful. For example, source “10100101” is known as “Ext Comb /Electric Gen /Anthracite Coal /Pulverized Coal”. You can read each of the two files using the readRDS() function in R. For example, reading in each file can be done with the following code: ## This first line will likely take a few seconds. Be patient! NEI <- readRDS("summarySCC_PM25.rds") SCC <- readRDS("Source_Classification_Code.rds") as long as each of those files is in your current working directory (check by calling dir() and see if those files are in the listing). Assignment The overall goal of this assignment is to explore the National Emissions Inventory database and see what it say about fine particulate matter pollution in the United states over the 10-year period 1999–2008. You may use any R package you want to support your analysis. Questions You must address the following questions and tasks in your exploratory analysis. For each question/task you will need to make a single plot. Unless specified, you can use any plotting system in R to make your plot. Have total emissions from PM2.5 decreased in the United States from 1999 to 2008? Using the base plotting system, make a plot showing the total PM2.5 emission from all sources for each of the years 1999, 2002, 2005, and 2008. Have total emissions from PM2.5 decreased in the Baltimore City, Maryland (fips == "24510") from 1999 to 2008? Use the base plotting system to make a plot answering this question. Of the four types of sources indicated by the type (point, nonpoint, onroad, nonroad) variable, which of these four sources have seen decreases in emissions from 1999–2008 for Baltimore City? Which have seen increases in emissions from 1999–2008? Use the ggplot2 plotting system to make a plot answer this question. Across the United States, how have emissions from coal combustion-related sources changed from 1999–2008? How have emissions from motor vehicle sources changed from 1999–2008 in Baltimore City? Compare emissions from motor vehicle sources in Baltimore City with emissions from motor vehicle sources in Los Angeles County, California (fips == "06037"). Which city has seen greater changes over time in motor vehicle emissions? Making and Submitting Plots For each plot you should Construct the plot and save it to a PNG file. Create a separate R code file (plot1.R, plot2.R, etc.) that constructs the corresponding plot, i.e. code in plot1.R constructs the plot1.png plot. Your code file should include code for reading the data so that the plot can be fully reproduced. You must also include the code that creates the PNG file. Only include the code for a single plot (i.e. plot1.R should only include code for producing plot1.png) Upload the PNG file on the Assignment submission page Copy and paste the R code from the corresponding R file into the text box at the appropriate point in the peer assessment.
sidf3ar
Business Problem Overview In the telecom industry, customers are able to choose from multiple service providers and actively switch from one operator to another. In this highly competitive market, the telecommunications industry experiences an average of 15-25% annual churn rate. Given the fact that it costs 5-10 times more to acquire a new customer than to retain an existing one, customer retention has now become even more important than customer acquisition. For many incumbent operators, retaining high profitable customers is the number one business goal. To reduce customer churn, telecom companies need to predict which customers are at high risk of churn. In this project, you will analyse customer-level data of a leading telecom firm, build predictive models to identify customers at high risk of churn and identify the main indicators of churn. Understanding and Defining Churn There are two main models of payment in the telecom industry - postpaid (customers pay a monthly/annual bill after using the services) and prepaid (customers pay/recharge with a certain amount in advance and then use the services). In the postpaid model, when customers want to switch to another operator, they usually inform the existing operator to terminate the services, and you directly know that this is an instance of churn. However, in the prepaid model, customers who want to switch to another network can simply stop using the services without any notice, and it is hard to know whether someone has actually churned or is simply not using the services temporarily (e.g. someone may be on a trip abroad for a month or two and then intend to resume using the services again). Thus, churn prediction is usually more critical (and non-trivial) for prepaid customers, and the term ‘churn’ should be defined carefully. Also, prepaid is the most common model in India and southeast Asia, while postpaid is more common in Europe in North America. This project is based on the Indian and Southeast Asian market. Definitions of Churn There are various ways to define churn, such as: Revenue-based churn: Customers who have not utilised any revenue-generating facilities such as mobile internet, outgoing calls, SMS etc. over a given period of time. One could also use aggregate metrics such as ‘customers who have generated less than INR 4 per month in total/average/median revenue’. The main shortcoming of this definition is that there are customers who only receive calls/SMSes from their wage-earning counterparts, i.e. they don’t generate revenue but use the services. For example, many users in rural areas only receive calls from their wage-earning siblings in urban areas. Usage-based churn: Customers who have not done any usage, either incoming or outgoing - in terms of calls, internet etc. over a period of time. A potential shortcoming of this definition is that when the customer has stopped using the services for a while, it may be too late to take any corrective actions to retain them. For e.g., if you define churn based on a ‘two-months zero usage’ period, predicting churn could be useless since by that time the customer would have already switched to another operator. In this project, you will use the usage-based definition to define churn. High-value Churn In the Indian and the southeast Asian market, approximately 80% of revenue comes from the top 20% customers (called high-value customers). Thus, if we can reduce churn of the high-value customers, we will be able to reduce significant revenue leakage. In this project, you will define high-value customers based on a certain metric (mentioned later below) and predict churn only on high-value customers. Understanding the Business Objective and the Data The dataset contains customer-level information for a span of four consecutive months - June, July, August and September. The months are encoded as 6, 7, 8 and 9, respectively. The business objective is to predict the churn in the last (i.e. the ninth) month using the data (features) from the first three months. To do this task well, understanding the typical customer behaviour during churn will be helpful. Understanding Customer Behaviour During Churn Customers usually do not decide to switch to another competitor instantly, but rather over a period of time (this is especially applicable to high-value customers). In churn prediction, we assume that there are three phases of customer lifecycle : The ‘good’ phase: In this phase, the customer is happy with the service and behaves as usual. The ‘action’ phase: The customer experience starts to sore in this phase, for e.g. he/she gets a compelling offer from a competitor, faces unjust charges, becomes unhappy with service quality etc. In this phase, the customer usually shows different behaviour than the ‘good’ months. Also, it is crucial to identify high-churn-risk customers in this phase, since some corrective actions can be taken at this point (such as matching the competitor’s offer/improving the service quality etc.) The ‘churn’ phase: In this phase, the customer is said to have churned. You define churn based on this phase. Also, it is important to note that at the time of prediction (i.e. the action months), this data is not available to you for prediction. Thus, after tagging churn as 1/0 based on this phase, you discard all data corresponding to this phase. In this case, since you are working over a four-month window, the first two months are the ‘good’ phase, the third month is the ‘action’ phase, while the fourth month is the ‘churn’ phase. Data Dictionary The dataset can be download using this link. The data dictionary is provided for download below. Data Dictionary - Telecom Churn Download The data dictionary contains meanings of abbreviations. Some frequent ones are loc (local), IC (incoming), OG (outgoing), T2T (telecom operator to telecom operator), T2O (telecom operator to another operator), RECH (recharge) etc. The attributes containing 6, 7, 8, 9 as suffixes imply that those correspond to the months 6, 7, 8, 9 respectively. Data Preparation The following data preparation steps are crucial for this problem: 1. Derive new features This is one of the most important parts of data preparation since good features are often the differentiators between good and bad models. Use your business understanding to derive features you think could be important indicators of churn. 2. Filter high-value customers As mentioned above, you need to predict churn only for the high-value customers. Define high-value customers as follows: Those who have recharged with an amount more than or equal to X, where X is the 70th percentile of the average recharge amount in the first two months (the good phase). After filtering the high-value customers, you should get about 29.9k rows. 3. Tag churners and remove attributes of the churn phase Now tag the churned customers (churn=1, else 0) based on the fourth month as follows: Those who have not made any calls (either incoming or outgoing) AND have not used mobile internet even once in the churn phase. The attributes you need to use to tag churners are: total_ic_mou_9 total_og_mou_9 vol_2g_mb_9 vol_3g_mb_9 After tagging churners, remove all the attributes corresponding to the churn phase (all attributes having ‘ _9’, etc. in their names). Modelling Build models to predict churn. The predictive model that you’re going to build will serve two purposes: It will be used to predict whether a high-value customer will churn or not, in near future (i.e. churn phase). By knowing this, the company can take action steps such as providing special plans, discounts on recharge etc. It will be used to identify important variables that are strong predictors of churn. These variables may also indicate why customers choose to switch to other networks. In some cases, both of the above-stated goals can be achieved by a single machine learning model. But here, you have a large number of attributes, and thus you should try using a dimensionality reduction technique such as PCA and then build a predictive model. After PCA, you can use any classification model. Also, since the rate of churn is typically low (about 5-10%, this is called class-imbalance) - try using techniques to handle class imbalance. You can take the following suggestive steps to build the model: Preprocess data (convert columns to appropriate formats, handle missing values, etc.) Conduct appropriate exploratory analysis to extract useful insights (whether directly useful for business or for eventual modelling/feature engineering). Derive new features. Reduce the number of variables using PCA. Train a variety of models, tune model hyperparameters, etc. (handle class imbalance using appropriate techniques). Evaluate the models using appropriate evaluation metrics. Note that is is more important to identify churners than the non-churners accurately - choose an appropriate evaluation metric which reflects this business goal. Finally, choose a model based on some evaluation metric. The above model will only be able to achieve one of the two goals - to predict customers who will churn. You can’t use the above model to identify the important features for churn. That’s because PCA usually creates components which are not easy to interpret. Therefore, build another model with the main objective of identifying important predictor attributes which help the business understand indicators of churn. A good choice to identify important variables is a logistic regression model or a model from the tree family. In case of logistic regression, make sure to handle multi-collinearity. After identifying important predictors, display them visually - you can use plots, summary tables etc. - whatever you think best conveys the importance of features. Finally, recommend strategies to manage customer churn based on your observations.