Found 25 repositories(showing 25)
sayantann11
Classification - Machine Learning This is ‘Classification’ tutorial which is a part of the Machine Learning course offered by Simplilearn. We will learn Classification algorithms, types of classification algorithms, support vector machines(SVM), Naive Bayes, Decision Tree and Random Forest Classifier in this tutorial. Objectives Let us look at some of the objectives covered under this section of Machine Learning tutorial. Define Classification and list its algorithms Describe Logistic Regression and Sigmoid Probability Explain K-Nearest Neighbors and KNN classification Understand Support Vector Machines, Polynomial Kernel, and Kernel Trick Analyze Kernel Support Vector Machines with an example Implement the Naïve Bayes Classifier Demonstrate Decision Tree Classifier Describe Random Forest Classifier Classification: Meaning Classification is a type of supervised learning. It specifies the class to which data elements belong to and is best used when the output has finite and discrete values. It predicts a class for an input variable as well. There are 2 types of Classification: Binomial Multi-Class Classification: Use Cases Some of the key areas where classification cases are being used: To find whether an email received is a spam or ham To identify customer segments To find if a bank loan is granted To identify if a kid will pass or fail in an examination Classification: Example Social media sentiment analysis has two potential outcomes, positive or negative, as displayed by the chart given below. https://www.simplilearn.com/ice9/free_resources_article_thumb/classification-example-machine-learning.JPG This chart shows the classification of the Iris flower dataset into its three sub-species indicated by codes 0, 1, and 2. https://www.simplilearn.com/ice9/free_resources_article_thumb/iris-flower-dataset-graph.JPG The test set dots represent the assignment of new test data points to one class or the other based on the trained classifier model. Types of Classification Algorithms Let’s have a quick look into the types of Classification Algorithm below. Linear Models Logistic Regression Support Vector Machines Nonlinear models K-nearest Neighbors (KNN) Kernel Support Vector Machines (SVM) Naïve Bayes Decision Tree Classification Random Forest Classification Logistic Regression: Meaning Let us understand the Logistic Regression model below. This refers to a regression model that is used for classification. This method is widely used for binary classification problems. It can also be extended to multi-class classification problems. Here, the dependent variable is categorical: y ϵ {0, 1} A binary dependent variable can have only two values, like 0 or 1, win or lose, pass or fail, healthy or sick, etc In this case, you model the probability distribution of output y as 1 or 0. This is called the sigmoid probability (σ). If σ(θ Tx) > 0.5, set y = 1, else set y = 0 Unlike Linear Regression (and its Normal Equation solution), there is no closed form solution for finding optimal weights of Logistic Regression. Instead, you must solve this with maximum likelihood estimation (a probability model to detect the maximum likelihood of something happening). It can be used to calculate the probability of a given outcome in a binary model, like the probability of being classified as sick or passing an exam. https://www.simplilearn.com/ice9/free_resources_article_thumb/logistic-regression-example-graph.JPG Sigmoid Probability The probability in the logistic regression is often represented by the Sigmoid function (also called the logistic function or the S-curve): https://www.simplilearn.com/ice9/free_resources_article_thumb/sigmoid-function-machine-learning.JPG In this equation, t represents data values * the number of hours studied and S(t) represents the probability of passing the exam. Assume sigmoid function: https://www.simplilearn.com/ice9/free_resources_article_thumb/sigmoid-probability-machine-learning.JPG g(z) tends toward 1 as z -> infinity , and g(z) tends toward 0 as z -> infinity K-nearest Neighbors (KNN) K-nearest Neighbors algorithm is used to assign a data point to clusters based on similarity measurement. It uses a supervised method for classification. The steps to writing a k-means algorithm are as given below: https://www.simplilearn.com/ice9/free_resources_article_thumb/knn-distribution-graph-machine-learning.JPG Choose the number of k and a distance metric. (k = 5 is common) Find k-nearest neighbors of the sample that you want to classify Assign the class label by majority vote. KNN Classification A new input point is classified in the category such that it has the most number of neighbors from that category. For example: https://www.simplilearn.com/ice9/free_resources_article_thumb/knn-classification-machine-learning.JPG Classify a patient as high risk or low risk. Mark email as spam or ham. Keen on learning about Classification Algorithms in Machine Learning? Click here! Support Vector Machine (SVM) Let us understand Support Vector Machine (SVM) in detail below. SVMs are classification algorithms used to assign data to various classes. They involve detecting hyperplanes which segregate data into classes. SVMs are very versatile and are also capable of performing linear or nonlinear classification, regression, and outlier detection. Once ideal hyperplanes are discovered, new data points can be easily classified. https://www.simplilearn.com/ice9/free_resources_article_thumb/support-vector-machines-graph-machine-learning.JPG The optimization objective is to find “maximum margin hyperplane” that is farthest from the closest points in the two classes (these points are called support vectors). In the given figure, the middle line represents the hyperplane. SVM Example Let’s look at this image below and have an idea about SVM in general. Hyperplanes with larger margins have lower generalization error. The positive and negative hyperplanes are represented by: https://www.simplilearn.com/ice9/free_resources_article_thumb/positive-negative-hyperplanes-machine-learning.JPG Classification of any new input sample xtest : If w0 + wTxtest > 1, the sample xtest is said to be in the class toward the right of the positive hyperplane. If w0 + wTxtest < -1, the sample xtest is said to be in the class toward the left of the negative hyperplane. When you subtract the two equations, you get: https://www.simplilearn.com/ice9/free_resources_article_thumb/equation-subtraction-machine-learning.JPG Length of vector w is (L2 norm length): https://www.simplilearn.com/ice9/free_resources_article_thumb/length-of-vector-machine-learning.JPG You normalize with the length of w to arrive at: https://www.simplilearn.com/ice9/free_resources_article_thumb/normalize-equation-machine-learning.JPG SVM: Hard Margin Classification Given below are some points to understand Hard Margin Classification. The left side of equation SVM-1 given above can be interpreted as the distance between the positive (+ve) and negative (-ve) hyperplanes; in other words, it is the margin that can be maximized. Hence the objective of the function is to maximize with the constraint that the samples are classified correctly, which is represented as : https://www.simplilearn.com/ice9/free_resources_article_thumb/hard-margin-classification-machine-learning.JPG This means that you are minimizing ‖w‖. This also means that all positive samples are on one side of the positive hyperplane and all negative samples are on the other side of the negative hyperplane. This can be written concisely as : https://www.simplilearn.com/ice9/free_resources_article_thumb/hard-margin-classification-formula.JPG Minimizing ‖w‖ is the same as minimizing. This figure is better as it is differentiable even at w = 0. The approach listed above is called “hard margin linear SVM classifier.” SVM: Soft Margin Classification Given below are some points to understand Soft Margin Classification. To allow for linear constraints to be relaxed for nonlinearly separable data, a slack variable is introduced. (i) measures how much ith instance is allowed to violate the margin. The slack variable is simply added to the linear constraints. https://www.simplilearn.com/ice9/free_resources_article_thumb/soft-margin-calculation-machine-learning.JPG Subject to the above constraints, the new objective to be minimized becomes: https://www.simplilearn.com/ice9/free_resources_article_thumb/soft-margin-calculation-formula.JPG You have two conflicting objectives now—minimizing slack variable to reduce margin violations and minimizing to increase the margin. The hyperparameter C allows us to define this trade-off. Large values of C correspond to larger error penalties (so smaller margins), whereas smaller values of C allow for higher misclassification errors and larger margins. https://www.simplilearn.com/ice9/free_resources_article_thumb/machine-learning-certification-video-preview.jpg SVM: Regularization The concept of C is the reverse of regularization. Higher C means lower regularization, which increases bias and lowers the variance (causing overfitting). https://www.simplilearn.com/ice9/free_resources_article_thumb/concept-of-c-graph-machine-learning.JPG IRIS Data Set The Iris dataset contains measurements of 150 IRIS flowers from three different species: Setosa Versicolor Viriginica Each row represents one sample. Flower measurements in centimeters are stored as columns. These are called features. IRIS Data Set: SVM Let’s train an SVM model using sci-kit-learn for the Iris dataset: https://www.simplilearn.com/ice9/free_resources_article_thumb/svm-model-graph-machine-learning.JPG Nonlinear SVM Classification There are two ways to solve nonlinear SVMs: by adding polynomial features by adding similarity features Polynomial features can be added to datasets; in some cases, this can create a linearly separable dataset. https://www.simplilearn.com/ice9/free_resources_article_thumb/nonlinear-classification-svm-machine-learning.JPG In the figure on the left, there is only 1 feature x1. This dataset is not linearly separable. If you add x2 = (x1)2 (figure on the right), the data becomes linearly separable. Polynomial Kernel In sci-kit-learn, one can use a Pipeline class for creating polynomial features. Classification results for the Moons dataset are shown in the figure. https://www.simplilearn.com/ice9/free_resources_article_thumb/polynomial-kernel-machine-learning.JPG Polynomial Kernel with Kernel Trick Let us look at the image below and understand Kernel Trick in detail. https://www.simplilearn.com/ice9/free_resources_article_thumb/polynomial-kernel-with-kernel-trick.JPG For large dimensional datasets, adding too many polynomial features can slow down the model. You can apply a kernel trick with the effect of polynomial features without actually adding them. The code is shown (SVC class) below trains an SVM classifier using a 3rd-degree polynomial kernel but with a kernel trick. https://www.simplilearn.com/ice9/free_resources_article_thumb/polynomial-kernel-equation-machine-learning.JPG The hyperparameter coefθ controls the influence of high-degree polynomials. Kernel SVM Let us understand in detail about Kernel SVM. Kernel SVMs are used for classification of nonlinear data. In the chart, nonlinear data is projected into a higher dimensional space via a mapping function where it becomes linearly separable. https://www.simplilearn.com/ice9/free_resources_article_thumb/kernel-svm-machine-learning.JPG In the higher dimension, a linear separating hyperplane can be derived and used for classification. A reverse projection of the higher dimension back to original feature space takes it back to nonlinear shape. As mentioned previously, SVMs can be kernelized to solve nonlinear classification problems. You can create a sample dataset for XOR gate (nonlinear problem) from NumPy. 100 samples will be assigned the class sample 1, and 100 samples will be assigned the class label -1. https://www.simplilearn.com/ice9/free_resources_article_thumb/kernel-svm-graph-machine-learning.JPG As you can see, this data is not linearly separable. https://www.simplilearn.com/ice9/free_resources_article_thumb/kernel-svm-non-separable.JPG You now use the kernel trick to classify XOR dataset created earlier. https://www.simplilearn.com/ice9/free_resources_article_thumb/kernel-svm-xor-machine-learning.JPG Naïve Bayes Classifier What is Naive Bayes Classifier? Have you ever wondered how your mail provider implements spam filtering or how online news channels perform news text classification or even how companies perform sentiment analysis of their audience on social media? All of this and more are done through a machine learning algorithm called Naive Bayes Classifier. Naive Bayes Named after Thomas Bayes from the 1700s who first coined this in the Western literature. Naive Bayes classifier works on the principle of conditional probability as given by the Bayes theorem. Advantages of Naive Bayes Classifier Listed below are six benefits of Naive Bayes Classifier. Very simple and easy to implement Needs less training data Handles both continuous and discrete data Highly scalable with the number of predictors and data points As it is fast, it can be used in real-time predictions Not sensitive to irrelevant features Bayes Theorem We will understand Bayes Theorem in detail from the points mentioned below. According to the Bayes model, the conditional probability P(Y|X) can be calculated as: P(Y|X) = P(X|Y)P(Y) / P(X) This means you have to estimate a very large number of P(X|Y) probabilities for a relatively small vector space X. For example, for a Boolean Y and 30 possible Boolean attributes in the X vector, you will have to estimate 3 billion probabilities P(X|Y). To make it practical, a Naïve Bayes classifier is used, which assumes conditional independence of P(X) to each other, with a given value of Y. This reduces the number of probability estimates to 2*30=60 in the above example. Naïve Bayes Classifier for SMS Spam Detection Consider a labeled SMS database having 5574 messages. It has messages as given below: https://www.simplilearn.com/ice9/free_resources_article_thumb/naive-bayes-spam-machine-learning.JPG Each message is marked as spam or ham in the data set. Let’s train a model with Naïve Bayes algorithm to detect spam from ham. The message lengths and their frequency (in the training dataset) are as shown below: https://www.simplilearn.com/ice9/free_resources_article_thumb/naive-bayes-spam-spam-detection.JPG Analyze the logic you use to train an algorithm to detect spam: Split each message into individual words/tokens (bag of words). Lemmatize the data (each word takes its base form, like “walking” or “walked” is replaced with “walk”). Convert data to vectors using scikit-learn module CountVectorizer. Run TFIDF to remove common words like “is,” “are,” “and.” Now apply scikit-learn module for Naïve Bayes MultinomialNB to get the Spam Detector. This spam detector can then be used to classify a random new message as spam or ham. Next, the accuracy of the spam detector is checked using the Confusion Matrix. For the SMS spam example above, the confusion matrix is shown on the right. Accuracy Rate = Correct / Total = (4827 + 592)/5574 = 97.21% Error Rate = Wrong / Total = (155 + 0)/5574 = 2.78% https://www.simplilearn.com/ice9/free_resources_article_thumb/confusion-matrix-machine-learning.JPG Although confusion Matrix is useful, some more precise metrics are provided by Precision and Recall. https://www.simplilearn.com/ice9/free_resources_article_thumb/precision-recall-matrix-machine-learning.JPG Precision refers to the accuracy of positive predictions. https://www.simplilearn.com/ice9/free_resources_article_thumb/precision-formula-machine-learning.JPG Recall refers to the ratio of positive instances that are correctly detected by the classifier (also known as True positive rate or TPR). https://www.simplilearn.com/ice9/free_resources_article_thumb/recall-formula-machine-learning.JPG Precision/Recall Trade-off To detect age-appropriate videos for kids, you need high precision (low recall) to ensure that only safe videos make the cut (even though a few safe videos may be left out). The high recall is needed (low precision is acceptable) in-store surveillance to catch shoplifters; a few false alarms are acceptable, but all shoplifters must be caught. Learn about Naive Bayes in detail. Click here! Decision Tree Classifier Some aspects of the Decision Tree Classifier mentioned below are. Decision Trees (DT) can be used both for classification and regression. The advantage of decision trees is that they require very little data preparation. They do not require feature scaling or centering at all. They are also the fundamental components of Random Forests, one of the most powerful ML algorithms. Unlike Random Forests and Neural Networks (which do black-box modeling), Decision Trees are white box models, which means that inner workings of these models are clearly understood. In the case of classification, the data is segregated based on a series of questions. Any new data point is assigned to the selected leaf node. https://www.simplilearn.com/ice9/free_resources_article_thumb/decision-tree-classifier-machine-learning.JPG Start at the tree root and split the data on the feature using the decision algorithm, resulting in the largest information gain (IG). This splitting procedure is then repeated in an iterative process at each child node until the leaves are pure. This means that the samples at each node belonging to the same class. In practice, you can set a limit on the depth of the tree to prevent overfitting. The purity is compromised here as the final leaves may still have some impurity. The figure shows the classification of the Iris dataset. https://www.simplilearn.com/ice9/free_resources_article_thumb/decision-tree-classifier-graph.JPG IRIS Decision Tree Let’s build a Decision Tree using scikit-learn for the Iris flower dataset and also visualize it using export_graphviz API. https://www.simplilearn.com/ice9/free_resources_article_thumb/iris-decision-tree-machine-learning.JPG The output of export_graphviz can be converted into png format: https://www.simplilearn.com/ice9/free_resources_article_thumb/iris-decision-tree-output.JPG Sample attribute stands for the number of training instances the node applies to. Value attribute stands for the number of training instances of each class the node applies to. Gini impurity measures the node’s impurity. A node is “pure” (gini=0) if all training instances it applies to belong to the same class. https://www.simplilearn.com/ice9/free_resources_article_thumb/impurity-formula-machine-learning.JPG For example, for Versicolor (green color node), the Gini is 1-(0/54)2 -(49/54)2 -(5/54) 2 ≈ 0.168 https://www.simplilearn.com/ice9/free_resources_article_thumb/iris-decision-tree-sample.JPG Decision Boundaries Let us learn to create decision boundaries below. For the first node (depth 0), the solid line splits the data (Iris-Setosa on left). Gini is 0 for Setosa node, so no further split is possible. The second node (depth 1) splits the data into Versicolor and Virginica. If max_depth were set as 3, a third split would happen (vertical dotted line). https://www.simplilearn.com/ice9/free_resources_article_thumb/decision-tree-boundaries.JPG For a sample with petal length 5 cm and petal width 1.5 cm, the tree traverses to depth 2 left node, so the probability predictions for this sample are 0% for Iris-Setosa (0/54), 90.7% for Iris-Versicolor (49/54), and 9.3% for Iris-Virginica (5/54) CART Training Algorithm Scikit-learn uses Classification and Regression Trees (CART) algorithm to train Decision Trees. CART algorithm: Split the data into two subsets using a single feature k and threshold tk (example, petal length < “2.45 cm”). This is done recursively for each node. k and tk are chosen such that they produce the purest subsets (weighted by their size). The objective is to minimize the cost function as given below: https://www.simplilearn.com/ice9/free_resources_article_thumb/cart-training-algorithm-machine-learning.JPG The algorithm stops executing if one of the following situations occurs: max_depth is reached No further splits are found for each node Other hyperparameters may be used to stop the tree: min_samples_split min_samples_leaf min_weight_fraction_leaf max_leaf_nodes Gini Impurity or Entropy Entropy is one more measure of impurity and can be used in place of Gini. https://www.simplilearn.com/ice9/free_resources_article_thumb/gini-impurity-entrophy.JPG It is a degree of uncertainty, and Information Gain is the reduction that occurs in entropy as one traverses down the tree. Entropy is zero for a DT node when the node contains instances of only one class. Entropy for depth 2 left node in the example given above is: https://www.simplilearn.com/ice9/free_resources_article_thumb/entrophy-for-depth-2.JPG Gini and Entropy both lead to similar trees. DT: Regularization The following figure shows two decision trees on the moons dataset. https://www.simplilearn.com/ice9/free_resources_article_thumb/dt-regularization-machine-learning.JPG The decision tree on the right is restricted by min_samples_leaf = 4. The model on the left is overfitting, while the model on the right generalizes better. Random Forest Classifier Let us have an understanding of Random Forest Classifier below. A random forest can be considered an ensemble of decision trees (Ensemble learning). Random Forest algorithm: Draw a random bootstrap sample of size n (randomly choose n samples from the training set). Grow a decision tree from the bootstrap sample. At each node, randomly select d features. Split the node using the feature that provides the best split according to the objective function, for instance by maximizing the information gain. Repeat the steps 1 to 2 k times. (k is the number of trees you want to create, using a subset of samples) Aggregate the prediction by each tree for a new data point to assign the class label by majority vote (pick the group selected by the most number of trees and assign new data point to that group). Random Forests are opaque, which means it is difficult to visualize their inner workings. https://www.simplilearn.com/ice9/free_resources_article_thumb/random-forest-classifier-graph.JPG However, the advantages outweigh their limitations since you do not have to worry about hyperparameters except k, which stands for the number of decision trees to be created from a subset of samples. RF is quite robust to noise from the individual decision trees. Hence, you need not prune individual decision trees. The larger the number of decision trees, the more accurate the Random Forest prediction is. (This, however, comes with higher computation cost). Key Takeaways Let us quickly run through what we have learned so far in this Classification tutorial. Classification algorithms are supervised learning methods to split data into classes. They can work on Linear Data as well as Nonlinear Data. Logistic Regression can classify data based on weighted parameters and sigmoid conversion to calculate the probability of classes. K-nearest Neighbors (KNN) algorithm uses similar features to classify data. Support Vector Machines (SVMs) classify data by detecting the maximum margin hyperplane between data classes. Naïve Bayes, a simplified Bayes Model, can help classify data using conditional probability models. Decision Trees are powerful classifiers and use tree splitting logic until pure or somewhat pure leaf node classes are attained. Random Forests apply Ensemble Learning to Decision Trees for more accurate classification predictions. Conclusion This completes ‘Classification’ tutorial. In the next tutorial, we will learn 'Unsupervised Learning with Clustering.'
zeal-up
This repository includes some classical network architecture of video classification(action recognition). Because of the scale of Kinetics, most of the architectures in this repo have not be tested on kenetics. But the training loss curve seems normal in the training procedure.
rasensiotorres
We classify chest X-ray images with Covid, normal and pneumonia using CNN and transfer learning. Given the small dataset, we use k-fold cross validation for training the model, reaching accuracies of ~97%. on the test data. We also construct a confusion matrix and P-R curve
marcelprasetyo
This was done for a 2D (a term peculiar to my school meaning: across subjects) project for my university course Systems & Control (PID basically), at Singapore University of Technology and Design. We were required to record and upload a video recording to YouTube. Basically it's a line-following robot with 4 IR sensors and motors (for wheels), where the wheel speed and direction is determined by a PD error function, subtracting left sensor readings from right sensor readings to give the error value, where 0 = white and 1000 = black for raw sensor values. This error function is fed into a simple PD (no Integral) function to give the calibration number (negative or positive) subtracted and added to the left and right (base motor speed) wheels respectively. An additional All White logic is added because the error function can't tell the difference between equal black values (detect black lines on both sides) and equal white values (detect white paper on both sides), where this All White logic (activated when sum of sensor values is below a certain threshold) saves the previous loop's left and right motor speeds and freezes the robot car in this state (e.g. keep turning right; keep pivoting left) until the sensors find enough black tape to surpass the white threshold, when PID behavior resumes to normal. This All White logic is critical in helping the robot nail 90 degree turns, which often give the problem of reading all white very quickly (thus under normal conditions would make the robot go straight out or go haywire). This track is based on one of the Bowser's Castle tracks (unfortunately I couldn't find the original track again after searching the wiki), modified for our paper dimensions. This track is actually pretty challenging (I sketched it and laid out the black tape) and would be very difficult to do at a higher base car speed. The intersections are not difficult but are dangerous if the car doesn't stabilize by the time it reaches an intersection. Some lines are intentionally put closer together for added challenge, while the Y-junction at Robo-Browser was taken from the original track, to my surprise it works well at giving the illusion of an autonomous choice (left or right) to the car, which originally I thought would follow only the left or right leg programmatically. The Y-junction also gives us problems because it can often cause our car to turn around through the other leg and go backwards *facepalm*. Bowser's Fire is unstable in general but we got lucky this time, because usually if the oscillating motion sways the car to the right at the start of Bowser's Fire (notice the closest curve there) the car will take a shortcut to Boos Arena (I don't think this is allowed per the project rules but heck yeah for actual racing games). You can see this shortcut happening in our fourth lap though (we got a cheater here..). Actually there's only some requirements we needed to complete for our custom designed tracks: two 90 degree turns, 30 cm discontinuous lines, total 8 turns, the whole track length spanning minimum 8 meters (ours about 10 m). No intersections or Y-junctions required or even the whole Mario Kart theme. We just wanted some challenge and some fun *wink* *wink*. To get an A grade for this project the car must run the track for at least 3 minutes and at least 3 laps. We didn't have to submit the Arduino code as well so technically it's possible to hardcode certain things. Disclaimer: Siti Nurbaya, my groupmate, has already uploaded an earlier version that she edited, on top of which I inserted the place names and pictures to make it livelier. Of course kudos to the Nintendo dudes for most of the creative assets. Please spare us any criticism about the project 🥺 (it's just a school project). Raw footage recorded by my awesome groupmate Li Jiang Yan Special thanks to Siti who edited the earlier version, to Kelvin Ng Chao Yong for his Malaysian Arduino code and testing, to Johannes Brian for the mechanical equipment and assembly of the robot car.
aiok03
Descriptive statistics and Explanatory data analysis In order to have an idea of the received data, we look through our table transactions and train. The shape of the train is 6000 rows and 2 columns (client_id and target – gender). Also we considered the info of transactions and noticed that there are no empty values, all of them are equal to 130039. After that we merged two tables and called it as data. To display unique codes and types we used ‘unique’ function and noticed that unique codes 173 and unique types 61. Using ‘describe’ function we can see minimal code, type, sum and the same parameters but maximum. The first hypothesis was to find what gender makes lots of requests. For conveniency we used for loop to make values in percentile view. And according to the barplot the biggest number of processes are made by females. The second hypothesis was to find the code with the biggest sum. For that we grouped by code and counted the mean of all sums. This list we converted from series to frame for further working process. The problem was that the code interpreted the code as the index, that’s why we have to fix it with ‘reset_index’ function. After that we plotted the graph and noticed that the most high sum is with 4722 code and proved it with another code under the graph. The third hypothesis is to find the distribution of sums relatively to the gender. But the first graph didn’t replaced this information because the scatter of the data is too high. The sign is not normally distributed and it is not symmetrical. It is hard to asses, that’s why we grouped information by gender and counted mean of the sum. According to this information we noticed that males spend more money than women. The same process we made with median and got the same conclusion. And since the mean and median values are not equal, our assumption about unnormalized data was proved. The last hypothesis was to find number of clients for each type and code – to find the most popular request within clients. For that we applied ‘str’ to each parameter for correct visualization on the graph. Counted the number of each request for type and code and reflected it in the graphs. According to them the most popular is 1010 type and 6011 code. Lastly, for further working process we returned type and code to the int type. Feature engineering Client’s balance condition We took every sum from dataframe data, grouped for every client and found the sum for each of them. We calculated the income and expenses for each client. Some clients with minus value made more expenses, some of them not, that means that he got more income. In minus is 0, in plus is 1. RFM In RFM section we started from Recency. For each client we grouped the information about them and found the maximum date where the transaction was done. The datetime column consisted from two values – date and time, for further working process in future engineering section we divided them for different columns. The most recent day we equaled to 457 and according to this value started to count the recency of last transactions for each client by subtraction. The next step is Frequency. We used ‘group by’ function and counted appearance of each client in our database. The last step is Monetary (to count expenses). Using group by function and condition, where the sum is less than 0 (expenses are negative values), we counted the total expenses of each client and noticed one point. That some clients didn’t spend any money at all. Segmentation based on RFM We merged all the tables into one and made a rank according to the best values in each segment using percentage. Using the formula we divided clients by 5 score scale, by this database and elbow method, plotted the graph, where 3 clusters were optimal solution. With KMeans library we plotted the k-mean illustration of clients according to the distance from randomly chosen centroids, showed distribution of clients in clusters. After the work done we gathered basic table with clusters using prefixes to each of them. Clustering for codes Now we'll work with codes to create clustering codes, and we'll utilize TF IDF and k-means to do it. We will also employ limitization, tokenization, and stop word elimination. We import the pymorphy2 library for limiting, and limiting is when words take their original form. Tokenization by sentences is the process of dividing a written language into component sentences. We also need to delete stop words, a stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query. We would not want these words to take up space in our database, or taking up valuable processing time. We also make use of the re – Regular expression operations library, which is a library for regular expression operations. In this section we also use MorphAnalyzer() - Morphological analysis is the identification of a word's features based on how it is spelt. Morphological analysis does not make use of information about nearby words. For morphological analysis of words, there is a MorphAnalyzer class in pymorphy2. If we apply directly the clustering on those matrix, we will have issues as our matrices are very sparse and the computation of distances will be a mess. What we can do, is to perform IS to reduce data to a dense matrix of dimension 156 by applying SVD. Singular Value Decomposition (SVD) is one of the widely used methods for dimensionality reduction. We defined that 156 is the right number in our case. We used the Silhouette score to evaluate the quality of clusters created using K-Means. By Silhouette score we chose number of clusters and performed k means clustering on our tf-idf matrix. Then we tried to do a visualization of our clusters and we applied t-sne . t-SNE is a tool to visualize high-dimensional data. And then we added clusters to data and df dataframe. Finally we created word cloud by our clusters Clustering for types Data cleaning for types Firstly, we noticed that there were 155 types. However in data, there are 61 types. When we merge the data and that types, the total number of types become 58. This means that 3 types have no any description and that’s why we replace them with the mode value. Also we found that some types have type description ‘н.д’ which means no data and their total number in data is 26. Also we noticed that type description repeats for several types and we dropped duplicates and replaced them with first accurancy type in data. Creating clusters for types We manually divided them into the 5 categories according to dome key words in description. And merged them with our dataframe. Then we noticed outliers in recency and frequency. We found 0.999 and 0.001 quantile, where the first one is considered as the high, and the second is the low boundary. Everything above 0,999 and below 0.001 is considered as an outlier. We removed them for both recency and frequency. After that we checked dataframe by describe and concluded that everything become normal. Supervised learning The time for prediction came. We divided our dataframe into train and test and used KNN, Decision Tree Classifier and Random Forest, Logistic Regression for further predictions. We decided to investigate the accuracy from 1 to 20 with step 2 for each neighbor in train and test. And built the plot. The best result is accuracy 58 for 19 neighbors. Decision Tree gave us 54 for test set and Random Forest’s accuracy was 64. We investigated feature importance for both of them and noticed that monetary had the most influence on predicting the data. For Grid Search we manually set the hyper parameters and for cross validation equals to four folds. Best estimater for random forest classifier for grid search was found. After that good estimaters were chosen for random forest, and the same accuracy occurred. Best accuracy for random forest with default hyper parameters. We built confusion matrix and calculated recall, precision and f-1 score. Also we decided to build lofistic regression but the accuracy was too small, that’s why we build roc-auc and precision-recall curve. Conclusion All the models showed that taken data was not enough and actually not the best for gender prediction. Actions for increase the accuracy were done, such as adding more features, removing outliers. According to this investigation the best choice was random forest.
jxs1996
Background: Budd-Chiari syndrome (BCS) is characterized by hepatic venous outflow obstruction and in severe cases, is even life-threatening. In the past few decades, the risk factors related to BCS, including inherited and acquired hypercoagulable states or other predisposing factors, have been reported. However, a large number of patients have no identifiable etiological factors. And, there are different causes for BCS in the West and East. In China, segmental or membranous inferior vena cava obstruction is the main manifestation of BCS. The prevalence of prothrombotic disorders seems to be relatively low. Methods: In this study, 500 BCS patients and 696 normal individuals were recruited for whole-exome sequencing. We carried out whole-exome sequencing and developed polygenic risk scoring (PRS) model based on the PLINK, LASSOSUM, BLUP, and BayesA method. We further performed the BCS risk prediction by intersecting the BCS PRS model with the venous thromboembolism and vascular malformations model. Results: BCS-related mutations, such as rs1042331, rs34370305, and rs73739662, were discovered by BCS genome-wide association studies. By comparing different polygenic risk scoring algorithms, the optimal model produced by the BayesA algorithm was determined. By testing additionally recruited experimental samples, it was found that area under the ROC curve>0.9. Conclusion: Further interpretation of the model also provides new insights to explain the difference in genetic risk for BCS between China and the West. In addition, we also found that BCS, venous thromboembolism, and vascular malformations might share some common genetic risks, which might provide new insights into the pathogenesis of BCS.
evacamilla
Webb-app for users to fill in their answers from a questionaire and compare the results to others in a normal distribution/curve
According to the problem statement of the coursework, we are provided with the data of 5 users who own 10 smart home appliances each. This data indicates that each user is given 10 tasks to complete with the smart appliances for which proper scheduling is necessary. We are also provided with 10000 predictive guideline price curves as training data that gives the pricing details for each hour of the day and the corresponding label indicating whether the pricing curve is normal or abnormal. By using this data, a suitable model needs to be designed for classifying the data to be scheduled into normal or abnormal, which is done to effectively predict a pricing attack. A set of 100 pricing curves is provided to be used as testing data for which labelling (normal or abnormal) must be determined using the model designed previously. For the price curves that are classified to be abnormal, we must minimize the cost by developing a Linear Programming energy solution based on the abnormal predictive guideline price curve and plot the obtained scheduling results displaying the hourly energy usage of the 5 users. The minimized schedule obtained as a result of our solution will be the corresponding normal scheduling for each abnormal price curve.
AvichalV
Red Wine Quality Prediction Problem Statement: The dataset is related to red and white variants of the Portuguese "Vinho Verde" wine. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.). This dataset can be viewed as classification task. The classes are ordered and not balanced (e.g. there are many more normal wines than excellent or poor ones). Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods. Attribute Information Input variables (based on physicochemical tests): 1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol Output variable (based on sensory data): 12 - quality (score between 0 and 10) What might be an interesting thing to do, is to set an arbitrary cutoff for your dependent variable (wine quality) at e.g. 7 or higher getting classified as 'good/1' and the remainder as 'not good/0'. This allows you to practice with hyper parameter tuning on e.g. decision tree algorithms looking at the ROC curve and the AUC value. You need to build a classification model. Inspiration Use machine learning to determine which physiochemical properties make a wine 'good'! Submission Details - Share the link of the repository as your submission. Downlaod Files: https://github.com/dsrscientist/DSData/blob/master/winequality-red.csv
Deepak-Meher
A beginner-friendly data analytics portfolio featuring two Excel-based projects: Employee Income Distribution using histogram, normal curve, Z-test and T-test; and a Cloud Kitchen Weekly Sales Dashboard built with Pivot Tables, charts, and KPIs to derive business insights.
johnstinson99
Python code to pull urls from Chrome logfile and simulate access times of on-line media with realistic distributions. Realistic distributions are created by summing data from a number of normal distribution curves. Logfiles are simulated for upload into a test database..
gujralsanyam22
Predicting the Quality of Red Wine using Machine Learning Algorithms for Regression Analysis, Data Visualizations and Data Analysis. Description Context The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.). These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones). This dataset is also available from the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/wine+quality , I just shared it to kaggle for convenience. (If I am mistaken and the public license type disallowed me from doing so, I will take this down if requested.) Content For more information, read [Cortez et al., 2009]. Input variables (based on physicochemical tests): 1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol Output variable (based on sensory data): 12 - quality (score between 0 and 10) Tips What might be an interesting thing to do, is aside from using regression modelling, is to set an arbitrary cutoff for your dependent variable (wine quality) at e.g. 7 or higher getting classified as 'good/1' and the remainder as 'not good/0'. This allows you to practice with hyper parameter tuning on e.g. decision tree algorithms looking at the ROC curve and the AUC value. Without doing any kind of feature engineering or overfitting you should be able to get an AUC of .88 (without even using random forest algorithm) KNIME is a great tool (GUI) that can be used for this. 1 - File Reader (for csv) to linear correlation node and to interactive histogram for basic EDA. 2- File Reader to 'Rule Engine Node' to turn the 10 point scale to dichtome variable (good wine and rest), the code to put in the rule engine is something like this: $quality$ > 6.5 => "good" TRUE => "bad" 3- Rule Engine Node output to input of Column Filter node to filter out your original 10point feature (this prevent leaking) 4- Column Filter Node output to input of Partitioning Node (your standard train/tes split, e.g. 75%/25%, choose 'random' or 'stratified') 5- Partitioning Node train data split output to input of Train data split to input Decision Tree Learner node and 6- Partitioning Node test data split output to input Decision Tree predictor Node 7- Decision Tree learner Node output to input Decision Tree Node input 8- Decision Tree output to input ROC Node.. (here you can evaluate your model base on AUC value) Inspiration Use machine learning to determine which physiochemical properties make a wine 'good'! Acknowledgements This dataset is also available from the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/wine+quality , I just shared it to kaggle for convenience. (I am mistaken and the public license type disallowed me from doing so, I will take this down at first request. I am not the owner of this dataset. Please include this citation if you plan to use this database: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009. Relevant publication P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
CS1300-2024
The simulations used to give intuition and understanding into the reasoning behind why the normal and chi^2 curves are declarative for AB tests.
ahmed-sattar
This project performs a Shapiro-Wilk testto check if a dataset follows a normal distribution. It also visualizes the data distribution with a histogram and overlays a normal distribution curve. The script uses Python libraries such as pandas, scipy, and matplotlib for data manipulation, statistical testing, and visualization.
karencode
This is the code to a Shiny App that computes a p-value for a test statistic that follows the standard normal distribution. This app also shows the probability as an area under the z curve.
jujuakd888
Built a reproducible R Markdown pipeline (tidyverse, ggplot2, pheatmap, survival) to analyse TCGA-style breast cancer data: tumour-vs-normal testing (Wilcoxon + FDR), heatmaps, volcano plots, and Kaplan–Meier curves; delivered cleaned CSV outputs and figures to a shared GitHub repo.
This project uses a Deep Learning Convolutional Neural Network (CNN) to classify chest X-ray images into three categories: Normal, Pneumonia Bacterial, and Pneumonia Viral. The model is trained with data augmentation and evaluated on a separate test set with visualization of accuracy and loss curves.
Dinesh-Narasimhan
Normal distribution is a symmetric, bell-shaped curve representing how data is distributed around the mean. In statistics, it’s used to model natural phenomena like heights, test scores, and measurement errors. It follows the empirical rule where ~68% of data lies within 1 standard deviation from the mean.
njafarov
Suppose you are an analyst for a company that makes sensors. Your company is working on a new Load Cell Sensor and wants to test its reliability with several different tests. Reliability specialists often describe the lifetime of a population of these types of products using a graphical representation called the bathtub curve. The bathtub curve consists of three periods: an infant mortality period with a decreasing failure rate followed by a normal life period (also known as "useful life") with a low, relatively constant failure rate and concluding with a wear-out period that exhibits an increasing failure rate.
Emil-Sila
Makes a Gauss fit for the input data parameter and a histogram based on the number of bins that are located in 'kwargs'. It saves 2 graphs that both include a normal distribution curve, histogram and data. One of them contains the p-value of the Chi squares test, the other contains: number of bins, size of bins, mean and standard deviation for the Gauss distribution and Chi squares test results, including: Chi square, p-value and degrees of freedom.
vtrevino
VALORATE is a procedure to accurately estimate the p-value of the difference in two survival curves using the log-rank test specially in the cases of largely unbalanced groups. Instead of using a normal or chi-squrare, VALORATE estimates the null-distribution by a weighted sum of conditional distributions over a co-occurrence parameter. VALORATE was designed for cancer genomics where the comparisons between survival groups are heavily unbalanced since the frequency of gene mutations is quite low. Nevertheless, VALORATE should work for standard log-rank tests.
krishnakish
These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones). Content For more information, read [Cortez et al., 2009]. Input variables (based on physicochemical tests): 1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol Output variable (based on sensory data): 12 - quality (score between 0 and 10) What might be an interesting thing to do, is aside from using regression modelling, is to set an arbitrary cutoff for your dependent variable (wine quality) at e.g. 7 or higher getting classified as 'good/1' and the remainder as 'not good/0'. This allows you to practice with hyper parameter tuning on e.g. decision tree algorithms looking at the ROC curve and the AUC value. Without doing any kind of feature engineering or overfitting you should be able to get an AUC of .88 (without even using random forest algorithm)
gujralsanyam22
Predicting the Quality of Red Wine using Machine Learning Algorithms for Regression Analysis, Data Visualizations and Data Analysis. Description Context The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.). These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones). This dataset is also available from the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/wine+quality , I just shared it to kaggle for convenience. (If I am mistaken and the public license type disallowed me from doing so, I will take this down if requested.) Content For more information, read [Cortez et al., 2009]. Input variables (based on physicochemical tests): 1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol Output variable (based on sensory data): 12 - quality (score between 0 and 10) Tips What might be an interesting thing to do, is aside from using regression modelling, is to set an arbitrary cutoff for your dependent variable (wine quality) at e.g. 7 or higher getting classified as 'good/1' and the remainder as 'not good/0'. This allows you to practice with hyper parameter tuning on e.g. decision tree algorithms looking at the ROC curve and the AUC value. Without doing any kind of feature engineering or overfitting you should be able to get an AUC of .88 (without even using random forest algorithm) KNIME is a great tool (GUI) that can be used for this. 1 - File Reader (for csv) to linear correlation node and to interactive histogram for basic EDA. 2- File Reader to 'Rule Engine Node' to turn the 10 point scale to dichtome variable (good wine and rest), the code to put in the rule engine is something like this: $quality$ > 6.5 => "good" TRUE => "bad" 3- Rule Engine Node output to input of Column Filter node to filter out your original 10point feature (this prevent leaking) 4- Column Filter Node output to input of Partitioning Node (your standard train/tes split, e.g. 75%/25%, choose 'random' or 'stratified') 5- Partitioning Node train data split output to input of Train data split to input Decision Tree Learner node and 6- Partitioning Node test data split output to input Decision Tree predictor Node 7- Decision Tree learner Node output to input Decision Tree Node input 8- Decision Tree output to input ROC Node.. (here you can evaluate your model base on AUC value) Inspiration Use machine learning to determine which physiochemical properties make a wine 'good'! Acknowledgements This dataset is also available from the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/wine+quality , I just shared it to kaggle for convenience. (I am mistaken and the public license type disallowed me from doing so, I will take this down at first request. I am not the owner of this dataset. Please include this citation if you plan to use this database: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009. Relevant publication P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
svasantd
library(pROC) # install with install.packages("pROC") library(randomForest) # install with install.packages("randomForest") ####################################### ## ## Generate weight and obesity datasets. ## ####################################### set.seed(420) # this will make my results match yours num.samples <- 100 ## genereate 100 values from a normal distribution with ## mean 172 and standard deviation 29, then sort them weight <- sort(rnorm(n=num.samples, mean=172, sd=29)) ## Now we will decide if a sample is obese or not. ## NOTE: This method for classifying a sample as obese or not ## was made up just for this example. ## rank(weight) returns 1 for the lightest, 2 for the second lightest, ... ## ... and it returns 100 for the heaviest. ## So what we do is generate a random number between 0 and 1. Then we see if ## that number is less than rank/100. So, for the lightest sample, rank = 1. ## This sample will be classified "obese" if we get a random number less than ## 1/100. For the second lightest sample, rank = 2, we get another random ## number between 0 and 1 and classify this sample "obese" if that random ## number is < 2/100. We repeat that process for all 100 samples obese <- ifelse(test=(runif(n=num.samples) < (rank(weight)/num.samples)), yes=1, no=0) obese ## print out the contents of "obese" to show us which samples were ## classified "obese" with 1, and which samples were classified ## "not obese" with 0. ## plot the data plot(x=weight, y=obese) ## fit a logistic regression to the data... glm.fit=glm(obese ~ weight, family=binomial) lines(weight, glm.fit$fitted.values) ####################################### ## ## draw ROC and AUC using pROC ## ####################################### ## NOTE: By default, the graphs come out looking terrible ## The problem is that ROC graphs should be square, since the x and y axes ## both go from 0 to 1. However, the window in which I draw them isn't square ## so extra whitespace is added to pad the sides. roc(obese, glm.fit$fitted.values, plot=TRUE) ## Now let's configure R so that it prints the graph as a square. ## par(pty = "s") ## pty sets the aspect ratio of the plot region. Two options: ## "s" - creates a square plotting region ## "m" - (the default) creates a maximal plotting region roc(obese, glm.fit$fitted.values, plot=TRUE) ## NOTE: By default, roc() uses specificity on the x-axis and the values range ## from 1 to 0. This makes the graph look like what we would expect, but the ## x-axis itself might induce a headache. To use 1-specificity (i.e. the ## False Positive Rate) on the x-axis, set "legacy.axes" to TRUE. roc(obese, glm.fit$fitted.values, plot=TRUE, legacy.axes=TRUE) ## If you want to rename the x and y axes... roc(obese, glm.fit$fitted.values, plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Postive Percentage") ## We can also change the color of the ROC line, and make it wider... roc(obese, glm.fit$fitted.values, plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Postive Percentage", col="#377eb8", lwd=4) ## If we want to find out the optimal threshold we can store the ## data used to make the ROC graph in a variable... roc.info <- roc(obese, glm.fit$fitted.values, legacy.axes=TRUE) str(roc.info) ## and then extract just the information that we want from that variable. roc.df <- data.frame( tpp=roc.info$sensitivities*100, ## tpp = true positive percentage fpp=(1 - roc.info$specificities)*100, ## fpp = false positive precentage thresholds=roc.info$thresholds) head(roc.df) ## head() will show us the values for the upper right-hand corner ## of the ROC graph, when the threshold is so low ## (negative infinity) that every single sample is called "obese". ## Thus TPP = 100% and FPP = 100% tail(roc.df) ## tail() will show us the values for the lower left-hand corner ## of the ROC graph, when the threshold is so high (infinity) ## that every single sample is called "not obese". ## Thus, TPP = 0% and FPP = 0% ## now let's look at the thresholds between TPP 60% and 80%... roc.df[roc.df$tpp > 60 & roc.df$tpp < 80,] ## We can calculate the area under the curve... roc(obese, glm.fit$fitted.values, plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Postive Percentage", col="#377eb8", lwd=4, print.auc=TRUE) ## ...and the partial area under the curve. roc(obese, glm.fit$fitted.values, plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Postive Percentage", col="#377eb8", lwd=4, print.auc=TRUE, print.auc.x=45, partial.auc=c(100, 90), auc.polygon = TRUE, auc.polygon.col = "#377eb822") ####################################### ## ## Now let's fit the data with a random forest... ## ####################################### rf.model <- randomForest(factor(obese) ~ weight) ## ROC for random forest roc(obese, rf.model$votes[,1], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Postive Percentage", col="#4daf4a", lwd=4, print.auc=TRUE) ####################################### ## ## Now layer logistic regression and random forest ROC graphs.. ## ####################################### roc(obese, glm.fit$fitted.values, plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Postive Percentage", col="#377eb8", lwd=4, print.auc=TRUE) plot.roc(obese, rf.model$votes[,1], percent=TRUE, col="#4daf4a", lwd=4, print.auc=TRUE, add=TRUE, print.auc.y=40) legend("bottomright", legend=c("Logisitic Regression", "Random Forest"), col=c("#377eb8", "#4daf4a"), lwd=4) ####################################### ## ## Now that we're done with our ROC fun, let's reset the par() variables. ## There are two ways to do it... ## ####################################### par(pty = "m")
marionLlvre
from dataclasses import dataclass from enum import Enum from random import * from math import * import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt import networkx as nx sns.set() # Some constants FRAME_RATE = 1 # Refresh graphics very FRAME_RATE hours DENSITY = 100 I0 = 0.03 SOCIAL_DISTANCE = 0.007 # in km SPEED = 6 # km/day BETA1 = 0.50 # Probality to gets infected (From "S" to "I") BETA2 = 0.8 # Probability to get infected if you are part of an infected cluster GAMMA1 = 7 * 24 # Number of hours before recovering (From "I" to "R") GAMMA2 = 0.003 # Probability to die (From "I" to "D") EPSILON = 0.05 # Probability to be Susceptible again (From "R" to "S") BORDER = True SIGMA = ((6/24)/(3* sqrt(2))) # en km LOCKDOWN= False proba_for_asymptomatique = 0.3 proba_for_infectious = 0.05 MAX_HOME_DISTANCE = 0.1 ## The locations of borders on the map ## The locations of borders on the map A = (0.,0.82142857) B = (0.46896552,0.53125) C = (0.57241379,0.58928571) D = (1.,0.33035714) class SIRState(Enum): SUSCEPTIBLE = 0 INFECTIOUS = 1 RECOVERED = 2 DEAD = 3 class District(Enum): D7 = 0 D15 = 1 def compute_district(x,y): fy = my_piecewise_curve(x) if y < fy: return District(1) else: return District(0) def my_piecewise_curve(x): #retourne le y if 0<=x< B[0]: return ((B[1]-A[1])/B[0]-A[0])*x+A[1] if B[0]<=x <C[0]: return ((C[1]-B[1]) /(C[0]-B[0]))*x+ (B[1]-(((C[1]-B[1]) /(C[0]-B[0]))*B[0])) if C[0]<= x: return ((D[1]-C[1])/(D[0]-C[0]))*x+ (C[1]-((D[1]-C[1])/(D[0]-C[0]))*C[0]) @dataclass class Person: x: float #Normalized x position y: float #Normalized y position succ: list # liste des voisins infectieux status : int district : District #n° du district BORDER = TRUE LOCKDOWN= True def __init__(self, x, y): self.x = x self.y = y self.district = compute_district(x, y) # District ne peut être modifié car dépendant de x,y d'origine self.origin = (x,y) # idem pour la position d'origine car invariable def move(self): # tout le monde bouge de la même façon dx,dy = np.random.normal(0,SIGMA, size=2) #le mvt suit la loi normale x = self.x + dx y = self.y + dy if 0 <=x<= 1 and 0 <=y<= 1: # Conditions pour ne pas sortir de la map if BORDER and self.district == compute_district(x, y): # si district t = district t+1 if LOCKDOWN and sqrt((x-self.origin[0])**2+(y-self.origin[1])**2)> MAX_HOME_DISTANCE: #interdiction de dépasser le lockdown return self self.x+=dx self.y+=dy if not BORDER: if LOCKDOWN and sqrt((x-self.origin[0])**2+(y-self.origin[1])**2)> MAX_HOME_DISTANCE : #il peut y avoir le lockdown et pas le border return self self.x+=dx self.y+=dy def update(self): return self #susceptible personne qui bouge aléatoirement class SusceptiblePerson(Person): state = SIRState.SUSCEPTIBLE def update(self): infectedneighbor = False for people in self.succ: if people.state== SIRState.INFECTIOUS: infectedneighbor=True if np.random.rand() < 0.5 and infectedneighbor==True: return InfectiousPerson (self.x, self.y) else: return self #si on a des voisins infecté on peut etre infecté avec une proba de 0,5 def voisin_infected_same_district(self): # pour chaque point dans le meme arrond que le point self d'interet regarde si il est infecte for elt in self.succ: # pour chasue element dans la list des successeurs avant on regarde son etat ( donc list doit prendr een compte stattut if [elt].state == SIRState.INFECTIOUS: # le point a pour etat infecté return Susceptibleperson(self.x,self.y) else : return self def voisin_infected (self): #voisin infecté? for i in self.succ: if people[i].state == SIRState.INFECTIOUS: return True return False class InfectiousPerson(Person): state = SIRState.INFECTIOUS age:int =0 #si on a des voisins infecté on peut etre infecté avec une proba de 0,5 def voisin_infected_same_district(self): # pour chaque point dans le meme arrond que le point self d'interet regarde si il est infecte for elt in self.succ: # pour chasue element dans la list des successeurs avant on regarde son etat ( donc list doit prendr een compte stattut if [elt].state == SIRState.INFECTIOUS: # le point a pour etat infecté return InfectiousPerson(self.x,self.y) else : return self def update(self): self.age+=1 if np.random.rand() < GAMMA2 : return DeadPerson(self.x, self.y) #proba de mourir après infection if self.age >= GAMMA1 : #si plus que 7 jours return RecoveredPerson(self.x, self.y) if BORDER: #si le voisin infecté est dans le mm district has_infected_neighbor = voisin_infected_same_district(self) else: #sinon seulement si il a un voisins infecté has_infected_neighbor = voisin_infected(self) if not has_infected_neighbor: #si pas de voisin contaminé, pas de contamination return self if has_infected_neighbor: beta=BETA1 #proba de devenir infectieux if has_infected_neighbor and np.random.random() < beta : return InfectiousPerson(self.x, self.y) else: return self class RecoveredPerson(Person): state = SIRState.RECOVERED def update(self): if np.random.rand() < EPSILON : #redeviend susceptible return SusceptiblePerson(self.x, self.y) else: return self #immunisé class DeadPerson(Person): state = SIRState.DEAD def move(self): #dead is dead pass ''' Fonctions used to display and plot the curves (you should not have to change them) ''' def display_map(people, ax = None): x = [ p.x for p in people] y = [ p.y for p in people] h = [ p.state.name[0] for p in people] horder = ["S", "I", "R", "D"] ax = sns.scatterplot(x, y, hue=h, hue_order=horder, ax=ax) ax.set_xlim((0.0,1.0)) ax.set_ylim((0.0,1.0)) ax.set_aspect(224/145) ax.set_axis_off() ax.set_frame_on(True) ax.legend(loc=1, bbox_to_anchor=(0, 1)) count_by_population = None def plot_population(people, ax = None): global count_by_population states = np.array([p.state.value for p in people], dtype=int) counts = np.bincount(states, minlength=4) entry = { "Susceptible" : counts[SIRState.SUSCEPTIBLE.value], "Infectious" : counts[SIRState.INFECTIOUS.value], "Dead" : counts[SIRState.DEAD.value], "Recovered" : counts[SIRState.RECOVERED.value] } cols = ["Susceptible", "Infectious", "Recovered", "Dead"] if count_by_population is None: count_by_population = pd.DataFrame(entry, index=[0.]) else: count_by_population = count_by_population.append(entry, ignore_index=True) if ax != None: count_by_population.index = np.arange(len(count_by_population)) / 24 sns.lineplot(data=count_by_population, ax = ax) ''' Main loop function, that is called at each turn ''' def next_loop_event(t): print("Time =",t) # Move each person for p in people: p.move() update_graph(people) # Update the state of people for i in range(len(people)): people[i] = people[i].update() if t % FRAME_RATE == 0: fig.clf() ax1, ax2 = fig.subplots(1,2) display_map(people, ax1) plot_population(people, ax2) else: plot_population(people, None) ''' donne les personnes infectieuses voisines ''' #faire une liste propre a cahque personne qui ont des voisins et voir le nombre de infectious people dans ces voisins UNSEEN = 0; DONE = 1 def update_graph(people): #update liste des successeurs infectieux for i in (people): i.succ=[] for j in (people): D=((i.x-j.x)**2+(i.y-j.y)**2)**0.5 #cf théorème de Pythagore if i is not j and D <= SOCIAL_DISTANCE : #si le respect des distances n'est pas respecté i.succ.append(j) #liste de voisins infectieux ''' Function that crate the initial population ''' def create_data(): # This creates a susceptible person located at (0.25,0.5) # and an infectious person located at (0.75,0.5) a=[] for i in range(DENSITY): if np.random.random()< I0: #créer une personne INFECTIOUS avec une probabilité a.append(InfectiousPerson(np.random.rand(),np.random.rand())) else : a.append(SusceptiblePerson(np.random.rand(),np.random.rand())) return a #liste de toutes les personnes de l'étude def create_data_test(): S = [ SusceptiblePerson(0.5, 0.25) for i in range(100) ] I = [ InfectiousPerson(0.5, 0.76) for i in range(100) ] return S + I import matplotlib.animation as animation people = create_data() fig = plt.figure(1) duration = 20 # in days anim = animation.FuncAnimation(fig, next_loop_event, frames=np.arange(duration*24), interval=100, repeat=False) # To save the animation as a video #anim.save("simulation.mp4", fps=5, dpi=100, writer="ffmpeg") plt.show()
All 25 repositories loaded