Found 445 repositories(showing 30)
sayantann11
Classification - Machine Learning This is ‘Classification’ tutorial which is a part of the Machine Learning course offered by Simplilearn. We will learn Classification algorithms, types of classification algorithms, support vector machines(SVM), Naive Bayes, Decision Tree and Random Forest Classifier in this tutorial. Objectives Let us look at some of the objectives covered under this section of Machine Learning tutorial. Define Classification and list its algorithms Describe Logistic Regression and Sigmoid Probability Explain K-Nearest Neighbors and KNN classification Understand Support Vector Machines, Polynomial Kernel, and Kernel Trick Analyze Kernel Support Vector Machines with an example Implement the Naïve Bayes Classifier Demonstrate Decision Tree Classifier Describe Random Forest Classifier Classification: Meaning Classification is a type of supervised learning. It specifies the class to which data elements belong to and is best used when the output has finite and discrete values. It predicts a class for an input variable as well. There are 2 types of Classification: Binomial Multi-Class Classification: Use Cases Some of the key areas where classification cases are being used: To find whether an email received is a spam or ham To identify customer segments To find if a bank loan is granted To identify if a kid will pass or fail in an examination Classification: Example Social media sentiment analysis has two potential outcomes, positive or negative, as displayed by the chart given below. https://www.simplilearn.com/ice9/free_resources_article_thumb/classification-example-machine-learning.JPG This chart shows the classification of the Iris flower dataset into its three sub-species indicated by codes 0, 1, and 2. https://www.simplilearn.com/ice9/free_resources_article_thumb/iris-flower-dataset-graph.JPG The test set dots represent the assignment of new test data points to one class or the other based on the trained classifier model. Types of Classification Algorithms Let’s have a quick look into the types of Classification Algorithm below. Linear Models Logistic Regression Support Vector Machines Nonlinear models K-nearest Neighbors (KNN) Kernel Support Vector Machines (SVM) Naïve Bayes Decision Tree Classification Random Forest Classification Logistic Regression: Meaning Let us understand the Logistic Regression model below. This refers to a regression model that is used for classification. This method is widely used for binary classification problems. It can also be extended to multi-class classification problems. Here, the dependent variable is categorical: y ϵ {0, 1} A binary dependent variable can have only two values, like 0 or 1, win or lose, pass or fail, healthy or sick, etc In this case, you model the probability distribution of output y as 1 or 0. This is called the sigmoid probability (σ). If σ(θ Tx) > 0.5, set y = 1, else set y = 0 Unlike Linear Regression (and its Normal Equation solution), there is no closed form solution for finding optimal weights of Logistic Regression. Instead, you must solve this with maximum likelihood estimation (a probability model to detect the maximum likelihood of something happening). It can be used to calculate the probability of a given outcome in a binary model, like the probability of being classified as sick or passing an exam. https://www.simplilearn.com/ice9/free_resources_article_thumb/logistic-regression-example-graph.JPG Sigmoid Probability The probability in the logistic regression is often represented by the Sigmoid function (also called the logistic function or the S-curve): https://www.simplilearn.com/ice9/free_resources_article_thumb/sigmoid-function-machine-learning.JPG In this equation, t represents data values * the number of hours studied and S(t) represents the probability of passing the exam. Assume sigmoid function: https://www.simplilearn.com/ice9/free_resources_article_thumb/sigmoid-probability-machine-learning.JPG g(z) tends toward 1 as z -> infinity , and g(z) tends toward 0 as z -> infinity K-nearest Neighbors (KNN) K-nearest Neighbors algorithm is used to assign a data point to clusters based on similarity measurement. It uses a supervised method for classification. The steps to writing a k-means algorithm are as given below: https://www.simplilearn.com/ice9/free_resources_article_thumb/knn-distribution-graph-machine-learning.JPG Choose the number of k and a distance metric. (k = 5 is common) Find k-nearest neighbors of the sample that you want to classify Assign the class label by majority vote. KNN Classification A new input point is classified in the category such that it has the most number of neighbors from that category. For example: https://www.simplilearn.com/ice9/free_resources_article_thumb/knn-classification-machine-learning.JPG Classify a patient as high risk or low risk. Mark email as spam or ham. Keen on learning about Classification Algorithms in Machine Learning? Click here! Support Vector Machine (SVM) Let us understand Support Vector Machine (SVM) in detail below. SVMs are classification algorithms used to assign data to various classes. They involve detecting hyperplanes which segregate data into classes. SVMs are very versatile and are also capable of performing linear or nonlinear classification, regression, and outlier detection. Once ideal hyperplanes are discovered, new data points can be easily classified. https://www.simplilearn.com/ice9/free_resources_article_thumb/support-vector-machines-graph-machine-learning.JPG The optimization objective is to find “maximum margin hyperplane” that is farthest from the closest points in the two classes (these points are called support vectors). In the given figure, the middle line represents the hyperplane. SVM Example Let’s look at this image below and have an idea about SVM in general. Hyperplanes with larger margins have lower generalization error. The positive and negative hyperplanes are represented by: https://www.simplilearn.com/ice9/free_resources_article_thumb/positive-negative-hyperplanes-machine-learning.JPG Classification of any new input sample xtest : If w0 + wTxtest > 1, the sample xtest is said to be in the class toward the right of the positive hyperplane. If w0 + wTxtest < -1, the sample xtest is said to be in the class toward the left of the negative hyperplane. When you subtract the two equations, you get: https://www.simplilearn.com/ice9/free_resources_article_thumb/equation-subtraction-machine-learning.JPG Length of vector w is (L2 norm length): https://www.simplilearn.com/ice9/free_resources_article_thumb/length-of-vector-machine-learning.JPG You normalize with the length of w to arrive at: https://www.simplilearn.com/ice9/free_resources_article_thumb/normalize-equation-machine-learning.JPG SVM: Hard Margin Classification Given below are some points to understand Hard Margin Classification. The left side of equation SVM-1 given above can be interpreted as the distance between the positive (+ve) and negative (-ve) hyperplanes; in other words, it is the margin that can be maximized. Hence the objective of the function is to maximize with the constraint that the samples are classified correctly, which is represented as : https://www.simplilearn.com/ice9/free_resources_article_thumb/hard-margin-classification-machine-learning.JPG This means that you are minimizing ‖w‖. This also means that all positive samples are on one side of the positive hyperplane and all negative samples are on the other side of the negative hyperplane. This can be written concisely as : https://www.simplilearn.com/ice9/free_resources_article_thumb/hard-margin-classification-formula.JPG Minimizing ‖w‖ is the same as minimizing. This figure is better as it is differentiable even at w = 0. The approach listed above is called “hard margin linear SVM classifier.” SVM: Soft Margin Classification Given below are some points to understand Soft Margin Classification. To allow for linear constraints to be relaxed for nonlinearly separable data, a slack variable is introduced. (i) measures how much ith instance is allowed to violate the margin. The slack variable is simply added to the linear constraints. https://www.simplilearn.com/ice9/free_resources_article_thumb/soft-margin-calculation-machine-learning.JPG Subject to the above constraints, the new objective to be minimized becomes: https://www.simplilearn.com/ice9/free_resources_article_thumb/soft-margin-calculation-formula.JPG You have two conflicting objectives now—minimizing slack variable to reduce margin violations and minimizing to increase the margin. The hyperparameter C allows us to define this trade-off. Large values of C correspond to larger error penalties (so smaller margins), whereas smaller values of C allow for higher misclassification errors and larger margins. https://www.simplilearn.com/ice9/free_resources_article_thumb/machine-learning-certification-video-preview.jpg SVM: Regularization The concept of C is the reverse of regularization. Higher C means lower regularization, which increases bias and lowers the variance (causing overfitting). https://www.simplilearn.com/ice9/free_resources_article_thumb/concept-of-c-graph-machine-learning.JPG IRIS Data Set The Iris dataset contains measurements of 150 IRIS flowers from three different species: Setosa Versicolor Viriginica Each row represents one sample. Flower measurements in centimeters are stored as columns. These are called features. IRIS Data Set: SVM Let’s train an SVM model using sci-kit-learn for the Iris dataset: https://www.simplilearn.com/ice9/free_resources_article_thumb/svm-model-graph-machine-learning.JPG Nonlinear SVM Classification There are two ways to solve nonlinear SVMs: by adding polynomial features by adding similarity features Polynomial features can be added to datasets; in some cases, this can create a linearly separable dataset. https://www.simplilearn.com/ice9/free_resources_article_thumb/nonlinear-classification-svm-machine-learning.JPG In the figure on the left, there is only 1 feature x1. This dataset is not linearly separable. If you add x2 = (x1)2 (figure on the right), the data becomes linearly separable. Polynomial Kernel In sci-kit-learn, one can use a Pipeline class for creating polynomial features. Classification results for the Moons dataset are shown in the figure. https://www.simplilearn.com/ice9/free_resources_article_thumb/polynomial-kernel-machine-learning.JPG Polynomial Kernel with Kernel Trick Let us look at the image below and understand Kernel Trick in detail. https://www.simplilearn.com/ice9/free_resources_article_thumb/polynomial-kernel-with-kernel-trick.JPG For large dimensional datasets, adding too many polynomial features can slow down the model. You can apply a kernel trick with the effect of polynomial features without actually adding them. The code is shown (SVC class) below trains an SVM classifier using a 3rd-degree polynomial kernel but with a kernel trick. https://www.simplilearn.com/ice9/free_resources_article_thumb/polynomial-kernel-equation-machine-learning.JPG The hyperparameter coefθ controls the influence of high-degree polynomials. Kernel SVM Let us understand in detail about Kernel SVM. Kernel SVMs are used for classification of nonlinear data. In the chart, nonlinear data is projected into a higher dimensional space via a mapping function where it becomes linearly separable. https://www.simplilearn.com/ice9/free_resources_article_thumb/kernel-svm-machine-learning.JPG In the higher dimension, a linear separating hyperplane can be derived and used for classification. A reverse projection of the higher dimension back to original feature space takes it back to nonlinear shape. As mentioned previously, SVMs can be kernelized to solve nonlinear classification problems. You can create a sample dataset for XOR gate (nonlinear problem) from NumPy. 100 samples will be assigned the class sample 1, and 100 samples will be assigned the class label -1. https://www.simplilearn.com/ice9/free_resources_article_thumb/kernel-svm-graph-machine-learning.JPG As you can see, this data is not linearly separable. https://www.simplilearn.com/ice9/free_resources_article_thumb/kernel-svm-non-separable.JPG You now use the kernel trick to classify XOR dataset created earlier. https://www.simplilearn.com/ice9/free_resources_article_thumb/kernel-svm-xor-machine-learning.JPG Naïve Bayes Classifier What is Naive Bayes Classifier? Have you ever wondered how your mail provider implements spam filtering or how online news channels perform news text classification or even how companies perform sentiment analysis of their audience on social media? All of this and more are done through a machine learning algorithm called Naive Bayes Classifier. Naive Bayes Named after Thomas Bayes from the 1700s who first coined this in the Western literature. Naive Bayes classifier works on the principle of conditional probability as given by the Bayes theorem. Advantages of Naive Bayes Classifier Listed below are six benefits of Naive Bayes Classifier. Very simple and easy to implement Needs less training data Handles both continuous and discrete data Highly scalable with the number of predictors and data points As it is fast, it can be used in real-time predictions Not sensitive to irrelevant features Bayes Theorem We will understand Bayes Theorem in detail from the points mentioned below. According to the Bayes model, the conditional probability P(Y|X) can be calculated as: P(Y|X) = P(X|Y)P(Y) / P(X) This means you have to estimate a very large number of P(X|Y) probabilities for a relatively small vector space X. For example, for a Boolean Y and 30 possible Boolean attributes in the X vector, you will have to estimate 3 billion probabilities P(X|Y). To make it practical, a Naïve Bayes classifier is used, which assumes conditional independence of P(X) to each other, with a given value of Y. This reduces the number of probability estimates to 2*30=60 in the above example. Naïve Bayes Classifier for SMS Spam Detection Consider a labeled SMS database having 5574 messages. It has messages as given below: https://www.simplilearn.com/ice9/free_resources_article_thumb/naive-bayes-spam-machine-learning.JPG Each message is marked as spam or ham in the data set. Let’s train a model with Naïve Bayes algorithm to detect spam from ham. The message lengths and their frequency (in the training dataset) are as shown below: https://www.simplilearn.com/ice9/free_resources_article_thumb/naive-bayes-spam-spam-detection.JPG Analyze the logic you use to train an algorithm to detect spam: Split each message into individual words/tokens (bag of words). Lemmatize the data (each word takes its base form, like “walking” or “walked” is replaced with “walk”). Convert data to vectors using scikit-learn module CountVectorizer. Run TFIDF to remove common words like “is,” “are,” “and.” Now apply scikit-learn module for Naïve Bayes MultinomialNB to get the Spam Detector. This spam detector can then be used to classify a random new message as spam or ham. Next, the accuracy of the spam detector is checked using the Confusion Matrix. For the SMS spam example above, the confusion matrix is shown on the right. Accuracy Rate = Correct / Total = (4827 + 592)/5574 = 97.21% Error Rate = Wrong / Total = (155 + 0)/5574 = 2.78% https://www.simplilearn.com/ice9/free_resources_article_thumb/confusion-matrix-machine-learning.JPG Although confusion Matrix is useful, some more precise metrics are provided by Precision and Recall. https://www.simplilearn.com/ice9/free_resources_article_thumb/precision-recall-matrix-machine-learning.JPG Precision refers to the accuracy of positive predictions. https://www.simplilearn.com/ice9/free_resources_article_thumb/precision-formula-machine-learning.JPG Recall refers to the ratio of positive instances that are correctly detected by the classifier (also known as True positive rate or TPR). https://www.simplilearn.com/ice9/free_resources_article_thumb/recall-formula-machine-learning.JPG Precision/Recall Trade-off To detect age-appropriate videos for kids, you need high precision (low recall) to ensure that only safe videos make the cut (even though a few safe videos may be left out). The high recall is needed (low precision is acceptable) in-store surveillance to catch shoplifters; a few false alarms are acceptable, but all shoplifters must be caught. Learn about Naive Bayes in detail. Click here! Decision Tree Classifier Some aspects of the Decision Tree Classifier mentioned below are. Decision Trees (DT) can be used both for classification and regression. The advantage of decision trees is that they require very little data preparation. They do not require feature scaling or centering at all. They are also the fundamental components of Random Forests, one of the most powerful ML algorithms. Unlike Random Forests and Neural Networks (which do black-box modeling), Decision Trees are white box models, which means that inner workings of these models are clearly understood. In the case of classification, the data is segregated based on a series of questions. Any new data point is assigned to the selected leaf node. https://www.simplilearn.com/ice9/free_resources_article_thumb/decision-tree-classifier-machine-learning.JPG Start at the tree root and split the data on the feature using the decision algorithm, resulting in the largest information gain (IG). This splitting procedure is then repeated in an iterative process at each child node until the leaves are pure. This means that the samples at each node belonging to the same class. In practice, you can set a limit on the depth of the tree to prevent overfitting. The purity is compromised here as the final leaves may still have some impurity. The figure shows the classification of the Iris dataset. https://www.simplilearn.com/ice9/free_resources_article_thumb/decision-tree-classifier-graph.JPG IRIS Decision Tree Let’s build a Decision Tree using scikit-learn for the Iris flower dataset and also visualize it using export_graphviz API. https://www.simplilearn.com/ice9/free_resources_article_thumb/iris-decision-tree-machine-learning.JPG The output of export_graphviz can be converted into png format: https://www.simplilearn.com/ice9/free_resources_article_thumb/iris-decision-tree-output.JPG Sample attribute stands for the number of training instances the node applies to. Value attribute stands for the number of training instances of each class the node applies to. Gini impurity measures the node’s impurity. A node is “pure” (gini=0) if all training instances it applies to belong to the same class. https://www.simplilearn.com/ice9/free_resources_article_thumb/impurity-formula-machine-learning.JPG For example, for Versicolor (green color node), the Gini is 1-(0/54)2 -(49/54)2 -(5/54) 2 ≈ 0.168 https://www.simplilearn.com/ice9/free_resources_article_thumb/iris-decision-tree-sample.JPG Decision Boundaries Let us learn to create decision boundaries below. For the first node (depth 0), the solid line splits the data (Iris-Setosa on left). Gini is 0 for Setosa node, so no further split is possible. The second node (depth 1) splits the data into Versicolor and Virginica. If max_depth were set as 3, a third split would happen (vertical dotted line). https://www.simplilearn.com/ice9/free_resources_article_thumb/decision-tree-boundaries.JPG For a sample with petal length 5 cm and petal width 1.5 cm, the tree traverses to depth 2 left node, so the probability predictions for this sample are 0% for Iris-Setosa (0/54), 90.7% for Iris-Versicolor (49/54), and 9.3% for Iris-Virginica (5/54) CART Training Algorithm Scikit-learn uses Classification and Regression Trees (CART) algorithm to train Decision Trees. CART algorithm: Split the data into two subsets using a single feature k and threshold tk (example, petal length < “2.45 cm”). This is done recursively for each node. k and tk are chosen such that they produce the purest subsets (weighted by their size). The objective is to minimize the cost function as given below: https://www.simplilearn.com/ice9/free_resources_article_thumb/cart-training-algorithm-machine-learning.JPG The algorithm stops executing if one of the following situations occurs: max_depth is reached No further splits are found for each node Other hyperparameters may be used to stop the tree: min_samples_split min_samples_leaf min_weight_fraction_leaf max_leaf_nodes Gini Impurity or Entropy Entropy is one more measure of impurity and can be used in place of Gini. https://www.simplilearn.com/ice9/free_resources_article_thumb/gini-impurity-entrophy.JPG It is a degree of uncertainty, and Information Gain is the reduction that occurs in entropy as one traverses down the tree. Entropy is zero for a DT node when the node contains instances of only one class. Entropy for depth 2 left node in the example given above is: https://www.simplilearn.com/ice9/free_resources_article_thumb/entrophy-for-depth-2.JPG Gini and Entropy both lead to similar trees. DT: Regularization The following figure shows two decision trees on the moons dataset. https://www.simplilearn.com/ice9/free_resources_article_thumb/dt-regularization-machine-learning.JPG The decision tree on the right is restricted by min_samples_leaf = 4. The model on the left is overfitting, while the model on the right generalizes better. Random Forest Classifier Let us have an understanding of Random Forest Classifier below. A random forest can be considered an ensemble of decision trees (Ensemble learning). Random Forest algorithm: Draw a random bootstrap sample of size n (randomly choose n samples from the training set). Grow a decision tree from the bootstrap sample. At each node, randomly select d features. Split the node using the feature that provides the best split according to the objective function, for instance by maximizing the information gain. Repeat the steps 1 to 2 k times. (k is the number of trees you want to create, using a subset of samples) Aggregate the prediction by each tree for a new data point to assign the class label by majority vote (pick the group selected by the most number of trees and assign new data point to that group). Random Forests are opaque, which means it is difficult to visualize their inner workings. https://www.simplilearn.com/ice9/free_resources_article_thumb/random-forest-classifier-graph.JPG However, the advantages outweigh their limitations since you do not have to worry about hyperparameters except k, which stands for the number of decision trees to be created from a subset of samples. RF is quite robust to noise from the individual decision trees. Hence, you need not prune individual decision trees. The larger the number of decision trees, the more accurate the Random Forest prediction is. (This, however, comes with higher computation cost). Key Takeaways Let us quickly run through what we have learned so far in this Classification tutorial. Classification algorithms are supervised learning methods to split data into classes. They can work on Linear Data as well as Nonlinear Data. Logistic Regression can classify data based on weighted parameters and sigmoid conversion to calculate the probability of classes. K-nearest Neighbors (KNN) algorithm uses similar features to classify data. Support Vector Machines (SVMs) classify data by detecting the maximum margin hyperplane between data classes. Naïve Bayes, a simplified Bayes Model, can help classify data using conditional probability models. Decision Trees are powerful classifiers and use tree splitting logic until pure or somewhat pure leaf node classes are attained. Random Forests apply Ensemble Learning to Decision Trees for more accurate classification predictions. Conclusion This completes ‘Classification’ tutorial. In the next tutorial, we will learn 'Unsupervised Learning with Clustering.'
Mario-Kart-Felix
2020 was a roller coaster of major, world-shaking events. We all couldn't wait for the year to end. But just as 2020 was about to close, it pulled another fast one on us: the SolarWinds hack, one of the biggest cybersecurity breaches of the 21st century. The SolarWinds hack was a major event not because a single company was breached, but because it triggered a much larger supply chain incident that affected thousands of organizations, including the U.S. government. What is SolarWinds? SolarWinds is a major software company based in Tulsa, Okla., which provides system management tools for network and infrastructure monitoring, and other technical services to hundreds of thousands of organizations around the world. Among the company's products is an IT performance monitoring system called Orion. As an IT monitoring system, SolarWinds Orion has privileged access to IT systems to obtain log and system performance data. It is that privileged position and its wide deployment that made SolarWinds a lucrative and attractive target. What is the SolarWinds hack? The SolarWinds hack is the commonly used term to refer to the supply chain breach that involved the SolarWinds Orion system. In this hack, suspected nation-state hackers that have been identified as a group known as Nobelium by Microsoft -- and often simply referred to as the SolarWinds Hackers by other researchers -- gained access to the networks, systems and data of thousands of SolarWinds customers. The breadth of the hack is unprecedented and one of the largest, if not the largest, of its kind ever recorded. More than 30,000 public and private organizations -- including local, state and federal agencies -- use the Orion network management system to manage their IT resources. As a result, the hack compromised the data, networks and systems of thousands when SolarWinds inadvertently delivered the backdoor malware as an update to the Orion software. SolarWinds customers weren't the only ones affected. Because the hack exposed the inner workings of Orion users, the hackers could potentially gain access to the data and networks of their customers and partners as well -- enabling affected victims to grow exponentially from there. Orion Platform hack compromised networks of thousands of SolarWinds customers Hackers compromised a digitally signed SolarWinds Orion network monitoring component, opening a backdoor into the networks of thousands of SolarWinds government and enterprise customers. How did the SolarWinds hack happen? The hackers used a method known as a supply chain attack to insert malicious code into the Orion system. A supply chain attack works by targeting a third party with access to an organization's systems rather than trying to hack the networks directly. The third-party software, in this case the SolarWinds Orion Platform, creates a backdoor through which hackers can access and impersonate users and accounts of victim organizations. The malware could also access system files and blend in with legitimate SolarWinds activity without detection, even by antivirus software. SolarWinds was a perfect target for this kind of supply chain attack. Because their Orion software is used by many multinational companies and government agencies, all the hackers had to do was install the malicious code into a new batch of software distributed by SolarWinds as an update or patch. The SolarWinds hack timeline Here is a timeline of the SolarWinds hack: September 2019. Threat actors gain unauthorized access to SolarWinds network October 2019. Threat actors test initial code injection into Orion Feb. 20, 2020. Malicious code known as Sunburst injected into Orion March 26, 2020. SolarWinds unknowingly starts sending out Orion software updates with hacked code According to a U.S. Department of Homeland Security advisory, the affected versions of SolarWinds Orion are versions are 2019.4 through 2020.2.1 HF1. More than 18,000 SolarWinds customers installed the malicious updates, with the malware spreading undetected. Through this code, hackers accessed SolarWinds's customer information technology systems, which they could then use to install even more malware to spy on other companies and organizations. Who was affected? According to reports, the malware affected many companies and organizations. Even government departments such as Homeland Security, State, Commerce and Treasury were affected, as there was evidence that emails were missing from their systems. Private companies such as FireEye, Microsoft, Intel, Cisco and Deloitte also suffered from this attack. The breach was first detected by cybersecurity company FireEye. The company confirmed they had been infected with the malware when they saw the infection in customer systems. FireEye labeled the SolarWinds hack "UNC2452" and identified the backdoor used to gain access to its systems through SolarWinds as "Sunburst." Microsoft also confirmed that it found signs of the malware in its systems, as the breach was affecting its customers as well. Reports indicated Microsoft's own systems were being used to further the hacking attack, but Microsoft denied this claim to news agencies. Later, the company worked with FireEye and GoDaddy to block and isolate versions of Orion known to contain the malware to cut off hackers from customers' systems. They did so by turning the domain used by the backdoor malware used in Orion as part of the SolarWinds hack into a kill switch. The kill switch here served as a mechanism to prevent Sunburst from operating further. Nonetheless, even with the kill switch in place, the hack is still ongoing. Investigators have a lot of data to look through, as many companies using the Orion software aren't yet sure if they are free from the backdoor malware. It will take a long time before the full impact of the hack is known. Why did it take so long to detect the SolarWinds attack? With attackers having first gained access to the SolarWinds systems in September 2019 and the attack not being publicly discovered or reported until December 2020, attackers may well have had 14 or more months of unfettered access. The time it takes between when an attacker is able to gain access and the time an attack is actually discovered is often referred to as dwell time. According to a report released in January 2020 by security firm CrowdStrike, the average dwell time in 2019 was 95 days. Given that it took well over a year from the time the attackers first entered the SolarWinds network until the breach was discovered, the dwell time in the attack exceeded the average. The question of why it took so long to detect the SolarWinds attack has a lot to do with the sophistication of the Sunburst code and the hackers that executed the attack. "Analysis suggests that by managing the intrusion through multiple servers based in the United States and mimicking legitimate network traffic, the attackers were able to circumvent threat detection techniques employed by both SolarWinds, other private companies, and the federal government," SolarWinds said in its analysis of the attack. FireEye, which was the first firm to publicly report the attack, conducted its own analysis of the SolarWinds attack. In its report, FireEye described in detail the complex series of action that the attackers took to mask their tracks. Even before Sunburst attempts to connect out to its command-and-control server, the malware executes a number of checks to make sure no antimalware or forensic analysis tools are running. What was the purpose of the hack? The purpose of the hack remains largely unknown. Still, there are many reasons hackers would want to get into an organization's system, including having access to future product plans or employee and customer information held for ransom. It is also not yet clear what information, if any, hackers stole from government agencies. But the level of access appears to be deep and broad. There are speculations that many enterprises might be collateral damage, as the main focus of the attack was government agencies that make use of the SolarWinds IT management systems. Who was responsible for the hack? Federal investigators and cybersecurity agents believe a Russian espionage operation -- mostly likely Russia's Foreign Intelligence Service -- is behind the SolarWinds attack. The Russian government has denied any involvement in the attack, releasing a statement that said, "Malicious activities in the information space contradicts the principles of the Russian foreign policy, national interests and understanding of interstate relations." They also added that "Russia does not conduct offensive operations in the cyber domain." Contrary to experts in his administration, then-President Donald Trump hinted at around the time of the discovery of the SolarWinds hack that Chinese hackers might be behind the cybersecurity attack. However, he did not present any evidence to back up his claim. Shortly after his inauguration, President Joe Biden vowed that his administration intended to hold Russia accountable, through the launch of a full-scale intelligence assessment and review of the SolarWinds attack and those behind it. The president also created the position of deputy national security adviser for cybersecurity as part of the National Security Council. The role, held by veteran intelligence operative Anne Neuberger, is part of an overall bid by the Biden administration to refresh the federal government's approach to cybersecurity and better respond to nation-state actors. Naming the attack: What is Solorigate, Sunburst and Nobelium? The SolarWinds attack has a number of different names associated with it. While the attack is often referred to simply as the SolarWinds attack, that isn't the only name to know. Sunburst. This is the name of the actual malicious code injection that was planted by hackers into the SolarWinds Orion IT monitoring system code. Both SolarWinds and CrowdStrike generally refer to the attack as Sunburst. Solorigate. Microsoft initially dubbed the actual threat actor group behind the SolarWinds attack as Solorigate. It's a name that stuck and was adopted by other researchers as well as media. Nobelium. In March 2021, Microsoft decided that the primary designation for the threat actor behind the SolarWinds attack should actually be Nobelium -- the idea being that the group is active against multiple victims -- not just SolarWinds -- and uses more malware than just Sunburst. The China connection to the SolarWinds attack While it is suspected that the initial Sunburst code and the attack against SolarWinds and its users came from a threat actor based in Russia, other nation-state threat actors have also used SolarWinds in attacks. According to a Reuters report, suspected nation-state hackers based in China exploited SolarWinds during the same period of time the Sunburst attack occurred. The suspected China-based threat actors targeted the National Finance Center, which is a payroll agency within the U.S. Department of Agriculture. It is suspected that the China-based attackers did not use Sunburst, but rather a different malware that SolarWinds identifies as Supernova. Why is the SolarWinds hack important? The SolarWinds supply chain attack is a global hack, as threat actors turned the Orion software into a weapon gaining access to several government systems and thousands of private systems around the world. Due to the nature of the software -- and by extension the Sunburst malware -- having access to entire networks, many government and enterprise networks and systems face the risk of significant breaches. The hack could also be the catalyst for rapid, broad change in the cybersecurity industry. Many companies and government agencies are now in the process of devising new methods to react to these types of attacks before they happen. Governments and organizations are learning that it is not enough to build a firewall and hope it protects them. They have to actively seek out vulnerabilities in their systems, and either shore them up or turn them into traps against these types of attacks. Since the hack was discovered, SolarWinds has recommended customers update their existing Orion platform. The company has released patches for the malware and other potential vulnerabilities discovered since the initial Orion attack. SolarWinds also recommended customers not able to update Orion isolate SolarWinds servers and/or change passwords for accounts that have access to those servers. The greater White House cybersecurity focus will be crucial, some industry experts have said. But organizations should consider adopting modern software-as-a-service tools for monitoring and collaboration. While the cybersecurity industry has significantly advanced in the last decade, these kinds of attacks show that there is still a long way to go to get really secure systems. The Nobelium group continues to attack targets The suspected threat actor group behind the SolarWinds attack has remained active in 2021 and hasn't stopped at just targeting SolarWinds. On May 27, 2021, Microsoft reported that Nobelium, the group allegedly behind the SolarWinds attack, infiltrated software from email marketing service Constant Contact. According to Microsoft, Nobelium targeted approximately 3,000 email accounts at more than 150 different organizations. The initial attack vector appears to be an account used by USAID. From that initial foothold, Nobelium was able to send out phishing emails in an attempt to get victims to click on a link that would deploy a backdoor Trojan designed to steal user information.
Manibarathi
High-throughput droplet microfluidic devices with fluorescence detection systems provide several advantages over conventional end-point cytometric techniques due to their ability to isolate single cells and investigate complex intracellular dynamics. While there have been significant advances in the field of experimental droplet microfluidics, the development of complementary software tools has lagged. Existing quantification tools have limitations including interdependent hardware platforms or challenges analyzing a wide range of high-throughput droplet microfluidic data using a single algorithm. To address these issues, an all-in-one Python algorithm called FluoroCellTrack was developed and its wide-range utility was tested on three different applications including quantification of cellular response to drugs, droplet tracking, and intracellular fluorescence. The algorithm imports all images collected using bright field and fluorescence microscopy and analyzes them to extract useful information. Two parallel steps are performed where droplets are detected using a mathematical Circular Hough Transform (CHT) while single cells (or other contours) are detected by a series of steps defining respective color boundaries involving edge detection, dilation, and erosion. These feature detection steps are strengthened by segmentation and radius/area thresholding for precise detection and removal of false positives. Individually detected droplet and contour center maps are overlaid to obtain encapsulation information for further analyses. FluoroCellTrack demonstrates an average of a ~92-99% similarity with manual analysis and exhibits a significant reduction in analysis time of 30 min to analyze an entire cohort compared to 20 h required for manual quantification.
rahulthawal
Dependencies OpenCV The Open Computer Vision library, in the form of python, was used to perform image parsing and analysis. This was used to apply a masking filter to each image, then parse each image for the contours of the marker / target. Numpy The Numerical Python (Numpy) library was used to provide advanced array and matrix functionality for the mathematics used in the program. In particular, the arrays were used for creating and passing the necessary RGB (in this instance, BGR) vectors for creating the appropriate mask. Access The Python program has been designed to run at the allocated hours of each member of Team 4 automatically (See below for a time table). It first starts by retrieving the system time of the executing host and compares it to the hard-coded times for each member. If the system time does not coincide with members' time, the program will inform the user through Standard Output and close. Otherwise, the program will continue with the Execution Procedure. Function Definitions readResponse() Takes the encoded response from the server after issuing a command and parses through the encoded white-space characters for more a more user-friendly report. getCoords(output) Given the output from the server, it parses through the data to find the X, Y, and Z co-ordinates. sendAJMA() Sends the Angular commands to the server to move the arm. After issuing, it acquires the resulting X, Y, and Z co-ordinates from the parsed return output. sendCapture() Sends the capture command to the server and prints the response. find_marker(image) Given an input image, the function parses over each pixel, applying a mask. After massaging the data, it attempts to find contours within the image to find a target marker (for example, a piece of paper). download_image(url, filename) Issues a HTTP request for an image. rawCommand(givenData) Parses through the encoded response from the server and makes it more user-friendly. printData() Does the actual printing of parsed data for the user's benefit. downloadCheckImage() First, it attempts to download the latest image from the server and reports any errors it encounters. Then, it parses the image for the marker if possible and displays the image to the user. testCom() With an initial list of commands for the arm, it proceeds to iterate through them while checking each image for the marker/box. After locating it, it proceeds to center the camera horizontally on the box and then vertically. After having centered the camera on the box, it will proceed to use forward kinematics to calculate the box's location. Execution Procedure The program proceeds to first authenticate with the R12 server. On success, it proceeds to call on the primary driver, testComm, to find the box. This process then instructs the arm for move, capture an image, downloads the image, and parses it for the target. All the while, reporting all pertinent data to the user for diagnostics purposes.
woshipzs
Codes For Audio Sentiment Analysis of Call Center Data
SahaanaIyer
JPMorgan Chase has traders in all the major financial centers and creates a marketplace for asset classes around the globe for our investor clients. Trading teams are organized by asset class: Equities (stocks) Commodities Credit (debt and bonds) Currency & Emerging Markets Public Finance (Government bonds) Interest rates Securitized Products For this project, a trader from the equities team (publicly listed company stocks) has requested functionality be added to their dashboard to allow them to input specific information so they can monitor a new trading strategy. In order to do this, you’ll need to set up your system so you can interface with the relevant financial data feed, make the required calculations and then present this in a way that allows the traders to visualize and analyze this data in real time. The visualization of charts and data analysis our trader’s see is all built on JPMorgan Chase's own open sourced software called Perspective. You’ll learn how to implement this to facilitate the trader’s requested changes and deliver actionable insights. You’ll have to gain an understanding of the user requirements and then build something that meets those requirements. 1: Interface with a stock price data feed Interface with a stock price data feed and set up your system for analysis of the data Financial Data Python Git Basic Programming 2: Use JPMorgan Chase frameworks and tools Implement the Perspective open source code in preparation for data visualization React Typescript Web Applications 3: Display data visually for traders Use Perspective to create the chart for the trader’s dashboard Technical Communication Financial Analysis Web Applications
MightyHive-AdTech-Split-Testing-Data- MightyHive: AdTech Split Testing Company Website: http://mightyhive.com/call-center-remarketing/ Data Source: Teamleada provide Data science project to get hands on experience. In the project "A/B Testing Analytics: MightyHive Project" the data is provided along with the schema to perform a detailed Data Cleaning, Data Deduplication, Data Strcuturing & Analysis. Data Available: https://www.teamleada.com/projects/ab-testing-analytics-mightyhive-project/data-background/data-background MightyHive is an advertising technology company that uses retargeting methods to send ads to users online. One product, Call Center Remarketing, uses call center log data to retarget those consumers online whom did not make a purchase. As a data analyst, you are tasked with determining the results from an advertising campaign with a MightyHive client Martin’s Travel Agency. Read this background information on retargeting by ReTargeter. Data The results of the advertising campaign for Martin's Travel Agency are given in the following two datasets: The Abandoned Dataset: https://github.com/sayandev/MightyHive-AdTech-Split-Testing-Data-/blob/master/Abandoned_Data_Seed.csv Observations in the Abandoned Dataset are individuals who called into Martins Travel Agency's call center but did not make a purchase. The Reservation Dataset: https://github.com/sayandev/MightyHive-AdTech-Split-Testing-Data-/blob/master/Reservation_Data_Seed.csv Observations in the Reservation Dataset are the customers who called into Martins Travel Agency's call center and made a reservation. The schema for both of the datasets is provided below: Caller_ID - A unique ID generated for every incoming phone call to the call center Session - The Year/Month/Day/Time of each incoming phone call to the call center Incoming_Phone - Phone number identified using caller indentification Contact_Phone - Phone number the caller submits Test_Control - Experiment tag
chetanp2002
The Call-Center Data Analysis project examines key metrics like call duration, resolution rates, and customer satisfaction to evaluate call-center performance. By identifying trends and patterns, the project offers insights for improving efficiency and customer experience, while using predictive modeling to forecast call outcomes.
PawanBangera1
No description available
GRKdev
CallScribe is an integrated system for call center management, featuring audio file transcription and analysis, data management via a FastAPI backend, and a user-friendly Next.js frontend interface for viewing and interacting with conversation data.
mohdsalah
5/4/2018 DSP mini project that provides the description of dimension reduction method; namely Kernel Principal Component Analysis (KPCA) as data mining is concerned with finding meaningful patterns in large sets of data. It considers other techniques that are used in the application of the method namely the centering in feature space. And the numerical experiments that were performed on an iris dataset. Kernel seeks to project the set of data onto a low-dimensional subspace that captures the highest possible amount of variance in the data. Kernel PCA embeds the data into a high dimensional space, called the feature space. The project reduces the time and storage space required, improves the performance of the machine learning model and reduces the computational power and it becomes easier to visualize the data when reduced to very low dimensions such as 2D or 3D.
City of Los Angeles 311 Call Center Data Analysis Project
mdmahadihasan-tm
We analyze the data set with SQL and we use Tableau for the visualization part
Anmol-rocket
The Call Center Data Analysis project provides insights into call center performance using Python. It processes call data and generates visualizations to evaluate key metrics, including call volume trends, queue lengths, wait times, and agent efficiency. This project helps call centers optimize their operations by identifying patterns 🚀
adityabhatisaini
No description available
prafullgangrade
This project aims to analyze call center data using Python and its data analysis libraries such as Pandas, NumPy, and Matplotlib. The data set used in this project contains call center metrics.
Maheshwari1990
Have Created visualization for the given data for call center company project -used different analysis of KPI.
Yogesh0318
This project analyzes the performance of a customer call center using a dataset containing 5000 call records. The aim is to evaluate agent efficiency, response time, customer satisfaction, and issue resolution. Various insights are drawn using data analysis and visualization techniques. The analysis helps management identify top-performing agents,
nawalasim
Call Center Data Analysis
Trillionier
No description available
LuisBuruato
No description available
anjot2807
By analyzing this call center dataset, I aim to help improve call center efficiency and customer satisfaction by identifying the best hours for outgoing calls. The insights gained from this analysis can inform scheduling decisions and lead to a more effective allocation of resources within the call center.
Shri-Abhirami-R
Power BI - Dashboard Creation assoiciated with PWC
ManojeetMahato
No description available
sankalp130
This repository contains Tableau dashboard, visualization, and data analysis projects I have created.
Supritam-Pal
No description available
Kaneriadhruv
Improve customer satisfaction rate by Call Center Data Analysis
AbhijitWadekar
No description available
AgasteenData
SQL queries and Power BI analysis for sales data insights.
shatabdi0412
In this project we use a dataset from ‘Real World Fake Data’ , Import it to MySQL Workbench, clean it and then analyse it. At last, used Tableau to visualize the data.