Found 70 repositories(showing 30)
sayantann11
Classification - Machine Learning This is ‘Classification’ tutorial which is a part of the Machine Learning course offered by Simplilearn. We will learn Classification algorithms, types of classification algorithms, support vector machines(SVM), Naive Bayes, Decision Tree and Random Forest Classifier in this tutorial. Objectives Let us look at some of the objectives covered under this section of Machine Learning tutorial. Define Classification and list its algorithms Describe Logistic Regression and Sigmoid Probability Explain K-Nearest Neighbors and KNN classification Understand Support Vector Machines, Polynomial Kernel, and Kernel Trick Analyze Kernel Support Vector Machines with an example Implement the Naïve Bayes Classifier Demonstrate Decision Tree Classifier Describe Random Forest Classifier Classification: Meaning Classification is a type of supervised learning. It specifies the class to which data elements belong to and is best used when the output has finite and discrete values. It predicts a class for an input variable as well. There are 2 types of Classification: Binomial Multi-Class Classification: Use Cases Some of the key areas where classification cases are being used: To find whether an email received is a spam or ham To identify customer segments To find if a bank loan is granted To identify if a kid will pass or fail in an examination Classification: Example Social media sentiment analysis has two potential outcomes, positive or negative, as displayed by the chart given below. https://www.simplilearn.com/ice9/free_resources_article_thumb/classification-example-machine-learning.JPG This chart shows the classification of the Iris flower dataset into its three sub-species indicated by codes 0, 1, and 2. https://www.simplilearn.com/ice9/free_resources_article_thumb/iris-flower-dataset-graph.JPG The test set dots represent the assignment of new test data points to one class or the other based on the trained classifier model. Types of Classification Algorithms Let’s have a quick look into the types of Classification Algorithm below. Linear Models Logistic Regression Support Vector Machines Nonlinear models K-nearest Neighbors (KNN) Kernel Support Vector Machines (SVM) Naïve Bayes Decision Tree Classification Random Forest Classification Logistic Regression: Meaning Let us understand the Logistic Regression model below. This refers to a regression model that is used for classification. This method is widely used for binary classification problems. It can also be extended to multi-class classification problems. Here, the dependent variable is categorical: y ϵ {0, 1} A binary dependent variable can have only two values, like 0 or 1, win or lose, pass or fail, healthy or sick, etc In this case, you model the probability distribution of output y as 1 or 0. This is called the sigmoid probability (σ). If σ(θ Tx) > 0.5, set y = 1, else set y = 0 Unlike Linear Regression (and its Normal Equation solution), there is no closed form solution for finding optimal weights of Logistic Regression. Instead, you must solve this with maximum likelihood estimation (a probability model to detect the maximum likelihood of something happening). It can be used to calculate the probability of a given outcome in a binary model, like the probability of being classified as sick or passing an exam. https://www.simplilearn.com/ice9/free_resources_article_thumb/logistic-regression-example-graph.JPG Sigmoid Probability The probability in the logistic regression is often represented by the Sigmoid function (also called the logistic function or the S-curve): https://www.simplilearn.com/ice9/free_resources_article_thumb/sigmoid-function-machine-learning.JPG In this equation, t represents data values * the number of hours studied and S(t) represents the probability of passing the exam. Assume sigmoid function: https://www.simplilearn.com/ice9/free_resources_article_thumb/sigmoid-probability-machine-learning.JPG g(z) tends toward 1 as z -> infinity , and g(z) tends toward 0 as z -> infinity K-nearest Neighbors (KNN) K-nearest Neighbors algorithm is used to assign a data point to clusters based on similarity measurement. It uses a supervised method for classification. The steps to writing a k-means algorithm are as given below: https://www.simplilearn.com/ice9/free_resources_article_thumb/knn-distribution-graph-machine-learning.JPG Choose the number of k and a distance metric. (k = 5 is common) Find k-nearest neighbors of the sample that you want to classify Assign the class label by majority vote. KNN Classification A new input point is classified in the category such that it has the most number of neighbors from that category. For example: https://www.simplilearn.com/ice9/free_resources_article_thumb/knn-classification-machine-learning.JPG Classify a patient as high risk or low risk. Mark email as spam or ham. Keen on learning about Classification Algorithms in Machine Learning? Click here! Support Vector Machine (SVM) Let us understand Support Vector Machine (SVM) in detail below. SVMs are classification algorithms used to assign data to various classes. They involve detecting hyperplanes which segregate data into classes. SVMs are very versatile and are also capable of performing linear or nonlinear classification, regression, and outlier detection. Once ideal hyperplanes are discovered, new data points can be easily classified. https://www.simplilearn.com/ice9/free_resources_article_thumb/support-vector-machines-graph-machine-learning.JPG The optimization objective is to find “maximum margin hyperplane” that is farthest from the closest points in the two classes (these points are called support vectors). In the given figure, the middle line represents the hyperplane. SVM Example Let’s look at this image below and have an idea about SVM in general. Hyperplanes with larger margins have lower generalization error. The positive and negative hyperplanes are represented by: https://www.simplilearn.com/ice9/free_resources_article_thumb/positive-negative-hyperplanes-machine-learning.JPG Classification of any new input sample xtest : If w0 + wTxtest > 1, the sample xtest is said to be in the class toward the right of the positive hyperplane. If w0 + wTxtest < -1, the sample xtest is said to be in the class toward the left of the negative hyperplane. When you subtract the two equations, you get: https://www.simplilearn.com/ice9/free_resources_article_thumb/equation-subtraction-machine-learning.JPG Length of vector w is (L2 norm length): https://www.simplilearn.com/ice9/free_resources_article_thumb/length-of-vector-machine-learning.JPG You normalize with the length of w to arrive at: https://www.simplilearn.com/ice9/free_resources_article_thumb/normalize-equation-machine-learning.JPG SVM: Hard Margin Classification Given below are some points to understand Hard Margin Classification. The left side of equation SVM-1 given above can be interpreted as the distance between the positive (+ve) and negative (-ve) hyperplanes; in other words, it is the margin that can be maximized. Hence the objective of the function is to maximize with the constraint that the samples are classified correctly, which is represented as : https://www.simplilearn.com/ice9/free_resources_article_thumb/hard-margin-classification-machine-learning.JPG This means that you are minimizing ‖w‖. This also means that all positive samples are on one side of the positive hyperplane and all negative samples are on the other side of the negative hyperplane. This can be written concisely as : https://www.simplilearn.com/ice9/free_resources_article_thumb/hard-margin-classification-formula.JPG Minimizing ‖w‖ is the same as minimizing. This figure is better as it is differentiable even at w = 0. The approach listed above is called “hard margin linear SVM classifier.” SVM: Soft Margin Classification Given below are some points to understand Soft Margin Classification. To allow for linear constraints to be relaxed for nonlinearly separable data, a slack variable is introduced. (i) measures how much ith instance is allowed to violate the margin. The slack variable is simply added to the linear constraints. https://www.simplilearn.com/ice9/free_resources_article_thumb/soft-margin-calculation-machine-learning.JPG Subject to the above constraints, the new objective to be minimized becomes: https://www.simplilearn.com/ice9/free_resources_article_thumb/soft-margin-calculation-formula.JPG You have two conflicting objectives now—minimizing slack variable to reduce margin violations and minimizing to increase the margin. The hyperparameter C allows us to define this trade-off. Large values of C correspond to larger error penalties (so smaller margins), whereas smaller values of C allow for higher misclassification errors and larger margins. https://www.simplilearn.com/ice9/free_resources_article_thumb/machine-learning-certification-video-preview.jpg SVM: Regularization The concept of C is the reverse of regularization. Higher C means lower regularization, which increases bias and lowers the variance (causing overfitting). https://www.simplilearn.com/ice9/free_resources_article_thumb/concept-of-c-graph-machine-learning.JPG IRIS Data Set The Iris dataset contains measurements of 150 IRIS flowers from three different species: Setosa Versicolor Viriginica Each row represents one sample. Flower measurements in centimeters are stored as columns. These are called features. IRIS Data Set: SVM Let’s train an SVM model using sci-kit-learn for the Iris dataset: https://www.simplilearn.com/ice9/free_resources_article_thumb/svm-model-graph-machine-learning.JPG Nonlinear SVM Classification There are two ways to solve nonlinear SVMs: by adding polynomial features by adding similarity features Polynomial features can be added to datasets; in some cases, this can create a linearly separable dataset. https://www.simplilearn.com/ice9/free_resources_article_thumb/nonlinear-classification-svm-machine-learning.JPG In the figure on the left, there is only 1 feature x1. This dataset is not linearly separable. If you add x2 = (x1)2 (figure on the right), the data becomes linearly separable. Polynomial Kernel In sci-kit-learn, one can use a Pipeline class for creating polynomial features. Classification results for the Moons dataset are shown in the figure. https://www.simplilearn.com/ice9/free_resources_article_thumb/polynomial-kernel-machine-learning.JPG Polynomial Kernel with Kernel Trick Let us look at the image below and understand Kernel Trick in detail. https://www.simplilearn.com/ice9/free_resources_article_thumb/polynomial-kernel-with-kernel-trick.JPG For large dimensional datasets, adding too many polynomial features can slow down the model. You can apply a kernel trick with the effect of polynomial features without actually adding them. The code is shown (SVC class) below trains an SVM classifier using a 3rd-degree polynomial kernel but with a kernel trick. https://www.simplilearn.com/ice9/free_resources_article_thumb/polynomial-kernel-equation-machine-learning.JPG The hyperparameter coefθ controls the influence of high-degree polynomials. Kernel SVM Let us understand in detail about Kernel SVM. Kernel SVMs are used for classification of nonlinear data. In the chart, nonlinear data is projected into a higher dimensional space via a mapping function where it becomes linearly separable. https://www.simplilearn.com/ice9/free_resources_article_thumb/kernel-svm-machine-learning.JPG In the higher dimension, a linear separating hyperplane can be derived and used for classification. A reverse projection of the higher dimension back to original feature space takes it back to nonlinear shape. As mentioned previously, SVMs can be kernelized to solve nonlinear classification problems. You can create a sample dataset for XOR gate (nonlinear problem) from NumPy. 100 samples will be assigned the class sample 1, and 100 samples will be assigned the class label -1. https://www.simplilearn.com/ice9/free_resources_article_thumb/kernel-svm-graph-machine-learning.JPG As you can see, this data is not linearly separable. https://www.simplilearn.com/ice9/free_resources_article_thumb/kernel-svm-non-separable.JPG You now use the kernel trick to classify XOR dataset created earlier. https://www.simplilearn.com/ice9/free_resources_article_thumb/kernel-svm-xor-machine-learning.JPG Naïve Bayes Classifier What is Naive Bayes Classifier? Have you ever wondered how your mail provider implements spam filtering or how online news channels perform news text classification or even how companies perform sentiment analysis of their audience on social media? All of this and more are done through a machine learning algorithm called Naive Bayes Classifier. Naive Bayes Named after Thomas Bayes from the 1700s who first coined this in the Western literature. Naive Bayes classifier works on the principle of conditional probability as given by the Bayes theorem. Advantages of Naive Bayes Classifier Listed below are six benefits of Naive Bayes Classifier. Very simple and easy to implement Needs less training data Handles both continuous and discrete data Highly scalable with the number of predictors and data points As it is fast, it can be used in real-time predictions Not sensitive to irrelevant features Bayes Theorem We will understand Bayes Theorem in detail from the points mentioned below. According to the Bayes model, the conditional probability P(Y|X) can be calculated as: P(Y|X) = P(X|Y)P(Y) / P(X) This means you have to estimate a very large number of P(X|Y) probabilities for a relatively small vector space X. For example, for a Boolean Y and 30 possible Boolean attributes in the X vector, you will have to estimate 3 billion probabilities P(X|Y). To make it practical, a Naïve Bayes classifier is used, which assumes conditional independence of P(X) to each other, with a given value of Y. This reduces the number of probability estimates to 2*30=60 in the above example. Naïve Bayes Classifier for SMS Spam Detection Consider a labeled SMS database having 5574 messages. It has messages as given below: https://www.simplilearn.com/ice9/free_resources_article_thumb/naive-bayes-spam-machine-learning.JPG Each message is marked as spam or ham in the data set. Let’s train a model with Naïve Bayes algorithm to detect spam from ham. The message lengths and their frequency (in the training dataset) are as shown below: https://www.simplilearn.com/ice9/free_resources_article_thumb/naive-bayes-spam-spam-detection.JPG Analyze the logic you use to train an algorithm to detect spam: Split each message into individual words/tokens (bag of words). Lemmatize the data (each word takes its base form, like “walking” or “walked” is replaced with “walk”). Convert data to vectors using scikit-learn module CountVectorizer. Run TFIDF to remove common words like “is,” “are,” “and.” Now apply scikit-learn module for Naïve Bayes MultinomialNB to get the Spam Detector. This spam detector can then be used to classify a random new message as spam or ham. Next, the accuracy of the spam detector is checked using the Confusion Matrix. For the SMS spam example above, the confusion matrix is shown on the right. Accuracy Rate = Correct / Total = (4827 + 592)/5574 = 97.21% Error Rate = Wrong / Total = (155 + 0)/5574 = 2.78% https://www.simplilearn.com/ice9/free_resources_article_thumb/confusion-matrix-machine-learning.JPG Although confusion Matrix is useful, some more precise metrics are provided by Precision and Recall. https://www.simplilearn.com/ice9/free_resources_article_thumb/precision-recall-matrix-machine-learning.JPG Precision refers to the accuracy of positive predictions. https://www.simplilearn.com/ice9/free_resources_article_thumb/precision-formula-machine-learning.JPG Recall refers to the ratio of positive instances that are correctly detected by the classifier (also known as True positive rate or TPR). https://www.simplilearn.com/ice9/free_resources_article_thumb/recall-formula-machine-learning.JPG Precision/Recall Trade-off To detect age-appropriate videos for kids, you need high precision (low recall) to ensure that only safe videos make the cut (even though a few safe videos may be left out). The high recall is needed (low precision is acceptable) in-store surveillance to catch shoplifters; a few false alarms are acceptable, but all shoplifters must be caught. Learn about Naive Bayes in detail. Click here! Decision Tree Classifier Some aspects of the Decision Tree Classifier mentioned below are. Decision Trees (DT) can be used both for classification and regression. The advantage of decision trees is that they require very little data preparation. They do not require feature scaling or centering at all. They are also the fundamental components of Random Forests, one of the most powerful ML algorithms. Unlike Random Forests and Neural Networks (which do black-box modeling), Decision Trees are white box models, which means that inner workings of these models are clearly understood. In the case of classification, the data is segregated based on a series of questions. Any new data point is assigned to the selected leaf node. https://www.simplilearn.com/ice9/free_resources_article_thumb/decision-tree-classifier-machine-learning.JPG Start at the tree root and split the data on the feature using the decision algorithm, resulting in the largest information gain (IG). This splitting procedure is then repeated in an iterative process at each child node until the leaves are pure. This means that the samples at each node belonging to the same class. In practice, you can set a limit on the depth of the tree to prevent overfitting. The purity is compromised here as the final leaves may still have some impurity. The figure shows the classification of the Iris dataset. https://www.simplilearn.com/ice9/free_resources_article_thumb/decision-tree-classifier-graph.JPG IRIS Decision Tree Let’s build a Decision Tree using scikit-learn for the Iris flower dataset and also visualize it using export_graphviz API. https://www.simplilearn.com/ice9/free_resources_article_thumb/iris-decision-tree-machine-learning.JPG The output of export_graphviz can be converted into png format: https://www.simplilearn.com/ice9/free_resources_article_thumb/iris-decision-tree-output.JPG Sample attribute stands for the number of training instances the node applies to. Value attribute stands for the number of training instances of each class the node applies to. Gini impurity measures the node’s impurity. A node is “pure” (gini=0) if all training instances it applies to belong to the same class. https://www.simplilearn.com/ice9/free_resources_article_thumb/impurity-formula-machine-learning.JPG For example, for Versicolor (green color node), the Gini is 1-(0/54)2 -(49/54)2 -(5/54) 2 ≈ 0.168 https://www.simplilearn.com/ice9/free_resources_article_thumb/iris-decision-tree-sample.JPG Decision Boundaries Let us learn to create decision boundaries below. For the first node (depth 0), the solid line splits the data (Iris-Setosa on left). Gini is 0 for Setosa node, so no further split is possible. The second node (depth 1) splits the data into Versicolor and Virginica. If max_depth were set as 3, a third split would happen (vertical dotted line). https://www.simplilearn.com/ice9/free_resources_article_thumb/decision-tree-boundaries.JPG For a sample with petal length 5 cm and petal width 1.5 cm, the tree traverses to depth 2 left node, so the probability predictions for this sample are 0% for Iris-Setosa (0/54), 90.7% for Iris-Versicolor (49/54), and 9.3% for Iris-Virginica (5/54) CART Training Algorithm Scikit-learn uses Classification and Regression Trees (CART) algorithm to train Decision Trees. CART algorithm: Split the data into two subsets using a single feature k and threshold tk (example, petal length < “2.45 cm”). This is done recursively for each node. k and tk are chosen such that they produce the purest subsets (weighted by their size). The objective is to minimize the cost function as given below: https://www.simplilearn.com/ice9/free_resources_article_thumb/cart-training-algorithm-machine-learning.JPG The algorithm stops executing if one of the following situations occurs: max_depth is reached No further splits are found for each node Other hyperparameters may be used to stop the tree: min_samples_split min_samples_leaf min_weight_fraction_leaf max_leaf_nodes Gini Impurity or Entropy Entropy is one more measure of impurity and can be used in place of Gini. https://www.simplilearn.com/ice9/free_resources_article_thumb/gini-impurity-entrophy.JPG It is a degree of uncertainty, and Information Gain is the reduction that occurs in entropy as one traverses down the tree. Entropy is zero for a DT node when the node contains instances of only one class. Entropy for depth 2 left node in the example given above is: https://www.simplilearn.com/ice9/free_resources_article_thumb/entrophy-for-depth-2.JPG Gini and Entropy both lead to similar trees. DT: Regularization The following figure shows two decision trees on the moons dataset. https://www.simplilearn.com/ice9/free_resources_article_thumb/dt-regularization-machine-learning.JPG The decision tree on the right is restricted by min_samples_leaf = 4. The model on the left is overfitting, while the model on the right generalizes better. Random Forest Classifier Let us have an understanding of Random Forest Classifier below. A random forest can be considered an ensemble of decision trees (Ensemble learning). Random Forest algorithm: Draw a random bootstrap sample of size n (randomly choose n samples from the training set). Grow a decision tree from the bootstrap sample. At each node, randomly select d features. Split the node using the feature that provides the best split according to the objective function, for instance by maximizing the information gain. Repeat the steps 1 to 2 k times. (k is the number of trees you want to create, using a subset of samples) Aggregate the prediction by each tree for a new data point to assign the class label by majority vote (pick the group selected by the most number of trees and assign new data point to that group). Random Forests are opaque, which means it is difficult to visualize their inner workings. https://www.simplilearn.com/ice9/free_resources_article_thumb/random-forest-classifier-graph.JPG However, the advantages outweigh their limitations since you do not have to worry about hyperparameters except k, which stands for the number of decision trees to be created from a subset of samples. RF is quite robust to noise from the individual decision trees. Hence, you need not prune individual decision trees. The larger the number of decision trees, the more accurate the Random Forest prediction is. (This, however, comes with higher computation cost). Key Takeaways Let us quickly run through what we have learned so far in this Classification tutorial. Classification algorithms are supervised learning methods to split data into classes. They can work on Linear Data as well as Nonlinear Data. Logistic Regression can classify data based on weighted parameters and sigmoid conversion to calculate the probability of classes. K-nearest Neighbors (KNN) algorithm uses similar features to classify data. Support Vector Machines (SVMs) classify data by detecting the maximum margin hyperplane between data classes. Naïve Bayes, a simplified Bayes Model, can help classify data using conditional probability models. Decision Trees are powerful classifiers and use tree splitting logic until pure or somewhat pure leaf node classes are attained. Random Forests apply Ensemble Learning to Decision Trees for more accurate classification predictions. Conclusion This completes ‘Classification’ tutorial. In the next tutorial, we will learn 'Unsupervised Learning with Clustering.'
piyushpathak03
Recommendation Systems This is a workshop on using Machine Learning and Deep Learning Techniques to build Recommendation Systesm Theory: ML & DL Formulation, Prediction vs. Ranking, Similiarity, Biased vs. Unbiased Paradigms: Content-based, Collaborative filtering, Knowledge-based, Hybrid and Ensembles Data: Tabular, Images, Text (Sequences) Models: (Deep) Matrix Factorisation, Auto-Encoders, Wide & Deep, Rank-Learning, Sequence Modelling Methods: Explicit vs. implicit feedback, User-Item matrix, Embeddings, Convolution, Recurrent, Domain Signals: location, time, context, social, Process: Setup, Encode & Embed, Design, Train & Select, Serve & Scale, Measure, Test & Improve Tools: python-data-stack: numpy, pandas, scikit-learn, keras, spacy, implicit, lightfm Notes & Slides Basics: Deep Learning AI Conference 2019: WhiteBoard Notes | In-Class Notebooks Notebooks Movies - Movielens 01-Acquire 02-Augment 03-Refine 04-Transform 05-Evaluation 06-Model-Baseline 07-Feature-extractor 08-Model-Matrix-Factorization 09-Model-Matrix-Factorization-with-Bias 10-Model-MF-NNMF 11-Model-Deep-Matrix-Factorization 12-Model-Neural-Collaborative-Filtering 13-Model-Implicit-Matrix-Factorization 14-Features-Image 15-Features-NLP Ecommerce - YooChoose 01-Data-Preparation 02-Models News - Hackernews Product - Groceries Python Libraries Deep Recommender Libraries Tensorrec - Built on Tensorflow Spotlight - Built on PyTorch TFranking - Built on TensorFlow (Learning to Rank) Matrix Factorisation Based Libraries Implicit - Implicit Matrix Factorisation QMF - Implicit Matrix Factorisation Lightfm - For Hybrid Recommedations Surprise - Scikit-learn type api for traditional alogrithms Similarity Search Libraries Annoy - Approximate Nearest Neighbour NMSLib - kNN methods FAISS - Similarity search and clustering Learning Resources Reference Slides Deep Learning in RecSys by Balázs Hidasi Lessons from Industry RecSys by Xavier Amatriain Architecting Recommendation Systems by James Kirk Recommendation Systems Overview by Raimon and Basilico Benchmarks MovieLens Benchmarks for Traditional Setup Microsoft Tutorial on Recommendation System at KDD 2019 Algorithms & Approaches Collaborative Filtering for Implicit Feedback Datasets Bayesian Personalised Ranking for Implicit Data Logistic Matrix Factorisation Neural Network Matrix Factorisation Neural Collaborative Filtering Variational Autoencoders for Collaborative Filtering Evaluations Evaluating Recommendation Systems
creinders
Clustering algorithms (Mean shift and K-Means) from scratch in NumPy, PyTorch, TensorFlow, and JAX
Samxx97
An Implementation of fuzzy clustering algorithms in Numpy
n0obcoder
Implementation of some of the most used Clustering Algorithms from scratch (only using Numpy)
mithunjmistry
Own Hierarchical Clustering algorithm implementation without using Sklearn's in-built function. Dynamic programming approach is used to achieve it and numpy is used for matrices operations.
ryilkici
Kajoo is a practical interface for machine learning algorithms. The aim of this application, to be able to use some machine learning algorithms practically like clustering or regression. The interface has written with Python PyQt5 and PySide2 packages. The functions in the backend of the application has written with NumPy, Pandas and Scikit-learn packages.
A project on analyzing the shifting trends in research on CSR activities of Businesses and Firms. The project includes Topic Modeling and Clustering to extract the underlying subtopics of CSR and time series analysis of changing trends since 1962. The project will give businesses an insight into important CSR areas to focus on leading to better recognition of firms in the public. Python Libraries Used: Matplotlib, Gensim, Natural Language Toolkit, pyLDAvis, numpy, scikit-learn, networkx. DBMS: MongoDB. Algorithms: Hierarchical Single Linkage clustering to estimate optimal number of topics, Latent Dirichlet Allocation for Topic Modelling.
Bharathi-A-7
Implemented in Python,the project uses Unsupervised learning model to classify the transaction data of customers into clusters based on similarity.The project includes Exploratory Data Analysis,Cohort Analysis to analyze people belonging to different cohorts, RFM Analysis to dig deeper into the purchasing pattern and retention of people ,Association Mining and most importantly clustering of customers using the " K-Means Clustering" algorithm. Libraries such as Pandas,numpy, matplotlib,seaborn etc were used to handle data and aid the visualizations.
VireshAmbardar
As we all know that agriculture depends largely on the nature of soil and the climatic conditions and many a times, we face unpredictable changes in climate like, non-seasonal rainfall or heat waves or fluctuations in humidity levels, etc. and all such events cause a great loss to our farmers and farming, because of which they are not able to utilize their agricultural land to it's fullest.So to solve all such problems, I have build a Machine Learning Model by the virtue of which we can help farmers, optimize the agricultural production, because this predictive model will help them understand that for a particular soil & given climatic condition, which crop will be best suitable for the harvest. There are 7 key factors that I've taken into account which will help us in determining, exactly which crop should be grown and at what period of time, viz. Amount of Nitrogen, Phosphorus and Potassium in soil, Temperature in degree celcius, Humidity, pH and Rainfall in mm. Tools used: Python & Jupyter Notebook Libraries used: Numpy, Pandas, Seaborn, Matplotlib, ipywidgets and sklearn. Machine Learning Algorithms used: Clustering Analysis and Logistic Regression.
ANALYZING ROAD SAFETY & TRAFFIC DEMOGRAPHICS IN THE UK (Multi-class Classification) SUMMARY Here, I am aim to analyze the Road Safety and Traffic Demographics dataset (UK), containing accidents reported by the police between the years of 2004 - 2017. PROJECT GOALS: Identify factors responsible for most of the reported accidents. Build a machine learning model that is capable of accurately predicting the severity of an accident. Provide recommendations to the Department of Transport (UK Government), to improve road safety policies and prevent recurrences of severe accidents where possible. PACKAGES USED: Scikit-learn, numpy, pandas, imblearn (imbalanced-learn), seaborn, Matplotlib MOTIVATION World Health Organization (WHO) reported that more than 1.25 million people die each year while 50 million are injured as a result of road accidents worldwide. Road accidents are the 10th leading cause of death globally. On current trends, road traffic accidents are to become the 7th leading cause of death by 2030 making it a major public health concern. Between the years 2005 and 2016, there were roughly 2 million road accidents reported in the United Kingdom (UK) alone of which 16,000 were fatal. As a big data project, I wanted to explore the traffic demographics data in greater detail using machine learning! CONTEXT The UK government amassed traffic data from 2004 to 2017, recording over 2 million accidents in the process and making this one of the most comprehensive traffic data sets out there. It's a huge picture of a country undergoing change. Note that all the contained accident data comes from police reports, so this data does not include minor incidents. For steps undertaken to pre-process and clean the data, please view the "Data Cleansing & Descriptive Analysis_UK Traffic Demographics.ipynb" file DESCRIPTIVE ANALYTICS (EDA) Tools used include Python, Tableau, MS PowerBI Percent (%) distribution of target classes Percent dist of Accident Severity As seen above, the data is highly imbalanced. For detailed steps undertaken to deal with the imbalanced data, please view the "Modelling_Predictive Analytics_UK Traffic Demographics.ipynb" file. This article provides some great tips on utilizing the correct performance metrics when analyzing a models performance trained on an imbalanced dataset. This article describes several strategies that can help combat the case of a severly imbalanced dataset. Methods include: Resampling strategies (under - Tomek Links, Cluster Centroids, over sampling - SMOTE) Using Decision Tree based models Using Cost-Sensitive training (Penalize algorithms) Number of accidents by Year and Accident Severity Total accidents by year and severity It can be seen above that the trend seems to be increasing as the years go. In addition, the spike between 2008 - 2009 was because of a enhancement in the reporting system introduced in the UK in 2009, where all accident including minor accidents needed to be reported by the police so as to match the counts represented by hospitals, insurance claims etc. Accidents density by Location geomap Most accidents took place in major cities - Birmingham, London, leeds, Newcastle Accidents by Gender and Age Accidents by gender and age Accidents by Day of the week and Year Accidents by year and weekday Most accidents take place on a Friday Vehicle Manoever at time of accident Vehicle Manoever at time of accident Most accidents take place as a result of overtaking For more findings, please go to the "Images" folder. For steps undertaken to carry out some predictive modeling and hyper-parameter tuning, please view the "Modelling_Predictive Analytics_UK Traffic Demographics.ipynb" file. RECOMMENDATIONS TO THE DEPARTMENT OF TRANSPORT (UK) Decrease emergency response times during afternoon rush-hours (15-19) especially on Fridays. Allocate resources to investigate high density traffic points and identify new infrastructure needs to divert traffic from dual-carriage ways. Explore conditions of vehicles and casualties such as vehicle type, age of vehicles registered, pedestrian movements, etc. for policy makers. Adopt comprehensive distracted driving laws that increase penalties for drivers who commit traffic violations like aggressive overtaking. ACKNOWLEDGEMENTS The license for this dataset is the Open Givernment Licence used by all data on data.gov.uk. The raw datasets are available from the UK Department of Transport website. I had a lot of fun working on this dataset and learned a lot in the process. I plan to further my research in the area of predictive modeling using imabalanced data and how to effectively build a highly robust model for future projects. About Here, I analyze the Road Safety and Traffic Demographics dataset (UK), containing accidents reported by the police between the years of 2004 - 2017. Topics accident-rate accident-severity imbalanced-data imbalanced-learning road-accident reported-accidents road-safety uk-government transport traffic-demographics severe-accidents pca classification Resources Readme Releases No releases published Packages No packages published Languages Jupyter Notebook 100.0% © 2020 GitHub, Inc.
Shikha18Shukla
A collection of Machine Learning algorithm implementations in Python using libraries like scikit-learn, pandas, numpy, matplotlib, and seaborn. Covers Regression, Classification, Clustering, and Reinforcement Learning with practical datasets like Titanic, Boston Housing, Iris, and Mall Customers.
akashjborah97
Target: To analyze the ML joB market in India using Segmentation analysis for finding companies probable of hiring an ML Engineer/Data Analyst in respect to his/her skillset. Techniques and Algorithms used: Machine learning using python with libraries(numpy, pandas, scikit-learn, matplotlib) , elbow method, stability based structure analysis, k means clustering. In this project, we took a dataset form the website naukri.com analysed the skills and companies and using clustering algorithm we performed segmentation. Results : As Segmentation analysis is an important step before we embark on any plan. Hence it is important to learn how to analyze the job market and the demanded skills by the company. By analyzing the trend, we have observed cluster 0 contains companies which are inclined towards hiring people with Python skills on Data Science and Machine Learning. Cluster 1 contains companies which are likely to hire people with skills are not oriented towards Data Analysis. Cluster 2 contains companies which are inclined towards hiring people with Python and R skills on Data Science. Cluster 3 contains companies which are inclined towards hiring people with Python skills on Machine Learning. Cluster 4 contains companies which are likely to hire people with skills are not oriented towards Data Analysis. Cluster 5 contains companies which are likely to hire people with skills of Python, Machine Learning and minimal Data Science. The most demanded skills for the recruiters are Python, Data Science, Machine Learning and other IT skills. For the company’s analysis based on experience demanded, it was observed that Wipro, HiringSign, Global Logic and Gojek etc. didn’t appear in top numbers before the segmentation and appeared after the segmentation was carried out for the minimum, average and maximum experience data.
ChetanPatil28
Using K-means clustering algorithm built from scratch in Numpy to segment images
k-k-yathin
Implementation of the K-Means clustering algorithm from scratch in Python using NumPy, with visualization using Matplotlib.
samueljcatania
Supervised K-NN classification and unsupervised K-Means clustering machine learning algorithms built from scratch in Python using NumPy and Matplotlib.
ananyamohapatra20
Implementation of the K-Means Clustering and Principal Component Analysis algorithms from scratch in Python using Numpy and Pandas and Matplotlib for visualization.
sreekaryerragunta
A comprehensive collection of unsupervised learning algorithms implemented from scratch in Python with NumPy, demonstrating deep understanding of clustering, dimensionality reduction, and pattern discovery techniques.
AshishSharma2894
Here I implemented K-Means and K-Median clustering algorithms without the use of any in-build library except for NumPy and used Pandas only for data upload
bhagyavansh
A curated portfolio of 15 hands-on deep learning projects from my 4th semester, implementing core algorithms like Perceptrons, RNNs, LSTMs, CNNs, MLPs, and K-Means clustering. Demonstrates practical skills in Python, TensorFlow/Keras, NumPy, and Scikit-learn to solve logic gates, sequence prediction, text generation, and clustering problems.
PanagiwthsPapadopoulos
🖧 Algorithm for maximizing the Distance Quality Function (Q_d) in graph clustering. 📏 Optimizes partitions by minimizing intra-cluster distances and maximizing expected distances. Features 🏆 proto-cluster initialization, 🔗 cluster expansion, and 🎨 visualizations. Built with Python, networkx, numpy, and matplotlib. 🚀
LamimZakirPronay
In this project, we will use the Python libraries NumPy and Scikit-learn to implement a KMeans clustering algorithm. The data simulated data will only have three clusters, which will be identified by the clustering algorithm. To get started, make sure you are using Python 2.7 or above by evaluating the following cell.
Sadanand-666
Completed the Data Science and Machine Learning program at Zalimaa Development: • Gained hands-on experience in data preprocessing, feature engineering, and statistical analysis. • Built and evaluated machine learning models including regression, classification, and clustering algorithms. • Applied Python libraries such as Pandas, NumPy, Matplotlib
miryeganeh
This repository contains a set of machine learning algorithms (classification and clustering) implemented from scratch without help of high level libraries such as Sklearn or even Pandas. In these algorithms, I've tried to use the vectorized operations of Numpy to make the computations more efficient. These algorithms are the result of my Tutoring classes with my students from university of Harvard, UCLA, ... .
NGDSystems
Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy.
noyshabtay
An implementation of Spectral clustering algorithm on data from sklearn.datasets.make_blobs API using C language and a verity of Python packages such as Numpy, C-API, Matplotlib and more. Additional info of the k-means implementation can be found in my k-means repository.
ngochuyenvu165
This project provides a step-by-step guide on applying the K-Means clustering algorithm and the RFM (Recency, Frequency, Monetary) model to segment customers and identify their behaviors. The project includes code examples in Python, using popular libraries like Pandas, NumPy, Scikit-learn, and Plotly
sambit0000
Complete customer purchasing behaviour analysis was done on a retail dataset. Initially data cleaning and EDA was done. Each customer was grouped to a cohort to analyze their retention rate. RFM model was created based on their receny, frequency and monetary values. RFM scores were given to each customer and each customer was grouped to 4 clusters as high value, medium value based on their RFM scores. Finally customer segmentation was done using K means clustering algorithm. 5 clusters were selected using elbow method and customers were grouped to one of the clusters. Additionally charts were made in tableau to visualize our results better. numpy, pandas, , matplotlib, seaborn, sklearn libraries were used in the project.
Sweeteally
Are you ready to start your path to becoming a Data Scientist! This comprehensive course will be your guide to learning how to use the power of Python to analyze data, create beautiful visualizations, and use powerful machine learning algorithms! Data Scientist has been ranked the number one job on Glassdoor and the average salary of a data scientist is over $120,000 in the United States according to Indeed! Data Science is a rewarding career that allows you to solve some of the world's most interesting problems! This course is designed for both beginners with some programming experience or experienced developers looking to make the jump to Data Science! This comprehensive course is comparable to other Data Science bootcamps that usually cost thousands of dollars, but now you can learn all that information at a fraction of the cost! With over 100 HD video lectures and detailed code notebooks for every lecture this is one of the most comprehensive course for data science and machine learning on Udemy! We'll teach you how to program with Python, how to create amazing data visualizations, and how to use Machine Learning with Python! Here a just a few of the topics we will be learning: Programming with Python NumPy with Python Using pandas Data Frames to solve complex tasks Use pandas to handle Excel Files Web scraping with python Connect Python to SQL Use matplotlib and seaborn for data visualizations Use plotly for interactive visualizations Machine Learning with SciKit Learn, including: Linear Regression K Nearest Neighbors K Means Clustering Decision Trees Random Forests Natural Language Processing Neural Nets and Deep Learning Support Vector Machines and much, much more! Enroll in the course and become a data scientist today! Wat zijn de vereisten? Some programming experience Admin permissions to download files Wat leer ik in deze cursus? Use Python for Data Science and Machine Learning Use Spark for Big Data Analysis Implement Machine Learning Algorithms Learn to use NumPy for Numerical Data Learn to use Pandas for Data Analysis Learn to use Matplotlib for Python Plotting Learn to use Seaborn for statistical plots Use Plotly for interactive dynamic visualizations Use SciKit-Learn for Machine Learning Tasks K-Means Clustering Logistic Regression Linear Regression Random Forest and Decision Trees Natural Language Processing and Spam Filters Neural Networks Support Vector Machines Wie is het doelpubliek? This course is meant for people with at least some programming experience
Mansoor1565
Introduction Pig and Python are very widespread systems for executing complex Hadoop map-reduce-based data-flows. It enhances a layer of abstraction on top of Hadoop’s map-reduce mechanisms. That is with the intention of permitting developers to take a high-level view of the data and operations on that data. Pig enables us to do things more openly. For instance, we may join two or more data sources. Writing a join as a map and reduce function is a bit of a drag and it’s commonly value avoiding. Therefore, Pig is great as it makes simpler multifaceted tasks. It offers a high-level scripting language that permits users to take more of a big-picture view of their data flow. Pig is particularly inordinate as it is extensible. This article will emphasize its extensibility. At the end of this article, we will be able to write PigLatin scripts that execute Python code as a part of a larger map-reduce workflow. Description Pig is composed of two main parts: A high-level data-flow language is called Pig Latin. An engine that analyses improves, and performs the Pig Latin scripts as a series of MapReduce jobs that are run on a Hadoop cluster. Pig is at ease to write, comprehend, and maintain as it is a data transformation language that enables the processing of data to be described as a sequence of transformations. It is similarly highly extensible through the use of the User Defined Functions (UDFs). User-Defined Functions (UDFs) A Pig UDF permits custom processing to be written in many languages, for example, Python. It is a function that is nearby to Pig. On the other hand, it is written in a language that isn’t PigLatin. Pig permits us to register UDFs for use within a PigLatin script. A UDF requires to fit a precise prototype An instance of a Pig application is the Extract, Transform, Load (ETL) process. That defines how an application extracts data from a data source, changes the data for querying and examination drives. It also loads the result onto a target data store. When Pig loads the data, it may execute projections, iterations, and other transformations. UDFs allow more multifaceted algorithms to be useful during the change phase. It may be stored back in HDFS after the data is done being processed by Pig. PigLatin scripts We can write the simplest Python UDF as; from pig_util import outputSchema @outputSchema('word:chararray') def hi_world(): return "hello world" The data output from a function has a particular form. Pig likes it if we require the schema of the data as then it distinguishes what it may do with that data. That’s what the output_schema decorator is for. There are a couple of diverse means to state a schema. If that were saved in a file named my_udfs.py we will be able to make use of it in a PigLatin script as; -- first register it to make it available REGISTER 'myudf.py' using jython as my_special_udfs users = LOAD 'user_data' AS (name: chararray); hello_users = FOREACH users GENERATE name, my_special_udfs.hi_world(); UDF arguments UDF has inputs and outputs as well. Look at the below some UDFs: def deal_with_a_string(s1): return s1 + " for the win!" def deal_with_two_strings(s1,s2): return s1 + " " + s2 def square_a_number(i): return i*i def now_for_a_bag(lBag): lOut = [] for i,l in enumerate(lBag): lNew = [i,] + l lOut.append(lNew) return lOut The following are UDFs in a PigLatin script: REGISTER 'myudf.py' using jython as myudfs users = LOAD 'user_data' AS (firstname: chararray, lastname:chararray,some_integer:int); winning_users = FOREACH users GENERATE myudfs.deal_with_a_string(firstname); full_names = FOREACH users GENERATE myudfs.deal_with_two_strings(firstname,lastname); squared_integers = FOREACH users GENERATE myudfs.square_a_number(some_integer); users_by_number = GROUP users by some_integer; indexed_users_by_number = FOREACH users_by_number GENERATE group,myudfs.now_for_a_bag(users); Outside Standard Python UDFs We can’t use NumPy from Jython. Moreover, Pig doesn’t actually permit Python Filter UDFs. We may only do stuff as: user_messages = LOAD 'user_twits' AS (name:chararray, message:chararray); --add a field that says whether it is naughty (1) or not (0) messages_with_rudeness = FOREACH user_messages GENERATE name,message,contains_naughty_words(message) as naughty; --then filter by the naughty field filtered_messages = FILTER messages_with_rudeness by (naughty==1); -- and finally strip away the naughty field rude_messages = FOREACH filtered_messages GENERATE name,message; Python Streaming UDFs Pig enables us to look into the Hadoop Streaming API. This allows us to get around the Jython issue when we require it to. Hadoop lets us write mappers and reducers in any language that provides us access to stdin and stdout. Therefore, that’s attractive much any language we want. Similar to Python 3 or even Cow. The following is a simple Python streaming script, let’s call it simple_stream.py: #! /usr/bin/env python import sys import string for line in sys.stdin: if len(line) == 0: continue l = line.split() #split the line by whitespace for i,s in enumerate(l): print "{key}\t{value}\n".format(key=i,value=s) # give out a key value pair for each word in the line The purpose is to develop Hadoop to run the script on each node. The hashbang line (#!) requires to be valid on every node. Each import statement must be valid on every node. Also, any system-level files or resources accessed inside the Python script must be accessible in the same way on every node. Use with simple_stream script DEFINE stream_alias 'simple_stream.py' SHIP('simple_stream.py'); user_messages = LOAD 'user_twits' AS (name:chararray, message:chararray); just_messages = FOREACH user_messages generate message; streamed = STREAM just_messages THROUGH stream_alias; DUMP streamed; The over-all format we are using is: DEFINE alias 'command' SHIP('files'); The alias is the name used to access the streaming function from inside the PigLatin script. The command is the system command Pig would call when it is essential to use the streaming function. Finally, SHIP tells Pig those files and dependencies Pig desires to distribute to the Hadoop nodes for the command to be able to work.