Found 166 repositories(showing 30)
Aryia-Behroziuan
An ANN is a model based on a collection of connected units or nodes called "artificial neurons", which loosely model the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit information, a "signal", from one artificial neuron to another. An artificial neuron that receives a signal can process it and then signal additional artificial neurons connected to it. In common ANN implementations, the signal at a connection between artificial neurons is a real number, and the output of each artificial neuron is computed by some non-linear function of the sum of its inputs. The connections between artificial neurons are called "edges". Artificial neurons and edges typically have a weight that adjusts as learning proceeds. The weight increases or decreases the strength of the signal at a connection. Artificial neurons may have a threshold such that the signal is only sent if the aggregate signal crosses that threshold. Typically, artificial neurons are aggregated into layers. Different layers may perform different kinds of transformations on their inputs. Signals travel from the first layer (the input layer) to the last layer (the output layer), possibly after traversing the layers multiple times. The original goal of the ANN approach was to solve problems in the same way that a human brain would. However, over time, attention moved to performing specific tasks, leading to deviations from biology. Artificial neural networks have been used on a variety of tasks, including computer vision, speech recognition, machine translation, social network filtering, playing board and video games and medical diagnosis. Deep learning consists of multiple hidden layers in an artificial neural network. This approach tries to model the way the human brain processes light and sound into vision and hearing. Some successful applications of deep learning are computer vision and speech recognition.[68] Decision trees Main article: Decision tree learning Decision tree learning uses a decision tree as a predictive model to go from observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). It is one of the predictive modeling approaches used in statistics, data mining, and machine learning. Tree models where the target variable can take a discrete set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. In data mining, a decision tree describes data, but the resulting classification tree can be an input for decision making. Support vector machines Main article: Support vector machines Support vector machines (SVMs), also known as support vector networks, are a set of related supervised learning methods used for classification and regression. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that predicts whether a new example falls into one category or the other.[69] An SVM training algorithm is a non-probabilistic, binary, linear classifier, although methods such as Platt scaling exist to use SVM in a probabilistic classification setting. In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces. Illustration of linear regression on a data set. Regression analysis Main article: Regression analysis Regression analysis encompasses a large variety of statistical methods to estimate the relationship between input variables and their associated features. Its most common form is linear regression, where a single line is drawn to best fit the given data according to a mathematical criterion such as ordinary least squares. The latter is often extended by regularization (mathematics) methods to mitigate overfitting and bias, as in ridge regression. When dealing with non-linear problems, go-to models include polynomial regression (for example, used for trendline fitting in Microsoft Excel[70]), logistic regression (often used in statistical classification) or even kernel regression, which introduces non-linearity by taking advantage of the kernel trick to implicitly map input variables to higher-dimensional space. Bayesian networks Main article: Bayesian network A simple Bayesian network. Rain influences whether the sprinkler is activated, and both rain and the sprinkler influence whether the grass is wet. A Bayesian network, belief network, or directed acyclic graphical model is a probabilistic graphical model that represents a set of random variables and their conditional independence with a directed acyclic graph (DAG). For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases. Efficient algorithms exist that perform inference and learning. Bayesian networks that model sequences of variables, like speech signals or protein sequences, are called dynamic Bayesian networks. Generalizations of Bayesian networks that can represent and solve decision problems under uncertainty are called influence diagrams. Genetic algorithms Main article: Genetic algorithm A genetic algorithm (GA) is a search algorithm and heuristic technique that mimics the process of natural selection, using methods such as mutation and crossover to generate new genotypes in the hope of finding good solutions to a given problem. In machine learning, genetic algorithms were used in the 1980s and 1990s.[71][72] Conversely, machine learning techniques have been used to improve the performance of genetic and evolutionary algorithms.[73] Training models Usually, machine learning models require a lot of data in order for them to perform well. Usually, when training a machine learning model, one needs to collect a large, representative sample of data from a training set. Data from the training set can be as varied as a corpus of text, a collection of images, and data collected from individual users of a service. Overfitting is something to watch out for when training a machine learning model. Federated learning Main article: Federated learning Federated learning is an adapted form of distributed artificial intelligence to training machine learning models that decentralizes the training process, allowing for users' privacy to be maintained by not needing to send their data to a centralized server. This also increases efficiency by decentralizing the training process to many devices. For example, Gboard uses federated machine learning to train search query prediction models on users' mobile phones without having to send individual searches back to Google.[74] Applications There are many applications for machine learning, including: Agriculture Anatomy Adaptive websites Affective computing Banking Bioinformatics Brain–machine interfaces Cheminformatics Citizen science Computer networks Computer vision Credit-card fraud detection Data quality DNA sequence classification Economics Financial market analysis[75] General game playing Handwriting recognition Information retrieval Insurance Internet fraud detection Linguistics Machine learning control Machine perception Machine translation Marketing Medical diagnosis Natural language processing Natural language understanding Online advertising Optimization Recommender systems Robot locomotion Search engines Sentiment analysis Sequence mining Software engineering Speech recognition Structural health monitoring Syntactic pattern recognition Telecommunication Theorem proving Time series forecasting User behavior analytics In 2006, the media-services provider Netflix held the first "Netflix Prize" competition to find a program to better predict user preferences and improve the accuracy of its existing Cinematch movie recommendation algorithm by at least 10%. A joint team made up of researchers from AT&T Labs-Research in collaboration with the teams Big Chaos and Pragmatic Theory built an ensemble model to win the Grand Prize in 2009 for $1 million.[76] Shortly after the prize was awarded, Netflix realized that viewers' ratings were not the best indicators of their viewing patterns ("everything is a recommendation") and they changed their recommendation engine accordingly.[77] In 2010 The Wall Street Journal wrote about the firm Rebellion Research and their use of machine learning to predict the financial crisis.[78] In 2012, co-founder of Sun Microsystems, Vinod Khosla, predicted that 80% of medical doctors' jobs would be lost in the next two decades to automated machine learning medical diagnostic software.[79] In 2014, it was reported that a machine learning algorithm had been applied in the field of art history to study fine art paintings and that it may have revealed previously unrecognized influences among artists.[80] In 2019 Springer Nature published the first research book created using machine learning.[81] Limitations Although machine learning has been transformative in some fields, machine-learning programs often fail to deliver expected results.[82][83][84] Reasons for this are numerous: lack of (suitable) data, lack of access to the data, data bias, privacy problems, badly chosen tasks and algorithms, wrong tools and people, lack of resources, and evaluation problems.[85] In 2018, a self-driving car from Uber failed to detect a pedestrian, who was killed after a collision.[86] Attempts to use machine learning in healthcare with the IBM Watson system failed to deliver even after years of time and billions of dollars invested.[87][88] Bias Main article: Algorithmic bias Machine learning approaches in particular can suffer from different data biases. A machine learning system trained on current customers only may not be able to predict the needs of new customer groups that are not represented in the training data. When trained on man-made data, machine learning is likely to pick up the same constitutional and unconscious biases already present in society.[89] Language models learned from data have been shown to contain human-like biases.[90][91] Machine learning systems used for criminal risk assessment have been found to be biased against black people.[92][93] In 2015, Google photos would often tag black people as gorillas,[94] and in 2018 this still was not well resolved, but Google reportedly was still using the workaround to remove all gorillas from the training data, and thus was not able to recognize real gorillas at all.[95] Similar issues with recognizing non-white people have been found in many other systems.[96] In 2016, Microsoft tested a chatbot that learned from Twitter, and it quickly picked up racist and sexist language.[97] Because of such challenges, the effective use of machine learning may take longer to be adopted in other domains.[98] Concern for fairness in machine learning, that is, reducing bias in machine learning and propelling its use for human good is increasingly expressed by artificial intelligence scientists, including Fei-Fei Li, who reminds engineers that "There’s nothing artificial about AI...It’s inspired by people, it’s created by people, and—most importantly—it impacts people. It is a powerful tool we are only just beginning to understand, and that is a profound responsibility.”[99] Model assessments Classification of machine learning models can be validated by accuracy estimation techniques like the holdout method, which splits the data in a training and test set (conventionally 2/3 training set and 1/3 test set designation) and evaluates the performance of the training model on the test set. In comparison, the K-fold-cross-validation method randomly partitions the data into K subsets and then K experiments are performed each respectively considering 1 subset for evaluation and the remaining K-1 subsets for training the model. In addition to the holdout and cross-validation methods, bootstrap, which samples n instances with replacement from the dataset, can be used to assess model accuracy.[100] In addition to overall accuracy, investigators frequently report sensitivity and specificity meaning True Positive Rate (TPR) and True Negative Rate (TNR) respectively. Similarly, investigators sometimes report the false positive rate (FPR) as well as the false negative rate (FNR). However, these rates are ratios that fail to reveal their numerators and denominators. The total operating characteristic (TOC) is an effective method to express a model's diagnostic ability. TOC shows the numerators and denominators of the previously mentioned rates, thus TOC provides more information than the commonly used receiver operating characteristic (ROC) and ROC's associated area under the curve (AUC).[101] Ethics Machine learning poses a host of ethical questions. Systems which are trained on datasets collected with biases may exhibit these biases upon use (algorithmic bias), thus digitizing cultural prejudices.[102] For example, using job hiring data from a firm with racist hiring policies may lead to a machine learning system duplicating the bias by scoring job applicants against similarity to previous successful applicants.[103][104] Responsible collection of data and documentation of algorithmic rules used by a system thus is a critical part of machine learning. Because human languages contain biases, machines trained on language corpora will necessarily also learn these biases.[105][106] Other forms of ethical challenges, not related to personal biases, are more seen in health care. There are concerns among health care professionals that these systems might not be designed in the public's interest but as income-generating machines. This is especially true in the United States where there is a long-standing ethical dilemma of improving health care, but also increasing profits. For example, the algorithms could be designed to provide patients with unnecessary tests or medication in which the algorithm's proprietary owners hold stakes. There is huge potential for machine learning in health care to provide professionals a great tool to diagnose, medicate, and even plan recovery paths for patients, but this will not happen until the personal biases mentioned previously, and these "greed" biases are addressed.[107] Hardware Since the 2010s, advances in both machine learning algorithms and computer hardware have led to more efficient methods for training deep neural networks (a particular narrow subdomain of machine learning) that contain many layers of non-linear hidden units.[108] By 2019, graphic processing units (GPUs), often with AI-specific enhancements, had displaced CPUs as the dominant method of training large-scale commercial cloud AI.[109] OpenAI estimated the hardware compute used in the largest deep learning projects from AlexNet (2012) to AlphaZero (2017), and found a 300,000-fold increase in the amount of compute required, with a doubling-time trendline of 3.4 months.[110][111] Software Software suites containing a variety of machine learning algorithms include the following: Free and open-source so
Introduction to machine learning and data mining How can a machine learn from experience, to become better at a given task? How can we automatically extract knowledge or make sense of massive quantities of data? These are the fundamental questions of machine learning. Machine learning and data mining algorithms use techniques from statistics, optimization, and computer science to create automated systems which can sift through large volumes of data at high speed to make predictions or decisions without human intervention. Machine learning as a field is now incredibly pervasive, with applications from the web (search, advertisements, and suggestions) to national security, from analyzing biochemical interactions to traffic and emissions to astrophysics. Perhaps most famously, the $1M Netflix prize stirred up interest in learning algorithms in professionals, students, and hobbyists alike. This class will familiarize you with a broad cross-section of models and algorithms for machine learning, and prepare you for research or industry application of machine learning techniques. Background We will assume basic familiarity with the concepts of probability and linear algebra. Some programming will be required; we will primarily use Matlab, but no prior experience with Matlab will be assumed. (Most or all code should be Octave compatible, so you may use Octave if you prefer.) Textbook and Reading There is no required textbook for the class. However, useful books on the subject for supplementary reading include Murphy's "Machine Learning: A Probabilistic Perspective", Duda, Hart & Stork, "Pattern Classification", and Hastie, Tibshirani, and Friedman, "The Elements of Statistical Learning".
sayantann11
lustering in Machine Learning Introduction to Clustering It is basically a type of unsupervised learning method . An unsupervised learning method is a method in which we draw references from datasets consisting of input data without labelled responses. Generally, it is used as a process to find meaningful structure, explanatory underlying processes, generative features, and groupings inherent in a set of examples. Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups. It is basically a collection of objects on the basis of similarity and dissimilarity between them. For ex– The data points in the graph below clustered together can be classified into one single group. We can distinguish the clusters, and we can identify that there are 3 clusters in the below picture. It is not necessary for clusters to be a spherical. Such as : DBSCAN: Density-based Spatial Clustering of Applications with Noise These data points are clustered by using the basic concept that the data point lies within the given constraint from the cluster centre. Various distance methods and techniques are used for calculation of the outliers. Why Clustering ? Clustering is very much important as it determines the intrinsic grouping among the unlabeled data present. There are no criteria for a good clustering. It depends on the user, what is the criteria they may use which satisfy their need. For instance, we could be interested in finding representatives for homogeneous groups (data reduction), in finding “natural clusters” and describe their unknown properties (“natural” data types), in finding useful and suitable groupings (“useful” data classes) or in finding unusual data objects (outlier detection). This algorithm must make some assumptions which constitute the similarity of points and each assumption make different and equally valid clusters. Clustering Methods : Density-Based Methods : These methods consider the clusters as the dense region having some similarity and different from the lower dense region of the space. These methods have good accuracy and ability to merge two clusters.Example DBSCAN (Density-Based Spatial Clustering of Applications with Noise) , OPTICS (Ordering Points to Identify Clustering Structure) etc. Hierarchical Based Methods : The clusters formed in this method forms a tree-type structure based on the hierarchy. New clusters are formed using the previously formed one. It is divided into two category Agglomerative (bottom up approach) Divisive (top down approach) examples CURE (Clustering Using Representatives), BIRCH (Balanced Iterative Reducing Clustering and using Hierarchies) etc. Partitioning Methods : These methods partition the objects into k clusters and each partition forms one cluster. This method is used to optimize an objective criterion similarity function such as when the distance is a major parameter example K-means, CLARANS (Clustering Large Applications based upon Randomized Search) etc. Grid-based Methods : In this method the data space is formulated into a finite number of cells that form a grid-like structure. All the clustering operation done on these grids are fast and independent of the number of data objects example STING (Statistical Information Grid), wave cluster, CLIQUE (CLustering In Quest) etc. Clustering Algorithms : K-means clustering algorithm – It is the simplest unsupervised learning algorithm that solves clustering problem.K-means algorithm partition n observations into k clusters where each observation belongs to the cluster with the nearest mean serving as a prototype of the cluster . Applications of Clustering in different fields Marketing : It can be used to characterize & discover customer segments for marketing purposes. Biology : It can be used for classification among different species of plants and animals. Libraries : It is used in clustering different books on the basis of topics and information. Insurance : It is used to acknowledge the customers, their policies and identifying the frauds. City Planning: It is used to make groups of houses and to study their values based on their geographical locations and other factors present. Earthquake studies: By learning the earthquake-affected areas we can determine the dangerous zones. References : Wiki Hierarchical clustering Ijarcs matteucc analyticsvidhya knowm
h-sami-ullah
Designing a Machine Learning algorithm to predict stock prices is a subject of interest for economists and machine learning practitioners. Financial modelling is a challenging task, not only from an analytical perspective but also from a psychological perspective. After 2008 financial crisis, many financial companies and investors shifted their interest towards predicting future trends. Most of the existing methods for stock price forecasting are modelled using non-linear methods and evaluated on specific data sets. These models are not able to generalize for diverse datasets. Financial time series data is highly dynamic in nature and makes it difficult to analyze through statistical methods. Recurrent Neural Networks (RNN) based Long Short- Term Memory (LSTM) networks were able to capture the patterns of the sequences data meanwhile statistical methods tried to generalize by memorizing data instead of recognizing patterns. In this work, we examined the performance of LSTM model and statistical models over stock prices of different companies to generalize the model. The experimental results of this study show that, LSTM network outperformed traditional statistical methods like ARIMA, MA and AR models. Furthermore, we have noticed that, LSTM network was able to perform consistently on different data sets while statistical methods showed varied performance. Through this project, we addressed the gaps in current models of stock price prediction in both economic and machine learning perspective.
By learning and using prediction for failures, it is one of the important steps to improve the reliability of the cloud computing system. Furthermore, gave the ability to avoid incidents of failure and costs overhead of the system. It created a wonderful opportunity with the breakthroughs of machine learning and cloud storage that utilize generated huge data that provide pathways to predict when the system or hardware malfunction or fails. It can be used to improve the reliability of the system with the help of insights of using statistical analysis on the workload data from the cloud providers. This research will discuss regarding job usage data of tasks on the large “Google Cluster Workload Traces 2019” dataset, using multiple resampling techniques such as “Random Under Sampling, Random Oversampling and Synthetic Minority Oversampling Technique” to handle the imbalanced dataset. Furthermore, using multiple machine learning algorithm which is for traditional machine learning algorithm are “Logistic Regression, Decision Tree Classifier, Random Forest Classifier, Gradient Boosting Classifier and Extreme Gradient Boosting Classifier” while deep learning algorithm using “Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU)” for job failure prediction between imbalanced and balanced dataset. Then, to have a comparison of imbalanced and balanced in terms of model accuracy, error rate, sensitivity, f – measure, and precision. The results are Extreme Gradient Boosting Classifier and Gradient Boosting Classifier is the most performing algorithm with and without imbalanced handling techniques. It showcases that SMOTE is the best method to choose from for handling imbalanced data. The deep learning model of LSTM and Gated Recurrent Unit may be not the best for the in terms of accuracy, based on the ROC Curve its better than the XGBoost Classifier and Gradient Boosting Classifier.
cran-task-views
CRAN Task View: Machine Learning & Statistical Learning
shashwat23
Titanic-Machine-Learning-from-Disaster This repository contains a machine learning project for predicting survival of passengers who travelled on Titanic Ship in 1912. Problem Description- This project highlights my approach to the introductory machine learning competition on Kaggle website- Titanic: Machine Learning from Disaster [1]. The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. This project analyses which people were likely to survive. In particular, tools of machine learning have been used to predict which passengers survived the tragedy. Project Description This project has been made in Python v3.4. It uses various data processing, visualisation and machine learning packages such as numpy, pandas, matplotlib, scikit-learn etc. which should be installed if the code is run on a local machine. The project uses a 5 step process (general procedure) for it's predicting task which is as follows [2]: Perform a statistical analysis of the data and look over it's characteristics such as data type of columns, number of instances, correlation of each attribute with the output variable, finding mean and other information about data, correlation matrix etc. After performing statistical analysis, do a visual analysis by plotting the data. Do analyse the scatter_matrix, plot box plots etc. so as to know which attributes are relevant and which are not. Remove irrelevant attributes from the dataset for further analysis. Make a list of all machine learning algorithms that can give good prediction results and spot check each one of them (apply each one of them on the dataset) to find which one is better for prediction. Use k-fold cross validation to calculate performance characteristics of each of the learners (accuracy, precision, recall, area under ROC curve etc.). Take some of the good performing algorithms and perform a grid search/ randomised search over it's hyperparameters to find the optimal hyperparameters for the prediction task. Ensure that the optimal hyperparameters do not overfit the data, by performing k-fold cross validations on learners using these tuned hyperparametes as well. Use an ensemble or Voting Classifier on the above selected algorithms to achieve better performance or use any one of the above algorithm directly to perform predictions. Keep iterating over the above steps again and again and tune them according to the need so as to achieve better performance. File Description titanic_predictor - contains python code for predicting survival. my_solution.csv - contains sample output file generated from algorithm. train.csv- contains training data test.csv - contains testing data for making predictions readme.md - for guide to this project.
anandjha90
DESCRIPTION One of the leading retail stores in the US, Walmart, would like to predict the sales and demand accurately. There are certain events and holidays which impact sales on each day. There are sales data available for 45 stores of Walmart. The business is facing a challenge due to unforeseen demands and runs out of stock some times, due to the inappropriate machine learning algorithm. An ideal ML algorithm will predict demand accurately and ingest factors like economic conditions including CPI, Unemployment Index, etc. Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of all, which are the Super Bowl, Labour Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modeling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data. Historical sales data for 45 Walmart stores located in different regions are available. Dataset Description This is the historical data which covers sales from 2010-02-05 to 2012-11-01, in the file Walmart_Store_sales. Within this file you will find the following fields: Store - the store number Date - the week of sales Weekly_Sales - sales for the given store Holiday_Flag - whether the week is a special holiday week 1 – Holiday week 0 – Non-holiday week Temperature - Temperature on the day of sale Fuel_Price - Cost of fuel in the region CPI – Prevailing consumer price index Unemployment - Prevailing unemployment rate Holiday Events Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13 Labour Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13 Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13 Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13 Analysis Tasks Basic Statistics tasks Which store has maximum sales Which store has maximum standard deviation i.e., the sales vary a lot. Also, find out the coefficient of mean to standard deviation Which store/s has good quarterly growth rate in Q3’2012 Some holidays have a negative impact on sales. Find out holidays which have higher sales than the mean sales in non-holiday season for all stores together Provide a monthly and semester view of sales in units and give insights Statistical Model For Store 1 – Build prediction models to forecast demand
purvapruthi
This project is about developing an OCR system to detect text in images.Due to complexity of natural scenes,illumination effects and non-uniformity of background,this task is a difficult Computer Vision problem.Our aim is to develop a more efficient system with better accuracy to recognize text in all kind of images using machine learning and statistical techniques.
clandgrebe
NBA betting is a big business, accounting for upwards of $500 million in legal bets placed on every month games throughout its 82 game season. With this in mind, we wanted to explore how we could maximize our betting profits by making more accurate predictions on the out- comes of games through machine learning. We searched for the most thorough and reliable data available, eventually choosing to use the NBA’s own highly trusted statistical database website https://www.nba.com/stats/, along with a sports betting research website called from https://www.sportsbookreviewsonline.com). As is discussed in later sections, the raw data lends no feasible value to our task because game data only is available after a game outcome is known, so we spent considerable time cleaning and processing the data to curate statistical features. Using these features, we began implementing machine learning techniques (Logistic Regression, Random Forest, Gradient Boosting, and Neural Networks) to compare against a baseline model of predicting the home team to win in recent NBA seasons. We first implemented a modified 5-Fold Cross-Validation approach on all of our models, followed by an exploration into using a hold-out validation set and blending approach on the predictions from our models to see if we could improve our accuracy (primary performance metric). Overall, we found success with our modeling techniques, and our best performing models performed significantly higher than a baseline model and many pre-existing models.
jonwihl
In time-domain astronomy, data gathered from the telescopes is usually represented in the form of light-curves. These are time series that show the brightness variation of an object through a period of time (for a visual representation see video below). Based on the variability characteristics of the light-curves, celestial objects can be classified into different groups (quasars, long period variables, eclipsing binaries, etc.) and consequently be studied in depth independently. In order to characterize this variability, some of the existing methods use machine learning algorithms that build their decision on the light-curves features. Features, the topic of the following work, are numerical descriptors that aim to characterize and distinguish the different variability classes. They can go from basic statistical measures such as the mean or the standard deviation, to complex time-series characteristics such as the autocorrelation function. In this document we present a library with a compilation of some of the existing light-curve features. The main goal is to create a collaborative and open tool where every user can characterize or analyze an astronomical photometric database while also contributing to the library by adding new features. However, it is important to highlight that this library is not restricted to the astronomical field and could also be applied to any kind of time series. Our vision is to be capable of analyzing and comparing light-curves from all the available astronomical catalogs in a standard and universal way. This would facilitate and make more efficient tasks as modeling, classification, data cleaning, outlier detection and data analysis in general. Consequently, when studying light-curves, astronomers and data analysts would be on the same wavelength and would not have the necessity to find a way of comparing or matching different features. In order to achieve this goal, the library should be run in every existent survey (MACHO, EROS, OGLE, Catalina, Pan-STARRS, etc) and future surveys (LSST) and the results should be ideally shared in the same open way as this library.
sergiomorapardo
The use of statistical models in computer algorithms allows computers to make decisions and predictions, and to perform tasks that traditionally require human cognitive abilities. Machine learning is the interdisciplinary field at the intersection of statistics and computer science which develops such algorithms and interweaves them with computer systems.
DOMAIN: Electronics and Telecommunication • CONTEXT: A communications equipment manufacturing company has a product which is responsible for emitting informative signals. Company wants to build a machine learning model which can help the company to predict the equipment’s signal quality using various parameters. • DATA DESCRIPTION: The data set contains information on various signal tests performed: 1. Parameters: Various measurable signal parameters. 2. Signal_Quality: Final signal strength or quality • PROJECT OBJECTIVE: The need is to build a regressor which can use these parameters to determine the signal strength or quality [as number]. Steps and tasks: [ Total Score: 10 points] 1. Import data. 2. Data analysis & visualisation • Perform relevant and detailed statistical analysis on the data. • Perform relevant and detailed uni, bi and multi variate analysis. Hint: Use your best analytical approach. Even you can mix match columns to create new columns which can be used for better analysis. Create your own features if required. Be highly experimental and analytical here to find relevant hidden patterns. 3. Design, train, tune and test a neural network regressor. Hint: Use best approach to refine and tune the data or the model. Be highly experimental here. 4. Pickle the model for future use.
satyamshorrf
Machine learning is a subfield of artificial intelligence (AI) that involves the use of algorithms and statistical models to enable machines to learn from data, make decisions, and improve their performance on a task without being explicitly programmed.
Chaitra-Bhat383
Dataset corruption is a critical problem that needs to be addressed in the near future. Being in an era rife with technology every company and organisation will want to leverage the power of machine learning and data analytics to overcome such problems. It is a significant task that calls for highly statistical algorithms to detect tainted data. We aim to address the aforementioned issue utilising a novel strategy that makes use of the Adamic-Adar algorithm, which is frequently applied in social networks. To find outliers, we contrast this strategy with the prevailing K-Means clustering technique.
In this work, we propose a Deep Learning approach for the feature extraction in the offshore oil wells monitoring context, exploiting the public 3W dataset, well-known in literature. There is a classification task with eight classes, each one related to a particular condition of the machinery. Thanks to the peculiarities of the labels, the proposed framework is valid both for diagnostics and prognostics. In more detail, we compare two different approaches in features extraction. The first one is a statistical approach, widely used in the literature related to the considered dataset; the second is based on Convolutional 1D AutoEncoder. The extracted features are then used as input for several Machine Learning algorithms, namely the Random Forest, Nearest Neighbors, Gaussian Naive Bayes and Quadratic Discriminant Analysis. The worthiness of the Convolutional AutoEncoder is proved by different experiments on various time horizons.
carrollstreet
Python library designed to streamline common and advanced analytical tasks in data analysis, statistics, and machine learning. It offers a broad range of tools for statistical testing, A/B test analysis, optimization algorithms, visualizations and mathematical computations.
ARTHANAREESWARAR
Rainfall prediction is a critical task in meteorology, agriculture, disaster preparedness, and resource management. It involves using historical weather data, satellite imagery, and various statistical and machine learning models to estimate future precipitation levels.
Python is a powerful open-source programming language known for its easy-access libraries and framework built for machine learning modeling tasks. With the advancements in plant phenotyping technologies in the past few years, the application of machine learning models in phytopathology for plant disease symptom detection and severity estimation has gained increasing interest. This proposed workshop aims to introduce python language and its environment to plant pathologists and hands-on activities on data manipulation, statistical analysis, and visualization. In addition, this workshop will also introduce machine learning packages and techniques for conducting classification modeling work using datasets from published phytopathological research. This workshop aims to assist plant pathologists in the initial learning curve of a new programming language and motivate them to explore and use python to excel in their research.
DhivyaRenuka
GRIP : The Spark Foundation ( Data Science and Business Analytics Intern ) Author :Dhivya S Prediction using Supervised ML (Level - Beginner) ● Predict the percentage of an student based on the no. of study hours. ● This is a simple linear regression task as it involves just 2 variables. ● You can use R, Python, SAS Enterprise Miner or any other tool ● Data can be found at http://bit.ly/w-data ● What will be predicted score if a student studies for 9.25 hrs/ day? ● Sample Solution : https://bit.ly/2HxiGGJ Supervised machine learningis a type of machine learning in which the machine is fed the training data which is labelled. Supervised machine learning is further categorized into regression and classification Linear regression is probably the simplest approach for statistical learning. It is a good starting point for more advanced approaches, and in fact, many fancy statistical learning techniques can be seen as an extension of linear regression. Therefore, understanding this simple model will build a good base before moving on to more complex approaches.Output type- continuous (number)
dharineeshramtp2000
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including pattern recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Here we try to analyse the customers visiting the Shopping Mall
The Boston housing market is highly competitive, and you want to be the best real estate agent in the area. To compete with your peers, you decide to leverage a few basic machine learning concepts to assist you and a client with finding the best selling price for their home. Luckily, you’ve come across the Boston Housing dataset which contains aggregated data on various features for houses in Greater Boston communities, including the median value of homes for each of those areas. Your task is to build an optimal model based on a statistical analysis with the tools available. This model will then be used to estimate the best selling price for your clients' homes.
TeniaKovacs
Instructions This assignment uses data from the UC Irvine Machine Learning Repository, a popular repository for machine learning datasets. In particular, we will be using the “Individual household electric power consumption Data Set” which I have made available on the course web site: Dataset: Electric power consumption [20Mb] Description: Measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 years. Different electrical quantities and some sub-metering values are available. The following descriptions of the 9 variables in the dataset are taken from the UCI web site: Date: Date in format dd/mm/yyyy Time: time in format hh:mm:ss Global_active_power: household global minute-averaged active power (in kilowatt) Global_reactive_power: household global minute-averaged reactive power (in kilowatt) Voltage: minute-averaged voltage (in volt) Global_intensity: household global minute-averaged current intensity (in ampere) Sub_metering_1: energy sub-metering No. 1 (in watt-hour of active energy). It corresponds to the kitchen, containing mainly a dishwasher, an oven and a microwave (hot plates are not electric but gas powered). Sub_metering_2: energy sub-metering No. 2 (in watt-hour of active energy). It corresponds to the laundry room, containing a washing-machine, a tumble-drier, a refrigerator and a light. Sub_metering_3: energy sub-metering No. 3 (in watt-hour of active energy). It corresponds to an electric water-heater and an air-conditioner. Review criterialess Criteria Was a valid GitHub URL containing a git repository submitted? Does the GitHub repository contain at least one commit beyond the original fork? Please examine the plot files in the GitHub repository. Do the plot files appear to be of the correct graphics file format? Does each plot appear correct? Does each set of R code appear to create the reference plot? Reviewing the Assignments Keep in mind this course is about exploratory graphs, understanding the data, and developing strategies. Here's a good quote from a swirl lesson about exploratory graphs: "They help us find patterns in data and understand its properties. They suggest modeling strategies and help to debug analyses. We DON'T use exploratory graphs to communicate results." The rubrics should always be interpreted in that context. As you do your evaluation, please keep an open mind and focus on the positive. The goal is not to deduct points over small deviations from the requirements or for legitimate differences in implementation styles, etc. Look for ways to give points when it's clear that the submitter has given a good faith effort to do the project, and when it's likely that they've succeeded. Most importantly, it's okay if a person did something differently from the way that you did it. The point is not to see if someone managed to match your way of doing things, but to see if someone objectively accomplished the task at hand. To that end, keep the following things in mind: DO Review the source code. Keep an open mind and focus on the positive.≤/li> When in doubt, err on the side of giving too many points, rather than giving too few. Ask yourself if a plot might answer a question for the person who created it. Remember that not everyone has the same statistical background and knowledge. DON'T: Deduct just because you disagree with someone's statistical methods. Deduct just because you disagree with someone's plotting methods. Deduct based on aesthetics. Loading the dataless When loading the dataset into R, please consider the following: The dataset has 2,075,259 rows and 9 columns. First calculate a rough estimate of how much memory the dataset will require in memory before reading into R. Make sure your computer has enough memory (most modern computers should be fine). We will only be using data from the dates 2007-02-01 and 2007-02-02. One alternative is to read the data from just those dates rather than reading in the entire dataset and subsetting to those dates. You may find it useful to convert the Date and Time variables to Date/Time classes in R using the strptime() and as.Date() functions. Note that in this dataset missing values are coded as ?. Making Plotsless Our overall goal here is simply to examine how household energy usage varies over a 2-day period in February, 2007. Your task is to reconstruct the following plots below, all of which were constructed using the base plotting system. First you will need to fork and clone the following GitHub repository: https://github.com/rdpeng/ExData_Plotting1 For each plot you should Construct the plot and save it to a PNG file with a width of 480 pixels and a height of 480 pixels. Name each of the plot files as plot1.png, plot2.png, etc. Create a separate R code file (plot1.R, plot2.R, etc.) that constructs the corresponding plot, i.e. code in plot1.R constructs the plot1.png plot. Your code file should include code for reading the data so that the plot can be fully reproduced. You must also include the code that creates the PNG file. Add the PNG file and R code file to the top-level folder of your git repository (no need for separate sub-folders) When you are finished with the assignment, push your git repository to GitHub so that the GitHub version of your repository is up to date. There should be four PNG files and four R code files, a total of eight files in the top-level folder of the repo.
Solar Energy Particles (SEPs) can be associated with solar flares and coronal mass ejections (CMEs) and offer energy spectra ranging from few KeVs to many GeVs. These events can occur without any notable indication and alter the radiation environment of the inner solar systems, which can potentially lead to precarious conditions for humans in space, affect the interior of spacecraft's sensitive electronics, and trigger radio blackouts. Identifying the most critical physical parameters of the Solar Dynamic Observatory (SDO) to detect SEPs can allow for a swift response against its adverse effects. With the profusion of high-quality time series data from the SDO, which accounts for the modulating background of magnetic activity and the inherently dynamic phenomenon of pre-flares and post-flare phases; antithetical to non-representative data with the point-in-time measurements employed earlier, selection of vital parameters for solar flare classification using machine learning algorithms appears to be a well-fitted problem in this realm. The primary issue of dealing with multivariate time series data (mvts) is the large number of physical parameters operating at a rapid frequency, making the data dimensionality very high and thus causing the learning process to curb. Moreover, manually selecting vital parameters is a tedious and costly task on which experts may not always agree on the results. In response, we examined feature subset selection using multiple algorithms on mvts data and the statistical features derived from mvts segments (vectorized data). We used the SWAN-SF (Space Weather Analytics for Solar Flares) benchmark dataset collected from May 2010 - September 2018 to conduct our experiments. The comprehensive study gives a stable scheme to recognize the critical physical parameters, which boosts the learning process and can be used as a blueprint to foretell future solar flare episodes.
Aryia-Behroziuan
In the late 1960s, computer vision began at universities which were pioneering artificial intelligence. It was meant to mimic the human visual system, as a stepping stone to endowing robots with intelligent behavior.[11] In 1966, it was believed that this could be achieved through a summer project, by attaching a camera to a computer and having it "describe what it saw".[12][13] What distinguished computer vision from the prevalent field of digital image processing at that time was a desire to extract three-dimensional structure from images with the goal of achieving full scene understanding. Studies in the 1970s formed the early foundations for many of the computer vision algorithms that exist today, including extraction of edges from images, labeling of lines, non-polyhedral and polyhedral modeling, representation of objects as interconnections of smaller structures, optical flow, and motion estimation.[11] The next decade saw studies based on more rigorous mathematical analysis and quantitative aspects of computer vision. These include the concept of scale-space, the inference of shape from various cues such as shading, texture and focus, and contour models known as snakes. Researchers also realized that many of these mathematical concepts could be treated within the same optimization framework as regularization and Markov random fields.[14] By the 1990s, some of the previous research topics became more active than the others. Research in projective 3-D reconstructions led to better understanding of camera calibration. With the advent of optimization methods for camera calibration, it was realized that a lot of the ideas were already explored in bundle adjustment theory from the field of photogrammetry. This led to methods for sparse 3-D reconstructions of scenes from multiple images. Progress was made on the dense stereo correspondence problem and further multi-view stereo techniques. At the same time, variations of graph cut were used to solve image segmentation. This decade also marked the first time statistical learning techniques were used in practice to recognize faces in images (see Eigenface). Toward the end of the 1990s, a significant change came about with the increased interaction between the fields of computer graphics and computer vision. This included image-based rendering, image morphing, view interpolation, panoramic image stitching and early light-field rendering.[11] Recent work has seen the resurgence of feature-based methods, used in conjunction with machine learning techniques and complex optimization frameworks.[15][16] The advancement of Deep Learning techniques has brought further life to the field of computer vision. The accuracy of deep learning algorithms on several benchmark computer vision data sets for tasks ranging from classification, segmentation and optical flow has surpassed prior methods.[citation needed]
sumitmandalsm359
About Comprehensive ML & DL implementations using PyTorch and scikit-learn, including linear & softmax regression, LDA, SVMs, CNNs, ResNet fine-tuning with augmentation, Transformer image captioning, SimCLR self-supervised learning, and unsupervised methods (K-Means, PCA, GMM) on CIFAR-10, MNIST, and housing data and Text-to-Speech with Piper.
DKhimani23
Statistical Machine Learning Tasks - LaTex format
rayenebech
Comparision of different statistical machine learning models Vs Neural Network based models on Text Classification tasks.
JohnOfGod6
This project applies statistical machine learning techniques to regression and classification tasks, including polynomial regression and spam classification.
Shandilya21
Forum for Information Retrieval (FIRE): Implementation of the statistical machine learning model for analysis of tweets for discriminant task