Found 79 repositories(showing 30)
kunalj101
Data Science Hacks consists of tips, tricks to help you become a better data scientist. Data science hacks are for all - beginner to advanced. Data science hacks consist of python, jupyter notebook, pandas hacks and so on.
BexTuychiev
A book of subtle code tricks and gem resources for all things data, machine learning and deep learning.
xploitspeeds
* READ THE README FOR INFO!! * Incoming Tags- z score statistics,find mean median mode statistics in ms excel,variance,standard deviation,linear regression,data processing,confidence intervals,average value,probability theory,binomial distribution,matrix,random numbers,error propagation,t statistics analysis,hypothesis testing,theorem,chi square,time series,data collection,sampling,p value,scatterplots,statistics lectures,statistics tutorials,business mathematics statistics,share stock market statistics in calculator,business analytics,GTA,continuous frequency distribution,statistics mathematics in real life,modal class,n is even,n is odd,median mean of series of numbers,math help,Sujoy Krishna Das,n+1/2 element,measurement of variation,measurement of central tendency,range of numbers,interquartile range,casio fx991,casio fx82,casio fx570,casio fx115es,casio 9860,casio 9750,casio 83gt,TI BAII+ financial,casio piano,casio calculator tricks and hacks,how to cheat in exam and not get caught,grouped interval data,equation of triangle rectangle curve parabola hyperbola,graph theory,operation research(OR),numerical methods,decision making,pie chart,bar graph,computer data analysis,histogram,statistics formula,matlab tutorial,find arithmetic mean geometric mean,find population standard deviation,find sample standard deviation,how to use a graphic calculator,pre algebra,pre calculus,absolute deviation,TI Nspire,TI 84 TI83 calculator tutorial,texas instruments calculator,grouped data,set theory,IIT JEE,AIEEE,GCSE,CAT,MAT,SAT,GMAT,MBBS,JELET,JEXPO,VOCLET,Indiastudychannel,IAS,IPS,IFS,GATE,B-Tech,M-Tech,AMIE,MBA,BBA,BCA,MCA,XAT,TOEFL,CBSE,ICSE,HS,WBUT,SSC,IUPAC,Narendra Modi,Sachin Tendulkar Farewell Speech,Dhoom 3,Arvind Kejriwal,maths revision,how to score good marks in exams,how to pass math exams easily,JEE 12th physics chemistry maths PCM,JEE maths shortcut techniques,quadratic equations,competition exams tips and ticks,competition maths,govt job,JEE KOTA,college math,mean value theorem,L hospital rule,tech guru awaaz,derivation,cryptography,iphone 5 fingerprint hack,crash course,CCNA,converting fractions,solve word problem,cipher,game theory,GDP,how to earn money online on youtube,demand curve,computer science,prime factorization,LCM & GCF,gauss elimination,vector,complex numbers,number systems,vector algebra,logarithm,trigonometry,organic chemistry,electrical math problem,eigen value eigen vectors,runge kutta,gauss jordan,simpson 1/3 3/8 trapezoidal rule,solved problem example,newton raphson,interpolation,integration,differentiation,regula falsi,programming,algorithm,gauss seidal,gauss jacobi,taylor series,iteration,binary arithmetic,logic gates,matrix inverse,determinant of matrix,matrix calculator program,sex in ranchi,sex in kolkata,vogel approximation VAM optimization problem,North west NWCR,Matrix minima,Modi method,assignment problem,transportation problem,simplex,k map,boolean algebra,android,casio FC 200v 100v financial,management mathematics tutorials,net present value NPV,time value of money TVM,internal rate of return IRR Bond price,present value PV and future value FV of annuity casio,simple interest SI & compound interest CI casio,break even point,amortization calculation,HP 10b financial calculator,banking and money,income tax e filing,economics,finance,profit & loss,yield of investment bond,Sharp EL 735S,cash flow casio,re finance,insurance and financial planning,investment appraisal,shortcut keys,depreciation,discounting
dssg
Methods, tools, tips, and tricks for anyone interested in getting started doing data science for the social good.
Aryia-Behroziuan
An ANN is a model based on a collection of connected units or nodes called "artificial neurons", which loosely model the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit information, a "signal", from one artificial neuron to another. An artificial neuron that receives a signal can process it and then signal additional artificial neurons connected to it. In common ANN implementations, the signal at a connection between artificial neurons is a real number, and the output of each artificial neuron is computed by some non-linear function of the sum of its inputs. The connections between artificial neurons are called "edges". Artificial neurons and edges typically have a weight that adjusts as learning proceeds. The weight increases or decreases the strength of the signal at a connection. Artificial neurons may have a threshold such that the signal is only sent if the aggregate signal crosses that threshold. Typically, artificial neurons are aggregated into layers. Different layers may perform different kinds of transformations on their inputs. Signals travel from the first layer (the input layer) to the last layer (the output layer), possibly after traversing the layers multiple times. The original goal of the ANN approach was to solve problems in the same way that a human brain would. However, over time, attention moved to performing specific tasks, leading to deviations from biology. Artificial neural networks have been used on a variety of tasks, including computer vision, speech recognition, machine translation, social network filtering, playing board and video games and medical diagnosis. Deep learning consists of multiple hidden layers in an artificial neural network. This approach tries to model the way the human brain processes light and sound into vision and hearing. Some successful applications of deep learning are computer vision and speech recognition.[68] Decision trees Main article: Decision tree learning Decision tree learning uses a decision tree as a predictive model to go from observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). It is one of the predictive modeling approaches used in statistics, data mining, and machine learning. Tree models where the target variable can take a discrete set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. In data mining, a decision tree describes data, but the resulting classification tree can be an input for decision making. Support vector machines Main article: Support vector machines Support vector machines (SVMs), also known as support vector networks, are a set of related supervised learning methods used for classification and regression. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that predicts whether a new example falls into one category or the other.[69] An SVM training algorithm is a non-probabilistic, binary, linear classifier, although methods such as Platt scaling exist to use SVM in a probabilistic classification setting. In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces. Illustration of linear regression on a data set. Regression analysis Main article: Regression analysis Regression analysis encompasses a large variety of statistical methods to estimate the relationship between input variables and their associated features. Its most common form is linear regression, where a single line is drawn to best fit the given data according to a mathematical criterion such as ordinary least squares. The latter is often extended by regularization (mathematics) methods to mitigate overfitting and bias, as in ridge regression. When dealing with non-linear problems, go-to models include polynomial regression (for example, used for trendline fitting in Microsoft Excel[70]), logistic regression (often used in statistical classification) or even kernel regression, which introduces non-linearity by taking advantage of the kernel trick to implicitly map input variables to higher-dimensional space. Bayesian networks Main article: Bayesian network A simple Bayesian network. Rain influences whether the sprinkler is activated, and both rain and the sprinkler influence whether the grass is wet. A Bayesian network, belief network, or directed acyclic graphical model is a probabilistic graphical model that represents a set of random variables and their conditional independence with a directed acyclic graph (DAG). For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases. Efficient algorithms exist that perform inference and learning. Bayesian networks that model sequences of variables, like speech signals or protein sequences, are called dynamic Bayesian networks. Generalizations of Bayesian networks that can represent and solve decision problems under uncertainty are called influence diagrams. Genetic algorithms Main article: Genetic algorithm A genetic algorithm (GA) is a search algorithm and heuristic technique that mimics the process of natural selection, using methods such as mutation and crossover to generate new genotypes in the hope of finding good solutions to a given problem. In machine learning, genetic algorithms were used in the 1980s and 1990s.[71][72] Conversely, machine learning techniques have been used to improve the performance of genetic and evolutionary algorithms.[73] Training models Usually, machine learning models require a lot of data in order for them to perform well. Usually, when training a machine learning model, one needs to collect a large, representative sample of data from a training set. Data from the training set can be as varied as a corpus of text, a collection of images, and data collected from individual users of a service. Overfitting is something to watch out for when training a machine learning model. Federated learning Main article: Federated learning Federated learning is an adapted form of distributed artificial intelligence to training machine learning models that decentralizes the training process, allowing for users' privacy to be maintained by not needing to send their data to a centralized server. This also increases efficiency by decentralizing the training process to many devices. For example, Gboard uses federated machine learning to train search query prediction models on users' mobile phones without having to send individual searches back to Google.[74] Applications There are many applications for machine learning, including: Agriculture Anatomy Adaptive websites Affective computing Banking Bioinformatics Brain–machine interfaces Cheminformatics Citizen science Computer networks Computer vision Credit-card fraud detection Data quality DNA sequence classification Economics Financial market analysis[75] General game playing Handwriting recognition Information retrieval Insurance Internet fraud detection Linguistics Machine learning control Machine perception Machine translation Marketing Medical diagnosis Natural language processing Natural language understanding Online advertising Optimization Recommender systems Robot locomotion Search engines Sentiment analysis Sequence mining Software engineering Speech recognition Structural health monitoring Syntactic pattern recognition Telecommunication Theorem proving Time series forecasting User behavior analytics In 2006, the media-services provider Netflix held the first "Netflix Prize" competition to find a program to better predict user preferences and improve the accuracy of its existing Cinematch movie recommendation algorithm by at least 10%. A joint team made up of researchers from AT&T Labs-Research in collaboration with the teams Big Chaos and Pragmatic Theory built an ensemble model to win the Grand Prize in 2009 for $1 million.[76] Shortly after the prize was awarded, Netflix realized that viewers' ratings were not the best indicators of their viewing patterns ("everything is a recommendation") and they changed their recommendation engine accordingly.[77] In 2010 The Wall Street Journal wrote about the firm Rebellion Research and their use of machine learning to predict the financial crisis.[78] In 2012, co-founder of Sun Microsystems, Vinod Khosla, predicted that 80% of medical doctors' jobs would be lost in the next two decades to automated machine learning medical diagnostic software.[79] In 2014, it was reported that a machine learning algorithm had been applied in the field of art history to study fine art paintings and that it may have revealed previously unrecognized influences among artists.[80] In 2019 Springer Nature published the first research book created using machine learning.[81] Limitations Although machine learning has been transformative in some fields, machine-learning programs often fail to deliver expected results.[82][83][84] Reasons for this are numerous: lack of (suitable) data, lack of access to the data, data bias, privacy problems, badly chosen tasks and algorithms, wrong tools and people, lack of resources, and evaluation problems.[85] In 2018, a self-driving car from Uber failed to detect a pedestrian, who was killed after a collision.[86] Attempts to use machine learning in healthcare with the IBM Watson system failed to deliver even after years of time and billions of dollars invested.[87][88] Bias Main article: Algorithmic bias Machine learning approaches in particular can suffer from different data biases. A machine learning system trained on current customers only may not be able to predict the needs of new customer groups that are not represented in the training data. When trained on man-made data, machine learning is likely to pick up the same constitutional and unconscious biases already present in society.[89] Language models learned from data have been shown to contain human-like biases.[90][91] Machine learning systems used for criminal risk assessment have been found to be biased against black people.[92][93] In 2015, Google photos would often tag black people as gorillas,[94] and in 2018 this still was not well resolved, but Google reportedly was still using the workaround to remove all gorillas from the training data, and thus was not able to recognize real gorillas at all.[95] Similar issues with recognizing non-white people have been found in many other systems.[96] In 2016, Microsoft tested a chatbot that learned from Twitter, and it quickly picked up racist and sexist language.[97] Because of such challenges, the effective use of machine learning may take longer to be adopted in other domains.[98] Concern for fairness in machine learning, that is, reducing bias in machine learning and propelling its use for human good is increasingly expressed by artificial intelligence scientists, including Fei-Fei Li, who reminds engineers that "There’s nothing artificial about AI...It’s inspired by people, it’s created by people, and—most importantly—it impacts people. It is a powerful tool we are only just beginning to understand, and that is a profound responsibility.”[99] Model assessments Classification of machine learning models can be validated by accuracy estimation techniques like the holdout method, which splits the data in a training and test set (conventionally 2/3 training set and 1/3 test set designation) and evaluates the performance of the training model on the test set. In comparison, the K-fold-cross-validation method randomly partitions the data into K subsets and then K experiments are performed each respectively considering 1 subset for evaluation and the remaining K-1 subsets for training the model. In addition to the holdout and cross-validation methods, bootstrap, which samples n instances with replacement from the dataset, can be used to assess model accuracy.[100] In addition to overall accuracy, investigators frequently report sensitivity and specificity meaning True Positive Rate (TPR) and True Negative Rate (TNR) respectively. Similarly, investigators sometimes report the false positive rate (FPR) as well as the false negative rate (FNR). However, these rates are ratios that fail to reveal their numerators and denominators. The total operating characteristic (TOC) is an effective method to express a model's diagnostic ability. TOC shows the numerators and denominators of the previously mentioned rates, thus TOC provides more information than the commonly used receiver operating characteristic (ROC) and ROC's associated area under the curve (AUC).[101] Ethics Machine learning poses a host of ethical questions. Systems which are trained on datasets collected with biases may exhibit these biases upon use (algorithmic bias), thus digitizing cultural prejudices.[102] For example, using job hiring data from a firm with racist hiring policies may lead to a machine learning system duplicating the bias by scoring job applicants against similarity to previous successful applicants.[103][104] Responsible collection of data and documentation of algorithmic rules used by a system thus is a critical part of machine learning. Because human languages contain biases, machines trained on language corpora will necessarily also learn these biases.[105][106] Other forms of ethical challenges, not related to personal biases, are more seen in health care. There are concerns among health care professionals that these systems might not be designed in the public's interest but as income-generating machines. This is especially true in the United States where there is a long-standing ethical dilemma of improving health care, but also increasing profits. For example, the algorithms could be designed to provide patients with unnecessary tests or medication in which the algorithm's proprietary owners hold stakes. There is huge potential for machine learning in health care to provide professionals a great tool to diagnose, medicate, and even plan recovery paths for patients, but this will not happen until the personal biases mentioned previously, and these "greed" biases are addressed.[107] Hardware Since the 2010s, advances in both machine learning algorithms and computer hardware have led to more efficient methods for training deep neural networks (a particular narrow subdomain of machine learning) that contain many layers of non-linear hidden units.[108] By 2019, graphic processing units (GPUs), often with AI-specific enhancements, had displaced CPUs as the dominant method of training large-scale commercial cloud AI.[109] OpenAI estimated the hardware compute used in the largest deep learning projects from AlexNet (2012) to AlphaZero (2017), and found a 300,000-fold increase in the amount of compute required, with a doubling-time trendline of 3.4 months.[110][111] Software Software suites containing a variety of machine learning algorithms include the following: Free and open-source so
ant-vessicchio
Journey through the world of data science from beginner to advanced concepts and methodologies. In this repository, I will walk you through DS skills, hints, tips, and tricks all with Sports relevant examples!
baniasbaabe
A fine book including tips and tricks around Data Science with Python
Jupyter Notebook tips and tricks for the Berkeley Institute for Data Science lecture. http://bids.berkeley.edu/
nitinkaushik01
This repository contains content related to Data Science related Tips, Tricks and Career Advice
Tips and Tricks to setup a cloud machine for Analytics and Data Science with R, RStudio and Shiny Servers, Python and JupyterLab
gerhard1050
Best practices how to prepare data for Data Science and Analytics, Tips/Tricks and Code Examples on how to implement this with SAS Software
FRBCesab
:globe_with_meridians: Tips and tricks for data science
asimihsan
Repo for tools, tricks, and a virtual machine for "Intro to Data Science" on Udacity
nagdevAmruthnath
This repo contains various notebooks showing some fun stuff in data science, tips and tricks and etc.
Nitinguptadu
Feature extraction settings When starting a new data science project involving time series you probably want to start by extracting a comprehensive set of features. Later you can identify which features are relevant for the task at hand. In the final stages, you probably want to fine tune the parameter of the features to fine tune your models. You can do all those things with tsfresh. So, you need to know how to control which features are calculated by tsfresh and how one can adjust the parameters. In this section, we will clarify this. For the lazy: Just let me calculate some features So, to just calculate a comprehensive set of features, call the tsfresh.extract_features() method without passing a default_fc_parameters or kind_to_fc_parameters object, which means you are using the default options (which will use all feature calculators in this package for what we think are sane default parameters). For the advanced: How do I set the parameters for all kind of time series? After digging deeper into your data, you maybe want to calculate more of a certain type of feature and less of another type. So, you need to use custom settings for the feature extractors. To do that with tsfresh you will have to use a custom settings object: >>> from tsfresh.feature_extraction import ComprehensiveFCParameters >>> settings = ComprehensiveFCParameters() >>> # Set here the options of the settings object as shown in the paragraphs below >>> # ... >>> from tsfresh.feature_extraction import extract_features >>> extract_features(df, default_fc_parameters=settings) The default_fc_parameters is expected to be a dictionary, which maps feature calculator names (the function names you can find in the tsfresh.feature_extraction.feature_calculators file) to a list of dictionaries, which are the parameters with which the function will be called (as key value pairs). Each function parameter combination, that is in this dict will be called during the extraction and will produce a feature. If the function does not take any parameters, the value should be set to None. For example fc_parameters = { "length": None, "large_standard_deviation": [{"r": 0.05}, {"r": 0.1}] } will produce three features: one by calling the tsfresh.feature_extraction.feature_calculators.length() function without any parameters and two by calling tsfresh.feature_extraction.feature_calculators.large_standard_deviation() with r = 0.05 and r = 0.1. So you can control, which features will be extracted, by adding/removing either keys or parameters from this dict. It is as easy as that. If you decide to not calculate the length feature here, you delete it from the dictionary: del fc_parameters["length"] And now, only the two other features are calculated. For convenience, three dictionaries are predefined and can be used right away: tsfresh.feature_extraction.settings.ComprehensiveFCParameters: includes all features without parameters and all features with parameters, each with different parameter combinations. This is the default for extract_features if you do not hand in a default_fc_parameters at all. tsfresh.feature_extraction.settings.MinimalFCParameters: includes only a handful of features and can be used for quick tests. The features which have the “minimal” attribute are used here. tsfresh.feature_extraction.settings.EfficientFCParameters: Mostly the same features as in the tsfresh.feature_extraction.settings.ComprehensiveFCParameters, but without features which are marked with the “high_comp_cost” attribute. This can be used if runtime performance plays a major role. Theoretically, you could calculate an unlimited number of features with tsfresh by adding entry after entry to the dictionary. For the ambitious: How do I set the parameters for different type of time series? It is also possible, to control the features to be extracted for the different kinds of time series individually. You can do so by passing another dictionary to the extract function as a kind_to_fc_parameters = {“kind” : fc_parameters} parameter. This dict must be a mapping from kind names (as string) to fc_parameters objects, which you would normally pass as an argument to the default_fc_parameters parameter. So, for example using kind_to_fc_parameters = { "temperature": {"mean": None}, "pressure": {"max": None, "min": None} } will extract the “mean” feature of the “temperature” time series and the “min” and “max” of the “pressure” time series. The kind_to_fc_parameters argument will partly override the default_fc_parameters. So, if you include a kind name in the kind_to_fc_parameters parameter, its value will be used for that kind. Other kinds will still use the default_fc_parameters. A handy trick: Do I really have to create the dictionary by hand? Not necessarily. let’s assume you have a DataFrame of tsfresh features. By using feature selection algorithms you find out that only a subgroup of features is relevant. Then, we provide the tsfresh.feature_extraction.settings.from_columns() method that constructs the kind_to_fc_parameters dictionary from the column names of this filtered feature matrix to make sure that only relevant features are extracted. This can save a huge amount of time because you prevent the calculation of uncessary features. Let’s illustrate that with an example: # X_tsfresh containes the extracted tsfresh features X_tsfresh = extract_features(...) # which are now filtered to only contain relevant features X_tsfresh_filtered = some_feature_selection(X_tsfresh, y, ....) # we can easily construct the corresponding settings object kind_to_fc_parameters = tsfresh.feature_extraction.settings.from_columns(X_tsfresh_filtered) this will construct you the kind_to_fc_parameters dictionary that corresponds to the features and parameters (!) from the tsfresh features that were filtered by the some_feature_selection feature selection algorithm.
DOI-USGS
USGS targets tips and tricks: overview of conventions and best practices for using targets in USGS Data Science workflows
samuelcorradi
Data science and data engineer tips and tricks
EmAchieng
Python For Data Science Tips, Tricks and Techniques.
AndreeaDraghici
More Python Tips, Tricks and Techniques for Data Science
titilomt
Repository for Data Science Notebooks - Tricks for how to perform a Data Science project like Data Pre Processing, Data Analysis, a little bit ML, etc...
Abhijit2505
In computer science, a graph is an Abstract Data Type(ADT) that is meant to implement the undirected graph and directed graph concepts from the field of graph theory within mathematics.This repository contains tricks and techniques involved with the Graph data structures.
Abhijit2505
In computer science, a Binary Tree is a tree data structure in which each node has at most two children, A recursive definition using just set theory notions is that a (non-empty) binary tree is a tuple (L, S, R).This repository contains code on Binary-Tree, the tricks and techniques involved.
alshival
Data Science Tips/Tricks
TheAustinator
A data science container with RStudio and tricked out Jupyter Lab
Cepeda-R
Tricks and Tips of all Know-How aquired during Data Science Projects
cengelhard
using data science to make AIs to play trick-taking games like Hearts and Spades
codekunoichi
All things Data Science, learning journey, portfolio projects and all tips and tricks along the way.
b-i-king
Exercise Files from LinkedIn Course on 'Python for Data Science Tips, Tricks, & Techniques' by Ben Sullins
This repo is for the Linkedin Learning course: More Python Tips, Tricks, and Techniques for Data Science
fahdshaikh
🧠💡 Bite-sized, beginner-friendly Data Science cheatsheets in Python — from 📊 Pandas tricks to 🤖 ML basics. Each module is runnable, well-commented, and designed to help you learn faster, remember longer, and build real skills 🚀.