Found 32 repositories(showing 30)
sanand0
Official content for the IITM BS course on Tools in Data Science
Aryia-Behroziuan
An ANN is a model based on a collection of connected units or nodes called "artificial neurons", which loosely model the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit information, a "signal", from one artificial neuron to another. An artificial neuron that receives a signal can process it and then signal additional artificial neurons connected to it. In common ANN implementations, the signal at a connection between artificial neurons is a real number, and the output of each artificial neuron is computed by some non-linear function of the sum of its inputs. The connections between artificial neurons are called "edges". Artificial neurons and edges typically have a weight that adjusts as learning proceeds. The weight increases or decreases the strength of the signal at a connection. Artificial neurons may have a threshold such that the signal is only sent if the aggregate signal crosses that threshold. Typically, artificial neurons are aggregated into layers. Different layers may perform different kinds of transformations on their inputs. Signals travel from the first layer (the input layer) to the last layer (the output layer), possibly after traversing the layers multiple times. The original goal of the ANN approach was to solve problems in the same way that a human brain would. However, over time, attention moved to performing specific tasks, leading to deviations from biology. Artificial neural networks have been used on a variety of tasks, including computer vision, speech recognition, machine translation, social network filtering, playing board and video games and medical diagnosis. Deep learning consists of multiple hidden layers in an artificial neural network. This approach tries to model the way the human brain processes light and sound into vision and hearing. Some successful applications of deep learning are computer vision and speech recognition.[68] Decision trees Main article: Decision tree learning Decision tree learning uses a decision tree as a predictive model to go from observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). It is one of the predictive modeling approaches used in statistics, data mining, and machine learning. Tree models where the target variable can take a discrete set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. In data mining, a decision tree describes data, but the resulting classification tree can be an input for decision making. Support vector machines Main article: Support vector machines Support vector machines (SVMs), also known as support vector networks, are a set of related supervised learning methods used for classification and regression. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that predicts whether a new example falls into one category or the other.[69] An SVM training algorithm is a non-probabilistic, binary, linear classifier, although methods such as Platt scaling exist to use SVM in a probabilistic classification setting. In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces. Illustration of linear regression on a data set. Regression analysis Main article: Regression analysis Regression analysis encompasses a large variety of statistical methods to estimate the relationship between input variables and their associated features. Its most common form is linear regression, where a single line is drawn to best fit the given data according to a mathematical criterion such as ordinary least squares. The latter is often extended by regularization (mathematics) methods to mitigate overfitting and bias, as in ridge regression. When dealing with non-linear problems, go-to models include polynomial regression (for example, used for trendline fitting in Microsoft Excel[70]), logistic regression (often used in statistical classification) or even kernel regression, which introduces non-linearity by taking advantage of the kernel trick to implicitly map input variables to higher-dimensional space. Bayesian networks Main article: Bayesian network A simple Bayesian network. Rain influences whether the sprinkler is activated, and both rain and the sprinkler influence whether the grass is wet. A Bayesian network, belief network, or directed acyclic graphical model is a probabilistic graphical model that represents a set of random variables and their conditional independence with a directed acyclic graph (DAG). For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases. Efficient algorithms exist that perform inference and learning. Bayesian networks that model sequences of variables, like speech signals or protein sequences, are called dynamic Bayesian networks. Generalizations of Bayesian networks that can represent and solve decision problems under uncertainty are called influence diagrams. Genetic algorithms Main article: Genetic algorithm A genetic algorithm (GA) is a search algorithm and heuristic technique that mimics the process of natural selection, using methods such as mutation and crossover to generate new genotypes in the hope of finding good solutions to a given problem. In machine learning, genetic algorithms were used in the 1980s and 1990s.[71][72] Conversely, machine learning techniques have been used to improve the performance of genetic and evolutionary algorithms.[73] Training models Usually, machine learning models require a lot of data in order for them to perform well. Usually, when training a machine learning model, one needs to collect a large, representative sample of data from a training set. Data from the training set can be as varied as a corpus of text, a collection of images, and data collected from individual users of a service. Overfitting is something to watch out for when training a machine learning model. Federated learning Main article: Federated learning Federated learning is an adapted form of distributed artificial intelligence to training machine learning models that decentralizes the training process, allowing for users' privacy to be maintained by not needing to send their data to a centralized server. This also increases efficiency by decentralizing the training process to many devices. For example, Gboard uses federated machine learning to train search query prediction models on users' mobile phones without having to send individual searches back to Google.[74] Applications There are many applications for machine learning, including: Agriculture Anatomy Adaptive websites Affective computing Banking Bioinformatics Brain–machine interfaces Cheminformatics Citizen science Computer networks Computer vision Credit-card fraud detection Data quality DNA sequence classification Economics Financial market analysis[75] General game playing Handwriting recognition Information retrieval Insurance Internet fraud detection Linguistics Machine learning control Machine perception Machine translation Marketing Medical diagnosis Natural language processing Natural language understanding Online advertising Optimization Recommender systems Robot locomotion Search engines Sentiment analysis Sequence mining Software engineering Speech recognition Structural health monitoring Syntactic pattern recognition Telecommunication Theorem proving Time series forecasting User behavior analytics In 2006, the media-services provider Netflix held the first "Netflix Prize" competition to find a program to better predict user preferences and improve the accuracy of its existing Cinematch movie recommendation algorithm by at least 10%. A joint team made up of researchers from AT&T Labs-Research in collaboration with the teams Big Chaos and Pragmatic Theory built an ensemble model to win the Grand Prize in 2009 for $1 million.[76] Shortly after the prize was awarded, Netflix realized that viewers' ratings were not the best indicators of their viewing patterns ("everything is a recommendation") and they changed their recommendation engine accordingly.[77] In 2010 The Wall Street Journal wrote about the firm Rebellion Research and their use of machine learning to predict the financial crisis.[78] In 2012, co-founder of Sun Microsystems, Vinod Khosla, predicted that 80% of medical doctors' jobs would be lost in the next two decades to automated machine learning medical diagnostic software.[79] In 2014, it was reported that a machine learning algorithm had been applied in the field of art history to study fine art paintings and that it may have revealed previously unrecognized influences among artists.[80] In 2019 Springer Nature published the first research book created using machine learning.[81] Limitations Although machine learning has been transformative in some fields, machine-learning programs often fail to deliver expected results.[82][83][84] Reasons for this are numerous: lack of (suitable) data, lack of access to the data, data bias, privacy problems, badly chosen tasks and algorithms, wrong tools and people, lack of resources, and evaluation problems.[85] In 2018, a self-driving car from Uber failed to detect a pedestrian, who was killed after a collision.[86] Attempts to use machine learning in healthcare with the IBM Watson system failed to deliver even after years of time and billions of dollars invested.[87][88] Bias Main article: Algorithmic bias Machine learning approaches in particular can suffer from different data biases. A machine learning system trained on current customers only may not be able to predict the needs of new customer groups that are not represented in the training data. When trained on man-made data, machine learning is likely to pick up the same constitutional and unconscious biases already present in society.[89] Language models learned from data have been shown to contain human-like biases.[90][91] Machine learning systems used for criminal risk assessment have been found to be biased against black people.[92][93] In 2015, Google photos would often tag black people as gorillas,[94] and in 2018 this still was not well resolved, but Google reportedly was still using the workaround to remove all gorillas from the training data, and thus was not able to recognize real gorillas at all.[95] Similar issues with recognizing non-white people have been found in many other systems.[96] In 2016, Microsoft tested a chatbot that learned from Twitter, and it quickly picked up racist and sexist language.[97] Because of such challenges, the effective use of machine learning may take longer to be adopted in other domains.[98] Concern for fairness in machine learning, that is, reducing bias in machine learning and propelling its use for human good is increasingly expressed by artificial intelligence scientists, including Fei-Fei Li, who reminds engineers that "There’s nothing artificial about AI...It’s inspired by people, it’s created by people, and—most importantly—it impacts people. It is a powerful tool we are only just beginning to understand, and that is a profound responsibility.”[99] Model assessments Classification of machine learning models can be validated by accuracy estimation techniques like the holdout method, which splits the data in a training and test set (conventionally 2/3 training set and 1/3 test set designation) and evaluates the performance of the training model on the test set. In comparison, the K-fold-cross-validation method randomly partitions the data into K subsets and then K experiments are performed each respectively considering 1 subset for evaluation and the remaining K-1 subsets for training the model. In addition to the holdout and cross-validation methods, bootstrap, which samples n instances with replacement from the dataset, can be used to assess model accuracy.[100] In addition to overall accuracy, investigators frequently report sensitivity and specificity meaning True Positive Rate (TPR) and True Negative Rate (TNR) respectively. Similarly, investigators sometimes report the false positive rate (FPR) as well as the false negative rate (FNR). However, these rates are ratios that fail to reveal their numerators and denominators. The total operating characteristic (TOC) is an effective method to express a model's diagnostic ability. TOC shows the numerators and denominators of the previously mentioned rates, thus TOC provides more information than the commonly used receiver operating characteristic (ROC) and ROC's associated area under the curve (AUC).[101] Ethics Machine learning poses a host of ethical questions. Systems which are trained on datasets collected with biases may exhibit these biases upon use (algorithmic bias), thus digitizing cultural prejudices.[102] For example, using job hiring data from a firm with racist hiring policies may lead to a machine learning system duplicating the bias by scoring job applicants against similarity to previous successful applicants.[103][104] Responsible collection of data and documentation of algorithmic rules used by a system thus is a critical part of machine learning. Because human languages contain biases, machines trained on language corpora will necessarily also learn these biases.[105][106] Other forms of ethical challenges, not related to personal biases, are more seen in health care. There are concerns among health care professionals that these systems might not be designed in the public's interest but as income-generating machines. This is especially true in the United States where there is a long-standing ethical dilemma of improving health care, but also increasing profits. For example, the algorithms could be designed to provide patients with unnecessary tests or medication in which the algorithm's proprietary owners hold stakes. There is huge potential for machine learning in health care to provide professionals a great tool to diagnose, medicate, and even plan recovery paths for patients, but this will not happen until the personal biases mentioned previously, and these "greed" biases are addressed.[107] Hardware Since the 2010s, advances in both machine learning algorithms and computer hardware have led to more efficient methods for training deep neural networks (a particular narrow subdomain of machine learning) that contain many layers of non-linear hidden units.[108] By 2019, graphic processing units (GPUs), often with AI-specific enhancements, had displaced CPUs as the dominant method of training large-scale commercial cloud AI.[109] OpenAI estimated the hardware compute used in the largest deep learning projects from AlexNet (2012) to AlphaZero (2017), and found a 300,000-fold increase in the amount of compute required, with a doubling-time trendline of 3.4 months.[110][111] Software Software suites containing a variety of machine learning algorithms include the following: Free and open-source so
mdzaheerjk
💻 AI/ML Enthusiast | Sharing code & learning in public 📚 Exploring Python, Data Science & Deep Learning 🛠️ Building real-world projects with modern AI tools 🤝 Open to contributions & innovative collaborations
akaidkhan
Introduction to R R is a programming language, which is an object oriented language created by Statisticians, R provides objects, operators and functions that allow the user to explore, model and visualize data. R is a Programming language Developed at AT&T Bell Lab. It is an open source free language, allowing anyone to use and modify it. R is licensed under the GNU General Public License, with copyright held by The R Foundation For Statistical Computing. It has no need to pay any subscription charges R has a huge active community member. If you have any question about any function any library you can Google it and you would get a proper answer and right the way. As it is an open source language, you, me and lots of Data Scientist, they actually built in all those, inbuilt function and they upload it in a website called CRAN and then you can download all those packages. Over 7800 packages listed on CRAN, here we listed some of the most powerful and commonly used in R packages. R is a cross platform. R can run in different kind of operating system and different hardware. Generally, it is used on GNU/Linux, Macintosh, and Microsoft Windows and running on both 32 and 64-bit processor. R is mainly used for Statistical Analysis and Analytics Purpose, you might be thinking why to learn again another language if you already know many programming languages like JAVA or other programming languages, and think why do you need the language because R is mainly used for all those statistical Analysis and that’s why you should learn the language R. you would understand after doing this course it is actually easy to interpret. R is the leading tool for statistics and data analysis, machine learning as well as. The programming language is more than a statistical package, you can build your own objects, functions, and packages. It is easy to use, the coding style is quite easy. R enables you to interact with many data sources: ODBC -compliant databases (Excel, Access). R also can handle CSV files, SAS, and SPSS, XML and lots of other different files as well. Similarly, it can create a very good visualization. It can produce graphics output in PDF, JPG, PNG and SVG formats and table output for LATEX and HTML. It has a lot of inbuilt functions(packages & Libraries) and the results are also easy to interpret and that’s why lots of industries are using R, it is not about the big or small. Lots of companies like Microsoft, Google are using R actively. It has a big reason, it is free and you can do POC out there. So, be confident about the fact that you are going to learn R and it has huge popularity and your market value is always higher if you know R in Data Science
Skin cancer is the most prevalent type of cancer. Melanoma, specifically, is responsible for 75% of skin cancer deaths, despite being the least common skin cancer. The American Cancer Society estimates over 100,000 new melanoma cases will be diagnosed in 2020. It's also expected that almost 7,000 people will die from the disease. As with other cancers, early and accurate detection—potentially aided by data science—can make treatment more effective. Currently, dermatologists evaluate every one of a patient's moles to identify outlier lesions or “ugly ducklings” that are most likely to be melanoma. Existing AI approaches have not adequately considered this clinical frame of reference. Dermatologists could enhance their diagnostic accuracy if detection algorithms take into account “contextual” images within the same patient to determine which images represent a melanoma. If successful, classifiers would be more accurate and could better support dermatological clinic work. As the leading healthcare organization for informatics in medical imaging, the Society for Imaging Informatics in Medicine (SIIM)'s mission is to advance medical imaging informatics through education, research, and innovation in a multi-disciplinary community. SIIM is joined by the International Skin Imaging Collaboration (ISIC), an international effort to improve melanoma diagnosis. The ISIC Archive contains the largest publicly available collection of quality-controlled dermoscopic images of skin lesions. In this competition, you’ll identify melanoma in images of skin lesions. In particular, you’ll use images within the same patient and determine which are likely to represent a melanoma. Using patient-level contextual information may help the development of image analysis tools, which could better support clinical dermatologists. Melanoma is a deadly disease, but if caught early, most melanomas can be cured with minor surgery. Image analysis tools that automate the diagnosis of melanoma will improve dermatologists' diagnostic accuracy. Better detection of melanoma has the opportunity to positively impact millions of people.
stephhazlitt
Talk about the shift towards the use of R and other data science tools in the BC Public Service
crosbycj89
The Electric Chain is an open science project dedicated to the intersection of solar energy, IoT and blockchain technology. One of the Electric Chain’s goals is to connect the world’s 7 million solar facilities, watching the skies 24/7 and posting live data to one blockchain for scientists, researchers and human progress. Initial focus is on verifying and publishing solar power generation data publicly in near real time. The Electric Chain project supports the development of open standards and tools to publish and read solar electricity generation data using the Solar Chain blockchain and/or other blockchain technologies.
This project is part of my theses for my master degree in Data Science at the University of Edinburgh. As part of the National Healthcare system, the Edinburgh Community Pulmonary Rehabilitation team is tasked with delivering high quality care, engaging on promoting public health and improving self-management. The Pulmonary Rehabilitation team, based in Leith, has been running since 2008 and receives annually over 650 referrals for Lothian-based individuals living with chronic lung conditions. The main aim of this project is to develop a web-based educational platform that can be used as a preparation tool for students attending the Pulmonary Rehabilitation team for their placement. The system will provide the knowledge and facilities to guide the student through interactive case studies and quizzes. The longer term aim will be to expand such a platform to make it available and relevant to other professionals working on respiratory conditions or for professionals looking to move in a respiratory post. The system will be implemented using a combination of HTML, CSS and Javascript client-side and an appropriate language, such as PHP or Python server-side, to enable asynchronous interactions. The interface will need careful design and, once implemented, will need to be responsive to user actions (such as mouse clicks) in order to provide appropriate feedback. A lightweight communication protocol and API e.g. based on JSON may need to be designed to ensure that the interface can be decoupled from the server. The possibility of building such a system as a plugin to a well-known platform such as Wordpress or Drupal will also be explored. The student will interact and work closely with the Pulmonary Rehabilitation team and, over the course of the project, have several meetings and contextual interviews in order to gather data and then evaluate the system. Existing efforts for other health conditions (such as stroke) will be surveyed when setting up the requirements for this system.
nathantypanski
The Lunar Mapping and Modeling Portal (LMMP) is a system that has been built to support lunar exploration activities that will enable return of both manned and unmanned missions to the Moon. It provides a web-based Portal and a suite of interactive visualization and analysis tools to enable mission planners, lunar scientists, and engineers to access mapped lunar data products from past and current lunar missions. It also addresses the lunar science community, the lunar commercial community, education and public outreach (E/PO), and anyone else interested in accessing or utilizing lunar data.
kaiusdepaula
Welcome to my Data Projects Showcase! Explore my diverse collection of publicly shared projects where I analyze datasets, derive insights, and employ various tools. From sales forecasting to NLP and optimizations, these projects demonstrate my expertise in data science. Contribute, engage, and discover the power of data!
MungaiJohnThuo
I long for devoting myself to public health and become an excellent Biostatistician. Data-driven science is indispensable to meet the increasing demand in medical and human-health problems improvement through data explosion, I want to apply tools of statistics by analyzing medical devices with lists, tables and graphics for statistical reports.
petterasla
Is global warming caused by humans? The Consensus Project collected thousands of peer-reviewed publications related to global warming and manually labeled them as "Skeptic", "Neutral" or "Pro" (www.skepticalscience.com/tcp.php). They arguably concluded that 97% of the papers upto 2011 agree that global warming is real and due to human factors. Their data is publicly available and can be explored through a nice online visualisation: www.skepticalscience.com/tcp.php What happened after 2011? We don't know, because there is no data! Labelling thousands of articles by hand is expensive and time consuming. However, labelling documents is a task that can be automated. It is called "document classification" in Natural Language Processing (NLP) and it is typically implemented using supervised classification, a machine learning technique. One part of this project is therefore building a document classification system, trained on the data from the Consensus Project, predicting the labels "Skeptic", "Neutral" or "Pro". This task has in fact a lot in common with another popular NLP task: opinion mining/sentiment analysis. Such a classifier requires features extracted from the text. How do we get the abstracts, or even better, the full-text of articles? This involves searching for articles (search & information retrieval), downloading the source documents (crawling websites) and filtering out the text (text extraction from HTML or PDF). Data collection therefore forms a second major part of this project. The third part of this project concerns interactive visualisation. The visualisation mentioned above is nice, but so much more is possible. Provided that we can also extract the affiliation of authors, we can plot the distribution of climate skepticism on a world map, contrasting e.g. USA vs. Europe. What if we take the impact factors of journals into account? Is there more or less skepticism in high-impact journals (e.g. Nature, Science) then in low-impact journals (Chinese Journal of Oceanology and Limnology). Many interesting options to explore (for some inspiration, see www.creativebloq.com/design-tools/data-visualization-712402).
Emptiinessss
No description available
jamartinez133
This is my public repository for the final assignment in the tools for Data Science course in Coursera.
walkabillylab
Ressources and tools for data analysis using R in public health, kinesiology, politicial science, ecology, and mircoplastics
chrispopiel
This is a public repository for submitting the final assignment in the Tools for Data Science course.
Jash47
Data Science Toolkit final project repository. Includes Jupyter Notebook, data science tool examples, public GitHub repository, and peer evaluation rubric. For Data Science Toolkit course students and anyone interested in data science. Contributions welcome.
This is a public repository for the Final Assignment Peer Evaluation in the IBM Tools for Data Science online Course
dimitrisgiampastos
This is a public repository for a assignment in the course Tools for Data Science by IBM, provided by coursera.
ddhksl183
This course introduced you to multiple data science tools, and in this final project, you will use Jupyterlite Notebook, one of the easiest tools to share publicly.
This course introduced you to multiple data science tools, and in this final project, you will use Jupyterlite Notebook, one of the easiest tools to share publicly.
This course introduced you to multiple data science tools, and in this final project, you will use Jupyterlite Notebook, one of the easiest tools to share publicly. Leveraging Jupyterlite Notebook on Skills Network labs, you will create your Jupyterlite Notebook (in English) and share it via a public GitHub link.
vatsiiC
This course introduced us to multiple data science tools, and in this final project, we will use Jupyterlite Notebook, one of the easiest tools to share publicly. We will create your Jupyterlite Notebook and share it via a public GitHub link. We will need to include a combination of markdown and code cells.
the-eva-a
This repository contains the project for a data analysis group assignment focused on examining health measures across US states and their relation to early death (years lost). The analysis leverages Python tools like pandas, Plotly, and other data science libraries to visualize and interpret trends in public health data.
PranavKartha
A set of jupyter notebooks in which I use the statsmodels and scikit-learn libraries in python to conduct data science and machine learning on public datasets, such as AirBNB listings in China, and investigating racial bias in a risk assessment tool for criminals.
dremeke-pixel
My AI engineering portfolio features hands-on projects in Python, machine learning, and data science. I focus on building practical, real-world tools that improve healthcare and public health systems, using data, thoughtful problem-solving, and intelligent design to create solutions that scale and make a meaningful impact
shrutiii11
Visualizing and Analyzing Crime Trends in Los Angeles (2020–Present) This project uses public crime data from the Los Angeles Police Department to analyze, visualize, and animate crime patterns across time and locations. It explores trends by date, area, and crime type using Python-based data science and visualization tools.
Adityism
An LLM-powered API to answer graded assignment questions for the Tools in Data Science (TDS) course. The API processes plain-text queries and file attachments, returning structured, accurate answers. Deployed on Vercel for public accessibility, allowing users to send cURL requests and receive JSON responses.
GVCL
The interactive visualization tool, “Demographic Viz”, which is an outcome of collaborative work between National Institute of Mental Health & Neuro Science (NIMHANS), Bangalore and IIIT Bangalore. It contains different modules that provide visual understanding of raw data from Demographic data, pertaining to effectiveness of public health programmers, demographics related any particular disease. In this application, we have implemented cartograms/maps, temporal chart and bar charts to analyze the impact of different factors related to patient record. This data can be analyzed in the context of the meta-data of the Demograhic data (patient heath record), namely, age, type of disease, gender, patient geo-location. This tool is intended for doctors, hospitals, public health professionals, and others who perform research on these data. The effectiveness of this tool is tested by exploring and analyzing Patient health data of one of the internationally supported public health programs, deployed in India. The tool is generic and can be used with multivariate data from domains such as health, education, environment and others, where there could be additional geospatio-temporal location.
HarshithaRavindra
The interactive visualization tool, “Demographic Viz”, which is an outcome of collaborative work between National Institute of Mental Health & Neuro Science (NIMHANS), Bangalore and IIIT Bangalore. It contains different modules that provide visual understanding of raw data from Demographic data, pertaining to effectiveness of public health programmers, demographics related any particular disease. In this application, we have implemented cartograms/maps, temporal chart and bar charts to analyze the impact of different factors related to patient record. This data can be analyzed in the context of the meta-data of the Demograhic data (patient heath record), namely, age, type of disease, gender, patient geo-location. This tool is intended for doctors, hospitals, public health professionals, and others who perform research on these data. The effectiveness of this tool is tested by exploring and analyzing Patient health data of one of the internationally supported public health programs, deployed in India. The tool is generic and can be used with multivariate data from domains such as health, education, environment and others, where there could be additional geospatio-temporal location.