Found 87 repositories(showing 30)
Chandrakant817
Statistics for Data Science and Machine Learning Handwritten Notes
This repository contains comprehensive notes on various statistical concepts and methodologies, designed to aid in the understanding and application of statistical analysis. It serves as a valuable resource for students, researchers, and professionals looking to enhance their statistical knowledge and skills.
rohanmistry231
A curated collection of resources, notes, and practice materials for preparing for data science interviews, covering algorithms, statistics, and machine learning concepts. Includes coding exercises, cheat sheets, and reference guides to aid in mastering technical interviews.
tereom
Class notes for the computational statistics class (Spanish), master in Data Science ITAM
Welcome to the repository of handwritten notes on Statistics and Probability for Data Science. These notes cover fundamental concepts essential for understanding data science methodologies and analyses.
jdstorey
Lecture slides and notes for Applied Statistics and Data Science course
tereom
Class notes for multivariate statistics class, ITAM, master in Data Science
tereom
Class notes (in Spanish) for the course of Mathematical Statistics with Resampling, Master in Data Science / Master in Computer Science ITAM.
EmmanuelLwele
Interview Coding Challenge Data Science Step 1 of the Data Scientist Interview process. Follow the instructions below to complete this portion of the interview. Please note, although we do not set a time limit for this challenge, we recommend completing it as soon as possible as we evaluate candidates on a first come, first serve basis... If you have any questions, please feel free to email support@TheZig.io. We will do our best to clarify any issues you come across. Prerequisites: A Text Editor - We recommend Visual Studio Code for the ClientSide code, its lightweight, powerful and Free! (https://code.visualstudio.com/) SQL Server Management Studio (https://docs.microsoft.com/en-us/sql/ssms/download-sql-server-management-studio-ssms?view=sql-server-2017) R - Software Environment for statitistal computing and graphics. You can download R at the mirrors listed here (https://cran.r-project.org/mirrors.html) Azure - Microsoft's Cloud Computing platform. You can create an account without a credit card by using the Azure Pass available at this link (https://azure.microsoft.com/en-us/offers/azure-pass/) Git - For source control and committing your final solution to a new private repo (https://git-scm.com/downloads) a. If you're not very familiar with git commands, here's a helpful cheatsheet (https://services.github.com/on-demand/downloads/github-git-cheat-sheet.pdf) 'R' Challenge For each numbered section below, write R code and comments to solve the problem or to show your rationale. For sections that ask you to give outputs, provide outputs in separate files and name them with the section number and the word output "Section 1 - Output". Create a private repo and submit your modified R script along with any supporting files. Load in the dataset from the accompanying file "account-defaults.csv" This dataset contains information about loan accounts that either went delinquent or stayed current on payments within the loan's first year. FirstYearDelinquency is the outcome variable, all others are predictors. The objective of modeling with this dataset is to be able to predict the probability that new accounts will become delinquent; it is primarily valuable to understand lower-risk accounts versus higher-risk accounts (and not just to predict 'yes' or 'no' for new accounts). FirstYearDelinquency - indicates whether the loan went delinquent within the first year of the loan's life (values of 1) AgeOldestIdentityRecord - number of months since the first record was reported by a national credit source AgeOldestAccount - number of months since the oldest account was opened AgeNewestAutoAccount - number of months since the most recent auto loan or lease account was opened TotalInquiries - total number of credit inquiries on record AvgAgeAutoAccounts - average number of months since auto loan or lease accounts were opened TotalAutoAccountsNeverDelinquent - total number of auto loan or lease accounts that were never delinquent WorstDelinquency - worst status of days-delinquent on an account in the first 12 months of an account's life; values of '400' indicate '400 or greater' HasInquiryTelecomm - indicates whether one or more telecommunications credit inquires are on record within the last 12 months (values of 1) Perform an exploratory data analysis on the accounts data In your analysis include summary statistics and visualizations of the distributions and relationships. Build one or more predictive model(s) on the accounts data using regression techniques Identify the strongest predictor variables and provide interpretations. Identify and explain issues with the model(s) such as collinearity, etc. Calculate predictions and show model performance on out-of-sample data. Summarize out-of-sample data in tiers from highest-risk to lowest-risk. Split up the dataset by the WorstDelinquency variable. For each subset, run a simple regression of FirstYearDelinquency ~ TotalInquiries. Extract the predictor's coefficient and p-value from each model. Store the in a list where the names of the list correspond to the values of WorstDelinquency. Load in the dataset from the accompanying file "vehicle-depreciation.csv". The dataset contains information about vehicles that our company purchases at auction, sells to customers, repossess from defaulted accounts, and finally re-sell at auction to recover some of our losses. Perform an analysis and/or build a predictive model that provides a method to estimate the depreciation of vehicle worth (from auction purchase to auction sale). Use whatever techniques you want to provide insight into the dataset and walk us through your results - this is your chance to show off your analytical and storytelling skills! CustomerGrade - the credit risk grade of the customer AuctionPurchaseDate - the date that the vehicle was purchased at auction AuctionPurchaseAmount - the dollar amount spent purchasing the vehicle at auction AuctionSaleDate - the date that the vehicle was sold at auction AuctionSaleAmount - the dollar amount received for selling the vehicle at auction VehicleType - the high-level class of the vehicle Year - the year of the vehicle Make - the make of the vehicle Model - the model of the vehicle Trim - the trim of the vehicle BodyType - the body style of the vehicle AuctionPurchaseOdometer - the odometer value of the vehicle at the time of purchase at the auction AutomaticTransmission - indicates (with value of 1) whether the vehicle has an automatic transmission DriveType - the drivetrain type of the vehicle
venkateshelangovan
This repository contains notes and resources for GATE Data Science (DA) preparation. It covers key subjects such as: Probability & Statistics Linear Algebra Calculus Data Structures & Algorithms Machine Learning Artificial Intelligence Database Management Systems
tereom
Class notes (in Spanish) for the course of Mathematical Statistics with Resampling, Master in Data Science / Master in Computer Science ITAM.
qiushiyan
math notes for statistics and data science
Added a PDF with detailed notes on statistics specifically for data science.
pbstark
Notes for Reproducible and Collaborative Statistical Data Science, Statistics 159/259, fall 2018, UC Berkeley
Socialblade is a well known company which maintains statistics of YouTube channels, Instagram accounts and many more. Their website features a page which shows Top 5000 YouTube channels and some basic information about them. I wanted to use those values for performing EDA, so I decided to scrap it. The data contains Socialblade Rankings of top 5000 YouTube channels. The data can be used for finding useful insights and the revealing possible correlations between the features of the channels and their respective rankings. Note: This work is not sponsored by Socialblade and is just one of an outcome of a fun project made using Data Science technologies. The project does not aim at violation of any policies or privacy since the data on the website is publicly available.
udayansawant
Trends-in-Programming-Languages We’re always staying on the pulse of the software development industry so that we can better prepare our students for rapidly – changing technology job market. Many of the students and developers in general are interested in working at top startups and tech companies. How can we tell what programming languages and technologies are used by the most people? How about what languages are growing, and which are shrinking so that we can tell which are most worth investing time in? One excellent source of data is Stack Overflow, a programming question and answer site with more than 16 million questions on programming topics. By measuring the number of questions about each technology, we can get an approximate sense of how many people are using it. We're going to use open data from the Stack Exchange Data Explorer to examine the relative popularity of languages like R, Python, Java and JavaScript have changed over time. Each Stack Overflow question has a tag, which marks a question to describe its topic or technology. For instance, there's a tag for languages like R or Python, and for packages like ggplot2 or pandas. The dataset used, contains one observation for each tag in a year. The dataset also includes both the number of questions asked in that tag in the year and the total number of questions asked in that year Rather than just the counts of the tags, the concern here is the percentage change: the fraction of questions that year have that tag What are the most asked - about tags? It's sure been fun to visualize and compare tags over time. The dplyr and ggplot2 tags may not have as many questions as R, but we can tell they're both growing quickly as well. We might like to know which tags have the most questions overall, not just within a particular year. #The TIOBE Programming Community index is an indicator of the popularity of programming languages. The index is updated once a month. The ratings are based on the number of skilled engineers world-wide, courses and third-party vendors. Popular search engines such as Google, Bing, Yahoo!, Wikipedia, Amazon, YouTube and Baidu are used to calculate the ratings. It is important to note that the TIOBE index is not about the best programming language or the language in which most lines of code have been written. (https://www.tiobe.com/tiobe-index/) Comparing R and Python There is a lot of heated discussion over the topic, but there are some great, thoughtful articles as well. Some suggest Python is preferable as a general-purpose programming language, while others suggest data science is better served by a dedicated language and toolchain. The origins and development arcs of the two languages are compared, often to support differing conclusions. For individual data scientists, some common points to consider: • Python is a great general programming language, with many libraries dedicated to data science. • Many (if not most) general introductory programming courses start teaching with Python now. • Python is the go-to language for many ETL and Machine Learning workflows. • Many (if not most) introductory courses to statistics and data science teach R now. • R has become the world’s largest repository of statistical knowledge with reference implementations for thousands, if not tens of thousands, of algorithms that have been vetted by experts. The documentation for many R packages includes links to the primary literature on the subject. • R has a very low barrier to entry for doing exploratory analysis, and converting that work into a great report, dashboard, or API. • R with RStudio is often considered the best place to do exploratory data analysis. (https://blog.rstudio.com/2019/12/17/r-vs-python-what-s-the-best-for-language-for-data-science/)
AbuTaher003
This repository contains notes and codes of statistics. Both academic and statistics for Data science are covered.
walidhossain99
This contains all the lecture notes of the Statistics for Data Science In Bangla playlist
This is a repository containing the notes on statistics and probability for Data Science from basics to Advance
eecsjlee
A Structured Collection of Notes and Hands-On Projects for Learning Data Science, Covering Machine Learning, Statistics, Data Analysis, and More.
tommasofacchin
Personal study repository for the MITx MicroMasters in Statistics and Data Science. Contains concise LaTeX notes and worked exercises for the core courses.
Conni2
This repository contains my Jupyter Notebook notes covering the fundamentals of data science, perfect for beginners and enthusiasts looking to understand data analysis, statistics, and visualization.
dorukanc
A comprehensive collection of Python code, notes, and resources exploring the mathematical foundations essential for data science and artificial intelligence. This repo covers linear algebra, calculus, probability, statistics, and optimization, providing practical implementations and theoretical insights.
This repository is for coding activities developed in Jupyter notebooks using the Noteable platform (www.noteable.edina.ac.uk) for Scottish teachers and learners. These materials were developed between February and May 2021 and cover topics from the SQA curriculum in Computing Science, Mathematics, Statistics and other fields involving data analysis. The content in these notebooks aims to provide support and learning materials for teachers to adopt and use the Noteable service across school in Scotland, to deliver curriculum topics involving the analysis of numbers, data or other types of information and programming elements at Scottish Qualifications Authority National Levels 3, 4, 5, Higher and Advanced Higher. This content aims to provide support and learning materials for teachers to adopt and use the Noteable service across schools in Scotland.
lukaszkora
This is a "School Register" app - my academic project completed in 2013 while studying at Warsaw School of Computer Science (Warszawska Wyższa Szkoła Informatyki). It allows (depending on user role) to add/edit/delete data ref. students, grades and news (visible on landing page). Also, there is a simple "statistics" module which allows to find out about total average grade for all students, or certain class, or student, or course. You're welcome to check how it works: ADMIN - login: admin, password: adminadmin Please note: starting page is StronaGlowna.aspx (just for purpose of this project).
elz-harri
While your data companions rushed off to jobs in finance and government, you remained adamant that science was the way for you. Staying true to your mission, you've joined Pymaceuticals Inc., a burgeoning pharmaceutical company based out of San Diego. Pymaceuticals specializes in anti-cancer pharmaceuticals. In its most recent efforts, it began screening for potential treatments for squamous cell carcinoma (SCC), a commonly occurring form of skin cancer. As a senior data analyst at the company, you've been given access to the complete data from their most recent animal study. In this study, 249 mice identified with SCC tumor growth were treated through a variety of drug regimens. Over the course of 45 days, tumor development was observed and measured. The purpose of this study was to compare the performance of Pymaceuticals' drug of interest, Capomulin, versus the other treatment regimens. You have been tasked by the executive team to generate all of the tables and figures needed for the technical report of the study. The executive team also has asked for a top-level summary of the study results. ## Instructions Your tasks are to do the following: * Before beginning the analysis, check the data for any mouse ID with duplicate time points and remove any data associated with that mouse ID. * Use the cleaned data for the remaining steps. * Generate a summary statistics table consisting of the mean, median, variance, standard deviation, and SEM of the tumor volume for each drug regimen. * Generate a bar plot using both Pandas's `DataFrame.plot()` and Matplotlib's `pyplot` that shows the number of total mice for each treatment regimen throughout the course of the study. * **NOTE:** These plots should look identical. * Generate a pie plot using both Pandas's `DataFrame.plot()` and Matplotlib's `pyplot` that shows the distribution of female or male mice in the study. * **NOTE:** These plots should look identical. * Calculate the final tumor volume of each mouse across four of the most promising treatment regimens: Capomulin, Ramicane, Infubinol, and Ceftamin. Calculate the quartiles and IQR and quantitatively determine if there are any potential outliers across all four treatment regimens. * Using Matplotlib, generate a box and whisker plot of the final tumor volume for all four treatment regimens and highlight any potential outliers in the plot by changing their color and style. **Hint**: All four box plots should be within the same figure. Use this [Matplotlib documentation page](https://matplotlib.org/gallery/pyplots/boxplot_demo_pyplot.html#sphx-glr-gallery-pyplots-boxplot-demo-pyplot-py) for help with changing the style of the outliers. * Select a mouse that was treated with Capomulin and generate a line plot of tumor volume vs. time point for that mouse. * Generate a scatter plot of mouse weight versus average tumor volume for the Capomulin treatment regimen. * Calculate the correlation coefficient and linear regression model between mouse weight and average tumor volume for the Capomulin treatment. Plot the linear regression model on top of the previous scatter plot. * Look across all previously generated figures and tables and write at least three observations or inferences that can be made from the data. Include these observations at the top of notebook. Here are some final considerations: * You must use proper labeling of your plots, to include properties such as: plot titles, axis labels, legend labels, _x_-axis and _y_-axis limits, etc.
Nourrmadan
In this project, you will write a program that simulates the operation of a telephone system that might be found in small business, such as your local restaurant. Only one person can answer the phone (a single-service queue), but there can be unlimited number of calls waiting to be answered. Queue analysis considers two primary elements, the length of time a requester waits for service (the queue waiting time – in case, the customer is calling for an order) and service time (the time it takes the customer to place an order). Your program will simulate the operation of the telephone and gather statistics during the process. The program will mainly require two inputs to run the simulation: (1) the length time in hours that the service will be provided and (2) the maximum time it takes for the operator to take an order (the maximum service time). For elements are required to run the simulation: a timing loop, a call simulator, a call processor and a start call function. 1. Timing loop: This is simply the simulation loop. Every iteration of the loop will be considered 1 minute real time. The loop will continue until the service has been in operation the requested amount of time (given input above). When the operating period is complete, however, any waiting calls must be answered before ending the simulation. The timing loop has the following subfunctions: a. Determine whether a call was received (call simulator) b. Process active call c. Start new call 2. Call simulator: The call simulator will use a random number generator to determine whether a call has been received. Scale the random number to an appropriate range, such as 1 to 10. The random number should be compared with a defined constant. If the value is less than the constant, a call was received; if it is not, then no call was received. For the simulation, set the call level to %50; that is, on average, a call will be received every 2 minutes. If a call is received, place it in a queue. 3. Process active call: If a call is active, test whether it has been completed. In order to check if it is completed, you will again need to randomly decide whether a call is completed or not. In order to do that, you need to check the time of the service and make sure that it is always less than the given maximum service time. If completed, print the statistics for the current call and gather the necessary statistics for the end-of-job report. 4. Start new call: If there are no active calls, start a new call if there is one waiting in the queue. Note that starting a call must calculate the time the call has been waiting. This will show the time that the call has been waiting in the queue. Al Alamein International University Faculty of Computer Science and Engineering CSE111 Data Structures During the processing, print the data shown in the table below after each call is completed. (Note: you will not get the same results). It shows the call number, arrival time (the time the call was made), wait time (how long the call waited in the queue to be answered), start time (when the call was answered), service time (how long it took to answer this call) and queue size (shows the total number of calls in the queue). Clock Call Arrival time Wait time Start time time number 4120232 6232524 At the end of the simulation, print out the following statistics gathered during the processing: Be sure to use appropriate format for each statistic, such as float for averages: 1. Total calls: calls received during the operating time 2. Total idle time: total time during which no calls were being serviced 3. Total wait time: sum of wait times for all calls 4. Total service time: sum of service time for all calls 5. Maximum queue size: maximum number of calls waiting during the simulation 6. Average wait time: total wait time/number of calls 7. Average service time: total service time/number of calls Run the simulator twice. Both runs should simulate 2 hours. In the first simulation, use a maximum of service time of 2 minutes. In the second run, use a maximum of service time of 5 minutes. Programming Requirements: In order to implement this simulator, you need to enter the two values required to start the simulator from the command line. These values are: (1) the length time in hours that the service will be provided and (2) the maximum time it takes for the operator to take an order (the maximum service time). You need to write the following functions: run_simulator(): This function will include the main timing loop. call_simulator(): The call simulator will use a random number generator to determine whether a call has been received. process_active_call(): If a call is active, test whether it has been completed. If completed, print the statistics for the current call and gather the necessary statistics for the end-of-job report. start_new_call(): If there are no active calls, start a new call if there is one waiting in the queue. finalise_report_simulator(): This function will mainly finish the simulator with reporting the required data (see the list above).
# Matplotlib Homework - The Power of Plots ## Background What good is data without a good plot to tell the story? So, let's take what you've learned about Python Matplotlib and apply it to a real-world situation and dataset:  While your data companions rushed off to jobs in finance and government, you remained adamant that science was the way for you. Staying true to your mission, you've joined Pymaceuticals Inc., a burgeoning pharmaceutical company based out of San Diego. Pymaceuticals specializes in anti-cancer pharmaceuticals. In its most recent efforts, it began screening for potential treatments for squamous cell carcinoma (SCC), a commonly occurring form of skin cancer. As a senior data analyst at the company, you've been given access to the complete data from their most recent animal study. In this study, 249 mice identified with SCC tumor growth were treated through a variety of drug regimens. Over the course of 45 days, tumor development was observed and measured. The purpose of this study was to compare the performance of Pymaceuticals' drug of interest, Capomulin, versus the other treatment regimens. You have been tasked by the executive team to generate all of the tables and figures needed for the technical report of the study. The executive team also has asked for a top-level summary of the study results. ## Instructions Your tasks are to do the following: * Before beginning the analysis, check the data for any mouse ID with duplicate time points and remove any data associated with that mouse ID. * Use the cleaned data for the remaining steps. * Generate a summary statistics table consisting of the mean, median, variance, standard deviation, and SEM of the tumor volume for each drug regimen. * Generate a bar plot using both Pandas's `DataFrame.plot()` and Matplotlib's `pyplot` that shows the number of total mice for each treatment regimen throughout the course of the study. * **NOTE:** These plots should look identical. * Generate a pie plot using both Pandas's `DataFrame.plot()` and Matplotlib's `pyplot` that shows the distribution of female or male mice in the study. * **NOTE:** These plots should look identical. * Calculate the final tumor volume of each mouse across four of the most promising treatment regimens: Capomulin, Ramicane, Infubinol, and Ceftamin. Calculate the quartiles and IQR and quantitatively determine if there are any potential outliers across all four treatment regimens. * Using Matplotlib, generate a box and whisker plot of the final tumor volume for all four treatment regimens and highlight any potential outliers in the plot by changing their color and style. **Hint**: All four box plots should be within the same figure. Use this [Matplotlib documentation page](https://matplotlib.org/gallery/pyplots/boxplot_demo_pyplot.html#sphx-glr-gallery-pyplots-boxplot-demo-pyplot-py) for help with changing the style of the outliers. * Select a mouse that was treated with Capomulin and generate a line plot of time point versus tumor volume for that mouse. * Generate a scatter plot of mouse weight versus average tumor volume for the Capomulin treatment regimen. * Calculate the correlation coefficient and linear regression model between mouse weight and average tumor volume for the Capomulin treatment. Plot the linear regression model on top of the previous scatter plot. * Look across all previously generated figures and tables and write at least three observations or inferences that can be made from the data. Include these observations at the top of notebook. Here are some final considerations: * You must use proper labeling of your plots, to include properties such as: plot titles, axis labels, legend labels, _x_-axis and _y_-axis limits, etc. * See the [starter workbook](Pymaceuticals/pymaceuticals_starter.ipynb) for help on what modules to import and expected format of the notebook. ## Hints and Considerations * Be warned: These are very challenging tasks. Be patient with yourself as you trudge through these problems. They will take time and there is no shame in fumbling along the way. Data visualization is equal parts exploration, equal parts resolution. * You have been provided a starter notebook. Use the code comments as a reminder of steps to follow as you complete the assignment. * Don't get bogged down in small details. Always focus on the big picture. If you can't figure out how to get a label to show up correctly, come back to it. Focus on getting the core skeleton of your notebook complete. You can always revisit old problems. * While you are trying to complete this assignment, feel encouraged to constantly refer to Stack Overflow and the Pandas documentation. These are needed tools in every data analyst's tool belt. * Remember, there are many ways to approach a data problem. The key is to break up your task into micro tasks. Try answering questions like: * How does my DataFrame need to be structured for me to have the right _x_-axis and _y_-axis? * How do I build a basic scatter plot? * How do I add a label to that scatter plot? * Where would the labels for that scatter plot come from? Again, don't let the magnitude of a programming task scare you off. Ultimately, every programming problem boils down to a handful of bite-sized tasks. * Get help when you need it! There is never any shame in asking. But, as always, ask a _specific_ question. You'll never get a great answer to "I'm lost." ### Copyright Trilogy Education Services © 2020. All Rights Reserved.
AndrewjSage
No description available
SanchitaMishra170676
No description available