Found 1,210 repositories(showing 30)
ananas-analytics
A hackable data integration & analysis tool to enable non technical users to edit data processing jobs and visualise data on demand.
jobright-ai
Collection of 2026 New Grad Jobs in Data Analysis!
Aryia-Behroziuan
An ANN is a model based on a collection of connected units or nodes called "artificial neurons", which loosely model the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit information, a "signal", from one artificial neuron to another. An artificial neuron that receives a signal can process it and then signal additional artificial neurons connected to it. In common ANN implementations, the signal at a connection between artificial neurons is a real number, and the output of each artificial neuron is computed by some non-linear function of the sum of its inputs. The connections between artificial neurons are called "edges". Artificial neurons and edges typically have a weight that adjusts as learning proceeds. The weight increases or decreases the strength of the signal at a connection. Artificial neurons may have a threshold such that the signal is only sent if the aggregate signal crosses that threshold. Typically, artificial neurons are aggregated into layers. Different layers may perform different kinds of transformations on their inputs. Signals travel from the first layer (the input layer) to the last layer (the output layer), possibly after traversing the layers multiple times. The original goal of the ANN approach was to solve problems in the same way that a human brain would. However, over time, attention moved to performing specific tasks, leading to deviations from biology. Artificial neural networks have been used on a variety of tasks, including computer vision, speech recognition, machine translation, social network filtering, playing board and video games and medical diagnosis. Deep learning consists of multiple hidden layers in an artificial neural network. This approach tries to model the way the human brain processes light and sound into vision and hearing. Some successful applications of deep learning are computer vision and speech recognition.[68] Decision trees Main article: Decision tree learning Decision tree learning uses a decision tree as a predictive model to go from observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). It is one of the predictive modeling approaches used in statistics, data mining, and machine learning. Tree models where the target variable can take a discrete set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. In data mining, a decision tree describes data, but the resulting classification tree can be an input for decision making. Support vector machines Main article: Support vector machines Support vector machines (SVMs), also known as support vector networks, are a set of related supervised learning methods used for classification and regression. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that predicts whether a new example falls into one category or the other.[69] An SVM training algorithm is a non-probabilistic, binary, linear classifier, although methods such as Platt scaling exist to use SVM in a probabilistic classification setting. In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces. Illustration of linear regression on a data set. Regression analysis Main article: Regression analysis Regression analysis encompasses a large variety of statistical methods to estimate the relationship between input variables and their associated features. Its most common form is linear regression, where a single line is drawn to best fit the given data according to a mathematical criterion such as ordinary least squares. The latter is often extended by regularization (mathematics) methods to mitigate overfitting and bias, as in ridge regression. When dealing with non-linear problems, go-to models include polynomial regression (for example, used for trendline fitting in Microsoft Excel[70]), logistic regression (often used in statistical classification) or even kernel regression, which introduces non-linearity by taking advantage of the kernel trick to implicitly map input variables to higher-dimensional space. Bayesian networks Main article: Bayesian network A simple Bayesian network. Rain influences whether the sprinkler is activated, and both rain and the sprinkler influence whether the grass is wet. A Bayesian network, belief network, or directed acyclic graphical model is a probabilistic graphical model that represents a set of random variables and their conditional independence with a directed acyclic graph (DAG). For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases. Efficient algorithms exist that perform inference and learning. Bayesian networks that model sequences of variables, like speech signals or protein sequences, are called dynamic Bayesian networks. Generalizations of Bayesian networks that can represent and solve decision problems under uncertainty are called influence diagrams. Genetic algorithms Main article: Genetic algorithm A genetic algorithm (GA) is a search algorithm and heuristic technique that mimics the process of natural selection, using methods such as mutation and crossover to generate new genotypes in the hope of finding good solutions to a given problem. In machine learning, genetic algorithms were used in the 1980s and 1990s.[71][72] Conversely, machine learning techniques have been used to improve the performance of genetic and evolutionary algorithms.[73] Training models Usually, machine learning models require a lot of data in order for them to perform well. Usually, when training a machine learning model, one needs to collect a large, representative sample of data from a training set. Data from the training set can be as varied as a corpus of text, a collection of images, and data collected from individual users of a service. Overfitting is something to watch out for when training a machine learning model. Federated learning Main article: Federated learning Federated learning is an adapted form of distributed artificial intelligence to training machine learning models that decentralizes the training process, allowing for users' privacy to be maintained by not needing to send their data to a centralized server. This also increases efficiency by decentralizing the training process to many devices. For example, Gboard uses federated machine learning to train search query prediction models on users' mobile phones without having to send individual searches back to Google.[74] Applications There are many applications for machine learning, including: Agriculture Anatomy Adaptive websites Affective computing Banking Bioinformatics Brain–machine interfaces Cheminformatics Citizen science Computer networks Computer vision Credit-card fraud detection Data quality DNA sequence classification Economics Financial market analysis[75] General game playing Handwriting recognition Information retrieval Insurance Internet fraud detection Linguistics Machine learning control Machine perception Machine translation Marketing Medical diagnosis Natural language processing Natural language understanding Online advertising Optimization Recommender systems Robot locomotion Search engines Sentiment analysis Sequence mining Software engineering Speech recognition Structural health monitoring Syntactic pattern recognition Telecommunication Theorem proving Time series forecasting User behavior analytics In 2006, the media-services provider Netflix held the first "Netflix Prize" competition to find a program to better predict user preferences and improve the accuracy of its existing Cinematch movie recommendation algorithm by at least 10%. A joint team made up of researchers from AT&T Labs-Research in collaboration with the teams Big Chaos and Pragmatic Theory built an ensemble model to win the Grand Prize in 2009 for $1 million.[76] Shortly after the prize was awarded, Netflix realized that viewers' ratings were not the best indicators of their viewing patterns ("everything is a recommendation") and they changed their recommendation engine accordingly.[77] In 2010 The Wall Street Journal wrote about the firm Rebellion Research and their use of machine learning to predict the financial crisis.[78] In 2012, co-founder of Sun Microsystems, Vinod Khosla, predicted that 80% of medical doctors' jobs would be lost in the next two decades to automated machine learning medical diagnostic software.[79] In 2014, it was reported that a machine learning algorithm had been applied in the field of art history to study fine art paintings and that it may have revealed previously unrecognized influences among artists.[80] In 2019 Springer Nature published the first research book created using machine learning.[81] Limitations Although machine learning has been transformative in some fields, machine-learning programs often fail to deliver expected results.[82][83][84] Reasons for this are numerous: lack of (suitable) data, lack of access to the data, data bias, privacy problems, badly chosen tasks and algorithms, wrong tools and people, lack of resources, and evaluation problems.[85] In 2018, a self-driving car from Uber failed to detect a pedestrian, who was killed after a collision.[86] Attempts to use machine learning in healthcare with the IBM Watson system failed to deliver even after years of time and billions of dollars invested.[87][88] Bias Main article: Algorithmic bias Machine learning approaches in particular can suffer from different data biases. A machine learning system trained on current customers only may not be able to predict the needs of new customer groups that are not represented in the training data. When trained on man-made data, machine learning is likely to pick up the same constitutional and unconscious biases already present in society.[89] Language models learned from data have been shown to contain human-like biases.[90][91] Machine learning systems used for criminal risk assessment have been found to be biased against black people.[92][93] In 2015, Google photos would often tag black people as gorillas,[94] and in 2018 this still was not well resolved, but Google reportedly was still using the workaround to remove all gorillas from the training data, and thus was not able to recognize real gorillas at all.[95] Similar issues with recognizing non-white people have been found in many other systems.[96] In 2016, Microsoft tested a chatbot that learned from Twitter, and it quickly picked up racist and sexist language.[97] Because of such challenges, the effective use of machine learning may take longer to be adopted in other domains.[98] Concern for fairness in machine learning, that is, reducing bias in machine learning and propelling its use for human good is increasingly expressed by artificial intelligence scientists, including Fei-Fei Li, who reminds engineers that "There’s nothing artificial about AI...It’s inspired by people, it’s created by people, and—most importantly—it impacts people. It is a powerful tool we are only just beginning to understand, and that is a profound responsibility.”[99] Model assessments Classification of machine learning models can be validated by accuracy estimation techniques like the holdout method, which splits the data in a training and test set (conventionally 2/3 training set and 1/3 test set designation) and evaluates the performance of the training model on the test set. In comparison, the K-fold-cross-validation method randomly partitions the data into K subsets and then K experiments are performed each respectively considering 1 subset for evaluation and the remaining K-1 subsets for training the model. In addition to the holdout and cross-validation methods, bootstrap, which samples n instances with replacement from the dataset, can be used to assess model accuracy.[100] In addition to overall accuracy, investigators frequently report sensitivity and specificity meaning True Positive Rate (TPR) and True Negative Rate (TNR) respectively. Similarly, investigators sometimes report the false positive rate (FPR) as well as the false negative rate (FNR). However, these rates are ratios that fail to reveal their numerators and denominators. The total operating characteristic (TOC) is an effective method to express a model's diagnostic ability. TOC shows the numerators and denominators of the previously mentioned rates, thus TOC provides more information than the commonly used receiver operating characteristic (ROC) and ROC's associated area under the curve (AUC).[101] Ethics Machine learning poses a host of ethical questions. Systems which are trained on datasets collected with biases may exhibit these biases upon use (algorithmic bias), thus digitizing cultural prejudices.[102] For example, using job hiring data from a firm with racist hiring policies may lead to a machine learning system duplicating the bias by scoring job applicants against similarity to previous successful applicants.[103][104] Responsible collection of data and documentation of algorithmic rules used by a system thus is a critical part of machine learning. Because human languages contain biases, machines trained on language corpora will necessarily also learn these biases.[105][106] Other forms of ethical challenges, not related to personal biases, are more seen in health care. There are concerns among health care professionals that these systems might not be designed in the public's interest but as income-generating machines. This is especially true in the United States where there is a long-standing ethical dilemma of improving health care, but also increasing profits. For example, the algorithms could be designed to provide patients with unnecessary tests or medication in which the algorithm's proprietary owners hold stakes. There is huge potential for machine learning in health care to provide professionals a great tool to diagnose, medicate, and even plan recovery paths for patients, but this will not happen until the personal biases mentioned previously, and these "greed" biases are addressed.[107] Hardware Since the 2010s, advances in both machine learning algorithms and computer hardware have led to more efficient methods for training deep neural networks (a particular narrow subdomain of machine learning) that contain many layers of non-linear hidden units.[108] By 2019, graphic processing units (GPUs), often with AI-specific enhancements, had displaced CPUs as the dominant method of training large-scale commercial cloud AI.[109] OpenAI estimated the hardware compute used in the largest deep learning projects from AlexNet (2012) to AlphaZero (2017), and found a 300,000-fold increase in the amount of compute required, with a doubling-time trendline of 3.4 months.[110][111] Software Software suites containing a variety of machine learning algorithms include the following: Free and open-source so
sharmaroshan
Data Visualizations is emerging as one of the most essential skills in almost all of the IT and Non IT Background Sectors and Jobs. Using Data Visualizations to make wiser decisions which could land the Business to make bigger profits and understand the root cause and behavioral analysis of people and customers associated to it. In this Repository I have deeply discussed about Line Plots, Bar plots, Scatter Plots, and Pie Charts, Apart from that I have Discussed scientific plots, 3d plots, animated plots, interactive plots to visualize any kind of business problem and that too of any complexity.
RafaelCartenet
Model Context Protocol (MCP) server for Databricks that empowers AI agents to autonomously interact with Unity Catalog metadata. Enables data discovery, lineage analysis, and intelligent SQL execution. Agents explore catalogs/schemas/tables, understand relationships, discover notebooks/jobs, and execute queries - greatly reducing ad-hoc query time.
jay-johnson
Create and manage multiple Kubernetes clusters using KVM on a bare metal Fedora 29 server. Includes helm + rook-ceph + nginx ingress + the stock analysis engine (jupyter + redis cluster + minio + automated cron jobs for data collection) - works on Kubernetes version v1.16.0 - 1.16.3 was not working
MNC-Aubin
No description available
marcgarnica13
Understanding gender differences in professional European football through Machine Learning interpretability and match actions data. This repository contains the full data pipeline implemented for the study *Understanding gender differences in professional European football through Machine Learning interpretability and match actions data*. We evaluated European male, and female football players' main differential features in-match actions data under the assumption of finding significant differences and established patterns between genders. A methodology for unbiased feature extraction and objective analysis is presented based on data integration and machine learning explainability algorithms. Female (1511) and male (2700) data points were collected from event data categorized by game period and player position. Each data point included the main tactical variables supported by research and industry to evaluate and classify football styles and performance. We set up a supervised classification pipeline to predict the gender of each player by looking at their actions in the game. The comparison methodology did not include any qualitative enrichment or subjective analysis to prevent biased data enhancement or gender-related processing. The pipeline had three representative binary classification models; A logic-based Decision Trees, a probabilistic Logistic Regression and a multilevel perceptron Neural Network. Each model tried to draw the differences between male and female data points, and we extracted the results using machine learning explainability methods to understand the underlying mechanics of the models implemented. A good model predicting accuracy was consistent across the different models deployed. ## Installation Install the required python packages ``` pip install -r requirements.txt ``` To handle heterogeneity and performance efficiently, we use PySpark from [Apache Spark](https://spark.apache.org/). PySpark enables an end-user API for Spark jobs. You might want to check how to set up a local or remote Spark cluster in [their documentation](https://spark.apache.org/docs/latest/api/python/index.html). ## Repository structure This repository is organized as follows: - Preprocessed data from the two different data streams is collecting in [the data folder](data/). For the Opta files, it contains the event-based metrics computed from each match of the 2017 Women's Championship and a single file calculating the event-based metrics from the 2016 Men's Championship published [here](https://figshare.com/collections/Soccer_match_event_dataset/4415000/5). Even though we cannot publish the original data source, the two python scripts implemented to homogenize and integrate both data streams into event-based metrics are included in [the data gathering folder](data_gathering/) folder contains the graphical images and media used for the report. - The [data cleaning folder](data_cleaning/) contains descriptor scripts for both data streams and [the final integration](data_cleaning/merger.py) - [Classification](classification/) contains all the Jupyter notebooks for each model present in the experiment as well as some persistent models for testing.
valenserimedei
Welcome to the new era. One of the biggest challenges when studying the technical skills of data science is understanding how those skills and concepts translate into real jobs, like growth marketing. The main idea is to demonstrate how with Python skills you can make the best marketing decisions based on data. In this project, through Python, using packages such as pandas, I perform an analysis of marketing campaigns using machine learning, taking into account the different metrics such as CTR, conversion rate, or retention rate of each social network, to learn how to analyze campaign performance, measure customer engagement, and predict customer churn, to improve company's marketing strategy.
saif0666
Exploratory data analysis on occupational wage data using Python. Includes data cleaning, wage growth comparison, and visualizations of top-paying and most-employed jobs using Pandas, Matplotlib, and Seaborn.
MioYvo
Risk control system. Receive external system data, distribute jobs to workers, find and analysis risks according to previous defined rules by our own rule-engine, make a judge or punish and response it to origin external system.
snehayadav23
This project aims to implement a detector of similar job descriptions using big data analysis techniques on the LinkedIn Jobs & Skills dataset. By leveraging PySpark, we will uncover insights and patterns in the job summary data to help employers and job seekers better understand the job market.
RecruiterRon
David Aplin Group, one of Canada's Best Managed Companies, has partnered with our client to recruit Junior Software Developers. New graduates or soon-to-graduate students are encouraged to apply! Our client is looking for Junior Software Developers to join their growing team. This position is responsible for the development, evaluation, implementation, and maintenance of new software solutions, including maintenance and development of existing applications. Applications involve data collection, data storage, machine learning, and data visualization. The Role: Designing, coding, and debugging software applications using front-end frameworks and enterprise applications - front-end, back-end, and full-stack development. Performing software analysis, code analysis, requirements analysis, software reviews, identification of code metrics, system risk analysis, software reliability analysis. Providing assistance with installations, system configuration, and third-party system integrations. Providing team members and clients with support and guidance. The Ideal Candidate: A Bachelor's degree or Diploma in Computer Science, Computer Engineering, Information Technology, or a similar field. Experience working with coding languages C#, JavaScript, Angular, React, Python, PHP jQuery, JSON, and Ajax. Solid understanding of web design and development principles. Good planning, analytical, and decision-making skills. A portfolio of web design, applications, and projects you have worked on including projects published on GitHub. Critical-thinking skills. In-depth knowledge of software prototyping and UX design tools. High personal code/development standards (peer testing, unit testing, documentation, etc). Team spirit and a sense of humour are always great. Goal-orientated and deadline-driven. COVID-19 considerations: All employees are currently working from home. Any equipment or materials required for work will be provided by the company via shipment to the employee's home. Company policy will continue to evolve through the COVID-19 pandemic and implement alternative working arrangements to ensure that all our people stay safe. If you are interested in this position and meet the above criteria, please send your resume in confidence directly to Jim Juacalla or Ron Cantiveros at Aplin Information Technology, A Division of David Aplin Group. We thank all applicants; however, only those selected for an interview will be contacted. Apply: https://jobs.aplin.com/job/409253/Junior-Software-Developers-New-Graduates
Mavengence
Jobs have been scraped from linkedin to perform a data analysis to optimize the CV and cover letter
NatLabRockies
HPC jobs from NREL's Eagle supercomputer: data, analysis, prediction, visualization
tanvisenjaliya
The small size and lack of diversity of the available dataset of dermatoscopic pictures make it difficult to train neural networks for automated identification of pigmented skin lesions. The HAM10000 ("Human Against Machine with 10000 training images") dataset addresses this issue. It consist of dermatoscopic images from various populations, which were captured and preserved using various modalities. There are 10015 dermatoscopic images in the final dataset, which can be used as a training set for academic machine learning. The cases include a diverse range of pigmented lesions, including actinic keratoses and intraepithelial carcinoma / Bowen's disease (akiec), basal cell carcinoma (bcc), benign keratosis-like lesions (solar lentigines / seborrheic keratoses and lichen-planus like keratoses, bkl), dermatofibroma (df), melanoma (mel), melanocytic nevi (nv) and vascular lesions (angiomas, angiokeratomas, pyogenic granulomas and hemorrhage, vasc). The dataset includes lesions with multiple images, which can be tracked by the lesion_id-column within the HAM10000_metadata file. The International Skin Imaging Collaboration (ISIC), a multinational partnership that has created the world's biggest public archive of dermoscopic pictures of skin, held the world's largest skin image analysis challenge. In 2018, the challenge was held in Granada, Spain, at the Medical Image Computing and Computer Assisted Intervention conference. Over 12,500 photos were included in the dataset, which was divided into three jobs. 900 individuals signed up for data download, with 115 completing the lesion segmentation job, 25 the lesion attribute detection task, and 159 the illness classification task
The purpose of this project is to analyze the US job market for Data Jobs using the Indeed data (2019) for top 10 US Tech(IT) cities.
pranjaljain99
BITCOIN ANALYSIS USING BIG DATA JOBS
IrvanDimetrio
Scraping LinkedIn Data-Related job postings and Analysing job market trends in Indonesia.
I analysed the data analyst jobs data set to find the some insights such as the most in demand data anlyst jobs, most competitive, and so on
Linkedin trending jobs analysis BI project
SHIVASHANKAR-V07
SQL-based analysis of Data Analyst jobs (2023) - salaries, skills demand, and career insights using PostgreSQL, inspired by Luke Barousse’s SQL course.
michael7101
The objective of this project is to scrape data for jobs site to analyze and rank the top skill employer our seeking to fill Data Analysis job positions.
This data science project focuses on analyzing the employment challenges faced by refugees in Egypt through both quantitative and qualitative data. The project will explore job unsustainability for employed refugees and the barriers unemployed refugees face when seeking jobs. Additionally, the project will include text analysis of open-ended resp.
aschaetzle
Pig Latin is a high-level language developed at Yahoo! Research designed for data analysis tasks, which is automatically transformed into MapReduce jobs and executed in a Hadoop cluster. PigSPARQL is a translation from SPARQL 1.0 to Pig Latin, which allows to execute SPARQL queries on large RDF graphs with MapReduce.
muabdalaleam
An inspection into the current market of data-field jobs & freelancing (data analysis, data science, ML development & data engineering)
MohabWafaie
A full interactive and dynamic Power BI dashboard for analysing the top 5 most common data related jobs in the job market.
priyachakradhari
No description available
jennifermarie6sl
Data Analyst Jobs Analysis
MasterMindRomii
Welcome to the DataScience Jobs Data Insights project, Day 4 of my data exploration journey! Today, we ventured into the intriguing realm of data science job data.