Found 1,943 repositories(showing 30)
mar-antaya
No description available
himanshub1007
# AD-Prediction Convolutional Neural Networks for Alzheimer's Disease Prediction Using Brain MRI Image ## Abstract Alzheimers disease (AD) is characterized by severe memory loss and cognitive impairment. It associates with significant brain structure changes, which can be measured by magnetic resonance imaging (MRI) scan. The observable preclinical structure changes provides an opportunity for AD early detection using image classification tools, like convolutional neural network (CNN). However, currently most AD related studies were limited by sample size. Finding an efficient way to train image classifier on limited data is critical. In our project, we explored different transfer-learning methods based on CNN for AD prediction brain structure MRI image. We find that both pretrained 2D AlexNet with 2D-representation method and simple neural network with pretrained 3D autoencoder improved the prediction performance comparing to a deep CNN trained from scratch. The pretrained 2D AlexNet performed even better (**86%**) than the 3D CNN with autoencoder (**77%**). ## Method #### 1. Data In this project, we used public brain MRI data from **Alzheimers Disease Neuroimaging Initiative (ADNI)** Study. ADNI is an ongoing, multicenter cohort study, started from 2004. It focuses on understanding the diagnostic and predictive value of Alzheimers disease specific biomarkers. The ADNI study has three phases: ADNI1, ADNI-GO, and ADNI2. Both ADNI1 and ADNI2 recruited new AD patients and normal control as research participants. Our data included a total of 686 structure MRI scans from both ADNI1 and ADNI2 phases, with 310 AD cases and 376 normal controls. We randomly derived the total sample into training dataset (n = 519), validation dataset (n = 100), and testing dataset (n = 67). #### 2. Image preprocessing Image preprocessing were conducted using Statistical Parametric Mapping (SPM) software, version 12. The original MRI scans were first skull-stripped and segmented using segmentation algorithm based on 6-tissue probability mapping and then normalized to the International Consortium for Brain Mapping template of European brains using affine registration. Other configuration includes: bias, noise, and global intensity normalization. The standard preprocessing process output 3D image files with an uniform size of 121x145x121. Skull-stripping and normalization ensured the comparability between images by transforming the original brain image into a standard image space, so that same brain substructures can be aligned at same image coordinates for different participants. Diluted or enhanced intensity was used to compensate the structure changes. the In our project, we used both whole brain (including both grey matter and white matter) and grey matter only. #### 3. AlexNet and Transfer Learning Convolutional Neural Networks (CNN) are very similar to ordinary Neural Networks. A CNN consists of an input and an output layer, as well as multiple hidden layers. The hidden layers are either convolutional, pooling or fully connected. ConvNet architectures make the explicit assumption that the inputs are images, which allows us to encode certain properties into the architecture. These then make the forward function more efficient to implement and vastly reduce the amount of parameters in the network. #### 3.1. AlexNet The net contains eight layers with weights; the first five are convolutional and the remaining three are fully connected. The overall architecture is shown in Figure 1. The output of the last fully-connected layer is fed to a 1000-way softmax which produces a distribution over the 1000 class labels. AlexNet maximizes the multinomial logistic regression objective, which is equivalent to maximizing the average across training cases of the log-probability of the correct label under the prediction distribution. The kernels of the second, fourth, and fifth convolutional layers are connected only to those kernel maps in the previous layer which reside on the same GPU (as shown in Figure1). The kernels of the third convolutional layer are connected to all kernel maps in the second layer. The neurons in the fully connected layers are connected to all neurons in the previous layer. Response-normalization layers follow the first and second convolutional layers. Max-pooling layers follow both response-normalization layers as well as the fifth convolutional layer. The ReLU non-linearity is applied to the output of every convolutional and fully-connected layer.  The first convolutional layer filters the 224x224x3 input image with 96 kernels of size 11x11x3 with a stride of 4 pixels (this is the distance between the receptive field centers of neighboring neurons in a kernel map). The second convolutional layer takes as input the (response-normalized and pooled) output of the first convolutional layer and filters it with 256 kernels of size 5x5x48. The third, fourth, and fifth convolutional layers are connected to one another without any intervening pooling or normalization layers. The third convolutional layer has 384 kernels of size 3x3x256 connected to the (normalized, pooled) outputs of the second convolutional layer. The fourth convolutional layer has 384 kernels of size 3x3x192 , and the fifth convolutional layer has 256 kernels of size 3x3x192. The fully-connected layers have 4096 neurons each. #### 3.2. Transfer Learning Training an entire Convolutional Network from scratch (with random initialization) is impractical[14] because it is relatively rare to have a dataset of sufficient size. An alternative is to pretrain a Conv-Net on a very large dataset (e.g. ImageNet), and then use the ConvNet either as an initialization or a fixed feature extractor for the task of interest. Typically, there are three major transfer learning scenarios: **ConvNet as fixed feature extractor:** We can take a ConvNet pretrained on ImageNet, and remove the last fully-connected layer, then treat the rest structure as a fixed feature extractor for the target dataset. In AlexNet, this would be a 4096-D vector. Usually, we call these features as CNN codes. Once we get these features, we can train a linear classifier (e.g. linear SVM or Softmax classifier) for our target dataset. **Fine-tuning the ConvNet:** Another idea is not only replace the last fully-connected layer in the classifier, but to also fine-tune the parameters of the pretrained network. Due to overfitting concerns, we can only fine-tune some higher-level part of the network. This suggestion is motivated by the observation that earlier features in a ConvNet contains more generic features (e.g. edge detectors or color blob detectors) that can be useful for many kind of tasks. But the later layer of the network becomes progressively more specific to the details of the classes contained in the original dataset. **Pretrained models:** The released pretrained model is usually the final ConvNet checkpoint. So it is common to see people use the network for fine-tuning. #### 4. 3D Autoencoder and Convolutional Neural Network We take a two-stage approach where we first train a 3D sparse autoencoder to learn filters for convolution operations, and then build a convolutional neural network whose first layer uses the filters learned with the autoencoder.  #### 4.1. Sparse Autoencoder An autoencoder is a 3-layer neural network that is used to extract features from an input such as an image. Sparse representations can provide a simple interpretation of the input data in terms of a small number of \parts by extracting the structure hidden in the data. The autoencoder has an input layer, a hidden layer and an output layer, and the input and output layers have same number of units, while the hidden layer contains more units for a sparse and overcomplete representation. The encoder function maps input x to representation h, and the decoder function maps the representation h to the output x. In our problem, we extract 3D patches from scans as the input to the network. The decoder function aims to reconstruct the input form the hidden representation h. #### 4.2. 3D Convolutional Neural Network Training the 3D convolutional neural network(CNN) is the second stage. The CNN we use in this project has one convolutional layer, one pooling layer, two linear layers, and finally a log softmax layer. After training the sparse autoencoder, we take the weights and biases of the encoder from trained model, and use them a 3D filter of a 3D convolutional layer of the 1-layer convolutional neural network. Figure 2 shows the architecture of the network. #### 5. Tools In this project, we used Nibabel for MRI image processing and PyTorch Neural Networks implementation.
TiffinTech
No description available
machina-sports
Open-source agent skills for live sports data and prediction markets. Football, F1, Kalshi, Polymarket. Zero API keys. SKILL.md format.
SalilVishnuKapur
Understanding transportation mode from GPS (Global Positioning System) traces is an essential topic in the data mobility domain. In this paper, a framework is proposed to predict transportation modes. This framework follows a sequence of five steps: (i) data preparation, where GPS points are grouped in trajectory samples; (ii) point features generation; (iii) trajectory features extraction; (iv) noise removal; (v) normalization. We show that the extraction of the new point features: bearing rate, the rate of rate of change of the bearing rate and the global and local trajectory features, like medians and percentiles enables many classifiers to achieve high accuracy (96.5%) and f1 (96.3%) scores. We also show that the noise removal task affects the performance of all the models tested. Finally, the empirical tests where we compare this work against state-of-art transportation mode prediction strategies show that our framework is competitive and outperforms most of them.
ignaciorlando
In this work, we present an extensive description and evaluation of our method for blood vessel segmentation in fundus images based on a discriminatively trained, fully connected conditional random field model. Standard segmentation priors such as a Potts model or total variation usually fail when dealing with thin and elongated structures. We overcome this difficulty by using a conditional random field model with more expressive potentials, taking advantage of recent results enabling inference of fully connected models almost in real-time. Parameters of the method are learned automatically using a structured output support vector machine, a supervised technique widely used for structured prediction in a number of machine learning applications. Our method, trained with state of the art features, is evaluated both quantitatively and qualitatively on four publicly available data sets: DRIVE, STARE, CHASEDB1 and HRF. Additionally, a quantitative comparison with respect to other strategies is included. The experimental results show that this approach outperforms other techniques when evaluated in terms of sensitivity, F1-score, G-mean and Matthews correlation coefficient. Additionally, it was observed that the fully connected model is able to better distinguish the desired structures than the local neighborhood based approach. Results suggest that this method is suitable for the task of segmenting elongated structures, a feature that can be exploited to contribute with other medical and biological applications.
mehmetkahya0
A sophisticated Formula 1 race simulation tool that models and predicts F1 race outcomes with realistic parameters based on driver skills, team performance, track characteristics, and dynamic weather conditions.
mar-antaya
No description available
Developing a Deep learning classification-based model for screening pharmaceutical compounds with hERG inhibitory activity (cardiotoxicity) and using the model to screen CAS antiviral database to identify compounds with cardiotoxicity potential. The data is derived from "Drug Discovery Hackathon 2020: PS ID: DDT2-13" (https://innovateindia.mygov.in/ddh2020/problem-statements/) Details related to the project can also be derived from: (https://youtu.be/7tqaPmYQmCM) Note: The solution for the above problem statement is solved with Deep learning classification based model instead of linear discriminant analysis model as written in the problem statement. Details of the project: In silico prediction of cardiotoxicity with high sensitivity and specificity for potential drug molecules would be of immense value. Hence, building a classification-based machine learning models, capable of efficiently predicting cardiotoxicity will be critical. A data set of diverse pharmaceutical compounds with hERG channel inhibitory activity (blocker/non-blocker) is provided. The SMILES notations of all compounds are given. The set of compounds divided into a training set and a test set using 70:30 ratios. Simple, reproducible and easily transferable classification models developed from the training set compounds using 2D descriptors. The models were validated based on the test set compounds. The models is having the following quality: Training Set: ROC AUC for training set: 0.977280 Classification accuracy for training set: 0.986058 Precision for training set: 0.993124 Sensitivity/Recall for training set: 0.990235 F1 score for training set: 0.991677 Confusion matrix: [[ 892 33] [ 47 4766]] Test set: ROC AUC for test set: 0.649767 Classification accuracy for test set: 0.813670 Precision for test set: 0.883061 Sensitivity/Recall for test set: 0.990235 F1 score for test set: 0.889050 Confusion matrix: [[ 165 243] [ 215 1835]] The best model was also used to classify CAS antiviral database compounds for hERG channel inhibitory activity and a list of compounds with cardiotoxicity potential was being generated in the form of .csv file.
DerHefi
Finding explainable models to predict Formula 1 Qualifying Results
Aayushi-2808
# Cervical_cancer_detection_using_ML # Introduction According to World Health Organisation (WHO), when detected at an early stage, cervical cancer is one of the most curable cancers. Hence, the main motive behind this project is to detect the cancer in its early stages so that it can be treated and managed in the patients effectively. # Flow of project is as explained below: This project is divided into 5 parts: 1. Data Cleaning 2. Exploratory Data Analysis 3. Baseline model: Logistic Regression 4. Ensemble Models: Bagging with Decision Trees, Random forest and Boosting 5. Model Comparison and results # Refer below for References: Link to basic information regarding cervical cancer : https://www.cdc.gov/cancer/cervical/basic_info/index.htm The dataset for tackling the problem is supplied by the UCI repository for Machine Learning. Link to Dataset : https://archive.ics.uci.edu/ml/datasets/Cervical+cancer+%28Risk+Factors%29 The dataset contains a list of risk factors that lead up to the Biopsy examination. The generation of the predictor variable is taken care of in part 2 (Exploratory data analysis) of this report. We will try to predict the 'biopsy' variable from the dataset using Logistic Regression, Random Forest, Bagging with Decision Trees and Boosting with XGBoost Classifier. # Results: Based on our Base model and The Ensemble Models we used, we observed - 1. After the entire process of training, hyperparameter tuning and tackling class imbalance was complete , we obtained the results as depicted through the graphics. 2. We observe that Bagging and Random Forest gives the highest accuracy and precision of 97.09 and 80% resp. 3. Plotting the Confusion matrix showed us that Random Forest using upsampling and class weights gives us 2 false positives and 3 false negatives with auc of 0.87 # Why random forest is the best model?? 1. So as we see, while comparing all of our models,RF has maximum f1_score and accuracy along with Bagging i.e. 76.2 n 97.09% resp. 2. And it also produces the same amount of false negatives with a recall of 72.73% just like all the other models. 3. But we still consider RF better coz of its added advantage that, the decision trees are decorrelated as compared to bagging leading to lesser variance and greater ability to generalize. # Conclusion: On observing the feature importance of the best model i.e random forest, we can see that the most important features are Schiller, Hinselmann, HPV, Citology, etc. This also makes sense because Schiller and Hinselmann are actually the tests used to detect cervical cancer. # Problems Faced: A major problem encountered while training the model was that it had too little data to train. On collaborating with all the hospitals in India, we can have enough data points to train a model with a higher recall, thus making the model better. # Scope of Improvement As next steps I would want to do exactly that, to deploy the model and refine it. We may also modify the number of the predictor variables, as it may well turn out that there are other predictors which may not be present in our current dataset. This can only be found by practical implementation of our predictions.
emilrules
No description available
lajfhlwejrk
:tada: alzheimer's disease prediction based on brain PET/MRI images (baseline, f1-score 0.81)
frankndungu
Machine learning model that predicts Formula 1 race results for the 2025 Shanghai Grand Prix using historical performance data, team strengths, and driver characteristics. Features data visualization, team change handling, and position progression forecasting.
A hybrid AI-based stock market prediction system using LSTM, Random Forest, and XGBoost, built for real-world deployment with Optuna-powered tuning, feature-rich engineering, and ensemble prediction logic. Designed to optimize F1 score and accuracy, this system aims to generate reliable buy/sell signals on stocks.
frankndungu
Suzuka 2025 F1 race predictions by Otto.rentals — built for fans who love speed, stats, and bold visuals.
yuyangchee98
An interactive web application that allows Formula 1 fans to create and share their predictions for the latest F1 season. Users can drag and drop drivers into grid positions for each race, track points, and share their predictions with others.
Using NLP and ML, make a model to identify hate speech (racist or sexist tweets) in Twitter. Problem Statement: Twitter is the biggest platform where anybody and everybody can have their views heard. Some of these voices spread hate and negativity. Twitter is wary of its platform being used as a medium to spread hate. You are a data scientist at Twitter, and you will help Twitter in identifying the tweets with hate speech and removing them from the platform. You will use NLP techniques, perform specific cleanup for tweets data, and make a robust model. Domain: Social Media Analysis to be done: Clean up tweets and build a classification model by using NLP techniques, cleanup specific for tweets data, regularization and hyperparameter tuning using stratified k-fold and cross validation to get the best model. Content: id: identifier number of the tweet Label: 0 (non-hate) /1 (hate) Tweet: the text in the tweet Tasks: Load the tweets file using read_csv function from Pandas package. Get the tweets into a list for easy text cleanup and manipulation. To cleanup: Normalize the casing. Using regular expressions, remove user handles. These begin with '@’. Using regular expressions, remove URLs. Using TweetTokenizer from NLTK, tokenize the tweets into individual terms. Remove stop words. Remove redundant terms like ‘amp’, ‘rt’, etc. Remove ‘#’ symbols from the tweet while retaining the term. Extra cleanup by removing terms with a length of 1. Check out the top terms in the tweets: First, get all the tokenized terms into one large list. Use the counter and find the 10 most common terms. Data formatting for predictive modeling: Join the tokens back to form strings. This will be required for the vectorizers. Assign x and y. Perform train_test_split using sklearn. We’ll use TF-IDF values for the terms as a feature to get into a vector space model. Import TF-IDF vectorizer from sklearn. Instantiate with a maximum of 5000 terms in your vocabulary. Fit and apply on the train set. Apply on the test set. Model building: Ordinary Logistic Regression Instantiate Logistic Regression from sklearn with default parameters. Fit into the train data. Make predictions for the train and the test set. Model evaluation: Accuracy, recall, and f_1 score. Report the accuracy on the train set. Report the recall on the train set: decent, high, or low. Get the f1 score on the train set. Looks like you need to adjust the class imbalance, as the model seems to focus on the 0s. Adjust the appropriate class in the LogisticRegression model. Train again with the adjustment and evaluate. Train the model on the train set. Evaluate the predictions on the train set: accuracy, recall, and f_1 score. Regularization and Hyperparameter tuning: Import GridSearch and StratifiedKFold because of class imbalance. Provide the parameter grid to choose for ‘C’ and ‘penalty’ parameters. Use a balanced class weight while instantiating the logistic regression. Find the parameters with the best recall in cross validation. Choose ‘recall’ as the metric for scoring. Choose stratified 4 fold cross validation scheme. Fit into the train set. What are the best parameters? Predict and evaluate using the best estimator. Use the best estimator from the grid search to make predictions on the test set. What is the recall on the test set for the toxic comments? What is the f_1 score?
amirdhami
Predicting the outcome of F1 races by classifying drivers into a given final placement based on their stats going into a race
frankndungu
Jeddah 2025 F1 race predictions by Otto.rentals — built for fans who love speed, stats, and bold visuals.
End-to-end churn prediction pipeline using XGBoost and MLflow. Features automated preprocessing, threshold optimization (F1: 0.637), and imbalance handling (scale_pos_weight). Optimized for 81% recall to maximize customer retention.
dhvani-k
EDA and Prediction of F1 Race WInners
jakearchibald
F1 predictions thingy. Just me playing with Django really
rohanbagel
No description available
paramramit305-a11y
💳 End-to-end ML web app for loan approval prediction | Logistic Regression | 88% Accuracy | F1: 80.9% | Deployed on Streamlit
lakerrenhu
The main purpose in this project is to develop reliable prediction model that be able to predict whether a given molecule is a CDK2 inhibitor. In this project a bunch of machine learning methods are applied to learn the prediction model of cancer inhibitor. These methods include elastic net, support vector classifier (SVC), K-nearest neighbors (KNN), Decision Tree (DT), Random Forest (RF), Huber Regressor (HR), Lasso, Lasso plus cross-validation (Lasso CV), least-angle regression (LAR), Bayesian Ridge (BR), stochastic gradient descent classifier (SGDC), Ridge regres-sion (RR), Logistic Regression (LR), Orthogonal Match Pursuit (OMP), Multilayer Perceptron (MLP), Qlattice, convolutional neural network (CNN). The performance metrics of the models are developed on test dataset, represented by precision, recall, f1-score, accuracy, and ROC.
vasukumar92
Rare Event Classification (Ensemble Modelling incorporating Random Under-sampling). The data consists of 10,500 credit applications, each classified as good or bad credit. However, there are only 500 bad credit applications. Since this is less than 5% of the data, classifying applicants as bad credit is referred to as a rare event problem. This is also known as anomaly dete ction in many applications. Approach: The best ratio is discovered by trying ratios between 50:50 to 85:15. Build an ensemble model based on the optimum ratio selected. This is done my creating ensemble of trees using the optimum ratio, fitting a model to each, making classification probability predictions for each and then averaging those to get predicted classification probabilities. From that we can calculate the loss totaled over all the trees. The base model is a decision tree with a minimum leaf size is 5, and the minimum split size is 5. The optimum depth for this model is determined by optimizing the F1-score using 10-fold cross-validation.
mirzahash
Problem Statement A non-banking financial institution (NBFI) or non-bank financial company (NBFC) is a Financial Institution that does not have a full banking license or is not supervised by a national or international banking regulatory agency. NBFC facilitates bank-related financial services, such as investment, risk pooling, contractual savings, and market brokering. An NBFI is struggling to mark profits due to an increase in defaults in the vehicle loan category. The company aims to determine the client’s loan repayment abilities and understand the relative importance of each parameter contributing to a borrower’s ability to repay the loan. Goal: The goal of the problem is to predict whether a client will default on the vehicle loan payment or not. For each ID in the Test_Dataset, you must predict the “Default” level. Datasets The problem contains two datasets, Train_Dataset and Test_Dataset. Model building is to be done on Train_Dataset and the Model testing is to be done on Test_Dataset. The output from the Test_Dataset is to be used to make prediction. Metric to measure The metric to measure is the F1_Score. F1_Score is the harmonic mean of Recall and Precision.
klimanyusuf
DESCRIPTION Using NLP and ML, make a model to identify hate speech (racist or sexist tweets) in Twitter. Problem Statement: Twitter is the biggest platform where anybody and everybody can have their views heard. Some of these voices spread hate and negativity. Twitter is wary of its platform being used as a medium to spread hate. You are a data scientist at Twitter, and you will help Twitter in identifying the tweets with hate speech and removing them from the platform. You will use NLP techniques, perform specific cleanup for tweets data, and make a robust model. Domain: Social Media Analysis to be done: Clean up tweets and build a classification model by using NLP techniques, cleanup specific for tweets data, regularization and hyperparameter tuning using stratified k-fold and cross-validation to get the best model. Content: id: identifier number of the tweet Label: 0 (non-hate) /1 (hate) Tweet: the text in the tweet Tasks: Load the tweets file using read_csv function from Pandas package. Get the tweets into a list for easy text cleanup and manipulation. To cleanup: Normalize the casing. Using regular expressions, remove user handles. These begin with '@’. Using regular expressions, remove URLs. Using TweetTokenizer from NLTK, tokenize the tweets into individual terms. Remove stop words. Remove redundant terms like ‘amp’, ‘rt’, etc. Remove ‘#’ symbols from the tweet while retaining the term. Extra cleanup by removing terms with a length of 1. Check out the top terms in the tweets: First, get all the tokenized terms into one large list. Use the counter and find the 10 most common terms. Data formatting for predictive modeling: Join the tokens back to form strings. This will be required for the vectorizers. Assign x and y. Perform train_test_split using sklearn. We’ll use TF-IDF values for the terms as a feature to get into a vector space model. Import TF-IDF vectorizer from sklearn. Instantiate with a maximum of 5000 terms in your vocabulary. Fit and apply on the train set. Apply on the test set. Model building: Ordinary Logistic Regression Instantiate Logistic Regression from sklearn with default parameters. Fit into the train data. Make predictions for the train and the test set. Model evaluation: Accuracy, recall, and f_1 score. Report the accuracy on the train set. Report the recall on the train set: decent, high, or low. Get the f1 score on the train set. Looks like you need to adjust the class imbalance, as the model seems to focus on the 0s. Adjust the appropriate class in the LogisticRegression model. Train again with the adjustment and evaluate. Train the model on the train set. Evaluate the predictions on the train set: accuracy, recall, and f_1 score. Regularization and Hyperparameter tuning: Import GridSearch and StratifiedKFold because of class imbalance. Provide the parameter grid to choose for ‘C’ and ‘penalty’ parameters. Use a balanced class weight while instantiating the logistic regression. Find the parameters with the best recall in cross validation. Choose ‘recall’ as the metric for scoring. Choose stratified 4 fold cross validation scheme. Fit into the train set. What are the best parameters? Predict and evaluate using the best estimator. Use the best estimator from the grid search to make predictions on the test set. What is the recall on the test set for the toxic comments?
Quadrob
A school project made with python to predict formula 1 races using machine learning