Found 1,519 repositories(showing 30)
16 Text Preprocessing Techniques in Python for Twitter Sentiment Analysis.
FarooqMulla
Focuses on detecting spam messages in SMS text using Natural Language Processing (NLP) and Machine Learning techniques. It leverages text preprocessing, feature extraction, and classification algorithms to accurately predict whether a message is Spam or Ham (Not Spam).
The project is a simple sentiment analysis using NLP. The project in written in python with Jupyter notebook. It shows how to do text preprocessing (removing of bad words, stop words, lemmatization, tokenization). It further shows how to save a trained model, and use the model in a real life suitation. The machine learning model used here is k-Nearest Neighbor which is used to build the model. Various performance evaluation techniques are used, and they include confusion matrix, and Scikit-learn libraries classification report which give the accuracy, precision, recall and f1- score preformance of the model. The target values been classified are positive and negative review.
This GitHub repository provides an implementation of the paper "MAGNET: Multi-Label Text Classification using Attention-based Graph Neural Network" . MAGNET is a state-of-the-art approach for multi-label text classification, leveraging the power of graph neural networks (GNNs) and attention mechanisms.
venkat-0706
Build a model to classify text as positive, negative, or neutral. Apply NLP techniques for preprocessing and machine learning for classification. Aim for accurate sentiment prediction on various text formats.
gulabpatel
NLP basic and advance text preprocessing concepts and techniques
Siddarth305
It includes implementations of key NLP techniques, such as text preprocessing, tokenization, sentiment analysis, named entity recognition (NER), and more.
Elangovan0101
An AI-powered system for extracting and summarizing key legal information from complex legal documents using advanced Natural Language Processing (NLP) techniques. This project utilizes SpaCy for preprocessing and entity extraction, and Sumy for text summarization, to generate concise summaries of lengthy legal texts.
dyneth02
A specialized toolkit for Information Retrieval and Web Analytics. This rep covers the architecture of search engines, featuring custom implementations of inverted and positional indexing, Boolean retrieval, and text preprocessing pipelines. It includes N-grams analysis, cosine similarity foundations, and advanced NLP tokenization techniques.
The Movie Reviews dataset. The dataset is imported from the NLTK libray. It has 1000 positive and 1000 negative reviews. I have first imported the dataset into a pandas data frame which makes it easier to do the processing. The next step is to analyze the (+) and ( - ) reviews. I have also preprocessed the dataset using Lemmatizing and other standard NLP techniques. To extract the features from the text I have used the Tfidf vectorizer from the scikit. Lastly I have used various modellig algos from scikit to train on this data.
VoxDroid
An efficient text classification pipeline for email subjects, leveraging NLP techniques and Multinomial Naive Bayes. Easily preprocess data, train the model, and categorize new email subjects. Ideal for NLP enthusiasts and those building practical email categorization systems using Python.
LeticiaSPedroso
Text classification and preprocessing techniques using library Brain.js
prakash-ukhalkar
A comprehensive set of Jupyter notebooks that take you from NLP fundamentals to advanced techniques. Covers text preprocessing, POS tagging, NER, sentiment analysis (with VADER), text classification, word embeddings, and transformer models like BERT. Built with real-world datasets using NLTK, spaCy, scikit-learn, and Hugging Face Transformers.
This repository contains a collection of Natural Language Processing (NLP) projects completed as assignments for the NLP course at Sharif University of Technology. Each project demonstrates different aspects of NLP techniques, from text preprocessing and analysis to machine learning and deep learning applications.
This research aims to fine-tune an Arabic OCR model using Tesseract 5.0, enhancing text recognition accuracy through extensive data collection, preprocessing, and image generation. By leveraging advanced training techniques and data augmentation, we achieve significant improvements in word error rates (WER).
quangvuminh2000
Preprocessing text based with nlp technique
mariam-khediri
PixOCR Mini Project is an end-to-end OCR pipeline built using Python and Tesseract to extract text from diverse document types. It explores preprocessing techniques to improve recognition accuracy on real-world scanned and colored images.
huzaifa1-0
LegalSense is an AI-powered legal assistant using Retrieval-Augmented Generation (RAG) techniques. The system retrieves relevant legal content using vector embeddings and delivers accurate responses via a local LLM. Implemented full RAG pipeline: text chunking, preprocessing, embedding creation, vector database setup, and response generation.
Safae26
A complete Bag of Words pipeline built with Python, NLTK, and spaCy. It demonstrates text preprocessing (tokenization, lowercasing, stopword removal, lemmatization) and converts text into numerical vectors using word frequency counts. Perfect for understanding fundamental NLP vectorization techniques.
YassineOUAHMANE
This project explores how Optical Character Recognition (OCR) can be improved using image thresholding techniques. By preprocessing images with different thresholding methods, the goal is to enhance text clarity and increase the accuracy of OCR extraction.
muqadasejaz
A machine learning project that uses Logistic Regression to classify emails as spam or not spam based on their content and metadata. The model is trained on labeled email data using text preprocessing techniques and converts text into numerical features to accurately detect unwanted messages.
This project integrates large language models with pathology text data to improve cancer classification. Using NLP techniques such as text preprocessing, Named Entity Recognition (NER), and feature engineering, we enhance clinical decision-making. Machine learning models are trained to improve accuracy and predictions in medical reports.
pSahoo-456
This project uses Natural Language Processing techniques to classify whether a given sentence indicates stress or not. Built using Python and NLP libraries, it demonstrates the use of text preprocessing, vectorization, and machine learning for real-world emotional analysis.
Designed a classification model based on the text reviews with 91.28% accuracy to understand Kindle customer feedback. Extracted reviews from an Amazon database and used TfidfVectorizer, stemming and lemmatizing techniques to preprocess data
Johnnywang1899
imbalanced-learn library, supervised learning, Scikit-learn machine learning library for Python (sklearn), supervised learning in linear models (linear regression & logistic regression), dataset split into training & testing datasets, Accuracy/Precision/sensitivity(recall)/F1 score, confusion matrix, SVM(Support Vector Machine – support vector, hyperplane), data preprocessing: - labelling (encoding – convert all text columns to numeric label), - Data Scale & Normalization (Standard scaler – mean = 0, variance = 1), Decision Trees, Ensemble Learning – Random Forest (weak/moderate/strong learner), Bootstrap Aggregation, Boosting [Adaptive boosting (AdaBoost), Gradient Boosting, for Class imbalance (solution 1: Oversampling (Random oversampling, synthetic minority oversampling technique (SMOTE)), solution 2: Undersampling (Random undersampling, Cluster Centroid undersampling, solution 3: SMOTEENN)
Tirth8038
The main aim of the project is to analyze the Twitter data describing the covid situation and to build a text classification model which can distinguish the tweets into 5 categories such as Extremely Negative (0), Negative (1), Neutral (2), Positive (3) and Extremely Positive (4). The provided dataset contains tweets with dimension (37041, 2) and numerical labels with dimension (37041,2) of above categories separately. However, the provided tweets need to be cleaned as it contains irrelevant elements such as mentions (@), HTTP links, HTML tags, punctuation marks and URL. Using the regex function, I removed those elements and Stopwords from tweets. Apart from this, to normalize the terms, I implemented Porter Stemmer and used WordNet Lemmatizer to convert the term to its base form. After this, to convert the words into vectors of equal length, I tokenized the tweets and converted it to sequence and then post padded the sequence with zero and kept the length of largest sequence in tweets as maximum length. After Preprocessing the data, the Tweet dataset has dimension of (37041, 286). For Model Selection, I build 3 different models consisting of one Baseline model such as Multinomial Naive Bayes and 2 advanced Recurrent Neural Network models such as GRU Architecture with a single Embedding layer, 1 Bidirectional layer followed by Global Average Pooling 1D and 2 Dense layers & LSTM Architecture with a single Embedding layer followed by 2 Bidirectional layers and 2 Dense layers. In addition to this, I also tried applying Dropout with a 40% dropout rate during training of RNN models and Early Stopping method for preventing overfitting and evaluated that Early Stopping gave better results than Dropout. For evaluation of models, I splitted the dataset into training,testing and validation split with (80,10,10) ratio and calculated F1 macro, AUC Score on test data and using the Confusion Matrix, I calculated the accuracy by dividing the sum of diagonal elements by the sum of all elements. In addition to this, I plotted training vs. validation loss and accuracy graphs to visualize the performance of models. Interestingly, by not implementing the preprocessing techniques like removing stopwords, Porter Stemmer or WordNetLemmatizer and using just basic text cleaning function in the RNN model with LSTM architecture, the accuracy of the model was increased from 73.87% to 77.1% and had AUC score of 0.95.
shubhampadole68
Millions of people are using Twitter and expressing their emotions like happiness, sadness, angry, etc. The Sentiment analysis is also about detecting the emotions, opinion, assessment, attitudes, and took this into consideration as a way humans think. Sentiment analysis classifies the emotions into classes such as positive or negative. Nowadays, industries are interested to use textual data for semantic analysis to extract the view of people about their products and services. Sentiment analysis is very important for them to know the customer satisfaction level and they can improve their services accordingly. To work on the text data, they try to extract the data from social media platforms. There are a lot of social media sites like Google Plus, Facebook, and Twitter that allow expressing opinions, views, and emotions about certain topics and events. Microblogging site Twitter is expanding rapidly among all other online social media networking sites with about 200 million users. Twitter was founded in 2006 and currently, it is the most famous microblogging platform. In 2017 2 million users shared 8.3 million tweets in one hour. Twitter users use to post their thoughts, emotions, and messages on their profiles, called tweets. Words limit of a single tweet has 140 characters. Twitter sentiment analysis based on the NLP (natural language processing) field. For tweets text, we use NLP techniques like tokenizing the words, removing the stop words like I, me, my, our, your, is, was, etc. Natural language processing also plays a part to preprocess the data like cleaning the text and removing the special characters and punctuation marks. Sentimental analysis is very important because we can know the trends of people’s emotions on specific topics with their tweets.
NITHISHM2410
This Repo includes modules that helps NLP related tasks.
sharadpatell
Text preprocessing techniques used in NLP
atharvabhide
This project classifies question pairs as duplicate or non-duplicate by using NLP techniques for text preprocessing and feature extraction and then applying classification algorithms on the extracted features.