Search Results

Found 1,519 repositories(showing 30)

text-preprocessing-techniques

Deffro

🧡56

16 Text Preprocessing Techniques in Python for Twitter Sentiment Analysis.

230

Python

Updated 1 week ago

data-sciencepreprocessingpython+2

Focuses on detecting spam messages in SMS text using Natural Language Processing (NLP) and Machine Learning techniques. It leverages text preprocessing, feature extraction, and classification algorithms to accurately predict whether a message is Spam or Ham (Not Spam).

MIT

Swift

Updated 3 days ago

coremlcoreml-modelsios+7

Sentiment-Analysis-NLP-with-Python

yrtnsari

❤️40

The project is a simple sentiment analysis using NLP. The project in written in python with Jupyter notebook. It shows how to do text preprocessing (removing of bad words, stop words, lemmatization, tokenization). It further shows how to save a trained model, and use the model in a real life suitation. The machine learning model used here is k-Nearest Neighbor which is used to build the model. Various performance evaluation techniques are used, and they include confusion matrix, and Scikit-learn libraries classification report which give the accuracy, precision, recall and f1- score preformance of the model. The target values been classified are positive and negative review.

Jupyter Notebook

Updated 2 months ago

data-sciencejupyter-notebookknn-classification+10

MAGNET-Multi-Label-Text-Classi-cation-using-Attention-based-Graph-Neural-Network

akash18tripathi

❤️35

This GitHub repository provides an implementation of the paper "MAGNET: Multi-Label Text Classification using Attention-based Graph Neural Network" . MAGNET is a state-of-the-art approach for multi-label text classification, leveraging the power of graph neural networks (GNNs) and attention mechanisms.

Jupyter Notebook

Updated 4 months ago

adjacency-matrixattention-mechanismbert+15

Sentimental-Analysis

venkat-0706

🧡60

Build a model to classify text as positive, negative, or neutral. Apply NLP techniques for preprocessing and machine learning for classification. Aim for accurate sentiment prediction on various text formats.

MIT

Jupyter Notebook

Updated 1 week ago

data-visualizationfeature-engineeringmachine-learning+11

NLP_Basics

gulabpatel

❤️35

NLP basic and advance text preprocessing concepts and techniques

Jupyter Notebook

Updated 8 months ago

deep-learninggensimgramformer+16

NLP

Siddarth305

❤️30

It includes implementations of key NLP techniques, such as text preprocessing, tokenization, sentiment analysis, named entity recognition (NER), and more.

GPL-3.0

Python

Updated 1 year ago

Legal-document-summarizer

Elangovan0101

🧡60

An AI-powered system for extracting and summarizing key legal information from complex legal documents using advanced Natural Language Processing (NLP) techniques. This project utilizes SpaCy for preprocessing and entity extraction, and Sumy for text summarization, to generate concise summaries of lengthy legal texts.

MIT

Python

Updated 3 weeks ago

IRWA-Labs

dyneth02

🧡50

A specialized toolkit for Information Retrieval and Web Analytics. This rep covers the architecture of search engines, featuring custom implementations of inverted and positional indexing, Boolean retrieval, and text preprocessing pipelines. It includes N-grams analysis, cosine similarity foundations, and advanced NLP tokenization techniques.

MIT

Jupyter Notebook

Updated 1 month ago

boolean-retrievalcosine-similaritydata-science+12

Movie-Reviews-NLTK-Sentiment-Analysis-

mrc03

🧡50

The Movie Reviews dataset. The dataset is imported from the NLTK libray. It has 1000 positive and 1000 negative reviews. I have first imported the dataset into a pandas data frame which makes it easier to do the processing. The next step is to analyze the (+) and ( - ) reviews. I have also preprocessed the dataset using Lemmatizing and other standard NLP techniques. To extract the features from the text I have used the Tfidf vectorizer from the scikit. Lastly I have used various modellig algos from scikit to train on this data.

MIT

Jupyter Notebook

Updated 2 months ago

NLP-Email-Categorizer

VoxDroid

🧡50

An efficient text classification pipeline for email subjects, leveraging NLP techniques and Multinomial Naive Bayes. Easily preprocess data, train the model, and categorize new email subjects. Ideal for NLP enthusiasts and those building practical email categorization systems using Python.

MIT

Python

Updated 1 month ago

categorizationclassificationemail+17

brain.js-classifier

LeticiaSPedroso

❤️30

Text classification and preprocessing techniques using library Brain.js

Jupyter Notebook

Updated 3 years ago

NLP

prakash-ukhalkar

🧡50

A comprehensive set of Jupyter notebooks that take you from NLP fundamentals to advanced techniques. Covers text preprocessing, POS tagging, NER, sentiment analysis (with VADER), text classification, word embeddings, and transformer models like BERT. Built with real-world datasets using NLTK, spaCy, scikit-learn, and Hugging Face Transformers.

MIT

Python

Updated 2 months ago

bertdeep-learninggensim+17

Natural-Language-Processing-Projects

tahamajs

❤️45

This repository contains a collection of Natural Language Processing (NLP) projects completed as assignments for the NLP course at Sharif University of Technology. Each project demonstrates different aspects of NLP techniques, from text preprocessing and analysis to machine learning and deep learning applications.

MIT

Jupyter Notebook

Updated 1 month ago

deep-learningmachine-learningnlp+1

Fine-Tuning-an-Arabic-OCR-Model-using-Tesseract-5.0

OmarSamirz

❤️40

This research aims to fine-tune an Arabic OCR model using Tesseract 5.0, enhancing text recognition accuracy through extensive data collection, preprocessing, and image generation. By leveraging advanced training techniques and data augmentation, we achieve significant improvements in word error rates (WER).

MIT

Jupyter Notebook

Updated 6 months ago

arabic-ocrarabic-ocr-modelarabic-tesseract-ocr+8

preprocessing-pgp

quangvuminh2000

❤️40

Preprocessing text based with nlp technique

MIT

Python

Updated 2 years ago

PixOCR-mini

mariam-khediri

❤️20

PixOCR Mini Project is an end-to-end OCR pipeline built using Python and Tesseract to extract text from diverse document types. It explores preprocessing techniques to improve recognition accuracy on real-world scanned and colored images.

Jupyter Notebook

Updated 10 months ago

nlp-machine-learningocr-pythonocr-recognition

LEGALSENSE

huzaifa1-0

❤️35

LegalSense is an AI-powered legal assistant using Retrieval-Augmented Generation (RAG) techniques. The system retrieves relevant legal content using vector embeddings and delivers accurate responses via a local LLM. Implemented full RAG pipeline: text chunking, preprocessing, embedding creation, vector database setup, and response generation.

Jupyter Notebook

Updated 3 months ago

bag-of-words

Safae26

❤️45

A complete Bag of Words pipeline built with Python, NLTK, and spaCy. It demonstrates text preprocessing (tokenization, lowercasing, stopword removal, lemmatization) and converts text into numerical vectors using word frequency counts. Perfect for understanding fundamental NLP vectorization techniques.

Jupyter Notebook

Updated 2 months ago

Optical-Character-Recognition

YassineOUAHMANE

❤️35

This project explores how Optical Character Recognition (OCR) can be improved using image thresholding techniques. By preprocessing images with different thresholding methods, the goal is to enhance text clarity and increase the accuracy of OCR extraction.

Jupyter Notebook

Updated 4 months ago

Email-Spam-Classifier

muqadasejaz

❤️45

A machine learning project that uses Logistic Regression to classify emails as spam or not spam based on their content and metadata. The model is trained on labeled email data using text preprocessing techniques and converts text into numerical features to accurately detect unwanted messages.

Jupyter Notebook

Updated 2 months ago

classificationemail-filteringemail-spam-classifier+12

LLM-Cancer-Classification-using-Pathology-text

MarySuneela

🧡55

This project integrates large language models with pathology text data to improve cancer classification. Using NLP techniques such as text preprocessing, Named Entity Recognition (NER), and feature engineering, we enhance clinical decision-making. Machine learning models are trained to improve accuracy and predictions in medical reports.

Jupyter Notebook

Updated 2 weeks ago

Stress-Detection-Using-NLP

pSahoo-456

❤️40

This project uses Natural Language Processing techniques to classify whether a given sentence indicates stress or not. Built using Python and NLP libraries, it demonstrates the use of text preprocessing, vectorization, and machine learning for real-world emotional analysis.

MIT

Jupyter Notebook

Updated 4 months ago

ensamble-learningmachine-learningnlp+2

Amazon-Kindle-Feedback-Analysis-Using-NLP

mlimaye94

❤️35

Designed a classification model based on the text reviews with 91.28% accuracy to understand Kindle customer feedback. Extracted reviews from an Amazon database and used TfidfVectorizer, stemming and lemmatizing techniques to preprocess data

Jupyter Notebook

Updated 1 year ago

Credit_Risk_Analysis

Johnnywang1899

❤️35

imbalanced-learn library, supervised learning, Scikit-learn machine learning library for Python (sklearn), supervised learning in linear models (linear regression & logistic regression), dataset split into training & testing datasets, Accuracy/Precision/sensitivity(recall)/F1 score, confusion matrix, SVM(Support Vector Machine – support vector, hyperplane), data preprocessing: - labelling (encoding – convert all text columns to numeric label), - Data Scale & Normalization (Standard scaler – mean = 0, variance = 1), Decision Trees, Ensemble Learning – Random Forest (weak/moderate/strong learner), Bootstrap Aggregation, Boosting [Adaptive boosting (AdaBoost), Gradient Boosting, for Class imbalance (solution 1: Oversampling (Random oversampling, synthetic minority oversampling technique (SMOTE)), solution 2: Undersampling (Random undersampling, Cluster Centroid undersampling, solution 3: SMOTEENN)

Jupyter Notebook

Updated 3 years ago

Tweets-Sentiment-Classification

Tirth8038

🧡55

The main aim of the project is to analyze the Twitter data describing the covid situation and to build a text classification model which can distinguish the tweets into 5 categories such as Extremely Negative (0), Negative (1), Neutral (2), Positive (3) and Extremely Positive (4). The provided dataset contains tweets with dimension (37041, 2) and numerical labels with dimension (37041,2) of above categories separately. However, the provided tweets need to be cleaned as it contains irrelevant elements such as mentions (@), HTTP links, HTML tags, punctuation marks and URL. Using the regex function, I removed those elements and Stopwords from tweets. Apart from this, to normalize the terms, I implemented Porter Stemmer and used WordNet Lemmatizer to convert the term to its base form. After this, to convert the words into vectors of equal length, I tokenized the tweets and converted it to sequence and then post padded the sequence with zero and kept the length of largest sequence in tweets as maximum length. After Preprocessing the data, the Tweet dataset has dimension of (37041, 286). For Model Selection, I build 3 different models consisting of one Baseline model such as Multinomial Naive Bayes and 2 advanced Recurrent Neural Network models such as GRU Architecture with a single Embedding layer, 1 Bidirectional layer followed by Global Average Pooling 1D and 2 Dense layers & LSTM Architecture with a single Embedding layer followed by 2 Bidirectional layers and 2 Dense layers. In addition to this, I also tried applying Dropout with a 40% dropout rate during training of RNN models and Early Stopping method for preventing overfitting and evaluated that Early Stopping gave better results than Dropout. For evaluation of models, I splitted the dataset into training,testing and validation split with (80,10,10) ratio and calculated F1 macro, AUC Score on test data and using the Confusion Matrix, I calculated the accuracy by dividing the sum of diagonal elements by the sum of all elements. In addition to this, I plotted training vs. validation loss and accuracy graphs to visualize the performance of models. Interestingly, by not implementing the preprocessing techniques like removing stopwords, Porter Stemmer or WordNetLemmatizer and using just basic text cleaning function in the RNN model with LSTM architecture, the accuracy of the model was increased from 73.87% to 77.1% and had AUC score of 0.95.

Jupyter Notebook

Updated 3 weeks ago

twitter-sentiment-analysis-using-python

shubhampadole68

❤️35

Millions of people are using Twitter and expressing their emotions like happiness, sadness, angry, etc. The Sentiment analysis is also about detecting the emotions, opinion, assessment, attitudes, and took this into consideration as a way humans think. Sentiment analysis classifies the emotions into classes such as positive or negative. Nowadays, industries are interested to use textual data for semantic analysis to extract the view of people about their products and services. Sentiment analysis is very important for them to know the customer satisfaction level and they can improve their services accordingly. To work on the text data, they try to extract the data from social media platforms. There are a lot of social media sites like Google Plus, Facebook, and Twitter that allow expressing opinions, views, and emotions about certain topics and events. Microblogging site Twitter is expanding rapidly among all other online social media networking sites with about 200 million users. Twitter was founded in 2006 and currently, it is the most famous microblogging platform. In 2017 2 million users shared 8.3 million tweets in one hour. Twitter users use to post their thoughts, emotions, and messages on their profiles, called tweets. Words limit of a single tweet has 140 characters. Twitter sentiment analysis based on the NLP (natural language processing) field. For tweets text, we use NLP techniques like tokenizing the words, removing the stop words like I, me, my, our, your, is, was, etc. Natural language processing also plays a part to preprocess the data like cleaning the text and removing the special characters and punctuation marks. Sentimental analysis is very important because we can know the trends of people’s emotions on specific topics with their tweets.

Jupyter Notebook

Updated 2 years ago

text-preprocessing-techniques

NITHISHM2410

❤️40

This Repo includes modules that helps NLP related tasks.

MIT

Jupyter Notebook

Updated 1 year ago

bert-embeddingslight-weight-netnlp+4

Text_preprocessing_steps_for_NLP

sharadpatell

❤️35

Text preprocessing techniques used in NLP

Jupyter Notebook

Updated 1 year ago

Quora-Question-Pair-Similarity

atharvabhide

❤️40

This project classifies question pairs as duplicate or non-duplicate by using NLP techniques for text preprocessing and feature extraction and then applying classification algorithms on the extracted features.

MIT

Jupyter Notebook

Updated 2 years ago

machine-learningxgboost

GitHub Explorer

Search Results

text-preprocessing-techniques

SMSTextSpamPrediction

Sentiment-Analysis-NLP-with-Python

MAGNET-Multi-Label-Text-Classi-cation-using-Attention-based-Graph-Neural-Network

Sentimental-Analysis

NLP_Basics

NLP

Legal-document-summarizer

IRWA-Labs

Movie-Reviews-NLTK-Sentiment-Analysis-

NLP-Email-Categorizer

brain.js-classifier

NLP

Natural-Language-Processing-Projects

Fine-Tuning-an-Arabic-OCR-Model-using-Tesseract-5.0

preprocessing-pgp

PixOCR-mini

LEGALSENSE

bag-of-words

Optical-Character-Recognition

Email-Spam-Classifier

LLM-Cancer-Classification-using-Pathology-text

Stress-Detection-Using-NLP

Amazon-Kindle-Feedback-Analysis-Using-NLP

Credit_Risk_Analysis

Tweets-Sentiment-Classification

twitter-sentiment-analysis-using-python

text-preprocessing-techniques

Text_preprocessing_steps_for_NLP

Quora-Question-Pair-Similarity

text-preprocessing-techniques

SMSTextSpamPrediction

Sentiment-Analysis-NLP-with-Python

MAGNET-Multi-Label-Text-Classi-cation-using-Attention-based-Graph-Neural-Network

Sentimental-Analysis

NLP_Basics

NLP

Legal-document-summarizer

IRWA-Labs

Movie-Reviews-NLTK-Sentiment-Analysis-

NLP-Email-Categorizer

brain.js-classifier

NLP

Natural-Language-Processing-Projects

Fine-Tuning-an-Arabic-OCR-Model-using-Tesseract-5.0

preprocessing-pgp

PixOCR-mini

LEGALSENSE

bag-of-words

Optical-Character-Recognition

Email-Spam-Classifier

LLM-Cancer-Classification-using-Pathology-text

Stress-Detection-Using-NLP

Amazon-Kindle-Feedback-Analysis-Using-NLP

Credit_Risk_Analysis

Tweets-Sentiment-Classification

twitter-sentiment-analysis-using-python

text-preprocessing-techniques

Text_preprocessing_steps_for_NLP

Quora-Question-Pair-Similarity