Found 1,527 repositories(showing 30)
nltk
NLTK Data
Pybot can change the way learners try to learn python programming language in a more interactive way. This chatbot will try to solve or provide answer to almost every python related issues or queries that the user is asking for. We are implementing NLP for improving the efficiency of the chatbot. We will include voice feature for more interactivity to the user. By utilizing NLP, developers can organize and structure knowledge to perform tasks such as automatic summarization, translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation. NLTK has been called “a wonderful tool for teaching and working in, computational linguistics using Python,” and “an amazing library to play with natural language.The main issue with text data is that it is all in text format (strings). However, the Machine learning algorithms need some sort of numerical feature vector in order to perform the task. So before we start with any NLP project we need to pre-process it to make it ideal for working. Converting the entire text into uppercase or lowercase, so that the algorithm does not treat the same words in different cases as different Tokenization is just the term used to describe the process of converting the normal text strings into a list of tokens i.e words that we actually want. Sentence tokenizer can be used to find the list of sentences and Word tokenizer can be used to find the list of words in strings.Removing Noise i.e everything that isn’t in a standard number or letter.Removing Stop words. Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words.Stemming is the process of reducing inflected (or sometimes derived) words to their stem, base or root form — generally a written word form. Example if we were to stem the following words: “Stems”, “Stemming”, “Stemmed”, “and Stemtization”, the result would be a single word “stem”. A slight variant of stemming is lemmatization. The major difference between these is, that, stemming can often create non-existent words, whereas lemmas are actual words. So, your root stem, meaning the word you end up with, is not something you can just look up in a dictionary, but you can look up a lemma. Examples of Lemmatization are that “run” is a base form for words like “running” or “ran” or that the word “better” and “good” are in the same lemma so they are considered the same.
The 5 courses in this University of Michigan specialization introduce learners to data science through the python programming language. This skills-based specialization is intended for learners who have a basic python or programming background, and want to apply statistical, machine learning, information visualization, text analysis, and social network analysis techniques through popular python toolkits such as pandas, matplotlib, scikit-learn, nltk, and networkx to gain insight into their data. Introduction to Data Science in Python (course 1), Applied Plotting, Charting & Data Representation in Python (course 2), and Applied Machine Learning in Python (course 3) should be taken in order and prior to any other course in the specialization. After completing those, courses 4 and 5 can be taken in any order. All 5 are required to earn a certificate.
SakthiVigneshwaran
The project is a Book Recommendation System that uses item based collaborative filtering to suggest books based on user preferences and ratings. It preprocesses the dataset and applies similarity to find similar books. Built with Python, it utilizes Pandas for data handling, NumPy for calculations, NLTK for text processing and feature extraction.
pln-fing-udelar
Transform MCR 3.0 data to read with nltk WordNet reader. Use this to load WordNet in Spanish, among other languages, from nltk.
SUBHADIPMAITI-DEV
This project develops a Depression Detection System using Machine Learning on Twitter data. It predicts depression by analyzing tweets with SVM, Logistic Regression, Decision Trees, and NLTK in Python.
CompLin
Aelius is a suite of Python, NLTK-based modules and language data for training and evaluating POS-taggers for Brazilian Portuguese and annotating corpora in this language variety.
madhurimarawat
This repository began as a 7th-semester minor project and evolved into our 8th-semester major project, "Advanced Stock Price Forecasting Using a Hybrid Model of Numerical and Textual Analysis." It utilizes Python, NLP (NLTK, spaCy), ML models, Grafana, InfluxDB, and Streamlit for data analysis and visualization.
MiyainNYC
Web Crawl(Scrapy), Text Mining(NLTK) Data Mining(Sklearn) Visualization(Python, Tableau). Classification and Clustering on Google App text
wxfsd
Install using NLTK downloader: nltk.download()
michaelmml
Automated PDF and text processing with Spacy and NLTK; information extraction from text based on grammatical structure; deployed on extracted raw search data
Kairos-T
Sentiment Analysis Python script using NLP (NLTK's VADER model) tool that analyses text data and labels them with sentiment scores.
FarhaKousar1601
This project, conducted in collaboration with Global Core Tech, focuses on analyzing sentiment in Flipkart reviews. Using Python and essential data science libraries like Pandas, Matplotlib, NLTK, and Seaborn, we aim to extract valuable insights into customer sentiments from the reviews.
rohitthapliyal2000
Opinion mining for provided data from various NLTK corpus to test/enhance the accuracy of the NaiveBayesClassifier model.
partoftheorigin
This repository contains my work while completing the specialization created by University of Michigan on Coursera. The 5 courses in this University of Michigan specialization introduce learners to data science through the python programming language. This skills-based specialization is intended for learners who have basic a python or programming background, and want to apply statistical, machine learning, information visualization, text analysis, and social network analysis techniques through popular python toolkits such as pandas, matplotlib, scikit-learn, nltk, and networkx to gain insight into their data. Introduction to Data Science in Python (course 1), Applied Plotting, Charting & Data Representation in Python (course 2), and Applied Machine Learning in Python (course 3) should be taken in order and prior to any other course in the specialization. After completing those, courses 4 and 5 can be taken in any order. All 5 are required to earn a certificate.
dalindev
data mining using tweepy, SA using NLTK
kawadhiya21
A chat bot made using basic NLP concepts with python nltk and tornado package. It does basic interpretation as it lacks training data for classification.
andreasvc
Data-Oriented Parsing implementation for NLTK applied to Esperanto morphology and syntax
rishy
Tags textual data using NLTK and Open Data from Wikipedia.
I scrape the news data using webhose.io and stock price data using nsepy. then I labelled the data using information about rising or fall in prices corresponding to news scraped for that day. For the word embeddings, I used GloVe provided by Stanford University. I used stop words from NLTK to remove sentence fillers that do not change the context. Then I used TF-IDF to remove the words that provide the least information. then I used LSTM to map any time-dependent relations in the data set and predicted if a user should "buy" or "sell" the share and with what confidence based on today's news events.
chezou
Example repository for NLTK execution on PySpark cluster with Cloudera Data Science Workbench
Using raw data of Enron spam datasets to create a corpus using python, nltk and shell script.
surakshashukla
This is a phishing email detection analysis, which was done using Python and Apache Spark to detect phishing emails using three different classification method. Python was used for email data loading and text parsing using NLTK package. Apache Spark was used to run the analysis in big data environment.
The Movie Reviews dataset. The dataset is imported from the NLTK libray. It has 1000 positive and 1000 negative reviews. I have first imported the dataset into a pandas data frame which makes it easier to do the processing. The next step is to analyze the (+) and ( - ) reviews. I have also preprocessed the dataset using Lemmatizing and other standard NLP techniques. To extract the features from the text I have used the Tfidf vectorizer from the scikit. Lastly I have used various modellig algos from scikit to train on this data.
MashaKubyshina
Data Science scripts - Pandas, Numpy, NLTK, NLP 2020-2021
NishthaChaudhary
Natural language processing (NLP) is an exciting branch of artificial intelligence (AI) that allows machines to break down and understand human language. I plan to walk through text pre-processing techniques, machine learning techniques and Python libraries for NLP. Text pre-processing techniques include tokenization, text normalization and data cleaning. Once in a standard format, various machine learning techniques can be applied to better understand the data. This includes using popular modeling techniques to classify emails as spam or not, or to score the sentiment of a tweet on Twitter. Newer, more complex techniques can also be used such as topic modeling, word embeddings or text generation with deep learning. We will walk through an example in Jupyter Notebook that goes through all of the steps of a text analysis project, using several NLP libraries in Python including NLTK, TextBlob, spaCy and gensim along with the standard machine learning libraries including pandas and scikit-learn.
viyatgandhi
Use of scikit-learn, networkx, scipy, numpy and nltk to perform real time analysis of data.
Using NLP and ML, make a model to identify hate speech (racist or sexist tweets) in Twitter. Problem Statement: Twitter is the biggest platform where anybody and everybody can have their views heard. Some of these voices spread hate and negativity. Twitter is wary of its platform being used as a medium to spread hate. You are a data scientist at Twitter, and you will help Twitter in identifying the tweets with hate speech and removing them from the platform. You will use NLP techniques, perform specific cleanup for tweets data, and make a robust model. Domain: Social Media Analysis to be done: Clean up tweets and build a classification model by using NLP techniques, cleanup specific for tweets data, regularization and hyperparameter tuning using stratified k-fold and cross validation to get the best model. Content: id: identifier number of the tweet Label: 0 (non-hate) /1 (hate) Tweet: the text in the tweet Tasks: Load the tweets file using read_csv function from Pandas package. Get the tweets into a list for easy text cleanup and manipulation. To cleanup: Normalize the casing. Using regular expressions, remove user handles. These begin with '@’. Using regular expressions, remove URLs. Using TweetTokenizer from NLTK, tokenize the tweets into individual terms. Remove stop words. Remove redundant terms like ‘amp’, ‘rt’, etc. Remove ‘#’ symbols from the tweet while retaining the term. Extra cleanup by removing terms with a length of 1. Check out the top terms in the tweets: First, get all the tokenized terms into one large list. Use the counter and find the 10 most common terms. Data formatting for predictive modeling: Join the tokens back to form strings. This will be required for the vectorizers. Assign x and y. Perform train_test_split using sklearn. We’ll use TF-IDF values for the terms as a feature to get into a vector space model. Import TF-IDF vectorizer from sklearn. Instantiate with a maximum of 5000 terms in your vocabulary. Fit and apply on the train set. Apply on the test set. Model building: Ordinary Logistic Regression Instantiate Logistic Regression from sklearn with default parameters. Fit into the train data. Make predictions for the train and the test set. Model evaluation: Accuracy, recall, and f_1 score. Report the accuracy on the train set. Report the recall on the train set: decent, high, or low. Get the f1 score on the train set. Looks like you need to adjust the class imbalance, as the model seems to focus on the 0s. Adjust the appropriate class in the LogisticRegression model. Train again with the adjustment and evaluate. Train the model on the train set. Evaluate the predictions on the train set: accuracy, recall, and f_1 score. Regularization and Hyperparameter tuning: Import GridSearch and StratifiedKFold because of class imbalance. Provide the parameter grid to choose for ‘C’ and ‘penalty’ parameters. Use a balanced class weight while instantiating the logistic regression. Find the parameters with the best recall in cross validation. Choose ‘recall’ as the metric for scoring. Choose stratified 4 fold cross validation scheme. Fit into the train set. What are the best parameters? Predict and evaluate using the best estimator. Use the best estimator from the grid search to make predictions on the test set. What is the recall on the test set for the toxic comments? What is the f_1 score?
alibolek
This is an interactive nltk website that you can execute your data via nltk methods.
RafayKhattak
ToxiScan is a text analysis tool that leverages the power of Natural Language Toolkit (NLTK) and the Naive Bayes classifier to determine the presence of toxicity in textual data.