Found 22 repositories(showing 22)
bvannah
Python code to extract a user's tweets and predict their Myers-Briggs personality type and financial risk tolerance. Uses bag of words model trained using the (MBTI) Myers-Briggs Personality Type Dataset to make predictions.
dmallya93
Building a search engine to discovery web services specified using a natural language query that infers relationships using an ontology of Twitter data. Technologies used are NLTK, Python, Whoosh, Django and CMU Ark Tweet Parser. The fast information sharing on Twitter from millions of users all over the world leads to almost real-time reporting of events. It is extremely important for business and administrative decision makers to learn events popularity as quickly as possible, as it can buy extra precious time for them to make informed decisions. Therefore, we introduce the problem of predicting future popularity trend of events on microblogging platforms. Traditionally, trend prediction has been performed by using time series analysis of past popularity to forecast the future popularity changes.
Aamir-Salaam
Political Sentiment Analysis of tweets using NLP and Machine Learning in Python for Election Results Prediction.
srishb28
DATA This assignment is about part-of-speech tagging on Twitter data. The data is located in ./data directory with a train and dev split. The test data is also included, but with false POS tags on purpose. You will develop and tune your models only using train and dev sets, and will generate predictions for the test data once you are done developing. The accuracy will be computed by TA with the goldstandard labels. This data set contains tweets annotated with their universal parts-of-speech tags, with 379 tweets for training and 112 for dev, and 12 possible part-of-speech labels. The test corpus will contain 295 tweets. The format of the data files is pretty straight forward. It contains a line for each token (with its label separated by a whitespace), and with sentences separated with empty line. See the below example an example, and examine the text files yourself (always a good idea). @paulwalk X It PRON 's VERB the DET view NOUN from ADP where ADV I PRON 'm VERB living VERB for ADP two NUM weeks NOUN . . Empire NOUN State NOUN Building NOUN = X ESB NOUN . . Pretty ADV bad ADJ storm NOUN here ADV last ADJ evening NOUN Files data.py: The primary entry point that reads the data, and trains and evaluates the tagger implementation. usage: python data.py [-h] [-m MODEL] [--test] optional arguments: -h, --help show this help message and exit -m MODEL, --model MODEL 'LR'/'lr' for logistic regression tagger 'CRF'/'crf' for conditional random field tagger --test Make predictions for test dataset tagger.py: Code for two sequence taggers, logistic regression and CRF. Both of these taggers rely on 'feats.py' and 'feat_gen.py' to compute the features for each token. The CRF tagger also relies on 'viterbi.py' to decode (which is currently incorrect), and on 'struct_perceptron.py' for the training algorithm (which also needs Viterbi to be working). feats.py & 'feat_gen.py: Code to compute, index, and maintain the token features. The primary purpose of 'feats.py' is to map the boolean features computed in 'feats_gen.py' to integers, and do the reverse mapping (if you want to know the name of a feature from its index). 'feats_gen.py' is used to compute the features of a token in a sentence, which you will be extending. The method there returns the computed features for a token as a list of string (so does not have to worry about indices, etc.). 'struct_perceptron.py': A direct port (with negligible changes) of the structured perceptron trainer from the 'pystruct' project. Only used for the CRF tagger. The description of the various hyperparameters of the trainer are available here, but you should change them from the constructor in 'tagger.py'. 'viterbi.py' (and 'viterbi_test.py'): General purpose interface to a sequence Viterbi decoder in 'viterbi.py', which currently has an incorrect implementation. Once you have implemented the Viterbi implementation, running 'python viterbi_test.py' should result in succesful execution without any exceptions. conlleval.pl: This is the official evaluation script for the CONLL evaluation. Although it computes the same metrics as the python code does, it supports a bunch of features, such as: (a) Latex formatted tables, by using -l, (b) BIO annotation by default, turned off using -r. In particular, when evaluating the output prediction files (~.pred) for POS tagging, $ ./conlleval.pl -r -d \t < ./predictions/twitter_dev.pos.pred
abdullahmoustaf
WeRateDogs Project Report Introduction: The below report which made for the Udacity Data Analyst Nanodegree Program of project “WeRateDogs”, I’ll try top explain the process in which my report has gone through. The goal of this project is to practice the process of wrangling and cleaning data, which was made through this twitter account tweet data. Tweets went through a process in which I performed the following activites: - Gathering Data - Assessing Data - Cleaning Data - Gathering Data I this process data is being obtained from csv files and loaded to tables in which it will go through the wrangling process. - Twitter archive data was loaded to `twitter_archive` table whcih contains WeRateDogs Twitter archive, which was provided by the Course and data was imported into the dataframe. - Image prediction data was imported from Image prediction file provided by the course and hosted in Udacity’s servers and added data to `predictions` table. The tweet image predictions, basically predicts whether the object in a said image is a dog or other object. - API data was provided through a file in the course material as my twitter developer account wasn’t created when I’ve started to work on the project, in this file I was able to query API data in JOSN file to read twitter data to api_df_now table. - Assessing Data In this step data is being assessed visually and programmatically to detect quality and tidiness issues in the gathered data. - ‘twitter_archive’ has missing data in multiple tables example, "in_reply_to_status_id", "in_reply_to_user_id", "retweeted_status_id", "in_reply_to_user_id", "retweeted_status_id", "retweeted_status_user_id". Lower case dog names was an issue too. - Another issue is the dog names that can make confusion like doggo, pupper, floofer and puppo. - Timestamp is another issue that needs an attention. Source of content needs to be organized. - Rating values needed some changes. - image predictions columns was making a confsuion. - ‘api_df_now’ file is separate from Twitter archive data. Cleaning Data In this step data is being cleaned and added to new tables twitter_archive_clean, prediction_clean and api_df_now_clean according to the issues observed in after assessing the data. 1- Fixing Quality issues 1- Dropped unnecessary columns containing missing data “in_reply_to_status_id", "in_reply_to_user_id", "retweeted_status_id", "in_reply_to_user_id", "retweeted_status_id", "retweeted_status_user_id". 2- Replaced missing “None” values with “NaN”. 3- Joined “api_df_now” table with “twitter_archive” table and renaming ‘tweet_id’ column. 4- Combined all dog names; doggo, pupper, floofer and puppo under one column name ‘dog’ 5- Changed timestamp to datetime. 6- Optimized source of content: Twitter for iphone, Vine - Make a Scene, Twitter Web Client and TweetDeck. 7- Made a default value for numerator and denominator values. 8- Capitalized first letters of dogs names. 2- Tidiness 1- Changed Image predictions p1, p2 and p3 names to potential_dog1, potential_dog2 and potential_dog3. 2- Merged the cleaned data into the clean tables. Conclusion Through this project I’ve learned to express the data analysis process through code and different tools offered through the Jupyter lab application. Data wrangling is crucial in the data analysis process as it’s the only way top obtain a reliable data to take the proper decisions in any organization. And using python in the process made it much easier and more efficient, also, the different libraries used in the process allowed the data to be read and manipulated in a relatively easier way, which will facilitates the process if dealt with much larger data amounts like Big Data. This proves that using code is a way to manipulate data and alter it efficiently. I believe that through my learning process I’ll be able to dig deeper into more processes and tools which will make the process more fruitful and efficient.
In this Problem, we build a machine learning model that predicts which Tweets are about real disasters andwhich one’s aren’t. We have a dataset of 10,000 tweets that were hand labeled.This is an NLP problem andwe will be using various methods to determine the accuracy with which we can predict this.
fredrikmalmberg
An application built with Spark Streaming and Python that integrates to Twitter, analyses tweets and makes predictions of potentially viral tweets using a continuosly trained model built in Tensorflow. The result is visualised in a basic web app.
Mateen-Abid
This project focused on emotion detection from text using deep learning. It uses a BERT model fine-tuned on the GoEmotions dataset to classify user comments or tweets into one of 28 emotional categories. Built with Python, PyTorch, and Hugging Face Transformers, the project includes training, prediction, evaluation, testing and visualization
pillowTree3
This Python project uses NLP methods to predict tweet toxicity (Toxic: 1, Non-toxic: 0) from a provided dataset. It applies Bag of Words and TF-IDF for text representation and utilizes Decision Trees, Random Forest, Naive Bayes, K-NN, and SVM for prediction. Evaluation includes Precision, Recall, F1-Score, Confusion Matrices, and RoC-AUC curves.
joj19968
# Wrangle-and-Analyze-data #### Introduction Real-world data rarely comes clean. Using Python and its libraries, you will gather data from a variety of sources and in a variety of formats, assess its quality and tidiness, then clean it. This is called data wrangling. You will document your wrangling efforts in a Jupyter Notebook, plus showcase them through analyses and visualizations using Python (and its libraries) and/or SQL. The dataset that you will be wrangling (and analyzing and visualizing) is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." WeRateDogs has over 4 million followers and has received international media coverage. WeRateDogs downloaded their Twitter archive and sent it to Udacity via email exclusively for you to use in this project. This archive contains basic tweet data (tweet ID, timestamp, text, etc.) for all 5000+ of their tweets as they stood on August 1, 2017. More on this soon. ### Project Details Your tasks in this project are as follows: -Data wrangling, which consists of: Gathering data, Assessing data and Cleaning data. -Storing, analyzing, and visualizing your wrangled data -Reporting on 1) your data wrangling efforts and 2) your data analyses and visualizations #### Gathering Data for this Project Gather each of the three pieces of data as described below in a Jupyter Notebook titled wrangle_act.ipynb: The WeRateDogs Twitter archive. I am giving this file to you, so imagine it as a file on hand. Download this file manually by clicking the following link: twitter_archive_enhanced.csv The tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. This file (image_predictions.tsv) is hosted on Udacity's servers and should be downloaded programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv Each tweet's retweet count and favorite ("like") count at minimum, and any additional data you find interesting. Using the tweet IDs in the WeRateDogs Twitter archive, query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called tweet_json.txt file. Each tweet's JSON data should be written to its own line. Then read this .txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count. Note: do not include your Twitter API keys, secrets, and tokens in your project submission. #### Assessing Data for this Project After gathering each of the above pieces of data, assess them visually and programmatically for quality and tidiness issues. Detect and document at least eight (8) quality issues and two (2) tidiness issues in your wrangle_act.ipynb Jupyter Notebook. To meet specifications, the issues that satisfy the Project Motivation (see the Key Points header on the previous page) must be assessed. #### Cleaning Data for this Project Clean each of the issues you documented while assessing. Perform this cleaning in wrangle_act.ipynb as well. The result should be a high quality and tidy master pandas DataFrame (or DataFrames, if appropriate). Again, the issues that satisfy the Project Motivation must be cleaned. #### Storing, Analyzing, and Visualizing Data for this Project Store the clean DataFrame(s) in a CSV file with the main one named twitter_archive_master.csv. If additional files exist because multiple tables are required for tidiness, name these files appropriately. Additionally, you may store the cleaned data in a SQLite database (which is to be submitted as well if you do). Analyze and visualize your wrangled data in your wrangle_act.ipynb Jupyter Notebook. At least three (3) insights and one (1) visualization must be produced. #### Reporting for this Project Create a 300-600 word written report called wrangle_report.pdf or wrangle_report.html that briefly describes your wrangling efforts. This is to be framed as an internal document. Create a 250-word-minimum written report called act_report.pdf or act_report.html that communicates the insights and displays the visualization(s) produced from your wrangled data. This is to be framed as an external document, like a blog post or magazine article, for example.
A Data Science project on Python Jupyter Notebook. Prediction of the polarity of tweets using several classification models.
Sultavespa
Python-based sentiment analysis of tweets using NLP and machine learning. Includes scripts for data collection, model training, and sentiment prediction. Trained on Sentiment140 dataset.
This repository contains two files. File 1 does sentimental analysis of recent tweets using Python in Tensorflow. The second file does a prediction of Male vs Female using a simple classifier.
bhaveshrajput99
A machine learning project that analyzes the sentiment of tweets (positive or negative) using NLP techniques. It includes a Jupyter Notebook for training and a Python GUI for real-time sentiment prediction.
ShikhaIIMA
Python Code for prediction of tweet category (disaster/no disaster). text pre-processing (stop word removal, stemming) and binary classification using SVM, Naive Bayes and Decision Tree.
anujapdixit
Analyzed the Twitter Reviews for the 5 major US Airlines and found insights on the improvements. Used Natural Language Processing in Python to classify an airline review text. Trained the model using a variety of machine learning algorithms such as logistic regression, random forest, knn for tweet classification prediction and tested using model evaluation techniques such as holdout evaluation, cross-validation, ROC and AUC curve to conclude the best model. Used grid search method for hyperparameter tuning to find the best tuned model with an accuracy of 77.4%
AishwaryaSelvarajan
This repository contains the machine learning concepts implemented in Python. This has workouts on basics of numpy, pandas,matplotlib. The application using these libraries includes Data Visualisation, Salary Prediction using Linear Regression, Image Classification using K - Nearest Neighbours (KNN), Web Scraping and SentimentIntensity Analysis of Amazon review and twitter tweets using BeautifulSoup, Face Detection, MNIST Hand Written Digit Recognition and designing of a Neural Network
abdelrahmansamir1
# Project - Wrangling and analyzing data from twitter archives Part of Udacity's Data Analyst Nanodegree ## Project Overview Gathered data from the archives of a twitter account called WeRateDogs which rates owned dogs in their tweets and adds a humorous comment with it. Gathered retweet and like counts for all the tweets from the twitter API using the access library tweepy and read them into a pandas dataframe in the Jupyter Notebook. Downloaded a tsv file about the tweets programmatically using the requests and BeautifulSoup libraries in python. Assessed quality and tidiness issues in the the gathered datasets and cleaned all of them using pandas functions. Analysed the cleaned tables and created visualizations using the matplotlib library. ## What would you need to install? To work on the project, you will need to install either Python 3 or Anaconda. ## Project Files twitter-archive-enhanced : Csv file containing data about the tweets like tweet text, dog ratings, dog name, dog type, etc. image-predictions : Tsv file containing all the dog breed predictions for all the tweets, computed using a neural network that predicted the breeds by using the dog images in the tweets. Other features include image url and breed_predicted which has two values True and False, True for when a breed was predicted and False if otherwise. tweet_json : A text file created which consists of each tweet's json object. tweet_counts_clean : A file created from a cleaned dataset consisting of retweet and like counts for all the tweets. These counts were gathered using the twitter API and the python library tweepy. image_predictions_clean : Cleaned version of the image-predictions file. twitter_archive_master : A master dataset created by joining the cleaned versions of all the three files. wrangle_act : Project workbook where all the wrangling, analyzing and visualizing was done. wrangle_report : Project report discussing all the activities performed in the workbook. act_report : Report communicating the insights drawn from the wrangled datasets with the help of visualizations.
mohmednasr
The dataset that I will be wrangling, analyzing, and visualizing is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." WeRateDogs has over 4 million followers and has received international media coverage. WeRateDogs downloaded their Twitter archive and sent it to Udacity via email exclusively for I to use in this project. This archive contains basic tweet data (tweet ID, timestamp, text, etc.) for all 5000+ of their tweets as they stood on August 1, 2017. More on this soon. Image Predictions File One more cool thing: Udacity ran every image in the WeRateDogs Twitter archive through a neural network that can classify breeds of dogs*. The results: a table full of image predictions (the top three only) alongside each tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images) and all of them on this URL "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv". Each tweet's retweet count and favorite ("like") count at minimum, and any additional data I find interesting. Using the tweet IDs in the WeRateDogs Twitter archive, query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called 'tweet_json.txt' file.
Timmtet
In the project, the dataset that was wrangled (and analyzing and visualizing) is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. WeRateDogs has over 4 million followers and has received international media coverage. The project consists of three major sections: 1. Data gathering 2. Data assessing 3. Data cleaning The project involves three datasets: 1. Enhanced Twitter Archive 2. Additional Data obtained via the Twitter API 3. An Image Predictions File. The enhanced twitter archive file was downloaded from the web. From the enhanced twitter archive, the retweet count and favorite count are two of the notable column omissions. Fortunately, this additional data can be gathered from Twitter's API. Anyone who has access to data for the 3000 most recent tweets, at least. Hence, the Twitter's API was quered to gather this valuable data. Using the tweet IDs in the WeRateDogs Twitter archive, the Twitter API was quered for each tweet's JSON data using Python's Tweepy library and each tweet's entire set of JSON data was stored in a file called tweet_json.txt file. For the Image Predictions File, a tsv file (image_predictions.tsv) is present in each tweet according to a neural network. It was hosted on a servers and was downloaded programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv.
KingsleyElo
Project at Data Analyst Nanodegree Program at Udacity. WeRateDogs Twitter account has over 9 million followers and has received international media coverage. Due to the popularity of WeRateDogs, we decided to look into the statistics of the account and extract some data from the tweets.The Data used in this project was gathered from various sources, including downloading manually for dataset that is already available, using the requests library to download additional dataset from the link provided and using tweepy API, to gathered additional data that was needed. This resulted in collecting three datasets, which were then loaded into three seperate pandas dataframe namely(WeRateDogs, api_tweets, and Image_prediction). These datasets sets were assessed for Quality and Tidiness issues using visual assessment using Excel and programmatic assessment using python methods such as; .info(), .head(), .describe(), .value_counts(), etc.
malmusfer
I completed this project as part of Udacity's Data Analyst Nanodegree. The project is based around the "WeRateDogs" Twitter page, a page which will kindly rate pictures and videos of dogs out of ten. Since dogs are all round fantastic creatures, all of WeRateDogs’ ratings are above ten. They also tag each dog with a different category out of “doggo”, “floofer”, “pupper”, or “puppo”. An archive of this Twitter data for WeRateDogs’ tweets was provided for this project as a CSV file. Two more sources of data were also gathered as part of this project: predictions for which type of dog is present in each picture (carried out previously, not by myself, by being passed through an image classification algorithm) and additional tweet information acquired from Twitter. I approached this project using the three steps of data wrangling: gather, assess, clean. In the gather phase, the image prediction data was downloaded using Python's Requests library. The additional Twitter information (i.e. retweet and favorite counts) was downloaded using the Twitter API. In the following assess step, I then inspected the generated data frames in order to find any quality or tidiness issues. The cleaning step subsequently involved implementing steps to fix the quality and tidiness issues that were previously identified. Following the data wrangling process, and some exploration and analysis of the (now clean and tidy) data, was carried out, numerous interesting results were observed.
All 22 repositories loaded