US Election 2020: Trump vs Biden on Twitter

Published in

Analytics Vidhya

6 min readNov 2, 2020

Fine Tuning RoBERTa in PyTorch on TPUs

In the Republican National Convention earlier this year, Donald Trump proclaimed on stage: “This is the most important election in US history”. Many would agree that he might not be exaggerating by much this time. Some even predict this election could have the highest voter turnout rate since 1908. Torn between the conflicting stances and policies of the two candidates, America’s choice will have far-reaching consequences for the rest of the world.

With so much at stake, it would be interesting to see if we can predict the outcome with publicly available data. I started collecting tweets from the beginning of October, hoping to use the power of NLP and transformers to predict state by state outcomes of the election. But it quickly became apparent that the user base of Twitter is pro-Democrat in general. According to a recent study conducted by the Pew Research Center, 69% of the highly prolific tweeters are Democrats.

If there is an intrinsic bias in the data, we cannot predict the outcome using this source alone. But we can look at the trend over the course of October and see if there is any hint for things to come.

Data

The 30 day average daily volume of Twitter on 30 October 2020 is over 20M. To choose a relevant subset of reasonable size, we only select tweets with hashtags #Trump or #Biden, and exclude retweets. (See for example this guide on how to scrape tweets). We are left with around 20K tweets each day for analysis.

Looking at the race bar chart of the most popular hashtags by date (using Flourish Studio), one thing stands out: #Trump is the most popular one by far — but the majority of these tweets, as we shall see, are negative in sentiment. This election reflects the polarization of the US society into two camps, pro-Trump and anti-Trump. People who vote for Biden are more likely to be Trump haters than genuine Biden supporters. Twitter has also taken the unprecedented decision to suppress, if not outright ban, some tweets on the Hunter Biden incident, and this may be the reason why hashtags like #HunterBiden did not go viral on the platform despite a very high view rate on Fox news.

Training, Validation and Test datasets

To generate the training data set, we assume that certain hashtags used in the text part of the tweets reflect the voting intention. For Trump, the hashtags include:

#Trump2020Landslide,
#Trump2020LandslideVictory,
#Trump2020ToSaveAmerica,
#BidenCrimeFamily

And for Biden we use the following:

#TrumpCovid,
#TrumpVirus,
#VoteHimOut,
#TrumpIsANationalDisgrace,
#TrumpMeltDown,
#TrumpCrimeFamily

The size of the training set is around 20K. Manually checking a sample of a hundred tweets, this method of generating a training set only has an accuracy of around 90%, as commenters could be expressing disapproval to a hashtag, or simply being sarcastic.

To generate the validation set, we cannot use the standard method of random splitting from the training set — it would be too easy for the algorithm to simply recognize the hashtags used without analyzing the language at all, overfitting the validation set. We use another set of phrases, but this time on the fields user_name and user_description. For example, if a user has the phrase “vote Biden” in the user_description, we are fairly certain of his voting intention. The validation set contains around 2,500 tweets.

As for the test set, we will use all tweets where the field user_location contains a reference to USA with a filter — Twitter accounts opened after 01 September 2020 are excluded to prevent fake accounts spreading propaganda. Tweets that had been used in training and validation are also excluded, leaving us with around 600K tweets in this set.

Text preparation

Although stopword removal and stemming were part of the standard workflow in the pre-transformer days, nowadays they often make little impact on the results thanks to the power of transformers. We will simply:

concatenate three fields from the tweet: user_name, user_description and text as our input text;
remove hyperlinks;
remove punctuation except apostrophe;
demojize the text using the emoji library;
split camel case words like “BidenHarris” to “Biden Harris”; and
convert all letters to lower cases.

Training PyTorch Roberta on Google Colab TPU

Code for fine-tuning the PyTorch Huggingface RoBERTa-based model on TPU with pre-trained weights can be found in my Google Colab Notebook.

I will not go into much details on the code here as I believe the workflow is reasonably clear in the notebook itself, but I will note a few non-standard features. As the training set was selected with a simple rule, the phase with which the tweet is selected would be masked and replaced with an empty string with a probability of 0.5 during training. This would hopefully force the model to rely not just on those phrases but the deeper linguistic features in the tweets.

Since the training set itself is noisy, we will use soft-labelling: instead of using [0, 1] for the label of the classes, we use [0.1, 0.9] here. Empirically the model trains better with this modification. The training metric profile seems reasonable, with the validation accuracy reaching a respectable level after 4 epochs, typical of many downstream fine-tuning tasks.

Results

Before we dive into the results, let’s first pick a small sample from the test data set and spot check the quality of prediction. After checking 100 samples, we have an accuracy of around 80% for both Trump-supporting tweets and Biden-supporting ones.

There are a variety of reasons for the cases where the algorithm falls short. Sometimes there is simply not enough information in the text part of the tweets — it is the accompanying image or video that carries the main message, and we have not been analyzing them. Sometimes the algorithm is defeated by the nuance of the language, and other times the tweet is just plain confusing (at least to non-Americans).

For example, a tweet states “#Trump supporters hospitalized after being stranded in freezing cold at late-night rally”. Despite the model giving it a 73% chance of being pro-Trump, I have my doubts. So I follow the link in the tweet, and the tweet turns out to be the headline of an article in the Guardian, an anti-Trump left-wing paper. So, not quite.

With the sheer sample size for each day, if there is a genuine underlying pattern, the slight handicap on accuracy should not matter too much — the pattern should still be apparent.

With the caveat out of the way, we can finally look at the interesting part: how do the relative support rates of Trump and Biden vary as election day looms? In order to prevent prolific users from distorting the results, we will first average the results by user — every account will only have one vote per day.

Here is the overall picture for the whole country:

As mentioned before, the Twitter’s user-base is generally pro-Democrat, and without other sources it is hard to draw any conclusions about the true level of support. The anti-Trump camp has a huge advantage amongst Twitter users.

Yet there is a very distinct trend — Donald Trump’s support rose gradually on Twitter throughout October, reaching a local peak on the 23rd when former business partner Bobulinski testified against the Bidens. Overall there is a swing of over 10% in favor of Donald Trump.

This pattern broadly holds for individual states as well, with more noise due to smaller sample sizes. In one of the swing states, Pennsylvania, we observe the same, if not more erratic, upward trajectory for trump.

Has the momentum swung enough for Trump to hold on for 4 more years? The whole world is holding its breath…

Jupyter notebooks for tweets scraping, EDA, data preparation and result analysis can be found in my Github repo. Thanks for reading.

US Election 2020: Trump vs Biden on Twitter

Written by Ivan Lai