Using Neural Networks to Predict the Spread of Tweets Containing Disinformation
Jeremy Swack, Jinyang Liu, Aaraj Vij, Atharv Gupta
In Summer 2021, DisinfoLab established a Technical Analyst team to grow the research capacity of our lab. Our Technical Analysts build tools to collect large swaths of data for analysis and employ artificial intelligence to identify salient trends in disinformation. For our pilot project, we built a neural network that predicts the engagement of tweets containing disinformation based on thirteen variables. Such research may aid social media companies in implementing measures to stop the spread of disinformation on their platforms.
To train our neural network, we first collected Twitter data from two online platforms: BotSentinel and Hoaxy.
BotSentinel is a program that tracks inauthentic Twitter account activity. Using artificial intelligence and machine learning, BotSentinel generates authenticity ratings for Twitter accounts based on their abidance with Twitter’s user guidelines. These ratings are not binary––they do not directly verify whether an account is a “bot” account or a “real” person. Instead, their system rates each account on a scale from 1 to 100 and categorizes each account as either normal, questionable, disruptive, or problematic. Using this data, BotSentinel then formulates a list of the top hashtags, two-word phrases, URLs, and mentions tweeted by likely inauthentic accounts, which are updated every hour. More information about how BotSentinel analyzes accounts can be found on their website.
Hoaxy is an online tool developed by the Indiana University Observatory on Social Media that visualizes the spread of articles and phrases on Twitter. Hoaxy tracks the sharing of links to stories from low-credibility sources and offers a Live Search function where users can search for the network spread of specific phrases and links. Hoaxy also calculates a bot score for users sharing a target piece of information, using a machine learning algorithm called Botometer.
For data collection, our analyst team uses three Python scripts.
This script web scrapes trending two-word phrases identified by BotSentinel and inputs these phrases as Live Search terms into Hoaxy. Hoaxy retrieves relevant tweets and exports the results into spreadsheets, which are saved locally.
This script uses the Twitter API to collect all publicly available user and tweet information, including the full text of each tweet.
This script uses Valence Aware Dictionary and sEntiment Reasoner (VADER) natural language processing to determine the polarity, subjectivity, and negative, neutral or positive sentiment of each tweet.
Using these scripts, our team collected approximately 20,000 data points between July and August 2021.
To analyze the collected data, we developed a neural network that predicts the number of Twitter interactions—retweets, mentions, and quote tweets—that a tweet will receive based on thirteen variables:
Hoaxy bot score of account
Twitter API Data
Date account was created
Number of users the account follows
Number of users following the account
Number of public lists the account has
Number of tweets the account has
Number of tweets the account has liked
Verification status of the account
Default or custom profile picture
Language of target tweet
VADER Natural Language Processing Data
Polarity of target tweet
Subjectivity of target tweet
Sentiment score of given tweet
To train the model, we split the data into 15,000 training data points and 5,000 testing data points. We built the model using Keras’ Sequential class to produce a single numerical output. To improve the model’s accuracy and prevent overfitting, we added batch normalization and dropout layers alongside the dense layers. The results are shown in the plot below.
This plot shows the actual number of Twitter interactions a tweet received on the X-axis against the predicted number of Twitter interactions a tweet received on the Y-axis for the five thousand testing data points. The closer a point is to the blue line, the more accurate the prediction for that data point is. Most points fall between zero and one hundred interactions and the model generally had a high degree of accuracy for points in this range. However, at the extremes where there was less data for the model to train with, the model struggled to consistently create accurate predictions.
To gather insight into the significance of each variable, we made a variable importance chart using the package SHAP. This package allows us to calculate SHapley Additive exPlanations (SHAP) values, or the average contribution of a variable to the model’s predictions across all permutations of the different variables. The variable importance chart is displayed below.
The top ten most impactful variables are shown on this chart. A variable with a blue bar means that a higher value of that variable increases the model’s prediction of the number of engagements a tweet will receive, while a red bar means that a higher value of that variable lowers the model’s prediction.
Based on our findings, the most important variable in predicting the engagements of a tweet was the Hoaxy bot score of the account that posted it. Tweets from accounts that had a higher probability of being a bot, and thus a larger bot score, received a significant positive boost towards their predicted number of interactions. The second most important variable in the model was the time when an account was created. Because the bar is red, this means that tweets from older accounts received a significant negative push on their predicted number of interactions. Putting these elements together, this means that generally, tweets from accounts that were newer and more likely to be bots tended to get higher levels of interactions. The third most important variable, sentiment score, comes from VADER’s natural language processing that was done on the data. This variable, which is a value from -1 to 1, where -1 is the most negative, 0 is neutral, and 1 is the most positive, indicates the sentiment of a given tweet. Because the bar is blue, this means that tweets that were more positive received a positive bump in their predictions.
Interestingly, the number of tweets or statuses that an account had, the number of people an account followed, and the number of followers an account had were not the most significant predictors in a tweet’s number of interactions. A likely explanation for this could be that viral tweets can come from anywhere depending on the kind of interactions they get.
We have begun examining the performance of this model on emerging articles containing disinformation. We identified 15 articles containing disinformation and utilized Hoaxy’s Live Search feature to collect 400 data points. We then separated these articles and data points into three distinct categories: COVID-19, International Relations, and United States Politics. We tested these data points on our existing model, which produced the following result.
For these known disinformation data points, our existing model performed poorly. Two possible explanations for this performance are:
The test data set is small, and therefore is subjected to increased variance
Our metric for collecting data—tweets containing flagged phrases from BotSentinel—is an inadequate proxy for the true spread of disinformation on Twitter
DisinfoLab will continue to investigate these possibilities and create new training and test sets for an updated model.