Automated Archival and Visual Analysis of Tweets Mentioning #bog13, Bioinformatics, #rstats, and Others

(This article was first published on Getting Genetics Done, and kindly contributed to R-bloggers)
Automatically Archiving Twitter Results

Ever since Twitter gamed its own API and killed off great services like IFTTT triggers, I've been looking for a way to automatically archive tweets containing certain search terms of interest to me. Twitter's built-in search is limited, and I wanted to archive interesting tweets for future reference and to start playing around with some basic text / trend analysis.

Enter t - the twitter command-line interface. t is a command-line power tool for doing all sorts of powerful Twitter queries using the command line. See t's documentation for examples.

I wrote this script that uses the t utility to search Twitter separately for a set of specified keywords, and append those results to a file. The comments at the end of the script also show you how to commit changes to a git repository, push to GitHub, and automate the entire process to run twice a day with a cron job. Here's the code as of May 14, 2013:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
#!/bin/sh
 
## twitterchive.sh
## Stephen Turner (stephenturner.us)
##
## Script uses the t command line client (https://github.com/sferik/t)
## to search twitter for keywords stored in the arr variable below.
##
## Must first install the t gem and authenticate with OAuth.
##
## Twitter enforces some API limits to how many tweets you can search for
## in one query, and how many queries you can execute in a given period.
##
## I'm not sure what these limitations are, but I've hit them a few times.
## To be safe, I would limit the number of queries to ~5, $n to ~200, and
## run no more than a couple times per day.
 
## declare an array variable containing all your search terms.
## prefix any hashtags with a \
declare -a arr=(bioinformatics metagenomics rna-seq \#rstats)
 
## How many results would you like for each query?
n=250
 
## cd into where the script is being executed from.
DIR="$(dirname "$(readlink $0)")"
cd $DIR
echo $DIR
echo $(pwd)
 
echo
 
## now loop through the above array
for query in ${arr[@]}
do
## if your query contains a hashtag, remove the "#" from the filename
filename=$DIR/${query/\#/}.txt
echo "Query:\t$query"
echo "File:\t$filename"
 
## create the file for storing tweets if it doesn't already exist.
if [ ! -f $filename ]
then
touch $filename
fi
 
## use t (https://github.com/sferik/t) to search the last $n tweets in the query,
## concatenating that output with the existing file, sort and uniq that, then
## write the results to a tmp file.
search_cmd="t search all -ldn $n '$query' | cat - $filename | sort | uniq | grep -v ^ID > $DIR/tmp"
echo "Search:\t$search_cmd"
eval $search_cmd
 
## rename the tmp file to the original filename
rename_cmd="mv $DIR/tmp $filename"
echo "Rename:\t$rename_cmd"
eval $rename_cmd
 
echo
done
 
## push changes to github.
## errors running git push via cron necessitated authenticating over ssh instead of https
# git commit -a -m "Update search results: $(date)"
# git push origin master
 
## Run with a cronjob: 00 12 * * * cd /path/to/twitterchive/ && ./twitterchive.sh


That script, and results for searching for "bioinformatics", "metagenomics", "#rstats", "rna-seq", and "#bog13" (the Biology of Genomes 2013 meeting) are all in the GitHub repository below. (Please note that these results update dynamically, and searching Twitter at any point could possibly result in returning some unsavory Tweets.)

https://github.com/stephenturner/twitterchive

Analyzing Tweets using R

You'll also find an analysis subdirectory, containing some R code to produce barplots showing the number of tweets per day over the last month, frequency of tweets by hour of the day, the most used hashtags within a search, the most prolific tweeters, and a ubiquitous word cloud. Much of this code is inspired by Neil Saunders's analysis of Tweets from ISMB 2012. Here's the code as of May 14, 2013:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114
## Most of this code was adapted near-verbatim from Neil's post about ISMB 2012.
## http://nsaunders.wordpress.com/2012/08/16/twitter-coverage-of-the-ismb-2012-meeting-some-statistics/
 
## Modify this. This is where I keep this repo.
repoDir <- ("~/code/twitterchive/")
 
## Go to the analysis directory
setwd(paste(repoDir, "analysis", sep=""))
 
## Function needs better documentation
twitterchivePlots <- function (filename=NULL) {
## Load required packages
require(tm)
require(wordcloud)
require(RColorBrewer)
if (class(filename)!="character") stop("filename must be character")
if (!file.exists(filename)) stop(paste("File does not exist:", filename))
searchTerm <- sub("\\.txt", "", basename(filename))
message(paste("Filename:", filename))
message(paste("Search Term: ", searchTerm))
## Read in the data and munge around the dates.
## I can't promise the fixed widths will always work out for you.
message("Reading in data.")
trim.whitespace <- function(x) gsub("^\\s+|\\s+$", "", x) # Function to trim leading and trailing whitespace from character vectors.
d <- read.fwf(filename, widths=c(18, 14, 18, 1000), stringsAsFactors=FALSE, comment.char="")
d <- as.data.frame(sapply(d, trim.whitespace))
names(d) <- c("id", "datetime", "user", "text")
d$user <- sub("@", "", d$user)
d$datetime <- as.POSIXlt(d$datetime, format="%b %d %H:%M")
d$date <- as.Date(d$datetime)
d$hour <- d$datetime$hour
d <- na.omit(d) # CRs cause a problem. explain this later.
head(d)
## Number of tweets by date for the last n days
recentDays <- 30
message(paste("Plotting number of tweets by date in the last", recentDays, "days."))
recent <- subset(d, date>=(max(date)-recentDays))
byDate <- as.data.frame(table(recent$date))
names(byDate) <- c("date", "tweets")
png(paste(searchTerm, "barplot-tweets-by-date.png", sep="--"), w=1000, h=700)
par(mar=c(8.5,4,4,1))
with(byDate, barplot(tweets, names=date, col="black", las=2, cex.names=1.2, cex.axis=1.2, mar=c(10,4,4,1), main=paste("Number of Tweets by Date", paste("Term:", searchTerm), sep="\n")))
dev.off()
# ggplot(byDate) + geom_bar(aes(date, tweets), stat="identity", fill="black") + theme_bw() + ggtitle("Number of Tweets by Date") + theme(axis.text.x=element_text(angle=90, hjust=1))
## Number of tweets by hour
message("Plotting number of tweets by hour.")
byHour <- as.data.frame(table(d$hour))
names(byHour) <- c("hour", "tweets")
png(paste(searchTerm, "barplot-tweets-by-hour.png", sep="--"), w=1000, h=700)
with(byHour, barplot(tweets, names.arg=hour, col="black", las=1, cex.names=1.2, cex.axis=1.2, main=paste("Number of Tweets by Hour", paste("Term:", searchTerm), paste("Date:", Sys.Date()), sep="\n")))
dev.off()
# ggplot(byHour) + geom_bar(aes(hour, tweets), stat="identity", fill="black") + theme_bw() + ggtitle("Number of Tweets by Hour")
## Barplot of top 20 hashtags
message("Plotting top 20 hashtags.")
words <- unlist(strsplit(d$text, " "))
head(table(words))
ht <- words[grep("^#", words)]
ht <- tolower(ht)
ht <- gsub("[^A-Za-z0-9]", "", ht) # remove anything not starting with a letter or number
ht <- as.data.frame(table(ht))
ht <- subset(ht, ht!="") # remove blanks
ht <- ht[sort.list(ht$Freq, decreasing=TRUE), ]
ht <- ht[-1, ] # remove the term you're searching for? it usually dominates the results.
ht <- head(ht, 20)
head(ht)
png(paste(searchTerm, "barplot-top-hashtags.png", sep="--"), w=1000, h=700)
par(mar=c(5,10,4,2))
with(ht[order(ht$Freq), ], barplot(Freq, names=ht, horiz=T, col="black", las=1, cex.names=1.2, cex.axis=1.2, main=paste("Number of Tweets by Hour", paste("Term:", searchTerm), paste("Date:", Sys.Date()), sep="\n")))
dev.off()
# ggplot(ht) + geom_bar(aes(ht, Freq), fill = "black", stat="identity") + coord_flip() + theme_bw() + ggtitle("Top hashtags")
## Top Users
message("Plotting most prolific users.")
users <- as.data.frame(table(d$user))
colnames(users) <- c("user", "tweets")
users <- users[order(users$tweets, decreasing=T), ]
users <- subset(users, user!=searchTerm)
users <- head(users, 20)
head(users)
png(paste(searchTerm, "barplot-top-users.png", sep="--"), w=1000, h=700)
par(mar=c(5,10,4,2))
with(users[order(users$tweets), ], barplot(tweets, names=user, horiz=T, col="black", las=1, cex.names=1.2, cex.axis=1.2, main=paste("Most prolific users", paste("Term:", searchTerm), paste("Date:", Sys.Date()), sep="\n")))
dev.off()
## Word clouds
message("Plotting a wordcloud.")
words <- unlist(strsplit(d$text, " "))
words <- grep("^[A-Za-z0-9]+$", words, value=T)
words <- tolower(words)
words <- words[-grep("^[rm]t$", words)] # remove "RT"
words <- words[!(words %in% stopwords("en"))] # remove stop words
words <- words[!(words %in% c("mt", "rt", "via", "using", 1:9))] # remove RTs, MTs, via, and single digits.
wordstable <- as.data.frame(table(words))
wordstable <- wordstable[order(wordstable$Freq, decreasing=T), ]
wordstable <- wordstable[-1, ] # remove the hashtag you're searching for? need to functionalize this.
head(wordstable)
png(paste(searchTerm, "wordcloud.png", sep="--"), w=800, h=800)
wordcloud(wordstable$words, wordstable$Freq, scale = c(8, .2), min.freq = 3, max.words = 200, random.order = FALSE, rot.per = .15, colors = brewer.pal(8, "Dark2"))
#mtext(paste(paste("Term:", searchTerm), paste("Date:", Sys.Date()), sep=";"), cex=1.5)
dev.off()
message(paste(searchTerm, ": All done!\n"))
}
 
filelist <- list("../bioinformatics.txt", "../metagenomics.txt", "../rstats.txt", "../rna-seq.txt")
lapply(filelist, twitterchivePlots)


Also in that analysis directory you'll see periodically updated plots for the results of the queries above.

Analyzing Tweets mentioning "bioinformatics"

Using the bioinformatics query, here are the number of tweets per day over the last month:

Here is the frequency of "bioinformatics" tweets by hour:
Here are the most used hashtags (other than #bioinformatics):
Here are the most prolific bioinformatics Tweeps:
Here's a wordcloud for all the bioinformatics Tweets since March:
Analyzing Tweets mentioning "#bog13"

The 2013 CSHL Biology of Genomes Meeting took place May 7-11, 2013. I searched and archived Tweets mentioning #bog13 from May 1 through May 14 using this script. You'll notice in the code above that I'm no longer archiving this hashtag. I probably need a better way to temporarily add keywords to the search, but I haven't gotten there yet.

Here are the number of Tweets per day during that period. Tweets clearly peaked a couple days into the meeting, with follow-up commentary trailing off quickly after the meeting ended.

Here is the frequency frequency of Tweets by hour, clearly bimodal:
Top hashtags (other than #bog13). Interestingly #bog14 was the most highly used hashtag, so I'm guessing lots of folks are looking forward to next years' meeting. Also, #ashg12 got lots of mentions, presumably because someone presented updated work from last years' ASHG meeting.
Here were the most prolific Tweeps - many of the usual suspects here, as well as a few new ones (new to me at least):
And finally, the requisite wordcloud:

More analysis

If you look in the analysis directory of the repo you'll find plots like these for other keywords (#rstats, metagenomics, rna-seq, and others to come). I would also like to do some sentiment analysis as Neil did in the ISMB post referenced above, but the sentiment package has since been removed from CRAN. I hear there are other packages for polarity analysis, but I haven't yet figured out how to use them. I've given you the code to do the mundane stuff (parsing the fixed-width files from t, for starters). I'd love to see someone take a stab at some further text mining / polarity / sentiment analysis!

twitterchive - archive and analyze results from a Twitter search