I just signed up to Pinboard because I wanted a permanent resource to capture all the links I’ve posted and retweeted on Twitter. While Pinboard integrates well in terms of capturing links from your ongoing feed, it will only work backwards to the previous 3200 tweets due to Twitter’s API limit. So the first thing I wanted was to do was process my long term Twitter archive to get everything from the previous four years. You’d think other people would want this too, so something must exist to do it, right? Wrong.
In the end, I had to write a small Ruby hack to parse the data from the archive, grep the links and post them to Pinboard. It was actually fairly easy once I’d found a suitable example via Google and identified which variant of each Ruby gem I needed was currently maintained and working. I added a few comfort touches like expansion of shortened links and decoding of HTML entities (Twitter uses them; Pinboard doesn’t). What took the most time was understanding the Twitter archive data format, since there doesn’t appear to be any formal documentation for it. But it’s basically JSON, so is fairly readable once you’ve perused a few example tweets. [N.B. I’m not a JSON expert or a regular coder and have apparently forgotten the little Ruby I ever learned, so treat all this as the desperate grasps for comprehension of a total naif.]
Your tweets are all stored in datestamped files within the
data/js/tweets
directory of the archive. Each file is formatted in JSON
except for the first line (beginning ‘Grailbird…’), which needs to be
discarded:
data=File.read(file).sub(/^Grailbird.data.tweets_([^=]*)=/){}
j=JSON.parse(data)
What you’re left with is an array of individual tweets for that month,
each element containing hashes of the various components of the tweet,
beginning with the source (i.e. the app or website used to post the
tweet). The key parts required for Pinboard posts are any hashtags and
URLs from the entities hash, the text and the created_at
field. Note
that the urls
entity doesn’t appear to be populated in older tweets
(circa 2010), so you need to grep the text with a suitable regex to locate
any links if this hash is empty (I used the URI.extract
method for this).
If the entity is populated then take the expanded_url
field. (The urls
entity is actually an array of URLs but as a Pinboard post can only show
one link, I only take the first element each time. However, there’s a
fallback method to view any others, as discussed further below.) I used the
LongURL module to try to expand each link to the final destination target,
bypassing any URL shortening used (itself often shortened again using
Twitter’s t.co shortcut) and generate a meaningful link.
Similarly, the hashtags entity is an array of hashtags in the tweet, so I
iterate over that and gather the text
item from each entry for
Pinboard’s tagtext
array parameter.
The text
part becomes the Pinboard description
field; since this
contains the original text including all the links as posted, it acts as a
backup of the original URI(s) in the event that the link expansion doesn’t
work correctly. One thing I’ve learnt from this: obsolete URL shorteners
are destructive to the Internet’s memory,
since you’re left with no easy way to recover the original link
destination. (Principal offender here is The Browser’s apparently defunct b.rw app,
which means that their older posted links are now all invalid. Bit of a
drawback for a curation site, that.) Also, many sites replace obsolete
page links with redirects to their top level home page (or the page of the
company that bought them out), which is no help at all. I guess that’s the
drawback of relying on an ‘ephemeral’ medium like Twitter for archiving.
The only tricky part concerns (native) retweets: the tweet contains
details of your retweet, including the ‘RT’ header with the retweeted
user’s name and abridged text, while the original tweet is nested in a
retweeted_status
field within that tweet. This means that when you find
such a field, you need to pull the relevant details from the surrounding
tweet (I use the text for Pinboard’s description
field as it shows the
attribution, and retain the datestamp of the retweet rather than the
original) and then extract the retweet for the actual link (you can
treat this as a normal element; i.e. unwrap the parent element and proceed
as before). I put the original, unabridged tweet text in Pinboard’s
extended
description field.
Unfortunately, I haven’t been able to process my entire archive as the one I’d previously downloaded only extends to February 2013 and, when I try to request an updated one from Twitter, it first wants to verify my email address and then fails to send a confirmation email for this purpose, despite continuing to send notifications successfully to the same address.
I’m surprised that there appears to be little other work undertaken in mining and analysing Twitter archives, as there are probably a number of vaguely useful stats and summaries that could be generated from them. But then I guess most of that isn’t readily monetisable, particularly as the data isn’t considered current.
Other bubbles
- Exploring your twitter archive with unix: David MacIver has a nice blog
post on analysing your archive with
jq
, a command line JSON parser. - Original code fragment I used as the basis of my parser hack (in Japanese, but the code is easily understandable).