Roy Tang

Programmer, engineer, scientist, critic, gamer, dreamer, and kid-at-heart.

Blog Notes Photos Links Archives About

An Archive of My Own

A post about making a guy making an archive of his twitter data made the rounds lately, so I figured I should make my own post about my ongoing efforts in this regard. I mentioned in an earlier post that I like being able to use social media to dig through my own history. But as the first link above says, these social media sites can go away since nothing lasts forever. In the spirit of the Indieweb principle to own your data, for the past few months I’ve been working on archiving most of my social media content on my own site. I also intend to cut down on my social media accounts and activities, and having a backup of all the data on my own site helps with that effort.

Requirements and considerations:

  • I wanted my content from external sites to appear on this site, not necessarily as a post. I would create new content types as necessary
  • Privacy: I would only import content that I considered okay for public consumption. Actually almost all of my social media accounts are fully public, except for Facebook, where I restrict access because I use it to post pictures of family. More on this when I talk about importing from facebook.
  • Comments and reactions: Ideally I would be able to preserve comments and reactions to my content as well (tweet replies/likes, fb comments/reacts, quora comments, etc), but that kind of runs afoul of the privacy requirement, since technically those commenters did not consent to me publishing their comments on my own site. I decided to be more conservative with this regard. Also it would have required a massive effort, since most of the backup data exports did not include third party comments. Only selected comments from certain sources are included in this site.
  • Syndication links: As much as possible, I wanted to retain links to where an entry originally existed. Many of my older entries have been cross-posted across different social media platforms, so they will have multiple syndication links.

Another question is if I were to import content from Twitter, Facebook, etc, how would they appear on the blog? Would they be imported as posts and it will all be one giant mess, etc? I decided to maintain my traditional blog posts as the primary focus of the site, and the content from social media would largely be confined to some new sections (these are meant to map to Indiweb post types):

  • notes: microblog-like status updates
  • photos: photos, sketches, screenshots, and definitely memes. Basically anything with an image
  • links: links and bookmarks shared from various places
  • replies: replies to content on other sites. This is mostly just twitter replies at the moment.
  • reposts: reposts/shares of content on other sites. This would include twitter retweets and tumblr reblogs.

I also wanted two types of home page: one is the more traditional homepage listing only the most recent posts, while another would be a firehose of all posts+the above content, currently here. Unfortunately, I couldn’t (yet) figure out how to paginate this page with Hugo, as the pagination support is limited to the homepage and listing pages and not to custom pages. Maybe someday.

Basically, I wrote a hodgepodge of different python scripts to import data from various social media sources into this hugo blog. It has been a huge effort, and frankly the sort of thing I only expect someone as insane as me would do. The scripts are available on this blog’s own repo, but they are not portable at all and highly specific to my use cases and require certain temp files I keep around on my local filesystem that are unavailable in the repo.

Some technical challenges I encountered:

  • Link rot. As much as possible I wanted to preserve the intent of the original items. This meant following and resolving link shorteners so that the original URL would be preserved in the archive here (instead of say, the t.co URL twitter uses or bit.ly that twitter used to use). But those shorteners didn’t always resolve to the same URL as they did before. Some sites have changed their structure so that the original URLs are no longer available and they just redirect to their homepage, etc. Some sites like imgur deleted old images from when they were starting out (they no longer delete any images as of now), so their links would resolve to something like “imgur.com/removed”. Some URL shortener services are no longer around (like ping.fm), so for those entries crossposted using ping.fm, the contexts are forever lost. Some reposts What is archived here now represents a “best effort” at trying to reconstruct the original content as it was.
  • Permalink scheme. Given the link rot issue above, I needed to make sure I chose a reasonable permalink solution for the new content types, since I would want to avoid changing the permalinks in the future.
  • Deduplication. Since I had a lost of content cross-posted across different social media sites, I had to do some amount of duplicate detection. i.e. I didn’t want the same content to be posted under two separate entries, one from twitter and one from tumblr. This involved some rudimentary matching based on content and date posted. The matching was challenging since the date posted wouldn’t always be consistent across platforms
  • Hugo performance and limitations. With all this data imported, the number of content pages shot up from around 1000+ to roughly 15,000+, pushing the Hugo static generation to its limits. Generation time shot up from < 10 seconds to < 2 minutes. For a while, it was even reaching as long as 5-10 minutes, forcing me to do some template optimization to keep the performance reasonable. I’m not sure how much further I could optimize my templates; I tried switching to some other themes to see if they fared any better with this amount of content, but the generation times were generally in the same ballpark. Perhaps I will revisit this again in the future. I’m also starting to miss some features of dynamic sites, like the ability to create custom queries, something like “list all replies tagged #mtg”, which are proving difficult if not impossible to do with Hugo. There’s also that limitation on the homepage pagination I mentioned earlier. I do still like having the entire site be statically generated, perhaps in the future I will explore my options for more versatile lists and searches.

I’ll also discuss each import source a bit, and the particular challenges each one posed.

  • Quora: I had already been planning to stop using Quora since I spoke about it in a previous post. Since Quora has no concept of exporting a backup of your data, importing my Quora posts involved a little bit of JS console sorcery and a little bit of web scraping as well. I imported each answer I had as a post on this site. After this import, I expect to no longer be using Quora. I’ve also unsubscribed to all their summary emails. It’s too bad, there’s often some interesting content there, but more and more it feels like Experts Exchange.
  • Goodreads: For a while I used to use Goodreads to track my reading and post book reviews. Although Goodreads did provide a data export, it was kind of barebones, and importing required a lot of manual work. Luckily, there weren’t that many entries to import. I imported all of the book reviews into this blog as posts, and any further reading I can just talk about on the blog directly. I do not expect to be using Goodreads again.
  • Plurk: Plurk is an old social media site that I used around 2010-2011, mainly because a certain friend group of mine was very active there. Looks like everyone stopped using it around 2012-2013ish though. Their data export was very good, and I was able to import all the plurk content (most of it crossposted from twitter via ping.fm) and I included the plurk comments too. (Those comments were all public anyway.)
  • Tumblr: I’ve had the tumblr blog since 2008! I enjoyed using the Tumblr account mainly because there’s a lot of memes there and there’s content I follow, but I realized a while back I could just follow them via RSS. All my tumblr blog content has been imported either as notes, links or reposts. I don’t expect to be using tumblr much moving forward, although I still like it as a syndication endpoint; I may eventually re-enable syndicating my content to tumblr. I also have a second (currently inactive) tumblr blog focused on comics that I have not yet imported, as the content there is primarily comic book covers and excerpts. I may do so in the future.
  • Pocket: I use Pocket mainly for link sharing, so all the stuff from Pocket gets imported as links. I still intend to use Pocket moving forward, and there is an active script that imports new links I’ve shared.
  • Wordpress: My blog was previously running on Wordpress before migrating to Hugo of course, so no importing of the original posts was necessary. I did, however run a short-lived comic-book focused Wordpress blog (a predecessor of the Tumblr) from 2012-2013 on the ireadcomicbooks.com domain (no longer active). I imported the posts and comments from that old blog into this one.
  • Youtube: I imported a selection of videos uploaded to my Youtube channel, mostly game videos and sketch timelapses. That account also had some videos uploaded by my niece, but those are her content not mine, so they were not migrated into this blog. I expect to still occasionally use Youtube for video uploads (since it’s very convenient for general internet sharing), but very sporadically so I don’t have any automated migration set up.
  • Flickr: I don’t use this much anymore, but I did have a few photos uploaded here, so they are now merged into this site.
  • roywantsmeat and royondjango: roywantsmeat was the blogger incarnation of this blog. It’s still up!. I mostly did this import to get syndication links since presumably all the posts from that blog had been imported to Wordpress and then to this blog. But that was not true! Apparently there were some missed posts, so some content gets to live again. Roy on Django was a second short-lived blogger blog I had, focused on Python/Django, now the posts from their contain syndication links back to blogger as well.
  • Twitter: Twitter is still my most actively used social media channel, so a whole bunch of tweets have been imported as notes, replies and reposts. A lot of the past content was cross-posted from other sites though. I have a script that actively imports new tweets as notes, replies and reposts. I also had a second twitter account dedicated to mtg, tweets from that account have been imported as well. If you reply or like a tweet that points to one of my blog posts, they are also automatically imported as comments to that post. (This is not retroactive.)
  • Instagram: Instagram was challenging to import. This was actually the first import I attempted. There was a data export, but the problem was it did not provide any way to derive the original instagram URL. For a while I researched on how the instagram slug was derived, thinking maybe it could be algorithmically determined. Eventually I gave up and looked for other means. Since there is currently no public API access, I had to use a scraper library to get the instagram URLs then used some matching algorithm to combine them with the data export, and voila I was able to import all the entries as photos. I still plan on using Instagram moving forward (primarily for sharing sketches), but currently there is no viable way to automatically import newer entries. I tried using OwnYourGram for a while, but that ran into some trouble after some Instagram changes broke their scraping algorithm, so we’re stuck.
  • Facebook: Ah, this was the most challenging import of all. The challenges were (a) first, I didn’t want all of my FB content to be publicly available on this blog, so there was a lot of manual vetting/review that had to be done; and (b) Facebook’s data export (both JSON and HTML formats) did not make it easy to get the URL for any given post, I had to do some scraping to get the correct URLs; and (c) the data export wasn’t even very complete; among other things it didn’t include visibility settings (would have helped with the manual review) or the application used to post (would have helped with deduplication); and (d) the data model wasn’t straightforward, since a given URL could refer to a post, a photo album, a single photo, etc and some items wouldn’t even have a single canonical URL, multiple URLs could be pointing to the same thing. I actually ended up rewriting the import script completely (you can see in github I have an fb2.py) after I’d already imported some posts. While I will still be probably using FB moving forward (for sharing family photos mostly), I do not intend to import any further Facebook posts into this blog. (Any content which could be public will be cross-posted elsewhere instead)

If you weren’t keeping track, moving forward I generally still intend to still be regularly using Twitter, Instagram, and sometimes Youtube, and somehow syndicate that content back to this site. In some future update, I will probably attempt to reverse this. That is, change it so that I post on this site, and the content automatically gets syndicated to Twitter/Instagram/etc as appropriate. This is Indieweb POSSE principle, which makes a lot of sense, assuming I am able to overcome the technological hurdles.

For now, the archive is available for your perusal.

See Also