John Terenz.io

Archiving My Archives: Removing Ancient Emails from Google

JT on 20171208

I've been a Gmail user since mid-2007 and have maintained healthy habits with Gmail by archiving, rather than deleting, most of my email. Recently I was thinking about my personal privacy and security situation and I got to thinking. I've focused a lot on hardening my services (ie. using a password manager, 2-factor authentication, random security question answers, etc.) and also keeping my overall online presence nicely pruned, but I hadn't really considered the massive volume of email I was sitting on going back over ten years.

While there would be many bad things about my Google account being compromised, the thought of some of my more ancient emails being in the mix made me decide that it was time to reduce my footprint here and move all emails before 2010 from my Google account to an encrypted backup. I also decided that going forward it seems reasonable to only keep the last seven years of email warm, so this might become an annual tradition. Additionally, I benefited from freeing up some storage in my Google account (apparently I liked to email MP3s back in the day) and I went ahead and backed up more recent years as well just to have them. Here's how I did it...

At the end of 2013 Google released Takeout for Gmail. This tools allows you to export arbitrary Gmail labels in mbox format. I've used this tool before but not with this granularity. My strategy is to:

  1. Create a label encompassing all conversations for a given year
  2. Export that year in mbox format using Takeout
  3. Encrypt and backup that archive
  4. Delete all emails with that label from Gmail
  5. Repeat with following year

Creating the label

This is done easily using the before and after search operators in Gmail. To isolate all conversations from 2011, for example, we can use the query before:2012-01-01 after:2011-01-01 and then select all results from the search.

Search query

Next click the "label" button and create a new label for that year, applying it to all the conversations.

Create label

Apply label

A few notes before we head over to Takeout. Because Gmail groups messages in conversations it is possible that a conversation spans a year boundary. In this case, a conversation might end up with more than one year label and therefore when you export the mbox file for each label you will have duplicates. You can either accept this or when you build your labels you can filter out conversations already labeled in previous years (thus isolating them to the label of the year of the first message). An example could be: before:2012-01-01 after:2011-01-01 -label:2010 -label:2009 -label:2008 -label:2007. For my purposes I have decided to accept duplicates. And it goes without saying, you could just backup all conversations before a certain date in one file.

Exporting

In Takeout you are presented with a dizzying array of Google products you can pull your data from. Right now I'm only focused on email so I click the "select none" button at the top. Then I go to "Gmail", activate it, and use the "select labels" option to select only 2011.

Create archive

On the next page I opt for a .tgz file which can be as large as 50 GB to be created and a link emailed to me when ready.

Backing up

I could write a larger post about personal backups, but to cut to the chase I use a combination of offline backups and Amazon S3. I encrypt sensitive information that I back up to S3 locally first using OpenSSL. I wrote a thin wrapper on top of its EVP symmetric encryption API in C so that I can easily manage IV's and leverage UNIX pipes, but all the encryption I do is compatible with the OpenSSL enc utility (using AES encryption). Obviously key management is a challenge here but that's also for another post.

I am going to download the Google Takeout archive, find the .mbox file inside, compress it, encrypt it, and upload to S3 in a single step using gzip, my encryption tool (called enc which takes plaintext via stdin and outputs ciphertext to stdout), and the AWS CLI. The commands are below...

Download archive

john:Downloads$ tar -xf takeout-20171208T131619Z-001.tgz john:Downloads$ cd Takeout/Mail/ john:Mail$ ls total 851968 -rw-r--r--@ 1 john staff 407M Dec 8 16:09 2011.mbox john:Mail$ cat 2011.mbox | gzip | enc -e | aws s3 cp - s3://mybucket/backups/2011.mbox.gz.enc john:Mail$

One really nice thing about the S3 CLI tool is that you can actually upload data right from a UNIX pipe. In the second to last line I'm re-compressing, encrypting, and uploading the mbox file to my bucket for safe keeping.

Deleting

This part isn't too hard as you can easily select all the Gmail conversations under a label and move them to the trash (and empty the trash). The only note again is about conversations which overlap years. I decided that I would keep all messages in a coversation up to the most recent year. That means that even though I am deleting most emails from before 2010, there might be conversations that started in 2009 and ended in 2010 or even later. The query I used to delete by label for 2009 therefore excludes the labels of following years like this: label:2009 -label:2010 -label:2011 ... (you can stop once you see no more results from after 12/31/2009).

Searching and reading backups

What does one do with the mbox files now if one wants to search for something? Luckily this format is very simple and easy to deal with. Essentially the format just concatenates all the individual email messages according to RFC 2822 into a single file with a special separator between each message. That means you can simply use less, grep, vim or whatever your favorite UNIX tool is to navigate and read the file (most messages should have a plain text version along with the HTML version). You can also use a desktop email client (think Apple Mail or Thunderbird in OS X) and import the files there. I even wrote a little Ruby script to split the mbox file into individual .eml files in the same directory which can be QuickLook'ed in OS X.

# script.rb filename = '2009.mbox' count = 0 email_file = nil File.open(filename, 'r') do |file| while (line = file.gets) if line.match(/^From\s\d+\@xxx\s(.+)/) email_file.close if email_file count += 1 email_file = File.open("#{count.to_s.rjust(6, '0')}.eml", 'w') else email_file.puts(line) end end end

Split .mbox file

Lastly, if there are attachments in the mbox file they are encoded in base64. There are some published scripts if you Google "mbox extract attachment" which extract them automatically or you can isolate the base64 text and use the base64 command to decode it. For a .jpg file that might look like this (assuming you isolated the base64 payload into the file "test.jpg.64".

base64 --decode test.jpg.64 > test.jpg

Conclusion

Encrypting and/or deleting and/or moving parts of your online footprint offline are good for increasing security and decreasing the impact of a worst case scenario. Google is also great in that it makes available tools to export your data in standard, open formats out of the Google ecosystem. I hope you found this interesting and useful. Please comment on HN.