Sunday 27 August 2006

Backing up a gmail account

There are a lot of interesting and useful services on the internet these days, like gmail and blogger. But when I take advantage of them, I cannot help but feel pained by the fact I am entering my content/information into a something controlled by someone else. There is a certain satisfaction to having a copy of your own information, knowing that if the service goes away, or changes it terms of service, you still have everything you created to do with as you wish.

To ensure that I have a copy of the information in my gmail account, I have been keeping an eye out for a way to back it up. It was easy to find references to downloading the emails via POP, but that seemed a second rate solution. How do I know it will back up the emails I sent? What about the labels I have added to threads? So, ideally, I wanted something that would download all of the original messages, the labels, the threads and the labels assigned to threads.

This last week, I found a little time to play around with libgmail. This is a Python library which seems to do the same thing that the gmail interface does, in order to acquire the same forms of information that the interface is provided with. It is a surprisingly small and simple library, but it was complete enough for me to use to write a Python script to do the depth of backing up I wanted. The script currently allows:

  • Fetching the raw emails for each message in my account (whether ones I have sent or received).
  • Noting which thread each messages belongs to.
  • Noting which labels each thread has.
  • Noting all the labels I have created.
  • Using the previously backed up data to only download the changes, which is important because gmail locks down accounts in which it detects "unusual usage".
The backed up data is stored in the Python pickle file format. This is not a problem at all, because writing a Python script to do pretty much anything with the data, takes a minute at most. Including going over all the emails looking for specific things, extracting an email, etc.

One of the things I was interested to see, was that the size of all my emails came to about 80 megabytes, the same amount of space that gmail claimed my account was using. I had always doubted whether this was calculated using a fair method and assumed it was overestimated, or took into account how much space they used for all the indexing information for my account. I was also surprised to see how many emails I had, although alot came from mailing lists. 2018 threads in total, with around 11000 emails between them.

Backing up my account marks the emails in my inbox as having been read, which given all the features in gmail to make reading threaded conversations easy, makes it difficult to follow a thread. So it needs to be used after you have read all your emails.

Anyway, it is a load off my mind to have this backup tool written. I have also offered it as a demo script to the contact address for libgmail. Since I have blogged about it though, I have decided to add it to a google code hosted project. You can find it here:

http://code.google.com/p/gmail-backup/