Using rsync to make a local copy of RCSB PDB6 min read

By | July 3, 2020

Why the pause between posts?

This post comes after a long pause. The pause hasn’t been because of lack of interest, but mostly because IDRACK has been keeping busy lately. Recent activities included a nearly 4-week long crash course in Bioinformatics (two 90 minute sessions per week) , delivered through Zoom, so was accessed by ~80 students from various universities in Pakistan and another 4-week long python training camp (two 90 minute sessions per week) also delivered through Zoom.

rsync

rsync is a command line tool which can be used to sync data across two location, be it locally between two directories on your system or between a remote location and your machine.

The advantage of rsync in certain situations is that

  1. broken downloads can be resumed (if you can call it that)
  2. when working with databases that are updated routinely, instead of downloading them over and over again, you can, using rsync, download once and then every time the database has updated, instead of downloading the whole thing again, rsync (if done properly) can save you the trouble and only download the difference between the two database releases.

To illustrate this with a more tangible example, thing of a database having 100 entries. You come across it and like the database. You can download it, all 100 entries. You come across it again and see 10 more entries have been added. Now, if you want to update your copy of the database, you can delete the old cope and download the new one, with 110 entries. Or you can just download the 10 new entries. This will work if the updates are few and far between. Mostly, however, databases have tonnes of data and the differences between releases can be significant. This is where rsync can come in handy.

Please be mindful that rsync ability to be helpful depends on what you are download and how the particular database is structures.

RCSB PDB

Today we will be talking about the protein structure database (rcsb.org). The way data is available for download from this database makes rsync helpful.

RCSB PDB data includes protein structure files which have four letter IDs e.g. 1hv4, 1gcv etc. As of today there are 165957 PDB entries in the RCSB database. Each entry is a file. So you can see that in every update when this number changes, you would do well to download just the difference and not the whole database again. This is where rsync will come in handy.

Before we use it, it is important to understand something first. In order to use rsync in such a way that you can easily update, without downloading the whole thing again, is to always work in the same location. The files available for download from RCSB are organized in 1060 directories. The directories have two letter names which link to the PDB it will contain. The link can be understood as follows.

  1. The directories holding the RCSB data have names going from 00 to zz. Each of the letters in this naming goes from 0 to 0 and then from a to z.
  2. These two letters correspond to the two middle characters of the PDB ID. For example, the two middle characters of 1hv4 are hv.
  3. The PDB 1hv4 therefore is located in the directory hv (in a compressed format).

Why do I need to make a local copy?

The answer to this actually depends on what you are doing. My work usually deals with either looking at all the structures, or huge subsets of the publicly available ones. So in my line of work I cannot afford to download them everytime. Hence I make a local copy and keep it updated. I am making one right now on a new machine, which gave me some free time and I am using that time to write about it.

If your work is going to look at even 2-3 PDB files and you are going to have to do this a few time, I would recommend that you make a local copy and use that.

Downloading RCSB using rsync

Before I show how to download the whole database you might only want to download a few PDB files and can tolerate doing that slowly, in which case you can automate your workflow by fetching the resource using “wget”

For instance if you just want to download a PDB file, say 1hv4.pdb, you can use the following syntax on the command line

wget https://files.rcsb.org/view/1HV4.pdb

So if you have a handful of files, you can automate this using python with a for loop and os.system(…) or subprocess. (or using your preferred method in your preferred language.

Now onwards to rsync. The database provides a script in which several formats are available for download. In this post, we are only interested in download the PDB files. The script is available here. I have extracted the relevant fields from it and summarized it as a one liner shown below.

rsync -rlpt -v -z --delete --port=33444 rsync.wwpdb.org::ftp/data/structures/divided/pdb/ ./ > ./log 2>/dev/null

The above is a single line of code and may appear wrapped. Make sure to copy/paste correctly.

The script linked above has a few locations around the world from where (the same) data can be downloaded. I chose rsync.wwpdb.org, if you want other locations, see the appropriate FTP address listed in this script.

Running the rsync command

So now that you know what the command is, choose a location on your local machine. Create a folder name which is suitable and will remind you of its contents the next time you will look at it and navigate to that location. (By navigate I mean using your Linux/OSX shell – enter the newly created directory). Once there, on the command line type/paste the single line shown above.

rsync will get to work. The first time you run this, it will take a significant amount of time. This is because as mentioned earlier the first time, everything will get downloaded. However, when you do it properly, the next time, it will only download the part that is different.

There are several tutorial available online on rsync so have a look at some. To summarize how this command works I would say that when you run the above command in a directory where you want to download the the database, the directory on your machine is empty. The FTP directory will be replicated to this location. The next time when you run the same command in the same location, a directory structure of your directory will be created (directories and their content) and will be matched to that of the FTP server. The difference between the two structures will be calculated and only that difference will be downloaded, making it faster.

For windows users: I am not familiar with the flavor of rsync that comes pre-packed with windows (if at all). A quick google-ing should help.

Conclusion

I can’t say anything about other databases out there, but the RCSB PDB is well structured and keeping an updated copy of the database on your machine is quite easy if you make use of the rsync command. It speeds up my workflow.

Hopefully, everything here is clear, if not write to IDRACK at contact@idrack.org. IDRACK is on Facebook and twitter as well, with a YouTube channel on the way. Lastly, don’t forget to share this with your friends and colleagues. They might need this more than you do.

Leave a Reply

Your email address will not be published. Required fields are marked *