Why the pause between posts?
This post comes after a long pause. The pause hasn’t been because of lack of interest, but mostly because IDRACK has been keeping busy lately. Recent activities included a nearly 4-week long crash course in Bioinformatics (two 90 minute sessions per week) , delivered through Zoom, so was accessed by ~80 students from various universities in Pakistan and another 4-week long Python training camp (two 90 minute sessions per week) also delivered through Zoom.
rsync, which is short for remote sync, is a Linux command line tool which can be used to sync data between two locations, be it locally between two directories on your system or between a remote location and your machine.
The advantage of rsync in certain situations is that
- Interrupted downloads can be resumed
- Sometimes one wants to make a local copy of a database. The database might be routinely updated. Using rsync one can create a local copy of the database once, and subsequently rsync can only copy over the difference between the updated database and the local copy, hence saving a tonne of time.
To illustrate point 2 above with a more tangible example, think of a database having 100 entries. You come across it and like the database. You can download all 100 entries. Later, you come across it again and see 10 more entries have been added. Now, if you want to update your copy of the database, you can delete the old copy and download the new one, with 110 entries. Or you can just (manually) download the 10 new entries which you know are different. However, this will only work if the updates are few and far between. Mostly, databases have tonnes of data and the differences between consecutive releases might be significant. This is where rsync can come in handy.
Please be mindful that rsync’s ability to be helpful depends on what you are downloading and how the particular database is structured.
Today we will be talking about the protein structure database (rcsb.org). The way data is structured on this database makes rsync helpful for download purposes.
RCSB PDB data includes protein structure files which have four letter IDs e.g. 1hv4, 1gcv etc. As of today there are 165,957 PDB entries in the RCSB database. While there are severval different kinds of records available for each of the 165,957 entries, in this discussion we will limit ourselves to the PDB structure data (ATOM/HETATM record). So you can see that in every update when this number changes, you would do well to download just the difference and not the whole database again. This is where rsync will come in handy.
Before we use it, it is important to understand something first. In order to use rsync in such a way that you can easily update, without downloading the whole thing again, is to always work in the same location. The files available for download from RCSB are organized in 1060 directories. The directories have two letter names which link to the PDB it will contain. The link can be understood as follows.
- The directories holding the RCSB data have names going from 00 to zz. Each of the letters (e.g., 0 and 0 in 00) in this naming goes from 0 to 9 and then from a to z.
- These two letters correspond to the two middle characters of the PDB ID. For example, the two middle characters of 1hv4 are hv.
- The PDB 1hv4 therefore is located in the directory hv (in a compressed format).
Why do I need to make a local copy?
The answer to this actually depends on what research work you are doing. My work usually deals with either looking at all the structures, or huge subsets of the publicly available ones. So in my line of work I cannot afford to download them everytime. Hence I make a local copy and keep it up-to-date. I am making one right now on a new machine. While rsync downloads, I have some free time and I am using this time to write about it (rsync).
If your work is going to look at even 2-3 PDB files and you are going to have to do this a few times, I would recommend that you make a local copy and use that (OK, so perhaps for 2-3 PDB files you do not need to maintain a local copy, perhaps for 2000? Especially if the structures that you work with keep changing).
Downloading RCSB using rsync
Before I show how to download the whole database you might only want to download a few PDB files and can tolerate doing that slowly, in which case you can automate your workflow by fetching the resource using “wget”
For instance if you just want to download a PDB file, say 1hv4.pdb, you can use the following syntax on the command line
So if you have a handful of files, you can automate this using Python with a for loop and os.system(…) or subprocess (or using your preferred method in your preferred language).
Now onwards to rsync. The database provides a script in which several formats are available for download. In this post, we are only interested in downloading the PDB files. The script is available here. I have extracted the relevant fields from it and summarized it as a one liner (shown below).
rsync -rlpt -v -z --delete --port=33444 rsync.wwpdb.org::ftp/data/structures/divided/pdb/ ./ > ./log 2>/dev/null
The above is a single line of code and may appear wrapped. Make sure to copy/paste correctly.
The script linked above has a few locations around the world from where (the same) data can be downloaded. I chose rsync.wwpdb.org. If you want other locations, see the appropriate FTP address listed in this script.
Running the rsync command
So now that you know what the command is, choose a location on your local machine. Create a folder using a name which is suitable and will remind you of its contents the next time you will look at it and navigate to that location. (By navigate I mean using your Linux/macOS shell – enter the newly created directory). Once there, on the command line type/paste the single command shown above.
rsync will get to work. The first time you run this, it will take a significant amount of time. This is because, as mentioned earlier, the first time you run rsync everything will get downloaded. However, if you do it properly (as shown above), in subsequent attempts to keep your database up-to-date, the run time for rsync will be much smaller.
There are several tutorials available online on rsync. To summarize how this command works I would say that when you run the above command in a directory where you want to download the database, the directory on your machine is empty. The FTP directory will be replicated to this location. The next time when you run the same command in the same location, a directory structure of your directory will be created (directories and their content) and will be matched to that of the FTP server. The difference between the two structures will be calculated and only that difference will be downloaded, making it faster.
For windows users: I am not familiar with the flavor of rsync that comes pre-packed with windows (if at all). A quick google-ing should help.
I can’t say anything about other databases out there, but the RCSB PDB is well structured and keeping an updated copy of the database on your machine is quite easy, if you make use of the rsync command. It definitely helps me complete my work in a sane amount of time.
Hopefully, everything here is clear, if not write to IDRACK at contact[at]idrack.org. IDRACK is on Facebook and twitter as well, with a YouTube channel in the works.
Lastly, don’t forget to share this with your friends and colleagues. They might need this more than you do.