Friday, August 31, 2007

rsync speedup -- only 2.49

I had to transfer 2.4GB of svn working copy over the internet and I wanted to save as much bandwidth as possible. Just plain rsync of the working copy wouldn't have given me enough bandwidth savings since the duplicated data (not just blocks inside files but actual full files duplicated) are in different files.

I decided to try to take advantage of that duplication detection by creating a tar (not tar-gz) of the working copy. I hypothesized that there would be at least a 2x bandwidth saving because the pristine copies (in .svn/text-base) would be exactly the same as the working copy except where the working copy was modified, and even then only a few lines are typically modified, out of perhaps 500-1500.

There should be even more savings because the reason the working is so large is because there are a few branches (experimental working directories tracked via svnmerge) and 5 or so tags left lying around (older tags are removed and documented in a readme so they can be resurrected by referring to the revision number, but we don't keep everything around since we release very often and the tags are very large).

Because of the tags, I expected a speedup of 10-20. I only got 2.49 though. Which is good, but close to an order of magnitude away from what I expected. I'll play with this some more. I just took another quick look at the rsync technical report and it doesn't seem to invalidate my hypothesis
{alpha} searches through A to find all blocks of length S bytes (at any offset, not just multiples of S) that have the same weak and strong checksum as one of the blocks of B.

Possibly tweaking the block size might help. Blocksize is set according to the size of the file, and I expect it to be directly proportional to the file (to minimize the amount of data to transmit for checksums, probably). Setting it smaller will make sending the checksums larger, but possibly increase the efficiency. I expect to waste quite a lot of time and bandwidth on this :-).

No comments: