Our Datacenter Technology

Technology

Jersey City on the Hudson River
Jersey City on the Hudson River

The secret to how offsite backups work.

It’s based on a simple principle. But first, let’s explain what makes offsite backup difficult. In a word, it’s bandwidth. Bandwidth refers to the capacity of an internet connection to transfer data. Consider bandwidth as the size of the hose you use to fill your pool. If it’s a small garden hose, it will take days, but if it’s a fire hose, you could fill the pool in hours.

Your Internet connection is like the hose, and your files are the water. The narrower the available bandwidth, the longer it will take to send your data. If the size is limited, it will prevent you from getting your backups offsite during your nightly backup time window, putting you behind. If it’s too narrow, you’d perpetually fall behind and never be able to get caught up. When you send your backups offsite, you need to reduce the amount of data transmitted each night to prevent this from happening.

The simple solution is only to send changes. For years, this concept of just sending changes has defined most backups. The traditional trick is to scan your files and compare them against last night’s backup. Any new files or changed files are included in tonight’s backup.

This file-change technique worked fine, sending the backups to a local tape drive or an external hard. But trying to send file changes over the Internet would cause a backlog if the files were large. The entire file had to be shipped, even if only a tiny portion of that file changed.

Mets Life Stadium in Rutherford, NJ
Mets Life Stadium in Rutherford, NJ

Enter the new technique – called block level changes. This can best be described with a simple example. Say you and I each had a duplicate copy of a 100-page manuscript. Let’s say you changed page 10 of your copy. Using the old technique, I’d have to make a photocopy of 100 pages to have an identical copy. But what if, instead, I just photocopied page 10 and replaced my page 10 with your changes? By just transferring one page, I’d now have a complete backup of your document. That’s a block-level backup.

Block-level backups depend on both sides having a shared copy of all files. This first copy is referred to as the seed backup. It can be done over the Internet, or if the first backup is extensive, it can be done using a shuttle service, which transfers the backup via an encrypted hard drive.

Once both sides are in sync, each nightly backup can scan for changes. If it detects any files have been changed, it can peer into the files and, using some clever algorithms, identify the differences and transmit just those blocks to the offsite backup. You can read the details of this algorithm online at.

But this is only part of the solution. Offsite backups must deal with a common file system change called bulk moves. It’s best explained by way of example. Say you have a folder called ‘Client-Proposals,’ storing all your proposals over the year. At the end of the year, you want to make way for the New Year, but you’d like to keep the old proposals as an archive, so you rename the folder ‘Client-Proposals-From-2023’.

For argument’s sake, 2023 was a busy year, and you’d accumulated several thousand documents in your proposal folder (pictures, pdf files, cad drawings, etc…). When the nightly backup runs, it would see a new folder with several thousand documents. Further, it would see an old folder had been deleted with several thousand files. It would then remove the old folder and start adding the new files it found. Since these are all new files, there’s no opportunity to send just changes and would upload the newly added files, leading to several days, or possibly even weeks of backlogs.

Our technology has some patented solutions to resolve this. The short version is that before deleting any files, it checks the list of files it plans to add to see if there are filenames that look similar. If it finds common-looking file names, it cancels the planned delete & add and creates a rename, followed by an update. Suppose the files are actually different. In that case, the update operation will replace the entire content of the renamed files, essentially the same thing as just doing the delete & add it had initially planned to do.