- CrashPlan for Home
This article is intended for CrashPlan for Home users. For Code42 CrashPlan and CrashPlan PRO documentation, read this page on our enterprise support site.
Data de-duplication is a core element of CrashPlan's backup process. It ensures both the efficiency and integrity of your backup data, and therefore it is not possible to turn it off. However, as several blogs have noted, it is possible to modify how de-duplication behaves.
This article explains the purpose of a specific data de-duplication setting, as well as the impact of changing this setting on your bandwidth, CPU usage, and backup data.
Key concept: Rolling de-duplication of data blocks
This article assumes you are already familiar with the basic concept of data de-duplication and how backup works. However, to understand the implications of changing de-duplication's configuration settings, it's also important to understand how CrashPlan prepares and de-duplicates data blocks.
For each file in its to-do list, the CrashPlan service performs the following tasks in sequence:
- CrashPlan evaluates the file and breaks it up into a series of smaller pieces called blocks.
- CrashPlan generates a computationally-light checksum as a numerical way of identifying the data within the block.
- CrashPlan compares this checksum against its cache, which is stored in RAM on your computer, to determine if this block has already been backed up.
- CrashPlan chooses whether to send the block to the backup destination, depending on the checksum comparison:
- If the checksum does match, CrashPlan generates a stronger checksum and compares that against the cache. If this checksum matches, CrashPlan concludes that this block has not changed and does not send the block to the backup destination.
- If the checksum does not match, CrashPlan concludes that this block is new or changed since the last backup and sends the block to your backup destination.
- If CrashPlan determines that the block should be sent to your destinations it:
- CrashPlan continues this process for each block until it reaches the end of the file, and then moves on to the next file in the to-do list.
Rolling block analysis
As CrashPlan's de-duplication algorithm works through the file block by block, it uses a rolling analysis to review subsequent blocks whenever it locates a changed file. If a file has changed since the last backup, CrashPlan backs up only the changed block(s), without restarting backup on the remaining blocks in the file.
This is absolutely critical to efficiently backing up your data and conserving your bandwidth. Without a rolling block analysis, any time you back up a changed file (that is, whenever CrashPlan detects a new block within a sequence of blocks that have already backed up) CrashPlan will back up the new blocks and resend all of the subsequent blocks to your backup destinations.
While some files, like ISO images, never change, it is common for most file types to get updated, whether by you or by the applications that use them, at varying rates of frequency. For example, MP3 files change occasionally when metadata such as title, album, artist, and genre are added via ID3 metadata. Other file types, like Outlook PST files, change frequently when emails, events, and tasks are added. Depending on the type of files you are backing up, rolling block analysis can facilitate a large amount of data de-duplication.
Rolling data de-duplication maximum file size
As described above, rolling data de-duplication is essential to CrashPlan's ability to dynamically back up changes in your files while remaining efficient over the life of your backup. However, some blogs recommend changing this behavior to effectively eliminate rolling de-duplication. Before making any changes to CrashPlan, you should fully understand the implications of changing this setting.
Default configuration (recommended)
The default setting for rolling data de-duplication sets de-duplication's maximum file size to
0, meaning "unlimited." It enables rolling de-duplication on files of any size.
This way, if a new data block is introduced, CrashPlan can back up the new blocks, no matter where they occur within the sequence of blocks that make up the file, while still being able identify the old blocks that have already been backed up.
We strongly recommend leaving this configuration unchanged.
Altered configuration (unsupported)
Several blogs have recommended modifying a configuration file within the CrashPlan app to change this setting to:
This change tells CrashPlan to disable rolling de-duplication on any file over 1 byte in size, which in practice disables rolling de-duplication. Consequently, any time new blocks are introduced within a sequence of blocks that have already backed up, CrashPlan will back up the new blocks and resend all of the subsequent blocks to your backup destinations. As a result, changing this setting all but guarantees that you will send more data to your destinations over time.
Every backup is different, and the effects of this setting may have unintended consequences on your specific backup. Since CrashPlan is designed to back up your newest data first, this modification could result in constantly re-uploading large files that change frequently (such as Outlook PST files), leaving older files unprotected, even if the older files are more important to you.
Additionally, making this change requires a modification to one of CrashPlan's application files. While the recommended change is small, errors introduced when editing your application files can also have unintended consequences, including those leading to data loss. Also note, because this setting is outside the CrashPlan app, it is possible that it could be overwritten and reverted at any time.
Finally, be advised that this change is an unsupported configuration and our Customer Champions cannot assist you with tasks outside normal support, so you assume all risk of unintended behavior.
Frequently asked questions
The observed increase in backup speed is a false economy. CrashPlan is using more bandwidth to send data, but the efficiency of the backup process is substantially reduced, causing redundant data to be sent.
By design, CrashPlan performs work on the source computer so that data can be efficiently and securely transmitted and stored. Not only is every file broken down into blocks and encrypted before leaving the system, but each block is very carefully selected, compressed, and validated as truly new before it is sent. The process of analyzing blocks of data to ensure they haven't already been uploaded, while extremely efficient, is not instantaneous and it can require many re-reads of source data and CPU time to generate and compare checksums.
If you prevent rolling de-duplication, CrashPlan performs a minimum analysis on the files. While this change can increase CrashPlan's bandwidth utilization if disk I/O is a bottleneck, it only looks like CrashPlan is "working faster" in the short term. Overall data sent to your destinations can increase dramatically, causing initial backup and future incremental backup to require substantially more time.
This change has the greatest impact on systems whose backup speed is bottlenecked by CPU or disk I/O performance. For instance, NAS devices and older single-core computers may have limited computational resources, while data stored on slower external hard drives and network shares can have heavily limited disk read performance.
Most modern computers backing up from internal magnetic or SSD disks do not experience performance issues related to the disk I/O and CPU load created by CrashPlan's rolling de-duplication process. Users on these computers likely won't see an impact from this change.
On the other hand, systems backing up from slower disks with disk I/O limitations, especially network-attached disks, may experience a logarithmic drop in upload rate over time as the backup grows. This is because CrashPlan is waiting on checksum verification, sometimes on many hundreds or thousands of potential block windows, before sending new data. By preventing rolling de-duplication, disk I/O and CPU requirements are reduced, and users on these system will likely experience faster upload times. However, they are also much more likely to send redundant data.
The CrashPlan app includes settings for managing the amount of CPU time that the CrashPlan service is allowed to use for backup. If you have changed these, but CrashPlan is still using a high CPU percentage, it may be a symptom that disk I/O is constrained, or that antimalware software is interfering with CrashPlan.
Additionally, we always recommend confirming that you aren't backing up system or application files, which could be causing CrashPlan to expend a lot of resources on files for which there is no advantage to backing up. Finally, if you haven't already done so, review our recommendations for speeding up your backup.
No. Code42 is committed to providing unlimited backup, and we do not throttle or modify incoming bandwidth to our cloud destinations. Our incoming bandwidth and bandwidth per server is shared with and balanced across all active users. Additionally, Code42 monitors usage across our data centers and strategically adds capacity (servers, storage, and bandwidth) to our infrastructure to deliver the best upload speeds we can.
- Wikipedia: Checksums