How to Build a File Syncing Service

A few weeks ago, we announced the shutdown of LayerVault. Unfortunately over the past few years, we were unable to make the service financially viable. There were many reasons that we failed, but one of them was some of the technology decisions we made. We made certain assumptions that led to increased costs and increased brittleness of the system.

We learned many lessons, and this blog post will lay out the technical lessons we learned along with a few solutions. Finally, I’ll attempt to wrap all of this up into a final thought: were you to build a LayerVault today, what would that look like? I’ll lay out an open technical specification for the plumbing to build your own Google Drive, Dropbox, what-have-you, in a way that’s distributed and fault-tolerant. I will refer to this as the Perfect Sync System (“PSS”).

After spending years thinking about this problem, and looking at the cost structures of other companies that look like LayerVault, I’m convinced that the way forward is an open solution. There is still opportunity to deliver increased value and charge on top of an open solution, much like the relationship between git and GitHub.

The same open-source plumbing could power a spiritual successor to LayerVault, the LayerVault for doctors, the LayerVault for music, the LayerVault for dogs, and so on…

Lessons Learned

Assumption: At-rest storage should be the thing to optimize for

At LayerVault, we made the assumption that at-rest storage costs should be the dimension to optimize costs around. To deal with this, we built a binary diff-patch system that would transmit patches over the wire and store them on a central server. Then, we would apply the patches to the base file whenever we needed the latest version.

Unfortunately, systems that rely heavily on diff-patch approaches are often unnecessarily complicated. When existing in a cloud-based environment, storage is just one of the things that must be optimized. CPU time and transfer costs must also be kept a close eye on.

Thus, one of the primary cost centers for LayerVault was image processing. Our at-rest storage costs were usually only 1/10th of our total infrastructure bills on any given month. For a robust syncing and storage solution, binary diffing and patching is important but optional.

Assumption: End users will work on files simultaneously

One of the assumptions that LayerVault got right was that we expected users to work on files simultaneously. We structured how we handle conflicts from day one to keep this in mind.

This allowed us to never pop conflict dialogs mid-sync. A robust syncing solution should allow for files to be worked on simultaneously without error. Displaying the history of a file is a “porcelain” problem, in the git parlance.

Assumption: Users only care about the last version of the file

This is not an assumption that LayerVault made per-se, but it’s one that many others make.

Generally, history is less important within a personal context. Within a professional context, history is extremely important and often has a price tag. (If you accidentally delete N hours of work, that mistake cost you N hours times your hourly rate.)

Assumption: Teams use primariliy one operating system

A file syncing service is much less useful if it is not available on all operating systems that and end-user interacts with. Operating systems considered should be both desktop and mobile.

We were never able to get around to building a Windows client for LayerVault, and I believe that hurt potential adoption.

Assumption: Users prefer the “magic folder” approach

LayerVault, along with many other cloud storage service providers, make the assumption that working out of a single, unified “magic folder” was the way to go.

This approach tends to work well for the first few weeks and months, but then hits a brick wall as more members are added to the team. Without configuring things correctly, end users sync every single change from every other teammate.

One of the things that LayerVault did particularly well was our own flavor of selective syncing. We called them projects. When you joined a project, your app would sync those files locally. When you left a project, those files would be removed. What we got wrong, however, is the ability for those projects to be located outside of the central ~/LayerVault directory.

Thus, the ideal solution takes a repository-like approach similar to git. Every project is a self-contained repository. This greatly simplifies permissions issued down the line, which will arise in any organization of a certain size. This also provides more granular controls over what is getting transferred.

A nice porcelain could expose this interface within a context menu in Finder, if the host system were OS X.

Assumption: Syncing speed was important, but not paramount

When developing a syncing service, you are defined by how quickly you can shuttle files around. The bigger players like Dropbox are aware of this, hence their features like LANSync. By definition, a solution like BitTorrent Sync has this built in.

By choosing the nearest-neighbor for downloading file blobs, you can greatly increase the speed at which files are transferred. This also has the added bonus of cutting down on outgoing transfer costs by the service provider.

We never managed to implement something like this at LayerVault, and it impacted the speed at which our customers were able to work. The PSS has this built in.

Assumption: Versioning is not important

When developing LayerVault, we came to a very important realization: everyone wants version control whether they know it or not. Apple and Microsoft are attempting to bake this in at the operating system level with some success. I still don’t care much for their interfaces at the moment.

It’s surprising how little credence most of the existing syncing and storage services give to this: Box displays past versions in a funky interface, Dropbox will keep the last 30 days of versions and then destroy them, BitTorrent Sync offers no versioning whatsoever.

This makes sense. Traditionally (save for the case of Box), these applications tend to be more consumer-oriented. Versioning different documents on a personal level is a nice-to-have. In a professional context, versioning is an important safety net that can regularly avert disasters. None of the existing systems that I’ve seen are robust enough to provide disaster recovery that LayerVault included.

Assumption: Propriertary syncing is the way forward

There is currently no shortage of competition in the storage and syncing space. Google Drive, Box, and Dropbox are all heavily commoditized at this point and are in a race to the bottom. As of this writing, all of these companies do roughly the same thing. They are all competing on price and none of them are winning.

They are all attempting to add the value through specializing in certain verticals. You can see evidence of some big value-add plays through Box’s acquisitions in the medical imaging vertical, Dropbox’s Project Harmony, and so forth. Although I don’t believe the core syncing problem to be solved yet, these companies currently cannot afford to focus on a more robust syncing solution. They must focus on value-add diversification or else die at the hand of the market.

This situation is not new to software. We do not yet have a monopolistic sync and storage solution the way we had a single operating system for many years (Windows) or a single photo editing software (Photoshop). The space is young and the last-movers have not yet been crowned.

Each of these services has a predictable cost structure that they are working with: the price of data transfer and storage. Given the heated competition at the moment, this will continue to be a race to the bottom. Although there are some defensible features in the market, this market is relatively easy to enter into given open tools like rsync. Thus, I predict it will remain a race to the bottom for some time to come.

That is, unless, an open solution is pursued. An open solution can abstract out the cost of storage quite a bit. If the open solution can run on your company’s NAS, you don’t have to pay a hefty Amazon S3 bill each month.

If you are a service hosting this open solution, you can transparently pass the storage costs through to the end-users. As a host for an end-user application like this, I would argue that you very rarely want to be merely charging a percentage on top of storage costs. Instead, you should charge on the value added and pass the variable storage costs through.

The Perfect Syncing Solution

After many years and many implictly and explicitly tested hypotheses, here is what I believe the Perfect Syncing Solution will look like in terms of some core principles:

It is open
It is user-friendly
It is scalable
It is mindful of small hard drives
It is mindful of bandwidth
It uses known protocols
It is secure
It is fast
It operates on a per-folder basis
It is distributed
It is visual
It is implemented as a set of commandline tools

It is open

Given how fundamental I believe syncing, storage, and versioning to be to a professional workflow, the solution must be an open-sourced solution in today’s environment. It is a little silly that the big players are not sharing a common application for synchronizing file changes among customers.

Creating this as an open solution also has other benefits: because this must run on many different operating systems, the implementation costs quickly balloon. This is the exact problem that I’ve seen many open-source projects benefit from.

It is user-friendly

For this open solution to be successful, it has to have a set-it-and-forget-it solution. The add-commit-push flow in a git or git-like solution is too complicated for your average user. Many users will want the ability to add commit messages or have explicit control over commits, but this is not required.

The user-friendliness of the application will mostly depend on the shininess of the porcelain, but many of the considerations must start at the plumbing level.

It is scalable

It’s amazing how productive individuals can be with modern digital tools. With programs like Photoshop, this productivity translates to lots and lots of raw data. When accounting for every version of a prolific graphic designer, you will find yourself tracking hundreds of GBs of files in a single year.

The PSS must make design decisions that allow it to scale well without individual projects using up entire hard drives.

It is mindful of small hard drives

The birth of flash storage has led to snappy local drives. The win of speed is balanced out by smaller hard drive sizes for comparable prices. At the time of this writing, it’s not unusual to see flash drives top out at 512 GB.

In the case of creative professionals, work is usually done on the local drive. When a work session is finished, the work may be transferred over to a network drive.

This is a similar fashion by which the PSS should operate: work should be done locally and then have the option of being sent elsewhere. Hopefully that elsewhere has a much bigger drive, or is configured to send data to S3 or a similar service.

It is mindful of bandwidth

You would be surprised how pathetic office bandwidth can be at times. At LayerVault, customers in major metroplitan areas in the United States and the United Kingdom often had it the worst.

Thus, when transmitting data, the PSS must aim to use as little bandwidth as possible. It should also have the ability to limit the total speed at which is transfers files.

It uses known protocols

You would again be surprised how complicated the security environments can be at relatively small companies. This is sometimes the case of IT managers run amok, although many times it is due to a increased threat profile.

Thus, thie solution should be able to function on no more than on HTTPS and SSH.

It is secure

All transmitted data should be sent over encrypted channels to remotes. It’s 2015.

It is fast

As we experienced first-hand at LayerVault, every second is important in a professional setting. As such, files should be transferred from their nearest-neighbor when available. Files should only be transferred when necessary.

It operates on a per-folder basis

Throw out the magic folder analogy. It becomes such a headache when a subfolder is accidentally moved on the filesystem. The magic folder service assumes that that subfolder was deleted, because it has left the scope of the magic folder. Undeleting or moving the folder back into place results in a reupload of its contents in every syncing solution that exists today. Gross.

Thus, folders should be self-contained. All meta information should be kept in a hidden directory, similarly to how git operates. This allows the end-user to move the folder to the Deskop, into their Documents, folder, or wherever they please without adverse behavior.

It should be noted, that a certain porcelain could reinstitute the magic folder analogy if they so choose.

It is distributed

One of the biggest wins in recent version control has been the distributed nature. Git, Mercurial, and others are distributed version control systems (DVCS). This means that each repository is entirely self-contained and requires no network access to recall any history within the repository.

While this is the type of thing that makes engineers like myself excited, it has even more wins when it comes to some of the system constraints this imposes. Because it is decentralized by nature, distributing and synchronizing the data between remotes (to use the git vocabulary) must remain simple and idempotent.

I argue that this design, whether the original authors intended it or not, imposes healthy constraints on the design of the program.

It is visual

One of the highest-value things that LayerVault was able to offer was our ability to display different parts of different documents as PNGs. This allowed us to display different artboards, slices, layer comps, pages, and so forth on the web. You could preview files accurately without needing to open them up in their native application.

As such, document previews should be considered first-class citizens of the PSS. The document previews should either be stored as their raw reprsentations, or instruction sets of how to generate the previews.

It is implemented as a set of commandline tools

If you go digging around in Chapter 10 of Scott Chacon’s Pro Git, you’ll notice that git is a set of loosely connected tools that are oftentimes no more than a thin layer on top of existing *nix programs like cat, ls, and so forth. Each git command that a programmer may use (push, commit, fetch) is often no more than just a composition of other git commands closer to the metal.

This design pattern is very powerful, and has a few huge wins. For one, it makes the system much more testable. It also makes the system much easier to implement in whole or in parts on different operating systems.

As such, I recommend Go be the implementation language for this project. You get easy, multi-platform distribution out of the box. Go strikes the right balance between speed and expressiveness. Lastly, Go’s testing frameworks are robust enough to make 100% coverage not only attainable but meaningful.

The Specification

Alright, let’s give this thing a name. For the purposes of nostalgia, I’ll call this system Vault. Let’s start defining this application from the outside in.

The goal is to build a content-addressable filesystem with some fundamentally different considerations for working with partially complete repositories, continuous syncing, and remote objects.

Having a Vault project syncrhonize with its remotes should be as simple as running the following in the directory:

$ vault --watch

All meta information will be contained with a hidden .vault directory at every Vault project’s root. The meta-program that will tie together all of the subprograms will be called vault. All meta information will be stored in plain text files as JSON encoded using UTF-8.

By keeping this interface simple and familiar, we open up the possibility of powerful porcelains, much like how git has several graphical interfaces like gitk, Tower, and more.

The .vault directory

The .vault directory should be very familiar to anyone that has peeked inside a .git directory.

Let’s take a look at an example:

.
├─ config
├─ meta
│  └─ 07
│     └── 13d3ef2c774be533e24348b622391f638a4669
├─ objects
│  ├─ 07
│  │  └── 13d3ef2c774be533e24348b622391f638a4669
│  └─ fa
│     └── cebeeffacebeeffacebeeffacebeeffacebeef
└─ previews
   └─ fa
      └── ce2beef2face2beef2face2face2beef2facef

The config file

The config file should be a JSON document encoded using UTF-8.

Here’s an example, fleshed-out config file.

{
  "description": "Design files for the next big thing.",
  "remotes": [{
    "name": "origin",
    "url": "mysite.com/company/project.vault"
  }],
  "ignore_files": [
    ".DS_Store",
    "*.a",
    "*.idlk"
  ],
  "default_diff_method": "bsdiff"
}

Objects

Much like git does, raw file information may stored in the objects folder with the blob type. Directory structures are stored using the tree type, but represented using JSON.

All non-blob objects should be defined by a git-like header, followed by their content encoded as UTF-8 JSON with all whitespace removed. Example code in Ruby, heavily borrowed from Pro Git:

require 'json'
require 'digest/sha1'
require 'zlib'
require 'fileutils'

content = {
  "url": "https://mysite.com/cats.psd",
  "size": 12345,
  "sha": "[SHA1 hash]"
}.to_json
header = "remoteblob #{content.length}\0"
store = header + content
sha1 = Digest::SHA1.hexdigest(store)
zlib_content = Zlib::Deflate.deflate(store)
path = '.vault/objects/' + sha1[0,2] + '/' + sha1[2,38]
FileUtils.mkdir_p(File.dirname(path))
File.open(path, 'w') { |f| f.write zlib_content }

remoteblob objects

Importantly, we introduce two new subtypes to the blob type. First, the remoteblob type. This represents the file data at a remote location using a URL and checksum of the file data. By doing this, we are able to offload large chunks of binary data to a centralized store like Amazon S3.

Two stumbling blocks with remoteblob objects will be authentication and availability. Securely authenticating in a distributed environment potentially requires distribution of authentication keys as well. Availability becomes an issue if certain URLs are not accessible by certain clients, e.g. a NAS exists in one client’s environment but not another.

{
  "url": "https://mysite.com/cats.psd",
  "size": 12345,
  "sha": "[SHA1 hash]"
}

compositeblob objects

The second subtype we introduce will be the compositeblob type. This a blob that is made up of two or more blob objects defined as a base and one or more patches. In the interest of keeping things flexible, different diff/patch types are allowed but the use of bsdiff is encouraged.

The approach is different than git packfiles for a few reasons. The goal here is to make compositeblob objects wholly interchangeable with blobs.

Here’s an example representation:

{
  "blobs": [
    "[SHA1 hash]",
    "[SHA1 hash]",
    "[SHA1 hash]"
  ],
  "resulting_sha1": "..."
  "diff_method": "bsdiff"
}

By this definition, we see that we will definitely need a vault verify program to ensure that a repository is valid and that our compositeblob objects are valid.

meta objects

We separate out any meta-related information into a separate meta directory. This directory holds metadata about different objects in the system. In the case of a flat file blob, this meta information may specify information about the document: its layers, its height and width, and so forth.

Meta objects are JSON documents with the git-style header. Their SHA1 should perfectly match the blob that they represent. The structure of this document will be decided later.

meta objects will determine which previews belong to a document.

preview objects

We also separate out any visual representations of documents into a separate previews directory. This directory holds visual representations of whole documents or different pieces of them as rasterized or vector image formats. By default, we will prefer the PNG file format.

To save on space, these previews can also be stored as instruction sets on how to generate the files themselves. For example:

{
  "name": "Home Screen Artboard",
  "file": "[SHA1 has to the file object]",
  "height": 500,
  "width": 1000,
  "size": "54321",
  "intructions": {
    "exec": "sketchtool",
    "flags": "[some sketchtool flags]"
  }
}

Alternatively, these could also be a remoteblob-style objects.

Sometimes, preview-generation will have external dependencies, such as in the case of a Sketch file. A system for downloading and installing these external dependencies will need to be developed.

These preview objects are defined with a git-style header with the preview, as with all of the other objects.

Wins

Because certain types of objects are swappable with others, we have an interesting setup with some potentially big wins. An unoptimized repository is always possible, and will be the fastest version of the repository to access and browse.

As we work out of the directory, you can imagine a subprocess of the vault --watch process is constantly optimizing the repository in one big way: by replacing blob objects with compositeblob objects. This cuts down on the total file size of the repository, no matter where it is present. It also minimizes the total amount of data that needs to be sent across the wire to a remote.

Then, on the origin side we could begin swapping out different blob objects for remoteblob objects by transferring the contents to S3 and replacing the records. Eventually, we would be left with an extremely light-weight representation of the project and all of its past versions. If the origin was also interested in generating previews for the diffierent filetypes, we could also kick off those processes although its probably best to leave those to a client.

Thus a rough order operations is as follows for every client run loop:

receive_filesystem_changes()
add_changed_files_to_database()
generate_previews()
diff_blobs()
push_to_remote()
pull_from_remote()
move_files_from_database_into_place()

For a server run loop it could be set up to do the following:

receive_changes_from_clients()
bounce_large_files_to_s3()
report_changes_to_clients()

Moving data

Because Vault projects are content-addressable filesystems, much in the same way git is, moving data between repositories is relatively easy. Just copy the files that don’t exist into place.

One big question mark is reconciling data. Because it’s important to minimize the amount of user interaction as someone works out of a Vault, certain conventions must be followed when it comes to merging data sets. I think, though, that it’s possible to design the application in such where all automatic actions are merely additive. Destructive actions must be explicitly triggered by an end-user, e.g. remove certain versions from the history.

Working on the same file simultaneously on two different clients

One of the big requirements is the ability to work on the same file by two different clients. To accomplish this, we turn to git’s branching model. By tracking the HEAD of each different client, we will wind up with multiple leaf nodes on divergent branches in our working graph. Thus, we never disallow a push because, but instead ingest all data.

Later, we can splice those divergent branches into one. This would be akin to a rebase in git. No data is destroyed, but instead just coalesced into one working timeline or branch.

Different pieces

Given what I’ve laid out here and borrowing heavily from the git community, a few tools are obvious to the core functionality of Vault. They are:

# Make an existing folder into a Vault project
vault init

# Verify the validity of the project
vault verify

# Replace blob with remoteblob
vault remoteify [Blob SHA]

# Replace blob and its ancestry with a compositeblob
vault split [Blob SHA]

# Add a preview file for a given blob
vault preview add [File path] [Metadata]

# Add meta information about a given blob
vault meta add [Blob SHA] [Metadata]

# Add a file to the repo
vault file add [File path]

# Add a tree to the repo
vault tree add [Tree data]

# Push changes to all remotes
vault push

# Pull changes from all remotes
vault pull

# Push changes to a specific remote
vault push [Remote]

# Broadcast files available to those on the local network
vault broadcast

# Non-destructively splice divergent branches into one.
vault splice

# Destructively squash down commits down based on total
# file size of each change or some other metric.
vault squash

# Main entry point for most applications for syncing
# files to remotes. It will call out to the other tools
vault --watch

Conclusion

So that about wraps up this exploratory blog post. Hopefully this effectively encapsulated some of the stumbling blocks we hit at LayerVault and laid out a way forward.

This new solution, nicknamed Vault, is defensible against many of the issues we saw while building LayerVault and adopts many of the great ideas pioneered by git. It is fundamentally different than existing storage and sync services, and represents a potentially powerful new application.

I likely don’t have enough time to work on this at the moment. Let me know if you’re interested in this project. We might get something started.