Secure Cloud Storage with deduplication?

Last week, dropbox's little problem made me realize how much I was trusting them to do things right, just because they once (admittedly, long ago) wrote they'd be using encryption such that they could not access my data, even if they wanted. The auth problem showed that today's dropbox reality couldn't be farther from that - apparently the keys are lying around and are in no way tied to user passwords.

So, following my own advice to take care about my data, I tried to understand better how the other services offering similar functionality to dropbox actually work.

Reading through their FAQs, there are a lot of impressive sounding crypto acronyms, but usually no explanation of their logic how things work - simple questions like what data is encrypted with what key, in what place, and which bits are stored where, remain unanswered.

I'm not a professional in security, let alone cryptography. But I can follow a logical chain of arguments, and I think providers of allegedly secure storage should be able to explain their stuff in a way that can be followed by logically thinking.

Failing to find these explanations, I tried the reverse. Below I try following logic to find out if or how adding sharing and deduplication to the obvious basic setup (private storage for one user) will or will not compromise security. Please correct me if I'm wrong!

Without any sharing or upload optimizing features, the solution is simple: My (local!) cloud storage client encrypts everything with a long enough password I chose and nobody else knows, before uploading. Result: nobody but myself can decrypt the data [1].

That's basically how encrypted disk images, password wallet and keychain apps etc. work. More precisely, many of them use two-stage encryption: they generate a large random key to encrypt the data, and then use my password to encrypt that random key. The advantage is performance, as the workhorse algorithm (the one that encrypts the actual data) can be one that needs a large key to be reasonably safe, but is more efficient. The algorithm that encrypts the large random key with my (usually shorter) password doesn't need to be fast, and thus can be very elaborate to gain better security from a shorter secret.

Next step is sharing. How can I make some of my externally stored files accessible to someone else, but still keeping it entirely secret from the cloud storage provider who hosts the data?

With a two stage encryption, there's a way: I can share the large random key for the files being shared (of course, each file stored needs to have it's own random key). The only thing I need to make sure is nobody but the intended receiver obtains that key on the way. This is easier said than done. Because having the storage provider manage the transfer, such as offering a convenient button to share a file with user xyz, this inevitably means I must trust the service about the identity of xyz. Best they can provide is an exchange that does not require the key be present in a decrypted form in the provider's system at any time. That may be an acceptable compromise in many cases, but I need to be aware I need a proof of identity established outside the storage provider's reach to really avoid they can possibly access my shared files. For instance, I could send the key in a GPG encrypted email.

So bottom line for sharing is: correctly implemented, it does not affect the security of the non-shared files at all, and if the key was exchanged securely directly between the sharing parties, it would even work without a way for the provider to access the shared data. With the one-button sharing convenience we're used to, the weak point is that the provider (or a malicious hacker inside the providers system) could technically forge identities and receive access to shared data in place of the person I actually wanted to share with. Not likely, but possible.

The third step is deduplication. This is important for the providers, as it saves them a lot of storage, and it is a convenience for users because if they store files that other users already have, the upload time is near zero (because it's no upload at all, the data is already there).

Unfortunately, the explanations around deduplication get really foggy. I haven't found a logically complete explanation from any of the cloud storage providers so far. I see two things that must be managed:

First, for deduplication to work, the service provider needs to be able to detect duplicates. If the data itself is encrypted with user specific keys, the same file from different users looks completely different at the provider's end. So neither the data itself, nor a hash over that data can be used to detect duplicates. What some providers seem to do is calculating a hash over the unencrypted file. But I don't really undstand why many heated forum discussions seem to focus on whether that's ok or not. Because IMHO the elephant in the room is the second problem:

If indeed files are stored encrypted with a secret only the user has access to, deduplication is simply not possible, even if detection of duplicates can work with sharing hashes. The only one who can decrypt a given file is the one who has the key. The second user who tries to upload a given file does not have (and must not have a way to obtain!) the first user's key for that file by definition. So even if encrypted data of that file is already there, it does not help the second user. Without the key, that data stored by the first user is just garbage for him.

How can this be solved? IMHO all attempts based on doing some implicit sharing of the key when duplicates are detected are fundamentally flawed, because we inevitably run into the proof of identity problem as shown above with user-initiated sharing, which becomes totally unacceptable here as it would affect all files, not only explicitly shared ones.

I see only one logical way for deduplication without giving the provider a way to read your files: By shifting from proof-of-identity for users to proof-of-knowledge for files. If I can present a proof that I had a certain file in my possession, I should be able to download and decrypt it from the cloud. Even if it was not me, but someone else who actually uploaded it in the first place. Still everyone else, including the storage provider itself, must not be able to decrypt that file.

I now imagine the following: instead of encrypting the files with a large random key (see above), my cloud storage client would calculate a hash over my file and use that hash as the key to encrypt the file, then store the result in the cloud. So the only condition to get that file back would be having had access to the original unencrypted file once before. I myself would qualify, of course, but anyone (totally unrelated to me) who has ever seen the same file could calculate the same hash, and will qualify as well. However, for whom the file was a secret, it remains a secret [2].

I wonder if that's what cloud storage providers claiming to do global deduplication actually do. But even more I wonder why so few speak clear text. It need not to be on the front page of their offerings, but a locigally conclusive explanation of what is happening inside their service is something that should be in every FAQ, in one piece, and not just bits spread in some forum threads mixed with a lot of guesses and wrong information!


[1] Of course, this is possible only within the limits of the crypto algorithms used. But these are widely published and reviewed, as well as their actual implementations. These are not absolutely safe, and implementation errors can make them additionally vulnerable. But we can safely trust that there are a lot of real cryptography expert's eyes on these basic building blocks of data security. So my working assumption is that the used encryption methods per se are good enough. The interesting questions are what trade offs are made to implement convenience features.

[2] Note that all this does not help with another fundamental weakness of deduplication: if someone wants to find out who has a copy of a known file, and can gain access to the storage provider's data, he'll get that answer out of the same information which is needed for deduplication. If that's a concern, there's IMHO no way except not doing deduplication at all.


The only one who really cares about your data is you!

Once more, yesterday's Dropbox authentication bug shows a fundamental weakness of centralized services. Dropbox is just a high profile example, but the underlying problem is that of unneeded centralisation.

Every teenager who starts using facebook is told how important it is to wisely choose what to put online and what not, and always be aware that nothing published on the internet can ever be completely deleted any more.

However, the way popular "cloud" services are built today unfortunately just ignore this, and choose a centralized implementation. Which means uploading everything first, and then trying (and sometimes sadly failing, like dropbox yesterday) to protect that data from unauthorized access.

Why? Because it is easier to implement. Yes, distributed systems are much harder to design and implement. But just choosing a centralized approach is inevitably generating single points of failure. I really don't think we can afford that risk for a long time any more.

It's not even only a technical problem. It's a mindset of delegating too much responsibility, which is fatal. Relying on a centralized storage to be "just secure" is delegating responsibility to others - responsibility that those others are unlikely to comply with.

The argument often goes: it's too hard for a smaller company to run their own servers and keep them secure, so better leave that to the big cloud providers, who are the experts and really care. That's simply not true. Like everyone else, they care about their profit. If they loose or expose your data, they care about the PR debacle this might be for them, but not for the data itself. The only one who really cares about what happens to your data - is you.

Even assuming the service provider was able to keep your data safe, there's another problem. As we have heard again in the discussion about Dropbox's TOS, there are legal constraints on what a cloud service may and may not do. For instance, they may not store your data encrypted such that "law enforcement" cannot access it under certain conditions. Which means that Dropbox can't offer encryption based on really private keys (only you have the key to your data, they don't) even if they wanted to.

What they could do, and IMHO must do in the long term, is offering a federated system. Giving you the choice to host the majority of data in a place where you are legally allowed to use strong encryption with your entirely private keys, such as your own server. Only for sharing with others, and only with the data actually being shared, smaller entities need to enter a bigger federation (which might be organized by a globally operating company).

That's how internet mail always worked - no mail sent among members of a organisation ever needs to leave the mail servers of that organisation. Same for Jabber/XMPP. This should become true for Dropbox, Facebook, Twitter etc. as well. They should really start structuring their clouds, and giving the option to keep critical data by yourself, without making this a decision against using the service at all.

Unfortunately, one of the few big projects that expressedly had federation on the agenda, Google Wave, has almost (but not entirely) disappeared after a big hype in 2009. Sadly, most probably exactly the fact that they did focus on federation and scalability so much and not on polishing their web interface, has made it a failure in the eye of the public.

Maybe we should really do away with that fuzzy term "the cloud" and start talking about small and big clouds, more and less private ones, and how and if they should or should not interact.

Still, one of the currently most opaque clouds is ahead of us - Apple's iCloud. Nothing at all was said in public about how security will work in the iCloud. And from what was presented, it seems for now it will have no cross-account sharing features at all.

The only thing that seem clear is that all of our data will be stored in that huge datacenter in North Carolina, so I guess that using iCloud when it launches in a few months will demand total trust on Apple to get it right (and as said above - this is a responsibility nobody can really take).

On the other hand, Apple could be foresighted enough to realize the need for federation in a later step, for example allowing future Time Capsules to act as in-house cloud server. After all, and unlike other players, Apple profits from selling hardware to us. And to base a speculation on my earlier speculation (still neither confirmed nor disproved), iCloud might be technically ready for that.

But whatever inherent motivation the big players may or may not have to improve the situation - it's up to us to realize there's no easy way around taking care of our data ourselves, and to ask for standards, infrastructure and services which make doing so possible.

iCloud sync speculation

Here's my last minute technical speculation what iCloud will be in terms of sync :-)

It'll be sync-enabled WebDAV on a large scale.

I spent the last 10 years working on synchronisation, in particular SyncML. SyncML is an open standard for synchronisation, created in 2000 by the then big players in the mobile phone industry together with some well known software companies.

SyncML remained a niche from a user perspective, despite the fact that almost every featurephone built in the last 9 years has SyncML built-in. And despite the fact that Steve Jobs himself pointed out Apples support for SyncML when he introduced iSync in July 2002 at Macworld NY.

As we have learnt by now, iSync (and with it, SyncML for the Apple universe) will be history with Lion. And featurephones are pretty much history as well, superseded by smartphones.

Unlike featurephones, smartphones never had SyncML built-in (a fact that allowed me earn my living by writing SyncML clients for these smartphone platforms...). The reason probably was that the vendors of the dominant smartphone operating systems, Palm and later Microsoft, already had their own, proprietary sync technologies in place (HotSync, ActiveSync). Only Symbian was committed to SyncML, but forced by the market share of ActiveSync-enabled enterprise software (Exchange) in 2005 they also licensed ActiveSync from Microsoft.

So did Apple for the iPhone. So did Google for Google calendar mobile sync. Third party vendors of collaboration server solutions did the same.

For a while, it seemed that the sync battle was won by ActiveSync. And by other proprietary protocols for other kinds of syncing, like dropbox for files, Google for docs, and a myriad of small "cloud" enabled apps which all do their own homebrew syncing.

Not a pleasant sight for someone like me who believes that seamless and standards based sync is as basic for mobile computing as IP connectivity was for the internet.

However, in parallel another standard for interconnecting calendars (not exactly syncing, see below) grew - CalDAV. CalDAV is an extension of WebDAV, which adds calendar specific queries and other functionality to WebDAV. And WebDAV is a mature and widely deployed extension of HTTP to allow clients not only reading from a web server, but also writing to it. Apple is a strong supporter of WebDAV since many years (iDisk is WebDAV storage), and is also a driving force behind CalDAV. Mac OS 10.5 Leopard and iOS 3.0 support CalDAV. And more recently, Apple implemented CardDAV in iOS 4.0 and proposed it as an internet draft to IETF, to support contact information the same way as CalDAV does for calendar entries.

This is all long and well known, and CalDAV is already widely used by many calendaring solutions.

There's one not-so-well-known puzzle piece however. I stumbled upon it a few month ago because I am generally interested in sync related stuff. But only now I realized it might be the rosetta stone for making iCloud. I did some extra googling today and found some clues that fit too nicely to be pure coincidence.

The puzzle piece is this: An IETF draft called "Collection synchronisation for WebDAV" [Update - by March 2012 it has become RFC6578]. The problem with WebDAV (and CalDAV, CardDAV) is that it is was designed as an access method, but not a sync method. While it is well possible to sync data via WebDAV, it does not scale well with large sync sets, because a client needs to browse through all the information available first just to detect the changes. With a large sync sets with possibly many hundred thousand files (think of your home folder) that's simply not working. The proposed extension fixes exactly this problem, and makes WebDAV and its derivates ready for efficient sync of arbitrarily huge sync sets, by making the server itself keep track of changes and report them to interested clients.

With this, a WebDAV based sync infrastructure reaching from small items like contacts and calendar entries to large documents and files (hello dropbox!) is perfectly feasible. Now why should iCloud be that infrastructure? That's where I started googling today for this blog entry.

I knew that the "Collection synchronisation for WebDAV" proposal was coming from Apple. But before I didn't pay attention to who was the author. I did now - it's Cyrus Daboo, who spent a lot of time writing Mulberry, an email client dedicated to make best possible use of the IMAP standard. Although usually seen as just another email protocol, IMAP is very much about synchronisation at a very complex level (because emails can be huge, and partial sync of items, as well as moving them wildly around within folder hierarchies must be handled efficiently), so Cyrus is certainly a true sync expert, with a lot of real-world experience. He joined Apple in 2006. Google reveals that he worked on the Calendar Server (part of Mac OS X server supporting CalDAV and CardDAV), and also contributed to other WebDAV related enhancements. It doesn't seem likely to me they hired him (or he would let them hire him) just for polishing the calendar server a bit...

Related to the imminent release of iCloud, I found a few events interesting: MobileMe users had to migrate to a new CalDAV based Calendar by May 11th, 2011. And just a month earlier, Cyrus issued the "WebDAV sync informal last call" before submitting the "Collection synchronisation for WebDAV" to IETF, and noted that there are "already several client and server implementations of this draft now". And did you notice how the iOS iWork apps just got kind of a document manager with folders? After becoming WebDAV aware only a few months ago?

So what I guess we'll see today:

  • a framework in both iOS5 and Mac OS X Lion which nicely wraps WebDAV+"Collection synchronisation for WebDAV" in a way that makes permanent incremental syncing for all sort of data a basic service of the OS every app can make use of.
  • a cloud based WebDAV+Sync storage - the iCloud
  • a home based WebDAV+Sync storage - new TimeCapsules and maybe AirPorts
  • and of course a lot of Apple Magic around all this. Like Back-to-my-Mac and FaceTime are clever mash-ups of many existing internet standards to make them work "magically", there will be certainly more to iCloud than just a WebDAV login (let alone all the digital media locker functionality many expect).

In about 5 hours we'll hopefully know...