Secure Cloud Storage with deduplication?
Last week, dropbox’s little problem made me realize how much I was trusting them to do things right, just because they once (admittedly, long ago) wrote they’d be using encryption such that they could not access my data, even if they wanted. The auth problem showed that today’s dropbox reality couldn’t be farther from that – apparently the keys are lying around and are in no way tied to user passwords.
So, following my own advice to take care about my data, I tried to understand better how the other services offering similar functionality to dropbox actually work.
Reading through their FAQs, there are a lot of impressive sounding crypto acronyms, but usually no explanation of their logic how things work – simple questions like what data is encrypted with what key, in what place, and which bits are stored where, remain unanswered.
I’m not a professional in security, let alone cryptography. But I can follow a logical chain of arguments, and I think providers of allegedly secure storage should be able to explain their stuff in a way that can be followed by logically thinking.
Failing to find these explanations, I tried the reverse. Below I try following logic to find out if or how adding sharing and deduplication to the obvious basic setup (private storage for one user) will or will not compromise security. Please correct me if I’m wrong!
Without any sharing or upload optimizing features, the solution is simple: My (local!) cloud storage client encrypts everything with a long enough password I chose and nobody else knows, before uploading. Result: nobody but myself can decrypt the data .
That’s basically how encrypted disk images, password wallet and keychain apps etc. work. More precisely, many of them use two-stage encryption: they generate a large random key to encrypt the data, and then use my password to encrypt that random key. The advantage is performance, as the workhorse algorithm (the one that encrypts the actual data) can be one that needs a large key to be reasonably safe, but is more efficient. The algorithm that encrypts the large random key with my (usually shorter) password doesn’t need to be fast, and thus can be very elaborate to gain better security from a shorter secret.
Next step is sharing. How can I make some of my externally stored files accessible to someone else, but still keeping it entirely secret from the cloud storage provider who hosts the data?
With a two stage encryption, there’s a way: I can share the large random key for the files being shared (of course, each file stored needs to have it’s own random key). The only thing I need to make sure is nobody but the intended receiver obtains that key on the way. This is easier said than done. Because having the storage provider manage the transfer, such as offering a convenient button to share a file with user xyz, this inevitably means I must trust the service about the identity of xyz. Best they can provide is an exchange that does not require the key be present in a decrypted form in the provider’s system at any time. That may be an acceptable compromise in many cases, but I need to be aware I need a proof of identity established outside the storage provider’s reach to really avoid they can possibly access my shared files. For instance, I could send the key in a GPG encrypted email.
So bottom line for sharing is: correctly implemented, it does not affect the security of the non-shared files at all, and if the key was exchanged securely directly between the sharing parties, it would even work without a way for the provider to access the shared data. With the one-button sharing convenience we’re used to, the weak point is that the provider (or a malicious hacker inside the providers system) could technically forge identities and receive access to shared data in place of the person I actually wanted to share with. Not likely, but possible.
The third step is deduplication. This is important for the providers, as it saves them a lot of storage, and it is a convenience for users because if they store files that other users already have, the upload time is near zero (because it’s no upload at all, the data is already there).
Unfortunately, the explanations around deduplication get really foggy. I haven’t found a logically complete explanation from any of the cloud storage providers so far. I see two things that must be managed:
First, for deduplication to work, the service provider needs to be able to detect duplicates. If the data itself is encrypted with user specific keys, the same file from different users looks completely different at the provider’s end. So neither the data itself, nor a hash over that data can be used to detect duplicates. What some providers seem to do is calculating a hash over the unencrypted file. But I don’t really undstand why many heated forum discussions seem to focus on whether that’s ok or not. Because IMHO the elephant in the room is the second problem:
If indeed files are stored encrypted with a secret only the user has access to, deduplication is simply not possible, even if detection of duplicates can work with sharing hashes. The only one who can decrypt a given file is the one who has the key. The second user who tries to upload a given file does not have (and must not have a way to obtain!) the first user’s key for that file by definition. So even if encrypted data of that file is already there, it does not help the second user. Without the key, that data stored by the first user is just garbage for him.
How can this be solved? IMHO all attempts based on doing some implicit sharing of the key when duplicates are detected are fundamentally flawed, because we inevitably run into the proof of identity problem as shown above with user-initiated sharing, which becomes totally unacceptable here as it would affect all files, not only explicitly shared ones.
I see only one logical way for deduplication without giving the provider a way to read your files: By shifting from proof-of-identity for users to proof-of-knowledge for files. If I can present a proof that I had a certain file in my possession, I should be able to download and decrypt it from the cloud. Even if it was not me, but someone else who actually uploaded it in the first place. Still everyone else, including the storage provider itself, must not be able to decrypt that file.
I now imagine the following: instead of encrypting the files with a large random key (see above), my cloud storage client would calculate a hash over my file and use that hash as the key to encrypt the file, then store the result in the cloud. So the only condition to get that file back would be having had access to the original unencrypted file once before. I myself would qualify, of course, but anyone (totally unrelated to me) who has ever seen the same file could calculate the same hash, and will qualify as well. However, for whom the file was a secret, it remains a secret .
I wonder if that’s what cloud storage providers claiming to do global deduplication actually do. But even more I wonder why so few speak clear text. It need not to be on the front page of their offerings, but a locigally conclusive explanation of what is happening inside their service is something that should be in every FAQ, in one piece, and not just bits spread in some forum threads mixed with a lot of guesses and wrong information!
 Of course, this is possible only within the limits of the crypto algorithms used. But these are widely published and reviewed, as well as their actual implementations. These are not absolutely safe, and implementation errors can make them additionally vulnerable. But we can safely trust that there are a lot of real cryptography expert’s eyes on these basic building blocks of data security. So my working assumption is that the used encryption methods per se are good enough. The interesting questions are what trade offs are made to implement convenience features.
 Note that all this does not help with another fundamental weakness of deduplication: if someone wants to find out who has a copy of a known file, and can gain access to the storage provider’s data, he’ll get that answer out of the same information which is needed for deduplication. If that’s a concern, there’s IMHO no way except not doing deduplication at all.