Programming Communications: Securing data on AppEngine datastore and other clouds. Encryption? Then what?

Recently there is a lot of discussion on programming sites about encrypting data on cloud environments like App Engine. This post is a Sunday evening rumination on the topic and a small experiment on the concept. Test code is available at the end of the post.

Assuming the following,

1) You have decided to bear the extra computing to store encrypted data in your AppEngine datastore table.

2) You have decided to bear the extra space and computing that may be needed to subsequently process your encrypted data. After all you just don't want to store the data. You want to be able to access it on some query.

If we encrypt the stuff we are putting on to AppEngine like infrastructure, how do we compare data on tables to answer queries? A thought is to hash the data bytes too and store it in a shadow table. Then use a one dimensional pattern recognition algorithm to compare.

Hashing your data: Suppose you have 10 bytes of data. You encrypted it and put it on your AppEngine table. Hash the 10 bytes taking 2 consecutive bytes at a time. i.e if your data is say "abcdef" hash(a,b) then hash(b,c) and so on. Store this sequence of hash on another table possibly with a foreign key reference. If you use MD5 hashing then, it mean you end up with 9 * 16 bytes of hash for 10 bytes of data. But as per the assumption you have decided to go this far to protect data on your cloud. This type of hashing (using signature of) 2 consecutive characters/bytes comes in Donald Knuth's art of programming book.

Comparing data now comes down to comparing the hashes generated above. Hashes of the same 2 bytes should match. So how similar are the two hashes can be answered by one dimensional pattern recognition? After hashing the original table data and input data, we have 2 byte sequences with negative, positive and zero values. They can be compared by finding the maximum contiguous sum on each byte sequence. If the two sums are same or near then the underlying data is also same or similar. This throws up the question of what does 'near' mean.

A small test: We have two strings. We generate a hash for each of the strings. The hash is a sequence of bytes. It is obtained by applying MD5 hash on every 2 consecutive bytes of the strings. From the output we can see that this could work. But, how to compare two byte sequences and its implementation on the underlying database also comes into question.

Snapshot of hashing code is here. This type of using signatures for document matching is mentioned in Knuth's art of programming.