Monday, September 04, 2006

The Negated Database

The Economist has an interesting article on a new cryptography idea: instead of securing the data with encryption that may be compromised, why don't we just store the negation of the data? i.e. The database contains everything except the intended data (if the data space is small, we may need to put some extra paddings to increase the entropy). Say a database of 8-digit phone numbers, that database will contain all 8-digit numbers, except those valid ones. So to check whether a phone number is valid, we need to make sure that number is not in the database.

The advantage of this could be the leakage of confidential data may be contained if the said database is breached. However, I can think of the following shortcomings:

  1. Any change of data formats (e.g. adding one more field) will warrant a respawning of the whole negative database. This will not be trivial if the data space is big enough.
  2. Searching and sorting will be a challenge as special algorithms must be in place to ensure the database is responsive enough.
I feel the underlying theory has something to do with entropy: This negative database has a much higher entropy than the normal ones, and therefore it will take more effort to deal with it (this applies for legitimate and illegitimate users!).

Update:
Author's homepage.
He has a paper with more information about this scheme, however it seems my two concerns are still not addressed...

2 comments:

Jimmy L. said...

Wah Chai, 2nd entry! :P My first long comment :)

I don't think the change of data would result in the TOTAL regeneration of negative data. You just need to identify the 'hole' and modify the current database's hole to that it includes the 'hole'.

'Entropy' in both database must be equivalent because 'entropy' == 'information'. If they are different, it means we are representing different things.

The idea is that if the database is to be breached... we gotta breach the ENTIRE database to know the real data. Piecewise stealing won't work.

It's like a bit mask -- we mask what we do not need. It doesn't work for general cases, only for finite sets that are possible to process in a 'convenient' time interval. The main issue is processing time, especially for char strings. Each char is 2^8. A char field of 20 chars ... we have 2^28 possibilities. -_-;; Now...

Cuppa Chai said...

You have a point, it boils down to how the data are represented in the database. But even filling in the 'holes' in the negative database alone is a daunting task.