Why Companies Should Perfect The Art of Lying — In Order to Preserve Your Privacy

The Catch-22 of data

Why Companies Should Perfect The Art of Lying — In Order to Preserve Your Privacy

The Catch-22 of data

Catch-22 … companies want to increasingly mine your data … but companies face massive fines for breaches of personal data. Step forward … the art of lying … that age old method of protecting our information. But this time it’s the mathematical art of lying.

With increasing fines coming along for data breaches and the release of PII (Personal Identifiable Information), companies will have to get a whole lot smarter in the way that they gather data from users. One way is to use differential privacy, and where we can get a user/device/service to actually lie.

If you are interested, we have just published a paper on who we can use the RAPPOR method to preserve privacy [here]:

The problem with k-anonymity

Increasingly within Big Data we gather data which can contain sensitive data. For example if we have a group of students who have taken a test, and we might want allow someone to determine the average grade for a cohort of students. But can we allow someone to determine this, but not actually determine the actual grades of each individual student?

The way we normally protect against this is to use k-anonymity and where we restrict the number of records analysed for the sensitive data to a minimum of k.

So let’s say that Bob, Alice and Eve have taken a test, and we setup k=2. Can we determine the marks of Bob, Alice and Eve?

Now Mallory tries to determine the grades for Bob, Alice and Eve, but the system limits his searchers. So next he runs a query on Average(Bob, Alice) and gets 20%, and then Average(Bob, Eve) and gets 30%, and then Average (Alice, Eve) and gets 40%. Iit is possible to get the marks for Bob, Alice and Eve as 10%, 30% and 50%, respectively [here]. Here are other examples:

Bob/Alice average of 20%, Bob/Eve average of 30%, and Alice/Eve average of 40%”. Try!
Bob/Alice average of 80%, Bob/Eve average of 60%, and Alice/Eve average of 20%”. Try!

As another example, where we can defeat k-anonymity is by suppressing the query. For example if we want to determine if Alice supports Manchester Utd, we perform a query of:

Number of people who support Manchester Utd?

and then:

Number of people who support Manchester Utd who are not Alice?

If the values are equal, we know that Alice supports Manchester Utd, otherwise if they differ by one, he supports them.

Differential privacy with noise

With differential privacy we aim to add noise to the results of a query so that it is not possible to observe the changes in the dataset. For this we normally use a Laplace distribution to add noise [here]:

https://asecuritysite.com/comms/dist2?mu=0.125&sig=0.3&samples=10000

Now let’s say that we have Bob, Alice, Eve, Mallory, Carol, Trent and Fred, and that we have values of representing the number of times they have access a certain site as: 2,3,4,5,6,7 and 8, respectively. Now we want to allow someone to determine the average number of accesses, but preserve Fred’s data.

Now if we determine the average we get 4.5. Now let’s take away Fred’s value of 8, and we get an average value of 4. This will give away Fred’s value.

But now let’s add some Laplace noise. For example, if we take a mu value of 0.125, and a lambda value of 0.3, we get the values of [here]:

-0.19323 -0.1816416 -0.25693139 0.6917717 -0.02466541 0.4846251

Now we add these to the values of 2, 3, 4, 5, 6, and 7, we get:

1.80677 2.8183584 3.74306861 5.6917717 5.97533459 7.4846251

and now when we take the average we get: 4.5, which is the same as before, but where we have protected the value from Fred.

The art of lying

Let’s take an example. If you have 10 friends on Facebook, and you want to know the split of male v female, how can you do this without actually asking them to tell you their gender?

Well, we get them to toss a coin. If it is heads, they should lie, else if it is a tails they should tell the truth. So here we go:

Friend 1 (Male). Tosses coin: Tails. Tell: Male (Truth!)
Friend 2 (Male). Tosses coin: Heads. Tell: Male (Truth!)
Friend 3 (Male). Tosses coin: Heads. Tell: Female (Lie!)
Friend 4 (Female). Tosses coin: Tails. Tell: Female (Truth!)
Friend 5 (Female). Tosses coin: Heads. Tell: Female (Lie!)
Friend 6 (Female). Tosses coin: Heads. Tell: Male (Lie!)
Friend 7 (Male). Tosses coin: Tails. Tell: Male (Truth!)
Friend 8 (Male). Tosses coin: Tails. Tell: Male (Truth!)
Friend 9 (Male). Tosses coin: Tails. Tell: Female (Lie!)
Friend 10 (Female). Tosses coin: Heads. Tell: Male (Lie!)

Note: a lie has a 50/50 chance of being male or female.

So the number of heads is the same as tails (which is expected), and we have 6 males, and 4 females from the results, and this is the same as the cohort, but we can’t tell who was actually telling the truth or not. This is the basis for methods of privacy within Big Data analysis, where we can lie (or add noise).

I have illustrated the concept here with eight people:

Adding noise

One of the best ways to do this is with the Google RAPPOR method, which uses hashes and adding noise. For this we have a k-bit Bloom filter, and then follow three steps to the gathering of information.

Bloom filter

Before we look at RAPPOR, let’s look at Bloom filters. For this we take a number of hash values of a string and then set the corresponding bits for the Bloom filter. For example:

01234567890123456789012345678901
Add fred:       00000000000000100000010000000000  fred [21,14]
Add bert:       00000000100000100000010000000100  bert [29,8]
Add greg:       00000000100100100000011000000100  greg [11,22]

If we now search for “fred” we will see that bits 21 and 14 are set, so “fred” may be in the Bloom filter. If we try We now have bit position 8, 11, 14, 21, 22 and 29 set.

Now we can test for amy:
    amy is not there [16,12]
New we can test for greg:
    greg may be in there [11,22]

and here is a video presentation:

Lying is good …

The first part of the process is the signal part, which involves taking the message and hashing it with a number of hashing methods, (h) and setting bits on a Bloom filter (B).

Next we generate the Permanent Randomized Response (PRR), which will be memorized on the system. With this for each of bits (0 to k-1), and we create a fake Bloom filter, which is set as:

B_fake[i] = 1 for a probability of 0.5 x f
    B_fake[i] = 0 for a probability of 0.5 x f
    B_fake[i] = B[i] for a probability of 1 - f

So if f = 0.5, we will set with a fake 0 for 1-in-4 chance, a fake 1 for 1-in-4 chance, and for the actual bit in the Bloom filter (B) to appear in the PPR is 1-in-2. This is equivalent to having a coin flip, where if it is heads, we will lie about the result, but if it is tails we will tell the truth.

This will create a fake Bloom filter with noise (based on a coin flip), where half the time we will fake the bit, and the other time we will put the correct answer into the Bloom filter. This bit array is memorized for future reports.

When required to read the data, we perform the Instantaneous Randomized Response (IRR), which takes B_fake and creates S using:

If the bit in B_fake is set to 1, 
    the corresponding bit in S is set to 1 with probability q.
If the bit in B_fake is set to 0, 
    the corresponding bit in S is set to 1 with probability p.

Google illustrate the process as follows [here]:

Let’s take an example of some input data values:

client,cohort,value
1,24,'hello'
1,24,'hello'
2,34,'goodbye'
1,43,'hello'
1,40,'help'

This then produces [here]:

client,cohort,bloom,prr,irr
1,24,1000101000100000,0001001000110000,1011011011111100
1,24,1000101000100000,0001001000110000,1001001011001001
2,34,0000000001000111,0000000001011110,1000011100100111
1,43,0001000000100001,0101001110110011,1001001010101000
1,40,0010010000010000,0001110100011010,0101011111100010

We can see for the same client and cohort that we produce the same Bloom filter (1000101000100000), but the PRR has noise applied to it (0001001000110000). Finally the IRR value differs, even though we have the same input (1011011011111100 and 1001001011001001). In this way we cannot infer the input values.

If we use freq=1, all the bits will be random:

If we look in more detail at the first entry, we see the bits that have changed (where an X is change) and where we see that 8 are changed and 8 or not (so we see a random pattern):

1000101000100000
0001011001111001
X--XXX---X-XX--X

and with f=0.5, we have a 1 in 2 chance of a fake bit:

This time we only see four bits changing:

1000101000100000
0001001000110000
X--XX------X----

and with freq=0 we will get all of the bits from B coming through, and where there will be no random bits in B_fake:

client,cohort,bloom,prr,irr
1,24,1000101000100000,1000101000100000,1010101100000111
1,24,1000101000100000,1000101000100000,0000010111111000
2,34,0000000001000111,0000000001000111,1101010010001011
1,43,0001000000100001,0001000000100001,0000001011010111
1,40,0010010000010000,0010010000010000,1010111011001001

And there you go, almost perfect privacy. Try here.

Conclusions

GDPR has arrived! The days of companies mass harvesting data from users is gone. If a company wants to gather data, they will have to make sure that an individual cannot be spotted within the data, otherwise they have breached the citizen’s right to privacy. I appreciate that many companies address this by sending the user a generic consent statement, but that is using a sticking plaster to fit a hole in a dam.

Companies need to re-design and re-architecture, and put privacy and security at the core of their systems. Google has shown that it is keen to do this. Apple, on the other hand, seem to have a flawed implementation of differential privacy, where they reveal a little too much, and where errors can actually cause problems (such as in health care applications).