Do You Trust The Government To Link Your Data Across Its Systems? Demand (UN)Linkable IDs!

We have data sources which can span across the public sector. This might relate to health records, tax records, and so on. The nightmare…

Do You Trust The Government To Link Your Data Across Its Systems? Demand (Un)Linkable IDs!

Demo of the method used in this article: [here]

We have data sources which can span across the public sector. This might relate to health records, tax records, and so on. The nightmare situation for many is where governments create a data lake, and where all of your data is linked and merged. This would allow your GP to see your tax return, and the tax inspect the chance to see your health record. An improved solution is thus for citizens to own their own identity, and then allow them to link it between disparate systems.

So, let’s say that you don’t trust your government to match your identity across different data systems. Two of the best method proposed for this matching are known as CL-15 and CL-17:

[CL-15] Camenisch, Lehmann. (Un)linkable Pseudonyms for Governmental Databases. CCS’15. [Link]
[CL-17] Camenisch , Lehmann. Privacy- preserving User-auditable Pseudonym Systems. IEEE EuroSP’17.

Within these papers, Camenisch and Lehmann outline a way that the user can own their own ID (UID), and then to store a pseudonym on System A (PsIDA), and another one on System B (PsIDB). A controller then links the two IDs together, but cannot tell the identity of the person involved.

In this case System A has a value of xa, and System B has a value of xb. Next the user ID (UID) is then used to generate the pseudo ID in each system. For System A the pseudo ID will be UID^{xa}, and in System B it will be UID^{xb}. Because the security of discrete logs, neither System A nor System B can determine the value of UID:

Now we want to link Bob from System A to System B. First System A encrypts is pseudo-ID (PsIDA) with the public key of System B:

Next we use homomorphic encryption to raise this cipher to the power of xb/xa:

This is the same as:

The controller can then pass this cipher value to System B, and through the wonder of homomorphic encryption we get:

System B can then use its private key to get:

And thus System B will know the pseudo ID, but the controller or System A cannot tell the ID stored on System A. Isn’t that magic? A sample run is [here]:

UID:        43
ID1:        929293739471222707
ID2:        21611482313284249
xa:     11
xb:     10

e:      65537
N:      509628704099738411389114128951088319
p:      826182604440061657
q:      616847536320538967
RSA Encryption parameters. Public key: [e,N].

Cipher1:    458944241885995240133251377203010898
Derived ID2: 21611482313284249

And we see the magic, and where ID1 has been transformed into ID2 using RSA encryption alongside homomorphic encryption! The source code in RSA is:

import pyprimes
import sys

import random
import math

def extended_euclidean_algorithm(a, b):
    """
    Returns a three-tuple (gcd, x, y) such that
    a * x + b * y == gcd, where gcd is the greatest
    common divisor of a and b.

    This function implements the extended Euclidean
    algorithm and runs in O(log b) in the worst case.
    """
    s, old_s = 0, 1
    t, old_t = 1, 0
    r, old_r = b, a

    while r != 0:
        quotient = old_r // r
        old_r, r = r, old_r - quotient * r
        old_s, s = s, old_s - quotient * s
        old_t, t = t, old_t - quotient * t

    return old_r, old_s, old_t

def inverse_of(n, p):
    """
    Returns the multiplicative inverse of
    n modulo p.

    This function returns an integer m such that
    (n * m) % p == 1.
    """
    gcd, x, y = extended_euclidean_algorithm(n, p)
    assert (n * x + p * y) % p == gcd

    if gcd != 1:
        # Either n is 0, or p is not a prime number.
        raise ValueError(
            '{} has no multiplicative inverse '
            'modulo {}'.format(n, p))
    else:
        return x % p
def rabinMiller(n):
     s = n-1
     t = 0
     while s&1 == 0:
         s = s/2
         t +=1
     k = 0
     while k<128:
         a = random.randrange(2,n-1)
         #a^s is computationally infeasible.  we need a more intelligent approach
         #v = (a**s)%n
         #python's core math module can do modular exponentiation
         v = pow(a,s,n) #where values are (num,exp,mod)
         if v != 1:
             i=0
             while v != (n-1):
                 if i == t-1:
                     return False
                 else:
                     i = i+1
                     v = (v**2)%n
         k+=2
     return True

def isPrime(n):
     #lowPrimes is all primes (sans 2, which is covered by the bitwise and operator)
     #under 1000. taking n modulo each lowPrime allows us to remove a huge chunk
     #of composite numbers from our potential pool without resorting to Rabin-Miller
     lowPrimes =   [3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97
                   ,101,103,107,109,113,127,131,137,139,149,151,157,163,167,173,179
                   ,181,191,193,197,199,211,223,227,229,233,239,241,251,257,263,269
                   ,271,277,281,283,293,307,311,313,317,331,337,347,349,353,359,367
                   ,373,379,383,389,397,401,409,419,421,431,433,439,443,449,457,461
                   ,463,467,479,487,491,499,503,509,521,523,541,547,557,563,569,571
                   ,577,587,593,599,601,607,613,617,619,631,641,643,647,653,659,661
                   ,673,677,683,691,701,709,719,727,733,739,743,751,757,761,769,773
                   ,787,797,809,811,821,823,827,829,839,853,857,859,863,877,881,883
                   ,887,907,911,919,929,937,941,947,953,967,971,977,983,991,997]
     if (n >= 3):
         if (n&1 != 0):
             for p in lowPrimes:
                 if (n == p):
                    return True
                 if (n % p == 0):
                     return False
             return rabinMiller(n)
     return False

def generateLargePrime(k):
     #k is the desired bit length
     r=100*(math.log(k,2)+1) #number of attempts max
     r_ = r
     while r>0:
        #randrange is mersenne twister and is completely deterministic
        #unusable for serious crypto purposes
         n = random.randrange(2**(k-1),2**(k))
         r-=1
         if isPrime(n) == True:
             return n
     return "Failure after "+`r_` + " tries."

import random

val=0
bits=60

p= generateLargePrime(bits)
q= generateLargePrime(bits)

n=p*q

e=65537

PHI=(p-1)*(q-1)

d=inverse_of(e,PHI)

uid = 11
xa = 23
xb = 7

if (len(sys.argv)>1):
    uid=int(sys.argv[1])
if (len(sys.argv)>2):
    xa=int(sys.argv[2])
if (len(sys.argv)>3):
    xb=int(sys.argv[3])

id1=pow(uid,xa,n)
id2=pow(uid,xb,n)

print "ID1:\t\t",id1

print "ID2:\t\t",id2
print "xa:\t\t",xa
print "xb:\t\t",xb

print "\ne:\t\t",e
print "N:\t\t",n
print "p:\t\t",p
print "q:\t\t",q

print "RSA Encryption parameters. Public key: [e,N]."

cipher1=pow(id1,e,n)

print "\nCipher1:\t",cipher1

val=pow(cipher1,xb,n)
val=pow(val,inverse_of(xa,PHI),n)

val=pow(val,d,n)
print "Derived ID2:",val

And there you go! Governments should NOT be linking citizen IDs across systems with common identifiers, and should be using pseudo IDs to identify each citizen on each data source. In the best case the citizen owns their own ID, and where they have control of the linking process.

The following gives an outline of the method we have defined here:

Last week, Will Abramson and myself chatted with Jan Camenisch, and it was such a treat to speak to someone with such great vision, and who also has the skills to implement things. You will see Camenisch’s work in Hyperledger Fabric:

MSP Implementation with Identity Mixer - hyperledger-fabricdocs master documentation
Idemix authentication technology provides the trust model and security guarantees that are similar to what is ensured…hyperledger-fabric.readthedocs.io