Random versus Encrypted Data

A copy of this paper can be obtained in PDF format here.


What does it mean for information to be leaked? What is information anyways? If we come across some data, can we tell if it contains information or not? Is it even meaningful to talk about data which carries zero information? These are important questions when it comes to encrypting file systems, because, as I will show below, there are small amounts of information leaked even in our best encrypted system.

If we start by defining information as data containing a message then the opposite of information, or zero information, is random data, ie, data which may conform to some spectral parameters but does not exhibit any correlation between data elements. In order for data to carry a message, elements in the sequence must be correlated in some way. For example, "c" followed by "a" followed by "t" conveys the message "cat" --- the order is important as is the proximity of the characters, and within the canon of the English language, this message elicits the image of a fuzzy creature which purrs. This correlation, extended over the data sequence, gives the message. Since "random" by definition means that there are no correlation, there can be no message. The ideal then in encryption would be to create a system that cannot be distinguished by any means and under any circumstances from random data. I'm not sure this is even possible, but I can show ways in which we fall short of the ideal.

To illustrate these points, consider the following two strings of random hex digits:



Which sting contains random data and which contains encrypted data? This is a trick question because the truth is, neither does. The first string was generated using

echo "Hi Mom" | aespipe | xxd -ps

with password "asdfjklqweruiopzxcvnm" while the second was generated using

dd if=/dev/urandom count=16 bs=1 | xxd -ps

which is a sequence calculated by the kernel's non-blocking pseudo-random number generator (PRNG). PRNG's are similar to encryption algorithms in that they use mathematical formulas to generate subsequent elements of the sequence. They conform to certain spectral requirements (eg heads or tails are equally probably) and endeavor to disguise correlation, and so they act as decent approximations to random number sequences. A better way to generate random numbers is to use random physical events, like rolling dice or weather fluctuations. The kernel provides /dev/random that gathers it entropy (ie randomness) from the hardware; however, since this requires system activity to accumulate before it can deliver those random numbers, it blocks. Try using dd if=/dev/random instead of /dev/urandom in the examples below and you'll see how annoying the wait can be! I'll use urandom throughout, but to make these true tests, we should use random.

A few more points to note before moving one:

1) It should not matter whether the attacker knows what cypher was used in encrypting the message --- in the above case I used 128-bit AES. The message is safe unless he also knows the secret password. A cracked cipher is one in which the attacker can obtain the clear message without the secret in a reasonable about of time. Mathematically,

decryption = F( encrypted message, cipher, secret )

runs in a reasonable amount of cpu time and gives garbage unless the right secret is supplied. In our example,

echo -n "b128755656bfb10f8bed1c06614afd55" | xxd -r -ps | aespipe -d

returns "Hi Mom" only when we supply the right secret. It uses only fractions of a second of cpu time and prepending "time" to the above command gives:

real 0m7.041s
user 0m0.000s
sys 0m0.028s

In contrast, a function like this

decryption = G( encrypted message, cipher )

which runs in a reasonable amount of cpu time does not exist for a good cipher. If it did and were found, then we would say that the cipher is cracked. Since the key size is 128-bits, this means there are 2^128 = 3.4x10^38 possible keys to try and one and only one will decrypt the message. Since each attempt takes about 0.03s this means a brute force attack would take about 10^37s = about 10^19 lifetimes of the known universe. To cover this key space, aespipe insists on a password of 20 chars long or more. Including upper/lower/numbers/special chars, this means 94 possibilities for each char and 94^20 = 2.9x10^39. Its up to the user now to choose randomly within this password spaces --- good luck! Since humans prefer easy to remember passwords to hard, the actual distribution of chosen passwords is not uniform and so brute force attacks, like John the Ripper", proceed by trying more probable passwords before less probably ones, leading to a dictionary attack.

2) Since knowing the cipher should not give the attacker any advantage, some implementations, like the luks extension to dm-crypt announce it in a header. Here's an example obtained using the following commands:

dd if=/dev/urandom of=dmcrypt-luks-A.bmp count=1 bs=1M
losetup /dev/loop0 dmcrypt-luks-A.bmp
cryptsetup luksFormat /dev/loop0
hexdump -C /dev/loop0 | more

00000000 4c 55 4b 53 ba be 00 01 61 65 73 00 00 00 00 00 |LUKS....aes.....|
00000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000020 00 00 00 00 00 00 00 00 63 62 63 2d 65 73 73 69 |........cbc-essi|
00000030 76 3a 73 68 61 32 35 36 00 00 00 00 00 00 00 00 |v:sha256........|
00000040 00 00 00 00 00 00 00 00 73 68 61 31 00 00 00 00 |........sha1....|
00000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000060 00 00 00 00 00 00 00 00 00 00 04 08 00 00 00 10 |................|
00000070 ca 4e ae 89 ce d0 83 9e 6b aa 2d 74 8d ae 9d 56 |.N......k.-t...V|
00000080 2e f1 63 81 69 18 a9 99 b5 cf 33 47 0e 03 f3 bd |..c.i.....3G....|
00000090 a0 b8 15 65 85 2e 6e f3 22 a1 7a 2e 2d a4 f6 50 |...e..n.".z.-..P|
000000a0 73 69 07 ff 00 00 00 0a 39 62 37 62 62 33 64 32 |si......9b7bb3d2|
000000b0 2d 37 62 31 63 2d 34 66 63 39 2d 39 32 33 36 2d |-7b1c-4fc9-9236-|
000000c0 39 36 61 61 64 31 36 64 62 36 35 65 00 00 00 00 |96aad16db65e....|
000000d0 00 ac 71 f3 00 04 18 31 ee c0 18 69 ed 2b d9 5d |..q....1...i.+.]|
000000e0 6e 55 b0 53 66 8d a1 13 e9 75 a9 80 fd 0e bb 38 |nU.Sf....u.....8|
000000f0 6c c5 4b 40 b9 c7 ee 2d 00 00 00 08 00 00 0f a0 |l.K@...-........|

The hexdump is pretty clear, but to make it even easier cryptsetup provides a utility which will present this information in a human readable form. The following

cryptsetup luksDump /dev/loop0


LUKS header information for /dev/loop0

Version: 1
Cipher name: aes
Cipher mode: cbc-essiv:sha256
Hash spec: sha1
Payload offset: 1032
MK bits: 128
MK digest: ca 4e ae 89 ce d0 83 9e 6b aa 2d 74 8d ae 9d 56 2e f1 63 81
MK salt: 69 18 a9 99 b5 cf 33 47 0e 03 f3 bd a0 b8 15 65
85 2e 6e f3 22 a1 7a 2e 2d a4 f6 50 73 69 07 ff
MK iterations: 10
UUID: 9b7bb3d2-7b1c-4fc9-9236-96aad16db65e

Key Slot 0: ENABLED
Iterations: 268337
Salt: ee c0 18 69 ed 2b d9 5d 6e 55 b0 53 66 8d a1 13
e9 75 a9 80 fd 0e bb 38 6c c5 4b 40 b9 c7 ee 2d
Key material offset: 8
AF stripes: 4000
Key Slot 1: DISABLED
Key Slot 2: DISABLED
Key Slot 3: DISABLED
Key Slot 4: DISABLED
Key Slot 5: DISABLED
Key Slot 6: DISABLED
Key Slot 7: DISABLED

While none of this information helps the attacker to decode the encrypted data which follows the header, it does give away some information. It says "what follows is encrypted data, it is encoded using 128-bit AES, the mode in which this encryption is used is cipher block-chaining with initialization vector calculated using sha256 ..." This is harmless enough except that it diminishes plausible deniability meaning that if the attacker is able to coerce the victim into giving over the key, the victim has a more difficult time denying that there is any key to be given over and what the attacker is looking at is just random data. Some encryption implementations, like truecrypt, approach plausible deniability by creating a "hidden" volume. In this approach, there are two passwords, one which unlocks the true hidden volume and another which unlocks a fake volume. When coerced, the victim can give the key to unlock the fake volume while remaining silent about the true volume.

If we take the ideal of "zero information loss" then any luks-type header which discloses the metadata of the encrypted data compromises plausible deniability. In this case, implementations like dm-crypt without luks extensions or loop-aes are preferred. However, in cases where plausible deniability is not an issue, then a luks-type header has advantages, eg. the hal/dbus system in Gnome recognizes the luks header and initiates a dialog where the user can enter the password and obtain the decrypted volume as an icon on the desktop when he plugs in an encrypted pen drive. If all the potential victim is trying to protect against is theft, then deniability is not an issue. But if it comes to a repressive regime trying to commit human right abuses, then plausible deniability and zero information take on a new dimension.

Cryptanalysis with bitmaps

Let's return to the original question armed with the above knowledge. Is it possible to recognize the difference between (pseudo) random data and encrypted data even in the absence of any metadata? I'll demonstrate below that in some circumstances the answer is yes. To make my points as obvious as possible, I'll use bitmap images to demonstrate what's going on. Let's start with our reference systems

A B A - B

Fig 1

These images were created using The Gimp as 200x200 24-bit bmp images. (I had to render them as jpg to put on the web page, but that was the very last step. The original bitmaps can be obtained here.) The B.bmp file represents pure white, and its body is made up of all hex values FF. A.bmp is also mostly FF's except for the black A which are 00's. There are also some transitional values at the boarder between the black A and and white background. The image A-B.bmp was created by copying all of B.bmp to the clipboard, and pasting it into A as a new layer. The two layers were then "differenced" using the layer panel (you can bring it up using Windows => Dockable Dialogs => Layers).

Here are our reference random data images:

Random 1 Random 2 Random 1 - Random 2

Fig 2

These images were create using the following:

dd if=/dev/urandom of=random.bmp count=120054 bs=1
dd if=A.bmp of=random.bmp count=54 bs=1 conv=notrunc

The first line creates a random file the size of A.bmp and the second line adds the 54 byte bmp header. It is identical for all our bitmaps of the same size so we can just lift it from A.bmp. I added this header for all of the encrypted or random files below as a final step to render the file as a viewable bitmap.

I'll use this poor man's cryptanalysis to look at some implementations of encryption. I'll start with poor implementations and work towards our best practices. Using bitmaps is not rigorous, but its a nice technique to illustrate what we're looking for. At the time of this writing, I'm coding up a suite of test based off of Knuth's work on random number. Stay tuned!

Why embed encrypted data within random data and why using cipher block chaining

Almost all software which installs encrypted systems will lay down random data on the device first and default to some strong cipher (eg 128-bit aes) implemented with cipher block chaining. Let's use our bitmap cryptanalysis to see what happens if you don't.

First let's put A.bmp within an unencrypted ext2 file system for reference --- where do the blocks of A.bmp get put inside a filesystem? I produced this image as follows:

dd if=/dev/zero of=ext2-A.bmp count=480054 bs=1
losetup /dev/loop0 ext2-A.bmp
mke2fs /dev/loop0
mount /dev/loop0 zzz/
cp A.bmp zzz/
umount zzz/
losetup -d /dev/loop0
dd if=A400.bmp of=ext2-A.bmp count=54 bs=1 conv=notrunc

Note that A400.bmp is a 400x400 24-bit bmp which is 480054 bytes in size. The larger size accommodates a larger file for the ext2 filesystem so that one can copy A.bmp into it without running out of device space. It is also commensurate with the 200x200 bmp so that the A doesn't get wrapped unrecognizeably when the file system is rendered as a bmp image.

Let's similarly put A.bmp into a poorly encrypted ext2 file system. Here we make two mistakes: 1) we do not first fill up the file system with (pseudo) random data before encrypting and formatting, and secondly we don't use any cipher block chaining. This image was produced as follows

dd if=/dev/zero of=ext2-enc1-A.bmp count=480054 bs=1
losetup /dev/loop0 ext2-enc1-A.bmp
cryptsetup -c aes-ecb create ext2-enc1-A.bmp /dev/loop0
mke2fs /dev/mapper/ext2-enc1-A.bmp
mount /dev/mapper/ext2-enc1-A.bmp zzz/
cp A.bmp zzz/
umount zzz/
dmsetup remove ext2-enc1-A.bmp
losetup -d /dev/loop0
dd if=A400.bmp of=ext2-enc1-A.bmp count=54 bs=1 conv=notrunc

Since dd pulls its data out of /dev/zero, the file system stats with a base of all zeros below the encryption layer. Also, cryptsetup -c aes-ecb sets up a dm-crypt layer without any chaining, ie, each block is encrypted independently of any other block, so a clear block of all 0's is always encrypted into the same encrypted block. The pattern of the file within the file system clearly emerges.

Finally, let's repeat the above, but this time we'll correct one of our mistakes. We'll lay down random data under the encryption layer, but we'll still use aes-ecb.

Clear ext2 aes-ecb on Zero background aes-ecb on Random background

Fig 3

Its clear what's happening here. The since each block is encrypted independently of others, the underlying structure comes through. We are no where near the ideal of zero information.

Dm-crypt's default aes-cbc-essiv:sha256 with random filler

Laying down a random background and chaining is clearly important. There are several techniques of chaining, but the underlying idea is similar. In CBC mode, the most popular form of chaining, a block is XORed with the previous encrypted block before it itself is encrypted, thus creating a "chain". An initialization vector is used for the very first block. Since blocks are not encrypted independently of other blocks, a block of say all 0's will not always be encrypted in the same way, and the structure of the underlying clear file system doesn't emerge as obviously as in ecb. Let's repeat the previous experiment, but this time, we'll use aes-cbc-essiv:sha256. We'll still use all zeros and random data for the unencrypted background for comparison. Here's what one gets:

Clear ext2 aes-cbc-essiv:sha256 on Zero background aes-cbc-essiv:sha256 on Random background

Fig 4

The importance of a random background is apparent in this example. The chaining certainly obscure the structure of the underlying file system, but the outline of large regions of empty space are still discernible. When these empty regions are filled with random data, it becomes nearly impossible to tell what's encrypted and what is random. In fact, the 128-bit aes-cbc-essiv:sha256 passes all the standard tests for random number on both a local and global scale --- this is the subject of another writeup that I'll post another time.

Information leak despite random filler and chaining

At this point the reader may think he's safe and arrived at "zero information loss", but unfortunately, there is still another kind of attack that can be launched. This one depends on the attacker being able to watch the encrypted file system at snapshots in time. This might happen, for instance, if the victim backs up his data to a hard drive which he then stores off site. If the attacker sneaks between backups and images the disks, it then become possible for him to launch this king of attacks.

Here are three demonstrations of this attack:

a) aespipe is able to operate in several modes. One mode is a 128-bit aes-cbc with a simple initialization vector and one password. Another is multi-key-v3 mode, which also employs 128-bit aes-cbc, but uses 64 different keys to encrypt the blocks --- the first key for the first block, the second for the second and so on in a cyclical fashion. It also uses a 65th key plus MD5 for the initialization vector. Still, if an attacker has access to a file system before and after the addition of data, evidence emerges of the underlying encryption.

The following bitmaps were generated to illustrate the single key simple initialization vector mode:

aespipe < A.bmp > aesA.bmp
aespipe < B.bmp > aesB.bmp
dd if=A.bmp of=aesA.bmp count=54 bs=1 conv=notrunc
dd if=B.bmp of=aesB.bmp count=54 bs=1 conv=notrunc

The Gimp was then used as described above to produce aesA-aesB.bmp. Here are the results:

A.bmp under aes-cbc-plain B.bmp under aes-cbc-plain aes A - aes B

Fig 5

One can compare these to bitmaps for multi-key-v3. First we generate a file containing the 65 keys:

head -c 2925 /dev/urandom | uuencode -m - | head -n 66 | tail -n 65 | gpg --symmetric -a > key.gpg

And then we use the key file to produce the encrypted bitmaps:

aespipe -K key.gpg < A.bmp > aesA-v3.bmp
aespipe -K key.gpg < B.bmp > aesB-v3.bmp
dd if=A.bmp of=aesA-v3.bmp count=54 bs=1 conv=notrunc
dd if=A.bmp of=aesB-v3.bmp count=54 bs=1 conv=notrunc

Here are the results:

A.bmp under multikey-v3 aes B.bmp under multikey-v3 aes multikey-v3 aes A - multikey-v3 aes B

Fig 6

Its clear that the change to the file system is better blurred but there is still some information to be gained about the underlying unencrypted data.

b) Let's try the same test using 128-bit aes-cbc-essiv:sha256 on a random background. Here we'll employ luks extensions for convenience, but this is not necessary, nor does it change our conclusions:

dd if=/dev/urandom of=ext2-enc5-A.bmp count=1080054 bs=1
losetup /dev/loop0 ext2-enc5-A.bmp
cryptsetup luksFormat /dev/loop0
cryptsetup luksOpen /dev/loop0 ext2-enc5-A.bmp
mke2fs /dev/mapper/ext2-enc5-A.bmp
mount /dev/mapper/ext2-enc5-A.bmp zzz/
cp A.bmp zzz/
umount zzz/
dmsetup remove ext2-enc5-A.bmp
losetup -d /dev/loop0

cp ext2-enc5-A.bmp ext2-enc5-B.bmp
losetup /dev/loop0 ext2-enc5-B.bmp
cryptsetup luksOpen /dev/loop0 ext2-enc5-B.bmp
mount /dev/mapper/ext2-enc5-B.bmp zzz/
cp B.bmp zzz/A.bmp
umount zzz/
dmsetup remove ext2-enc5-B.bmp
losetup -d /dev/loop0

dd if=A600.bmp of=ext2-enc5-A.bmp count=54 bs=1 conv=notrunc
dd if=A600.bmp of=ext2-enc5-B.bmp count=54 bs=1 conv=notrunc

Here are the results:

A.bmp under aes-cbc-essiv:256 B.bmp under aes-cbc-essiv:256 aes-cbc:essiv256 A - aes-cbc:essiv256 B

Fig 7

c) To fairly compare loop-aes's multikey-v3 with dmcrypt aes-cbc-essiv:sha256, we repeat the previous test with loop-aes.

dd if=/dev/urandom of=ext2-enc6-A.bmp count=1080054 bs=1
losetup -e AES128 -K key.gpg /dev/loop0 ext2-enc6-A.bmp
mke2fs /dev/loop0
mount /dev/loop0 zzz/
cp A.bmp zzz/
umount zzz/
losetup -d /dev/loop0

cp ext2-enc6-A.bmp ext2-enc6-B.bmp
losetup -e AES128 -K key.gpg /dev/loop0 ext2-enc6-B.bmp
mount /dev/loop0 zzz/
cp B.bmp zzz/A.bmp
umount zzz/
losetup -d /dev/loop0

dd if=A600.bmp of=ext2-enc6-A.bmp count=54 bs=1 conv=notrunc
dd if=A600.bmp of=ext2-enc6-B.bmp count=54 bs=1 conv=notrunc

Here are the results:

A.bmp under multikey-v3 B.bmp under multikey-v3 multikey-v3 A - multikey-v3 B

Fig 8

If we zoom in on the bands, we can see that multikey-v3 loop-aes does a better job than aes-cbc-essiv:sha256 dmcrypt and hiding the underlying data:

Fig 9

Best Practice

Now that we know more about how information can leak from an encrypted file system, we can conclude with some advice for best practices. In addition to chosing multikey-v3 loop-aes, to avoid exposing the underlying structure of the clear file system, one can try filling up the empty space with (pseudo) random data. To illustrate, we repeated the previous example, but after copying A.bmp into the filesystem, we filled up the remaining empty space as follows:

cd zzz
dd if=/dev/urandom of=waste

When we remount the system, we first clear the space by removing "waste", do our work and then fill it back up again with (pseudo) random data:

cd zzz
rm waste
cp ../B.bmp A.bmp
dd if=/dev/urandom of=waste

The results follow:

A.bmp + random filler under multikey-v3 B.bmp + random filler under multikey-v3 multikey-v3 A+filler - multikey-v3 B+filler

Fig 10

Some of the underlying structure is still visible, but it is clear that we're getting closer to the situation in Fig 2.

Finally, let's go one more step. How about scattering A.bmp through the blocks of the file system, in other words, fragmenting it? Usually one wants to avoid fragmentation because it adds latency to IO during hard drive seeks. This is not an issue with solid state devices, such as pen drives or SSD's and so we should consider what benefit is gained in terms of encryption. First, let's see what happens to A.bmp when it is fragmented on an unencrypted drive. We do this by creating lots of little files and deleting every tenth one (and it bit more) before copying in A.bmp:

dd if=/dev/urandom of=ext2-frag.bmp count=1080054 bs=1
losetup /dev/loop0 ext2-frag.bmp
mke2fs -N 1024 /dev/loop0
mount /dev/loop0 zzz
cd zzz
for i in $(seq 1 1024) ; do dd if=/dev/zero of=$i.waste count=1 bs=1k ; done
rm -f *7.waste
rm -f ?7?.waste
cp ../A.bmp .
cd ..
umount zzz
dd if=A600.bmp of=ext2-frag.bmp count=54 bs=1 conv=notrunc

Here is the result:

Fig 11

So let's modify our technique of filling the empty space with random data by further forcing the fragmenation of A.bmp and B.bmp.

dd if=/dev/urandom of=aes-ext2-frag-A.bmp count=1080054 bs=1
losetup -e AES128 -K key.gpg /dev/loop0 aes-ext2-frag-A.bmp
mke2fs -N 1024 /dev/loop0
mount /dev/loop0 zzz/
cd zzz/
for i in $(seq 1 1024) ; do dd if=/dev/urandom of=$i.waste count=1 bs=1k ; done
rm -f *4.waste *7.waste
cp ../A.bmp .
for i in $(seq 1025 2048) ; do dd if=/dev/urandom of=$i.waste count=1 bs=1k ; done
cd ..
umount zzz/
losetup -d /dev/loop0

cp aes-ext2-frag-A.bmp aes-ext2-frag-B.bmp
losetup -e AES128 -K key.gpg /dev/loop0 aes-ext2-frag-B.bmp
mount /dev/loop0 zzz/
cd zzz/
cp ../B.bmp A.bmp
rm -f *.waste
cd ..
umount zzz/
losetup -d /dev/loop0

Using our subtraction attack, here's what we find:

A.bmp + random filler + fragmentation under multikey-v3 B.bmp + random filler + fragmentation under multikey-v3 multikey-v3 A+filler+frag - multikey-v3 B+filler+frag

Fig 12

The black bands (regions of no change) are still visible as in Fig 10, but are more scattered reflecting the fragmentation of the underlying file. It is unlikely we can totally eliminate these regions of unchanged data because they probably represent the file system's metadata, ie, the i-nodes and superblocks.