You Won’t Find ‘I” in Bitcoin. Here’s Base-2, -16 -32, -45, -58, -64, and all the other bases

In cybersecurity and digital forensics, you just never know what our data will look like, and how it is represented. At its core, it is 1s…

Photo by Nick Hillier on Unsplash

You Won’t Find ‘I” in Bitcoin. Here’s Base-2, -16 -32, -45, -58, -64, and all the other bases

In cybersecurity and digital forensics, you just never know what our data will look like, and how it is represented. At its core, it is 1s and 0s, and where these need to be interpreted either as numbers (integers and floating point values), characters, binary codes (Op codes), program values, or strings. Sometimes we might count four bits at a time, and represent the value as hex, and other times we could six bits at a time can represent it in Base64 format. The real skill of a cybersecurity analyst is to know how to interpret these bits, and convert them into something that makes sense.

Converting to printable characters

The success of the Web is possibly one of the greatest achievements ever in human kind. But, it has never really been allowed to evolve that much on its basic core protocols. Why? Because to change them would have a significant effect on the way the Web worked. RFC 791 and RFC 793. brought us IP and TCP, and RFC 2068 brought us a protocol that has never really advanced that much: HTTP 1.1.

One thing about the Internet, is that it was built to transfer text — mainly ASCII characters. So, one of the major challenges of building the Web was to convert our pesky binary data — which has bells, tabs and new lines — into a text form. For this we must produce a binary stream of data, and then match this to a ASCII character set that humans can either read or for the machine to be able to convert back into binary again. If Bob and Alice can encode and decode to this character set, then the can send each other binary information — such as an image — in a text format.

But, which character set should we use? If we use an 8-bit character set, we would get lots of control characters which would not represent each charater as a printable character.

Base-2 (“Binary”)

The first, and most basic one, is to represent each bit as a ‘0’ or a ‘1’ character. This, of course, is highly inefficient, and would expand the data eight fold. For “fred” we can represent our ASCII characters in a bit format, and where we have:

01100110 01110010 01100101 01100100
f r e d

This is actually a Base-2 form. You can try it here:

  • Message=”fred” (Encode), Base2 Try.
  • Message=”0110 0110 0111 0010 0110 0101 0110 0100" (Decode), Base2 Try.

Base-16 (“Hex”)

And representing each bit with a character is inefficient. As an improvement we can group our bits into four, and then define the equivalent hex character (0–9,A-F). This then gives us:

0110 0110 0111 0010 0110 0101 0110 0100
6 6 7 2 6 5 6 4

The Base-16 form of “help“ is thus “68656C70”. Here is an example of the conversion:

Figure 2: Conversion to hex

We can try this example with:

  • Message=”fred” (Encode), Base16 Try.
  • Message=”66726564" (Decode). Base16 Try.

Base16, Base32 and Base64 format are defined in [RFC 4648]:

Base-32

If you need a little more compression than Base16 (“Hex”), we can then represent five bits with a character. For this we represent our binary values from 00000 to 11111 with [RFC 4648]:

Value Encoding  Value Encoding  Value Encoding  Value Encoding
0 A 9 J 18 S 27 3
1 B 10 K 19 T 28 4
2 C 11 L 20 U 29 5
3 D 12 M 21 V 30 6
4 E 13 N 22 W 31 7
5 F 14 O 23 X
6 G 15 P 24 Y (pad) =
7 H 16 Q 25 Z
8 I 17 R 26 2

In a RegEx format our character set is [A-Z2–7=]. Notice that we have lost ‘0" and ‘1’, as these two characters can be confused with ‘O’ and ‘I’. An example is:

01100110 01110010 01100101 01100100
f r e d
01100   11001 11001  00110  01010  11001  00 [000]
M Z Z G K Z A =

A sample run proves this:

Message:  fred
Type: base32
Encoding: MZZGKZA=

And here are the two examples:

  • Message=”fred” (Encode), Base32 Try.
  • Message=”MZZGKZA=” (Decode), Base32 Try.

Base-26

The English alphabet has 26 letters, so we could use all of these for our encoding “[A-Z]”.

Message:  fred
Type: base26
Encoding: FOREXTK

Base-45

Base-45 format is used in applications such as QR codes within vaccination passports. With this we take two bytes are a time [A, B] and then determine the values of [C, D and E] for: (A×256)+B=C+(D×45)+(E×45×45). For this we basically determine (A×256)+B and then divide by 45 and note the remainder. We then have a lookup table for the remainder values. The character set used is: [0–9A-Z $%*+-./:]. Notice that a space (‘ ‘) has been used, so you need to watch out for these.

An example is [here]:

Input: test
Type: base45
Coding: 7WE QE

Here are two examples:

  • Message=”test”, Base45 Try.
  • Message=”7WE QE”, Base45 Try.

Base-58

Base-58 is used in Bitcoin, we where have at the character set of: [1–9A-HJ-NP-Za-km-z]. This has been created to get rid of the characters that could be misinterpreted for a Bitcoin wallet address. These are ‘0’ (zero), ‘I’ (upper case i), ’O’ (upper case o), and ‘l’ (lower case “L’) . An example is:

Input: fred
Type: base58
Coding: 3ctAMq

In terms of Base64 we lose six letters (‘0’ (zero), ‘I’ (upper case i), ’O’ (upper case o), and ‘l’ (lower case “L’)), and non-alphanumeric characters of + (plus) and / (slash).

With Base58, we convert the ASCII characters into binary, and the keep dividing by 58 and convert the remainder to a Base58 character. The alphabet becomes:

'123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz'

So let’s take an example of ‘e’. With ‘e’ we have a decimal value of 101, so we divide by 58 to get:

1 remainder 43

and next we divide 1 by 58 and we get:

0 remainer 1

So we take character at position 1 and at position 43, to give:

123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz

and gives:

2k

If we now take ‘ef’, we get 25958 (102 + 101 * 256), where we move each character up one byte. Basically we take the binary value of the string and then divide by 58 and take the remainder. So ‘ef’ is ‘01100101 01100110’ [here].

Two examples are:

  • Message=”fred” (encode), Base58 Try.
  • Message=”3ctAMq” (decode), Base58 Try.

Base-64

One of the most common formats for converting binary into a text format is Base64. With this we take six bits at a time and have a mapping to 64 printable characters: “[A-Za-z0–9+/=]”:

Value Encoding  Value Encoding  Value Encoding  Value Encoding
0 A 17 R 34 i 51 z
1 B 18 S 35 j 52 0
2 C 19 T 36 k 53 1
3 D 20 U 37 l 54 2
4 E 21 V 38 m 55 3
5 F 22 W 39 n 56 4
6 G 23 X 40 o 57 5
7 H 24 Y 41 p 58 6
8 I 25 Z 42 q 59 7
9 J 26 a 43 r 60 8
10 K 27 b 44 s 61 9
11 L 28 c 45 t 62 +
12 M 29 d 46 u 63 /
13 N 30 e 47 v
14 O 31 f 48 w (pad) =
15 P 32 g 49 x
16 Q 33 h 50 y

With “help” we have:

01101000 01100101 01101100 01110000
h e l p
011010 000110 010101 101100 011100 00 [0000]
Z n J l Z A = =

With this we pad with zeros at the end to make it a multiple of six bits. And the number of Base64 characters needs to be a multiple of four, so we pad the end of the Base64 string. Thus “fred” is “ZnJlZA==” in Base64. The Base64 mapping is:

Figure: Conversion to Base-64

Two examples are:

  • Message=”fred” (encode), Base64 Try.
  • Message=”ZnJlZA==” (decode), Base64 Try.

Conclusions

Encode and decoding should be one of the first lessons in cybersecurity and digital forensics (close followed by Magic Numbers). For malware writers, using different bases is a standard way that they avoid detection.

Some character sets

Here are some character sets for a few others:

Base2  [01]
Base3 [123]
Base5 [01234]
Base10 [0123456789]
Base26 [A-Z]
Base32 [A-Z2-7=]
Base45 [0-9A-Z $%*+-./:]
Base58 (bitcoin) [1-9A-HJ-NP-Za-km-z]
Base62 [0-9A-Za-z]
Base64 [A-Za-z0-9+/=]
Base67 [A-Za-z0-9-.!~_]
Base85 (Ascii85) [!"#$%&'()*+,-./0-9:;<=>?@A-Z[\]^_`a-u]
Base91 [A-Za-z0-9!#$%&()*+,./:;<=>?@[]^_`{|}~"
]

Coding

The code is:

import codext
import sys
import binascii
type="base2"
message="Testing"
if (len(sys.argv)>1):
type=sys.argv[1]
if (len(sys.argv)>2):
message=str(sys.argv[2])
print ("Message:\t",message)
print ("Type:\t\t",type)
str=codext.encode(message, type)
print("Coding:\t",str)