Base64 Encoding

Join me as I do my best to break down base64 encoding, and explain it in a way that makes it easy to understand LOL

What follows below is a description of base64 encoding and how it works at the bit level. Although i’ve used it extensively in my career, i never needed to know the underlying implementation. I had a decent grasp on it but most descriptions felt lacking when i actually went about trying to understand how to implement it.

What follows is a quick write up of what I learned and what I believe to be true about base64 encoding.

Given a string of “ABC”, we want to construct a base64 encoded representation. The final string will be “QUJD”.

Understanding the bits

1 standard byte is 8 bits, therefore our bit level representation of ABC is below.

Standard ASCII representation of bits

Each character is represented by 8 bits. The maximum value that can be represented using 8 bits is 255. The value is used as an index in a table of chars.

A B C
01000001 01000010 01000011

Base64 representation of bits

Each character is represented by 6 bits. The maximum value that can be represented using 6 bits is 63. The value is used as an index in a table of chars.

Base64 Character Table

Each value from 0-63 can be used as an index to this lookup table to perform the encoding or decoding.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
base64_chars = [
  "A","B","C","D","E","F","G","H",
  "I","J","K","L","M","N","O","P",
  "Q","R","S","T","U","V","W","X",
  "Y","Z","a","b","c","d","e","f",
  "g","h","i","j","k","l","m","n",
  "o","p","q","r","s","t","u","v",
  "w","x","y","z","0","1","2","3",
  "4","5","6","7","8","9","+","/"
]

Encoding

Q U J D
010000 010100 001001 000011

If we work in groups of 3 bytes, we have a stream equal to 24 bits. 3 Characters of 8 bits = 24 bits. To convert this stream to base 64, we can concatenate those 24 bits and split them into 6 bit groups giving us 4 base64 characters. This works since 4 characters of 6 bytes also equals 24 bits.

This is accomplished with bit shifting and looks like this:

1
2
original = 01000001 01000010 01000011
base 64  = 010000 010100 001001 000011
  1. shift 6 bits from first char (010000)
1
2
original = 01 01000010 01000011
ch_1 = 010000
  1. shift 2 bits from first char, 4 from second char (2 + 4 = 6)
1
2
3
original = 0010 01000011
ch_1 = 010000
ch_2 = 010100
  1. shift 4 bits from second char (0010), 2 from third char (01) (4 + 2 = 6)
1
2
3
4
original = 000011
ch_1 = 010000
ch_2 = 010100
ch_3 = 001001
  1. shift 6 bits from third char (all that’s left) into ch_4
1
2
3
4
5
original = 000011
ch_1 = 010000
ch_2 = 010100
ch_3 = 001001
ch_4 = 000011

Now that we have all 4 new bytes, we can just use a table of allowable base64 characters to pick from since our values will be 0-63. In the event there are not enough bytes to get 3, we will add a padding character in that place (=).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
base64_chars = [
  "A","B","C","D","E","F","G","H",
  "I","J","K","L","M","N","O","P",
  "Q","R","S","T","U","V","W","X",
  "Y","Z","a","b","c","d","e","f",
  "g","h","i","j","k","l","m","n",
  "o","p","q","r","s","t","u","v",
  "w","x","y","z","0","1","2","3",
  "4","5","6","7","8","9","+","/"
]

new_ch_1 = base64_chars[int(ch_1)]
new_ch_2 = base64_chars[int(ch_2)]
new_ch_3 = base64_chars[int(ch_3)]
new_ch_4 = base64_chars[int(ch_4)]

original = base64 ABC = QUJD

Now if we base64 encode ABCD we will end up with QUJDRA==.

Note that this is because we must always create groups of 24 bits to perform encoding or decoding. When we add that new character D, we are bumping our bit count from 24 to 32. 32 is not divisible by 24 bits or 3 bytes, which is a requirement of base64 encoding.

3 - 4 = 1

Since we want to have groups of 3 characters, or 24 bites, we add 2 padding bytes (=) to create a total of 6 chars, which IS divisble.

Readable base64 encode/decode in python

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
```python
import sys

base64_table = [
    "A", "B", "C", "D", "E", "F", "G", "H", "I", "J",
    "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T",
    "U", "V", "W", "X", "Y", "Z", "a", "b", "c", "d",
    "e", "f", "g", "h", "i", "j", "k", "l", "m", "n",
    "o", "p", "q", "r", "s", "t", "u", "v", "w", "x",
    "y", "z", "0", "1", "2", "3", "4", "5", "6", "7",
    "8", "9", "+", "/"
]


def get_base64_ch(ch):
    if ch > 64:
        raise ValueError("character value > 64")
    return base64_table[ch]


def get_bin_from_char(ch):
    num = ord(ch)
    return format(num, '010b')[-8:]


def get_bin_from_int(num):
    return format(num, '010b')[-8:]


def base64_encode(str):
    encoded = ""
    binary_s = ""

    padding = len(str) % 3
    if padding > 0:
        padding = 3 - padding

    for i in range(0, len(str), 1):
        ch = str[i:i+1]
        b = get_bin_from_char(ch)
        binary_s += b

    bs_len = len(binary_s)

    for i in range(0, bs_len, 6):
        remaining = bs_len - i
        if remaining >= 6:
            byte = binary_s[i:i+6]
        else:
            original_byte = binary_s[i:i+remaining]
            byte = original_byte.ljust(6, '0')

        num = int(byte, 2)
        ch = get_base64_ch(num)
        encoded += ch

    return encoded + "=" * padding


def base64_decode(str):
    decoded = ""
    binary_s = ""

    str = str.replace("=", "")

    for i in range(0, len(str), 1):
        ch = str[i:i+1]
        num = base64_table.index(ch)
        b = get_bin_from_int(num)
        binary_s += b[-6:]

    bs_len = len(binary_s)

    for i in range(0, bs_len, 8):
        b = binary_s[i:i+8]
        num = int(b, 2)
        ch = chr(num)
        decoded += ch

    return decoded


def main():
    strings = [
        "ABC",
        "ABCD",
        "ABCDE",
        "ABCDEF"
    ]

    for str in strings:
        print("original: {}".format(str))
        encoded = base64_encode(str)
        print("encoded : {}".format(encoded))
        decoded = base64_decode(encoded)
        print("decoded : {}".format(decoded))
        print("")


if __name__ == "__main__":
    main()

Yields the following output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
❯ python3 encoding.py
original: ABC
encoded : QUJD
decoded : ABC

original: ABCD
encoded : QUJDRA==
decoded : ABCD

original: ABCDE
encoded : QUJDREU=
decoded : ABCDE

original: ABCDEF
encoded : QUJDREVG
decoded : ABCDEF
Built with Hugo
Theme Stack designed by Jimmy