Hello friend,
In this series, we are going to be creating a simple base64
encoder/decoder with python
.
We are going to do so from scratch, the aim is to understand the underlying process of how it works.
What is base64 ?
Before attacking base64, we need to understand what a numbering system is.
We, as humans, use base10
as our standard numbering system. If I ask you to spit out all of the digits you know, you are going to answer back with :
0 1 2 3 4 5 6 7 8 9
10
defines how many symbols are in that numbering system.
There is also base2
or Binary
, which , if you guess, has only 2 symbols, 0
and 1
, and is pretty much a fundamental component of computing, electronics, digital technology…
Okey, what about base64
now ?
Following the same logic, base64 is a numerical system which uses 64
symbols to encode data.
Those symbols are [A-Z,a-z]
, [0-9]
,+
and /
Here is the complete symbol table
Creating the symbol set
Now, first of all let’s build this table using python.
Here is a very simple way of doing int it:
1
2
3
4
5
6
# creating the base64 symbol set
A_Z = [chr(i) for i in range(65,91)]
a_z = [chr(i) for i in range(97,123)]
zero_nine = [chr(i) for i in range(48,58)]
additional_chars = ['+','/']
base64_symbol_table = A_Z + a_z + zero_nine + additional_chars
1
2
print(base64_symbol_table)
['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '+', '/']
Let’s break down how the above code works. I’ll explain the first line and you do the math:
range(65,91)
creates a range of numbers from 65 to 90 (inclusive), which represent the ASCII values for the uppercase letters A to Z.chr(i)
is a built-in Python function that takes an integer argument and returns the corresponding ASCII character. Sochr(65)
returns the character'A'
,chr(66)
returns ‘B'
, and so on.[chr(i) for i in range(65,91)]
is a list comprehension that applies the chr() function to each integer in the range 65 to 90, and creates a new list with the resulting characters. The end result is a list of strings containing the uppercase letters from A to Z.
Creating the encoder
A bit of theory
Now before attacking this part, we need to understand how the encoding part actually works.
To start, let’s say we have data = "Hi"
, how would we proceed to encode it?
- Each ascii character is going to be represented in its binary format.
In our case H
is 72
and i
is 105
.
Which gives 01001000
01101001
-Next, we group the bits together.
01001000
01101001
becomes 0100100001101001
.
- Each symbol in base64 is representend as
6 bits
not8 bits
, so we are going to divide the above result in chunks of6 bits
0100100001101001
becomes 010010
000110
1001
- Now the last byte
1001
needs to have6 bits
, we need some sort of padding.
010010
000110
1001
becomes 010010
000110
100100
Note : we added 00 at the end of 1001 to complete 6 bits
- We revert back to decimal
010010
000110
100100
becomes 18
6
36
- Those numbers represent the indexes in our symbol table, where
18
corresponds toS
…
18
6
36
becomes SGk
- Now a base64 string needs its length to be a multiple of
4
, we accomplish that by adding=
as a padding.
The final result is SGk=
Here is a little diagram a created to help you visualize the algorithm
The code
Now let’s get to actually coding that.
Step 1 :
1
2
3
4
# converting the word into its binary format
binary = []
for c in word:
binary.append(bin(ord(c)).split("b")[-1].rjust(8,'0'))
- Here, we loop through each character
c
of our string - we grab its ascii value with
ord()
- we convert it to binary with
bin()
1
2
print(bin(36))
0b100100
bin()
appends 0b
at the start of our desired value, we split by “b” and take the right half.
- To make sure that our
byte
is8 bits
, we fill the start of our string with zeros to complete 8 charars usingrjust()
.
Step 2 :
1
binary = "".join(binary)
Here, we just group what we got.
Step 3 :
1
2
3
4
5
new_word = []
while len(binary):
new_word.append(binary[0:6])
binary = binary[6:]
We form groups of 6 as explained earlier.
Step 4 :
1
new_word[-1] = new_word[-1].ljust(6, '0')
We grab the last element, and as long as it is not 6 digits, we add zeros.
Step 5 :
1
2
3
4
5
result = ""
for byte in new_word:
#byte is a string representing the 6 bit chunk
index = bin_to_int(byte)
result += base64_symbol_table[index]
Here we loop through each byte, which is 6 bits long, we convert it to int
using bin_to_int()
which is a small function a created.
The result is the index of our character in the symbol table, we access it and we store the result in result
.
Here is the bin_to_int()
function
1
2
3
4
5
6
7
8
# binary to decimal custom function
def bin_to_int(byte):
result = 0
k = 1
for i in range(0,len(byte)):
result += int(byte[len(byte) - 1 - i]) * k
k = k * 2
return result
Step 6 ~ Last step :
1
2
3
4
5
6
7
def get_multiple_of_4(num):
while num % 4 != 0:
num += 1
return num
result = result.ljust(get_multiple_of_4(len(result)),'=')
return result
In this step, we make sure the resulted string has a length that is multiple of 4.
To do so, we grab the length of our string, we get the closest multiple of 4 greater or equal to it, and thats the length of our new string.
We use ljust()
which adds =
as long as our string doesnt fit the length specified in the first parameter.
Wrapping Up
Putting all of the above together :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# creating base64 table index
A_Z = [chr(i) for i in range(65,91)]
a_z = [chr(i) for i in range(97,123)]
zero_nine = [chr(i) for i in range(48,58)]
additional_chars = ['+','/']
base64_symbol_table = A_Z + a_z + zero_nine + additional_chars
# binary to decimal custom function
def bin_to_int(byte):
#110101
result = 0
k = 1
for i in range(0,len(byte)):
result += int(byte[len(byte) - 1 - i]) * k
k = k * 2
return result
def b64_encode(word):
# converting the word into its binary format
binary = []
for c in word:
binary.append(bin(ord(c)).split("b")[-1].rjust(8,'0'))
binary = "".join(binary)
# splitting the binary format into chunks of 6 bits
new_word = []
while len(binary):
new_word.append(binary[0:6])
binary = binary[6:]
# adding padding to the last elem
new_word[-1] = new_word[-1].ljust(6, '0')
# encrypting the text
result = ""
for byte in new_word:
index = bin_to_int(byte)
result += base64_symbol_table[index]
# adding padding of "="
def get_multiple_of_4(num):
while num % 4 != 0:
num += 1
return num
result = result.ljust(get_multiple_of_4(len(result)),'=')
return result
Let’s test it 😄
1
2
3
4
word = "Yassir"
print(f"{word} ---> {b64_encode(word)}")
#Program stdout
Yassir ---> WWFzc2ly
Now we should get our word if we decode the result.
1
2
echo "WWFzc2ly" | base64 -d
Yassir
It works !
In the next post, we are going to create a decoder, and a main function to nicely handle the input.
See you soon