Hamming Distance Revisited

I refactored the hamming distance algorithm into cpp and used bitwise operators to accomplish the same thing.

I already solved this a long time ago in python. I haven’t had a lot of personal time in general to blog. I’ve been writing down fun ideas for posts, and believe me, I have a TON in the works but unfortunately I don’t have much finished material to release yet.

Better Content?

As part of a cleanup effort of my backup drive, I’ve been merging, and deleting repos, and files as I consolidate things. I stumbled across many articles from my previous blogs/sites, and will be releasing that content as i touch it up or rewrite it.

One of my focuses needs to be on grammar, correct spelling, and content. I was going to use being busy as an excuse to just let the posts fly, but as I want these posts to be a reflection of me, and the way I approach problems, I don’t think it makes sense to just release them without checking them more thoroughly.

Regarding Hamming Distance

I did a reasonably good write up on hamming distance previously, but decided rathern than link it here, I would rewrite it.

To that end, I would describe hamming distance as “the sum difference between two values”.

In Cryptopals challenge 1-6, we are asked to compute the Hamming Distance between two strings. I will say this over, and over, and over again…. It is mentioned in the first exercise for Cryptopals that we ALWAYS OPERATE ON BITS.

As is expected, a solution written by different software engineers, can yield different results if the understanding about the GOAL is not agreed upon. This is where things like User Stories, and Test Driven Development can be crucial planning tools used to help deliver the correct solution, which could also be called the agreed upon solution.

Research, and planning the algorithm

When possible, I will read descriptions, RFCs, protocols, etc to understand how a piece of software should behave. In this case, since hamming distance is a simple calculation, I headed to wikipedia first to read about what it is.

I will say that having gone back to wikipedia this morning to talk through what makes the algorithms there wrong, I can see that the description is significantly better than it used to be.

Definition

The Hamming distance between two equal-length strings of symbols is the number of positions at which the corresponding symbols are different.

The symbols may be letters, bits, or decimal digits, among other possibilities. For example, the Hamming distance between:

  • “karolin” and “kathrin” is 3.
  • “karolin” and “kerstin” is 3.
  • “kathrin” and “kerstin” is 4.
  • 0000 and 1111 is 4.
  • 2173896 and 2233796 is 3.

So given the updated definition, I would say the algorithms on wikipedia are correct when determining the hamming distance of the characters as the symbol, but we want the sum distance of hamming distance using bits as the symbol or metric.

Input

String 1:

1
this is a test

String 2:

1
wokka wokka!!!

Expected Output

1
37

Using bytes as the distance metric in Python

This is the original algorithm I found on wikipedia, which determines the distance in CHARs which is not helpful for our use case.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# hamming.py
import sys

def hamming_distance(s1, s2):
    if len(s1) != len(s2):
        raise ValueError("Undefined for sequences of unequal length")
    return sum(el1 != el2 for el1, el2 in zip(s1, s2))

def main():
    distance = hamming_distance("this is a test", "wokka wokka!!!")
    print("hamming distance: {}").format(distance)

if __name__ == "__main__":
    main()

Yields the following output:

1
2
$ python hamming.py
hamming distance: 14

Using bits as the distance metric in C++

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#include <iostream>
#include <string>
#include <bitset>

using std::cout;
using std::endl;
using std::string;
using std::bitset;

int main() {
    int hamming_distance = 0;
    string s1 = "this is a test";
    string s2 = "wokka wokka!!!";

    if (s1.length() == s2.length()) {
        cout << "lengths match, proceeding" << endl;
        for (int i = 0; i < s1.length(); i++) {
            bitset<8> b1(s1[i]);
            bitset<8> b2(s2[i]);

            for (int b = 0; b < b1.size(); b++) {
                hamming_distance += b1[b] ^ b2[b];
            }
        }
        cout << "hamming distance: " << hamming_distance << endl;
        return 0;
    } else {
        cout << "lengths do not match, sorry" << endl;
    }

    return 1;
}

Yields the following output:

1
2
3
❯ g++ hamming.cpp -o hamming && ./hamming         
lengths match, proceeding
hamming distance: 37

Table Showing Distance Calculation

- 0 1 2 3 4 5 6 7 8 9 10 11 12 13
s1 01110100 01101000 01101001 01110011 00100000 01101001 01110011 00100000 01100001 00100000 01110100 01100101 01110011 01110100
s2 01110111 01101111 01101011 01101011 01100001 00100000 01110111 01101111 01101011 01101011 01100001 00100001 00100001 00100001
sum 2 3 1 2 2 3 1 5 2 4 3 2 3 4
Built with Hugo
Theme Stack designed by Jimmy