foobuzz

by Valentin, March 28 2016, in science

equitext

I've never been satisfied with the vulnerability of substitution ciphers being the frequencies of characters in texts. My main concern is that it depends on the statistics of the language the message is written in. While languages such as English or French (and most natural languages I guess) sure have distinctive frequency distributions, we can imagine a language where letters appear approximately the same amount of time in written texts. In such case, a frequency analysis isn't feasible, at least on single letters.

While I'm no expert in cryptography, I think that this wouldn't mean that a substitution cipher isn't broken and I believe plenty of other attacks would still be possible. Brute force, to begin with. The 26 letters of the alphabet offer 26! possible substitution keys. That's 12 orders of magnitude less than 2^128 yet 128 bit keys are considered weak by today's standards.

Anyway, out of curiosity I cogitated about a method that would take a text and somehow transform it in order to produce another text in which every character appears the same amount of time. I ended up with a text-to-text encoding for which I've written an implementation in Python in a module called equitext. Here's a demo:

>>> import equitext
>>> message = "A histogram is a graphical representation of the distribution of numerical data. It is an estimate of the probability distribution of a continuous variable (quantitative variable) and was first introduced by Karl Pearson."
>>> equitext.histogram(message)
  ======================================================================== 0.145
i ================================================= 0.1
a ================================================= 0.1
t ============================================= 0.09
r =============================== 0.063
o ============================= 0.059
e ============================= 0.059
n =========================== 0.054
s ======================== 0.05
b =============== 0.032
u =============== 0.032
d ============= 0.027
l ============= 0.027
f =========== 0.023
h ========= 0.018
c ========= 0.018
m ====== 0.014
v ====== 0.014
p ====== 0.014
y ==== 0.009
g ==== 0.009
. ==== 0.009
) == 0.005
A == 0.005
q == 0.005
w == 0.005
I == 0.005
P == 0.005
K == 0.005
( == 0.005
>>> encoded = equitext.encode(message)
>>> encoded
' ImobKAfgh)levscPwtp.(ydarnqui mrqolb(fevItu.npgiwyasAdch)KP sdc.pqrAnmihtaovfuKegw)lbIP(y huynbqo(iswlfIr.eP)gmtAapvKcd oKyahgsmwe)ntfuIip.PlrvqAd(bc (tawq.vAcplugI)mfsndyoirbPeKh( aeifnrPvtdhqwImgpylu.cb)AKos(.PwtsIboufmipKeAahv gn)yqldrc grPKIvil(dyA)s.hcpuqtbmaeofwn oKs(c)AgP.uhevIlqarmtpwnfydbi vwmhKqlrbf()ietopysPc.IudngAa'
>>> equitext.histogram(encoded)
. ======================================================================== 0.033
m ======================================================================== 0.033
) ======================================================================== 0.033
b ======================================================================== 0.033
r ======================================================================== 0.033
( ======================================================================== 0.033
  ======================================================================== 0.033
v ======================================================================== 0.033
h ======================================================================== 0.033
A ======================================================================== 0.033
c ======================================================================== 0.033
o ======================================================================== 0.033
t ======================================================================== 0.033
w ======================================================================== 0.033
n ======================================================================== 0.033
I ======================================================================== 0.033
y ======================================================================== 0.033
s ======================================================================== 0.033
u ======================================================================== 0.033
f ======================================================================== 0.033
P ======================================================================== 0.033
p ======================================================================== 0.033
g ======================================================================== 0.033
i ======================================================================== 0.033
K ======================================================================== 0.033
e ======================================================================== 0.033
d ======================================================================== 0.033
a ======================================================================== 0.033
q ======================================================================== 0.033
l ======================================================================== 0.033
>>> equitext.decode(encoded)
'A histogram is a graphical representation of the distribution of numerical data. It is an estimate of the probability distribution of a continuous variable (quantitative variable) and was first introduced by Karl Pearson.'

The encoded text is about 1.44 more lengthly than the original, although this depends on the size of the alphabet of the original text, that is, the number of unique characters it contains. 1.44 is the number for a message containing only letters of the alphabet. I explain this in the next part.

Two weeks ago I published an encrypted message on Reddit on the /r/codes subreddit. The message has been first encoded using equitext then encrypted by a substitution whose key is a permutation of the text's alphabet chosen at random. The post haven't get much attention, but someone nevertheless figured out interesting properties.

How it works

In a first time we'll assume that the text to encode is only made of letters of the alphabet, although the encoding works with any Unicode string.

The text is splitted in chunks of 18 letters (magic number explained in a few lines). Let C be a given chunk. C is one of the many possible combinations of 18 letters. We can define an index on such combinations. For example, 'aaa...a' would be combination number 0, 'aaa...b' would be combination number 1, and so on, all the way up to 'zzz...z' which would be combination number 26^18-1. Having such an index, we can say that C is chunk number N.

Meanwhile, we can also define an index on permutations of the alphabet. For example, the regular 'abcde...yz' would be permutation number 0, then 'abcde...zy' would be permutation number 1, and so on, all the way up to 'zyxw...ba' which would be permutation number 26!-1. Having such an index, we can take its Nth element, where N is the combination index of chunk C. We have then encoded a chunk to a permutation. We are sure that there are at least N permutations because 26^18 <= 26! Actually 18 is the greatest n such that 26^n <= 26!

The encoded text being a sequence of permutations of the alphabet, every letter has the same number of occurrences in it. That's all.

Since each chunk of 18 letters is encoded to a permutation (26 letters), the text grows by a factor 26/18 = 1.44

Efficient indexing

Finding the index of a given chunk is equivalent to a conversion from base 26 to base 10 where the number in base 26 is represented using letters of the alphabet. Encoding in one direction to the other is a straightforward base-to-base conversion.

Finding the permutation corresponding to a given index is a bit more involved. It's done using something called the Lehmer code. In the following paragraphs I try to give a friendly explanation of how it works. I begin by explaining how to find the index of a given permutation, then explain how to find the permutation of a given index.

There are 26! permutations. We can partition the set of permutations in 26 subsets. The first subset contains all permutations begining with 'a', the second one contains all permutations begining with 'b', and so on. It's trivial that all these subsets have the same size of 26! / 26 = 25! We can then decide that permutations belonging to the first set (the "a..." set) all have their index between 0 and 25!. The permutations belonging to the "b..." set all have their index between 25! and 2×25!, and so on, all the way up to the "z..." set whose permutations all have their index between 25×25! and 26!. Note that in these bounds given as examples, the upper bound is exclusive.

Once we've identified to what set a permutation belongs to, we perform the same process inside this set, except that we've removed the previous letter from the available letters. For example, inside the "b..." set, there would be the "ba..." set, with indexes between 25! and 25!+24!, then the "bc..." set with indexes between 25!+24! and 25!+2×24! (note that we've jumped from a to c since b is already consumed). We iterate over the letters of the permutation and go to the right subset at each iteration according to the letter we read. The final set is associated to only one possible index, which is the index of the permutation.

As you can see, finding the index of a given permutation is actually a simple computation. Let's dumb things down even more by considering that there are only three letters 'a', 'b' and 'c' in the ['a', 'b', 'c'] order and we want the index of the 'bca' permutation. This index is:

1*2! + 1*1! + 0*0! = 2 + 1 + 0 = 3
  1. We take the ordinal of 'b' in ['a', 'b', 'c'] which is 1. We get the first term 1*2!
  2. We take the ordinal of 'c' in ['a', 'c'] (we've removed 'b' from the list) which is 1. We get the second term 1*1!
  3. We take the ordinal of 'a' in ['a'] (we've removed 'c' from the list) which is 0. We get the final term 0*0!.

The factorial number system

What's interesting here is that we have the decomposition of 3 in the factorial number system. The factorial number system is a system to write numbers where digits are coefficients of increasing factorial numbers instead of increasing powers. 3 is written 110 in "factorial base" because 3 = 1×2! + 1×1! + 0×0!

So how do you find the "factorial base" digits of an integer given in decimal base? Because that's gonna help us convert an integer to a permutation. Well, let's take 42. It's 13000 in factorial base. That's:

1*4! + 3*3! + 0*2! + 0*1! + 0*0!

Let's write that explicitly:

1*4*3*2*1 + 3*3*2*1 + 0*2*1 + 0*1 + 0

We can factor 1 from the first 4 terms:

1 * (1*4*3*2 + 3*3*2 + 0*2 + 0) + 0

Inside the inner-most group, we can factor 2 on the first 3 terms:

1 * (2 * (1*4*3 + 3*3 + 0) + 0) + 0

Inside th inner-most group, we can factor 3 on the first 2 terms:

1 * (2 * (3 * (1*4 + 3) + 0) + 0) + 0

Remember that this number is 42. From the way we've written it, it is now easy to understand how you can get its factoradic digits: if you divide 42 by 1 you end up with a remainder of 0 and a quotient of:

2 * (3 * (1*4 + 3) + 0) + 0

If you divide this by 2, you end up with a remainder of 0 and a quotient of:

3 * (1*4 + 3) + 0

If you divide this by 3, you end up with a remainder of 0 and a quotient of:

1*4 + 3

If you divide this by 4, you end up with a remainder of 3 and a quotient of:

1

If you devide this by 5, you end up with a remainder of 1 and a quotient of 0.

You stop there, and look at the remainders that you've accumulated: 0, 0, 0, 3 and 1. That's 3100 in reverse.

A convertion to "base factorial" is very similar to a convertion to a regular base: you recursively divide the quotient by the base until you reach 0 and the digits in the destination base are the remainders in reverse order. Instead that here, instead of "dividing by the base", you divide by increasing integers: divide by 1, then by 2, then by 3 and so forth.

Once we have the digits of an index in the factorial number system, it's easy to get its corresponding permutation. For example, let's take 3 (110) again with the ['a', 'b', 'c'] alphabet. You pop the letter with ordinal 1 from the alphabet: 'b'. The alphabet is now ['a', 'c']. You pop the letter with ordinal 1 from the alphabet: 'c'. The alphabet is now ['a']. You pop the letter with ordinal 0: 'a'. The permutation is indeed 'bca'.

Generalization

Equitext is trivially extended to any alphabet. The alphabet specific to the text (that is, the set of characters that the text uses) is extracted from the text thanks to a preliminary pass. It is then ordered using the Unicode code point of each of its characters. The length of chunks of text to be encoded to permutations is deduced from the size N of the alphabet. It's the greatest n such that N^n <= N!

The text might require to be padded in order for its length to be divisible by the length of a chunk. The text is padded by repeting the character of the alphabet whose ordinal indicates the length of the pad. The length of the pad can't be greater than the size of the alphabet because as a pad it has a lesser length than a chunk which itself has a lesser length than the size of the alphabet (otherwise N^n would be greater than N!), so we'll always have enough characters in the alphabet to encode the length of the pad using only one character.

Links