Inferred Caesar key not accurate on proper sentences

106 Views Asked by At

The process I tried is to get the the percentage of each letter in the encrypted text to compare to the English frequency text and calculate the differences in frequencies between the encrypted text and the reference frequencies. Next, I normalized the differences (tried various methods; this time Constant: 50, Divisor: 3; most likely wrong) between the 2 files and took the average value as the cipher key.

Main program:

attempts = 0
max_attempts = 3

while attempts < max_attempts:
    input_file_name = input("Please enter the file you want to analyze: ")

    if input_file_name.endswith('.txt') and os.path.exists(input_file_name):
        input_text = ManageFile.openFile(input_file_name)

        if input_text is None:
            print("Error: Unable to read the input file.")
            attempts += 1
        else:
            reference_file_name = input("Please enter the reference frequencies file: ")

            if reference_file_name.endswith('.txt') and os.path.exists(reference_file_name):
                reference_frequencies = ManageFile.openFile(reference_file_name)

                if reference_frequencies is None:
                    print("Error: Unable to read the reference frequencies file.")
                    attempts += 1
                else:
                    analyzer = FrequencyAnalyzer(input_text, reference_frequencies)
                    inferred_cipher_key = analyzer.analyze_text()
                    print("Inferred Caesar Cipher Key (Number of Shifts):", inferred_cipher_key)

                    decrypt_file = input("Do you want to decrypt the file? (y/n): ").lower()

                    if decrypt_file == 'y':
                        decrypted_text = Caesar(inferred_cipher_key).decrypt(input_text)
                        print("Decrypted Text:")
                        print(decrypted_text)
                        inner_attempts = 0
                        while inner_attempts < max_attempts:
                            output_file_name = input("Please enter a output file: ")

                            if output_file_name.endswith('.txt'):
                                if os.path.exists(output_file_name):
                                    overwrite = input(f"The file '{output_file_name}' already exists. Do you want to overwrite it? (y/n): ").lower()
                                    if overwrite != 'y':
                                        print("Decryption canceled.")
                                        break # Exit the loop if user doesn't want to overwrite and return to the main menu
                                    else:
                                        print(f"Overwriting the file '{output_file_name}'...")

                                ManageFile(input_text, inferred_cipher_key).toFile(output_file_name, decrypted_text)
                                print(f"Decrypted text has been saved to '{output_file_name}'")
                                break  # Exit the loop if a valid output file name is provided
                            else:
                                print("Invalid output file name. Please include '.txt' extension.")
                                inner_attempts += 1
                        else:
                            print("Failed to read the file after 3 attempts. Returning to the main menu.")
                    else:
                        print("Decryption canceled.")
                    break
            else:
                print("Invalid reference frequencies file name or file does not exist. Please include '.txt' extension.")
                attempts += 1
    else:
        print("Invalid input file name or file does not exist. Please include '.txt' extension.")
        attempts += 1

if attempts >= max_attempts:
    print("Failed to read the file(s) after 3 attempts. Returning to the main menu.")

Functions:

class TextAnalyzer:
    def __init__(self, text):
        self._text = text
        self._processed_text = ""
        self._all_letters = string.ascii_uppercase
        self._cleaned_letters = []
        self._cleaned_letters_count = 0
        self._individual_letter_counts = {}
        self._top_5_letter_counts = []

    def _preprocess_text(self):
        self._cleaned_letters = list(filter(str.isalpha, self._text.upper()))
        self._cleaned_letters_count = len(self._cleaned_letters)
        self._individual_letter_counts = collections.Counter(self._cleaned_letters)
        self._top_5_letter_counts = self._individual_letter_counts.most_common(5)
        self._processed_text = " ".join(self._cleaned_letters)

    def analyze_text(self):
        self._preprocess_text()

class FrequencyAnalyzer(TextAnalyzer):
    def __init__(self, text, frequency_data):
        super().__init__(text)
        self._frequency_data = {}
        lines = frequency_data.split("\\n")
        for line in lines:
            char, percent = line.strip().split(',')
            self._frequency_data[char] = float(percent)
        self._letter_percentages = {}
        self._inferred_cipher_key = None

    def _calculate_letter_percentages(self):
        super()._preprocess_text()
        total_letters = sum(self._individual_letter_counts.values())
        letter_percentages = {
            letter: (count / total_letters) * 100 for letter, count in self._individual_letter_counts.items()
        }
        return letter_percentages

    def _normalize_letter_percentages(self, letter_percentages):
        normalized_diff = {
            letter: (letter_percentages[letter] - self._frequency_data[letter] + 50) % 50 / 3
            for letter in letter_percentages
        }
        return normalized_diff

    def _generate_report(self):
        letter_percentages = self._calculate_letter_percentages()
        normalized_diff = self._normalize_letter_percentages(letter_percentages)
        average_normalized_diff = sum(normalized_diff.values()) / len(normalized_diff)
        inferred_cipher_key = round(average_normalized_diff)
        #print("Inferred Caesar Cipher Key (Number of Shifts):", inferred_cipher_key)
        return inferred_cipher_key

        # print("Inferred Caesar Cipher Key (Number of Shifts):", inferred_cipher_key)

        # print("Final Letter Percentages:")
        # for char, percentage in letter_percentages.items():
        #     print(f"{char}: {percentage:.3f}%")

    def analyze_text(self):
        inferred_cipher_key = self._generate_report()
        return inferred_cipher_key

I have 2 encrypted files and 1 frequency reference file in percent:

English reference frequency text file:

A,8.2
B,1.5
C,2.8
D,4.3
E,12.7
F,2.2
G,2.0
H,6.1
I,7.0
J,0.15
K,0.77
L,4.0
M,2.4
N,6.7
O,7.5
P,1.9
Q,0.095
R,6.0
S,6.3
T,9.1
U,2.8
V,0.98
W,2.4
X,0.15
Y,2.0
Z,0.074

When I run the first encrypted file which contains:

Beqvstm, beqvstm tqbbtm abiz,
pwe Q ewvlmz epib gwc izm.

Cx ijwdm bpm asg aw pqop,
tqsm i lqiuwvl qv bpm asg.

and the frequency reference file via

Please enter the file you want to analyze: Mystery.txt
Please enter the reference frequencies file: englishtext.txt
Inferred Caesar Cipher Key (Number of Shifts): 8

I get 8 which is correct as the new decrypted text is

Twinkle, twinkle little star,
how I wonder what you are.

Up above the sky so high,
like a diamond in the sky.

but when I try the second encrypted text (which is a proper sentence):

Bpm nikba qv bpib kwuxtmf kiam qa ycmabqwvijtm.

which should give

The facts in that complex case is questionable.

as the decrypted text, the inferred cipher key I should be getting is 8 but instead got 5.

Please enter the file you want to analyze: facts.txt
Please enter the reference frequencies file: englishtext.txt
Inferred Caesar Cipher Key (Number of Shifts): 5

I even tried other proper sentences like

Wkh shrsoh duh zhdulqj irupdo dwwluh

which should give

The people are wearing formal attire

as the decrypted text,

The inferred cipher key I should be getting is 3 but instead I got 4.

Please enter the file you want to analyze: attire.txt
Please enter the reference frequencies file: englishtext.txt
Inferred Caesar Cipher Key (Number of Shifts): 4

What is wrong with the method I used and is there a better approach that can be done?

2

There are 2 best solutions below

0
tripleee On

Clearly, the sentences you tested with do not exhibit the character distribution which you expect from longer texts in English. The approach you used is inherently brittle for short texts.

Perhaps a more robust approach would be to collect statistics for pairs (or triplets etc) of adjacent letters; incidentally, this sort of cryptographic analysis is what led Andrey Markov to invent Markov chains.

For a less invasive change, perhaps try the top three (or top five etc) rotation keys suggested by the frequency analysis you already have, and check against a dictionary which one(s) seem the most plausible - or perhaps display the decrypted result for all of them to the user (ranked by probability?) and let them choose.

2
rossum On

The standard method for solving a Caesar cipher is 'Running Down the Alphabet'. With only 26 possible keys, one of which is the null key, you can try them all:

NBCM CM UH YRUGJFY
nbcm cm uh yrugjfy
ocdn dn vi zsvhkgz
pdeo eo wj atwilha
qefp fp xk buxjmib
rfgq gq yl cvyknjc
sghr hr zm dwzlokd
this is an example
uijt jt bo fybnqmf
vjku ku cp gzcorng
wklv lv dq hadpsoh
xlmw mw er ibeqtpi
ymnx nx fs jcfruqj
znoy oy gt kdgsvrk
aopz pz hu lehtwsl
bpqa qa iv mfiuxtm
cqrb rb jw ngjvyun
drsc sc kx ohkwzvo
estd td ly pilxawp
ftue ue mz qjmybxq
guvf vf na rknzcyr
hvwg wg ob sloadzs
iwxh xh pc tmpbeat
jxyi yi qd unqcfbu
kyzj zj re vordgcv
lzak ak sf wpsehdw
mabl bl tg xqtfiex
nbcm cm uh yrugjfy

Letter frequencies, pair frequencies or triple frequencies can help you pick the right combination. For long texts just use a shorter sample of the cyphertext.