I am trying to write the more simple version of the Boyer Moore algorithm without prefix-function. It has to print all positions of symbols which have been compared with a pattern. And locally it passes the tests, but it fails when I commit it to the gitlab. I can't spot undefined behaviour here. I have garbage at the end.
#include <stdio.h>
#define MAX_PATTERN_LEN 16
#define BUF_SIZE 69
#define ALPH_SIZE 128
int read_str(int max_len, unsigned char *str_place) {
int count_read = 0;
for (int i = 0, ch; i < max_len; i++) {
if ((ch = getchar()) == '\n') {
str_place[i] = '\0';
break;
}
str_place[i] = (char)ch;
count_read++;
}
return count_read;
}
void calculate_shifts(const unsigned char *str, int len_str, int *badchar) {
for (int i = 0; i < ALPH_SIZE; i++)
badchar[i] = len_str;
for (int i = 0; i < len_str - 1; i++)
badchar[str[i]] = len_str - 1 - i;
}
void search(const unsigned char *str, const unsigned char *patt, int len_patt, int len_str) {
int badchar[ALPH_SIZE];
calculate_shifts(patt, len_patt, badchar);
int shift = 0;
while (shift <= (len_str - len_patt)) {
int j = len_patt - 1;
for (; j >= 0 && patt[j] == str[shift + j]; j--)
printf("%d ", shift + j + 1);
if (j < 0) {
shift += ((shift + len_patt) < len_str) ? badchar[patt[len_patt - 1]] : 1;
} else {
printf("%d ", shift + j + 1);
int shift_addition = badchar[str[shift + j]];
if ((shift_addition == len_patt) && (j < len_patt - 1) && (patt[len_patt - 1] == patt[0]))
shift_addition--;
shift += shift_addition;
}
}
}
int main(void) {
unsigned char str[BUF_SIZE + 1];
unsigned char patt[MAX_PATTERN_LEN + 1];
int len_patt = read_str(MAX_PATTERN_LEN + 1, patt);
int len_str = read_str(BUF_SIZE + 1, str);
if (!len_patt || !len_str)
return 0;
search(str, patt, len_patt, len_str);
return 0;
}
the test:
example
this is simple example
the correct output:
7 14 13 12 11 10 20 22 21 20 19 18 17 16
the actual output:
7 14 13 12 11 10 20 22 21 20 19 18 17 16 28 ..
The difference (extra number 28 at the end of the output) can be caused by a missing newline character (
\n) at the end of the last line of the input. I was able to reproduce both of your outputs (with and without the 28) locally.TL;DR Fix the bug in your code by changing
(ch = getchar()) == '\n'to(ch = getchar()) == EOF || ch == '\n', and changing#define ALPH_SIZE 128to#define ALPH_SIZE 256.Maybe there is another bug, I haven't checked. If you encounter any other bugs, please ask it in a separate question on StackOverflow.
The rest of this answer explains how I diagnosed this bug.
AddressSanitizer
gcc -fsanitize=address) reveals a memory access problem if the newline is missing:Please note that I had to rerun it multiple times to trigger the AddressSanitizer error message.
Most of the time this is caused by a bug in your program, and often it's undefined behavior. The
in search /tmp/t.c:33clause in the output of AddressSanitizer indicates where in your source code the bad read happens (i.e. line 33, functionsearch).I've added some extra
printfs to your code to see what happens in line 33. The output is:The culprit is the last read access:
str[-369011759] == ??, this indicates thatshift + jbecomes -369011759, which is obviously incorrect, because correct values are small nonnegative integers.Please note that I had to rerun it multiple times to trigger the Address Sanitizer error message, and the number -369011759 was different each time.
It looks like
jhas a sane value of 6. To figure out the actual bug, the next step is checking whereshiftgets its large negative value.At this point the output line
str[27] == 255was suspicious to me. I checked your code whether it handles such large values instrcorrectly, and then I found a bug.One way to fix this newfound bug is changing
ALPH_SIZEfrom128to256. (I've verified that it makes the Address Sanitizer errors disappear.) Here is why. Line 39 contains the read ofbadchar[str[shift + j]]. For that to be valid, we need0 <= str[shift + j] && str[shift + j] < ALPH_SIZE, becauseALPH_SIZEis the size of thebadchararray. However, the maximum value forstr[...]is 255 (because it's an unsigned char), thus we needALPH_SIZE >= 256.Indeed, an out-of-bounds array access (read or write) is undefined behavior in C. To prove that this is actually happening, I've changed line 39 to
, and I've also changed line 28 to use malloc:
, and I've also added
free(badchar)later. With all that I've got:Indeed, the output of AddressSanitizer indicates the bug in line 39, and the output of the extra
fprintfcalls confirm that it's an out-of-bounds array access: reading of element 255 of the arraybadchar, which has only 128 elements.How did we end up with
str[27] == 255(thus readingbadchar[255])? That's easy: if there is no terminating\nat the end of the file when readingstr, thengetchar()returns EOF, which becomes 255 when assigned to anunsigned char. (The obvious fix, as above, is to stop at EOF instead, and changeALPH_SIZEto 256.)Why hasn't AddressSanitizer reported the read of out-of-bounds array index problem (as a stack-buffer-overflow) in line 39 before
badcharwas changed to usemalloc(reported as heap-buffer-overflow)? I don't quite understand this yet. AddressSanitizer should be able to report stack-buffer-overflow reliably. Probably it depends on the GCC and Clang version, because changing the commandgcctoclangmade the stack-buffer-overflow error appear reliably. Probably upgrading to the latest GCC and Clang would help. Valgrind is not as reliable for detecting stack-buffer-overflow reads as AddressSanitizer.