I'm working on a project for a class and I could use some guidance. I need to parse a character array into constituent parts - the specifications of which I am given - but I am unsure how to do so in C.
I have been given a file and each page of the file is read into a buffer as a character array like so:
typedef struct page_t {
char reserved[PAGESIZE];
} page_t;
I have been given the following specifications about the pages read:
- For each page it starts with a 2 byte gap offset, followed key-value records, a gap at the indicated offset, and lastly an 8 byte address at the end pointing to the next page
- The key-value records are of the following form: 8 byte unsigned integer key followed by a value where the first 4 bytes are an unsigned integer inidicating the length of the string part of the value and a string of variable length (it will be the length indicated in the 4 bytes previously mentioned so the total length of the value portion will be length+4)
- There can be multiple key-value records in the file but the sum of all key-value records will not exceed 4086 bytes and the gap is always at the end of the file prior to the address of the next page
Since I have not been given anymore explanation about format of the page read in and I need to parse through the char array I was wondering if I could do something like use the strtoul
function to read the 8 bytes of the array at a time to find the correct key (and to skip over the key's values if they are not the key I am trying to match). I asked my TA about it and the answer I got was:
You can use functions that convert character (byte) arrays to numbers. Consider making a toy example program that converts a structure to a character array and back to see if scan/atoi/strtoll... have the expected behavior. If the functions do not work you can also consider reading iteratively. You may find them useful to extract the key/value size. The value as a string should work!
So I tried making a short program that converted a struct to an array and back and tried using strtoul
on the string but I'm not sure that I'm doing it correctly.
So my tester program looks like this:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>
typedef struct record_test {
uint64_t key;
uint32_t val_size;
char value[255];
} record_test;
int main( int argc, char ** argv ) {
record_test record = {1234, 13, "asdfghjklqwer"};
char page[4096];
// print what is in record
printf("Here's the record itself:\n");
printf("key: %llu\n", record.key);
printf("val_size: %u\n", record.val_size);
printf("record: %s\n", record.value);
memcpy(page, &record, sizeof(record_test));
// print what is in page
printf("Here's what's in the page:\n");
printf("page: %s\n", page);
// check page contents with pointer
record_test* revert;
revert = (record_test*)page;
printf("Here's the reverted record using pointers:\n");
printf("key: %llu\n", revert->key);
printf("val_size: %u\n", revert->val_size);
printf("record: %s\n", revert->value);
// reading what is in page using strtoul
char* endKey;
char* value;
printf("reading using strtoul:\n");
printf("key: %lu\n", strtoul(page, &endKey, 8));
printf("val size: %d\n", (int)strtoul(endKey, &value, 4));
printf("value: %s\n", value);
}
And these are the results I'm getting from it when I use printf to follow it:
Here's the record itself:
key: 1234
val_size: 13
record: asdfghjklqwer
Here's what's in the page:
page: ?
Here's the reverted record using pointers:
key: 1234
val_size: 13
record: asdfghjklqwer
reading using strtoul:
key: 0
val size: 0
value: ?
So based on the pointer that I used to recast the struct, the character array does have the right information in it but for whatever reason the character array itself is showing ?
when I try to print it and similarly the printf statements showing what strtoul
is reading is showing 0
for the integers. I'm not sure what's going on here, why am I getting ?
when that character isn't even in the value string?Can someone tell me where I am going wrong or if I can even use this function at all? Should I be trying to iterate though the character array using bitwise operations to read it instead?
Any help would be great! Thank you!
I'm going to try to help you understand what's happening here. When you do
memcpy
to "flatten" your structure, let's analyze what should be going into memory.We start out with
1234
. Convert that to hexadecimal and that becomes04D2
. Now auint64_t
is probably an 8 byte long structure on your machine (you can verify this by doing asizeof(uint64_t)
), so in memory you can expect the first 8 bytes to be00 00 00 00 00 00 04 D2
.Next up, you have
13
, which in hexadecimal is0D
and it's in auint32_t
. This is typically half of what auint64_t
is, probably 4 bytes long on your machine (again, you can verify withsizeof
). This means the next 4 bytes would be00 00 00 0D
.Finally, you have an array of 255
char
.char
's are 1 byte long each. Each letter in your textasdfghjklqwer
gets converted to an ASCII code representing that letter, so the hexadecimal would be61 73 64 66 67 ...
and the rest of those 255 bytes are just random data that's in your memory.Now one final thing to keep in mind is the endianness of your computer. If your computer has an Intel processor or AMD processor, then your computer is using little-endian. If you're unfamiliar with what endianness is, then look at this Wikipedia article for an explanation. But, simply put, endianness refers to the order that bytes are written to memory. Little endian (which is probably what you have) means that the little ends of the bytes are written first.
So what does this mean? Up above I said the first 8 bytes in memory would be
00 00 00 00 00 00 04 D2
. For little endian machines, this isn't really true. The bytes are actually written right to left. What's actually in memory would beD2 04 00 00 00 00 00 00
. Hopefully this makes sense.So now, with some little modifications to your program, you can actually print out what's in your computer's memory and you can see more clearly what I am talking about.
First, in your program, change
char page[4096];
tounsigned char page[4096];
. The reason is because all this would be easier to understand with unsigned characters. If you really want to know how signed and unsigned numbers work in a computer system, Googletwos compliment
to learn more. For now, just change it to unsigned. Then add this to your program:When you run this program, it will execute the
memcpy
like before, but then I have it printing out the data stored at thepage address
. Try modifying your record and see if you can understand what my explanation is all about!Hopefully this all made sense! Good luck!!!