A somewhat lengthy question so please bear with me.
I am writing a parser to extract Objective-C metadata entities from input Mach-O binaries. And I want to better understand how pointers to metadata entities are stored/encoded in Mach-Os.
For example, using this Obj-C code:
#import <Foundation/Foundation.h>
@interface Person : NSObject
- (void) someMethod;
@end
@implementation Person
- (void) someMethod {}
@end
int main() {
return 0;
}
I compile it twice:
1. First using the following command:
clang++ -target arm64-apple-ios16 -isysroot /path/to/iphoneos_sdk \
-framework Foundation -o test test.m
Output from objdump -s test
...
Contents of section __DATA_CONST.__objc_classlist:
100008000 c0c00000 00000000 ........
...
Contents of section __DATA.__objc_data:
10000c098 01000000 00001080 01000000 00001080 ................
10000c0a8 00000000 00002080 00000000 00000000 ...... .........
10000c0b8 00c00000 00001000 98c00000 00001000 ................
10000c0c8 02000000 00001080 00000000 00002080 .............. .
10000c0d8 00000000 00000000 48c00000 00000000 ........H.......
Note that the class pointer is stored as 0xc0c0 in the __objc_classlist section. The class is actually located at pointer: 0x0001 0000 c0c0 in the __objc_data section.
2. Then I compile the same input again using the following command (note that I have an Intel machine so the target here is x86_64 by default):
clang++ -framework Foundation -o test test.m
Output from objdump -s test
...
Contents of section __DATA_CONST.__objc_classlist:
100004000 d8800000 01000000 ........
...
Contents of section __DATA.__objc_data:
1000080b0 00000000 00000000 00000000 00000000 ................
1000080c0 00000000 00000000 00000000 00000000 ................
1000080d0 00800000 01000000 b0800000 01000000 ................
1000080e0 00000000 00000000 00000000 00000000 ................
1000080f0 00000000 00000000 68800000 01000000 ........h.......
In this case, the class pointer is stored as 0x0001 0000 80d8 in the __objc_classlist section and we can use that address to go to where the class is actually stored in the __objc_data section.
I also noticed other ways in which pointers are encoded. For example, I came across a case for ARM64 targets where a pointer to a metadata entity was stored as: 0x0000 9000 0000 3faf while the actual location is 0x0001 0000 3faf.
So, my question is: how does Objective-C/clang encode MD entity pointers in Mach-O files?
You're looking at non-linked data. You need to be aware of dynamic linking operations in order to meaningfully parse this.
It depends. Specifically, it depends on what runtime linking format for binds and rebases your binary uses. Broadly speaking, there are two formats:
Dyld opcodes.
This is the "old" format and has been used since macOS 10.6. In this format, all metadata is stored separately from the data it applies to, which is why you get clean pointers in your x86_64 binary, surrounded by zeroes. As its name suggests, it's an opcode-based sequence of instructions, which is stored somewhere in
__LINKEDITand is pointed to by theLC_DYLD_INFO/LC_DYLD_INFO_ONLYin the Mach-O header. You can dump this info specifically withxcrun dyld_info -opcodes:The load command and dyld opcodes are defined in
mach-o/loader.h. Use of the opcodes has been somewhat detailed by Jonathan Levin, though for the actual implementation, seeMachOAnalyzer.cppandMachOLayout.cppin dyld source.Chained fixups. This is the "new" format first introduced in iOS 12 on arm64e. In this format, some metadata is stored alongside the target data it applies to, which is what you're seeing in your arm64 binary.
This format was initially only used for arm64e binaries, and whether this is used depends on the target architecture and minimum OS version, but iOS 16 and macOS 13 targets now seem to use it for all architectures (I'm guessing your default macOS target is 12.x or lower).
The way this works is by first segmenting the binary into pages (which may or may not match the hardware page size), and recording the offset of the first value that needs to be operated on in each page. The data at that offset then encodes the information needed to construct a valid pointer at load-time, as well as the offset to the next such value, thereby forming the "fixup chain". Cramming all of this into a 64-bit (or sometimes even 32-bit) value is of course no small feat, so there are many subtly different formats that can be picked from, each optimised for a special use case (see
mach-o/fixup-chains.h), but generally you have the top bit telling you whether it's a bind or rebase, you have N amount of bits in the middle that encode distance to the next pointer, pointer authentication stuff, etc., and then you have the rest of the bits which encode the offset from the base of the image (for rebases) or the index into the import symbol table (for binds). Also, only one format can be chosen for the entire binary, so you will likely only have to implement two or three, and will never encounter the rest.At that point you're left with the list of page offsets that lead to the first value on each page. If chained fixups are used in conjunction with dyld opcodes, then this is encoded somehow (I never looked at it) in the dyld opcode sequence with
BIND_OPCODE_THREADED. If this is used stand-alone, then there is aLC_DYLD_CHAINED_FIXUPSload command in the Mach-O header, which points to astruct dyld_chained_fixups_header, which points to a few more structs, encoded as offsets from itself. One of those holds the page starts, another holds the list of imported symbols, etc. Seemach-o/fixup-chains.hagain for those.You can use
xcrun dyld_info -fixup_chainsandxcrun dyld_info -fixup_chain_detailsto examine this:In the more general case, you could also use
xcrun dyld_info -fixupsto display any sort of bind or rebase target, no matter whether it uses dyld opcodes or fixup chains under the hood. But I suppose that won't help you much for the purpose of writing a parser.