Debugging runtime differences from the same code base

238 Views Asked by At

I'm currently using the VEINS library and simulation package to do some experiments. Because these have a very long run time, I'm trying to use the university cluster servers (KITE 2.0/RHEL6.6/Lustre 2.5.29.ddnpf3) -- however, I've now encountered several different run time bugs, with the same code that runs perfectly fine on my local machine (Fedora 23). I'm looking for a way to easily debug this problem. I suspect that the cause lies somewhere in the different gcc version, or perhaps some other system level library that I can't change remotely (but I'm not sure). I'm certain that the OMNeT++ version is the same; the VEINS library is provided by me and is the same locally and remotely.

An example of the issues I've encountered is discussed here, which I eventually fixed like this (as far as I can tell, both versions have the same semantics... DimensionSet extends std::set, and DimensionSet::timeFreqDomain is a static const initialized with (Dimension::time, Dimension::frequency) as in the fix).

What is a good approach to look for the cause? Is there a simple way to "cross-compile" between these machines, or some way to diff the binaries to look for the cause? Where do I look for common ways to deal with problems like these?

1

There are 1 best solutions below

1
On BEST ANSWER

I might have tracked the error down to an example of a static initialization order fiasco: MiXiM's Dimension::time is a static member, so it should not have been used to initialize other static members. Unfortunately, this is exactly what MiXiM (and, by extension, Veins) did, leading to such crashes.

I have pushed commit 7807f47c (part of Veins 4.4), which gets rid of almost all static members, so that the whole of the framework should be safer to use.