C++ reference inconsistency

574 Views Asked by At

I'm using the yaml-cpp library to parse yaml. Abbreviated sample:

YAML::Node def = YAML::LoadFile(defFile);
for (auto itemPair = def.begin(); itemPair != def.end(); ++itemPair) {
    // Grab a reference so `itemPair->second` doesn't need to be copied all over the place
    auto& item = itemPair->second;

    // A few instances of the below in series
    if (item["key"].IsDefined()) { doSomething(item["key"].as<std::string>()); }

    // Problem happens here
    if (item["issue"].IsDefined()) {
        if (!item["issue"].IsMap()) { continue; }
        for (auto x = item["issue"].begin(); x != item["issue"].end(); ++x) {
            LOG(INFO) << "Type before: " << item.Type() << " : " << itemPair->second.Type();
            auto test = x->first.as<std::string>();
            LOG(INFO) << "Type after: " << item.Type() << " : " << itemPair->second.Type();
            // Using item as a map fails because it no longer is one!
            // Next loop attempt also crashes when it attempts to use [] on item.
        }
    }

The problem happens in the nested loop, where the reference taken at the beginning of the snippet suddenly changes, however the variable it's referencing seems to be unaffected:

I1218 12:44:04.697798 296012 main.cpp:123] Type before: 4 : 4
I1218 12:44:04.697813 296012 main.cpp:125] Type after: 2 : 4

My understanding of references is that they act as an alias for another variable. I understand the yaml library might be doing some magic behind the scenes which would change the underlying data, but I can't comprehend why the reference seems to be getting updated yet the original value remains.

Edit: Some serious mind-blowing behaviour is happening here. The reference gets "reset" back to the correct value after any call to itemPair->second.Type(). Thus if I add another log call:

LOG(INFO) << "Type after: " << item.Type() << " : " << itemPair->second.Type();
LOG(INFO) << "Type afterer: " << item.Type() << " : " << itemPair->second.Type();

The result:

I1218 12:58:59.965732 297648 main.cpp:123] Type before: 4 : 4
I1218 12:58:59.965752 297648 main.cpp:125] Type after: 2 : 4
I1218 12:58:59.965766 297648 main.cpp:126] Type afterer: 4 : 4

Reproducible example:

test.yaml:

---
one:
    key: x
    issue:
        first: 1
two:
    key: y
    issue:
        first: 1
        second: 2

main.cpp same as above but with hardcoded test.yaml, LOG swapped for std::cout, and the mock function:

#include <iostream>
#include <yaml-cpp/yaml.h>

void doSomething(std::string x) { std::cout << "Got key: " << x << std::endl; }

int main() {
    YAML::Node def = YAML::LoadFile("test.yaml");
    for (auto itemPair = def.begin(); itemPair != def.end(); ++itemPair) {
        // Grab a reference so `itemPair->second` doesn't need to be copied all over the place
        auto& item = itemPair->second;

        // A few instances of the below in series
        if (item["key"].IsDefined()) { doSomething(item["key"].as<std::string>()); }

        // Problem happens here
        if (item["issue"].IsDefined()) {
            if (!item["issue"].IsMap()) { continue; }
            for (auto x = item["issue"].begin(); x != item["issue"].end(); ++x) {
                std::cout << "Type before: " << item.Type() << " : " << itemPair->second.Type() << std::endl;
                auto test = x->first.as<std::string>();
                std::cout << "Type after: " << item.Type() << " : " << itemPair->second.Type() << std::endl;
                std::cout << "Type afterer: " << item.Type() << " : " << itemPair->second.Type() << std::endl;
                // Using item as a map fails because it no longer is one!
                // Next loop attempt also crashes when it attempts to use [] on item.
            }
        }
    }
}

Result:

$ ./build/out
Got key: x
Type before: 4 : 4
Type after: 2 : 4
Type afterer: 4 : 4
Got key: y
Type before: 4 : 4
Type after: 2 : 4
Type afterer: 4 : 4
Type before: 4 : 4
Type after: 2 : 4
Type afterer: 4 : 4
1

There are 1 best solutions below

0
On

Node is designed to hold a reference, and the iterator behaves like a pointer to std::pair<Node, Node> and will return a temporary Node. If you bind to that Node, you will bind to a destroyed Node. So you need a copy here. Change auto& to auto will solve the problem.

It is designed this way, because it doesn't want you to touch the underneath memory. Otherwise a possible dangling reference could be made when reallocating the memory.

An example for dangling reference:

std::vector<int> v{1};
auto &ref1 = v[0];

v.reserve(100); // reallocating, causing ref1 a dangling reference.

Also, I wrote why it is designed this way. See here: https://github.com/jbeder/yaml-cpp/issues/977#issuecomment-771041297 and I'll just copy it here.


Why reference is UB here.

When using ->, the iterator iter creates a temporary derefence result on the stack, returns its pointer, and destroys this object immediately after scope.

This is to make iter->second behaves alike the (*iter).second.

It is hard to decide when to destroy that object if you put that deref result on the heap.

The behavior is expected to be the same as (*iter).second. But (*iter).second is a rvalue, the compiler won't allow auto&. That's not the case in the iter->second, because the compiler thinks iter->second as a lvalue.

C++ standards makes p->m, the built-in member of pointer expression, a lvalue. So there's no way to forbid binding to reference.

In conclusion, the behavior is correct when

V list = iter->second;   // correct
V &list = iter->second;  // wrong
V &&list = iter->second; // COMPILE TIME ERROR
V &&list = std::move(iter->second); // still wrong

auto list = iter -> second;   // correct, list is V
auto &list = iter -> second;  // wrong,   list is V&
auto &&list = iter -> second; // wrong,   list is V&

V list = (*iter).second;   // correct
V &list = (*iter).second;  // COMPILE TIME ERROR
V &&list = (*iter).second; // correct

auto list = (*iter).second;   // correct, list is V
auto &list = (*iter).second;  // COMPILE TIME ERROR
auto &&list = (*iter).second; // correct, list is V&&

Here are some possible amendments for the author:

  1. Make detail::iterator_value object long living or just simply leak the memory.
  2. Remove operator->().
  3. Write into the document. Telling everyone to use auto.

Method 1 can cause much trouble. Method 2, 3 are good solutions, I think.

Why copy works like a reference here.

  1. In current design, every change went through a Node. Node is the public interface of the underlying memory. It is designed to become polymorphism. And the real type of the underlyding data is decided during running, which is unknown in the compile time. So auto& list = iter->second is impossible to bind to the correct underlying type.

This could be done with some efforts. It will be something like

auto& list = iter->second.data_as_ref<std::string>();

but still not that convenient to use.

  1. In the current design, you can get a copy by means of
auto list = iter->second.as<std::string>();

You cannot bind to it. It only allows you to copy, not write.

  1. That's because Node ensures you to use his interface to assign the value. It's quite important because assigning a data means 3 or more things to do.
  • If new data is one of the following type, it will encode it. std::pair, std::array, std::list, std::vector, std::map, bool, Binary
  • It assigns the data.
  • It assigns the type, one member in the enum class NodeType.
  • It assigns the status, a bool isDefined.

When reading, it also need to decode if the data was encoded. So it shouldn't give you direct write/read access.

  1. Also your ref could be dangling because a memory reallocation is possible.

Copy works like a reference is a must in the current design.

Conclusion

Use auto iter = iter->first; or use (*iter).first.