How could you recreate UE4 serialization of user-defined types?

4k Views Asked by At

The problem

The Unreal Engine 4 Editor allows you to add objects of your own types to the scene.
Doing so requires minimal work from the user - to make a class visible in the editor you only need to add some macros, like UCLASS()

UCLASS()
class MyInputComponent: public UInputComponent //you can instantiate it in the editor!
{
     UPROPERTY(EditAnywhere)
     bool IsSomethingEnabled;
};

This is enough to allow the editor to serialize the created-in-editor object's data (remember: the class is user-defined but the user doesn't have to hardcode loading specific fields. Also note that the UPROPERTY variable can be of user-defined type as well). It is then deserialized while loading the actual game. So how is it handled so painlessly?

My attempt - hardcoded loading for every new class

class Component //abstract class
{
public:
    virtual void LoadFromStream(std::stringstream& str) = 0;
    //virtual void SaveIntoStream(std::stringstream& str) = 0;
};
class UserCreatedComponent: public Component
{
    std::string Name;
    int SomeInteger;
    vec3 SomeVector; //example of user-defined type
public:
    virtual void LoadFromStream(std::stringstream& str) override //you have to write a function like this every time you create a new class
    {
        str >> Name >> SomeInteger >> SomeVector.x >> SomeVector.y >> SomeVector.z;
    }
};

std::vector<Component*> ComponentsFromStream(std::stringstream& str)
{
    std::vector<Component*> components;
    std::string type;
    while (str >> type)
    {
        if (type == "UserCreatedComponent") //do this for every user-defined type...
            components.push_back(new UserCreatedComponent);
        else
            continue;

        components.back()->LoadFromStream(str);
    }
    
    return components;
}

Example of an UserCreatedComponent object stream representation:

UserCreatedComponent MyComponent 5 0.707 0.707 0.707

The engine user has to do these things every time he creates a new class:
1. Modify ComponentsFromStream by adding another if
2. Add two methods, one which loads from stream and another which saves to stream.
We want to simplify it so the user only has to use a macro like UPROPERTY.

Our goal is to free the user from all this work and create a more extensible solution, like UE4's (described above).

Attempt at simplifying 1: Using type-int mapping

This section is based on the following: https://stackoverflow.com/a/17409442/12703830
The idea is that for every new class we map an integer, so when we create an object we can just pass the integer given in the stream to the factory.
Example of an UserCreatedComponent object stream representation:

1 MyComponent 5 0.707 0.707 0.707  

This solves the problem of working out the type of created object but also seems to create two new problems:

  • How should we map classes to integers? What would happen if we include two libraries containing classes that map themselves to the same number?
  • What will initializing e.g. components that need vectors for construction look like? We don't always use strings and ints for object construction (and streams give us pretty much only that).
1

There are 1 best solutions below

0
On BEST ANSWER

So how is it handled so painlessly?

C++ language does not provide features which would allow to implement such simple de/serialization of class instances as it works in the Unreal Engine. There are various ways how to workaround the language limitations, the Unreal uses a code generator.

The general idea is following:

  • When you start project compilation, a code generator is executed.
  • The code generator parses your header files and searches for macros which has special meaning, like UCLASS, USTRUCT, UENUM, UPROPERTY, etc.
  • Based on collected data, it generates not only code for de/serialization, but also for other purposes, like reflection (ability to iterate certain members), information about inheritance, etc.
  • After that, your code is finally compiled along with the generated code.

Note: this is also why you have to include "MyClass.generated.h" in all header files which declare UCLASS, USTRUCT and similar.

In other words, someone must write the de/serialization code in some form. The Unreal solution is that the author of such code is an application.

If you want to implement such system yourself, be aware that it's lots of work. I'm no expert in this field, so I'll just provide general information:

  • The primary idea of code-generators is to automatize repetitive work, nothing more - in other words, there's no other special magic. That means that "how objects are de/serialized" (how they're transformed from memory to file) and "how the code which de/serializes is created" (whether it's written by a person or generated by an application) are two separate topics.
  • First, it should be established how objects are de/serialized. For example, std::stringstream can be used, or objects can be de/serialized from/to generally known formats like XML, json, bson, yaml, etc., or a custom solution can be defined.
  • Establish what's the source of data for generated de/serialization code. In case of Unreal Engine, it's user code itself. But it's not the only way - for example Protobuffers use a simple language which is used only to define data structure and the generator creates code which you can include and use.
  • If the source of data should be C++ code itself, do not write you own C++ parser! (The only exceptions to this rule are: educational purpose or if you want to spend rest of your life with working on the parser.) Luckily, there are projects which you can use - for example there's clang AST.

How should we map classes to integers? What would happen if we include two libraries containing classes that map themselves to the same number?

There's one fundamental problem with mapping classes to integers: it's not possible to uniquely map every possible class name to an integer.

Proof: create classes named Foo_[integer] and map it to the [integer], i.e. Foo_0 -> 0, Foo_1 -> 1, Foo_2 -> 2, etc. After you use biggest integer value, how do you map Bar_0?

You can start assigning the numbers sequentially as they're added to a project, but as you correctly pin-pointed, what if you include new library? You could start counting from some big number, like 1.000.000, but how do you determine what should be first number for each library? It doesn't have a clear solution.

Some of solutions to this problem are:

  • Define clear subset of classes which can be de/serialized and assign sequential integers to these classes. The subset can be, for example, "only classes in my project, no library support".
  • Identify classes with two integers - one for class, one for library. This means you have to have some central register which assigns library integers uniquely (e.g. in order they're registered).
  • Use string which uniquely identifies the class including library name. This is what Unreal uses.
  • Generate a hash from class and library name. There's risk of hash collision - the better hash you use, the lower risk there is. For example git (the version control application) uses SHA-1 (which is considered unsafe today) to identify it's objects (files, directories, commits) and the program is used worldwide without bigger issues.
  • Generate UUID, a 128-bit random number (with special rules). There's also risk of collision, but it's generally considered highly improbable. Used by Java and Unity the game engine.

What would happen if we include two libraries containing classes that map themselves to the same number?

That's called a collision. How it's handled depends on design of de/serialization code, there are mainly two approaches to this problem:

  • Detect that. For example if your class identifier contains library identifier, don't allow loading/registering library with ID which is already identified. In case of ID which doesn't include library ID (e.g. hash/UUID variant), don't allow registering such classes. Throw an exception or exit the application.
  • Assume there's no collision. If actual collision happens, it's so-called UB, an undefined behaviour. The application will probably crash or act weirdly. It might corrupt stored data.

What will initializing e.g. components that need vectors for construction look like? We don't always use strings and ints for object construction (and streams give us pretty much only that).

This depends on what it's required from de/serializing code.

The simplest solution is actually to use string of values separated by space.

For example, let's define following structure:

struct Person
{
    std::string Name;
    float Age;
};

A vector of Person instances could look like: 3 Adam 22.2 Bob 34.5 Cecil 19.0 (i.e. first serialize number of items (vector size), then individual items).

However, what if you add, remove or rename a member? The serialized data would become unreadable. If you want more robust solution, it might be better to use more structured data, for example YAML:

persons:
  - name: Adam
    age: 22.2
  - name: Bob
    age: 34.5
  - name: Cecil
    age: 19.0

Final notes

The problem of de/serializing objects (in C++) is actually big, various systems uses various solutions. That's why this answer is so generic and it doesn't provide exact code - there's not single silver bullet. Every solution has it's advantages and disadvantages. Even detailed description of just Unreal Engine's serialization system would become a book.

So this answer assumes that reader is able to search for various mentioned topic, like yaml file format, Protobuffers, UUID, etc.

Every mentioned solution to a sub-problem has lots of it's own problems which weren't explored. For example de/serialization of string with spaces or new lines from/to simple string stream. If it's needed to solve such problems, it's recommended to first search for more specialized questions or write one if there's nothing to be found.

Also, C++ is constantly evolving. For example, better support for reflection is added, which might, one day, provide enough features to implement high-quality de/serializer. However, if it should be done in compile-time, it would heavily depend on templates which slow down compilation process significantly and decrease code readibility. That's why code generators might be still considered a better choice.