UE5 Very poor performance of Memcpy

166 Views Asked by At

I'm currently working on publishing images generated in Unreal to ROS, but now I'm facing strange performance issues.

I'm using Unreal 5.2 on Ubuntu 20, with a Ryzen 5 5600X, a RTX 3070 and 40GB of RAM.

My method currently looks like this

void FROSOutputServer::PublishImage(ros::Publisher& ImagePublisher, TSharedPtr<FSensorDataBase>& SensorData){ 
   
    TRACE_CPUPROFILER_EVENT_SCOPE(FROSOuputServer::PublishImage);
    FCameraData* CameraData = static_cast<FCameraData*>(SensorData.Get());
    
    sensor_msgs::ImagePtr ImgMsgPtr = boost::make_shared<sensor_msgs::Image>();

    ros::Time TimeStamp;
    TimeStamp.fromSec(CameraData->Timestamp.ToDouble());
    ImgMsgPtr->header.stamp = TimeStamp;
    ImgMsgPtr->header.frame_id = "test";
    ImgMsgPtr->step = CameraData->Width * 3; 
    ImgMsgPtr->height = CameraData->Height;
    ImgMsgPtr->width = CameraData->Width;
    ImgMsgPtr->encoding = "bgr8";
    ImgMsgPtr->is_bigendian = 0;
    
    {
        TRACE_CPUPROFILER_EVENT_SCOPE(FROSOuputServer::PublishImage::Copy);
        ImgMsgPtr->data.resize(CameraData->ImageData.Num());
        uint8* DestPtr = ImgMsgPtr->data.data();
        uint8* SrcPtr = CameraData->ImageData.GetData();
        FMemory::Memcpy(DestPtr, SrcPtr, CameraData->ImageData.Num());
    }
    {
        TRACE_CPUPROFILER_EVENT_SCOPE(FROSOuputServer::PublishImage::Publish);
        ImagePublisher.publish(ImgMsgPtr);
    }

    // Only for debugging purposes, this would be called implicitly by the shared pointer destructor
    {
        TRACE_CPUPROFILER_EVENT_SCOPE(FROSOuputServer::PublishImage::FreeUnrealPtr);
        SensorData.Reset();
    }

    {
        TRACE_CPUPROFILER_EVENT_SCOPE(FROSOuputServer::PublishImage::FreeBoostPtr);
        ImgMsgPtr.reset();
    }
}

With these Unreal data structs:

struct FSensorDataBase
{
    Utils::Time Timestamp;
};
struct FCameraData : public FSensorDataBase
{
    
    TArray<uint8> ImageData;
    uint32 Width;
    uint32 Height;
};

For testing purposes, I created a 1000x1000 image, resulting in a 3MB TArray.

I expected modern CPUs to take very little time to copy such "small" amounts of data, but when profiling I encountered very poor results. I used Unreal Insights for profiling. (See TRACE_CPUPROFILER_EVENT_SCOPE macro above) Unreal Insights Screenshot So the copy operation takes about 13.5ms and freeing the boost pointer after it is published takes another 20.8ms . Unreals shared pointer isn't freed in this scope.

Is there anyway to optimize this or am I simply running into CPU constraints due to Unreals overhead?

0

There are 0 best solutions below