Process set to have real-time priority, same operation takes spurious time spikes

94 Views Asked by At

I have to perform a very time-critical, easy operation (let's say a simple buffer shifting). I'm employing C++ and Eigen library.

Average time is fine, it totally hits my requirement. Problem is that, now and then, the same operation takes much much more time. Let's say that I can still tolerate up to 100us, but everything above risks to create problems of coordination and synchronization for real time application (even if this cases happen just three or four times).

Totally aware that Windows is NOT meant to be employed for time-critical real-time operation. I have some SDK the project relies on: I'm not sure they can run on RTOS amenities. Maybe I can try something Linux based, but I'm not sure it's viable.

That being said, is there any way to further reduce this happenings and to flatten the time execution, at least to have a well deterministic way of addressing the project?

Here is a very simple and representative test:

#include <Eigen/Dense>
#include <complex>
#include <vector>
#include <chrono>
#include <algorithm>
#include <iostream>
#include<numeric>
#include<cnpy.h>
#include <Windows.h>
#include <filesystem>

#define DATA_DIR "C:/workspace/trash"

#define NOW std::chrono::steady_clock::now
#define DELTA_T_US(t0, t1) (float)(std::chrono::duration_cast<std::chrono::nanoseconds>(t1 - t0).count() / 1000.0)
void print_time_res(std::vector<float> vec, std::string title)
{
    auto _min = *std::min_element(vec.begin(), vec.end());
    auto _max = *std::max_element(vec.begin(), vec.end());
    auto _avg = std::accumulate(vec.begin(), vec.end(), 0.0) / vec.size();

    std::cout << title << " minimum ex time : " << _min << "us" << std::endl;
    std::cout << title << " maximum ex time : " << _max << "us" << std::endl;
    std::cout << title << " average ex time : " << _avg << "us" << std::endl;
    std::cout << std::endl;
}

typedef  std::complex<float> std_complex;
typedef float complex[2];
typedef Eigen::Array <std_complex, Eigen::Dynamic, 1> eigen_c_vec;
typedef Eigen::Array<float, Eigen::Dynamic, 1> eigen_vec;
int main()
{
    // NOTE: Requires Administrator privileges! Otherwise it will fall back to priority "High". Checked with Task Manager
    SetPriorityClass(GetCurrentProcess(), REALTIME_PRIORITY_CLASS);

    // Needed for other operations
    printf("SIMD Used by Eigen: %s\n", Eigen::SimdInstructionSetsInUse());
    
    eigen_vec vec = eigen_vec::Random(360);
    const int N = 10000000;
    std::vector<float> times(N);

    for (int i = 0; i < N; i++)
    {
        auto _then = NOW();
        // Vector shifting
        vec(Eigen::seqN(0, 360 - 30)) = vec(Eigen::seqN(30, 360 - 30)).eval();
        auto _now = NOW();
        // Measured in microseconds
        times[i] = DELTA_T_US(_then, _now);
    }
    times.pop_back();
    print_time_res(times, "Eigen shifting");

    // Just to plot with Python
    std::string filename = (std::filesystem::path(DATA_DIR) / "TimeData.npy").string();
    cnpy::npy_save(filename, times, "w");
    return 0;
}

CMake Listing:

add_executable(EigenTest src/EigenTest.cpp)
target_link_libraries(EigenTest PRIVATE Eigen3::Eigen cnpy::cnpy)
target_compile_options(EigenTest PUBLIC /arch:AVX2)

Results

SIMD Used by Eigen: AVX SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2
Eigen shifting minimum ex time : 0us
Eigen shifting maximum ex time : 143us
Eigen shifting average ex time : 0.0956169us

enter image description here

1

There are 1 best solutions below

3
Igor Levicki On

To get more consistent execution times in modern operating systems which use paging and virtual memory you need to ensure that your data in memory is actually committed (and ideally that virtual memory pages containing said data are locked which requires increasing working set size), or your execution time will depend on whether the attempts to access the data in memory causes page faults or not.

Another thing to consider is the concept of priority boosting which desktop operating systems employ to increase the responsiveness of foreground applications and their windows. You can try changing the CPU scheduler to favor background services to see if that helps.

Moreover, disabling all visual effects, using a video card which supports hardware accelerated GPU scheduling (video card drivers are one of the main causes of high DPC latency), and disabling all unnecessary background processes, services and scheduled tasks along with setting High Performance power scheme and disabling CPU sleep states in BIOS should all help to minimize the latency and provide more consistent results.

Using a Windows Server without GUI to run the application is also an option, as is running it on a machine without network and audio adapters.

That is about as much as you can do with Windows which was never meant to be used as a realtime OS and which started its life with cooperative multitasking model.

EDIT

To answer you additional questions, unless you control the data layout you can't really use different memory allocation strategies.

The functions which allow allocation, locking, prefetching, unlocking, and freeing virtual memory in Windows are VirtualAlloc, VirtualLock, PrefetchVirtualMemory, VirtualUnlock, and VirtualFree.

Note that before even attempting to write code that uses those APIs (which allocate memory in system page size increments so allocating 1 byte will allocate 4KB or even 2MB) you should measure the impact of paging on real world usage of your code using performance counters in resource monitor or some other approach like this code for example:

// compile with -DPSAPI_VERSION=1, link with psapi.lib
#include <psapi.h>

void DumpMemoryStats()
{
    PROCESS_MEMORY_COUNTERS_EX pmcex = { 0 };

    pmcex.cb = sizeof(PROCESS_MEMORY_COUNTERS_EX);

    GetProcessMemoryInfo(GetCurrentProcess(), (PROCESS_MEMORY_COUNTERS*)&pmcex, sizeof(PROCESS_MEMORY_COUNTERS_EX));

    printf("          Page faults : %ld\n", pmcex.PageFaultCount);
    printf("Peak Working Set Size : %zu\n", pmcex.PeakWorkingSetSize);
    printf("     Working Set Size : %zu\n", pmcex.WorkingSetSize);
    printf("        Private Usage : %zu\n", pmcex.PrivateUsage);
}

In the most extreme scenario, you could even turn to Address Windowing Extensions and reserve a chunk of physical memory to make sure that the OS not only cannot page, but also cannot move your data around. Bear in mind that this is highly discouraged for general use applications, and I mention it only for the sake of the answer completeness when it comes to Windows memory management.

As for the Windows version, it turns out that Windows IoT Enterprise is the best choice — version 21H2 has introduced soft real-time capability.

Relevant links:

To summarize those links, in order to minimize latency you need to:

  • Disable CPU idle states
  • Disable SysMain (Superfetch), DPS (Diagnostic Policy Service), Audiosrv (Windows Audio), and wuauserv (Windows Update) services
  • Disable Threaded DPCs (deferred procedure calls)
  • Allocate CPU cores for your real-time app using WindowsIoT CSP
  • Set your process priority class to real-time and set the process CPU affinity mask to target only cores allocated in previous step, leaving one core for system and ISR/DPC processing

So as it turns out, Windows can be used for some real-time tasks after all. The peaks you are seeing in your plot are most likely caused by DPCs (video drivers are usually the worst offenders) — if you are curious you can use LatencyMon to record a DPC trace during your program execution to see if you can spot any overlap.