Is mmap() faster than read() for perf_event_open

153 Views Asked by At

I am looking to monitor a number of events (hardware, software, and hardware cache) in my application. As with most applications that do profiling, performance is key. In an ideal world I would be able to read the CPU PMU events directly for CPU Cycle Count and others using mrs instructions. But since the kernel disables PMU access from EL0 by default, I am stuck using perf with my application.

As it stands, my read_values() function below uses read() to read the results. I've been looking into ways to speed up the retrieval of these perf event values and came across this PMU HW Counter access document.

My question then is two fold:

  1. Is using mmap() instead of read() to retrieve the values from the fd a performance improvement? If so, how would I accomplish this?
  2. Is there a way to use a mrs instructor to retrieve the PMU registers directly? From the PMU HW Counter access link above, it states this should be possible. Though I am having trouble finding examples that explain how to do this.
struct read_format {
  uint64_t nr;          /* The number of events */
  struct {
    uint64_t value;     /* The value of the event */
    uint64_t id;        /* if PERF_FORMAT_ID */
  } values[nr];
};

int main() {
  struct perf_event_attr attr1;
  attr1.type = PERF_TYPE_HARDWARE;
  attr1.config = PERF_COUNT_HW_CPU_CYCLES;
  attr1.read_format = PERF_FORMAT_GROUP | PERF_FORMAT_ID;
  int main_fd = syscall(__NR_perf_event_open, &attr1, 0, -1, -1, 0);
  uint64_t id1;
  ioctl(main_fd, PERF_EVENT_IOC_ID, &id1);
  ioctl(main_fd, PERF_EVENT_IOC_RESET, 0);
  ioctl(main_fd, PERF_EVENT_IOC_ENABLE, 0);

  struct perf_event_attr attr2;
  attr2.type = PERF_TYPE_HARDWARE;
  attr2.config = PERF_COUNT_HW_CACHE_REFERENCES;
  attr2.read_format = PERF_FORMAT_GROUP | PERF_FORMAT_ID;
  int fd2 = syscall(__NR_perf_event_open, &attr2, 0, -1, main_fd, 0);
  uint64_t id2;
  ioctl(fd2, PERF_EVENT_IOC_ID, &id2);
  ioctl(fd2, PERF_EVENT_IOC_RESET, 0);
  ioctl(fd2, PERF_EVENT_IOC_ENABLE, 0);

  // read_values and log "START"

  // action

  // read_values and log "END"

  return 0;
}

read_values() {
  char buffer[4096];
  int read_bytes = read(main_fd, &buffer, sizeof(buffer));
  if (read_bytes == -1) { return 1; }

  struct read_format* rf = (struct read_format*) buffer;
  int values[rf->nr];
  for (int i=0; i<rf->nr; i++) {
    values[i] = rf->values[i].value;
  }
}

Changing read_values() read() call to mmap(NULL, sizeof(read_format), PROT_READ, MAP_SHARED, main_fd, 0) didn't work. In reading the buffer back after my mmap call, the buffer is not populated as the number of events is set to 0.

read_values() {
  char* buffer = read(main_fd, &buffer, sizeof(buffer));
  if (buffer == MAP_FAILED) { return 1; }

  struct read_format* rf = (struct read_format*) buffer;
  if (rf->nr == 0) { return 1; }
  int values[rf->nr];
  for (int i=0; i<rf->nr; i++) {
    values[i] = rf->values[i].value;
  }
}
1

There are 1 best solutions below

13
Luis Colorado On

mmap() is not related to read() in anyway. mmap() allows you to map a file in memory, but the underlying software uses exactly the same mechanism to read the file from disk that read() actually uses (it moves the blocks to in memory blocks to allow several processes to access the same file of the disk).

The only difference (that could make mmap() a better approach against read()) is that it maps directly the kernel buffer (the file contents) into the virtual memory space of the project. But this doesn't make it automatically far faster, as the acceses to the memory used for the map have still to be monitored by the kernel (on each access), to mark file buffers as dirty when you write in the space allocated to the buffer (and respect that you only write in the data area, and never in the housekeeping data) When the inode is blocked by another process issuing a read() or write() access to the file, your process must be blocked too, to avoid accessing the data while the other process has the inode locked.

This means some mapping has to be done to the memory you write, so it is actually written in the kernel buffer corresponding to the file, and this must be done with a similar mechanism to the copy_from/copy_to copying done in read. Things are optimized to do it fast, but the average gain is not the one in a memory access against a disk access.

Also the file must be possible to mmap() wich is not a general case, e.g. NFS files cannot be mapped (or it cannot be easily done) sockets and pipes cannot be mapped, and devices cannot be mapped (well, it depends, and some collaboration from the device driver is required to do it). In general, files that cannot be lseek()d will not be possible to be mapped.