How to troubleshoot the performance of a multi-threaded algorithm using a lock-free hashset?

39 Views Asked by At

So, I am currently working on graph sampling and developed an algorithm to do this on multiple CPU cores. It uses a lock-free hash set and a locking job queue. Basically, a thread gets a job from the queue and during processing the job, it might add jobs to the queue. This is done until all jobs are finished. The hash set is used to produce unique samples across all threads and the main activity a thread does, is inserting a sample into the hash set. That's why I implemented the hash-set to be lock-free. This is what I followed.

I ran it on a machine with 16 cores and on a machine with 24 cores, however, the single-threaded version of the algorithm is always an order of magnitude faster than the multi-threaded one.

And I don't really understand why and I ran out of ideas to try.

I first checked with perf if there is an issue with cache misses. Here are the outputs: Single-threaded:

       146.062.677      cpu_core/cache-references/                                     (99,82%)
        94.803.081      cpu_atom/cache-references/                                     (0,20%)
        67.115.478      cpu_core/cache-misses/                                        (99,82%)
        49.006.160      cpu_atom/cache-misses/                                        (0,20%)
     8.206.501.423      cpu_core/cycles/                                              (99,82%)
     5.437.332.727      cpu_atom/cycles/                                              (0,20%)
    12.932.996.905      cpu_core/instructions/                                        (99,82%)
     6.916.605.356      cpu_atom/instructions/                                        (0,20%)
     2.751.128.600      cpu_core/branches/                                            (99,82%)
     1.550.092.039      cpu_atom/branches/                                            (0,20%)
            89.326      faults                                                      
                68      migrations                                                  

       1,688575537 seconds time elapsed

       1,577569000 seconds user
       0,112396000 seconds sys

Multi-threaded:

       264.637.972      cpu_core/cache-references/                                     (81,19%)
        57.264.669      cpu_atom/cache-references/                                     (33,65%)
       116.415.033      cpu_core/cache-misses/                                        (81,19%)
         1.638.935      cpu_atom/cache-misses/                                        (33,65%)
    26.938.532.172      cpu_core/cycles/                                              (81,19%)
    14.019.200.396      cpu_atom/cycles/                                              (33,65%)
    23.240.603.882      cpu_core/instructions/                                        (81,19%)
     2.915.516.452      cpu_atom/instructions/                                        (33,65%)
     4.629.780.022      cpu_core/branches/                                            (81,19%)
       563.800.488      cpu_atom/branches/                                            (33,65%)
           229.810      faults                                                      
               474      migrations                                                  

       2,308616324 seconds time elapsed

       2,295040000 seconds user
       4,175755000 seconds sys

So, I am not sure if I am interpreting these stats correctly, the cache references count how often something is fetched from cache and cache misses count how often those fetches have to go to the next level of memory. (I think this refers to L3 caches, right?) In the multi-thread version there are definitely more misses however also more references and the ratio is pretty much the same as for the single-threaded version. Or does the ratio not matter at all?

I did also try using a lock-free job queue, however, the timing results were pretty much the same (albeit it, there is a bug there, but it only happens when the GC cleans up before the program finishes, so it shouldn't matter for the result). This however made sense to me, compared to how frequently the threads insert into the hash set, they rarely touch the queue.

I did compare it with a Python library that also does graph sampling and funnily enough, it's as slow as my multi-threaded version (at least in the order of magnitude), so I am not sure if this is a scaling issue where you need quite a lot of cores to overcome the larger overhead for thread-safe operations?

And yeah, I am currently stuck. I don't necessarily expect it to get working, but I would like to understand why there is no speedup.

0

There are 0 best solutions below