I am currently benchmarking a project written in C++ to determine the hot spots and the threading efficiency, using Intel VTune. When running the program normally it runs for ~15 minutes. Using the hotspot analysis in VTune I can see that the function __kmp_fork_barrier is taking up roughly 40% of the total CPU time.
Therefore, I also wanted to see the threading efficiency, but when starting the threading-module in VTune, it does not start the project at all, but instead hangs at __kmp_acquire_ticket_lock when running in Hardware event-based sampling-mode. When running in user-mode sampling-mode instead, the project immediately fails with a segfault (which does not occur when running it without VTune and checking it with valgrind). When using HPC performance characterization instead, VTune crashes.
Are those issues with VTune, or with my program? And how can I find the issues with the latter?
Threading analysis in Vtune hangs at __kmp_acquire_ticket_lock
824 Views Asked by arc_lupus At
1
There are 1 best solutions below
Related Questions in C++
- C++ using std::vector across boundaries
- Linked list without struct
- Connecting Signal QML to C++ (Qt5)
- how to get the reference of struct soap inherited in C++ Proxy/Service class
- Why we can't assign value to pointer
- Conversion of objects in c++
- shared_ptr: "is not a type" error
- C++ template using pointer and non pointer arguments in a QVector
- C++ SFML 2.2 vectors
- Lifetime of temporary objects
- I want to be able to use 4 different variables in a select statement in c ++
- segmentation fault: 11, extracting data in vector
- How to catch delay-import dll errors (missing dll or symbol) in MinGW(-w64)?
- How can I print all the values in this linked list inside a hash table?
- Configured TTL for A record(s) backing CNAME records
Related Questions in MULTITHREADING
- new thread blocks main thread
- WPF MessageBox Cancel checkbox check
- How to avoid concurrent access to a resource?
- run oncomplete event in async
- Threading Segfault when reading members
- Function timeouts in C and thread
- How are multiple requests to Task.Run handled from a resource management standpoint?
- Acumatica perfomance with threads
- Wait and Notify in Java threads for a given interval
- Different behavior of async with Visual Studio 2013(Windows8.1) and GCC 4.9(Ubuntu14.10)
- How to return blocking queue to the right object?
- background thread using Task.Run
- deletion and cleanup of worker thread in Qt crashes
- Pipeline-like operation using TChan
- implementing in app purchase on android
Related Questions in PROFILING
- How to profile a Yii2 based API?
- "Capture GPU Frame" in XCode -- iOS only?
- How does one debug infinite recursion in Haskell?
- What is the procedure for profiling under GHC 7.10.1 and cabal 1.23?
- Visual Studio 2013 unable to create diagnostic report
- Xdebug profiling shows different execution time than actual one
- Application is faster when profiling
- Why is this script slowing down per item with increased amount of input?
- ASP.NET MVC application profiling
- Memory leak due to Window.EfectiveValues retention
- Can't get golang pprof working
- What is difference between "node --prof" and using node-profiler
- Improve performance on processing a big pandas dataframe
- Visualvm thread started count
- optimize arithmetic operations with stl vector
Related Questions in OPENMP
- Is it safe to list optional fortran function argument in OpenMP shared clause?
- omp barriers are blocking
- OpenMP SIMD on Power8
- MPI+OpenMP job submission script on LSF
- Does OMP Pragmas nesting have significance?
- How to make DGEMM execute sequentially instead of in parallel in Matlab Mex Function
- Increased speed despite false sharing
- Simple speed up of C++ OpenMP kernel
- Performance issue of OpenMP code called from a pthread
- What preprocessor define does -fopenmp provide?
- OpenMP shared variable seems to be private
- Error with openmp for Nested for-loop
- Convert do/while into parallel do/while loop
- How to measure the load balancing in OpenMP of GCC
- Parallel for loop with reduction and manipulating arrays
Related Questions in INTEL-VTUNE
- VTune using Windows embedded OS
- OpenMP progam analysis with Intel VTune Amplifier: What is "kmp print storage map gtip"
- How to interpret Intel VTune Amplifier's Locks&Waits
- vtune function call count
- Why does g++ (4.6 and 4.7) promote the result of this division to a double? Can I stop it?
- Profiling java application with Intel VTune Amplifier XE 2013
- How to find Cycles per instruction of an i7 processor
- What is causing the cache misses in my code?
- Profiling a Linux kernel code snippet in Intel VTune
- Optimzing SSE-code
- When profiling, most of the time is spent in nvoglv64.dll. What should I deduce?
- OpenMP, VTune, idle threads
- vtune memory-access report showing incorrect output
- How can I make a function creating tens of thousands of symbolic filesystem links show up in VTUNE?
- Intel OneAPI setvarsh.sh not set pernamently (Ubuntu)
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
__kmp_xxxcalls are functions of the Intel/Clang OpenMP runtime.__kmp_fork_barrieris called when an OpenMP barrier is reached. If you spend 40% of your time on this function this means that you have a load balancing issue with the OpenMP threads in your program. You need to fix this work imbalance to get better performance. You can use the (experimental) OMPT support of runtimes to track what threads are doing and when they do so. VTune should have a minimal support for profiling OpenMP programs. Encountering a VTune crash is likely a bug and it should be reported on the Intel forum so that VTune developers can fix it. On your side, you can check that your program always pass all OpenMP barrier in a deterministic way. For more information, you can look at the Intel VTune OpenMP tutorial.Note that the results of VTune should also means that your OpenMP runtime is configured so that threads are actively polling the state of other threads which is good to reduce latencies but not always for performance or energy savings. You can control the behaviour of the runtime using the environment variable OMP_WAIT_POLICY.