What is the optimal number of processes per core? Say you're given a machine with 2 CPUs and 4 cores each, what is the number of processes that will give you the best performance?
Optimal number of processes?
3.5k Views Asked by user3593171 At
1
There are 1 best solutions below
Related Questions in PERFORMANCE
- Slow performance on ipad erasing image
- Can Apache Ant be told to cache its XML files?
- What are the pros and cons of the picture element?
- DB candidate as CouchDB/Schema replacement
- python member str performance too slow
- Split a large query (2 days) into pieces to increase the speed in Postgres
- Use GUI displayed results of SQL query vs new queries?
- fastest way to map a large number of longs
- Bash regular expression execution hangs on long expressions
- Why is calling a function so slow in Javascript?
- Performance of element-compare in java collections
- "Capture GPU Frame" in XCode -- iOS only?
- Efficiency penalty of initializing a struct/class within a loop
- Change the rotating speed of the circle when the mouse moves using javascript
- Replace foreach to make loop into queryable
Related Questions in CONCURRENCY
- Entity Framework Code First with Fluent API Concurrency `DbUpdateConcurrencyException` Not Raising
- How to return blocking queue to the right object?
- How to ensure data synchronization across threads within a "safe" area (e.g not in a critical section) without locking everything
- Breakpoint "concurrency" in Intellij
- java, when (and for how long) can a thread cache the value of a non-volatile variable?
- Reentrancy and Reentrant in C?
- How to do many simultaneous jsoup sessions (Spring boot project and concurrancy)
- Using multiple threads to print statements sequentially
- Interrupting long working thread
- Usage of C++11 std::unique_lock<std::mutex> lk(myMutex); not really clear
- Using getOrElseUpdate of TrieMap in Scala
- Concurrency of JPA when same DB used by other applications
- erlang processes and message passing architecture
- Erratic StampedLock.unlock(long) behaviour?
- Jersey Client, memory leak, static and concurrency
Related Questions in PARALLEL-PROCESSING
- Async vs Horizontal scaling
- Scattered indices in MPI
- How to perform parallel processes for different groups in a folder?
- Julia parallel programming - Making existing function available to all workers
- Running scala futures somewhat in parallel
- running a thread in parallel
- How to make DGEMM execute sequentially instead of in parallel in Matlab Mex Function
- Running time foreach package
- How to parallelize csh script with nested loop
- SSIS ETL parallel extraction from a AS400 file
- Fill an array with spmd in Matlab
- Distribute lines of code to workers
- Java 8 parallelStream for concurrent Database / REST call
- OutOfRangeException with Parallel.For
- R Nested Foreach Parallelization not Working
Related Questions in CPU-CORES
- How does OpenMP determine the Number of cores in the system?
- How does the JVM spread threads between CPU cores?
- When using multiprocessing Pool should the number of worker processes be the same as the number of CPU's or cores?
- What does Runtime.getRuntime().availableProcessors() exactly return?
- Force an Android app to run on its own CPU core
- How to activate a specific core on the big.Little architecture of arm?
- cpuinfo in python 2.4 for windows
- how to force an application to run in one core and no other applications run in that core on windows?
- How does kernel know how many cores there are
- Does multiprocessing run on one core
- How to reserve a core exclusively for a process
- how XEN chontrol cores and whether there is a API in Linux to run process on designated core
- Multi-Threading Mechanics?
- Optimal number of processes?
- 16 cores, yet performance plateaus when computing inner product with >= 4 threads. What's happening?
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
The answer is naturally - it depends. Obviously if you're interested in the performance of a certain single-threaded application, other processes just clutter your machine and compete over the shared resources. So let's look at two cases where this question may be interesting:
The second case is easier to answer, it (.. wait for it ..) depends on what you're running! If you have locks, more threads may lead to higher contention and conflicts. If you're lock free (or even some flavors of wait-free), you may still have fairness issues. It also depends on how the work is balanced internally in your application, or how your task schedulers work. There are simply too many possible solutions out there today.
If we assume you have perfect balancing between your threads, and no overhead for increased number, you can perhaps align this with the other use case where you simply run multiple independent processes. In that case, performance may have several sweet spots. The first is when you reach the number of physical cores (in your case 8, assuming you have 4 physical cores per socket). At that point, you're saturating your existing HW to the max. However, if you have some SMT mechanism (like Hyperthreading) supported, you may extend the overall number of cores by 2x, using 2 logical cores per each physical one. This doesn't add any resource into the story, it just splits the existing ones, which may have some penalty over the execution of each process, but on the other hand can run 2x processes simultaneously.
The overall aggregated speedup may vary, but i've seen number of up to 30% on average on generic benchmarks. As a thumbrule, processes that are memory latency bound or have complicated control flow, can benefit from this since the core can still progress when one thread is blocked. Code that is more oriented on execution bandwidth (like heavy floating point calculations) or memory bandwidth, wouldn't gain as much.
Beyond that number of processes, it may still be beneficial in some cases to add more processes - they won't run in parallel but if the overhead for context switches isn't too high, and you want to minimize the average wait time (which it also a way to look at performance that's not pure IPC), or you depend on communicating output out as early as possible - there are scenarios where this is useful.
One last point - the "optimal" number of processes may be even less than the number cores if your processes saturate other resources before reaching that point. If for example each thread requires a huge chunk virtual memory, you may start thrashing pages and page them off (painful penalty). If each thread has a large data-set which is uses over and over, you could fill up your shared cache and start losing from that point by adding more threads. Same goes for heavy IO, and so on.
As you can see, there's no right or wrong answer here, you simply need to benchmark your code over different systems.