I've been reading about Von Neumann's bottleneck, and AFAIK, the problem lies in that the CPU should either fetch or modify data operations, but not both at the same time; since they both require accessing the same memory bus. So, the problem mainly is in the limited bus transfer rate. I've read about how to mitigate this problem, and it mentioned that parallel processing should solve it, it doesn't depend on 1 core only, so when a core is stuck in fetch operation, other cores are working in a separate manner which cuts the computation time drastically.
Is this a correct understanding ? if so, aren't all of these core share the same bus to memory ? which made the bottleneck from the beginning ?
The Von Neumann bottleneck comes from the shared memory bus for code and data. If you ignore complex features of today's processors, and imagine a simple 8-bit Von Neumann processor with some RAM and some flash, the processor is constantly forced to wait for RAM operations to be completed before loading more data from flash. Today the mitigation is mostly done through the processor's L1 and L2 caches, and the branch predication logic embedded in the processor. Instructions can be preloaded into the cache and the memory bus is free to be used. Parallelization can help in specific workloads, but the reality is that today's computing paradigm is not really affected by this bottleneck much. Processors are very powerful, memories and buses very fast, and if you need more throughput you can just add more cache to the processor (as Intel does with Xeons, and AMD with Opterons). Parallelization is also more of a way of dodging the issue, your parallel workload is still subject to the same rules the processor architecture imposes. If anything, multi-threading should make the problem worse because of the multiple workloads competing for the same memory bus. Again, the solution was simply to add more cache between the memory bus and the processor cores.
As memories are getting faster, and processors not so much anymore, we might yet see this problem becoming an issue again. But then the birds are saying biocomputers are the future for general purpose computing, so hopefully the next major architecture will take past errors into consideration.