Does the Harvard architecture have the von Neumann bottleneck?

1.5k Views Asked by At

From the naming and this article I feel the answer is no, but I don't understand why. The bottleneck is how fast you can fetch data from memory. Whether you can fetch instruction at the same time doesn't seem to matter. Don't you still have to wait until the data arrive? Suppose fetching data takes 100 cpu cycles and executing instruction takes 1, the ability of doing that 1 cycle in advance doesn't seem to be a huge improvement. What am I missing here?

Context: I came across this article saying the Spectre bug is not going to be fixed because of speculative execution. I think speculative execution, for example branch prediction, makes sense for Harvard architecture too. Am I right? I understand speculative execution is more beneficial for von Neumann architecture, but by how much? Can someone give a rough number? On what extent can we say the Spectre will stay because of von Neumann architecture?

2

There are 2 best solutions below

0
On

The Harvard Architecture, separated instruction and data memories, is a mitigation of the von Neumann bottleneck. Backus' original definition of the bottleneck addresses a slightly more general problem than just instruction or data fetch and talks about the CPU/memory interface. In the paragraph before the money quote Backus talks about looking at the actual traffic on this bus,

Ironically, a large part of the traffic is not useful data but merely names of data that most of it consists of names as well as operations and data used only to compute such names.

In a Harvard architecture with a separated I/D bus, that will not change. It will still largely consist of names.

So the answer is a hard no. The Harvard architecture mitigates the von Neumann bottleneck but it doesn't solve it. Bluntly, it's a faster von Neumann bottleneck.

4
On

The term "von Neumann bottleneck" isn't just talking about Harvard vs. von Neumann architectures. It's talking about the entire idea of stored-program computers, which John von Neumann invented.

(Depending on context, some people may use it to mean the competition between code-fetch and data access; that does exacerbate the overall memory bottleneck without split caches. Or perhaps I'm mixing up terminology and the more general memory bottleneck for processors I discuss in the rest of this answer shouldn't be called the von Neumann bottleneck, although it is a real thing. See the memory wall section in Modern Microprocessors A 90-Minute Guide! )


The von Neumann bottleneck applies equally to both kinds of stored-program computers. And even to fixed-function (not stored-program) processors that keep data in RAM. (Old GPUs without programmable shaders are basically fixed-function but can still have memory bottlenecks accessing data).

Usually it's most relevant when looping over big arrays or pointer-based data structures like linked lists, so the code fits in an instruction cache and doesn't have to be fetched during data access anyway. (Computers too old to even have caches were just plain slow, and I'm not interested in arguing semantics of whether slowness even when there is temporal and/or spatial locality is a von Neumann bottleneck for them or not.)

https://whatis.techtarget.com/definition/von-Neumann-bottleneck points out that caching and prefetching is part of how we work around the von Neumann bottleneck, and that faster / wider busses make the bottleneck wider. But only stuff like Processor-in-Memory / https://en.wikipedia.org/wiki/Computational_RAM truly solves it, where an ALU is attached to memory cells directly, so there is no central bottleneck between computation and storage, and computational capacity scales with storage size. But von Neumann with a CPU and separate RAM works well enough for most things that it's not going away any time soon (given large caches and smart hardware prefetching, and out-of-order execution and/or SMT to hide memory latency.)


John von Neumann was a pioneer in early computing, and it's not surprising his name is attached to two different concepts.

Harvard vs. von Neumann is about whether program memory is in a separate address space (and a separate bus); that's an implementation detail for stored-program computers.


Spectre: yes, Spectre is just about data access and branch prediction, not accessing code as data or vice versa. If you can get a Spectre attack into program memory in a Harvard architecture in the first place (e.g. by running a normal program that makes system calls), then it can run the same as on a von Neumann.

I understand speculative execution is more beneficial for von Neumann architecture, but by how much?

What? No. There's no connection here at all. Of course, all high-performance modern CPUs are von Neumann. (With split L1i / L1d caches, but program and data memory are not separate, sharing the same address space and physical storage. Split L1 caches is often called "modified Harvard", which makes some sense on ISAs other than x86 where L1i isn't coherent with data caches so you need special flushing instructions before you can execute newly-stored bytes as code. x86 has coherent instruction caches, so it's very much an implementation detail.)

Some embedded CPUs are true Harvard, with program memory connected to Flash and data address space mapped to RAM. But often those CPUs are pretty low performance. Pipelined but in-order, and only using branch prediction for instruction prefetch.

But if you did build a very high performance CPU with fully separate program and data memories (so copying from one to the other would have to go through the CPU), there'd be basically zero different from modern high-performance CPUs. L1i cache misses are rare, and whether they compete with data access is not very significant.

I guess you'd have split caches all the way down, though; normally modern CPUs have unified L2 and L3 caches, so depending on the workload (big code size or not) more or less of L2 and L3 can end up holding code. Maybe you'd still use unified caches with one extra bit in the tag to distinguish code addresses from data addresses, allowing your big outer caches to be competitively shared between the two address-spaces.