Common hardware indicators such as IPC and cache miss rates support data collection at process granularity.
Why does the memory bandwidth usage only support the collection of the whole machine dimension? This is true both in Intel's PCM tool and AMD's uProf tool.
Is this due to limitations of the hardware PMU capabilities?