I have a little confusion about bank conflicts, avoiding them using memory padding and coalesced memory access. What I've read so far: Coalesced memory access from global memory is optimal. If it isn't achievable shared memory might be used to reorder the data needed by the current block and thus making coalesced access possible. However when using shared memory one has to look out for bank conflicts. One strategy to avoid bank conflicts is to pad the arrays stored in shared memory by 1. Consider the example from this blog post where each row of a 16x16 matrix is padded by 1 making it a 16x17 matrix in shared memory.
Now I understand that using memory padding might avoid bank conflicts but doesn't that also mean the memory is not aligned anymore? E.g. if I shift global memory by 1 thus misaligning it one warp would need to access two memory lanes instead of one because of the one last number not being in the same lane as all other numbers. So for my understanding coalesced memory access and memory padding are contradicting concepts, aren't they? Some clarification is appreciated very much!
Too long for a comment so I'm putting it here. Still not a complete answer though.
By the time I found this post by Mark Harris which demonstrates the usage of shared memory to faciliate coalesced memory access. The important takeaway for this question seems to be:
My initial understanding was that if coalesced access to global memory is not possible then it is read uncoalesced and then reordered in shared memory to achieve further coalesced accesses from shared memory. But instead data is read in a continous fashion from global memory and then the actual data needed can be read from shared memory in a non-coalesced way. Harris also states that uncoalesced access from shared memory is not a problem but unfortunately the post doesn't explain why.