A routine in BLAS Level 2 for banded matrix vector product exists, both for general and symmetric cases (links for MKL implementation).
Is there any way to use multiple vectors (without using an outside for-loop), to maximize performance in such cases?
I think that the Spike library is supposed to have such a routine for the symmetric case. I'm afraid I cannot be of any more help, though, as I have never used it.
The algorithm and implementation of Spike (for system solving) is outlined in [Polizzi & Sameh, Comp. Fluids (36), 2007].