A methodology for speeding up loop kernels by exploiting the software information and the memory architecture. (April 2015)