Intel uses 2 256bit AVX units per core. TR uses 4 128bit units, however for AVX 256 on TR, it can join 2 of the 128bit units together with no performance penalty. The problem with TR AVX is that one pair of 128 bit units can only do calculations on half of the intructions (like add only) while the other pair can do the other half (like multiply only). On Intel, each unit can do all instructions. This is why TR AVX is so slow in comparison. Same issue with Ryzen. Fingers crossed TR2 (and Ryzen 2) fixes this. No reason why it shouldn't be.