Solar Designer
2015-06-24 04:10:07 UTC
Hi,
This is to share some of my thoughts and results from a while ago.
As a side-effect of Katja's project, we realized that with our current
bcrypt code we're still way below the peak L1 data cache reads rate on
x86 CPUs:
http://www.openwall.com/presentations/Passwords14-Energy-Efficient-Cracking/slide-51.html
http://www.openwall.com/presentations/Passwords14-Energy-Efficient-Cracking/slide-52.html
As you can see, if L1 data cache reads were the bottleneck, i7-4770K
could theoretically do up to ~11.1k c/s, but it only achieves ~6.6k c/s
with our current code. That's about 60%. Not bad, but there might be
room for improvement.
The 11.1k figure assumes the 3.7 GHz max turbo clock rate with all cores
in use (I think it's realistic), but P reads from cache (not registers)
and moreover no combining of P reads. Since P is being read
sequentially, it may be possible to combine its 32-bit reads into fewer
larger reads (whether for one bcrypt instance or across instances), and
then we'd need fewer than 5 reads from cache per Blowfish round on
average (but rather 4.5 with 64-bit reads, or even closer to 4 with
wider than 64-bit reads). With this, the 11.1k figure might even be an
under-estimate. At 4.5 reads/round, it'd be 12.3k.
Now, the bottleneck does indeed appear to be different. It appears to
be taking too many instructions and uops to extract all of the S-box
indices and compute the effective addresses to keep the L1 data cache
read ports 100% busy at all times. (We only keep them about 60% busy.)
Can we optimize the instruction stream such that fewer uops would be
needed per S-box lookup on average? My gut feeling, especially after
the experiments that I'll describe below, is that we can't do much on
current CPUs. But nevertheless we can probably do a little bit better.
We've previously tried AVX2 with no luck (on Haswell; future CPUs might
or might not do better):
http://www.openwall.com/presentations/Passwords14-Energy-Efficient-Cracking/slide-42.html
Alain has recently mentioned similar poor results for his attempts to
use AVX2, and he also got similar speeds with SSE4.1 (which means doing
the gather loads manually). It is not surprising that there's little
difference between Haswell's microcoded AVX2 gather loads and manually
performed ones at x86 instruction stream level. By "similar poor
results" I mean speeds of around 4k c/s on quad-core Haswell, whereas
our interleaving of scalar instructions, even when implemented in C,
achieves 6k+ c/s (such as the ~6.6k figure I mentioned above).
Alain also mentioned trying BMI (the extra scalar instructions that were
introduced at the same time with AVX2), with slight speed improvement.
I had previously tried BMI (in Nov 2013, per file timestamps) at C
source code level, and IIRC this didn't help (but didn't hurt much
either, leaving the chance that a pure asm implementation would be good).
The relevant BMI instructions are BEXTR, PEXT, and SHRX. BEXTR and PEXT
are available via _bextr_u32() and _pext_u64() intrinsics, respectively.
For SHRX, I tried inline asm, which hurts gcc's instruction scheduling.
I've now tried BEXTR in pure asm code as well, see below. I haven't yet
re-tried SHRX in pure asm; this is worth trying.
Anyway, my recent idea was to try MMX2, in particular making use of the
PEXTRW instruction introduced with Pentium 3 (along with the first SSE).
I must admit this didn't appear to allow us to use Haswell's ports any
more optimally, but I decided to give it a try anyway.
http://www.realworldtech.com/haswell-cpu/4/
Even with use of MMX2, there are still plenty of scalar operations for
extracting the S-box indices, and if that's not enough to use Haswell's
port 6 fully, we can mix in 1 or 2 scalar implementations as well.
Here are the speeds I got on Pentium 3, 1.0 GHz:
Original Pentium optimized asm code - 144 c/s
Pentium Pro/2/3 optimized asm code - 171 c/s
Pure C, 2x interleaving, gcc 4.6.3 - 211 c/s
New 2x2 (4 instances) MMX2 asm code - 244 c/s
(The "Pure C, 2x interleaving" version is what's actually in JtR for a
while now, when building for 32-bit x86 with not-too-old gcc.)
Wow. That's a 15%+ improvement over the previous best result. I wish
speeds on this CPU were still practically relevant.
Moving to i7-4770K, the same 32-bit binary still runs a bit faster than
a 32-bit build of JtR as publicly released, compiled with "gcc version
4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)". One instance:
Old: 1049 c/s
New: 1347 c/s
8 concurrent instances (separate processes in both cases):
Old: 760*8 = 6080 c/s
New: 772*8 = 6176 c/s
This is only a 1.6% improvement, but perhaps for CPUs without HT the
improvement is much greater (as seen from the single instance speeds).
On 64-bit builds, though, I only got this to run at cumulative speeds
like 780*8 = 6240 c/s, which is worse than 6595 c/s previously seen with
OpenMP (and even worse than the slightly better speeds that can be seen
with separate independent processes). Replacing one of the MMX2
instances with one or two scalar instances (thus, 3 or 4 instances
total per thread) somehow didn't help. (Maybe I just didn't try hard
enough.)
So we appear to have an improvement for crippled systems - ancient CPUs
like Pentium 3 or, more importantly, 32-bit builds running on lower-cost
HT-less versions of recent CPUs. Unfortunately, since the improvement
is not universal and since data layout changes are needed (64-bit P
reads that I mentioned above), I can't just commit this code as-is.
The 2x2 MMX2 code looks like:
#define P(N) BF_current+16*N
#define Sa(N) P(18)+0x400*N
#define Sb(N) P(18)+0x1000+0x400*N
#define Sc(N) P(18)+0x2000+0x400*N
#define Sd(N) P(18)+0x3000+0x400*N
#define BF_ROUND(Lab, Lcd, Rab, Rcd, N) \
movd Lab,tmp1; \
shldl $16,tmp1,tmp2; \
movzbl tmp1_hi,tmp5; \
pextrw $3,Lab,tmp4; \
movzbl tmp2_hi,tmp6; \
pextrw $2,Lab,tmp3; \
andl $0xFF,tmp1; \
movd Sa(0)(,tmp6,4),tmpm1; \
andl $0xFF,tmp2; \
movd Sa(1)(,tmp2,4),tmpm2; \
movzbl tmp4_hi,tmp6; \
movd Sa(2)(,tmp5,4),tmpm3; \
andl $0xFF,tmp4; \
movd Sa(3)(,tmp1,4),tmpm4; \
movd Lcd,tmp1; \
movzbl tmp3_hi,tmp5; \
punpckldq Sb(0)(,tmp6,4),tmpm1; \
andl $0xFF,tmp3; \
punpckldq Sb(1)(,tmp4,4),tmpm2; \
shldl $16,tmp1,tmp2; \
punpckldq Sb(2)(,tmp5,4),tmpm3; \
movzbl tmp1_hi,tmp5; \
paddd tmpm2,tmpm1; \
pextrw $3,Lcd,tmp4; \
punpckldq Sb(3)(,tmp3,4),tmpm4; \
movzbl tmp2_hi,tmp6; \
pxor tmpm3,tmpm1; \
pextrw $2,Lcd,tmp3; \
pxor P(N)+16,Rab; \
andl $0xFF,tmp1; \
paddd tmpm4,tmpm1; \
andl $0xFF,tmp2; \
pxor tmpm1,Rab; \
movd Sc(0)(,tmp6,4),tmpm1; \
movd Sc(1)(,tmp2,4),tmpm2; \
movzbl tmp4_hi,tmp6; \
movd Sc(2)(,tmp5,4),tmpm3; \
andl $0xFF,tmp4; \
movd Sc(3)(,tmp1,4),tmpm4; \
movzbl tmp3_hi,tmp5; \
punpckldq Sd(0)(,tmp6,4),tmpm1; \
andl $0xFF,tmp3; \
punpckldq Sd(1)(,tmp4,4),tmpm2; \
punpckldq Sd(2)(,tmp5,4),tmpm3; \
paddd tmpm2,tmpm1; \
punpckldq Sd(3)(,tmp3,4),tmpm4; \
pxor tmpm3,tmpm1; \
pxor P(N)+24,Rcd; \
paddd tmpm4,tmpm1; \
pxor tmpm1,Rcd
This is 50 instructions per round for 4 instances combined, so 12.5
instructions per round per instance. I think it's pretty good.
I've also tried BMI in several ways, including taking it to the extreme
in terms of minimizing x86 instruction count (likely not the optimal
choice since we actually need to optimize uops, and latency hiding).
That version is:
#define P(N) BF_current+8*N
#define Sa(N) P(18)+0x400*N
#define Sb(N) P(18)+0x1000+0x400*N
#define BF_ROUND(La, La_lo, La_hi, Lb, Lb_lo, Lb_hi, Ra, Rb, N) \
bextr %r14d,La,tmp1; \
bextr %r14d,Lb,tmp5; \
xorl P(N)+8,Ra; \
xorl P(N)+12,Rb; \
bextr %r15d,La,tmp2; \
bextr %r15d,Lb,tmp6; \
movl Sa(0)(,tmp1_64,4),tmp1; \
movl Sb(0)(,tmp5_64,4),tmp5; \
movzbl La_hi,tmp3; \
movzbl Lb_hi,tmp7; \
addl Sa(1)(,tmp2_64,4),tmp1; \
addl Sb(1)(,tmp6_64,4),tmp5; \
movzbl La_lo,tmp4; \
movzbl Lb_lo,tmp8; \
xorl Sa(2)(,tmp3_64,4),tmp1; \
xorl Sb(2)(,tmp7_64,4),tmp5; \
addl Sa(3)(,tmp4_64,4),tmp1; \
addl Sb(3)(,tmp8_64,4),tmp5; \
xorl tmp1,Ra; \
xorl tmp5,Rb
where the out-of-loop initialization code included:
movl $0x0818,%r14d
movl $0x0810,%r15d
This is 20 instructions per round for two interleaved instances, so 10
instructions per round per instance. Thus, BMI can beat MMX2 in x86
instruction count, but like I said that is likely suboptimal in other
aspects.
So many experiments and no luck yet on current non-crippled CPUs and
systems (64-bit with HT).
If I find more time for this, I'll probably proceed with high-level
optimizations first, just to improve things a little bit, achieving
something like 6.8k c/s on the i7-4770K with OpenMP:
http://www.openwall.com/presentations/Passwords14-Energy-Efficient-Cracking/slide-50.html
by making the uses of Blowfish outside of the variable cost loop
use the interleaved implementation as well. And maybe 7k with
independent instances (no thread safety gives an extra register).
Alexander
This is to share some of my thoughts and results from a while ago.
As a side-effect of Katja's project, we realized that with our current
bcrypt code we're still way below the peak L1 data cache reads rate on
x86 CPUs:
http://www.openwall.com/presentations/Passwords14-Energy-Efficient-Cracking/slide-51.html
http://www.openwall.com/presentations/Passwords14-Energy-Efficient-Cracking/slide-52.html
As you can see, if L1 data cache reads were the bottleneck, i7-4770K
could theoretically do up to ~11.1k c/s, but it only achieves ~6.6k c/s
with our current code. That's about 60%. Not bad, but there might be
room for improvement.
The 11.1k figure assumes the 3.7 GHz max turbo clock rate with all cores
in use (I think it's realistic), but P reads from cache (not registers)
and moreover no combining of P reads. Since P is being read
sequentially, it may be possible to combine its 32-bit reads into fewer
larger reads (whether for one bcrypt instance or across instances), and
then we'd need fewer than 5 reads from cache per Blowfish round on
average (but rather 4.5 with 64-bit reads, or even closer to 4 with
wider than 64-bit reads). With this, the 11.1k figure might even be an
under-estimate. At 4.5 reads/round, it'd be 12.3k.
Now, the bottleneck does indeed appear to be different. It appears to
be taking too many instructions and uops to extract all of the S-box
indices and compute the effective addresses to keep the L1 data cache
read ports 100% busy at all times. (We only keep them about 60% busy.)
Can we optimize the instruction stream such that fewer uops would be
needed per S-box lookup on average? My gut feeling, especially after
the experiments that I'll describe below, is that we can't do much on
current CPUs. But nevertheless we can probably do a little bit better.
We've previously tried AVX2 with no luck (on Haswell; future CPUs might
or might not do better):
http://www.openwall.com/presentations/Passwords14-Energy-Efficient-Cracking/slide-42.html
Alain has recently mentioned similar poor results for his attempts to
use AVX2, and he also got similar speeds with SSE4.1 (which means doing
the gather loads manually). It is not surprising that there's little
difference between Haswell's microcoded AVX2 gather loads and manually
performed ones at x86 instruction stream level. By "similar poor
results" I mean speeds of around 4k c/s on quad-core Haswell, whereas
our interleaving of scalar instructions, even when implemented in C,
achieves 6k+ c/s (such as the ~6.6k figure I mentioned above).
Alain also mentioned trying BMI (the extra scalar instructions that were
introduced at the same time with AVX2), with slight speed improvement.
I had previously tried BMI (in Nov 2013, per file timestamps) at C
source code level, and IIRC this didn't help (but didn't hurt much
either, leaving the chance that a pure asm implementation would be good).
The relevant BMI instructions are BEXTR, PEXT, and SHRX. BEXTR and PEXT
are available via _bextr_u32() and _pext_u64() intrinsics, respectively.
For SHRX, I tried inline asm, which hurts gcc's instruction scheduling.
I've now tried BEXTR in pure asm code as well, see below. I haven't yet
re-tried SHRX in pure asm; this is worth trying.
Anyway, my recent idea was to try MMX2, in particular making use of the
PEXTRW instruction introduced with Pentium 3 (along with the first SSE).
I must admit this didn't appear to allow us to use Haswell's ports any
more optimally, but I decided to give it a try anyway.
http://www.realworldtech.com/haswell-cpu/4/
Even with use of MMX2, there are still plenty of scalar operations for
extracting the S-box indices, and if that's not enough to use Haswell's
port 6 fully, we can mix in 1 or 2 scalar implementations as well.
Here are the speeds I got on Pentium 3, 1.0 GHz:
Original Pentium optimized asm code - 144 c/s
Pentium Pro/2/3 optimized asm code - 171 c/s
Pure C, 2x interleaving, gcc 4.6.3 - 211 c/s
New 2x2 (4 instances) MMX2 asm code - 244 c/s
(The "Pure C, 2x interleaving" version is what's actually in JtR for a
while now, when building for 32-bit x86 with not-too-old gcc.)
Wow. That's a 15%+ improvement over the previous best result. I wish
speeds on this CPU were still practically relevant.
Moving to i7-4770K, the same 32-bit binary still runs a bit faster than
a 32-bit build of JtR as publicly released, compiled with "gcc version
4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)". One instance:
Old: 1049 c/s
New: 1347 c/s
8 concurrent instances (separate processes in both cases):
Old: 760*8 = 6080 c/s
New: 772*8 = 6176 c/s
This is only a 1.6% improvement, but perhaps for CPUs without HT the
improvement is much greater (as seen from the single instance speeds).
On 64-bit builds, though, I only got this to run at cumulative speeds
like 780*8 = 6240 c/s, which is worse than 6595 c/s previously seen with
OpenMP (and even worse than the slightly better speeds that can be seen
with separate independent processes). Replacing one of the MMX2
instances with one or two scalar instances (thus, 3 or 4 instances
total per thread) somehow didn't help. (Maybe I just didn't try hard
enough.)
So we appear to have an improvement for crippled systems - ancient CPUs
like Pentium 3 or, more importantly, 32-bit builds running on lower-cost
HT-less versions of recent CPUs. Unfortunately, since the improvement
is not universal and since data layout changes are needed (64-bit P
reads that I mentioned above), I can't just commit this code as-is.
The 2x2 MMX2 code looks like:
#define P(N) BF_current+16*N
#define Sa(N) P(18)+0x400*N
#define Sb(N) P(18)+0x1000+0x400*N
#define Sc(N) P(18)+0x2000+0x400*N
#define Sd(N) P(18)+0x3000+0x400*N
#define BF_ROUND(Lab, Lcd, Rab, Rcd, N) \
movd Lab,tmp1; \
shldl $16,tmp1,tmp2; \
movzbl tmp1_hi,tmp5; \
pextrw $3,Lab,tmp4; \
movzbl tmp2_hi,tmp6; \
pextrw $2,Lab,tmp3; \
andl $0xFF,tmp1; \
movd Sa(0)(,tmp6,4),tmpm1; \
andl $0xFF,tmp2; \
movd Sa(1)(,tmp2,4),tmpm2; \
movzbl tmp4_hi,tmp6; \
movd Sa(2)(,tmp5,4),tmpm3; \
andl $0xFF,tmp4; \
movd Sa(3)(,tmp1,4),tmpm4; \
movd Lcd,tmp1; \
movzbl tmp3_hi,tmp5; \
punpckldq Sb(0)(,tmp6,4),tmpm1; \
andl $0xFF,tmp3; \
punpckldq Sb(1)(,tmp4,4),tmpm2; \
shldl $16,tmp1,tmp2; \
punpckldq Sb(2)(,tmp5,4),tmpm3; \
movzbl tmp1_hi,tmp5; \
paddd tmpm2,tmpm1; \
pextrw $3,Lcd,tmp4; \
punpckldq Sb(3)(,tmp3,4),tmpm4; \
movzbl tmp2_hi,tmp6; \
pxor tmpm3,tmpm1; \
pextrw $2,Lcd,tmp3; \
pxor P(N)+16,Rab; \
andl $0xFF,tmp1; \
paddd tmpm4,tmpm1; \
andl $0xFF,tmp2; \
pxor tmpm1,Rab; \
movd Sc(0)(,tmp6,4),tmpm1; \
movd Sc(1)(,tmp2,4),tmpm2; \
movzbl tmp4_hi,tmp6; \
movd Sc(2)(,tmp5,4),tmpm3; \
andl $0xFF,tmp4; \
movd Sc(3)(,tmp1,4),tmpm4; \
movzbl tmp3_hi,tmp5; \
punpckldq Sd(0)(,tmp6,4),tmpm1; \
andl $0xFF,tmp3; \
punpckldq Sd(1)(,tmp4,4),tmpm2; \
punpckldq Sd(2)(,tmp5,4),tmpm3; \
paddd tmpm2,tmpm1; \
punpckldq Sd(3)(,tmp3,4),tmpm4; \
pxor tmpm3,tmpm1; \
pxor P(N)+24,Rcd; \
paddd tmpm4,tmpm1; \
pxor tmpm1,Rcd
This is 50 instructions per round for 4 instances combined, so 12.5
instructions per round per instance. I think it's pretty good.
I've also tried BMI in several ways, including taking it to the extreme
in terms of minimizing x86 instruction count (likely not the optimal
choice since we actually need to optimize uops, and latency hiding).
That version is:
#define P(N) BF_current+8*N
#define Sa(N) P(18)+0x400*N
#define Sb(N) P(18)+0x1000+0x400*N
#define BF_ROUND(La, La_lo, La_hi, Lb, Lb_lo, Lb_hi, Ra, Rb, N) \
bextr %r14d,La,tmp1; \
bextr %r14d,Lb,tmp5; \
xorl P(N)+8,Ra; \
xorl P(N)+12,Rb; \
bextr %r15d,La,tmp2; \
bextr %r15d,Lb,tmp6; \
movl Sa(0)(,tmp1_64,4),tmp1; \
movl Sb(0)(,tmp5_64,4),tmp5; \
movzbl La_hi,tmp3; \
movzbl Lb_hi,tmp7; \
addl Sa(1)(,tmp2_64,4),tmp1; \
addl Sb(1)(,tmp6_64,4),tmp5; \
movzbl La_lo,tmp4; \
movzbl Lb_lo,tmp8; \
xorl Sa(2)(,tmp3_64,4),tmp1; \
xorl Sb(2)(,tmp7_64,4),tmp5; \
addl Sa(3)(,tmp4_64,4),tmp1; \
addl Sb(3)(,tmp8_64,4),tmp5; \
xorl tmp1,Ra; \
xorl tmp5,Rb
where the out-of-loop initialization code included:
movl $0x0818,%r14d
movl $0x0810,%r15d
This is 20 instructions per round for two interleaved instances, so 10
instructions per round per instance. Thus, BMI can beat MMX2 in x86
instruction count, but like I said that is likely suboptimal in other
aspects.
So many experiments and no luck yet on current non-crippled CPUs and
systems (64-bit with HT).
If I find more time for this, I'll probably proceed with high-level
optimizations first, just to improve things a little bit, achieving
something like 6.8k c/s on the i7-4770K with OpenMP:
http://www.openwall.com/presentations/Passwords14-Energy-Efficient-Cracking/slide-50.html
by making the uses of Blowfish outside of the variable cost loop
use the interleaved implementation as well. And maybe 7k with
independent instances (no thread safety gives an extra register).
Alexander