[john-dev] optimizing bcrypt cracking on x86

Discussion:

Solar Designer

2015-06-24 04:10:07 UTC

Hi,

This is to share some of my thoughts and results from a while ago.

As a side-effect of Katja's project, we realized that with our current
bcrypt code we're still way below the peak L1 data cache reads rate on
x86 CPUs:

http://www.openwall.com/presentations/Passwords14-Energy-Efficient-Cracking/slide-51.html
http://www.openwall.com/presentations/Passwords14-Energy-Efficient-Cracking/slide-52.html

As you can see, if L1 data cache reads were the bottleneck, i7-4770K
could theoretically do up to ~11.1k c/s, but it only achieves ~6.6k c/s
with our current code. That's about 60%. Not bad, but there might be
room for improvement.

The 11.1k figure assumes the 3.7 GHz max turbo clock rate with all cores
in use (I think it's realistic), but P reads from cache (not registers)
and moreover no combining of P reads. Since P is being read
sequentially, it may be possible to combine its 32-bit reads into fewer
larger reads (whether for one bcrypt instance or across instances), and
then we'd need fewer than 5 reads from cache per Blowfish round on
average (but rather 4.5 with 64-bit reads, or even closer to 4 with
wider than 64-bit reads). With this, the 11.1k figure might even be an
under-estimate. At 4.5 reads/round, it'd be 12.3k.

Now, the bottleneck does indeed appear to be different. It appears to
be taking too many instructions and uops to extract all of the S-box
indices and compute the effective addresses to keep the L1 data cache
read ports 100% busy at all times. (We only keep them about 60% busy.)

Can we optimize the instruction stream such that fewer uops would be
needed per S-box lookup on average? My gut feeling, especially after
the experiments that I'll describe below, is that we can't do much on
current CPUs. But nevertheless we can probably do a little bit better.

We've previously tried AVX2 with no luck (on Haswell; future CPUs might
or might not do better):

http://www.openwall.com/presentations/Passwords14-Energy-Efficient-Cracking/slide-42.html

Alain has recently mentioned similar poor results for his attempts to
use AVX2, and he also got similar speeds with SSE4.1 (which means doing
the gather loads manually). It is not surprising that there's little
difference between Haswell's microcoded AVX2 gather loads and manually
performed ones at x86 instruction stream level. By "similar poor
results" I mean speeds of around 4k c/s on quad-core Haswell, whereas
our interleaving of scalar instructions, even when implemented in C,
achieves 6k+ c/s (such as the ~6.6k figure I mentioned above).

Alain also mentioned trying BMI (the extra scalar instructions that were
introduced at the same time with AVX2), with slight speed improvement.

I had previously tried BMI (in Nov 2013, per file timestamps) at C
source code level, and IIRC this didn't help (but didn't hurt much
either, leaving the chance that a pure asm implementation would be good).

The relevant BMI instructions are BEXTR, PEXT, and SHRX. BEXTR and PEXT
are available via _bextr_u32() and _pext_u64() intrinsics, respectively.
For SHRX, I tried inline asm, which hurts gcc's instruction scheduling.

I've now tried BEXTR in pure asm code as well, see below. I haven't yet
re-tried SHRX in pure asm; this is worth trying.

Anyway, my recent idea was to try MMX2, in particular making use of the
PEXTRW instruction introduced with Pentium 3 (along with the first SSE).
I must admit this didn't appear to allow us to use Haswell's ports any
more optimally, but I decided to give it a try anyway.

http://www.realworldtech.com/haswell-cpu/4/

Even with use of MMX2, there are still plenty of scalar operations for
extracting the S-box indices, and if that's not enough to use Haswell's
port 6 fully, we can mix in 1 or 2 scalar implementations as well.

Here are the speeds I got on Pentium 3, 1.0 GHz:

Original Pentium optimized asm code - 144 c/s
Pentium Pro/2/3 optimized asm code - 171 c/s
Pure C, 2x interleaving, gcc 4.6.3 - 211 c/s
New 2x2 (4 instances) MMX2 asm code - 244 c/s

(The "Pure C, 2x interleaving" version is what's actually in JtR for a
while now, when building for 32-bit x86 with not-too-old gcc.)

Wow. That's a 15%+ improvement over the previous best result. I wish
speeds on this CPU were still practically relevant.

Moving to i7-4770K, the same 32-bit binary still runs a bit faster than
a 32-bit build of JtR as publicly released, compiled with "gcc version
4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)". One instance:

Old: 1049 c/s
New: 1347 c/s

8 concurrent instances (separate processes in both cases):

Old: 760*8 = 6080 c/s
New: 772*8 = 6176 c/s

This is only a 1.6% improvement, but perhaps for CPUs without HT the
improvement is much greater (as seen from the single instance speeds).

On 64-bit builds, though, I only got this to run at cumulative speeds
like 780*8 = 6240 c/s, which is worse than 6595 c/s previously seen with
OpenMP (and even worse than the slightly better speeds that can be seen
with separate independent processes). Replacing one of the MMX2
instances with one or two scalar instances (thus, 3 or 4 instances
total per thread) somehow didn't help. (Maybe I just didn't try hard
enough.)

So we appear to have an improvement for crippled systems - ancient CPUs
like Pentium 3 or, more importantly, 32-bit builds running on lower-cost
HT-less versions of recent CPUs. Unfortunately, since the improvement
is not universal and since data layout changes are needed (64-bit P
reads that I mentioned above), I can't just commit this code as-is.

The 2x2 MMX2 code looks like:

#define P(N) BF_current+16*N
#define Sa(N) P(18)+0x400*N
#define Sb(N) P(18)+0x1000+0x400*N
#define Sc(N) P(18)+0x2000+0x400*N
#define Sd(N) P(18)+0x3000+0x400*N

#define BF_ROUND(Lab, Lcd, Rab, Rcd, N) \
movd Lab,tmp1; \
shldl $16,tmp1,tmp2; \
movzbl tmp1_hi,tmp5; \
pextrw $3,Lab,tmp4; \
movzbl tmp2_hi,tmp6; \
pextrw $2,Lab,tmp3; \
andl $0xFF,tmp1; \
movd Sa(0)(,tmp6,4),tmpm1; \
andl $0xFF,tmp2; \
movd Sa(1)(,tmp2,4),tmpm2; \
movzbl tmp4_hi,tmp6; \
movd Sa(2)(,tmp5,4),tmpm3; \
andl $0xFF,tmp4; \
movd Sa(3)(,tmp1,4),tmpm4; \
movd Lcd,tmp1; \
movzbl tmp3_hi,tmp5; \
punpckldq Sb(0)(,tmp6,4),tmpm1; \
andl $0xFF,tmp3; \
punpckldq Sb(1)(,tmp4,4),tmpm2; \
shldl $16,tmp1,tmp2; \
punpckldq Sb(2)(,tmp5,4),tmpm3; \
movzbl tmp1_hi,tmp5; \
paddd tmpm2,tmpm1; \
pextrw $3,Lcd,tmp4; \
punpckldq Sb(3)(,tmp3,4),tmpm4; \
movzbl tmp2_hi,tmp6; \
pxor tmpm3,tmpm1; \
pextrw $2,Lcd,tmp3; \
pxor P(N)+16,Rab; \
andl $0xFF,tmp1; \
paddd tmpm4,tmpm1; \
andl $0xFF,tmp2; \
pxor tmpm1,Rab; \
movd Sc(0)(,tmp6,4),tmpm1; \
movd Sc(1)(,tmp2,4),tmpm2; \
movzbl tmp4_hi,tmp6; \
movd Sc(2)(,tmp5,4),tmpm3; \
andl $0xFF,tmp4; \
movd Sc(3)(,tmp1,4),tmpm4; \
movzbl tmp3_hi,tmp5; \
punpckldq Sd(0)(,tmp6,4),tmpm1; \
andl $0xFF,tmp3; \
punpckldq Sd(1)(,tmp4,4),tmpm2; \
punpckldq Sd(2)(,tmp5,4),tmpm3; \
paddd tmpm2,tmpm1; \
punpckldq Sd(3)(,tmp3,4),tmpm4; \
pxor tmpm3,tmpm1; \
pxor P(N)+24,Rcd; \
paddd tmpm4,tmpm1; \
pxor tmpm1,Rcd

This is 50 instructions per round for 4 instances combined, so 12.5
instructions per round per instance. I think it's pretty good.

I've also tried BMI in several ways, including taking it to the extreme
in terms of minimizing x86 instruction count (likely not the optimal
choice since we actually need to optimize uops, and latency hiding).
That version is:

#define P(N) BF_current+8*N
#define Sa(N) P(18)+0x400*N
#define Sb(N) P(18)+0x1000+0x400*N

#define BF_ROUND(La, La_lo, La_hi, Lb, Lb_lo, Lb_hi, Ra, Rb, N) \
bextr %r14d,La,tmp1; \
bextr %r14d,Lb,tmp5; \
xorl P(N)+8,Ra; \
xorl P(N)+12,Rb; \
bextr %r15d,La,tmp2; \
bextr %r15d,Lb,tmp6; \
movl Sa(0)(,tmp1_64,4),tmp1; \
movl Sb(0)(,tmp5_64,4),tmp5; \
movzbl La_hi,tmp3; \
movzbl Lb_hi,tmp7; \
addl Sa(1)(,tmp2_64,4),tmp1; \
addl Sb(1)(,tmp6_64,4),tmp5; \
movzbl La_lo,tmp4; \
movzbl Lb_lo,tmp8; \
xorl Sa(2)(,tmp3_64,4),tmp1; \
xorl Sb(2)(,tmp7_64,4),tmp5; \
addl Sa(3)(,tmp4_64,4),tmp1; \
addl Sb(3)(,tmp8_64,4),tmp5; \
xorl tmp1,Ra; \
xorl tmp5,Rb

where the out-of-loop initialization code included:

movl $0x0818,%r14d
movl $0x0810,%r15d

This is 20 instructions per round for two interleaved instances, so 10
instructions per round per instance. Thus, BMI can beat MMX2 in x86
instruction count, but like I said that is likely suboptimal in other
aspects.

So many experiments and no luck yet on current non-crippled CPUs and
systems (64-bit with HT).

If I find more time for this, I'll probably proceed with high-level
optimizations first, just to improve things a little bit, achieving
something like 6.8k c/s on the i7-4770K with OpenMP:

http://www.openwall.com/presentations/Passwords14-Energy-Efficient-Cracking/slide-50.html

by making the uses of Blowfish outside of the variable cost loop
use the interleaved implementation as well. And maybe 7k with
independent instances (no thread safety gives an extra register).

Alexander

Solar Designer

2015-06-24 06:43:47 UTC

Permalink

Post by Solar Designer
Now, the bottleneck does indeed appear to be different. It appears to
be taking too many instructions and uops to extract all of the S-box
indices and compute the effective addresses to keep the L1 data cache
read ports 100% busy at all times. (We only keep them about 60% busy.)

BTW, optimizing the effective address calculation could make a
difference. In my testing on Haswell, not of bcrypt code but in
general, there is performance impact from using larger than 1-byte
displacement, as well as from using an index register (even if the
scaling factor is set to 1). Loads that use only a base register with
either no displacement or a 1-byte displacement (sufficient e.g. to skip
P-boxes and get to the first S-box) run faster (or rather, have less
impact on issue rate of adjacent instructions).

To avoid needing index scaling, we could use 64-bit S-box elements,
using bits 33 to 2 for data. Extracting the lowest pre-scaled index is
then quick. Extracting the remaining 3, not so much - can't just use
BEXTR, need an AND. Maybe the cost of an AND can be amortized, if
applied to two non-adjacent pre-scaled indices at once, and then BEXTR
could be used. BMI's 3-op ANDN instruction may be used to avoid MOVs.

Unfortunately, when using only a base register, we can't avoid either
needing large displacements or ADDing them to that register first. Yet
in my testing performance impact from using both a large displacement
and an index at once is worse than from using either of these alone.
So maybe large_displacement+base is better than small+base+index*4.
... Just tested on artificial code: yes, this is so, especially when the
instruction is a load-op rather than a pure load,

Unfortunately, there's also performance impact from larger than 1-byte
immediate values in instructions such as AND. So we'd need to minimize
the number of different masks and preload them into registers. This
means that this approach is possibly only worthwhile on CPUs with HT,
where we don't need more than 2x interleaving.

Alexander

Solar Designer

2017-03-26 18:52:45 UTC

Permalink

Post by Solar Designer

FWIW, I just found this note:

http://www.7-cpu.com/cpu/Haswell.html

L1 Data Cache Latency = 4 cycles for simple access via pointer
L1 Data Cache Latency = 5 cycles for access with complex address calculation (size_t n, *p; n = p[n]).

So maybe (just maybe) the "performance impact from using larger than
1-byte displacement, as well as from using an index register" was
actually from an extra cycle of latency on the load itself (the effect
of which on issue of other instructions may vary).

Post by Solar Designer
To avoid needing index scaling, we could use 64-bit S-box elements,
using bits 33 to 2 for data. Extracting the lowest pre-scaled index is
then quick. Extracting the remaining 3, not so much - can't just use
BEXTR, need an AND. Maybe the cost of an AND can be amortized, if
applied to two non-adjacent pre-scaled indices at once, and then BEXTR
could be used. BMI's 3-op ANDN instruction may be used to avoid MOVs.
Unfortunately, when using only a base register, we can't avoid either
needing large displacements or ADDing them to that register first. Yet
in my testing performance impact from using both a large displacement
and an index at once is worse than from using either of these alone.
So maybe large_displacement+base is better than small+base+index*4.
... Just tested on artificial code: yes, this is so, especially when the
instruction is a load-op rather than a pure load,
Unfortunately, there's also performance impact from larger than 1-byte
immediate values in instructions such as AND. So we'd need to minimize
the number of different masks and preload them into registers. This
means that this approach is possibly only worthwhile on CPUs with HT,
where we don't need more than 2x interleaving.

Alexander

Solar Designer

2017-03-27 11:55:34 UTC

Permalink

Post by Solar Designer

Post by Solar Designer
BTW, optimizing the effective address calculation could make a
difference. In my testing on Haswell, not of bcrypt code but in
general, there is performance impact from using larger than 1-byte
displacement, as well as from using an index register (even if the
scaling factor is set to 1). Loads that use only a base register with
either no displacement or a 1-byte displacement (sufficient e.g. to skip
P-boxes and get to the first S-box) run faster (or rather, have less
impact on issue rate of adjacent instructions).

http://www.7-cpu.com/cpu/Haswell.html
L1 Data Cache Latency = 4 cycles for simple access via pointer
L1 Data Cache Latency = 5 cycles for access with complex address calculation (size_t n, *p; n = p[n]).
So maybe (just maybe) the "performance impact from using larger than
1-byte displacement, as well as from using an index register" was
actually from an extra cycle of latency on the load itself (the effect
of which on issue of other instructions may vary).

Also, I just noticed this:

http://www.realworldtech.com/haswell-cpu/5/

"in Sandy Bridge [...] Address generation was a single cycle when the
(base + offset) is < 2K, with an extra cycle for larger (base + offset)
or (base + index + offset) addressing" and per that same article it
sounds like this remained the same in Haswell.

Alexander

Alain Espinosa

2015-06-24 15:56:43 UTC

Permalink

-------- Original message --------
From: Solar Designer <***@openwall.com>
Date:06/24/2015 12:10 AM (GMT-05:00)
To: john-***@lists.openwall.com
Cc:
Subject: [john-dev] optimizing bcrypt cracking on x86

...Alain has recently mentioned similar poor results for his attempts to
use AVX2

One thing worth trying is to interleave scalar instructions with AVX2. I am not quite sure how AVX2 gather works, but if it uses one ALU port and one read port then there are 3 ALU ports and one read port idle. I will try this in the near future.

...shldl $16,tmp1,tmp2
(Latency 2, throughput 1)

Faster is:
mov tmp2, tmp1
Shl tmp2, 16
(Latency 2, throughput 0.75)

Note that for Haswell, shld is slower in throughput than older architectures.

...bextr %r14d,La,tmp1;

I get the BMI speedup using shrx. I think bextr is similar to shld, and is faster to use shrx followed by and. Unfortunately I don't had BMI instructions latency/throughput. There are none in "Intel 64 and IA-32 Architectures Optimization Reference Manual", version 2013, June (too old?).

Regards,
Alain

Solar Designer

2015-06-24 21:03:02 UTC

Permalink

Post by Alain Espinosa
One thing worth trying is to interleave scalar instructions with AVX2.

I didn't try exactly that - interleaving within the same thread - but I
did try running a mix of AVX2 and scalar bcrypt threads on the i7-4770K
back in 2013. No luck. On your HT-less CPU, you will actually have to
go for the effort of mixing instructions within one thread, and please
feel free to try, but based on that old result I don't expect you'd have
any luck.

Post by Alain Espinosa
...shldl $16,tmp1,tmp2
(Latency 2, throughput 1)

Per these files:

GenuineIntel00306C3_HaswellXeon_InstLatX64.txt
GenuineIntel00306C3_Haswell_InstLatX64.txt
GenuineIntel00306C3_Haswell_InstLatX64.txt-2013

from http://users.atw.hu/instlatx64/

"SHLD r32, r32, imm8" is latency 1, throughput 0.5.

("-2013" is my own older saved version, downloaded in 2013. In fact, my
copies of the *.txt might also be older than what's currently on the
website.)

Post by Alain Espinosa
mov tmp2, tmp1
Shl tmp2, 16
(Latency 2, throughput 0.75)

You mean SHR, not SHL. Yes, I was using SHLD, but the equivalent
sequence will use SHR.

I've just tried this as well. I got speedup for 1 thread/core, but
significant slowdown for 2 threads/core. I also tried moving the two
instructions apart, which didn't affect the speeds much.

Post by Alain Espinosa
Note that for Haswell, shld is slower in throughput than older architectures.

I am not seeing that.

Post by Alain Espinosa
...bextr %r14d,La,tmp1;
I get the BMI speedup using shrx. I think bextr is similar to shld, and is faster to use shrx followed by and. Unfortunately I don't had BMI instructions latency/throughput. There are none in "Intel 64 and IA-32 Architectures Optimization Reference Manual", version 2013, June (too old?).

http://users.atw.hu/instlatx64/ has BMI instructions.

BEXTR is latency 2, throughput 1. SHRX is latency 1, throughput 0.5.

Yes, SHRX + AND might be faster, depending on context I guess.

Alexander

Alain Espinosa

2015-06-24 21:57:55 UTC

Permalink

-------- Original message --------
From: Solar Designer <***@openwall.com>
Date:06/24/2015 5:03 PM (GMT-05:00)
To: john-***@lists.openwall.com
Cc:
Subject: Re: [john-dev] optimizing bcrypt cracking on x86

..."SHLD r32, r32, imm8" is latency 1, throughput 0.5.

In "Intel 64 and IA-32 Architectures Optimization Reference Manual", version 2013, June; page 618 (latency-throughput):

06_3C,06_45 06_3A,06_3E,06_2A,06_2D
shld r32, r32, imm (2-1) (1-0.5)

In page 593:
06_3C,06_45 is Haswell
06_3A,06_3E is Ivi Bridge
06_2A,06_2D is Sandy Bridge

For older CPUs it shows (4-1).

Regards,
Alain

Alain Espinosa

2015-06-24 22:06:16 UTC

Permalink

-------- Original message --------
From: Solar Designer <***@openwall.com>
Date:06/24/2015 5:03 PM (GMT-05:00)
To: john-***@lists.openwall.com
Cc:
Subject: Re: [john-dev] optimizing bcrypt cracking on x86

...I got speedup for 1 thread/core, but
significant slowdown for 2 threads/core.

This is other thing that is different in my tests (may be my asm code is suboptimal). In a core i3-2120 I get 4% speed up interleaving 3 keys instead of 2. This is using 4 threads.

Regards,
Alain

Solar Designer

2015-06-24 22:45:02 UTC

Permalink

Post by Alain Espinosa
...I got speedup for 1 thread/core, but
significant slowdown for 2 threads/core.
This is other thing that is different in my tests (may be my asm code is suboptimal). In a core i3-2120 I get 4% speed up interleaving 3 keys instead of 2. This is using 4 threads.

Of course, on an HT-less CPU you need to interleave 3 or 4 instances
rather than just 2.

In fact, with my 2x2 MMX2 code I am experimenting with 4 parallel
instances (2x crippled SIMD, 2x interleaving) on i7-4770K as well.

Replacing those SHLD with MOV+SHR got me slowdown for 2 threads/core
even at 4 instances/thread. (But that's the 2x2 thing, not simple 4x
interleaving.)

Replacing those SHLD with SHRX similarly speeds things up for 1 thread/core
(just like MOV+SHR), but keeps performance the same for 2 threads/core
(unlike the slowdown seen with MOV+SHR). Unfortunately, it wastes a
register to hold the shift count, which may prevent other optimizations.

Alexander

Alain Espinosa

2015-06-24 23:18:56 UTC

Permalink

-------- Original message --------
From: Solar Designer <***@openwall.com>
Date:06/24/2015 6:45 PM (GMT-05:00)
To: john-***@lists.openwall.com
Cc:
Subject: Re: [john-dev] optimizing bcrypt cracking on x86

Post by Alain Espinosa
This is other thing that is different in my tests (may be my asm code is suboptimal). In a core i3-2120 I get 4% speed up interleaving 3 keys instead of 2. This is using 4 threads.

...Of course, on an HT-less CPU you need to interleave 3 or 4 instances
rather than just 2.

This is a HT capable CPU: 2 cores each with 2 HT threads.

Regards,
Alain

Solar Designer

2015-06-24 23:20:32 UTC

Permalink

Post by Alain Espinosa

...Of course, on an HT-less CPU you need to interleave 3 or 4 instances
rather than just 2.
This is a HT capable CPU: 2 cores each with 2 HT threads.

Oh, sorry. So you have both kinds of CPUs (HT and not) for your bcrypt
testing now? That's great.

Alexander

Alain Espinosa

2015-06-24 23:50:15 UTC

Permalink

-------- Original message --------
From: Solar Designer <***@openwall.com>
Date:06/24/2015 7:20 PM (GMT-05:00)
To: john-***@lists.openwall.com
Cc:
Subject: Re: [john-dev] optimizing bcrypt cracking on x86

...So you have both kinds of CPUs (HT and not) for your bcrypt
testing now?

Yes. Attached are all my CPUs benchmarks. There are 3 different architectures: core 2 duo, Sandy Bridge and Haswell. Can you make direct comparison between core i5 and core i7? Perhaps my assembly code is suboptimal.

One thing I notice is that you first try 32bits asm code. I use 64bits asm code only. Can this make a difference?

Regards,
Alain

Solar Designer

2015-06-25 05:57:42 UTC

Permalink

Post by Alain Espinosa
Yes. Attached are all my CPUs benchmarks. There are 3 different architectures: core 2 duo, Sandy Bridge and Haswell. Can you make direct comparison between core i5 and core i7? Perhaps my assembly code is suboptimal.

Thanks. I didn't mean to say your code is worse than mine. It probably
is OK, possibly even better than mine - for your CPU at least.

Post by Alain Espinosa
One thing I notice is that you first try 32bits asm code. I use 64bits asm code only. Can this make a difference?

Sure. When I first ported the 32-bit mode MMX2 code to 64-bit mode, it
actually ran slower than it did in 32-bit mode on the same Haswell CPU -
there were extra prefixes. I then revised it slightly to avoid the
prefixes, and it became same speed. And then I could proceed to make
use of the extra 8 GPRs and of BMI, which I did - but with limited
success, as I documented.

I merely wanted to share the experiments and findings and ideas so far.
I do not claim to have achieved much.

In a way, I am relieved that there doesn't appear to be much room for
further attack speedup on current x86 CPUs.

Alexander

Solar Designer

2015-06-25 04:33:21 UTC

Permalink

Post by Solar Designer
On 64-bit builds, though, I only got this to run at cumulative speeds
like 780*8 = 6240 c/s, which is worse than 6595 c/s previously seen with
OpenMP (and even worse than the slightly better speeds that can be seen
with separate independent processes).

I managed to improve this to 796*8 = 6368 c/s by removing some of the
large displacements on loads, and instead keeping them in base registers
(using the extra GPRs that we have in 64-bit mode for this). For the
288 bytes of P, an offset into the middle of this range may be put into
a register, and then 256 out of the 288 bytes may be accessed via 1-byte
displacements (or alternatively 248 out of 288, but then we can also
access the first S-box via the same base register with 0x78 in the
1-byte displacement). Also, remembering that R13 is special just like
RBP (no without-displacement encoding) can sometimes be helpful.

This is still not good enough, though.

Alexander

Solar Designer

2015-07-01 10:05:05 UTC

Permalink

Post by Solar Designer

Another related trick, which I haven't tried yet, is to interleave pairs
of S-boxes (from the same bcrypt instance or from different instances).
Then the same base register could be used to access two of such S-boxes
at once, with "4" in the displacement field (fits 1-byte, obviously) for
the second S-box in a pair. The index scaling would be by 8 rather than
by 4, but it's same cost. This way, only 4 base registers would be
needed to access everything for 2 bcrypt instances, with at most 1-byte
displacements (thus avoiding 4-byte displacements, which appear to cost
extra on Haswell).

Another advantage of such interleaving is that we're guaranteed to have
no cache bank conflict between lookups from these two S-boxes then. Per
my testing, this is irrelevant for Haswell, but it might be relevant on
other CPUs (and not only CPUs).

Post by Solar Designer
This is still not good enough, though.

Alexander