Discussion:
nVidia Maxwell support (was: john-users)
(too old to reply)
magnum
2014-09-23 22:00:54 UTC
Permalink
Raw Message
For now, I think someone with a Maxwell GPU should try building and
benchmarking our descrypt-opencl on it with the S-boxes that use
bitselect().
Is it safe to assume that OpenCL's bitselect() function will boil down
to the new instructions on Maxwell? Or is it not that simple?

For example, we have many cases like this:

#ifdef USE_BITSELECT
#define F(x, y, z) bitselect((z), (y), (x))
#define G(x, y, z) bitselect((y), (x), (z))
#else
#define F(x, y, z) ((z) ^ ((x) & ((y) ^ (z))))
#define G(x, y, z) ((y) ^ ((z) & ((x) ^ (y))))
#endif

or this new one (courtesy of Milen), that I plan to add to a bunch of
kernels:

#ifdef USE_BITSELECT
#define SWAP32(x) bitselect(rotate(x, 24U), rotate(x, 8U), 0x00FF00FFU)
#else
inline uint SWAP32(uint x)
{
x = rotate(x, 16U);
return ((x & 0x00FF00FF) << 8) + ((x >> 8) & 0x00FF00FF);
}
#endif

The thing is, we currently only define USE_BITSELECT for AMD devices.
Would it be safer, in the nvidia case, to leave the non-bitselect
versions for the optimizer to consider? Or would it be safer to use
bitselect, or should it really not matter? It seems to still matter on AMD.

If use of bitselect() increases the chance for better low-level code for
nvidia too, maybe we should always define USE_BITSELECT (I'd still keep
the #ifdefs for quick benchmarks with/without them, as well as for
reference).

magnum
magnum
2014-10-21 15:26:00 UTC
Permalink
Raw Message
Post by magnum
Is it safe to assume that OpenCL's bitselect() function will boil down
to the new instructions on Maxwell? Or is it not that simple?
#ifdef USE_BITSELECT
#define F(x, y, z) bitselect((z), (y), (x))
#define G(x, y, z) bitselect((y), (x), (z))
#else
#define F(x, y, z) ((z) ^ ((x) & ((y) ^ (z))))
#define G(x, y, z) ((y) ^ ((z) & ((x) ^ (y))))
#endif
I had a chance to try out a GTX980. Like I suspected, it seems to
unpredictably affect "optimizer transparency" so using bitselect for the
above gains a few percent with some versions of SHA1 and lose a few with
others.
Post by magnum
The thing is, we currently only define USE_BITSELECT for AMD devices.
Would it be safer, in the nvidia case, to leave the non-bitselect
versions for the optimizer to consider? Or would it be safer to use
bitselect, or should it really not matter? It seems to still matter on AMD.
If use of bitselect() increases the chance for better low-level code for
nvidia too, maybe we should always define USE_BITSELECT (I'd still keep
the #ifdefs for quick benchmarks with/without them, as well as for
reference).
I only tested WPAPSK for now because it's very flexible, so I could try
many combinations of alternate code just messing around with #ifdefs. I
ended up using the same SHA1 we use for AMD (Milen's, not Lukas' code
with a separate "SHA1SHORT" hard-coded for short input) and enabling
bitselect, for a ~5% gain (exceeding 3 billion SHA1/s). Using bitselect
gained some with Milen's code and ruined some with Lukas' code.

Another curious observation was that compared to the default scalar
mode, Lukas' code gained a little speed with --force-vector=2 (ie. using
uint2) while Milen's code lost some. Any larger vector size would ruin
performance, as expected (perhaps more than expected).

magnum
magnum
2015-10-13 08:43:24 UTC
Permalink
Raw Message
Moved here from john-users to spare the normal people.
BTW we now also use LOP3.LUT for many MD4, MD5 and SHA-2 OpenCL formats.
Some driver bug prevented me for using it in SHA-1 with nvidia 352.39
(the code is there, just disabled) and md5crypt disable it because of
performance regression (still to be investigated). Some formats show a
fine boost but none as much as DEScrypt.
That "driver bug" was PEBCAK, fixed now. I also added a trivial perl
script that (now) correctly calculates the truth table. Here's F5 for
RIPEMD-160:

$ ./truth.pl '((x) ^ ((y) | ~(z)))'
lut3(x, y, z, 0x2d) == ((x) ^ ((y) | ~(z)))

The result also works as-is for AVX-512 "ternarylogic", which will make
life simpler for us.

Most formats now has LOP3.LUT alternatives and seem to work fine now.
Some don't get any boost (just meaning the toolchain did a good job
already) but I think md5crypt is the only one getting a definite
performance regression (and still has it disabled). We should get to the
bottom of that. BTW it would be very nice having CUDA 7.5 on super.

magnum
magnum
2015-10-13 18:37:26 UTC
Permalink
Raw Message
Post by magnum
Most formats now has LOP3.LUT alternatives and seem to work fine now.
Some don't get any boost (just meaning the toolchain did a good job
already) but I think md5crypt is the only one getting a definite
performance regression (and still has it disabled). We should get to the
bottom of that. BTW it would be very nice having CUDA 7.5 on super.
Comparison of md5crypt kernel compiled with bitselect vs. with explicit
LOP3.LUT for the function primitives:

Bitselect:
ptxas info : 0 bytes gmem, 54 bytes cmem[3]
ptxas info : Compiling entry function 'cryptmd5' for 'sm_52'
ptxas info : Function properties for cryptmd5
ptxas . 592 bytes stack frame, 0 bytes spill stores, 0 bytes
spill loads
ptxas info : Used 38 registers, 344 bytes cmem[0], 268 bytes cmem[2]

Explicit LOP3.LUT:
ptxas info : 0 bytes gmem, 54 bytes cmem[3]
ptxas info : Compiling entry function 'cryptmd5' for 'sm_52'
ptxas info : Function properties for cryptmd5
ptxas . 592 bytes stack frame, 0 bytes spill stores, 0 bytes
spill loads
ptxas info : Used 37 registers, 344 bytes cmem[0], 260 bytes cmem[2]

explicit bitselect
PTX #lines 4293 4375
ISA #lines 4214 4177
DEPBAR 56 62
LOP32I 31 33
LOP3 372 372
.reuse 235 349
LOP3 w/ .reuse 95 103
IADD32 420 400
IADD3 381 383

Less DEPBAR should be a good thing but I think the much lower ".reuse"
number is not, and this may be the main problem. But we can't specify
which registers to use! Perhaps the register scheduling when using
inline PTX lop3 will improve over time. After reading some forum posts
about register slots I actually tried using alternate lop3 immediates,
shuffling x, y and z around. I could only conclude it *does* sometimes
matter... but the chance of actually controlling the situation appears
pretty slim to me.

LOP3 immediates used:
explicit: 0x39, 0x96, 0xca, 0xe4 (just the ones used in my functions).
bitselect: 0x4b, 0x96, 0xac, 0xb8, 0xca.

For reference, the natural truth table for just a bitselect is 0xd8 and
alternatives when shuffling x, y and z around are 0xac, 0xb8, 0xca, 0xe2
and 0xe4. And 0x96 is (x ^ y ^ z) in any order. That leaves 0x4b to
investigate. Doing so, I think I located that section in PTX vs. ISA but
I don't get what is happening. And I gave up this at that point.

On another note I find it strange that the difference in 2-op adds
doesn't match the difference in 3-op adds at all.

magnum

Loading...