Discussion:
64-bit rotate on AMD GCN
(too old to reply)
Solar Designer
2015-10-10 04:52:06 UTC
Permalink
Raw Message
Claudio, magnum, Agnieszka -

I've just tried out Claudio's latest sha512crypt-opencl commits - good
improvement on GCN, thanks! Most of the time, I got these -test speeds
on super's -dev=2 (Tahiti 1050 MHz, Catalyst 15.7):

Speed for cost 1 (iteration count) of 5000
Raw: 39125 c/s real, 1456K c/s virtual

Speed for cost 1 (iteration count) of 5000
Raw: 38325 c/s real, 728177 c/s virtual

On some rare occasions, I got:

Speed for cost 1 (iteration count) of 5000
Raw: 55538 c/s real, 1456K c/s virtual

even though the speeds reported by -v=4 during auto-tuning never went
above 41k. It's weird.

Then I noticed that opencl_sha512.h uses:

#define ror(x, n) ((x >> n) | (x << (64UL-n)))

In the generated ISA code, there are no v_alignbit_b32 instructions.
So I tried to use rotate():

#define ror(x, n) rotate(x, 64UL-n)

and speeds went up (except for the very rare 55k seen above), e.g. on
four consecutive tests:

Speed for cost 1 (iteration count) of 5000
Raw: 44734 c/s real, 1310K c/s virtual

Speed for cost 1 (iteration count) of 5000
Raw: 48188 c/s real, 728177 c/s virtual

Speed for cost 1 (iteration count) of 5000
Raw: 43690 c/s real, 728177 c/s virtual

Speed for cost 1 (iteration count) of 5000
Raw: 44887 c/s real, 2383K c/s virtual

The ISA changed, but still did not use v_alignbit_b32 instructions.
Notably, while all IL sizes and the ISA sizes for kernel_crypt and
kernel_final remained unchanged (although the instructions and VGPR
usage changed), the ISA size for kernel_prepare reduced:

-codeLenInByte = 163080 bytes;
+codeLenInByte = 162504 bytes;

(It's huge either way, though.)

Next I tried to use amd_bitalign() explicitly, initially for a half of
the rotates:

#define ror(x, n) ((n) < 32 ? (amd_bitalign((uint)((x) >> 32), (uint)(x), (uint)(n)) | ((ulong)amd_bitalign((uint)(x), (uint)((x) >> 32), (uint)(n)) << 32)) : rotate((x), 64UL - (n)))

IL sizes went way up, ISA sizes significantly down (for all 3 kernels),
and speed went up:

Speed for cost 1 (iteration count) of 5000
Raw: 60124 c/s real, 2621K c/s virtual

Speed for cost 1 (iteration count) of 5000
Raw: 54841 c/s real, 2621K c/s virtual

Of course, there were now v_alignbit_b32 instructions in the generated
ISA code. I ran only these two benchmarks with this code revision
before proceeding further, to:

#define ror(x, n) ((n) < 32 ? (amd_bitalign((uint)((x) >> 32), (uint)(x), (uint)(n)) | ((ulong)amd_bitalign((uint)(x), (uint)((x) >> 32), (uint)(n)) << 32)) : (amd_bitalign((uint)(x), (uint)((x) >> 32), (uint)(n) - 32) | ((ulong)amd_bitalign((uint)((x) >> 32), (uint)(x), (uint)(n) - 32) << 32)))

IL went further up, ISA further down, speeds went in unclear direction:

Speed for cost 1 (iteration count) of 5000
Raw: 54386 c/s real, 2912K c/s virtual

Speed for cost 1 (iteration count) of 5000
Raw: 54841 c/s real, 1456K c/s virtual

Speed for cost 1 (iteration count) of 5000
Raw: 54161 c/s real, 819200 c/s virtual

Speed for cost 1 (iteration count) of 5000
Raw: 53718 c/s real, 728177 c/s virtual

even though the speeds reported during auto-tuning went up (now above
62k is reported during auto-tuning, only to be followed with the 54k'ish
speeds on the final benchmark). For example:

[***@super run]$ AMD_OCL_BUILD_OPTIONS_APPEND=-save-temps ./john -test -form=sha512crypt-opencl -dev=2 -v=4
Benchmarking: sha512crypt-opencl, crypt(3) $6$ (rounds=5000) [SHA512 OpenCL]...
Device 2: Tahiti [AMD Radeon HD 7900 Series]
Calculating best GWS for LWS=64; max. 150ms single kernel invocation.
gws: 2048 5158 25790000 rounds/s 397.002ms per crypt_all()!
gws: 4096 14338 71690000 rounds/s 285.657ms per crypt_all()!
gws: 8192 31089 155445000 rounds/s 263.499ms per crypt_all()!
gws: 16384 59464 297320000 rounds/s 275.524ms per crypt_all()+
gws: 32768 59717 298585000 rounds/s 548.714ms per crypt_all()
gws: 65536 61746 308730000 rounds/s 1.061s per crypt_all()+
gws: 131072 62269 311345000 rounds/s 2.104s per crypt_all()
Calculating best LWS for GWS=65536
Testing LWS=64 GWS=65536 ... 43.049ms+
Testing LWS=128 GWS=65536 ... 45.408ms
Testing LWS=192 GWS=65472 ... 61.590ms
Testing LWS=256 GWS=65536 ... 42.687ms+
Calculating best GWS for LWS=256; max. 300ms single kernel invocation.
gws: 8192 31530 157650000 rounds/s 259.809ms per crypt_all()!
gws: 16384 58888 294440000 rounds/s 278.222ms per crypt_all()+
gws: 32768 60613 303065000 rounds/s 540.605ms per crypt_all()+
gws: 65536 61189 305945000 rounds/s 1.071s per crypt_all()
gws: 131072 61748 308740000 rounds/s 2.122s per crypt_all()+
gws: 262144 63347 316735000 rounds/s 4.138s per crypt_all()+
Local worksize (LWS) 256, global worksize (GWS) 262144
DONE
Speed for cost 1 (iteration count) of 5000
Raw: 55072 c/s real, 2621K c/s virtual

Oh, 63k during the auto-tuning even. Was that extrapolated from fewer
than 5000 iterations maybe? Is the instruction cache hit rate worse for
5000 iterations maybe? We could want to play with how we're splitting
the 5000 iterations across kernel invocations to hopefully regain this
speed for actual runs. ... or maybe we have it for actual runs already:

[***@super run]$ ./john -form=sha512crypt-opencl -dev=2 -v=5 -inc=alpha -min-len=8 -max-len=8 pw
[...]
Local worksize (LWS) 64, global worksize (GWS) 262144
[...]
0g 0:00:00:00 0g/s 0p/s 0c/s 0C/s
0g 0:00:00:16 0g/s 48907p/s 48907c/s 48907C/s bigetort..soyfryap
0g 0:00:00:17 0g/s 46044p/s 46044c/s 46044C/s bigetort..soyfryap
0g 0:00:00:18 0g/s 58060p/s 58060c/s 58060C/s soyfryam..calefryn
0g 0:00:00:19 0g/s 54985p/s 54985c/s 54985C/s soyfryam..calefryn
0g 0:00:00:20 0g/s 52245p/s 52245c/s 52245C/s soyfryam..calefryn
0g 0:00:00:21 0g/s 49742p/s 49742c/s 49742C/s soyfryam..calefryn
0g 0:00:00:23 0g/s 56839p/s 56839c/s 56839C/s calefrya..astefeto
0g 0:00:00:33 0g/s 55505p/s 55505c/s 55505C/s chumaist..metalito
0g 0:00:00:34 0g/s 53859p/s 53859c/s 53859C/s chumaist..metalito
0g 0:00:00:38 0g/s 55101p/s 55101c/s 55101C/s metality..singuapp
0g 0:00:00:39 0g/s 60386p/s 60386c/s 60386C/s singuazy..abbortom
0g 0:00:00:40 0g/s 58923p/s 58923c/s 58923C/s singuazy..abbortom
0g 0:00:00:41 0g/s 57473p/s 57473c/s 57473C/s singuazy..abbortom
0g 0:00:00:42 0g/s 56093p/s 56093c/s 56093C/s singuazy..abbortom
0g 0:00:00:43 0g/s 60864p/s 60864c/s 60864C/s abbortot..mcmyleow
abcdefgh (?)
1g 0:00:00:47 DONE (2015-10-10 07:24) 0.02114g/s 60976p/s 60976c/s 60976C/s abbortot..mcmyleow

61k average for a 47 seconds run, not bad. Maybe this includes a not
fully processed last buffer (4 or 5 seconds), though?

As a final experiment, I tried omitting the "- 32" from my uses of
amd_bitalign() that had it, because amd_bitalign() is defined (unlike
e.g. bitwise shifts in C) to take its shift count "& 31":

https://www.khronos.org/registry/cl/extensions/amd/cl_amd_media_ops.txt

Specifically:

#define ror(x, n) ((n) < 32 ? (amd_bitalign((uint)((x) >> 32), (uint)(x), (uint)(n)) | ((ulong)amd_bitalign((uint)(x), (uint)((x) >> 32), (uint)(n)) << 32)) : (amd_bitalign((uint)(x), (uint)((x) >> 32), (uint)(n)/* - 32*/) | ((ulong)amd_bitalign((uint)((x) >> 32), (uint)(x), (uint)(n)/* - 32*/) << 32)))

Notice two commented out "- 32"'s. The resulting ISA size remained the
same, and speeds too, but the changed shift counts (now some above 31)
are actually encoded in the instructions. It's curious that GCN even
has room in the instructions to encode those redundant shift counts,
instead of applying the "& 31" at compile time.

I think it's better to use the previous version, with the "- 32"'s
intact, as long as the rotate counts are compile-time constants, so this
subtraction is performed at compile time as well.

Further speedup might be possible through switching the SWAP64() macro
from using rotate() to using the approach above. However, it is
currently used on both vector and scalar arguments, so my trivial
attempt at changing it like that failed. I guess we'd need to split it
into two macros: one for vectors and one for scalars. I am leaving this
for Claudio or/and magnum to experiment with.

Agnieszka - the 64-bit rotate optimizations discussed above are probably
also applicable to Argon2 and Lyra2.

Claudio - some additional info on the experiments above, in the same
order (first is your code, then my revisions as described above):

[***@super run]$ fgrep codeLenInByte ?/*.isa
1/_temp_0_Tahiti_kernel_crypt.isa:codeLenInByte = 59044 bytes;
1/_temp_0_Tahiti_kernel_final.isa:codeLenInByte = 60240 bytes;
1/_temp_0_Tahiti_kernel_prepare.isa:codeLenInByte = 163080 bytes;
2/_temp_0_Tahiti_kernel_crypt.isa:codeLenInByte = 59044 bytes;
2/_temp_0_Tahiti_kernel_final.isa:codeLenInByte = 60240 bytes;
2/_temp_0_Tahiti_kernel_prepare.isa:codeLenInByte = 162504 bytes;
3/_temp_0_Tahiti_kernel_crypt.isa:codeLenInByte = 59164 bytes;
3/_temp_0_Tahiti_kernel_final.isa:codeLenInByte = 60328 bytes;
3/_temp_0_Tahiti_kernel_prepare.isa:codeLenInByte = 158704 bytes;
4/_temp_0_Tahiti_kernel_crypt.isa:codeLenInByte = 59148 bytes;
4/_temp_0_Tahiti_kernel_final.isa:codeLenInByte = 60316 bytes;
4/_temp_0_Tahiti_kernel_prepare.isa:codeLenInByte = 158616 bytes;
5/_temp_0_Tahiti_kernel_crypt.isa:codeLenInByte = 59148 bytes;
5/_temp_0_Tahiti_kernel_final.isa:codeLenInByte = 60316 bytes;
5/_temp_0_Tahiti_kernel_prepare.isa:codeLenInByte = 158616 bytes;
[***@super run]$ fgrep NumVgpr ?/*.isa
1/_temp_0_Tahiti_kernel_crypt.isa:NumVgprs = 116;
1/_temp_0_Tahiti_kernel_final.isa:NumVgprs = 116;
1/_temp_0_Tahiti_kernel_prepare.isa:NumVgprs = 90;
2/_temp_0_Tahiti_kernel_crypt.isa:NumVgprs = 115;
2/_temp_0_Tahiti_kernel_final.isa:NumVgprs = 115;
2/_temp_0_Tahiti_kernel_prepare.isa:NumVgprs = 91;
3/_temp_0_Tahiti_kernel_crypt.isa:NumVgprs = 124;
3/_temp_0_Tahiti_kernel_final.isa:NumVgprs = 119;
3/_temp_0_Tahiti_kernel_prepare.isa:NumVgprs = 88;
4/_temp_0_Tahiti_kernel_crypt.isa:NumVgprs = 119;
4/_temp_0_Tahiti_kernel_final.isa:NumVgprs = 118;
4/_temp_0_Tahiti_kernel_prepare.isa:NumVgprs = 87;
5/_temp_0_Tahiti_kernel_crypt.isa:NumVgprs = 119;
5/_temp_0_Tahiti_kernel_final.isa:NumVgprs = 118;
5/_temp_0_Tahiti_kernel_prepare.isa:NumVgprs = 87;

There was a spike in NumVgprs for the mixed version (with bitalign only
used for half of the rotates), yet the speed was good.

ScratchSize remained the same as yours in all of these revisions.

1/_temp_0_Tahiti_kernel_crypt.isa:ScratchSize = 32 dwords/thread;
1/_temp_0_Tahiti_kernel_final.isa:ScratchSize = 32 dwords/thread;
1/_temp_0_Tahiti_kernel_prepare.isa:ScratchSize = 108 dwords/thread;

Alexander
Solar Designer
2015-10-10 05:35:43 UTC
Permalink
Raw Message
Post by Solar Designer
#define ror(x, n) ((n) < 32 ? (amd_bitalign((uint)((x) >> 32), (uint)(x), (uint)(n)) | ((ulong)amd_bitalign((uint)(x), (uint)((x) >> 32), (uint)(n)) << 32)) : (amd_bitalign((uint)(x), (uint)((x) >> 32), (uint)(n) - 32) | ((ulong)amd_bitalign((uint)((x) >> 32), (uint)(x), (uint)(n) - 32) << 32)))
I've just tried introducing the above revision of ror() into myrice's
xsha512_kernel.cl, which previously used rotate(), and speed went from:

[***@super run]$ AMD_OCL_BUILD_OPTIONS_APPEND=-save-temps ./john -test=10 -form=xsha512-opencl -dev=2 -v=4
[...]
Local worksize (LWS) 128, global worksize (GWS) 8388608
DONE
Many salts: 278223K c/s real, 4973M c/s virtual
Only one salt: 56310K c/s real, 72389K c/s virtual

to:

[***@super run]$ AMD_OCL_BUILD_OPTIONS_APPEND=-save-temps ./john -test=10 -form=xsha512-opencl -dev=2 -v=4
[...]
Local worksize (LWS) 128, global worksize (GWS) 8388608
DONE
Many salts: 345265K c/s real, 5082M c/s virtual
Only one salt: 58486K c/s real, 72315K c/s virtual

So we should expect 300M+ c/s for raw-sha512 as well (this is also seen
as e.g. "310395000 rounds/s" during auto-tuning for sha512crypt), with
on-GPU mask and hash comparisons when we have those implemented for this
hash type efficiently (I think not yet?)

Similarly to sha512crypt, IL size went way up, and ISA size slightly down:

[***@super run]$ ls -l a b
a:
total 840
-rw-------. 1 solar solar 5733 Oct 10 08:13 _temp_0_Tahiti.cl
-rw-------. 1 solar solar 6838 Oct 10 08:13 _temp_0_Tahiti.i
-rw-------. 1 solar solar 163501 Oct 10 08:13 _temp_0_Tahiti.il
-rw-------. 1 solar solar 3445 Oct 10 08:13 _temp_0_Tahiti_kernel_cmp.il
-rw-------. 1 solar solar 5443 Oct 10 08:13 _temp_0_Tahiti_kernel_cmp.isa
-rw-------. 1 solar solar 159665 Oct 10 08:13 _temp_0_Tahiti_kernel_xsha512.il
-rw-------. 1 solar solar 506243 Oct 10 08:13 _temp_0_Tahiti_kernel_xsha512.isa

b:
total 984
-rw-------. 1 solar solar 6038 Oct 10 08:17 _temp_0_Tahiti.cl
-rw-------. 1 solar solar 11310 Oct 10 08:17 _temp_0_Tahiti.i
-rw-------. 1 solar solar 261999 Oct 10 08:17 _temp_0_Tahiti.il
-rw-------. 1 solar solar 3445 Oct 10 08:17 _temp_0_Tahiti_kernel_cmp.il
-rw-------. 1 solar solar 5443 Oct 10 08:17 _temp_0_Tahiti_kernel_cmp.isa
-rw-------. 1 solar solar 258163 Oct 10 08:17 _temp_0_Tahiti_kernel_xsha512.il
-rw-------. 1 solar solar 450253 Oct 10 08:17 _temp_0_Tahiti_kernel_xsha512.isa

[***@super run]$ fgrep codeLenInByte [ab]/*.isa
a/_temp_0_Tahiti_kernel_cmp.isa:codeLenInByte = 172 bytes;
a/_temp_0_Tahiti_kernel_xsha512.isa:codeLenInByte = 30432 bytes;
b/_temp_0_Tahiti_kernel_cmp.isa:codeLenInByte = 172 bytes;
b/_temp_0_Tahiti_kernel_xsha512.isa:codeLenInByte = 30140 bytes;
[***@super run]$ fgrep NumVgpr [ab]/*.isa
a/_temp_0_Tahiti_kernel_cmp.isa:NumVgprs = 3;
a/_temp_0_Tahiti_kernel_xsha512.isa:NumVgprs = 98;
b/_temp_0_Tahiti_kernel_cmp.isa:NumVgprs = 3;
b/_temp_0_Tahiti_kernel_xsha512.isa:NumVgprs = 106;

On a related note, sha512crypt-opencl is now almost same speed as
sha256crypt-opencl on GCN, meaning that there must be lots of room for
improvement in the latter. sha512crypt-opencl:

gws: 262144 62827 314135000 rounds/s 4.172s per crypt_all()+
Local worksize (LWS) 256, global worksize (GWS) 262144
DONE
Speed for cost 1 (iteration count) of 5000
Raw: 55072 c/s real, 2383K c/s virtual

sha256crypt-opencl:

gws: 1048576 64687 323435000 rounds/s 16.209s per crypt_all()+
Local worksize (LWS) 64, global worksize (GWS) 1048576
DONE
Speed for cost 1 (iteration count) of 5000
Raw: 68089 c/s real, 5518K c/s virtual

Curiously, there's little speed difference between these two seen on
auto-tuning (e.g. 62827 vs. 64687 on final and best lines here), but
more of a difference on final benchmark results. Also, optimal GWS of
1048576 is very high for a slow hash.

Alexander
magnum
2015-10-12 17:41:46 UTC
Permalink
Raw Message
Post by Solar Designer
#define ror(x, n) ((n) < 32 ? (amd_bitalign((uint)((x) >> 32), (uint)(x), (uint)(n)) | ((ulong)amd_bitalign((uint)(x), (uint)((x) >> 32), (uint)(n)) << 32)) : (amd_bitalign((uint)(x), (uint)((x) >> 32), (uint)(n) - 32) | ((ulong)amd_bitalign((uint)((x) >> 32), (uint)(x), (uint)(n) - 32) << 32)))
Thanks, this went into
https://github.com/magnumripper/JohnTheRipper/issues/1819 and most or
all applicable formats now use the above.

magnum
Pavel Semjanov
2015-10-15 20:25:11 UTC
Permalink
Raw Message
Post by magnum
Post by Solar Designer
#define ror(x, n) ((n) < 32 ? (amd_bitalign((uint)((x) >> 32),
(uint)(x), (uint)(n)) | ((ulong)amd_bitalign((uint)(x), (uint)((x) >>
32), (uint)(n)) << 32)) : (amd_bitalign((uint)(x), (uint)((x) >> 32),
(uint)(n) - 32) | ((ulong)amd_bitalign((uint)((x) >> 32), (uint)(x),
(uint)(n) - 32) << 32)))
Thanks, this went into
https://github.com/magnumripper/JohnTheRipper/issues/1819 and most or
all applicable formats now use the above.
magnum
Not working on small numbers and rotate by 8, like ror (0x220, 8).
I guess it's bitalign error. The only one mention I found is:
https://community.amd.com/thread/158878
--
SY / C4acT/\uBo Pavel Semjanov
_ _ _ http://www.semjanov.com
| | |-| |_|_| |-|
magnum
2015-10-15 21:02:56 UTC
Permalink
Raw Message
Post by Pavel Semjanov
Post by magnum
Post by Solar Designer
#define ror(x, n) ((n) < 32 ? (amd_bitalign((uint)((x) >> 32),
(uint)(x), (uint)(n)) | ((ulong)amd_bitalign((uint)(x), (uint)((x) >>
32), (uint)(n)) << 32)) : (amd_bitalign((uint)(x), (uint)((x) >> 32),
(uint)(n) - 32) | ((ulong)amd_bitalign((uint)((x) >> 32), (uint)(x),
(uint)(n) - 32) << 32)))
Thanks, this went into
https://github.com/magnumripper/JohnTheRipper/issues/1819 and most or
all applicable formats now use the above.
Not working on small numbers and rotate by 8, like ror (0x220, 8).
https://community.amd.com/thread/158878
What device and driver version(s) did you see that with? I recall Atom
told me he'd seen rotate() fail with numbers divisible by 8. I'm pretty
sure he meant the OpenCL function but it could be the same underlying
bug. That was in June last year so maybe Cat 14.4 or something. I never
saw that very bug surface though.

sigma0 has a rotate by 8 but I see no problems with 15.7. I just tested
bull's 13.4 and it works fine (although it's a *LOT* slower than 15.7
with Myrice's formats and it can't even build any of Claudio's SHA-512
formats).

magnum
Solar Designer
2015-10-16 07:17:27 UTC
Permalink
Raw Message
Post by Pavel Semjanov
Not working on small numbers and rotate by 8, like ror (0x220, 8).
https://community.amd.com/thread/158878
Ouch. When you say "on small numbers", do you mean only compile-time
constants, or also such numbers computed at runtime?
Post by Pavel Semjanov
What device and driver version(s) did you see that with? I recall Atom
told me he'd seen rotate() fail with numbers divisible by 8. I'm pretty
sure he meant the OpenCL function but it could be the same underlying
bug. That was in June last year so maybe Cat 14.4 or something. I never
saw that very bug surface though.
The closest I had heard of is this comment by Alain:

http://www.openwall.com/lists/john-dev/2015/08/23/17

"The 64 bit rotation is done manually, not using OpenCL rotate.
am_bitalign provides a very small speedup, but note that when used with
multiples of 8 it generate errors, at least when I test it, so we need
to use amd_bytealign then."

I never ran into these issues, even though the code I got into our
md5crypt-opencl actually uses amd_bitalign() only with multiples of 8
(since it does that for the unaligned writes). I also tried
amd_bytealign(), which was similar speed, but I chose to stay with
amd_bitalign() since it's the same as NVIDIA's funnel shift, so is
easier to substitute with that in our macros. OTOH, it wouldn't be hard
to multiply the constants by 8 in a macro if necessary.

Alexander
Pavel Semjanov
2015-10-16 08:08:55 UTC
Permalink
Raw Message
Post by Solar Designer
Post by Pavel Semjanov
Not working on small numbers and rotate by 8, like ror (0x220, 8).
https://community.amd.com/thread/158878
Ouch. When you say "on small numbers", do you mean only compile-time
constants, or also such numbers computed at runtime?
The compile-time constant in my code is assigned to variable, like:

#define sigma0_512(x) (ROR((x),1) ^ ROR((x),8) ^ ((x)>>7))
T1 = X15=U64(0x220);
...
s0 = sigma0_512(X15);
...

(Yes, it's SHA-512 ;)
Post by Solar Designer
Post by Pavel Semjanov
What device and driver version(s) did you see that with? I recall Atom
told me he'd seen rotate() fail with numbers divisible by 8. I'm pretty
sure he meant the OpenCL function but it could be the same underlying
bug. That was in June last year so maybe Cat 14.4 or something. I never
saw that very bug surface though.
I guess I had 15.7. I've just installed 15.7.1 and the bug still
exists. The GPU is R9 280x. And yes, it's Windows.
If you can't reproduce the bug, I'll send the full code. Anyway, I guess
it would be safer to define ROR8 separately.
--
SY / C4acT/\uBo Pavel Semjanov
_ _ _ http://www.semjanov.com
| | |-| |_|_| |-|
Solar Designer
2015-10-19 16:52:37 UTC
Permalink
Raw Message
Post by Pavel Semjanov
#define sigma0_512(x) (ROR((x),1) ^ ROR((x),8) ^ ((x)>>7))
T1 = X15=U64(0x220);
...
s0 = sigma0_512(X15);
...
(Yes, it's SHA-512 ;)
Maybe I've just managed to reproduce this. It turns out that our
pbkdf2-hmac-sha512-opencl was failing on AMD GPUs (but working fine on
NVIDIA). I didn't notice when playing with ror() before because I was
focusing on sha512crypt-opencl (which worked fine on all of the GPUs).

Changing this line in opencl_sha2.h:

#define sigma0_64(x) ((ror64(x,1)) ^ (ror64(x,8)) ^ (x >> 7))

to:

#define sigma0_64(x) ((ror64(x,1)) ^ (rotate(x,56UL)) ^ (x >> 7))

makes the problem go away for Tahiti (but the speed is poor, at about
1/4 of Titan X, unlike for sha512crypt where these GPUs are similar).
Juniper is still failing (could be a different problem; I haven't looked
into that).

The fact that this matters for pbkdf2-hmac-sha512-opencl suggests that
in that format we end up doing some computation on constants, which we
could avoid by modifying the source code. Maybe this is also part of
the reason why it's unexpectedly slow on Tahiti (in case not all of the
computation on constants gets done at compile time).
Post by Pavel Semjanov
I guess I had 15.7. I've just installed 15.7.1 and the bug still
exists. The GPU is R9 280x. And yes, it's Windows.
If you can't reproduce the bug, I'll send the full code. Anyway, I guess
it would be safer to define ROR8 separately.
If the above was the same problem, then it looks like we can reproduce
it with 15.7 on Linux as well.

magnum - would you take this problem from here?

Alexander
magnum
2015-10-19 17:28:31 UTC
Permalink
Raw Message
Post by Solar Designer
Post by Pavel Semjanov
#define sigma0_512(x) (ROR((x),1) ^ ROR((x),8) ^ ((x)>>7))
T1 = X15=U64(0x220);
...
s0 = sigma0_512(X15);
...
(Yes, it's SHA-512 ;)
Maybe I've just managed to reproduce this. It turns out that our
pbkdf2-hmac-sha512-opencl was failing on AMD GPUs (but working fine on
NVIDIA). I didn't notice when playing with ror() before because I was
focusing on sha512crypt-opencl (which worked fine on all of the GPUs).
#define sigma0_64(x) ((ror64(x,1)) ^ (ror64(x,8)) ^ (x >> 7))
#define sigma0_64(x) ((ror64(x,1)) ^ (rotate(x,56UL)) ^ (x >> 7))
makes the problem go away for Tahiti (but the speed is poor, at about
1/4 of Titan X, unlike for sha512crypt where these GPUs are similar).
Juniper is still failing (could be a different problem; I haven't looked
into that).
(...)
magnum - would you take this problem from here?
I tested this format specifically. Maybe that was on 15.9? I'll open an
issue and investigate.

magnum
Solar Designer
2015-10-19 19:52:20 UTC
Permalink
Raw Message
Post by magnum
I tested this format specifically. Maybe that was on 15.9?
No, pbkdf2-hmac-sha512-opencl was failing for me with 15.7 today (before
that maybe-fix I mentioned), on super. BTW, 1.8.0-jumbo-1's also fails
on super now.
Post by magnum
I'll open an issue and investigate.
So it's this issue now:

https://github.com/magnumripper/JohnTheRipper/issues/1840

Thanks!

Alexander
magnum
2015-10-21 00:15:11 UTC
Permalink
Raw Message
Post by Solar Designer
Post by magnum
I tested this format specifically. Maybe that was on 15.9?
No, pbkdf2-hmac-sha512-opencl was failing for me with 15.7 today (before
that maybe-fix I mentioned), on super. BTW, 1.8.0-jumbo-1's also fails
on super now.
Yes, I meant perhaps my (alleged) tests were on 15.9 but I had a chance
to test today and it actually failed on 15.9 too. Anyway a workaround is
in place now.

I now also found a workaround for oldoffice (#1497) so currently all
formats pass on Tahiti again.

Oh, except lotus5-opencl fails or hangs intermittently (does on nvidia
too), only with -test=0 and only intermittenly. That's #1726 and I
simply can not find the problem.

On an other note, oldoffice needs GPU-mask (as do rakp and ntlmv2) but I
have yet to understand how to add it. Claudio managed to figure out
Sayantan's code without help, that was impressing.

magnum

Loading...