Discussion:
PHC: yescrypt on GPU
(too old to reply)
Agnieszka Bielec
2015-07-16 11:39:57 UTC
Permalink
Raw Message
hi Solar

I wanted to make split kernel in yescrypt to make that kernel will be
execuing less that 1 second like I did in pomelo but it seems that
this could took me a lot of time.
I can just set in auto-tune option that kernel can be executed much
longer than 1 s.
and I was talking with Lukas and he agreed with me and told that this
is not so important because anybody won't be using GPU to crack
yescrypt.

what you think about that?

btw.

http://www.openwall.com/lists/john-dev/2015/04/17/18

here you told that you will provide guidance, so can I get this guidance?
Solar Designer
2015-07-16 14:28:54 UTC
Permalink
Raw Message
Post by Agnieszka Bielec
I wanted to make split kernel in yescrypt to make that kernel will be
execuing less that 1 second like I did in pomelo but it seems that
this could took me a lot of time.
I can just set in auto-tune option that kernel can be executed much
longer than 1 s.
and I was talking with Lukas and he agreed with me and told that this
is not so important because anybody won't be using GPU to crack
yescrypt.
what you think about that?
I think it's OK to postpone this work, although having a split kernel
would be nice from multiple perspectives.
Post by Agnieszka Bielec
btw.
http://www.openwall.com/lists/john-dev/2015/04/17/18
here you told that you will provide guidance, so can I get this guidance?
Unfortunately, not right now. Hopefully, next week. I am traveling
with my laptop now and attending an event. Got quite a backlog of work
to do and emails to reply to.

I am sorry about this unfortunate timing.

Alexander
Agnieszka Bielec
2015-07-19 12:20:10 UTC
Permalink
Raw Message
I optimized yescrypt-opencl (960m) by copying one table to private memory
before(with some optimizations):

***@none ~/Desktop/r/run $ GWS=1024 ./john --test --format=yescrypt-opencl
Benchmarking: yescrypt-opencl [Salsa20/8 OpenCL (inefficient,
development use only)]... Device 0: GeForce GTX 960M
memory per hash : 2.10 MB
DONE
Speed for cost 1 (N) of 2048, cost 2 (r) of 8, cost 3 (p) of 11, cost
4 (t) of 0, cost 5 (g) of 0
Many salts: 247 c/s real, 247 c/s virtual
Only one salt: 247 c/s real, 247 c/s virtual

now:

***@none ~/Desktop/r/src $ m;r;GWS=1024 ./john --test --format=yescrypt-opencl

Make process completed.
Benchmarking: yescrypt-opencl [Salsa20/8 OpenCL (inefficient,
development use only)]... Device 0: GeForce GTX 960M
memory per hash : 2.10 MB
DONE
Speed for cost 1 (N) of 2048, cost 2 (r) of 8, cost 3 (p) of 11, cost
4 (t) of 0, cost 5 (g) of 0
Many salts: 409 c/s real, 407 c/s virtual
Only one salt: 409 c/s real, 407 c/s virtual

but if I want to run benchmarks for GWS=256,512 and 1024 I need to set
a quarter of needed memory in autotune
(I'm getting CL_MEM_OBJECT_ALLOCATION_FAILURE for GWS=2048)

***@none ~/Desktop/r/run $ ./john --test --format=yescrypt-opencl --v=4
Benchmarking: yescrypt-opencl [Salsa20/8 OpenCL (inefficient,
development use only)]... Device 0: GeForce GTX 960M
Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__
-DDEVICE_INFO=131090 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21
-D_OPENCL_COMPILER -DBINARY_SIZE=32 -DSALT_SIZE=64
-DPLAINTEXT_LENGTH=125 -DHASH_SIZE=44
memory per hash : 2.10 MB
Calculating best global worksize (GWS); max. 100s total for crypt_all()
gws: 256 159 c/s 159 rounds/s 1.608s per crypt_all()!
gws: 512 161 c/s 161 rounds/s 3.176s per crypt_all()+
gws: 1024 145 c/s 145 rounds/s 7.029s per crypt_all()
Local worksize (LWS) 64, global worksize (GWS) 512
DONE
Speed for cost 1 (N) of 2048, cost 2 (r) of 8, cost 3 (p) of 11, cost
4 (t) of 0, cost 5 (g) of 0
Many salts: 355 c/s real, 358 c/s virtual
Only one salt: 358 c/s real, 358 c/s virtual

If I set all of needed memory:

***@none ~/Desktop/r/run $ ./john --test --format=yescrypt-opencl --v=4
Benchmarking: yescrypt-opencl [Salsa20/8 OpenCL (inefficient,
development use only)]... Device 0: GeForce GTX 960M
Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__
-DDEVICE_INFO=131090 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21
-D_OPENCL_COMPILER -DBINARY_SIZE=32 -DSALT_SIZE=64
-DPLAINTEXT_LENGTH=125 -DHASH_SIZE=44
memory per hash : 2.10 MB
Calculating best global worksize (GWS); max. 100s total for crypt_all()
gws: 256 158 c/s 158 rounds/s 1.612s per crypt_all()!
Local worksize (LWS) 64, global worksize (GWS) 256
DONE
Speed for cost 1 (N) of 2048, cost 2 (r) of 8, cost 3 (p) of 11, cost
4 (t) of 0, cost 5 (g) of 0
Many salts: 230 c/s real, 230 c/s virtual
Only one salt: 237 c/s real, 237 c/s virtual


and the other thing is that benchamrks estimate the speed inproperly

***@none ~/Desktop/r/run $ GWS=1024 ./john --test --format=yescrypt-opencl
Benchmarking: yescrypt-opencl [Salsa20/8 OpenCL (inefficient,
development use only)]... Device 0: GeForce GTX 960M
memory per hash : 2.10 MB
DONE
Speed for cost 1 (N) of 2048, cost 2 (r) of 8, cost 3 (p) of 11, cost
4 (t) of 0, cost 5 (g) of 0
Many salts: 407 c/s real, 407 c/s virtual
Only one salt: 409 c/s real, 409 c/s virtual

***@none ~/Desktop/r/run $ GWS=512 ./john --test --format=yescrypt-opencl
Benchmarking: yescrypt-opencl [Salsa20/8 OpenCL (inefficient,
development use only)]... Device 0: GeForce GTX 960M
memory per hash : 2.10 MB
DONE
Speed for cost 1 (N) of 2048, cost 2 (r) of 8, cost 3 (p) of 11, cost
4 (t) of 0, cost 5 (g) of 0
Many salts: 358 c/s real, 358 c/s virtual
Only one salt: 358 c/s real, 360 c/s virtual


***@none ~/Desktop/r/run $ ./john --test --format=yescrypt-opencl --v=4
Benchmarking: yescrypt-opencl [Salsa20/8 OpenCL (inefficient,
development use only)]... Device 0: GeForce GTX 960M
Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__
-DDEVICE_INFO=131090 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21
-D_OPENCL_COMPILER -DBINARY_SIZE=32 -DSALT_SIZE=64
-DPLAINTEXT_LENGTH=125 -DHASH_SIZE=44
memory per hash : 2.10 MB
Calculating best global worksize (GWS); max. 100s total for crypt_all()
gws: 256 159 c/s 159 rounds/s 1.608s per crypt_all()!
gws: 512 161 c/s 161 rounds/s 3.176s per crypt_all()+
gws: 1024 145 c/s 145 rounds/s 7.029s per crypt_all()
Local worksize (LWS) 64, global worksize (GWS) 512
DONE
Speed for cost 1 (N) of 2048, cost 2 (r) of 8, cost 3 (p) of 11, cost
4 (t) of 0, cost 5 (g) of 0
Many salts: 355 c/s real, 358 c/s virtual
Only one salt: 358 c/s real, 358 c/s virtual
Agnieszka Bielec
2015-07-21 19:00:51 UTC
Permalink
Raw Message
hi, I discovered right now that benchmarks for yescrypt should be done
using every password different like in pomelo (I don't know why, there
is no coalescing), I could testing it in cracking mode but I wasn't
doing this, and hence is the strange thing with estimating GWS

you should forget about previous benchmarks, sorry that I didn't
discovered this earlier

here is real test (using my modified bench.c)

***@none ~/Desktop/r/run $ ./john --test --format=yescrypt-opencl
--skip-self-testBenchmarking: yescrypt-opencl [Salsa20/8 OpenCL
(inefficient, development use only)]... Device 0: GeForce GTX 960M
memory per hash : 2.10 MB
DONE
Speed for cost 1 (N) of 2048, cost 2 (r) of 8, cost 3 (p) of 11, cost
4 (t) of 0, cost 5 (g) of 0
Many salts: 158 c/s real, 158 c/s virtual
Only one salt: 157 c/s real, 157 c/s virtual


and cracking run:

***@none ~/Desktop/r/run $ ./john --incremental=digits --min-length=8
--max-length=8 yes --format=yescrypt-opencl
Using default input encoding: UTF-8
Loaded 1 password hash (yescrypt-opencl [Salsa20/8 OpenCL
(inefficient, development use only)])
Device 0: GeForce GTX 960M
memory per hash : 2.10 MB
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:42 0.01% (ETA: 2015-07-29 04:12) 0g/s 155.8p/s 155.8c/s
155.8C/s GPU:72°C util:99% 19910252..11122030
0g 0:00:00:46 0.01% (ETA: 2015-07-29 07:11) 0g/s 155.7p/s 155.7c/s
155.7C/s GPU:72°C util:100% 11122031..20112549
Agnieszka Bielec
2015-07-22 02:06:12 UTC
Permalink
Raw Message
Post by Agnieszka Bielec
hi, I discovered right now that benchmarks for yescrypt should be done
using every password different like in pomelo (I don't know why, there
is no coalescing), I could testing it in cracking mode but I wasn't
doing this, and hence is the strange thing with estimating GWS
you should forget about previous benchmarks, sorry that I didn't
discovered this earlier
I reverted back my changes in kernel what I did last week and I didn't
gained any speed at all
Lukas Odzioba
2015-07-22 11:42:16 UTC
Permalink
Raw Message
Post by Agnieszka Bielec
I reverted back my changes in kernel what I did last week and I didn't
gained any speed at all
Do you mean "copying one table to private memory" ?
Agnieszka Bielec
2015-07-22 12:06:37 UTC
Permalink
Raw Message
Post by Lukas Odzioba
Post by Agnieszka Bielec
I reverted back my changes in kernel what I did last week and I didn't
gained any speed at all
Do you mean "copying one table to private memory" ?
yes
Agnieszka Bielec
2015-07-22 19:11:48 UTC
Permalink
Raw Message
Post by Agnieszka Bielec
Post by Agnieszka Bielec
hi, I discovered right now that benchmarks for yescrypt should be done
using every password different like in pomelo (I don't know why, there
is no coalescing), I could testing it in cracking mode but I wasn't
doing this, and hence is the strange thing with estimating GWS
you should forget about previous benchmarks, sorry that I didn't
discovered this earlier
I reverted back my changes in kernel what I did last week and I didn't
gained any speed at all
sorry for making a chaos but now turned out that when I was testing
speed of 2 versions of john with this optimization and without speed
was the same but I didn't remove this optimization properly I have in
the code #define SP_COPY 1 but I'm using #ifdef SP_COPY - I forgot
about that this looks like here and I changed only 1 to 0 instead
remove whole #define and even it's slower with this optimization
Agnieszka Bielec
2015-07-22 22:36:55 UTC
Permalink
Raw Message
has anyone idea why copying parts of memory from __global to __private
makes my code slower when there are different passwords and faster
where all passwords are the same? I did in lyra2 something very
similar, maybe my code is too big and I have to do split kernels?
Agnieszka Bielec
2015-07-22 22:44:41 UTC
Permalink
Raw Message
I did in lyra2 something very similar
and it works there (I don't want misunderstanding)
magnum
2015-07-22 23:33:26 UTC
Permalink
Raw Message
Post by Agnieszka Bielec
has anyone idea why copying parts of memory from __global to __private
makes my code slower when there are different passwords and faster
where all passwords are the same? I did in lyra2 something very
similar, maybe my code is too big and I have to do split kernels?
Are there differences in length distribution in the two cases? If not,
Maybe in the slow case they end up spilling to local memory due to
harder register pressure.

magnum
Solar Designer
2015-07-23 02:00:03 UTC
Permalink
Raw Message
Post by magnum
Post by Agnieszka Bielec
has anyone idea why copying parts of memory from __global to __private
makes my code slower when there are different passwords and faster
where all passwords are the same?
Why faster for same passwords:

This is puzzling, but my guess (which could well be wrong) is that the
remaining global memory accesses have better locality of reference
(resulting in better cache hit rate) and/or coalescing potential than
all of them did before you moved some to private memory. In other
words, you moved the "bad" ones to private and kept the "good" ones in
global. But they are only "good" when the passwords are the same (and I
guess the salts as well, or there are few different ones), so this is of
no practical use.

Why slower for different passwords:

I guess your LWS or/and GWS became lower.
Post by magnum
Post by Agnieszka Bielec
I did in lyra2 something very
similar, maybe my code is too big and I have to do split kernels?
Split kernel may be good anyway, but this is most likely unrelated to
this specific occasion.
Post by magnum
Are there differences in length distribution in the two cases?
This should be irrelevant. The PHC finalists process the plaintext
password into a hash early on, and do not use the plaintext password
frequently. They are not like e.g. md5crypt in this respect.
Post by magnum
If not,
Maybe in the slow case they end up spilling to local memory due to
harder register pressure.
Maybe. This is a possibility with any changes to a kernel.

Alexander
Agnieszka Bielec
2015-07-23 22:45:26 UTC
Permalink
Raw Message
Post by Solar Designer
Post by magnum
Post by Agnieszka Bielec
has anyone idea why copying parts of memory from __global to __private
makes my code slower when there are different passwords and faster
where all passwords are the same?
This is puzzling, but my guess (which could well be wrong) is that the
remaining global memory accesses have better locality of reference
(resulting in better cache hit rate) and/or coalescing potential than
all of them did before you moved some to private memory. In other
words, you moved the "bad" ones to private and kept the "good" ones in
global. But they are only "good" when the passwords are the same (and I
guess the salts as well, or there are few different ones), so this is of
no practical use.
I guess your LWS or/and GWS became lower.
Post by magnum
Post by Agnieszka Bielec
I did in lyra2 something very
similar, maybe my code is too big and I have to do split kernels?
Split kernel may be good anyway, but this is most likely unrelated to
this specific occasion.
Post by magnum
Are there differences in length distribution in the two cases?
This should be irrelevant. The PHC finalists process the plaintext
password into a hash early on, and do not use the plaintext password
frequently. They are not like e.g. md5crypt in this respect.
Post by magnum
If not,
Maybe in the slow case they end up spilling to local memory due to
harder register pressure.
Maybe. This is a possibility with any changes to a kernel.
I have message during compiling that: 0 bytes spill stores, 0 bytes spill
Agnieszka Bielec
2015-07-24 22:08:29 UTC
Permalink
Raw Message
Post by Solar Designer
Post by magnum
Post by Agnieszka Bielec
has anyone idea why copying parts of memory from __global to __private
makes my code slower when there are different passwords and faster
where all passwords are the same?
This is puzzling, but my guess (which could well be wrong) is that the
remaining global memory accesses have better locality of reference
(resulting in better cache hit rate) and/or coalescing potential than
all of them did before you moved some to private memory. In other
words, you moved the "bad" ones to private and kept the "good" ones in
global. But they are only "good" when the passwords are the same (and I
guess the salts as well, or there are few different ones), so this is of
no practical use.
I guess your LWS or/and GWS became lower.
Post by magnum
Post by Agnieszka Bielec
I did in lyra2 something very
similar, maybe my code is too big and I have to do split kernels?
Split kernel may be good anyway, but this is most likely unrelated to
this specific occasion.
Post by magnum
Are there differences in length distribution in the two cases?
This should be irrelevant. The PHC finalists process the plaintext
password into a hash early on, and do not use the plaintext password
frequently. They are not like e.g. md5crypt in this respect.
Post by magnum
If not,
Maybe in the slow case they end up spilling to local memory due to
harder register pressure.
Maybe. This is a possibility with any changes to a kernel.
I had in my code

for()
{
copy to private
some operiations on private
copy to global
}

i changed this code to

memset(this private array,0,size of private array)//because I noticed
when I was working on parallel that kernel can slow down after using
uninitialized array
for()
{
some operations on private
}

and runned with --skip-self-test and speed was the same, even without
this memset. this is big array 8KB but I have in another place copying
64 B and this also decreases speed even when copying 8KB is turned off
Solar Designer
2015-07-25 12:22:04 UTC
Permalink
Raw Message
Post by Agnieszka Bielec
I had in my code
for()
{
copy to private
some operiations on private
copy to global
}
i changed this code to
memset(this private array,0,size of private array)//because I noticed
when I was working on parallel that kernel can slow down after using
uninitialized array
for()
{
some operations on private
}
and runned with --skip-self-test and speed was the same, even without
this memset.
OK. Why would you incur any accesses to an uninitialized array, though?

yescrypt fully initializes its S-boxes with non-zero data before the
very first invocation of pwxform, which uses them.
Post by Agnieszka Bielec
this is big array 8KB but I have in another place copying
64 B and this also decreases speed even when copying 8KB is turned off
Moving a 64 bytes array from global to private decreases speed? That's
surprising if so. Is this 64 bytes array frequently accessed? Which
one is it? The current sub-block buffer in pwxform? You should keep it
in private, I think.

The S-boxes should likely be in local on AMD and in private on NVIDIA,
although you do in fact need to test with them in global as well - in
fact, ideally you'd have this tri-state choice auto-tuned at runtime,
since the optimal one will likely vary across GPUs (even similar ones).

yescrypt pwxform S-boxes are similar to bcrypt's, but are twice larger
(8 KB rather than bcrypt's 4 KB), use wider lookups (128-bit rather than
bcrypt's 32-bit), and there are only 2 of them (bcrypt has 4), which
reduces parallelism, but OTOH 4 such pwxform lanes are computed in
parallel, which increases parallelism. This is with yescrypt's current
default pwxform settings. We previously found that it's more optimal to
keep bcrypt's S-boxes in local or private (depending on GPU) rather than
in global, but the differences of pwxform (with particular settings) vs.
bcrypt might change this on some GPUs. Also, yescrypt's other uses of
global memory (for its large V array) might make use of global memory for
the S-boxes as well more optimal, since those other accesses to global
memory might limit the overall latency reduction possible with moving the
S-boxes to local or private memory, thereby skewing the balance towards
keeping them in global.

Alexander
Agnieszka Bielec
2015-07-25 16:17:06 UTC
Permalink
Raw Message
Post by Solar Designer
Post by Agnieszka Bielec
I had in my code
for()
{
copy to private
some operiations on private
copy to global
}
i changed this code to
memset(this private array,0,size of private array)//because I noticed
when I was working on parallel that kernel can slow down after using
uninitialized array
for()
{
some operations on private
}
and runned with --skip-self-test and speed was the same, even without
this memset.
OK. Why would you incur any accesses to an uninitialized array, though?
yescrypt fully initializes its S-boxes with non-zero data before the
very first invocation of pwxform, which uses them.
this was only experiment, I wanted to know the speed without copying
data from global memory
Post by Solar Designer
Post by Agnieszka Bielec
this is big array 8KB but I have in another place copying
64 B and this also decreases speed even when copying 8KB is turned off
Moving a 64 bytes array from global to private decreases speed? That's
surprising if so. Is this 64 bytes array frequently accessed? Which
one is it? The current sub-block buffer in pwxform? You should keep it
in private, I think.
in pwxform, 64 bytes - only once they are used
Post by Solar Designer
The S-boxes should likely be in local on AMD and in private on NVIDIA,
although you do in fact need to test with them in global as well - in
fact, ideally you'd have this tri-state choice auto-tuned at runtime,
since the optimal one will likely vary across GPUs (even similar ones).
yescrypt pwxform S-boxes are similar to bcrypt's, but are twice larger
(8 KB rather than bcrypt's 4 KB), use wider lookups (128-bit rather than
bcrypt's 32-bit), and there are only 2 of them (bcrypt has 4), which
reduces parallelism, but OTOH 4 such pwxform lanes are computed in
parallel, which increases parallelism. This is with yescrypt's current
default pwxform settings. We previously found that it's more optimal to
keep bcrypt's S-boxes in local or private (depending on GPU) rather than
in global, but the differences of pwxform (with particular settings) vs.
bcrypt might change this on some GPUs. Also, yescrypt's other uses of
global memory (for its large V array) might make use of global memory for
the S-boxes as well more optimal, since those other accesses to global
memory might limit the overall latency reduction possible with moving the
S-boxes to local or private memory, thereby skewing the balance towards
keeping them in global.
today I was removing getting smaller and smaller parts of the code to
track down the slow part
and this is indeed in pwxform, and when I have
x0 += p0[0];
x0 ^= p1[0];
[...]
x1 += p0[1];
x1 ^= p1[1];

commented out speed is the same with copying and without
(but I have another version of pwxform using vectors now)
Solar Designer
2015-07-25 17:00:13 UTC
Permalink
Raw Message
Post by Agnieszka Bielec
today I was removing getting smaller and smaller parts of the code to
track down the slow part
and this is indeed in pwxform, and when I have
x0 += p0[0];
x0 ^= p1[0];
[...]
x1 += p0[1];
x1 ^= p1[1];
commented out speed is the same with copying and without
This makes sense. In fact, the compiler might be optimizing out the
copying when the copied data is not used at all.
Post by Agnieszka Bielec
(but I have another version of pwxform using vectors now)
Of course, you should be using 128-bit vectors (assuming the current
pwxform settings).

Alexander
Solar Designer
2015-10-06 00:16:09 UTC
Permalink
Raw Message
Agnieszka,
Post by Agnieszka Bielec
Post by Solar Designer
Post by Agnieszka Bielec
this is big array 8KB but I have in another place copying
64 B and this also decreases speed even when copying 8KB is turned off
Moving a 64 bytes array from global to private decreases speed? That's
surprising if so. Is this 64 bytes array frequently accessed? Which
one is it? The current sub-block buffer in pwxform? You should keep it
in private, I think.
in pwxform, 64 bytes - only once they are used
Post by Solar Designer
The S-boxes should likely be in local on AMD and in private on NVIDIA,
although you do in fact need to test with them in global as well - in
fact, ideally you'd have this tri-state choice auto-tuned at runtime,
since the optimal one will likely vary across GPUs (even similar ones).
yescrypt pwxform S-boxes are similar to bcrypt's, but are twice larger
(8 KB rather than bcrypt's 4 KB), use wider lookups (128-bit rather than
bcrypt's 32-bit), and there are only 2 of them (bcrypt has 4), which
reduces parallelism, but OTOH 4 such pwxform lanes are computed in
parallel, which increases parallelism. This is with yescrypt's current
default pwxform settings. We previously found that it's more optimal to
keep bcrypt's S-boxes in local or private (depending on GPU) rather than
in global, but the differences of pwxform (with particular settings) vs.
bcrypt might change this on some GPUs. Also, yescrypt's other uses of
global memory (for its large V array) might make use of global memory for
the S-boxes as well more optimal, since those other accesses to global
memory might limit the overall latency reduction possible with moving the
S-boxes to local or private memory, thereby skewing the balance towards
keeping them in global.
today I was removing getting smaller and smaller parts of the code to
track down the slow part
and this is indeed in pwxform, and when I have
x0 += p0[0];
x0 ^= p1[0];
[...]
x1 += p0[1];
x1 ^= p1[1];
commented out speed is the same with copying and without
(but I have another version of pwxform using vectors now)
Looking at your code now, I see that you indeed store the 64-byte X
array in pwxform in global memory. This is weird.

Then you have several versions of the functions, with one of two sets
chosen with "#define SP_COPY" (which is commented out by default).
With SP_COPY, your pwxform works on S-boxes in private memory. Without
SP_COPY, it works on S-boxes in global memory.

At least for AMD, we need to try a version with S-boxes in local memory,
which you don't appear to have ever tried. (For bcrypt's 4 KB S-boxes,
we got 2x+ better speeds on GCN with them in local than in global memory.
They're too large for private on GCN, where it's 1 KB per work-item max.)

There may also be correlation between where you choose to keep the
S-boxes and where it's optimal to keep other buffers. With the S-boxes
in global memory anyway, it is more likely optimal to keep other things
in global memory as well, to maximize the occupation rather than
minimize latency of every one computation (which you can't do much about
when the S-box lookups incur high latency anyway). So you need to try
multiple combinations to properly test the different approaches. You
can't test S-boxes in global vs. local memory on its own, without making
changes to where other stuff is stored, and conclusively state that one
is faster than the other - you need to perform more tests, with other
changes, in order to draw such conclusions.

As you're probably aware from the PHC list, I am working on finalizing
yescrypt now, and I need to choose between 5 (or maybe 6) pwxform rounds
with 12 KB S-boxes, and 4 (or maybe 5) pwxform rounds with 12 KB L1 and
128 KB (or maybe 64 KB) L2 S-boxes:

http://thread.gmane.org/gmane.comp.security.phc/3244/focus=3495

If all GPU attack implementations would happen to find it optimal to
keep the S-boxes in global memory anyway, then +1 round (e.g. 5 total)
with 12 KB is preferable over the 12 KB + 128 KB version with -1 round
(e.g. 4 total). (The 12 KB + 128 KB version might work better against
some FPGAs and ASICs, though. As well as future GPUs with more local
memory per computation power, if such appear.) However, if even at
12 KB local memory is still more optimal than global for the S-boxes on
at least some GPUs, then that's a reason for me to go for the 12 KB +
128 KB option (even if it costs removing a round). (In this paragraph,
I am using "local memory" to refer to any on-die memory.)

A secondary question is how often to write to the S-boxes. I can do it
just once per sub-block, or after every pwxform round (except for the
last round, which is followed by a write anyway), or twice per sub-block
if the two-level S-boxes are used.

So having better results, for better optimized code, would be helpful
right now. And you'll need to revise it to include the 12 KB version,
instead of the current 8 KB. I may provide it to you soon if you're
ready to work on this. Please let me know.

Then there's also this weird trick I just posted about to the PHC list:

http://thread.gmane.org/gmane.comp.security.phc/2938/focus=3496

where a BSTY miner implementation author chose to split the S-box
lookups across multiple work-items. He chose 16, but I think 4 would be
optimal (or maybe 8, if additionally splitting loads of the two S-boxes
across two sets of 4 work-items). This might in fact speed them up, so
might be worth trying (as an extra option, on top of 3 main ones).
He reported 372 h/s at 2 MB (N=2048 r=8) on HD 7750. Scaling to 7970,
this could be up to 372*2048/512*1000/800 = 1860, but probably a lot
less than that in practice (7750's narrower memory bus might be a better
fit). Your reported best result is 914 for 1.5 MB (r=6), so seemingly
much slower than his:

http://www.openwall.com/lists/john-dev/2015/07/27/6

We have a 7750 (a version with DDR3 memory, though) in "well", so you
may try your code on it and compare against the 372 figure directly.
And like I wrote, his byte-granular loads are likely not optimal, with
uint likely more optimal.

Thanks,

Alexander
Solar Designer
2015-10-06 00:19:11 UTC
Permalink
Raw Message
Post by Solar Designer
http://thread.gmane.org/gmane.comp.security.phc/2938/focus=3496
where a BSTY miner implementation author chose to split the S-box
lookups across multiple work-items. He chose 16, but I think 4 would be
optimal (or maybe 8, if additionally splitting loads of the two S-boxes
across two sets of 4 work-items). This might in fact speed them up, so
might be worth trying (as an extra option, on top of 3 main ones).
He reported 372 h/s at 2 MB (N=2048 r=8) on HD 7750. Scaling to 7970,
https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units#Radeon_HD_7xxx_Series
Post by Solar Designer
this could be up to 372*2048/512*1000/800 = 1860, but probably a lot
less than that in practice (7750's narrower memory bus might be a better
fit). Your reported best result is 914 for 1.5 MB (r=6), so seemingly
http://www.openwall.com/lists/john-dev/2015/07/27/6
We have a 7750 (a version with DDR3 memory, though) in "well", so you
may try your code on it and compare against the 372 figure directly.
And like I wrote, his byte-granular loads are likely not optimal, with
uint likely more optimal.
Alexander
Agnieszka Bielec
2015-10-14 22:48:23 UTC
Permalink
Raw Message
Post by Solar Designer
So having better results, for better optimized code, would be helpful
right now. And you'll need to revise it to include the 12 KB version,
instead of the current 8 KB. I may provide it to you soon if you're
ready to work on this. Please let me know.
hi, it's still actual?
I'm sorry, lately I was spending almost all my free time to preparing
for CTF's but now I see that I should change how I am spending my free
time, now I will be working on jtr harder
but what have the highest priority now?
still I have a bug in lyra2 ( and "FAILED (split() is only casing
sometimes (#4))" after "git pull upstream bleeding-jumbo")
I just read this email now
Solar Designer
2015-10-15 09:52:17 UTC
Permalink
Raw Message
Hi Agnieszka,
Post by Agnieszka Bielec
Post by Solar Designer
So having better results, for better optimized code, would be helpful
right now. And you'll need to revise it to include the 12 KB version,
instead of the current 8 KB. I may provide it to you soon if you're
ready to work on this. Please let me know.
hi, it's still actual?
Partially. I have since made the determination on which approach to go
with ("the 12 KB version" and not a version with separate targeting of
L2 cache).

However, you will in fact need to update jumbo's yescrypt support to use
that version.
Post by Agnieszka Bielec
I'm sorry, lately I was spending almost all my free time to preparing
for CTF's but now I see that I should change how I am spending my free
time, now I will be working on jtr harder
but what have the highest priority now?
still I have a bug in lyra2 ( and "FAILED (split() is only casing
sometimes (#4))" after "git pull upstream bleeding-jumbo")
I just read this email now
I guess reading emails should be a higher priority. ;-) Other than
that, for PHC you could update your JtR code to latest Argon2, which
changed quite a bit since what you integrated. I think there are 3
flavors of it now, including a new flavor with a MAXFORM chain that I
had proposed. You could also proceed to optimize it further - avoid
register spilling, try splitting the 8 parallel BLAKE2b's across
multiple work-items (combining their results via local memory, with a
barrier). Did you see the thread on the PHC list about yescrypt on GPU?
It's a similar approach I am suggesting here for Argon2.

As to yescrypt, I intend to provide a new version to you in a week from
now or so.

Thanks,

Alexander
Agnieszka Bielec
2015-10-15 10:04:51 UTC
Permalink
Raw Message
Post by Solar Designer
Did you see the thread on the PHC list about yescrypt on GPU?
It's a similar approach I am suggesting here for Argon2.
I see although I wasn't reading it closely so far, thanks

Loading...