Solar Designer

2015-10-10 04:52:06 UTC

Claudio, magnum, Agnieszka -

I've just tried out Claudio's latest sha512crypt-opencl commits - good

improvement on GCN, thanks! Most of the time, I got these -test speeds

on super's -dev=2 (Tahiti 1050 MHz, Catalyst 15.7):

Speed for cost 1 (iteration count) of 5000

Raw: 39125 c/s real, 1456K c/s virtual

Speed for cost 1 (iteration count) of 5000

Raw: 38325 c/s real, 728177 c/s virtual

On some rare occasions, I got:

Speed for cost 1 (iteration count) of 5000

Raw: 55538 c/s real, 1456K c/s virtual

even though the speeds reported by -v=4 during auto-tuning never went

above 41k. It's weird.

Then I noticed that opencl_sha512.h uses:

#define ror(x, n) ((x >> n) | (x << (64UL-n)))

In the generated ISA code, there are no v_alignbit_b32 instructions.

So I tried to use rotate():

#define ror(x, n) rotate(x, 64UL-n)

and speeds went up (except for the very rare 55k seen above), e.g. on

four consecutive tests:

Speed for cost 1 (iteration count) of 5000

Raw: 44734 c/s real, 1310K c/s virtual

Speed for cost 1 (iteration count) of 5000

Raw: 48188 c/s real, 728177 c/s virtual

Speed for cost 1 (iteration count) of 5000

Raw: 43690 c/s real, 728177 c/s virtual

Speed for cost 1 (iteration count) of 5000

Raw: 44887 c/s real, 2383K c/s virtual

The ISA changed, but still did not use v_alignbit_b32 instructions.

Notably, while all IL sizes and the ISA sizes for kernel_crypt and

kernel_final remained unchanged (although the instructions and VGPR

usage changed), the ISA size for kernel_prepare reduced:

-codeLenInByte = 163080 bytes;

+codeLenInByte = 162504 bytes;

(It's huge either way, though.)

Next I tried to use amd_bitalign() explicitly, initially for a half of

the rotates:

#define ror(x, n) ((n) < 32 ? (amd_bitalign((uint)((x) >> 32), (uint)(x), (uint)(n)) | ((ulong)amd_bitalign((uint)(x), (uint)((x) >> 32), (uint)(n)) << 32)) : rotate((x), 64UL - (n)))

IL sizes went way up, ISA sizes significantly down (for all 3 kernels),

and speed went up:

Speed for cost 1 (iteration count) of 5000

Raw: 60124 c/s real, 2621K c/s virtual

Speed for cost 1 (iteration count) of 5000

Raw: 54841 c/s real, 2621K c/s virtual

Of course, there were now v_alignbit_b32 instructions in the generated

ISA code. I ran only these two benchmarks with this code revision

before proceeding further, to:

#define ror(x, n) ((n) < 32 ? (amd_bitalign((uint)((x) >> 32), (uint)(x), (uint)(n)) | ((ulong)amd_bitalign((uint)(x), (uint)((x) >> 32), (uint)(n)) << 32)) : (amd_bitalign((uint)(x), (uint)((x) >> 32), (uint)(n) - 32) | ((ulong)amd_bitalign((uint)((x) >> 32), (uint)(x), (uint)(n) - 32) << 32)))

IL went further up, ISA further down, speeds went in unclear direction:

Speed for cost 1 (iteration count) of 5000

Raw: 54386 c/s real, 2912K c/s virtual

Speed for cost 1 (iteration count) of 5000

Raw: 54841 c/s real, 1456K c/s virtual

Speed for cost 1 (iteration count) of 5000

Raw: 54161 c/s real, 819200 c/s virtual

Speed for cost 1 (iteration count) of 5000

Raw: 53718 c/s real, 728177 c/s virtual

even though the speeds reported during auto-tuning went up (now above

62k is reported during auto-tuning, only to be followed with the 54k'ish

speeds on the final benchmark). For example:

[***@super run]$ AMD_OCL_BUILD_OPTIONS_APPEND=-save-temps ./john -test -form=sha512crypt-opencl -dev=2 -v=4

Benchmarking: sha512crypt-opencl, crypt(3) $6$ (rounds=5000) [SHA512 OpenCL]...

Device 2: Tahiti [AMD Radeon HD 7900 Series]

Calculating best GWS for LWS=64; max. 150ms single kernel invocation.

gws: 2048 5158 25790000 rounds/s 397.002ms per crypt_all()!

gws: 4096 14338 71690000 rounds/s 285.657ms per crypt_all()!

gws: 8192 31089 155445000 rounds/s 263.499ms per crypt_all()!

gws: 16384 59464 297320000 rounds/s 275.524ms per crypt_all()+

gws: 32768 59717 298585000 rounds/s 548.714ms per crypt_all()

gws: 65536 61746 308730000 rounds/s 1.061s per crypt_all()+

gws: 131072 62269 311345000 rounds/s 2.104s per crypt_all()

Calculating best LWS for GWS=65536

Testing LWS=64 GWS=65536 ... 43.049ms+

Testing LWS=128 GWS=65536 ... 45.408ms

Testing LWS=192 GWS=65472 ... 61.590ms

Testing LWS=256 GWS=65536 ... 42.687ms+

Calculating best GWS for LWS=256; max. 300ms single kernel invocation.

gws: 8192 31530 157650000 rounds/s 259.809ms per crypt_all()!

gws: 16384 58888 294440000 rounds/s 278.222ms per crypt_all()+

gws: 32768 60613 303065000 rounds/s 540.605ms per crypt_all()+

gws: 65536 61189 305945000 rounds/s 1.071s per crypt_all()

gws: 131072 61748 308740000 rounds/s 2.122s per crypt_all()+

gws: 262144 63347 316735000 rounds/s 4.138s per crypt_all()+

Local worksize (LWS) 256, global worksize (GWS) 262144

DONE

Speed for cost 1 (iteration count) of 5000

Raw: 55072 c/s real, 2621K c/s virtual

Oh, 63k during the auto-tuning even. Was that extrapolated from fewer

than 5000 iterations maybe? Is the instruction cache hit rate worse for

5000 iterations maybe? We could want to play with how we're splitting

the 5000 iterations across kernel invocations to hopefully regain this

speed for actual runs. ... or maybe we have it for actual runs already:

[***@super run]$ ./john -form=sha512crypt-opencl -dev=2 -v=5 -inc=alpha -min-len=8 -max-len=8 pw

[...]

Local worksize (LWS) 64, global worksize (GWS) 262144

[...]

0g 0:00:00:00 0g/s 0p/s 0c/s 0C/s

0g 0:00:00:16 0g/s 48907p/s 48907c/s 48907C/s bigetort..soyfryap

0g 0:00:00:17 0g/s 46044p/s 46044c/s 46044C/s bigetort..soyfryap

0g 0:00:00:18 0g/s 58060p/s 58060c/s 58060C/s soyfryam..calefryn

0g 0:00:00:19 0g/s 54985p/s 54985c/s 54985C/s soyfryam..calefryn

0g 0:00:00:20 0g/s 52245p/s 52245c/s 52245C/s soyfryam..calefryn

0g 0:00:00:21 0g/s 49742p/s 49742c/s 49742C/s soyfryam..calefryn

0g 0:00:00:23 0g/s 56839p/s 56839c/s 56839C/s calefrya..astefeto

0g 0:00:00:33 0g/s 55505p/s 55505c/s 55505C/s chumaist..metalito

0g 0:00:00:34 0g/s 53859p/s 53859c/s 53859C/s chumaist..metalito

0g 0:00:00:38 0g/s 55101p/s 55101c/s 55101C/s metality..singuapp

0g 0:00:00:39 0g/s 60386p/s 60386c/s 60386C/s singuazy..abbortom

0g 0:00:00:40 0g/s 58923p/s 58923c/s 58923C/s singuazy..abbortom

0g 0:00:00:41 0g/s 57473p/s 57473c/s 57473C/s singuazy..abbortom

0g 0:00:00:42 0g/s 56093p/s 56093c/s 56093C/s singuazy..abbortom

0g 0:00:00:43 0g/s 60864p/s 60864c/s 60864C/s abbortot..mcmyleow

abcdefgh (?)

1g 0:00:00:47 DONE (2015-10-10 07:24) 0.02114g/s 60976p/s 60976c/s 60976C/s abbortot..mcmyleow

61k average for a 47 seconds run, not bad. Maybe this includes a not

fully processed last buffer (4 or 5 seconds), though?

As a final experiment, I tried omitting the "- 32" from my uses of

amd_bitalign() that had it, because amd_bitalign() is defined (unlike

e.g. bitwise shifts in C) to take its shift count "& 31":

https://www.khronos.org/registry/cl/extensions/amd/cl_amd_media_ops.txt

Specifically:

#define ror(x, n) ((n) < 32 ? (amd_bitalign((uint)((x) >> 32), (uint)(x), (uint)(n)) | ((ulong)amd_bitalign((uint)(x), (uint)((x) >> 32), (uint)(n)) << 32)) : (amd_bitalign((uint)(x), (uint)((x) >> 32), (uint)(n)/* - 32*/) | ((ulong)amd_bitalign((uint)((x) >> 32), (uint)(x), (uint)(n)/* - 32*/) << 32)))

Notice two commented out "- 32"'s. The resulting ISA size remained the

same, and speeds too, but the changed shift counts (now some above 31)

are actually encoded in the instructions. It's curious that GCN even

has room in the instructions to encode those redundant shift counts,

instead of applying the "& 31" at compile time.

I think it's better to use the previous version, with the "- 32"'s

intact, as long as the rotate counts are compile-time constants, so this

subtraction is performed at compile time as well.

Further speedup might be possible through switching the SWAP64() macro

from using rotate() to using the approach above. However, it is

currently used on both vector and scalar arguments, so my trivial

attempt at changing it like that failed. I guess we'd need to split it

into two macros: one for vectors and one for scalars. I am leaving this

for Claudio or/and magnum to experiment with.

Agnieszka - the 64-bit rotate optimizations discussed above are probably

also applicable to Argon2 and Lyra2.

Claudio - some additional info on the experiments above, in the same

order (first is your code, then my revisions as described above):

[***@super run]$ fgrep codeLenInByte ?/*.isa

1/_temp_0_Tahiti_kernel_crypt.isa:codeLenInByte = 59044 bytes;

1/_temp_0_Tahiti_kernel_final.isa:codeLenInByte = 60240 bytes;

1/_temp_0_Tahiti_kernel_prepare.isa:codeLenInByte = 163080 bytes;

2/_temp_0_Tahiti_kernel_crypt.isa:codeLenInByte = 59044 bytes;

2/_temp_0_Tahiti_kernel_final.isa:codeLenInByte = 60240 bytes;

2/_temp_0_Tahiti_kernel_prepare.isa:codeLenInByte = 162504 bytes;

3/_temp_0_Tahiti_kernel_crypt.isa:codeLenInByte = 59164 bytes;

3/_temp_0_Tahiti_kernel_final.isa:codeLenInByte = 60328 bytes;

3/_temp_0_Tahiti_kernel_prepare.isa:codeLenInByte = 158704 bytes;

4/_temp_0_Tahiti_kernel_crypt.isa:codeLenInByte = 59148 bytes;

4/_temp_0_Tahiti_kernel_final.isa:codeLenInByte = 60316 bytes;

4/_temp_0_Tahiti_kernel_prepare.isa:codeLenInByte = 158616 bytes;

5/_temp_0_Tahiti_kernel_crypt.isa:codeLenInByte = 59148 bytes;

5/_temp_0_Tahiti_kernel_final.isa:codeLenInByte = 60316 bytes;

5/_temp_0_Tahiti_kernel_prepare.isa:codeLenInByte = 158616 bytes;

[***@super run]$ fgrep NumVgpr ?/*.isa

1/_temp_0_Tahiti_kernel_crypt.isa:NumVgprs = 116;

1/_temp_0_Tahiti_kernel_final.isa:NumVgprs = 116;

1/_temp_0_Tahiti_kernel_prepare.isa:NumVgprs = 90;

2/_temp_0_Tahiti_kernel_crypt.isa:NumVgprs = 115;

2/_temp_0_Tahiti_kernel_final.isa:NumVgprs = 115;

2/_temp_0_Tahiti_kernel_prepare.isa:NumVgprs = 91;

3/_temp_0_Tahiti_kernel_crypt.isa:NumVgprs = 124;

3/_temp_0_Tahiti_kernel_final.isa:NumVgprs = 119;

3/_temp_0_Tahiti_kernel_prepare.isa:NumVgprs = 88;

4/_temp_0_Tahiti_kernel_crypt.isa:NumVgprs = 119;

4/_temp_0_Tahiti_kernel_final.isa:NumVgprs = 118;

4/_temp_0_Tahiti_kernel_prepare.isa:NumVgprs = 87;

5/_temp_0_Tahiti_kernel_crypt.isa:NumVgprs = 119;

5/_temp_0_Tahiti_kernel_final.isa:NumVgprs = 118;

5/_temp_0_Tahiti_kernel_prepare.isa:NumVgprs = 87;

There was a spike in NumVgprs for the mixed version (with bitalign only

used for half of the rotates), yet the speed was good.

ScratchSize remained the same as yours in all of these revisions.

1/_temp_0_Tahiti_kernel_crypt.isa:ScratchSize = 32 dwords/thread;

1/_temp_0_Tahiti_kernel_final.isa:ScratchSize = 32 dwords/thread;

1/_temp_0_Tahiti_kernel_prepare.isa:ScratchSize = 108 dwords/thread;

Alexander

I've just tried out Claudio's latest sha512crypt-opencl commits - good

improvement on GCN, thanks! Most of the time, I got these -test speeds

on super's -dev=2 (Tahiti 1050 MHz, Catalyst 15.7):

Speed for cost 1 (iteration count) of 5000

Raw: 39125 c/s real, 1456K c/s virtual

Speed for cost 1 (iteration count) of 5000

Raw: 38325 c/s real, 728177 c/s virtual

On some rare occasions, I got:

Speed for cost 1 (iteration count) of 5000

Raw: 55538 c/s real, 1456K c/s virtual

even though the speeds reported by -v=4 during auto-tuning never went

above 41k. It's weird.

Then I noticed that opencl_sha512.h uses:

#define ror(x, n) ((x >> n) | (x << (64UL-n)))

In the generated ISA code, there are no v_alignbit_b32 instructions.

So I tried to use rotate():

#define ror(x, n) rotate(x, 64UL-n)

and speeds went up (except for the very rare 55k seen above), e.g. on

four consecutive tests:

Speed for cost 1 (iteration count) of 5000

Raw: 44734 c/s real, 1310K c/s virtual

Speed for cost 1 (iteration count) of 5000

Raw: 48188 c/s real, 728177 c/s virtual

Speed for cost 1 (iteration count) of 5000

Raw: 43690 c/s real, 728177 c/s virtual

Speed for cost 1 (iteration count) of 5000

Raw: 44887 c/s real, 2383K c/s virtual

The ISA changed, but still did not use v_alignbit_b32 instructions.

Notably, while all IL sizes and the ISA sizes for kernel_crypt and

kernel_final remained unchanged (although the instructions and VGPR

usage changed), the ISA size for kernel_prepare reduced:

-codeLenInByte = 163080 bytes;

+codeLenInByte = 162504 bytes;

(It's huge either way, though.)

Next I tried to use amd_bitalign() explicitly, initially for a half of

the rotates:

#define ror(x, n) ((n) < 32 ? (amd_bitalign((uint)((x) >> 32), (uint)(x), (uint)(n)) | ((ulong)amd_bitalign((uint)(x), (uint)((x) >> 32), (uint)(n)) << 32)) : rotate((x), 64UL - (n)))

IL sizes went way up, ISA sizes significantly down (for all 3 kernels),

and speed went up:

Speed for cost 1 (iteration count) of 5000

Raw: 60124 c/s real, 2621K c/s virtual

Speed for cost 1 (iteration count) of 5000

Raw: 54841 c/s real, 2621K c/s virtual

Of course, there were now v_alignbit_b32 instructions in the generated

ISA code. I ran only these two benchmarks with this code revision

before proceeding further, to:

#define ror(x, n) ((n) < 32 ? (amd_bitalign((uint)((x) >> 32), (uint)(x), (uint)(n)) | ((ulong)amd_bitalign((uint)(x), (uint)((x) >> 32), (uint)(n)) << 32)) : (amd_bitalign((uint)(x), (uint)((x) >> 32), (uint)(n) - 32) | ((ulong)amd_bitalign((uint)((x) >> 32), (uint)(x), (uint)(n) - 32) << 32)))

IL went further up, ISA further down, speeds went in unclear direction:

Speed for cost 1 (iteration count) of 5000

Raw: 54386 c/s real, 2912K c/s virtual

Speed for cost 1 (iteration count) of 5000

Raw: 54841 c/s real, 1456K c/s virtual

Speed for cost 1 (iteration count) of 5000

Raw: 54161 c/s real, 819200 c/s virtual

Speed for cost 1 (iteration count) of 5000

Raw: 53718 c/s real, 728177 c/s virtual

even though the speeds reported during auto-tuning went up (now above

62k is reported during auto-tuning, only to be followed with the 54k'ish

speeds on the final benchmark). For example:

[***@super run]$ AMD_OCL_BUILD_OPTIONS_APPEND=-save-temps ./john -test -form=sha512crypt-opencl -dev=2 -v=4

Benchmarking: sha512crypt-opencl, crypt(3) $6$ (rounds=5000) [SHA512 OpenCL]...

Device 2: Tahiti [AMD Radeon HD 7900 Series]

Calculating best GWS for LWS=64; max. 150ms single kernel invocation.

gws: 2048 5158 25790000 rounds/s 397.002ms per crypt_all()!

gws: 4096 14338 71690000 rounds/s 285.657ms per crypt_all()!

gws: 8192 31089 155445000 rounds/s 263.499ms per crypt_all()!

gws: 16384 59464 297320000 rounds/s 275.524ms per crypt_all()+

gws: 32768 59717 298585000 rounds/s 548.714ms per crypt_all()

gws: 65536 61746 308730000 rounds/s 1.061s per crypt_all()+

gws: 131072 62269 311345000 rounds/s 2.104s per crypt_all()

Calculating best LWS for GWS=65536

Testing LWS=64 GWS=65536 ... 43.049ms+

Testing LWS=128 GWS=65536 ... 45.408ms

Testing LWS=192 GWS=65472 ... 61.590ms

Testing LWS=256 GWS=65536 ... 42.687ms+

Calculating best GWS for LWS=256; max. 300ms single kernel invocation.

gws: 8192 31530 157650000 rounds/s 259.809ms per crypt_all()!

gws: 16384 58888 294440000 rounds/s 278.222ms per crypt_all()+

gws: 32768 60613 303065000 rounds/s 540.605ms per crypt_all()+

gws: 65536 61189 305945000 rounds/s 1.071s per crypt_all()

gws: 131072 61748 308740000 rounds/s 2.122s per crypt_all()+

gws: 262144 63347 316735000 rounds/s 4.138s per crypt_all()+

Local worksize (LWS) 256, global worksize (GWS) 262144

DONE

Speed for cost 1 (iteration count) of 5000

Raw: 55072 c/s real, 2621K c/s virtual

Oh, 63k during the auto-tuning even. Was that extrapolated from fewer

than 5000 iterations maybe? Is the instruction cache hit rate worse for

5000 iterations maybe? We could want to play with how we're splitting

the 5000 iterations across kernel invocations to hopefully regain this

speed for actual runs. ... or maybe we have it for actual runs already:

[***@super run]$ ./john -form=sha512crypt-opencl -dev=2 -v=5 -inc=alpha -min-len=8 -max-len=8 pw

[...]

Local worksize (LWS) 64, global worksize (GWS) 262144

[...]

0g 0:00:00:00 0g/s 0p/s 0c/s 0C/s

0g 0:00:00:16 0g/s 48907p/s 48907c/s 48907C/s bigetort..soyfryap

0g 0:00:00:17 0g/s 46044p/s 46044c/s 46044C/s bigetort..soyfryap

0g 0:00:00:18 0g/s 58060p/s 58060c/s 58060C/s soyfryam..calefryn

0g 0:00:00:19 0g/s 54985p/s 54985c/s 54985C/s soyfryam..calefryn

0g 0:00:00:20 0g/s 52245p/s 52245c/s 52245C/s soyfryam..calefryn

0g 0:00:00:21 0g/s 49742p/s 49742c/s 49742C/s soyfryam..calefryn

0g 0:00:00:23 0g/s 56839p/s 56839c/s 56839C/s calefrya..astefeto

0g 0:00:00:33 0g/s 55505p/s 55505c/s 55505C/s chumaist..metalito

0g 0:00:00:34 0g/s 53859p/s 53859c/s 53859C/s chumaist..metalito

0g 0:00:00:38 0g/s 55101p/s 55101c/s 55101C/s metality..singuapp

0g 0:00:00:39 0g/s 60386p/s 60386c/s 60386C/s singuazy..abbortom

0g 0:00:00:40 0g/s 58923p/s 58923c/s 58923C/s singuazy..abbortom

0g 0:00:00:41 0g/s 57473p/s 57473c/s 57473C/s singuazy..abbortom

0g 0:00:00:42 0g/s 56093p/s 56093c/s 56093C/s singuazy..abbortom

0g 0:00:00:43 0g/s 60864p/s 60864c/s 60864C/s abbortot..mcmyleow

abcdefgh (?)

1g 0:00:00:47 DONE (2015-10-10 07:24) 0.02114g/s 60976p/s 60976c/s 60976C/s abbortot..mcmyleow

61k average for a 47 seconds run, not bad. Maybe this includes a not

fully processed last buffer (4 or 5 seconds), though?

As a final experiment, I tried omitting the "- 32" from my uses of

amd_bitalign() that had it, because amd_bitalign() is defined (unlike

e.g. bitwise shifts in C) to take its shift count "& 31":

https://www.khronos.org/registry/cl/extensions/amd/cl_amd_media_ops.txt

Specifically:

#define ror(x, n) ((n) < 32 ? (amd_bitalign((uint)((x) >> 32), (uint)(x), (uint)(n)) | ((ulong)amd_bitalign((uint)(x), (uint)((x) >> 32), (uint)(n)) << 32)) : (amd_bitalign((uint)(x), (uint)((x) >> 32), (uint)(n)/* - 32*/) | ((ulong)amd_bitalign((uint)((x) >> 32), (uint)(x), (uint)(n)/* - 32*/) << 32)))

Notice two commented out "- 32"'s. The resulting ISA size remained the

same, and speeds too, but the changed shift counts (now some above 31)

are actually encoded in the instructions. It's curious that GCN even

has room in the instructions to encode those redundant shift counts,

instead of applying the "& 31" at compile time.

I think it's better to use the previous version, with the "- 32"'s

intact, as long as the rotate counts are compile-time constants, so this

subtraction is performed at compile time as well.

Further speedup might be possible through switching the SWAP64() macro

from using rotate() to using the approach above. However, it is

currently used on both vector and scalar arguments, so my trivial

attempt at changing it like that failed. I guess we'd need to split it

into two macros: one for vectors and one for scalars. I am leaving this

for Claudio or/and magnum to experiment with.

Agnieszka - the 64-bit rotate optimizations discussed above are probably

also applicable to Argon2 and Lyra2.

Claudio - some additional info on the experiments above, in the same

order (first is your code, then my revisions as described above):

[***@super run]$ fgrep codeLenInByte ?/*.isa

1/_temp_0_Tahiti_kernel_crypt.isa:codeLenInByte = 59044 bytes;

1/_temp_0_Tahiti_kernel_final.isa:codeLenInByte = 60240 bytes;

1/_temp_0_Tahiti_kernel_prepare.isa:codeLenInByte = 163080 bytes;

2/_temp_0_Tahiti_kernel_crypt.isa:codeLenInByte = 59044 bytes;

2/_temp_0_Tahiti_kernel_final.isa:codeLenInByte = 60240 bytes;

2/_temp_0_Tahiti_kernel_prepare.isa:codeLenInByte = 162504 bytes;

3/_temp_0_Tahiti_kernel_crypt.isa:codeLenInByte = 59164 bytes;

3/_temp_0_Tahiti_kernel_final.isa:codeLenInByte = 60328 bytes;

3/_temp_0_Tahiti_kernel_prepare.isa:codeLenInByte = 158704 bytes;

4/_temp_0_Tahiti_kernel_crypt.isa:codeLenInByte = 59148 bytes;

4/_temp_0_Tahiti_kernel_final.isa:codeLenInByte = 60316 bytes;

4/_temp_0_Tahiti_kernel_prepare.isa:codeLenInByte = 158616 bytes;

5/_temp_0_Tahiti_kernel_crypt.isa:codeLenInByte = 59148 bytes;

5/_temp_0_Tahiti_kernel_final.isa:codeLenInByte = 60316 bytes;

5/_temp_0_Tahiti_kernel_prepare.isa:codeLenInByte = 158616 bytes;

[***@super run]$ fgrep NumVgpr ?/*.isa

1/_temp_0_Tahiti_kernel_crypt.isa:NumVgprs = 116;

1/_temp_0_Tahiti_kernel_final.isa:NumVgprs = 116;

1/_temp_0_Tahiti_kernel_prepare.isa:NumVgprs = 90;

2/_temp_0_Tahiti_kernel_crypt.isa:NumVgprs = 115;

2/_temp_0_Tahiti_kernel_final.isa:NumVgprs = 115;

2/_temp_0_Tahiti_kernel_prepare.isa:NumVgprs = 91;

3/_temp_0_Tahiti_kernel_crypt.isa:NumVgprs = 124;

3/_temp_0_Tahiti_kernel_final.isa:NumVgprs = 119;

3/_temp_0_Tahiti_kernel_prepare.isa:NumVgprs = 88;

4/_temp_0_Tahiti_kernel_crypt.isa:NumVgprs = 119;

4/_temp_0_Tahiti_kernel_final.isa:NumVgprs = 118;

4/_temp_0_Tahiti_kernel_prepare.isa:NumVgprs = 87;

5/_temp_0_Tahiti_kernel_crypt.isa:NumVgprs = 119;

5/_temp_0_Tahiti_kernel_final.isa:NumVgprs = 118;

5/_temp_0_Tahiti_kernel_prepare.isa:NumVgprs = 87;

There was a spike in NumVgprs for the mixed version (with bitalign only

used for half of the rotates), yet the speed was good.

ScratchSize remained the same as yours in all of these revisions.

1/_temp_0_Tahiti_kernel_crypt.isa:ScratchSize = 32 dwords/thread;

1/_temp_0_Tahiti_kernel_final.isa:ScratchSize = 32 dwords/thread;

1/_temp_0_Tahiti_kernel_prepare.isa:ScratchSize = 108 dwords/thread;

Alexander