I optimized yescrypt-opencl (960m) by copying one table to private memory

before(with some optimizations):

***@none ~/Desktop/r/run $ GWS=1024 ./john --test --format=yescrypt-opencl

Benchmarking: yescrypt-opencl [Salsa20/8 OpenCL (inefficient,

development use only)]... Device 0: GeForce GTX 960M

memory per hash : 2.10 MB

DONE

Speed for cost 1 (N) of 2048, cost 2 (r) of 8, cost 3 (p) of 11, cost

4 (t) of 0, cost 5 (g) of 0

Many salts: 247 c/s real, 247 c/s virtual

Only one salt: 247 c/s real, 247 c/s virtual

now:

***@none ~/Desktop/r/src $ m;r;GWS=1024 ./john --test --format=yescrypt-opencl

Make process completed.

Benchmarking: yescrypt-opencl [Salsa20/8 OpenCL (inefficient,

development use only)]... Device 0: GeForce GTX 960M

memory per hash : 2.10 MB

DONE

Speed for cost 1 (N) of 2048, cost 2 (r) of 8, cost 3 (p) of 11, cost

4 (t) of 0, cost 5 (g) of 0

Many salts: 409 c/s real, 407 c/s virtual

Only one salt: 409 c/s real, 407 c/s virtual

but if I want to run benchmarks for GWS=256,512 and 1024 I need to set

a quarter of needed memory in autotune

(I'm getting CL_MEM_OBJECT_ALLOCATION_FAILURE for GWS=2048)

***@none ~/Desktop/r/run $ ./john --test --format=yescrypt-opencl --v=4

Benchmarking: yescrypt-opencl [Salsa20/8 OpenCL (inefficient,

development use only)]... Device 0: GeForce GTX 960M

Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__

-DDEVICE_INFO=131090 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21

-D_OPENCL_COMPILER -DBINARY_SIZE=32 -DSALT_SIZE=64

-DPLAINTEXT_LENGTH=125 -DHASH_SIZE=44

memory per hash : 2.10 MB

Calculating best global worksize (GWS); max. 100s total for crypt_all()

gws: 256 159 c/s 159 rounds/s 1.608s per crypt_all()!

gws: 512 161 c/s 161 rounds/s 3.176s per crypt_all()+

gws: 1024 145 c/s 145 rounds/s 7.029s per crypt_all()

Local worksize (LWS) 64, global worksize (GWS) 512

DONE

Speed for cost 1 (N) of 2048, cost 2 (r) of 8, cost 3 (p) of 11, cost

4 (t) of 0, cost 5 (g) of 0

Many salts: 355 c/s real, 358 c/s virtual

Only one salt: 358 c/s real, 358 c/s virtual

If I set all of needed memory:

***@none ~/Desktop/r/run $ ./john --test --format=yescrypt-opencl --v=4

Benchmarking: yescrypt-opencl [Salsa20/8 OpenCL (inefficient,

development use only)]... Device 0: GeForce GTX 960M

Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__

-DDEVICE_INFO=131090 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21

-D_OPENCL_COMPILER -DBINARY_SIZE=32 -DSALT_SIZE=64

-DPLAINTEXT_LENGTH=125 -DHASH_SIZE=44

memory per hash : 2.10 MB

Calculating best global worksize (GWS); max. 100s total for crypt_all()

gws: 256 158 c/s 158 rounds/s 1.612s per crypt_all()!

Local worksize (LWS) 64, global worksize (GWS) 256

DONE

Speed for cost 1 (N) of 2048, cost 2 (r) of 8, cost 3 (p) of 11, cost

4 (t) of 0, cost 5 (g) of 0

Many salts: 230 c/s real, 230 c/s virtual

Only one salt: 237 c/s real, 237 c/s virtual

and the other thing is that benchamrks estimate the speed inproperly

***@none ~/Desktop/r/run $ GWS=1024 ./john --test --format=yescrypt-opencl

Benchmarking: yescrypt-opencl [Salsa20/8 OpenCL (inefficient,

development use only)]... Device 0: GeForce GTX 960M

memory per hash : 2.10 MB

DONE

Speed for cost 1 (N) of 2048, cost 2 (r) of 8, cost 3 (p) of 11, cost

4 (t) of 0, cost 5 (g) of 0

Many salts: 407 c/s real, 407 c/s virtual

Only one salt: 409 c/s real, 409 c/s virtual

***@none ~/Desktop/r/run $ GWS=512 ./john --test --format=yescrypt-opencl

Benchmarking: yescrypt-opencl [Salsa20/8 OpenCL (inefficient,

development use only)]... Device 0: GeForce GTX 960M

memory per hash : 2.10 MB

DONE

Speed for cost 1 (N) of 2048, cost 2 (r) of 8, cost 3 (p) of 11, cost

4 (t) of 0, cost 5 (g) of 0

Many salts: 358 c/s real, 358 c/s virtual

Only one salt: 358 c/s real, 360 c/s virtual

***@none ~/Desktop/r/run $ ./john --test --format=yescrypt-opencl --v=4

Benchmarking: yescrypt-opencl [Salsa20/8 OpenCL (inefficient,

development use only)]... Device 0: GeForce GTX 960M

Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__

-DDEVICE_INFO=131090 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21

-D_OPENCL_COMPILER -DBINARY_SIZE=32 -DSALT_SIZE=64

-DPLAINTEXT_LENGTH=125 -DHASH_SIZE=44

memory per hash : 2.10 MB

Calculating best global worksize (GWS); max. 100s total for crypt_all()

gws: 256 159 c/s 159 rounds/s 1.608s per crypt_all()!

gws: 512 161 c/s 161 rounds/s 3.176s per crypt_all()+

gws: 1024 145 c/s 145 rounds/s 7.029s per crypt_all()

Local worksize (LWS) 64, global worksize (GWS) 512

DONE

Speed for cost 1 (N) of 2048, cost 2 (r) of 8, cost 3 (p) of 11, cost

4 (t) of 0, cost 5 (g) of 0

Many salts: 355 c/s real, 358 c/s virtual

Only one salt: 358 c/s real, 358 c/s virtual