Solar Designer

2015-09-02 15:20:25 UTC

magnum, Lei -

SHA-1's H() aka F3() is the same as SHA-2's Maj(), yet somehow we're

using less optimal expressions for it on systems with bitselect().

In opencl_sha1.h we have:

#define F3(x, y, z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))

I've just tried changing this to:

#define F3(x, y, z) bitselect(x, y, (z) ^ (x))

and got some speedup for pbkdf2-hmac-sha1-opencl on GCN (1200K to

1228K c/s).

The same pattern is also seen in:

[***@super src]$ grep -r 'bitselect.*\^.*bitselect' .

./opencl_sha1.h:#define F3(x, y, z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))

./opencl/gpg_kernel.cl:#define F(x, y, z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))

./opencl/rar_kernel.cl:#define F(x,y,z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))

./opencl/rar_kernel.cl:#define F(x,y,z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))

./opencl/sha1_kernel.cl:#define F3(x, y, z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))

./opencl/salted_sha_kernel.cl:#define F3(x, y, z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))

./opencl/pbkdf2_kernel.cl:#define F(x, y, z) (bitselect(x, y, z) ^ bitselect(x, (uint)0, y))

./opencl/pbkdf2_kernel.cl:#define F(x,y,z) (bitselect(x, y, z) ^ bitselect(x, (uint)0, y))

and maybe elsewhere, if written slightly differently.

In simd-intrinsics.c we have:

#if __XOP__

#define SHA1_H(x,y,z) \

tmp[i] = vcmov((x[i]),(y[i]),(z[i])); \

tmp[i] = vxor((tmp[i]),vandnot((x[i]),(y[i])));

#else

#define SHA1_H(x,y,z) \

tmp[i] = vand((x[i]),(y[i])); \

tmp[i] = vor((tmp[i]),vand(vor((x[i]),(y[i])),(z[i])));

#endif

This is suboptimal in two ways:

1. It doesn't use the more optimal expression above (can do 2 operations

instead of 3).

2. The check for __XOP__ prevents this optimization from being used for

other archs where we have non-emulated vcmov(). This is currently NEON

and AltiVec.

While we could simply drop the check for __XOP__ once we've optimized

the expression since it'd be 4 operations with emulated vcmov(), which

is same count as the current #else branch, I suggest that we don't,

because the 4 operations in #else include some parallelism whereas our

vcmov() emulation does not.

So I think we should either enhance the check with also checking for

__ARM_NEON__ and __ALTIVEC__, or introduce some generic way of checking

for non-emulated vcmov() (e.g., a macro defined in pseudo_intrinsics.h

that would indicate that vcmov() is emulated, so we'd avoid it then).

The same applies to rawSHA1_ng_fmt_plug.c where we have:

#define R3(W, A, B, C, D, E) do { \

E = vadd_epi32(E, K); \

E = vadd_epi32(E, vxor(vcmov(D, B, C), vandnot(D, B))); \

E = vadd_epi32(E, W); \

B = vroti_epi32(B, 30); \

E = vadd_epi32(E, vroti_epi32(A, 5)); \

} while (false)

In fact, it's even worse there: as currently written, this expands to 5

operations when vcmov() is emulated, instead of 4. I think we should

put an #if around R3 and define it in two different ways: using the more

optimal 2 operations expression when vcmov() is non-emulated, and using

the 4 operations expression with parallelism (same as the current #else

branch in simd-intrinsics.c) when vcmov() is emulated.

magnum, will you take care of this, please? And test on GPUs and on XOP.

Lei, will you test/benchmark on NEON and AltiVec once magnum commits the

fixes, please?

Thanks,

Alexander

SHA-1's H() aka F3() is the same as SHA-2's Maj(), yet somehow we're

using less optimal expressions for it on systems with bitselect().

In opencl_sha1.h we have:

#define F3(x, y, z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))

I've just tried changing this to:

#define F3(x, y, z) bitselect(x, y, (z) ^ (x))

and got some speedup for pbkdf2-hmac-sha1-opencl on GCN (1200K to

1228K c/s).

The same pattern is also seen in:

[***@super src]$ grep -r 'bitselect.*\^.*bitselect' .

./opencl_sha1.h:#define F3(x, y, z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))

./opencl/gpg_kernel.cl:#define F(x, y, z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))

./opencl/rar_kernel.cl:#define F(x,y,z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))

./opencl/rar_kernel.cl:#define F(x,y,z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))

./opencl/sha1_kernel.cl:#define F3(x, y, z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))

./opencl/salted_sha_kernel.cl:#define F3(x, y, z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))

./opencl/pbkdf2_kernel.cl:#define F(x, y, z) (bitselect(x, y, z) ^ bitselect(x, (uint)0, y))

./opencl/pbkdf2_kernel.cl:#define F(x,y,z) (bitselect(x, y, z) ^ bitselect(x, (uint)0, y))

and maybe elsewhere, if written slightly differently.

In simd-intrinsics.c we have:

#if __XOP__

#define SHA1_H(x,y,z) \

tmp[i] = vcmov((x[i]),(y[i]),(z[i])); \

tmp[i] = vxor((tmp[i]),vandnot((x[i]),(y[i])));

#else

#define SHA1_H(x,y,z) \

tmp[i] = vand((x[i]),(y[i])); \

tmp[i] = vor((tmp[i]),vand(vor((x[i]),(y[i])),(z[i])));

#endif

This is suboptimal in two ways:

1. It doesn't use the more optimal expression above (can do 2 operations

instead of 3).

2. The check for __XOP__ prevents this optimization from being used for

other archs where we have non-emulated vcmov(). This is currently NEON

and AltiVec.

While we could simply drop the check for __XOP__ once we've optimized

the expression since it'd be 4 operations with emulated vcmov(), which

is same count as the current #else branch, I suggest that we don't,

because the 4 operations in #else include some parallelism whereas our

vcmov() emulation does not.

So I think we should either enhance the check with also checking for

__ARM_NEON__ and __ALTIVEC__, or introduce some generic way of checking

for non-emulated vcmov() (e.g., a macro defined in pseudo_intrinsics.h

that would indicate that vcmov() is emulated, so we'd avoid it then).

The same applies to rawSHA1_ng_fmt_plug.c where we have:

#define R3(W, A, B, C, D, E) do { \

E = vadd_epi32(E, K); \

E = vadd_epi32(E, vxor(vcmov(D, B, C), vandnot(D, B))); \

E = vadd_epi32(E, W); \

B = vroti_epi32(B, 30); \

E = vadd_epi32(E, vroti_epi32(A, 5)); \

} while (false)

In fact, it's even worse there: as currently written, this expands to 5

operations when vcmov() is emulated, instead of 4. I think we should

put an #if around R3 and define it in two different ways: using the more

optimal 2 operations expression when vcmov() is non-emulated, and using

the 4 operations expression with parallelism (same as the current #else

branch in simd-intrinsics.c) when vcmov() is emulated.

magnum, will you take care of this, please? And test on GPUs and on XOP.

Lei, will you test/benchmark on NEON and AltiVec once magnum commits the

fixes, please?

Thanks,

Alexander