Discussion:
SHA-1 H()
(too old to reply)
Solar Designer
2015-09-02 15:20:25 UTC
Permalink
Raw Message
magnum, Lei -

SHA-1's H() aka F3() is the same as SHA-2's Maj(), yet somehow we're
using less optimal expressions for it on systems with bitselect().

In opencl_sha1.h we have:

#define F3(x, y, z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))

I've just tried changing this to:

#define F3(x, y, z) bitselect(x, y, (z) ^ (x))

and got some speedup for pbkdf2-hmac-sha1-opencl on GCN (1200K to
1228K c/s).

The same pattern is also seen in:

[***@super src]$ grep -r 'bitselect.*\^.*bitselect' .
./opencl_sha1.h:#define F3(x, y, z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))
./opencl/gpg_kernel.cl:#define F(x, y, z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))
./opencl/rar_kernel.cl:#define F(x,y,z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))
./opencl/rar_kernel.cl:#define F(x,y,z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))
./opencl/sha1_kernel.cl:#define F3(x, y, z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))
./opencl/salted_sha_kernel.cl:#define F3(x, y, z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))
./opencl/pbkdf2_kernel.cl:#define F(x, y, z) (bitselect(x, y, z) ^ bitselect(x, (uint)0, y))
./opencl/pbkdf2_kernel.cl:#define F(x,y,z) (bitselect(x, y, z) ^ bitselect(x, (uint)0, y))

and maybe elsewhere, if written slightly differently.

In simd-intrinsics.c we have:

#if __XOP__
#define SHA1_H(x,y,z) \
tmp[i] = vcmov((x[i]),(y[i]),(z[i])); \
tmp[i] = vxor((tmp[i]),vandnot((x[i]),(y[i])));
#else
#define SHA1_H(x,y,z) \
tmp[i] = vand((x[i]),(y[i])); \
tmp[i] = vor((tmp[i]),vand(vor((x[i]),(y[i])),(z[i])));
#endif

This is suboptimal in two ways:

1. It doesn't use the more optimal expression above (can do 2 operations
instead of 3).

2. The check for __XOP__ prevents this optimization from being used for
other archs where we have non-emulated vcmov(). This is currently NEON
and AltiVec.

While we could simply drop the check for __XOP__ once we've optimized
the expression since it'd be 4 operations with emulated vcmov(), which
is same count as the current #else branch, I suggest that we don't,
because the 4 operations in #else include some parallelism whereas our
vcmov() emulation does not.

So I think we should either enhance the check with also checking for
__ARM_NEON__ and __ALTIVEC__, or introduce some generic way of checking
for non-emulated vcmov() (e.g., a macro defined in pseudo_intrinsics.h
that would indicate that vcmov() is emulated, so we'd avoid it then).

The same applies to rawSHA1_ng_fmt_plug.c where we have:

#define R3(W, A, B, C, D, E) do { \
E = vadd_epi32(E, K); \
E = vadd_epi32(E, vxor(vcmov(D, B, C), vandnot(D, B))); \
E = vadd_epi32(E, W); \
B = vroti_epi32(B, 30); \
E = vadd_epi32(E, vroti_epi32(A, 5)); \
} while (false)

In fact, it's even worse there: as currently written, this expands to 5
operations when vcmov() is emulated, instead of 4. I think we should
put an #if around R3 and define it in two different ways: using the more
optimal 2 operations expression when vcmov() is non-emulated, and using
the 4 operations expression with parallelism (same as the current #else
branch in simd-intrinsics.c) when vcmov() is emulated.

magnum, will you take care of this, please? And test on GPUs and on XOP.

Lei, will you test/benchmark on NEON and AltiVec once magnum commits the
fixes, please?

Thanks,

Alexander
Solar Designer
2015-09-02 15:52:25 UTC
Permalink
Raw Message
magnum, Lei -
Post by Solar Designer
SHA-1's H() aka F3() is the same as SHA-2's Maj()
And it turns out that while we appear to be optimally using bitselect()
or vcmov() for Maj(), the fallback expressions that we use vary across
source files and are not always optimal:

[***@super src]$ grep -r 'define .*Maj' .
./unused/sha512_kernel.cl:#define Maj(x,y,z) ((x & y) ^ (x & z) ^ (y & z))
./rawSHA512_ng_fmt_plug.c:#define Maj(x,y,z) vcmov(x, y, vxor(z, y))
./cuda_cryptsha512.h:#define Maj(x,y,z) ((x & y) ^ (x & z) ^ (y & z))
./opencl/pwsafe_kernel.cl:#define Maj(x,y,z) (bitselect(y, x,(z^y)))
./opencl/pwsafe_kernel.cl:#define Maj(x, y, z) ((y & z) | (x & (y | z)))
./opencl_sha512.h: #define Maj(x,y,z) bitselect(x, y, z ^ x)
./opencl_sha512.h: #define Maj(x,y,z) ((x & y) ^ (x & z) ^ (y & z))
./cuda_cryptsha256.h:#define Maj(x,y,z) ((x & y) ^ (x & z) ^ (y & z))
./escrypt/sha256.c:#define Maj(x, y, z) ((x & (y | z)) | (y & z))
./cuda_xsha512.h:#define Maj(x,y,z) ((x & y) ^ (x & z) ^ (y & z))
./opencl_sha2.h:#define Maj(x, y, z) bitselect(x, y, z ^ x)
./opencl_sha2.h:#define Maj(x, y, z) ((x & y) | (z & (x | y)))
./rawSHA256_ng_fmt_plug.c:#define Maj(x,y,z) vcmov(x, y, vxor(z, y))
./opencl_sha256.h: #define Maj(x, y, z) bitselect(x, y, z ^ x)
./opencl_sha256.h: #define Maj(x, y, z) ((x & y) ^ (x & z) ^ (y & z))
./simd-intrinsics.c:#define Maj(x,y,z) vcmov(x, y, vxor(z, y))
./simd-intrinsics.c:#define Maj(x,y,z) vcmov(x, y, vxor(z, y))
./cuda_pwsafe.h:#define Maj(x, y, z) ((y & z) | (x & (y | z)))
./cuda_rawsha256.h:#define Maj(x,y,z) ( (x & y) | (z & (x | y)) )
./cuda_rawsha512.h:#define Maj(x,y,z) ((x & y) ^ (x & z) ^ (y & z))

As you can see, some of these use 5 operations instead of 4, and some
use the parallelism-lacking approach with possibly emulated vcmov().

I think we should standardize on the parallelism-enabled 4 operation
expression for when there's no native bitselect() or vcmov() - for both
SHA-1 and SHA-2 in the same way.

This means we should probably check for AMD or at least NVIDIA Maxwell,
falling back to the 4 operation expression on older NVIDIA GPUs. (Need
to check which expression gets compiled to LOP3.LUT on Maxwell, though.)

A curious aspect is that Maj() is invariant with respect to the ordering
of its arguments. We can see it in the grep output above: some of the
expressions are the same except that they have x, y, z re-ordered in
different ways. We could test all 6 possible orderings in different
contexts (SHA-1 vs. SHA-256 vs. SHA-512, and different OpenCL kernels,
etc.) and see which is faster where (this might in fact differ).

Attached to this message is a program I used to search for possible
optimized expressions like this. No new findings from it, but it did
remind me of the issues I described in these two messages. I was hoping
it might find a 2 operation expression for MD5's I(), but no luck.
It doesn't yet test two bitselect()'s per expression, though - this is
worth adding and trying again (many possibilities to test there).

Oh, and of course on Maxwell and AVX-512 MD5's I() (and all others) is
just one operation, if the compiler manages. We should check the
generated code for our kernels on Maxwell.

Lei - I think you haven't gotten around to introducing AVX-512 ternary
logic intrinsics yet, have you? Unfortunately, we don't yet have
hardware to test them on, but you could test on Intel SDE or/and by
emulating them with macros.

Alexander
magnum
2015-09-02 19:31:34 UTC
Permalink
Raw Message
Post by Solar Designer
Post by Solar Designer
SHA-1's H() aka F3() is the same as SHA-2's Maj()
And it turns out that while we appear to be optimally using bitselect()
or vcmov() for Maj(), the fallback expressions that we use vary across
Perhaps Ch() too:

#define Ch(x, y, z) (z ^ (x & (y ^ z)))
#define Ch(x, y, z) ((x & y) ^ ( (~x) & z))

This is 3 vs. 4 ops, right?

magnum
Solar Designer
2015-09-03 04:56:53 UTC
Permalink
Raw Message
Post by magnum
Post by Solar Designer
Post by Solar Designer
SHA-1's H() aka F3() is the same as SHA-2's Maj()
And it turns out that while we appear to be optimally using bitselect()
or vcmov() for Maj(), the fallback expressions that we use vary across
#define Ch(x, y, z) (z ^ (x & (y ^ z)))
#define Ch(x, y, z) ((x & y) ^ ( (~x) & z))
This is 3 vs. 4 ops, right?
On archs without AND-NOT, yes. So it's a good find, and I'm happy you
patched these.

However, on archs with AND-NOT either is 3 ops, and the one with AND-NOT
has some parallelism. This brings us to:

Maybe we need to adjust our emulation of vcmov() to use the form with
AND-NOT when we know that AND-NOT is available - and since we're dealing
with intrinsics, we do know and it usually is. Not only for Ch(), but
in general.

This will need to be benchmarked. Along with higher parallelism comes
higher register pressure. It is possible that optimal interleaving
factors will become lower than they are now.

Maybe both forms of emulation need to be kept in pseudo_intrinsics.h
with a way for us to choose one or the other. It might happen that the
optimal choice will vary by arch, CPU, compiler, format.

Alexander
magnum
2015-09-03 09:52:47 UTC
Permalink
Raw Message
Post by Solar Designer
Post by magnum
Post by Solar Designer
Post by Solar Designer
SHA-1's H() aka F3() is the same as SHA-2's Maj()
And it turns out that while we appear to be optimally using bitselect()
or vcmov() for Maj(), the fallback expressions that we use vary across
#define Ch(x, y, z) (z ^ (x & (y ^ z)))
#define Ch(x, y, z) ((x & y) ^ ( (~x) & z))
This is 3 vs. 4 ops, right?
On archs without AND-NOT, yes. So it's a good find, and I'm happy you
patched these.
However, on archs with AND-NOT either is 3 ops, and the one with AND-NOT
has some parallelism.
Maybe the and-not one is better on some GPU then? I need to test.
Apparently GCN has ANDN and NAND. Not sure about nvidia. I really hope
we don't need a '(~x) & z' and a 'z & (~x)' version too? Optimizers are
usully fascinating but sometimes very disappointing.
Post by Solar Designer
Maybe both forms of emulation need to be kept in pseudo_intrinsics.h
with a way for us to choose one or the other. It might happen that the
optimal choice will vary by arch, CPU, compiler, format.
But if it varies by format, we need to decide outside pseudo_intrinsics.h.

BTW early tests indicate that 5916a57 made SHA-512 very slightly worse
(but almost hidden by normal variations).

magnum
Solar Designer
2015-09-03 18:40:03 UTC
Permalink
Raw Message
Post by magnum
Post by Solar Designer
Post by magnum
#define Ch(x, y, z) (z ^ (x & (y ^ z)))
#define Ch(x, y, z) ((x & y) ^ ( (~x) & z))
This is 3 vs. 4 ops, right?
On archs without AND-NOT, yes. So it's a good find, and I'm happy you
patched these.
However, on archs with AND-NOT either is 3 ops, and the one with AND-NOT
has some parallelism.
Maybe the and-not one is better on some GPU then? I need to test.
Yes, that's possible.
Post by magnum
Apparently GCN has ANDN and NAND.
I need to take a fresh look at the arch manual, but in the generated
code I only see scalar ANDN, and never vector ANDN (nor NAND). They
defined scalar ANDN presumably because it's so useful for exec masks.

I see you've committed this:

+#if cpu(DEVICE_INFO) || amd_gcn(DEVICE_INFO)
+#define HAVE_ANDNOT 1
+#endif

but I think the check for amd_gcn(DEVICE_INFO) is wrong.

And why this change? -

-#if !gpu_nvidia(DEVICE_INFO) || nvidia_sm_5x(DEVICE_INFO)
+#if !gpu_nvidia(DEVICE_INFO)
#define USE_BITSELECT 1
#elif gpu_nvidia(DEVICE_INFO)
#define OLD_NVIDIA 1
#endif
Post by magnum
Post by Solar Designer
Maybe both forms of emulation need to be kept in pseudo_intrinsics.h
with a way for us to choose one or the other. It might happen that the
optimal choice will vary by arch, CPU, compiler, format.
But if it varies by format, we need to decide outside pseudo_intrinsics.h.
We could include several versions of the macro in pseudo_intrinsics.h
and decide in the format via setting another macro (WANT_XXX) before
including pseudo_intrinsics.h.
Post by magnum
BTW early tests indicate that 5916a57 made SHA-512 very slightly worse
(but almost hidden by normal variations).
On what hardware?

The parallelism vs. register pressure tradeoff is in fact non-obviously
beneficial. But on XOP there should be speedup from doing 1 op fewer.

Alexander
magnum
2015-09-03 19:29:37 UTC
Permalink
Raw Message
Post by Solar Designer
Post by magnum
Post by Solar Designer
Post by magnum
#define Ch(x, y, z) (z ^ (x & (y ^ z)))
#define Ch(x, y, z) ((x & y) ^ ( (~x) & z))
This is 3 vs. 4 ops, right?
On archs without AND-NOT, yes. So it's a good find, and I'm happy you
patched these.
Apparently GCN has ANDN and NAND.
I need to take a fresh look at the arch manual, but in the generated
code I only see scalar ANDN, and never vector ANDN (nor NAND). They
defined scalar ANDN presumably because it's so useful for exec masks.
+#if cpu(DEVICE_INFO) || amd_gcn(DEVICE_INFO)
+#define HAVE_ANDNOT 1
+#endif
but I think the check for amd_gcn(DEVICE_INFO) is wrong.
We currently never run vectorized on GCN anyway, unless forced by user -
if format supports it at all. But perhaps it should be
(amd_gcn(DEVICE_INFO) && (V_WIDTH < 2)) then?
Post by Solar Designer
And why this change? -
-#if !gpu_nvidia(DEVICE_INFO) || nvidia_sm_5x(DEVICE_INFO)
+#if !gpu_nvidia(DEVICE_INFO)
#define USE_BITSELECT 1
#elif gpu_nvidia(DEVICE_INFO)
#define OLD_NVIDIA 1
#endif
I saw definite speedup for PBKDF2 and RAR iirc, and perhaps md5crypt.
But later I saw contradicting figures for other formats so I'm not sure
about this and things are in a state of flux. It might be that we should
revert to initially setting it (for Maxwell) in opencl_misc.h, and later
conditionally undefine it in certain formats.

Is bitselect() expected to always generate a LOP3.LUT? Even if it is, I
figure the optimizer just might be able to do better when given
bitselect-free code.

Besides all this, I see I introduced a bug: Now OLD_NVIDIA is defined
for Maxwell and that was not the intention. I'll fix that right away.
Post by Solar Designer
Post by magnum
BTW early tests indicate that 5916a57 made SHA-512 very slightly worse
(but almost hidden by normal variations).
On what hardware?
AVX and AVX2. My overall feeling is SHA256 got a slight boost while
SHA512 did not and sometimes the latter got a very slight regression.
But I haven't really gone systematic yet. All my tests are very
inconclusive as of yet, the fluctuations are larger than the
boosts/regressions.

magnum
Solar Designer
2015-09-03 19:40:48 UTC
Permalink
Raw Message
Post by magnum
Post by Solar Designer
Post by magnum
Apparently GCN has ANDN and NAND.
I need to take a fresh look at the arch manual, but in the generated
code I only see scalar ANDN, and never vector ANDN (nor NAND). They
defined scalar ANDN presumably because it's so useful for exec masks.
+#if cpu(DEVICE_INFO) || amd_gcn(DEVICE_INFO)
+#define HAVE_ANDNOT 1
+#endif
but I think the check for amd_gcn(DEVICE_INFO) is wrong.
We currently never run vectorized on GCN anyway, unless forced by user -
if format supports it at all.
That's the SIMD vs. SIMT confusion again.

When talking ISA level:

By scalar, I mean the tiny scalar unit that is normally used for control
only. By vector, I mean the SIMD units.

Per the generated assembly code, there are no ANDN and NAND instructions
for the SIMD units at all. Trying to Google what their likely mnemonics
would be returns no hits. I think they just don't exist.

And it does not matter whether the kernel is vectorized or not. It uses
those same vector instructions either way. If vectorized, it gets
interleaved instructions, e.g. phpass-opecl:

v_add_i32 v43, vcc, v36, v43 // 00003D78: 4A565724
v_add_i32 v44, vcc, v37, v44 // 00003D7C: 4A585925
v_add_i32 v45, vcc, v38, v45 // 00003D80: 4A5A5B26
v_add_i32 v46, vcc, v35, v46 // 00003D84: 4A5C5D23
- v_not_b32 v51, v28 // 00003D88: 7E666F1C
- v_not_b32 v52, v29 // 00003D8C: 7E686F1D
- v_not_b32 v53, v30 // 00003D90: 7E6A6F1E
- v_not_b32 v54, v27 // 00003D94: 7E6C6F1B
- v_or_b32 v51, v43, v51 // 00003D98: 3866672B
- v_or_b32 v52, v44, v52 // 00003D9C: 3868692C
- v_or_b32 v53, v45, v53 // 00003DA0: 386A6B2D
- v_or_b32 v54, v46, v54 // 00003DA4: 386C6D2E
+ v_bfi_b32 v51, v28, v43, -1 // 00003D88: D2940033 0306571C
+ v_bfi_b32 v52, v29, v44, -1 // 00003D90: D2940034 0306591D
+ v_bfi_b32 v53, v30, v45, -1 // 00003D98: D2940035 03065B1E
+ v_bfi_b32 v54, v27, v46, -1 // 00003DA0: D2940036 03065D1B
v_xor_b32 v51, v36, v51 // 00003DA8: 3A666724
v_xor_b32 v52, v37, v52 // 00003DAC: 3A686925
v_xor_b32 v53, v38, v53 // 00003DB0: 3A6A6B26
v_xor_b32 v54, v35, v54 // 00003DB4: 3A6C6D23

(This also shows the effect of my MD5_I optimization.)
Post by magnum
But perhaps it should be (amd_gcn(DEVICE_INFO) && (V_WIDTH < 2)) then?
No.
Post by magnum
Post by Solar Designer
And why this change? -
-#if !gpu_nvidia(DEVICE_INFO) || nvidia_sm_5x(DEVICE_INFO)
+#if !gpu_nvidia(DEVICE_INFO)
#define USE_BITSELECT 1
#elif gpu_nvidia(DEVICE_INFO)
#define OLD_NVIDIA 1
#endif
I saw definite speedup for PBKDF2 and RAR iirc, and perhaps md5crypt.
But later I saw contradicting figures for other formats so I'm not sure
about this and things are in a state of flux. It might be that we should
revert to initially setting it (for Maxwell) in opencl_misc.h, and later
conditionally undefine it in certain formats.
Is bitselect() expected to always generate a LOP3.LUT? Even if it is, I
figure the optimizer just might be able to do better when given
bitselect-free code.
Yes, we should review the generated code. It is unclear what source
code is more likely to result in optimal use of LOP3.LUT.
Post by magnum
Besides all this, I see I introduced a bug: Now OLD_NVIDIA is defined
for Maxwell and that was not the intention. I'll fix that right away.
Yes. Thanks.
Post by magnum
Post by Solar Designer
Post by magnum
BTW early tests indicate that 5916a57 made SHA-512 very slightly worse
(but almost hidden by normal variations).
On what hardware?
AVX and AVX2. My overall feeling is SHA256 got a slight boost while
SHA512 did not and sometimes the latter got a very slight regression.
But I haven't really gone systematic yet. All my tests are very
inconclusive as of yet, the fluctuations are larger than the
boosts/regressions.
That's not surprising. I only expect much difference on XOP.

Alexander
magnum
2015-09-04 10:59:23 UTC
Permalink
Raw Message
Post by Solar Designer
Post by magnum
Post by Solar Designer
+#if cpu(DEVICE_INFO) || amd_gcn(DEVICE_INFO)
+#define HAVE_ANDNOT 1
+#endif
but I think the check for amd_gcn(DEVICE_INFO) is wrong.
We currently never run vectorized on GCN anyway, unless forced by user -
if format supports it at all.
That's the SIMD vs. SIMT confusion again.
Oh, right :confused:. That macro wouldn't have any effect though, since
GCN has USE_BITSELECT. But I have now dropped it for GCN. Actually CPUs
also got USE_BITSELECT so any sections for HAVE_ANDNOT are currently
unused but I like to keep them there for the future.

magnum
magnum
2015-09-02 23:27:05 UTC
Permalink
Raw Message
Post by Solar Designer
Post by Solar Designer
SHA-1's H() aka F3() is the same as SHA-2's Maj()
And it turns out that while we appear to be optimally using bitselect()
or vcmov() for Maj(), the fallback expressions that we use vary across
(...)
As you can see, some of these use 5 operations instead of 4, and some
use the parallelism-lacking approach with possibly emulated vcmov().
I think we should standardize on the parallelism-enabled 4 operation
expression for when there's no native bitselect() or vcmov() - for both
SHA-1 and SHA-2 in the same way.
https://github.com/magnumripper/JohnTheRipper/commit/c5f50a9
https://github.com/magnumripper/JohnTheRipper/commit/5916a57

All done (I changed some 4-op Ch() to 3-op as well). I mostly see slight
boosts but some formats may show regression (they fluctuate). I
committed this but need to test more, and on other hardware than my laptop.
Post by Solar Designer
A curious aspect is that Maj() is invariant with respect to the ordering
of its arguments. We can see it in the grep output above: some of the
expressions are the same except that they have x, y, z re-ordered in
different ways. We could test all 6 possible orderings in different
contexts (SHA-1 vs. SHA-256 vs. SHA-512, and different OpenCL kernels,
etc.) and see which is faster where (this might in fact differ).
Definitely, I've seen silly boosts/regressions just from doing that.
It's annoying that the compiler can't figure it out for us - especially
if it would turn out eg. different GPU's like different ordering.
Post by Solar Designer
Attached to this message is a program I used to search for possible
optimized expressions like this. No new findings from it, but it did
remind me of the issues I described in these two messages. I was hoping
it might find a 2 operation expression for MD5's I(), but no luck.
It doesn't yet test two bitselect()'s per expression, though - this is
worth adding and trying again (many possibilities to test there).
Would you care to explain what it does/outputs or do I need to reverse
it? I don't quite get it.

magnum
Solar Designer
2015-09-03 19:46:37 UTC
Permalink
Raw Message
Post by magnum
Post by Solar Designer
Attached to this message is a program I used to search for possible
optimized expressions like this. No new findings from it, but it did
remind me of the issues I described in these two messages. I was hoping
it might find a 2 operation expression for MD5's I(), but no luck.
It doesn't yet test two bitselect()'s per expression, though - this is
worth adding and trying again (many possibilities to test there).
Would you care to explain what it does/outputs or do I need to reverse
it? I don't quite get it.
I'm sorry for the confusing code. It just evolved that way. Hopefully,
the new revision is slightly better in that its output is easier to use,
and that I posted a message in here with an example, even though the
code might be even more confusing.

Alexander
Solar Designer
2015-09-03 15:15:29 UTC
Permalink
Raw Message
Post by Solar Designer
Attached to this message is a program I used to search for possible
optimized expressions like this. [...] I was hoping
it might find a 2 operation expression for MD5's I(), but no luck.
I've enhanced the program, and had better luck today:

$ ./search3 y n n n 2>&1 | cut -d' ' -f7- | sort -u
165
sel(ff, 55, 0f) = f5; 33 ^ f5 = c6;
sel(ff, 55, 0f) = f5; f5 ^ 33 = c6;

#define I(x, y, z) (bitselect(0xffffffffU, (x), (z)) ^ (y))

To remind, the original was:

#define I(x, y, z) ((y) ^ ((x) | ~(z)))

I think it's the first time MD5 I() has been shown to be implementable
with only 2 operations on architectures without OR-NOT (and without
XNOR, and of course also without "ternary" logic instructions such as
Maxwell's or AVX-512's, which would turn this into 1 operation with no
effort).

Now that I think of it, the expression is actually very simple and I
should have been able to arrive at it without a program. bitselect()
with the all-ones constant is directly usable to implement OR-NOT. :-)
Post by Solar Designer
It doesn't yet test two bitselect()'s per expression, though - this is
worth adding and trying again (many possibilities to test there).
It does now, and it also tests constants as you can see. New version
attached. And here's a table produced with it:

$ ./search3.sh
SEL XNOR ORN ANDN COUNT MD5_I
yes yes yes yes 190 yes
yes yes yes no 190 yes
yes yes no yes 190 yes
yes yes no no 178 yes
yes no yes yes 177 yes
yes no yes no 177 yes
yes no no yes 177 yes
yes no no no 165 yes
no yes yes yes 144 yes
no yes yes no 114 yes
no yes no yes 114 yes
no yes no no 72 no
no no yes yes 131 yes
no no yes no 95 yes
no no no yes 95 no
no no no no 59 no

This shows the number of different truth tables possible for 3 inputs
with at most 2 operations on different architectures. (It is assumed
that AND, OR, NOT and constants are always available, so only possible
instruction set extensions are listed in the table.) The theoretical
maximum (actually achieved with Maxwell's or AVX-512's "ternary" logic
instructions) is 256.

The last column shows whether MD5's I() is implementable with 2
operations or not. Similar tables can be generated for other
expressions, such as those found in other MD4/MD5, SHA-1, and SHA-2
basic functions - but for those we readily knew seemingly optimal
expressions using bitselect(), so I focused on MD5's I() for now.

Unfortunately, the program became rather long. I'm sure it can be
greatly simplified - and perhaps it should in fact be rewritten and
tested against this version in order for us to have greater confidence
it actually finds all possible truth tables rather than only a (very
large) subset.

As to machine code improvements from this change to I(), here's GCN
assembly from our md5crypt kernel. Before the change:

v_not_b32 v5, v7 // 00003DA8: 7E0A6F07
v_or_b32 v5, v1, v5 // 00003DAC: 380A0B01
v_xor_b32 v5, v2, v5 // 00003DB0: 3A0A0B02
v_add_i32 v4, vcc, v13, v4 // 00003DB4: 4A08090D
v_add_i32 v4, vcc, v5, v4 // 00003DB8: 4A080905
v_add_i32 v4, vcc, 0xbd3af235, v4 // 00003DBC: 4A0808FF BD3AF235
v_alignbit_b32 v4, v4, v4, 22 // 00003DC4: D29C0004 025A0904
v_add_i32 v4, vcc, v1, v4 // 00003DCC: 4A080901

After the change (surprisingly, register allocation and offsets look
unchanged):

v_bfi_b32 v5, v7, v1, -1 // 00003DA8: D2940005 03060307
v_xor_b32 v5, v2, v5 // 00003DB0: 3A0A0B02
v_add_i32 v4, vcc, v13, v4 // 00003DB4: 4A08090D
v_add_i32 v4, vcc, v5, v4 // 00003DB8: 4A080905
v_add_i32 v4, vcc, 0xbd3af235, v4 // 00003DBC: 4A0808FF BD3AF235
v_alignbit_b32 v4, v4, v4, 22 // 00003DC4: D29C0004 025A0904
v_add_i32 v4, vcc, v1, v4 // 00003DCC: 4A080901

That's 7 instructions instead of 8. However, there's no code size
reduction in bytes, since v_bfi_b32 occupies 8 bytes (and it does so
even when no immediate value operand is used).

The effect of this will need to be tested across multiple OpenCL kernels
and pieces of C+intrinsics code across multiple architectures. Besides
GPUs, also on AMD CPUs with XOP.

Alexander
Solar Designer
2015-09-03 20:06:59 UTC
Permalink
Raw Message
Post by Solar Designer
$ ./search3 y n n n 2>&1 | cut -d' ' -f7- | sort -u
165
sel(ff, 55, 0f) = f5; 33 ^ f5 = c6;
sel(ff, 55, 0f) = f5; f5 ^ 33 = c6;
#define I(x, y, z) (bitselect(0xffffffffU, (x), (z)) ^ (y))
#define I(x, y, z) ((y) ^ ((x) | ~(z)))
I think it's the first time MD5 I() has been shown to be implementable
with only 2 operations on architectures without OR-NOT
atom confirmed that this wasn't in oclHashcat yet, but I guess now it
Post by Solar Designer
$ ./search3.sh
SEL XNOR ORN ANDN COUNT MD5_I
yes yes yes yes 190 yes
yes yes yes no 190 yes
yes yes no yes 190 yes
yes yes no no 178 yes
yes no yes yes 177 yes
yes no yes no 177 yes
yes no no yes 177 yes
yes no no no 165 yes
no yes yes yes 144 yes
no yes yes no 114 yes
no yes no yes 114 yes
no yes no no 72 no
no no yes yes 131 yes
no no yes no 95 yes
no no no yes 95 no
no no no no 59 no
Note that MD5 I() is also implementable in 2 ops on archs with XNOR and
ANDN, but no ORN and no SEL. Do these exist? (Usually if XNOR is
available, then ORN is also available.)

#define I(x, y, z) ((~(x) & (z)) ^ ~(y))
Post by Solar Designer
This shows the number of different truth tables possible for 3 inputs
with at most 2 operations on different architectures. (It is assumed
that AND, OR, NOT and constants are always available, so only possible
instruction set extensions are listed in the table.) The theoretical
maximum (actually achieved with Maxwell's or AVX-512's "ternary" logic
instructions) is 256.
Of course, I meant AND, OR, XOR, NOT there. (Forgot to list XOR.)
Post by Solar Designer
As to machine code improvements from this change to I(), here's GCN
v_not_b32 v5, v7 // 00003DA8: 7E0A6F07
v_or_b32 v5, v1, v5 // 00003DAC: 380A0B01
v_xor_b32 v5, v2, v5 // 00003DB0: 3A0A0B02
v_add_i32 v4, vcc, v13, v4 // 00003DB4: 4A08090D
v_add_i32 v4, vcc, v5, v4 // 00003DB8: 4A080905
v_add_i32 v4, vcc, 0xbd3af235, v4 // 00003DBC: 4A0808FF BD3AF235
v_alignbit_b32 v4, v4, v4, 22 // 00003DC4: D29C0004 025A0904
v_add_i32 v4, vcc, v1, v4 // 00003DCC: 4A080901
After the change (surprisingly, register allocation and offsets look
v_bfi_b32 v5, v7, v1, -1 // 00003DA8: D2940005 03060307
v_xor_b32 v5, v2, v5 // 00003DB0: 3A0A0B02
v_add_i32 v4, vcc, v13, v4 // 00003DB4: 4A08090D
v_add_i32 v4, vcc, v5, v4 // 00003DB8: 4A080905
v_add_i32 v4, vcc, 0xbd3af235, v4 // 00003DBC: 4A0808FF BD3AF235
v_alignbit_b32 v4, v4, v4, 22 // 00003DC4: D29C0004 025A0904
v_add_i32 v4, vcc, v1, v4 // 00003DCC: 4A080901
That's 7 instructions instead of 8. However, there's no code size
reduction in bytes, since v_bfi_b32 occupies 8 bytes (and it does so
even when no immediate value operand is used).
The effect of this will need to be tested across multiple OpenCL kernels
and pieces of C+intrinsics code across multiple architectures. Besides
GPUs, also on AMD CPUs with XOP.
For md5crypt-opencl, we're getting some slowdown - but that kernel is
weird (as discussed elsewhere, it currently uses global memory in its
inner loop). We need to optimize it some more, then re-test this change.

For phpass-opencl, there's decent speedup as mentioned above.

I didn't try any others yet.

Alexander
Lukas Odzioba
2015-09-03 20:42:05 UTC
Permalink
Raw Message
Post by Solar Designer
For md5crypt-opencl, we're getting some slowdown - but that kernel is
weird (as discussed elsewhere, it currently uses global memory in its
inner loop). We need to optimize it some more, then re-test this change.
I did not notice a significant slowdown, after a couple of tests it
seems that results are within measurement error and max seems to be
the same 1851k c/s.
When we reduce memory bottleneck it should work as expected.
Solar Designer
2015-09-03 21:02:56 UTC
Permalink
Raw Message
Post by Lukas Odzioba
Post by Solar Designer
For md5crypt-opencl, we're getting some slowdown - but that kernel is
weird (as discussed elsewhere, it currently uses global memory in its
inner loop). We need to optimize it some more, then re-test this change.
I did not notice a significant slowdown, after a couple of tests it
seems that results are within measurement error and max seems to be
the same 1851k c/s.
Is this --test speed on a lower clocked Tahiti?

The slowdown I saw was during actual cracking at length 8, something
like 3000K+ to 2900K (and further down to 2700K when the card heats up
and drops to non-turbo speed, 1050 MHz to 997.5 MHz in this test).
Post by Lukas Odzioba
When we reduce memory bottleneck it should work as expected.
Yes, I expect so.

Alexander
Lukas Odzioba
2015-09-03 21:40:18 UTC
Permalink
Raw Message
Post by Solar Designer
Is this --test speed on a lower clocked Tahiti?
Yes it is -test result on my 7970, I did not modify clocks.
aticonfig --odgc shows 1050 MHz at peak.

Thanks,
Lukas
Solar Designer
2015-09-04 07:27:06 UTC
Permalink
Raw Message
Post by Lukas Odzioba
Post by Solar Designer
Is this --test speed on a lower clocked Tahiti?
Yes it is -test result on my 7970, I did not modify clocks.
aticonfig --odgc shows 1050 MHz at peak.
I guess you have a different Catalyst version, then. I am getting
~2150K for --test on super's -dev=2 now, so 1050 MHz with 15.7.

Alexander
Solar Designer
2015-09-04 08:00:33 UTC
Permalink
Raw Message
Post by Solar Designer
#define I(x, y, z) (bitselect(0xffffffffU, (x), (z)) ^ (y))
[...]
Post by Solar Designer
Now that I think of it, the expression is actually very simple and I
should have been able to arrive at it without a program. bitselect()
with the all-ones constant is directly usable to implement OR-NOT. :-)
SEL XNOR ORN ANDN COUNT MD5_I
[...]
Post by Solar Designer
yes no yes yes 177 yes
yes no yes no 177 yes
yes no no yes 177 yes
yes no no no 165 yes
Since one SEL is usable to implement ORN, there should be no increase in
the number of different functions achieved with the addition of ORN to
the instruction set on top of SEL. Yet the table shows an increase.
I think the same holds for ANDN. I think the correct number for SEL
alone should thus be at least 177.

This indicates that the program still does not try all combinations.
I think the problem might be that it doesn't try enough constants along
with SELs. While op() can return constants, uses of op() count towards
the number of operations, which is limited to 2, and there's just one
other place where constants are added (and one more where they might get
copied, but in limited ways).

The program should be re-designed, or some existing tool should be used.
I just felt it'd be quicker and more transparent to write the program
from scratch than to (re)learn how to use an existing tool.

Alexander
Lei Zhang
2015-09-08 07:16:11 UTC
Permalink
Raw Message
Post by Solar Designer
Lei - I think you haven't gotten around to introducing AVX-512 ternary
logic intrinsics yet, have you?
I'll look into that.
Post by Solar Designer
Unfortunately, we don't yet have
hardware to test them on, but you could test on Intel SDE or/and by
emulating them with macros.
I've been following the news of Skylake. It's disappointing the desktop product line won't support AVX-512; only Xeon and Xeon Phi will have it, which are not launched yet. For the moment, Intel SDE is the only option...


Lei
Lei Zhang
2015-09-14 08:33:55 UTC
Permalink
Raw Message
Post by Lei Zhang
Post by Solar Designer
Lei - I think you haven't gotten around to introducing AVX-512 ternary
logic intrinsics yet, have you?
I'll look into that.
Post by Solar Designer
Unfortunately, we don't yet have
hardware to test them on, but you could test on Intel SDE or/and by
emulating them with macros.
I've been following the news of Skylake. It's disappointing the desktop product line won't support AVX-512; only Xeon and Xeon Phi will have it, which are not launched yet. For the moment, Intel SDE is the only option...
Ok, I've now introduced the ternary logic instruction to JtR, and get it to work on SDE.

In case it's helpful, here're some benchmark figures of JtR running on SDE:

Benchmarking: Raw-MD4 [MD4 512/512 AVX512F 16x3]... DONE
Raw: 312712 c/s real, 309647 c/s virtual

Benchmarking: Raw-MD5 [MD5 512/512 AVX512F 16x3]... DONE
Raw: 264240 c/s real, 264240 c/s virtual

Benchmarking: Raw-SHA1 [SHA1 512/512 AVX512F 16x]... DONE
Raw: 130772 c/s real, 129490 c/s virtual

Benchmarking: Raw-SHA256 [SHA256 512/512 AVX512F 16x]... DONE
Raw: 70304 c/s real, 70304 c/s virtual

Benchmarking: Raw-SHA512 [SHA512 512/512 AVX512F 8x]... DONE
Raw: 30875 c/s real, 30875 c/s virtual


BTW, SDE runs much more smoothly than I expected. At least those formats listed above ran quite fast on it.


Lei
Solar Designer
2015-09-14 12:27:25 UTC
Permalink
Raw Message
Post by Lei Zhang
Ok, I've now introduced the ternary logic instruction to JtR, and get it to work on SDE.
Cool. Thanks!
Yes, can you show benchmarks before/after the addition of ternary logic
instructions? And, did you fully make use of them (turning all of the
3-input basic functions of MD4/MD5/SHA-1/SHA-2 into single instructions)
or only to define vcmov() for now? You should do both, so that we also
have vcmov() available for use where it's sufficient (e.g., MD5's F and
G) and in possible future code where we might not introduce proper
ternary logic expressions right away.
Post by Lei Zhang
BTW, SDE runs much more smoothly than I expected. At least those formats listed above ran quite fast on it.
You mean the program's interactive response time, not the c/s rates.
The c/s rates suggest there's pure emulation with no JIT. But that's
fine for our purposes.

Alexander
Lei Zhang
2015-09-14 14:16:54 UTC
Permalink
Raw Message
Post by Solar Designer
And, did you fully make use of them (turning all of the
3-input basic functions of MD4/MD5/SHA-1/SHA-2 into single instructions)
or only to define vcmov() for now?
I turned all 3-input functions to using a single TERNLOG instruction, except for those that are already using a single CMOV (I thought one CMOV is good enough, but forgot it might be emulated). Now that you mentioned it, I also used TERNLOG to emulate CMOV. Here's the latest results:

Benchmarking: Raw-MD4 [MD4 512/512 AVX512F 16x3]... DONE
Raw: 219184 c/s real, 219184 c/s virtual

Benchmarking: Raw-MD5 [MD5 512/512 AVX512F 16x3]... DONE
Raw: 138917 c/s real, 140293 c/s virtual

Benchmarking: Raw-SHA1 [SHA1 512/512 AVX512F 16x]... DONE
Raw: 99216 c/s real, 99216 c/s virtual

Benchmarking: Raw-SHA256 [SHA256 512/512 AVX512F 16x]... DONE
Raw: 48839 c/s real, 49328 c/s virtual

Benchmarking: Raw-SHA512 [SHA512 512/512 AVX512F 8x]... DONE
Raw: 22019 c/s real, 22019 c/s virtual

Compared to the previous figures (please refer to my last message), using TERNLOG to emulate CMOV makes JtR slower on SDE. Maybe SDE's emulation of TERNLOG is just not efficient.


And here's the results without using any TERNLOG instructions:

Benchmarking: Raw-MD4 [MD4 512/512 AVX512F 16x3]... DONE
Raw: 444356 c/s real, 448800 c/s virtual

Benchmarking: Raw-MD5 [MD5 512/512 AVX512F 16x3]... DONE
Raw: 225172 c/s real, 227424 c/s virtual

Benchmarking: Raw-SHA1 [SHA1 512/512 AVX512F 16x]... DONE
Raw: 212784 c/s real, 212784 c/s virtual

Benchmarking: Raw-SHA256 [SHA256 512/512 AVX512F 16x]... DONE
Raw: 63413 c/s real, 63413 c/s virtual

Benchmarking: Raw-SHA512 [SHA512 512/512 AVX512F 8x]... DONE
Raw: 27440 c/s real, 27168 c/s virtual

I think that further confirms my statement above: SDE's emulation of TERNLOG is inefficient.
Post by Solar Designer
Post by Lei Zhang
BTW, SDE runs much more smoothly than I expected. At least those formats listed above ran quite fast on it.
You mean the program's interactive response time, not the c/s rates.
That's exactly what I meant :D


Lei
Solar Designer
2015-09-14 14:20:35 UTC
Permalink
Raw Message
Post by Lei Zhang
I think that further confirms my statement above: SDE's emulation of TERNLOG is inefficient.
OK. We don't currently have any better anyway.

Will you submit a pull request?

Thanks!

Alexander
Lei Zhang
2015-09-14 15:01:19 UTC
Permalink
Raw Message
Post by Solar Designer
Post by Lei Zhang
I think that further confirms my statement above: SDE's emulation of TERNLOG is inefficient.
OK. We don't currently have any better anyway.
Will you submit a pull request?
Yeah, I already sent one, but one of the formats failed the online tests. Still working on it.


Lei
magnum
2015-09-14 17:03:57 UTC
Permalink
Raw Message
Post by Lei Zhang
Post by Solar Designer
Post by Lei Zhang
I think that further confirms my statement above: SDE's emulation of TERNLOG is inefficient.
OK. We don't currently have any better anyway.
Will you submit a pull request?
Yeah, I already sent one, but one of the formats failed the online tests. Still working on it.
That failure was unrelated. The PR is merged now, good stuff!

magnum
Solar Designer
2015-09-14 17:40:47 UTC
Permalink
Raw Message
Post by magnum
That failure was unrelated. The PR is merged now, good stuff!
Yeah, looks good to me.

A minor detail, in this piece:

#if __AVX512F__
#undef vcmov
#define vcmov(x, y, z) vternarylogic(x, y, z, 0xE4)
#define vternarylogic _mm512_ternarylogic_epi32
#endif

I think we need to add:

#undef VCMOV_EMULATED

right after the "#undef vcmov". Since we have a one instruction
implementation, it is as good as native, non-emulated vcmov.

Alexander
magnum
2015-09-14 20:39:40 UTC
Permalink
Raw Message
Post by Solar Designer
Post by magnum
That failure was unrelated. The PR is merged now, good stuff!
Yeah, looks good to me.
#if __AVX512F__
#undef vcmov
#define vcmov(x, y, z) vternarylogic(x, y, z, 0xE4)
#define vternarylogic _mm512_ternarylogic_epi32
#endif
#undef VCMOV_EMULATED
right after the "#undef vcmov". Since we have a one instruction
implementation, it is as good as native, non-emulated vcmov.
Damn good catch. Fixed now.

BTW do you think we could use inline PTX to define a LOP3.LUT
instruction on nvidia, like you did with the funnel shifts? Or would it
possibly be worse than having the optimizer miss one or two, due to the
caveats of inline asm?

magnum
Solar Designer
2015-09-14 21:06:41 UTC
Permalink
Raw Message
Post by magnum
BTW do you think we could use inline PTX to define a LOP3.LUT
instruction on nvidia, like you did with the funnel shifts?
Yes, I thought of this too. We could want to check the generated code
first (it might already be using LOP3.LUT everywhere it should), or we
could just do the inline asm right away to ensure we'll always have
LOP3.LUT there no matter how the compiler might be changed.
Post by magnum
Or would it
possibly be worse than having the optimizer miss one or two, due to the
caveats of inline asm?
I saw no drawbacks from using inline PTX asm, since instruction
scheduling is performed in the PTX to ISA translation anyway.

This is very different from inline asm in C code compiled for a CPU,
where using inline asm for tiny pieces of code (such as for individual
instructions) breaks the C compiler's instruction scheduling.

Alexander
magnum
2015-10-06 00:32:17 UTC
Permalink
Raw Message
Post by Solar Designer
Post by magnum
BTW do you think we could use inline PTX to define a LOP3.LUT
instruction on nvidia, like you did with the funnel shifts?
Yes, I thought of this too. We could want to check the generated code
first (it might already be using LOP3.LUT everywhere it should), or we
could just do the inline asm right away to ensure we'll always have
LOP3.LUT there no matter how the compiler might be changed.
I implemented a shared lop3_lut(a, b, c, imm) function in de6c7c6 but
it's not enabled anywhere yet: I only tested md5crypt so far and it got
about 5% performance loss. I also tried only using it for one function
at a time but any of them results in performance loss - even F and G
which are both pure bitselects otherwise. I was expecting no difference
at all, at worst.
Post by Solar Designer
Post by magnum
Or would it
possibly be worse than having the optimizer miss one or two, due to the
caveats of inline asm?
I saw no drawbacks from using inline PTX asm, since instruction
scheduling is performed in the PTX to ISA translation anyway.
This is very different from inline asm in C code compiled for a CPU,
where using inline asm for tiny pieces of code (such as for individual
instructions) breaks the C compiler's instruction scheduling.
Something did not end up well. I'll compare resulting PTX and ISA and
try to figure out what happens.

magnum
magnum
2015-10-06 23:47:06 UTC
Permalink
Raw Message
Post by magnum
I implemented a shared lop3_lut(a, b, c, imm) function in de6c7c6 but
it's not enabled anywhere yet: I only tested md5crypt so far and it got
about 5% performance loss. I also tried only using it for one function
at a time but any of them results in performance loss - even F and G
which are both pure bitselects otherwise. I was expecting no difference
at all, at worst.
Here's a PTX diff with *only* F changed from bitselect() to inline asm
(I replaced all register numbers to <num> for simpler diff):

@@ -190,142 +190,130 @@
add.s32 %r<num>, %r<num>, -117830708;
shf.l.wrap.b32 %r<num>, %r<num>, %r<num>, 12;
add.s32 %r<num>, %r<num>, %r<num>;
- and.b32 %r<num>, %r<num>, %r<num>;
- not.b32 %r<num>, %r<num>;
- and.b32 %r<num>, %r<num>, -271733879;
- or.b32 %r<num>, %r<num>, %r<num>;
+ mov.u32 %r<num>, -271733879;
+ // inline asm
+ lop3.b32 %r<num>, %r<num>, %r<num>, %r<num>, 228;
+ // inline asm
ld.local.u32 %r<num>, [%rd4+72];
add.s32 %r<num>, %r<num>, %r<num>;

So if I read it right we replace "and, not, and immediate, or" with "mov
immediate, lop3". I can't see why that would decrease speed with 1%?
Even if the version with no inline PTX does end up as LOP3 (it should) -
why does the explicit version get slower?

Since we don't have CUDA 7.5 installed on super I can't look at the
resulting ISA - ptxas won't assemble this one, for some reason not even
the version without inline lop3.lut. It does assemble some other
kernels, and I have seen separate logic instructions in PTX end up as
LOP3 in the ISA. But for this comparison I'll need to continue my
digging somewhere else, later.

magnum
magnum
2015-09-02 18:19:02 UTC
Permalink
Raw Message
Post by Solar Designer
SHA-1's H() aka F3() is the same as SHA-2's Maj(), yet somehow we're
using less optimal expressions for it on systems with bitselect().
#define F3(x, y, z) (bitselect(x, y, z) ^ bitselect(x, 0U, y))
#define F3(x, y, z) bitselect(x, y, (z) ^ (x))
and got some speedup for pbkdf2-hmac-sha1-opencl on GCN (1200K to
1228K c/s).
(...)
magnum, will you take care of this, please? And test on GPUs and on XOP.
Good find, I totally missed this one. I'm glad I moved dozens of code
copies into opencl_sha1.h and similar headers: Much less code to
maintain. I'll look into it (including the stuff I snipped from my quote).

magnum
magnum
2015-09-04 10:37:22 UTC
Permalink
Raw Message
Post by Solar Designer
#if __XOP__
#define SHA1_H(x,y,z) \
tmp[i] = vcmov((x[i]),(y[i]),(z[i])); \
tmp[i] = vxor((tmp[i]),vandnot((x[i]),(y[i])));
#else
#define SHA1_H(x,y,z) \
tmp[i] = vand((x[i]),(y[i])); \
tmp[i] = vor((tmp[i]),vand(vor((x[i]),(y[i])),(z[i])));
#endif
TL;DR: The changes were good but made no big deal.

Pre and post 5916a57, here's Bull, OMP, complete list (of SHA formats
that can run OMP):

$ ../run/relbench -v o2 n2 | grep Ratio | sort
Ratio: 0.95438 real, 0.95335 virtual xsha, Mac OS X 10.4 - 10.6:Only one
salt
Ratio: 0.96187 real, 0.96203 virtual Raw-SHA512:Raw
Ratio: 0.98151 real, 0.98330 virtual SSHA512, LDAP:Only one salt
Ratio: 0.98306 real, 0.99574 virtual HMAC-SHA224:Many salts
Ratio: 0.98649 real, 0.98802 virtual sapg, SAP CODVN F/G (PASSCODE):Many
salts
Ratio: 0.98959 real, 0.99234 virtual xsha512, Mac OS X 10.7:Only one salt
Ratio: 0.98996 real, 1.00000 virtual Drupal7, $S$ (x16385):Raw

Manually re-testing Drupal7 (SHA-512) a couple of times shows it
probably neither got a boost not a regression (or perhaps a very very
slight boost: Same real speed, virtual boosted one (1) c/s - pretty much
consistently).

Ratio: 0.99001 real, 0.99875 virtual SybaseASE, Sybase ASE:Only one salt
Ratio: 0.99010 real, 0.99754 virtual Blackberry-ES10 (101x):Raw
Ratio: 0.99010 real, 1.00000 virtual Fortigate, FortiOS:Many salts
Ratio: 0.99050 real, 0.99650 virtual LastPass, sniffed sessions:Only one
salt
Ratio: 0.99055 real, 0.99534 virtual LastPass, sniffed sessions:Many salts
Ratio: 0.99085 real, 1.00077 virtual aix-ssha512, AIX LPA {ssha512}:Raw
Ratio: 0.99294 real, 0.99320 virtual HMAC-SHA256:Many salts
Ratio: 0.99383 real, 0.99884 virtual PBKDF2-HMAC-SHA256:Raw

Manually re-testing PBKDF2-HMAC-SHA256 a couple of times show neither
boost nor regression.

Ratio: 0.99415 real, 0.99611 virtual Blockchain, My Wallet (x10):Raw
Ratio: 0.99504 real, 0.99646 virtual HMAC-SHA512:Only one salt
Ratio: 0.99594 real, 1.00429 virtual Fortigate, FortiOS:Only one salt
Ratio: 0.99615 real, 0.99667 virtual Raw-SHA512-ng:Raw
Ratio: 0.99617 real, 1.00486 virtual Salted-SHA1:Only one salt
Ratio: 0.99625 real, 0.99634 virtual xsha512, Mac OS X 10.7:Many salts
Ratio: 0.99913 real, 0.99918 virtual SSHA512, LDAP:Many salts
Ratio: 1.00000 real, 0.97838 virtual HMAC-SHA224:Only one salt
Ratio: 1.00000 real, 0.99748 virtual RAR5:Raw
Ratio: 1.00000 real, 1.00000 virtual aix-ssha256, AIX LPA {ssha256}:Raw
Ratio: 1.00000 real, 1.00000 virtual HMAC-SHA512:Many salts
Ratio: 1.00000 real, 1.00000 virtual keyring, GNOME Keyring:Raw
Ratio: 1.00000 real, 1.00000 virtual mssql12, MS SQL 2012/2014:Many salts
Ratio: 1.00000 real, 1.00000 virtual sha512crypt, crypt(3) $6$
(rounds=5000):Raw
Ratio: 1.00000 real, 1.00118 virtual pwsafe, Password Safe:Raw
Ratio: 1.00000 real, 1.00142 virtual PBKDF2-HMAC-SHA512, GRUB2 / OS X
10.8+:Raw
Ratio: 1.00000 real, 1.00154 virtual sha256crypt, crypt(3) $5$
(rounds=5000):Raw
Ratio: 1.00000 real, 1.00230 virtual eCryptfs (65536x):Raw
Ratio: 1.00000 real, 1.00233 virtual lp, LastPass offline:Raw
Ratio: 1.00000 real, 1.00233 virtual rar, RAR3 (4 characters):Raw
Ratio: 1.00000 real, 1.00250 virtual SybaseASE, Sybase ASE:Many salts
Ratio: 1.00143 real, 1.00143 virtual tc_aes_xts, TrueCrypt AES256_XTS:Raw
Ratio: 1.00153 real, 1.00000 virtual HMAC-SHA384:Many salts
Ratio: 1.00238 real, 1.00115 virtual Raw-SHA256:Raw
Ratio: 1.00285 real, 0.99916 virtual HMAC-SHA1:Only one salt
Ratio: 1.00286 real, 1.00286 virtual tc_sha512, TrueCrypt AES256_XTS:Raw
Ratio: 1.00536 real, 1.01673 virtual saph, SAP CODVN H (PWDSALTEDHASH)
(SHA1x1024):Many salts
Ratio: 1.00539 real, 1.00566 virtual Raw-SHA224:Raw
Ratio: 1.00587 real, 1.00583 virtual Oracle12C:Raw
Ratio: 1.00897 real, 1.01051 virtual RAKP, IPMI 2.0 RAKP (RMCP+):Only
one salt
Ratio: 1.00931 real, 0.99772 virtual 7z, 7-Zip (512K iterations):Many salts
Ratio: 1.00965 real, 1.00000 virtual Django (x10000):Raw
Ratio: 1.00981 real, 1.02342 virtual keychain, Mac OS X Keychain:Raw
Ratio: 1.00996 real, 1.02035 virtual mscash2, MS Cache Hash 2 (DCC2):Raw
Ratio: 1.00997 real, 0.99876 virtual HMAC-SHA384:Only one salt
Ratio: 1.01000 real, 1.00000 virtual Bitcoin:Raw
Ratio: 1.01001 real, 1.00000 virtual Raw-SHA384:Raw
Ratio: 1.01005 real, 1.00457 virtual HMAC-SHA256:Only one salt
Ratio: 1.01193 real, 1.01197 virtual Salted-SHA1:Many salts
Ratio: 1.01314 real, 1.01314 virtual Citrix_NS10, Netscaler 10:Many salts
Ratio: 1.01546 real, 1.01677 virtual saph, SAP CODVN H (PWDSALTEDHASH)
(SHA1x1024):Only one salt
Ratio: 1.01617 real, 1.01439 virtual sha1crypt, NetBSD's sha1crypt:Raw
Ratio: 1.01626 real, 1.00000 virtual cloudkeychain, 1Password Cloud
Keychain:Raw
Ratio: 1.01628 real, 1.01764 virtual xsha, Mac OS X 10.4 - 10.6:Many salts
Ratio: 1.01634 real, 1.01507 virtual aix-ssha1, AIX LPA {ssha1}:Raw
Ratio: 1.01648 real, 1.01506 virtual sxc, StarOffice .sxc:Raw
Ratio: 1.01661 real, 1.01036 virtual sapg, SAP CODVN F/G (PASSCODE):Only
one salt
Ratio: 1.01695 real, 1.02453 virtual krb5-18, Kerberos 5 db etype 18:Raw
Ratio: 1.01699 real, 1.01871 virtual krb5pa-sha1, Kerberos 5 AS-REQ
Pre-Auth etype 17/18:Raw
Ratio: 1.01699 real, 1.02115 virtual EFS:Raw
Ratio: 1.01818 real, 1.01938 virtual STRIP, Password Manager:Raw
Ratio: 1.01858 real, 0.99275 virtual 7z, 7-Zip (512K iterations):Only
one salt
Ratio: 1.01898 real, 1.02020 virtual RAKP, IPMI 2.0 RAKP (RMCP+):Many salts
Ratio: 1.01924 real, 1.01681 virtual dmg, Apple DMG:Raw
Ratio: 1.01924 real, 1.02404 virtual OpenBSD-SoftRAID (8192 iterations):Raw
Ratio: 1.01935 real, 1.02062 virtual EncFS:Raw
Ratio: 1.02013 real, 1.01613 virtual LUKS:Raw
Ratio: 1.02058 real, 1.02058 virtual ZIP, WinZip:Raw
Ratio: 1.02315 real, 1.02172 virtual HMAC-SHA1:Many salts
Ratio: 1.02499 real, 1.02473 virtual Office, 2007/2010/2013:Raw
Ratio: 1.02616 real, 1.02497 virtual agilekeychain, 1Password Agile
Keychain:Raw
Ratio: 1.02778 real, 1.02604 virtual fde, Android FDE:Raw
Ratio: 1.03001 real, 1.01997 virtual PBKDF2-HMAC-SHA1:Raw
Ratio: 1.03393 real, 1.03394 virtual mssql12, MS SQL 2012/2014:Only one salt
Ratio: 1.03590 real, 1.03066 virtual ODF:Raw
Ratio: 1.03826 real, 1.02821 virtual Citrix_NS10, Netscaler 10:Only one salt
Ratio: 1.04312 real, 1.03429 virtual wpapsk, WPA/WPA2 PSK:Raw

I think the summary makes very good sense in this case since only
relevant formats were included:

Number of benchmarks: 82
Minimum: 0.95438 real, 0.95335 virtual
Maximum: 1.04312 real, 1.03429 virtual
Median: 1.00411 real, 1.00192 virtual
Median absolute deviation: 0.01072 real, 0.00599 virtual
Geometric mean: 1.00588 real, 1.00577 virtual
Geometric standard deviation: 1.01514 real, 1.01412 virtual

magnum
Solar Designer
2015-09-04 11:10:43 UTC
Permalink
Raw Message
Post by magnum
TL;DR: The changes were good but made no big deal.
Pre and post 5916a57, here's Bull, OMP, complete list (of SHA formats
[...]
Post by magnum
Number of benchmarks: 82
Minimum: 0.95438 real, 0.95335 virtual
Maximum: 1.04312 real, 1.03429 virtual
Median: 1.00411 real, 1.00192 virtual
Median absolute deviation: 0.01072 real, 0.00599 virtual
Geometric mean: 1.00588 real, 1.00577 virtual
Geometric standard deviation: 1.01514 real, 1.01412 virtual
Thanks!

Is there a code size reduction in the XOP build? Perhaps check with the
"size" command on the john binary and on individual *.o files.

Alexander
Lei Zhang
2015-09-08 07:04:57 UTC
Permalink
Raw Message
Post by Solar Designer
Lei, will you test/benchmark on NEON and AltiVec once magnum commits the
fixes, please?
On AltiVec (4xOMP):

[before]
pbkdf2-sha1: 35840 c/s real, 8982 c/s virtual
pbkdf2-sha256: 14194 c/s real, 3566 c/s virtual
pbkdf2-sha512: 5944 c/s real, 1489 c/s virtual

[after]
pbkdf2-sha1: 36141 c/s real, 9057 c/s virtual
pbkdf2-sha256: 14336 c/s real, 3592 c/s virtual
pbkdf2-sha512: 5936 c/s real, 1498 c/s virtual


On NEON (2xOMP):

[before]
pbkdf2-sha1: 578 c/s real, 289 c/s virtual
pbkdf2-sha256: 276 c/s real, 138 c/s virtual
pbkdf2-sha512: 125 c/s real, 62.7 c/s virtual

[after]
pbkdf2-sha1: 501 c/s real, 250 c/s virtual
pbkdf2-sha256: 276 c/s real, 138 c/s virtual
pbkdf2-sha512: 125 c/s real, 62.7 c/s virtual


There's no significant change on Altivec, while SHA1 somehow gets slower on NEON.


Lei
Solar Designer
2015-09-08 08:47:25 UTC
Permalink
Raw Message
Lei,
Post by Lei Zhang
Post by Solar Designer
Lei, will you test/benchmark on NEON and AltiVec once magnum commits the
fixes, please?
Is this 4 threads likely across different CPU cores? That's no good.
What we need for benchmarking is the maximum number of threads supported
in hardware on a certain number of CPU cores (on 1 core is OK if you
can't reliably use the entire machine's cores). So on POWER8 I guess
you'll run 8 threads all locked to one physical CPU core. You should be
able to do that with OpenMP env vars (affinity).

Please also run non-OpenMP benchmarks (thus, using 1 thread on 1 core
only) for reference.
Post by Lei Zhang
[before]
pbkdf2-sha1: 35840 c/s real, 8982 c/s virtual
pbkdf2-sha256: 14194 c/s real, 3566 c/s virtual
pbkdf2-sha512: 5944 c/s real, 1489 c/s virtual
[after]
pbkdf2-sha1: 36141 c/s real, 9057 c/s virtual
pbkdf2-sha256: 14336 c/s real, 3592 c/s virtual
pbkdf2-sha512: 5936 c/s real, 1498 c/s virtual
Thanks, but why are you testing these 3 hash types? I think we made
relevant changes to SHA-1 (optimized H using vcmov() as discussed in
this thread), MD5 (ditto, using my newly found expression for I), and
MD4 (ditto, realizing that G is the same as SHA-2 Maj).

We also revised how vcmov() is emulated and what we do when it is
emulated, but this should not affect AltiVec and NEON because those have
non-emulated vcmov(). We also adjusted SHA-256's interleaving factor on
XOP, but that's just XOP.

There should be no change to SHA-256 and SHA-512 on AltiVec and NEON.
Post by Lei Zhang
[before]
pbkdf2-sha1: 578 c/s real, 289 c/s virtual
pbkdf2-sha256: 276 c/s real, 138 c/s virtual
pbkdf2-sha512: 125 c/s real, 62.7 c/s virtual
[after]
pbkdf2-sha1: 501 c/s real, 250 c/s virtual
pbkdf2-sha256: 276 c/s real, 138 c/s virtual
pbkdf2-sha512: 125 c/s real, 62.7 c/s virtual
There's no significant change on Altivec,
OK, but you need to run 8 threads/core benchmarks.
Post by Lei Zhang
while SHA1 somehow gets slower on NEON.
It might need higher interleaving factor now. You haven't even tried
introducing interleaving for these archs, have you? (I don't recall.)

I think AltiVec probably won't need interleaving if we target modern
POWER chips with multiple hardware threads per core, but NEON will.

Also, as I suggested in the "MD5 on XOP, NEON, AltiVec" thread:

"[...] we'll need to revise MD5_I in simd-intrinsics.c to use [...]
the obvious expression with OR-NOT on NEON and AltiVec (IIRC, those
archs have OR-NOT, which might be lower latency than select)."

I think you should do that before benchmarking and before tuning of the
interleaving factors for MD5.

Thanks again,

Alexander
magnum
2015-09-08 10:32:11 UTC
Permalink
Raw Message
Is this 4 threads likely across different CPU cores? That's no good.
What we need for benchmarking is the maximum number of threads supported
in hardware on a certain number of CPU cores (on 1 core is OK if you
can't reliably use the entire machine's cores). So on POWER8 I guess
you'll run 8 threads all locked to one physical CPU core. You should be
able to do that with OpenMP env vars (affinity).
Please also run non-OpenMP benchmarks (thus, using 1 thread on 1 core
only) for reference.
Post by Lei Zhang
[before]
pbkdf2-sha1: 35840 c/s real, 8982 c/s virtual
pbkdf2-sha256: 14194 c/s real, 3566 c/s virtual
pbkdf2-sha512: 5944 c/s real, 1489 c/s virtual
[after]
pbkdf2-sha1: 36141 c/s real, 9057 c/s virtual
pbkdf2-sha256: 14336 c/s real, 3592 c/s virtual
pbkdf2-sha512: 5936 c/s real, 1498 c/s virtual
Thanks, but why are you testing these 3 hash types? I think we made
relevant changes to SHA-1 (optimized H using vcmov() as discussed in
this thread), MD5 (ditto, using my newly found expression for I), and
MD4 (ditto, realizing that G is the same as SHA-2 Maj).
We also revised how vcmov() is emulated and what we do when it is
emulated, but this should not affect AltiVec and NEON because those have
non-emulated vcmov(). We also adjusted SHA-256's interleaving factor on
XOP, but that's just XOP.
There should be no change to SHA-256 and SHA-512 on AltiVec and NEON.
Lei, you could also run the testparas.pl script (with OMP_NUM_THREADS
and GOMP_CPU_AFFINITY) before and after. But we'd want normal benchmarks
(like you did but for pbkdf2-hmac-md4/5 and -sha1) too, so we can assess
the virtual figures.

magnum
Lei Zhang
2015-09-09 15:43:41 UTC
Permalink
Raw Message
Post by Solar Designer
Lei,
Post by Lei Zhang
Post by Solar Designer
Lei, will you test/benchmark on NEON and AltiVec once magnum commits the
fixes, please?
Is this 4 threads likely across different CPU cores?
I think so. The benchmark results just fluctuated too bad when I utilize the maximum number of hardware threads, so I switched to a small number of threads, without binding them to a specific core though.
Post by Solar Designer
What we need for benchmarking is the maximum number of threads supported
in hardware on a certain number of CPU cores (on 1 core is OK if you
can't reliably use the entire machine's cores). So on POWER8 I guess
you'll run 8 threads all locked to one physical CPU core. You should be
able to do that with OpenMP env vars (affinity).
I'll post the updated results later.
Post by Solar Designer
Post by Lei Zhang
[before]
pbkdf2-sha1: 578 c/s real, 289 c/s virtual
pbkdf2-sha256: 276 c/s real, 138 c/s virtual
pbkdf2-sha512: 125 c/s real, 62.7 c/s virtual
[after]
pbkdf2-sha1: 501 c/s real, 250 c/s virtual
pbkdf2-sha256: 276 c/s real, 138 c/s virtual
pbkdf2-sha512: 125 c/s real, 62.7 c/s virtual
There's no significant change on Altivec,
OK, but you need to run 8 threads/core benchmarks.
Why? Our ZedBoard has only two cores.
Post by Solar Designer
Post by Lei Zhang
while SHA1 somehow gets slower on NEON.
It might need higher interleaving factor now. You haven't even tried
introducing interleaving for these archs, have you? (I don't recall.)
No, I haven't. I'll put this on my todo list.
Post by Solar Designer
"[...] we'll need to revise MD5_I in simd-intrinsics.c to use [...]
the obvious expression with OR-NOT on NEON and AltiVec (IIRC, those
archs have OR-NOT, which might be lower latency than select)."
I just checked the manuals. NEON does support OR-NOT, but AltiVec seems to only support NOT-OR (~(a|b)). So only NEON can benefit from this optimization perhaps.


Lei
Solar Designer
2015-09-09 16:51:38 UTC
Permalink
Raw Message
Post by Lei Zhang
Post by Solar Designer
Is this 4 threads likely across different CPU cores?
I think so. The benchmark results just fluctuated too bad when I utilize the maximum number of hardware threads, so I switched to a small number of threads, without binding them to a specific core though.
Have you also tried e.g. staying just one core's worth (8 threads) below
the maximum? And things like GOMP_SPINCOUNT=10000 and
GOMP_CPU_AFFINITY=0-155 (or whatever is the full range of logical CPU
numbers). Of course, these things only make sense if the system is
otherwise idle. If you can actually see where other load comes from,
then you need to reduce your use of the cores accordingly.
Post by Lei Zhang
Post by Solar Designer
Post by Lei Zhang
There's no significant change on Altivec,
OK, but you need to run 8 threads/core benchmarks.
Why? Our ZedBoard has only two cores.
That comment of mine only applied to your "There's no significant change
on Altivec" - thus, it didn't apply to ZedBoard.
Post by Lei Zhang
Post by Solar Designer
"[...] we'll need to revise MD5_I in simd-intrinsics.c to use [...]
the obvious expression with OR-NOT on NEON and AltiVec (IIRC, those
archs have OR-NOT, which might be lower latency than select)."
I just checked the manuals. NEON does support OR-NOT, but AltiVec seems to only support NOT-OR (~(a|b)). So only NEON can benefit from this optimization perhaps.
Oh, OK. MD5_I is also implementable with XNOR, but I think AltiVec
lacks that too. That's not a big deal since it has bitselect.

Alexander
Lei Zhang
2015-09-10 16:02:56 UTC
Permalink
Raw Message
<html><head></head><body dir="auto" style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">On Sep 9, 2015, at 11:43 PM, Lei Zhang &lt;***@gmail.com&gt; wrote:<br><blockquote type="cite"><br><blockquote type="cite">What we need for benchmarking is the maximum number of threads supported<br>in hardware on a certain number of CPU cores (on 1 core is OK if you<br>can't reliably use the entire machine's cores). &nbsp;So on POWER8 I guess<br>you'll run 8 threads all locked to one physical CPU core. &nbsp;You should be<br>able to do that with OpenMP env vars (affinity).<br></blockquote><br>I'll post the updated results later.<br></blockquote><div><br></div><div>The format prefix 'pbkdf2-hmac-' is omitted below.</div><br>On ARM (2xOMP):<br><br>[before]<br>MD4:<span class="Apple-tab-span" style="white-space:pre"> </span>2576 c/s real, 1288 c/s virtual<br>MD5:<span class="Apple-tab-span" style="white-space:pre"> </span>1651 c/s real, 825 c/s virtual<br>SHA1:<span class="Apple-tab-span" style="white-space:pre"> </span>578 c/s real, 290 c/s virtual<br><br>[after]<br>MD4:<span class="Apple-tab-span" style="white-space:pre"> </span>2608 c/s real, 1304 c/s virtual<br>MD5:<span class="Apple-tab-span" style="white-space:pre"> </span>1600 c/s real, 803 c/s virtual<br>SHA1:<span class="Apple-tab-span" style="white-space:pre"> </span>501 c/s real, 250 c/s virtual<br><br>MD4 becomes a little bit faster; MD5 &amp; SHA1 become slower.<br><br><br>On Power (8xOMP, bound to a single core)<br><br>[before]<br>MD4:<span class="Apple-tab-span" style="white-space:pre"> </span>28248 c/s real, 3531 c/s virtual<br>MD5:<span class="Apple-tab-span" style="white-space:pre"> </span>19980 c/s real, 2497 c/s virtual<br>SHA1:<span class="Apple-tab-span" style="white-space:pre"> </span>10593 c/s real, 1322 c/s virtual<br><br>[after]<br>MD4:<span class="Apple-tab-span" style="white-space:pre"> </span>31207 c/s real, 3882 c/s virtual<br>MD5:<span class="Apple-tab-span" style="white-space:pre"> </span>19980 c/s real, 2489 c/s virtual<br>SHA1:<span class="Apple-tab-span" style="white-space:pre"> </span>11273 c/s real, 1409 c/s virtual<br><br>On Power (1xOMP)<br><br>[before]<br>MD4:<span class="Apple-tab-span" style="white-space:pre"> </span>13398 c/s real, 13398 c/s virtual<br>MD5:<span class="Apple-tab-span" style="white-space:pre"> </span>10626 c/s real, 10626 c/s virtual<br>SHA1:<span class="Apple-tab-span" style="white-space:pre"> </span>8533 c/s real, 8533 c/s virtual<br><br>[after]<br>MD4:<span class="Apple-tab-span" style="white-space:pre"> </span>14628 c/s real, 14628 c/s virtual<br>MD5:<span class="Apple-tab-span" style="white-space:pre"> </span>10935 c/s real, 10935 c/s virtual<br>SHA1:<span class="Apple-tab-span" style="white-space:pre"> </span>8947 c/s real, 8947 c/s virtual<br><br>At least there's no performance drop on Power. BTW, It looks Power's SMT performance is not very impressive.<br><br><br>Lei</body></html>
Lei Zhang
2015-09-10 16:04:21 UTC
Permalink
Raw Message
Post by Lei Zhang
Post by Solar Designer
What we need for benchmarking is the maximum number of threads supported
in hardware on a certain number of CPU cores (on 1 core is OK if you
can't reliably use the entire machine's cores). So on POWER8 I guess
you'll run 8 threads all locked to one physical CPU core. You should be
able to do that with OpenMP env vars (affinity).
I'll post the updated results later.
The format prefix 'pbkdf2-hmac-' is omitted below.

On ARM (2xOMP):

[before]
MD4: 2576 c/s real, 1288 c/s virtual
MD5: 1651 c/s real, 825 c/s virtual
SHA1: 578 c/s real, 290 c/s virtual

[after]
MD4: 2608 c/s real, 1304 c/s virtual
MD5: 1600 c/s real, 803 c/s virtual
SHA1: 501 c/s real, 250 c/s virtual

MD4 becomes a little bit faster; MD5 & SHA1 become slower.


On Power (8xOMP, bound to a single core)

[before]
MD4: 28248 c/s real, 3531 c/s virtual
MD5: 19980 c/s real, 2497 c/s virtual
SHA1: 10593 c/s real, 1322 c/s virtual

[after]
MD4: 31207 c/s real, 3882 c/s virtual
MD5: 19980 c/s real, 2489 c/s virtual
SHA1: 11273 c/s real, 1409 c/s virtual

On Power (1xOMP)

[before]
MD4: 13398 c/s real, 13398 c/s virtual
MD5: 10626 c/s real, 10626 c/s virtual
SHA1: 8533 c/s real, 8533 c/s virtual

[after]
MD4: 14628 c/s real, 14628 c/s virtual
MD5: 10935 c/s real, 10935 c/s virtual
SHA1: 8947 c/s real, 8947 c/s virtual

At least there's no performance drop on Power. BTW, It looks Power's SMT performance is not very impressive.


Lei
Solar Designer
2015-09-10 16:26:47 UTC
Permalink
Raw Message
Lei,
Post by Lei Zhang
The format prefix 'pbkdf2-hmac-' is omitted below.
[before]
MD4: 2576 c/s real, 1288 c/s virtual
MD5: 1651 c/s real, 825 c/s virtual
SHA1: 578 c/s real, 290 c/s virtual
[after]
MD4: 2608 c/s real, 1304 c/s virtual
MD5: 1600 c/s real, 803 c/s virtual
SHA1: 501 c/s real, 250 c/s virtual
MD4 becomes a little bit faster; MD5 & SHA1 become slower.
For MD5, you can (and should) repair this with OR-NOT.

For SHA-1, please investigate: take a look at the generated code in
both cases.
Post by Lei Zhang
On Power (8xOMP, bound to a single core)
[before]
MD4: 28248 c/s real, 3531 c/s virtual
MD5: 19980 c/s real, 2497 c/s virtual
SHA1: 10593 c/s real, 1322 c/s virtual
[after]
MD4: 31207 c/s real, 3882 c/s virtual
MD5: 19980 c/s real, 2489 c/s virtual
SHA1: 11273 c/s real, 1409 c/s virtual
On Power (1xOMP)
[before]
MD4: 13398 c/s real, 13398 c/s virtual
MD5: 10626 c/s real, 10626 c/s virtual
SHA1: 8533 c/s real, 8533 c/s virtual
[after]
MD4: 14628 c/s real, 14628 c/s virtual
MD5: 10935 c/s real, 10935 c/s virtual
SHA1: 8947 c/s real, 8947 c/s virtual
At least there's no performance drop on Power. BTW, It looks Power's SMT performance is not very impressive.
Why, a 2x+ speedup compared to 1 thread is very good. It's similar to
or even better than what we're seeing with interleaving on x86.

Thanks,

Alexander
Lei Zhang
2015-09-12 08:53:42 UTC
Permalink
Raw Message
Post by Solar Designer
Post by Lei Zhang
On Power (8xOMP, bound to a single core)
[before]
MD4: 28248 c/s real, 3531 c/s virtual
MD5: 19980 c/s real, 2497 c/s virtual
SHA1: 10593 c/s real, 1322 c/s virtual
[after]
MD4: 31207 c/s real, 3882 c/s virtual
MD5: 19980 c/s real, 2489 c/s virtual
SHA1: 11273 c/s real, 1409 c/s virtual
On Power (1xOMP)
[before]
MD4: 13398 c/s real, 13398 c/s virtual
MD5: 10626 c/s real, 10626 c/s virtual
SHA1: 8533 c/s real, 8533 c/s virtual
[after]
MD4: 14628 c/s real, 14628 c/s virtual
MD5: 10935 c/s real, 10935 c/s virtual
SHA1: 8947 c/s real, 8947 c/s virtual
At least there's no performance drop on Power. BTW, It looks Power's SMT performance is not very impressive.
Why, a 2x+ speedup compared to 1 thread is very good. It's similar to
or even better than what we're seeing with interleaving on x86.
I don't really get it...

On my laptop, where each core supports 2 hardware threads, running 2 threads gets a 2x speedup compared to 1 thread on the same core. OTOH, each Power8 core supports up to 8 hardware threads, so I'd expect a higher speedup than just 2x.


Lei
Solar Designer
2015-09-12 09:57:45 UTC
Permalink
Raw Message
Lei,
Post by Lei Zhang
On my laptop, where each core supports 2 hardware threads, running 2 threads gets a 2x speedup compared to 1 thread on the same core.
This happens, but it's not very common. Usually, speedup from running 2
threads/core is much less than 2x.
Post by Lei Zhang
OTOH, each Power8 core supports up to 8 hardware threads, so I'd expect a higher speedup than just 2x.
SMT isn't only a way to increase resource utilization of a core when
running many threads. It's also a way to achieve lower latency due to
fewer context switches in server workloads (with lots of concurrent
requests) and to allow CPU designers to use higher instruction latencies
and achieve higher clock rate. (Note that my two uses of the word
latency in the previous sentence refer to totally different latencies:
server response latency on the order of milliseconds may be improved,
but instruction latency on the order of nanoseconds may be harmed at the
same time.) Our workload uses relatively low latency instructions:
integer only, and with nearly 100% L1 cache hit rate. Some other
workloads like multiplication of large matrices (exceeding L1 data
cache) might benefit from more hardware threads per core (or explicit
interleaving, but that's uncommon in scientific workloads except through
OpenCL and such), and that's also a reason for Power CPU designers to
support and possibly optimize for more hardware threads per core.

Finally, SMT provides middle ground between increasing the number of
ISA-visible CPU registers (which is limited by instruction size and the
number of register operands you can encode per instruction, as well as
by the need to maintain compatibility) and increasing the number of
rename registers. With SMT, there are sort of more ISA-visible CPU
registers: total across the many hardware threads. Those registers are
as good as ISA-visible ones for the purpose of replacing the need to
interleave instructions within 1 thread, yet they don't bump into
instruction size limitations.

I expect that on a CPU with more than 2 hardware threads the speed
growth with the increase of threads/core in use is spread over the 1 to
max threads range. So e.g. the speedup at only 2 threads on an 8
hardware threads CPU may very well be less than the speedup at 2 threads
on a 2 hardware threads CPU. I don't necessarily expect that the
speedup achieved at max threads is much or any greater than that
achieved at 2 threads on a CPU where 2 is the max. There's potential
for it to be greater (in the sense that the thread count doesn't limit
it to at most 2), but it might or might not be greater in practice.

Alexander
Solar Designer
2015-10-19 13:51:02 UTC
Permalink
Raw Message
Lei,

I just came across a recently posted article on this very topic:
performance scaling with POWER8's SMT (albeit in context of the
different reporting on AIX vs. Linux):

http://www.ibm.com/developerworks/library/l-processor-utilization-difference-aix-lop-trs/index.html

"Simultaneous multithreading (SMT) performance characterization shown in
Figure 6 is taken from the IBM POWER8 specification. This figure shows
that SMT8 provides 2.2 times better performance compared to single
threaded on POWER8."

The article also mentions that "a single-threaded application" run "on
an IBM POWER7 SMT4 system" "shows the core utilization as approximately
63% to 65%".

So the expected speedup when going from 1 thread/core to 8 threads/core
on POWER8 is 2.2 times, and the expected speedup when going from 1
thread/core to 4 threads/core on POWER7 is 1.5 to 1.6 times. Of course,
actual speedup will vary by application.

Alexander

P.S. I don't normally top-post, but it's one of those rare cases where I
find this appropriate - needing to quote a lot of context, yet not
Post by Solar Designer
Post by Lei Zhang
On my laptop, where each core supports 2 hardware threads, running 2 threads gets a 2x speedup compared to 1 thread on the same core.
This happens, but it's not very common. Usually, speedup from running 2
threads/core is much less than 2x.
Post by Lei Zhang
OTOH, each Power8 core supports up to 8 hardware threads, so I'd expect a higher speedup than just 2x.
SMT isn't only a way to increase resource utilization of a core when
running many threads. It's also a way to achieve lower latency due to
fewer context switches in server workloads (with lots of concurrent
requests) and to allow CPU designers to use higher instruction latencies
and achieve higher clock rate. (Note that my two uses of the word
server response latency on the order of milliseconds may be improved,
but instruction latency on the order of nanoseconds may be harmed at the
integer only, and with nearly 100% L1 cache hit rate. Some other
workloads like multiplication of large matrices (exceeding L1 data
cache) might benefit from more hardware threads per core (or explicit
interleaving, but that's uncommon in scientific workloads except through
OpenCL and such), and that's also a reason for Power CPU designers to
support and possibly optimize for more hardware threads per core.
Finally, SMT provides middle ground between increasing the number of
ISA-visible CPU registers (which is limited by instruction size and the
number of register operands you can encode per instruction, as well as
by the need to maintain compatibility) and increasing the number of
rename registers. With SMT, there are sort of more ISA-visible CPU
registers: total across the many hardware threads. Those registers are
as good as ISA-visible ones for the purpose of replacing the need to
interleave instructions within 1 thread, yet they don't bump into
instruction size limitations.
I expect that on a CPU with more than 2 hardware threads the speed
growth with the increase of threads/core in use is spread over the 1 to
max threads range. So e.g. the speedup at only 2 threads on an 8
hardware threads CPU may very well be less than the speedup at 2 threads
on a 2 hardware threads CPU. I don't necessarily expect that the
speedup achieved at max threads is much or any greater than that
achieved at 2 threads on a CPU where 2 is the max. There's potential
for it to be greater (in the sense that the thread count doesn't limit
it to at most 2), but it might or might not be greater in practice.
Alexander
Lei Zhang
2015-09-10 16:04:19 UTC
Permalink
Raw Message
Post by Solar Designer
"[...] we'll need to revise MD5_I in simd-intrinsics.c to use [...]
the obvious expression with OR-NOT on NEON and AltiVec (IIRC, those
archs have OR-NOT, which might be lower latency than select)."
Experimented on ARM:

[use CMOV]
Benchmarking: PBKDF2-HMAC-MD5 [PBKDF2-MD5 128/128 NEON 4x]... (2xOMP) DONE
Speed for cost 1 (iteration count) of 1000
Raw: 1600 c/s real, 800 c/s virtual

[use OR-NOT]
Benchmarking: PBKDF2-HMAC-MD5 [PBKDF2-MD5 128/128 NEON 4x]... (2xOMP) DONE
Speed for cost 1 (iteration count) of 1000
Raw: 1651 c/s real, 825 c/s virtual


Lei
Loading...