2016-01-21 11:08:11 UTC
FYI, I've just committed the loader and cracker optimizations from
September 2015 to the core tree. There are a few differences from what
went into jumbo back then, but almost all of those are on purpose, and
should remain as differences. For example, show_uid_in_cracks is
jumbo-specific, which affected these changes in a few places. So when
merging these, you should mostly keep the code currently in jumbo as-is,
even if the new core code does those things differently.
One exception to that is SSE2 vs. SSE checks for the prefetching.
It turns out those prefetch instructions and intrinsics are available
with plain SSE (Pentium 3) rather than require SSE2 (Pentium 4), so
let's in fact be checking __SSE__ and #include'ing <xmmintrin.h>, rather
than checking __SSE2__ and #include'ing <emmintrin.h>.
More importantly, our use of the NTA hint probably results in
performance regressions for some hash counts (neither very small nor
very large), as it reduces use of L2+ caches. I've actually seen at
least one such regression, where replacing the hint with T0 helped.
Unfortunately, NTA is in fact better than T0 for huge password hash
files, like the 29M test case. Maybe we need to move this portion of
code (the whole prefetching cracker) into an inline-able function, and
have the compiler specialize it in two different ways. And we'd need
some threshold parameter to choose one or the other per-salt.