Discussion:
[john-dev] USB-FPGA development
a***@openwall.net
2016-04-26 13:16:04 UTC
Permalink
Hi,

I'm doing an application development for FPGA, looking forward to integrate it into JtR.
The hardware is ZTEX USB-FPGA module 1.15y.


The task to implement a communication between host software and USB-FPGA module appeared to be time consuming because of following reasons:
- Ztex SDK doesn't include C host software library. There's Java library only;
- There's no examples or applications that communicate to Ztex board at high speed. E.g. various applications for Bitcoin Mining operate at speed no more than 0.5 MB/s;
- The hardware is a Multi-FPGA board with 4 FPGA chips and 1 USB device controller. FPGA chips don't have independent communication interfaces and at the first glance they don't look like independent devices. So it's an application developer's task to design a system of properly arranged I/O so host and FPGAs communicate in some ordered manner.

I've extracted a sub-project, ztex/examples/inouttraffic. That can be used as a starting point for creation of Ztex USB-FPGA applications.
The project is available at

https://github.com/Apingis/ztex_inouttraffic
The project includes HDL(Verilog) part, USB device controller's firmware and C host software.

There are issues on integration of ZTEX USB-FPGA applications and JtR. The features of the device are:
- low I/O speed (if compared to e.g. GPU). That's a problem if there are fast to compute hashes.
- the device doesn't have much memory. Actually each chip has 0.5MB of internal memory and there's no other memory on the device. That seems to be enough for many hash types. However when it comes to an idea to implement on-chip candidate password generator, there's no enough memory to store wordlist or charset data.

Denis
Lukas Odzioba
2016-04-26 19:31:34 UTC
Permalink
Post by a***@openwall.net
- low I/O speed (if compared to e.g. GPU). That's a problem if there are fast to compute hashes.
- the device doesn't have much memory. Actually each chip has 0.5MB of internal memory and there's no other memory on the device. That seems to be enough for many hash types. However when it comes to an idea to implement on-chip candidate password generator, there's no enough memory to store wordlist or charset data.
Hi Denis,
could you please mention specific benchmark IO numbers?
For very slow hashes we would be ok with 10 000 c/s times password
length - let's say 10 it is 800kB per board.
For fast hashes we have a bottleneck even on GPUs with GB/s transfers,
so it is not a big deal I guess, it just narrows number of potential
hashes which we do have plenty.
Is your code (and underlying libraries) thread safe?
Does increasing alignment over 8 bytes makes any difference?
Will you be able to do computations and fetch data simoultaneously on
fpga side with current code?
Did you test in on more than 1 board?

Thanks,
Lukas
a***@openwall.net
2016-04-26 21:20:54 UTC
Permalink
Hi Lukas,
Post by Lukas Odzioba
Hi Denis,
could you please mention specific benchmark IO numbers?
For very slow hashes we would be ok with 10 000 c/s times password length - let's say 10 it is 800kB per board.
For fast hashes we have a bottleneck even on GPUs with GB/s transfers,
so it is not a big deal I guess, it just narrows number of potential hashes which we do have plenty.
At test site, it displays 20 MB/s. There's small increase if FPGA's buffers doubled, quadrupled etc. and absolute maximum (achieved if host sends/receives in 512K pieces) is about 30 MB/s.
Post by Lukas Odzioba
Is your code (and underlying libraries) thread safe?
So far I didn't implement parallel processing of several boards. I'm considering usage of asynchronous USB transfer functions, or usage of OpenMP, or several threads or processes. So far it's a standalone test application and I'm yet unsure how exactly it would integrate into JtR code. Any suggestions are welcome.
Post by Lukas Odzioba
Does increasing alignment over 8 bytes makes any difference?
There's no difference in I/O performance. 8-byte alignment comes from FIFO width on FPGA side (the end of FIFO that is opposite to I/O pins), you can change it so input data better fits application logic.
Post by Lukas Odzioba
Will you be able to do computations and fetch data simoultaneously on fpga side with current code?
Yes, while some pieces of virtual hardware operate I/O pins and store/fetch data to internal FIFOs, at the same time other FPGA logic operate at other ends of FIFOs and do computations.
Post by Lukas Odzioba
Did you test in on more than 1 board?
Yes I tested it. Right now it's successfully doing sequential usb_bulk_transfer() blocking calls.


Denis
Lukas Odzioba
2016-04-26 22:04:00 UTC
Permalink
Post by a***@openwall.net
At test site, it displays 20 MB/s. There's small increase if FPGA's buffers doubled, quadrupled etc. and absolute maximum (achieved if host sends/receives in 512K pieces) is about 30 MB/s.
This should be ok for start.
Post by a***@openwall.net
So far I didn't implement parallel processing of several boards. I'm considering usage of asynchronous USB transfer functions, or usage of OpenMP, or several threads or processes. So far it's a standalone test application and I'm yet unsure how exactly it would integrate into JtR code. Any suggestions are welcome.
To integrate it to john you have to have some format that you want to
compute on this board.
Did you work on something like that, or maybe you just have
communication and the rest is still to do in this project?
Long story short we need three parts of the puzzle:
- fpga application that will compute hash(salt,password)
- communication, which you already have
- hooks in JtR to glue it up

If you still don't have first step we need to select some slow hash
format first.
I guess solar or other guys on the list will have some suggestions
which I'd like to hear.

Thanks,
Lukas
Royce Williams
2016-04-26 22:17:28 UTC
Permalink
Post by a***@openwall.net
Post by a***@openwall.net
At test site, it displays 20 MB/s. There's small increase if FPGA's
buffers doubled, quadrupled etc. and absolute maximum (achieved if host
sends/receives in 512K pieces) is about 30 MB/s.
This should be ok for start.
Post by a***@openwall.net
So far I didn't implement parallel processing of several boards. I'm
considering usage of asynchronous USB transfer functions, or usage of
OpenMP, or several threads or processes. So far it's a standalone test
application and I'm yet unsure how exactly it would integrate into JtR
code. Any suggestions are welcome.
To integrate it to john you have to have some format that you want to
compute on this board.
Did you work on something like that, or maybe you just have
communication and the rest is still to do in this project?
- fpga application that will compute hash(salt,password)
- communication, which you already have
- hooks in JtR to glue it up
If you still don't have first step we need to select some slow hash
format first.
I guess solar or other guys on the list will have some suggestions
which I'd like to hear.
Not knowing any better, descrypt seems like a good candidate.

https://github.com/Gifts/descrypt-ztex-bruteforcer

Royce
Solar Designer
2016-04-27 01:15:33 UTC
Permalink
Post by Royce Williams
Post by Lukas Odzioba
I guess solar or other guys on the list will have some suggestions
which I'd like to hear.
Not knowing any better, descrypt seems like a good candidate.
https://github.com/Gifts/descrypt-ztex-bruteforcer
descrypt is what Denis has already been playing with in simulation,
prior to switching to this ZTEX board communication sub-project as a
prerequisite for actual implementation on the physical FPGAs. So, yes,
the plan was/is for him to revisit this now.

And maybe bcrypt next, or at the same time, if someone else wants to
work on it - Katja, maybe you, now that communication is working?

Alexander
Solar Designer
2016-04-27 01:27:22 UTC
Permalink
Denis,

Thank you for bringing this in here, finally. And thank you for all of
your work so far!
Post by a***@openwall.net
So far I didn't implement parallel processing of several boards. I'm considering usage of asynchronous USB transfer functions, or usage of OpenMP, or several threads or processes. So far it's a standalone test application and I'm yet unsure how exactly it would integrate into JtR code. Any suggestions are welcome.
I think async functions will likely give us the most flexibility.

On a related note, I've been planning an enhancement to JtR's formats
interface to make it async. Right now, the current batch of hashes has
to be fully computed by the time crypt_all() returns.(*) The change
I've been envisioning is to support having two batches of hashes in
flight at a time, with only the previous batch (not the latest batch)
required to be fully computed by the time a crypt_all() call returns.
We don't have this yet, but when we do it would benefit from having
async I/O and async computation off-loading functions.

(*) This isn't exactly correct: it may be possible to postpone finishing
computation until cmp_all() returns, if it's called, or split computation
across get_hash*() functions. But that's a detail not essential to the
above paragraph, hence the footnote.

I also think you may want to switch to implementing the actual password
hashing (descrypt or/and bcrypt) along with your current I/O framework
now (blocking functions), and only then approach improving the I/O.

Alexander
a***@openwall.net
2016-04-27 19:55:14 UTC
Permalink
I also think you may want to switch to implementing the actual password hashing (descrypt or/and bcrypt) along with your current I/O framework now (blocking functions), and only then approach improving the I/O.
As mentioned above, I've already created descrypt application for FPGA and that works in simulator. That includes onboard hash comparsion. I estimate its performance at no less than 140 MH/s per chip (560 MH/s per board) and there's enough room for various optimizations.
Now I'm planning to:
- attach I/O;
- create JtR format, develop JtR integration;

- create on-chip candidate generator. I'm considering mask mode.

Denis
a***@openwall.net
2016-08-01 15:56:44 UTC
Permalink
Hi,

FPGA side application for descrypt is ready. Here are details:

1. Communication framework improvement.
URL: https://github.com/Apingis/ztex_inouttraffic
The purpose of the improvement is to create API independent from hardware implementation details (such as how fpga's switched on Ztex board or USB details).
Another issue is that host side and fpga side would exchange sequential packets of application data and framework provides functions for that.

2. Word Generator.
- Generates a word every cycle
- Implemented as a parametrized Verilog module (char_bits=7,ranges_max=8)
- It allows generation base on supplied word list. However it doesn't allow to insert same word into more than 1 position and I think such ability wouldn't be useful for descrypt anyway.
- Allows to specify starting index and number of candidates to generate, for easy distribution of load among multiple fpgas and boards
- C header: https://github.com/Apingis/ztex_inouttraffic/blob/master/host/pkt_comm/word_gen.h
- Generator and the rest of framework use less than 2% of fpga's resources.

3. crypt(3) Standard DES password cracker for Ztex 1.15y FPGA board.
URL: https://github.com/Apingis/ztex-descrypt
Project is based on communication framework.
Features:
- After generation, candidates are transferred to "arbiter" unit. Arbiter's task is to distribute candidates among cores and gather results.
- The design is split into cores. Each core includes 16-stage crypt pipeline and 1 bsearch comparator. At same time 16 candidates are on crypt pipeline and other 16 are in the comparator.
- Results are output in 2 types of packets: "Comparator found equality" and "Processing of an input packet done".
- Current version includes a built bitstream with 24 cores that operate at 216 MHz. Comparators have up to 1023 entries in hash table, operate at 156 MHz. That occupies 57% of fpga's resources.
- That performs at 700 MH/s with 511 or less hash table entries. If more hash table entries used, comparsion stage becomes a bottleneck and performance decreases by approximately 10%.
- The architecture has reached an internal limit so addition of more cores doesn't improve performance. That's because crypt takes 400 cycles to complete and arbiter processes one candidate every cycle. Word generator also generates one candidate every cycle. So in theory at 25 cores X 16 candidates the system reaches its architecture limit, actual limit is a little less than that.


Now I do concentrate on JtR integration.
The most major issue is the usage of candidate generator. There's nothing on the issue in 1.8.0-Jumbo-1.
I see mask mode development in bleeding-jumbo. "Mask" defined in mask.h and on-board generator configuration are about the same and do same function. I've got a question:
Mask mode doesn't use "format" API. However everything processed in mask mode must utilize all software features - such as crash recovery, distribution by nodes, collection of stats etc, including features that are not implemented yet. That definitely would result in code duplication which in turn would require more effort for further maintenence and development. How do you think?
There was a proposal by Solar to use "format" API for mask mode: http://www.openwall.com/lists/john-dev/2012/04/30/4
If that proposal was attempted and rejected - it would be interesting to know of the reasons, were there any impossible to resolve issues?

Denis
a***@openwall.net
2016-10-12 17:13:51 UTC
Permalink
Hi,

descrypt-ztex format is ready. I've created pull request at bleeding-jumbo:
https://github.com/magnumripper/JohnTheRipper/pull/2307

Running test on 3 boards for the 2nd day.

It works however there are still issues.
1. Mask implementation details.
1.1. In mask mode, it has to reconstruct a plaintext candidate out of template key and mask data. For that, it creates an array 'mask_int_cand.int_cand' - 4 bytes for every possible candidate. Then formats use the array for lookups.
The problem is that the array uses too much resources - with mask "?w?a?a?a?a" it uses 310 MBytes of RAM and ~2 sec. of CPU time for initialization. I was unable to check with mask such as "?w?b?b?b?b" because ?b doesn't seem to work correctly with 7-bit format, but I can calculate it would use up to 4 Gbytes of RAM and would cause substantial delay on program startup.
descrypt-ztex format uses divisions to reconstruct plaintext candidates, doesn't use 'mask_int_cand.int_cand' array. With on-device comparator that filters out overwhelming majority of computation results, reconstruction of plaintext candidates becomes rarily used function (called several times per second). Also I can't exclude cases where Ztex devices are connected to cost-optimized host system, in such cases host system might have no enough RAM.
So it would be great to skip allocation and initialization of 'mask_int_cand.int_cand' array if format doesn't use that.

2. Self-test.
2.1. For test array, I've generated several hashes with same salt and partial binaries. That is, on-device comparator is loaded with first 35 bits of hashes and they are the same, resulting in false positives. That cause self-test to fail. If that hashes are used for creation of password file then it works as expected, false positives successfully ruled out with cmp_exact() including the case where several false positives occur in one crypt_all() call.
2.2. "Warning: salt() returned misaligned pointer" self-test message. Format has 2-byte salt and I've set salt alignment to 2 bytes - is that correct or I should set salt_align to ARCH_WORD? Salt on host system is rarily accessed.
2.3. I've implemented a warning when mask is too short and that results in performance degradation because USB 2.0 link has no enough bandwith. The warning appears during self-test and it looks confusing. How can format know when self-test is running, to suppress the warning? Is it planned to add usage of mask in self-test?

3. FMT_REMOVE. How does format know when some binary was removed? If that's possible I'd prefer to keep comparator configuration until it actually changes - skipping unnecessary transfers to device would improve performance in case where only one salt is being audited. So far FMT_REMOVE is not implemented.


Denis
atom
2016-11-07 16:57:17 UTC
Permalink
This is nice work Denis, my gratulation! I'd love to see you adding a patch
for hashcat, too. Maybe I can help you other algorithms as well.

Anyway, I had some problem getting this to work. There's two problems of
which I know at least one how to solve.

1. I have three boards, but they all had some custom firmware installed,
one that wasn't "USB-FPGA Module 1.15y (default)". The following line in
ztex_scan.c will fail:

else if (!strncmp("USB-FPGA Module 1.15y (default)", dev->product_string,
31)) {

The user will end up with an error "no valid ZTEX devices found" and JtR
will quit. I've recompiled ZTEX_DEBUG with ZTEX_DEBUG=1, then it will print
the current firmware name on the board. I used that name to replace string
in the comparison above, then it worked. I think you need to find a
different solution to that, if not simply ignore the firmware running and
simply overwrite it.

2. There seems to be some timing issue somewhere when it comes to uploading
the bitstream. JtR prints the following error:

SN XXXXXXXXXX: uploading bitstreams.. ok
SN XXXXXXXXXX: device_list_check_bitstreams(): no bitstream or wrong type

Since I have multiple devices and if just one of them did not run into that
error on start, then JtR will start cracking. After some time (while JtR
cracks), JtR will try to upload the bitstream again to the other (failed)
devices. This sometimes works, but sometimes it does not. This will repeat
until all devices are initialized.

Here's some result after running for 2 days on a single hash:

0g 2:17:04:07 7.35% (ETA: 2016-12-11 22:04) 0g/s 2081Mp/s 2081Mc/s 2081MC/s
aaa;o7}r..ae\;o7}r

That's around 693MH/s

- Jens
Post by a***@openwall.net
Hi,
https://github.com/magnumripper/JohnTheRipper/pull/2307
Running test on 3 boards for the 2nd day.
It works however there are still issues.
1. Mask implementation details.
1.1. In mask mode, it has to reconstruct a plaintext candidate out of
template key and mask data. For that, it creates an array
'mask_int_cand.int_cand' - 4 bytes for every possible candidate. Then
formats use the array for lookups.
The problem is that the array uses too much resources - with mask
"?w?a?a?a?a" it uses 310 MBytes of RAM and ~2 sec. of CPU time for
initialization. I was unable to check with mask such as "?w?b?b?b?b"
because ?b doesn't seem to work correctly with 7-bit format, but I can
calculate it would use up to 4 Gbytes of RAM and would cause substantial
delay on program startup.
descrypt-ztex format uses divisions to reconstruct plaintext candidates,
doesn't use 'mask_int_cand.int_cand' array. With on-device comparator that
filters out overwhelming majority of computation results, reconstruction of
plaintext candidates becomes rarily used function (called several times per
second). Also I can't exclude cases where Ztex devices are connected to
cost-optimized host system, in such cases host system might have no enough
RAM.
So it would be great to skip allocation and initialization of
'mask_int_cand.int_cand' array if format doesn't use that.
2. Self-test.
2.1. For test array, I've generated several hashes with same salt and
partial binaries. That is, on-device comparator is loaded with first 35
bits of hashes and they are the same, resulting in false positives. That
cause self-test to fail. If that hashes are used for creation of password
file then it works as expected, false positives successfully ruled out with
cmp_exact() including the case where several false positives occur in one
crypt_all() call.
2.2. "Warning: salt() returned misaligned pointer" self-test message.
Format has 2-byte salt and I've set salt alignment to 2 bytes - is that
correct or I should set salt_align to ARCH_WORD? Salt on host system is
rarily accessed.
2.3. I've implemented a warning when mask is too short and that results in
performance degradation because USB 2.0 link has no enough bandwith. The
warning appears during self-test and it looks confusing. How can format
know when self-test is running, to suppress the warning? Is it planned to
add usage of mask in self-test?
3. FMT_REMOVE. How does format know when some binary was removed? If
that's possible I'd prefer to keep comparator configuration until it
actually changes - skipping unnecessary transfers to device would improve
performance in case where only one salt is being audited. So far FMT_REMOVE
is not implemented.
Denis
--
atom
Royce Williams
2016-11-07 17:23:09 UTC
Permalink
Post by atom
This is nice work Denis, my gratulation! I'd love to see you adding a patch
for hashcat, too. Maybe I can help you other algorithms as well.
Intriguing! :)
Post by atom
Anyway, I had some problem getting this to work. There's two problems of
which I know at least one how to solve.
1. I have three boards, but they all had some custom firmware installed, one
that wasn't "USB-FPGA Module 1.15y (default)". The following line in
else if (!strncmp("USB-FPGA Module 1.15y (default)", dev->product_string,
31)) {
The user will end up with an error "no valid ZTEX devices found" and JtR
will quit. I've recompiled ZTEX_DEBUG with ZTEX_DEBUG=1, then it will print
the current firmware name on the board. I used that name to replace string
in the comparison above, then it worked. I think you need to find a
different solution to that, if not simply ignore the firmware running and
simply overwrite it.
+1. FWIW for others until there's a better fix, I worked around this
by using the EZ-USB SDK tools to reflash the default firmware:

# cd ./ztex/default/usb-fpga-1.15y
# ./prog-1.15y.sh
Firmware to EEPROM upload time: 2155 ms
Writing configuration data.

... and then power-cycling the board.
Post by atom
2. There seems to be some timing issue somewhere when it comes to uploading
SN XXXXXXXXXX: uploading bitstreams.. ok
SN XXXXXXXXXX: device_list_check_bitstreams(): no bitstream or wrong type
Since I have multiple devices and if just one of them did not run into that
error on start, then JtR will start cracking. After some time (while JtR
cracks), JtR will try to upload the bitstream again to the other (failed)
devices. This sometimes works, but sometimes it does not. This will repeat
until all devices are initialized.
I only had this problem with one my devices, and it has persistently
been the same device.
Post by atom
0g 2:17:04:07 7.35% (ETA: 2016-12-11 22:04) 0g/s 2081Mp/s 2081Mc/s 2081MC/s
aaa;o7}r..ae\;o7}r
That's around 693MH/s
That's similar to my performance.

Royce
Post by atom
Post by a***@openwall.net
descrypt-ztex format is ready. I've created pull request at
https://github.com/magnumripper/JohnTheRipper/pull/2307
Elijah SmarTeam
2016-11-07 20:42:59 UTC
Permalink
Slight off-topic - how do you power those 1.15y? From site I'm getting too
wide selection of "4.5 to 16v 5.5/2.1/positive center" (
http://wiki.ztex.de/doku.php?id=en:ztex_boards:ztex_fpga_boards:power_supply_selection).
But what about amps? As I have only one of those - don't want to burn it :)
Maybe some well-known replacement option exist for power supply unit? (from
some routers or other consumer grade devices)
Post by Royce Williams
Post by atom
This is nice work Denis, my gratulation! I'd love to see you adding a
patch
Post by atom
for hashcat, too. Maybe I can help you other algorithms as well.
Intriguing! :)
Post by atom
Anyway, I had some problem getting this to work. There's two problems of
which I know at least one how to solve.
1. I have three boards, but they all had some custom firmware installed,
one
Post by atom
that wasn't "USB-FPGA Module 1.15y (default)". The following line in
else if (!strncmp("USB-FPGA Module 1.15y (default)", dev->product_string,
31)) {
The user will end up with an error "no valid ZTEX devices found" and JtR
will quit. I've recompiled ZTEX_DEBUG with ZTEX_DEBUG=1, then it will
print
Post by atom
the current firmware name on the board. I used that name to replace
string
Post by atom
in the comparison above, then it worked. I think you need to find a
different solution to that, if not simply ignore the firmware running and
simply overwrite it.
+1. FWIW for others until there's a better fix, I worked around this
# cd ./ztex/default/usb-fpga-1.15y
# ./prog-1.15y.sh
Firmware to EEPROM upload time: 2155 ms
Writing configuration data.
... and then power-cycling the board.
Post by atom
2. There seems to be some timing issue somewhere when it comes to
uploading
Post by atom
SN XXXXXXXXXX: uploading bitstreams.. ok
SN XXXXXXXXXX: device_list_check_bitstreams(): no bitstream or wrong
type
Post by atom
Since I have multiple devices and if just one of them did not run into
that
Post by atom
error on start, then JtR will start cracking. After some time (while JtR
cracks), JtR will try to upload the bitstream again to the other (failed)
devices. This sometimes works, but sometimes it does not. This will
repeat
Post by atom
until all devices are initialized.
I only had this problem with one my devices, and it has persistently
been the same device.
Post by atom
0g 2:17:04:07 7.35% (ETA: 2016-12-11 22:04) 0g/s 2081Mp/s 2081Mc/s
2081MC/s
Post by atom
aaa;o7}r..ae\;o7}r
That's around 693MH/s
That's similar to my performance.
Royce
Post by atom
Post by a***@openwall.net
descrypt-ztex format is ready. I've created pull request at
https://github.com/magnumripper/JohnTheRipper/pull/2307
Solar Designer
2016-11-07 20:59:05 UTC
Permalink
Post by Elijah SmarTeam
Slight off-topic - how do you power those 1.15y? From site I'm getting too
wide selection of "4.5 to 16v 5.5/2.1/positive center" (
http://wiki.ztex.de/doku.php?id=en:ztex_boards:ztex_fpga_boards:power_supply_selection).
But what about amps? As I have only one of those - don't want to burn it :)
Maybe some well-known replacement option exist for power supply unit? (from
some routers or other consumer grade devices)
This would be on-topic on the john-users list.

You can use an ATX power supply. You can also use a power adapter as
you described above. I tried a 12V/5A and it was getting too hot to the
touch (even though the actual load current was well under 4A). I later
bought some 12V/6A adapters and used those - much cooler. All of them
were cheap ones (about $12) off eBay. Maybe a higher quality 12V/5A
would not get as hot.

Do not use laptop power adapters. They usually have the same plugs and
polarity, but deliver 19V. ZTEX boards have 16V Zener diodes parallel
to input to prevent overvoltage like this, but you probably don't want
to test and see what burns first and in what exact way. ;-)

Alexander
Solar Designer
2016-11-07 21:52:46 UTC
Permalink
Post by Elijah SmarTeam
wide selection of "4.5 to 16v 5.5/2.1/positive center"
I'm not sure if you realize, but obviously you should want to stay
close to the upper end of the voltage range, to keep the current down.
Since you don't want to get dangerously close to 16V (or you'd risk
exceeding it), given what actual power adapters are widely available
pretty much means 12V is the way to go. Besides, 12V is what those
boards were most likely previously powered with (and thus tested with).

That said, I just recalled that Butterfly Labs' Jalapeno Bitcoin ASICs
shipped with 13V/6A adapters with the correct plugs. If you have, you
should be able to reuse one of those as well. (I noticed this when
retiring mine, but didn't try its adapter with a ZTEX board.) It will
probably be very slightly more efficient (lower current, so lower loss
in the cable).

Alexander
Elijah SmarTeam
2016-11-09 08:11:20 UTC
Permalink
Jens, can you please make a little clarification on how this speed can
corelate to test for hashcat on GPU? Are those numbers "simply comparable"
or there is smth to take in consideration? There is a benchmark data from
Jeremy <https://gist.github.com/epixoip/6ee29d5d626bd8dfe671a2d8f188b77b> and
according to it the speed for 1080 GTX FE is around 900 MH/s for hashcat's
descrypt. So does this mean that this "40W max" 1.15y device performance
for crypt(3) is almost like 77% of one of the top GPU available?
Post by atom
0g 2:17:04:07 7.35% (ETA: 2016-12-11 22:04) 0g/s 2081Mp/s 2081Mc/s
2081MC/s aaa;o7}r..ae\;o7}r
That's around 693MH/s
--
atom
Elijah SmarTeam
2016-11-13 16:57:01 UTC
Permalink
similar thing for me

*john -test -format=descrypt-ztex*
Benchmarking: descrypt-ztex, traditional crypt(3) [DES ZTEX]...
SN XXXXXXXXXX: uploading bitstreams.. ok
SN XXXXXXXXXX: device_list_check_bitstreams(): no bitstream or wrong type
no valid ZTEX devices found
*john -test -format=descrypt-ztex*
Benchmarking: descrypt-ztex, traditional crypt(3) [DES ZTEX]...
SN XXXXXXXXXX: uploading bitstreams.. SN XXXXXXXXXX: usb_bulk_write returns
-1 (LIBUSB_ERROR_IO)
failed
no valid ZTEX devices found
*john -test -format=descrypt-ztex*
Benchmarking: descrypt-ztex, traditional crypt(3) [DES ZTEX]...
SN XXXXXXXXXX: uploading bitstreams.. SN XXXXXXXXXX: usb_bulk_write returns
-1 (LIBUSB_ERROR_IO)
failed
no valid ZTEX devices found
*john -test -format=descrypt-ztex*
Benchmarking: descrypt-ztex, traditional crypt(3) [DES ZTEX]...
SN XXXXXXXXXX: uploading bitstreams.. ok
1 device(s) ZTEX 1.15y ready
SN: XXXXXXXXXX productId: 10.15.0.0 "inouttraffic UFM 1.15y" busnum:3
devnum:41
Warning: Slow communication channel to the device. Increase mask or expect
performance degradation.
Warning: salt() returned misaligned pointer
DONE
Warning: "Many salts" test limited: 5/256
Many salts: 1224K c/s real, 2674K c/s virtual
Only one salt: 1180K c/s real, 2570K c/s virtual
2. There seems to be some timing issue somewhere when it comes to
SN XXXXXXXXXX: uploading bitstreams.. ok
SN XXXXXXXXXX: device_list_check_bitstreams(): no bitstream or wrong type
Since I have multiple devices and if just one of them did not run into
that error on start, then JtR will start cracking. After some time (while
JtR cracks), JtR will try to upload the bitstream again to the other
(failed) devices. This sometimes works, but sometimes it does not. This
will repeat until all devices are initialized.
Frank Dittrich
2016-11-13 17:25:18 UTC
Permalink
Warning: "Many salts" test limited: 5/256
Many salts: 1224K c/s real, 2674K c/s virtual
Only one salt: 1180K c/s real, 2570K c/s virtual
BTW: That warning means, you'll need a longer test to get correct c/s
rates reported.
May be you can try --test=60 instead of just --test.

Frank
magnum
2016-11-14 06:56:11 UTC
Permalink
Post by Frank Dittrich
Warning: "Many salts" test limited: 5/256
Many salts: 1224K c/s real, 2674K c/s virtual
Only one salt: 1180K c/s real, 2570K c/s virtual
BTW: That warning means, you'll need a longer test to get correct c/s
rates reported.
May be you can try --test=60 instead of just --test.
There's even an eastern egg for this: Use --test=-1 and the tests will
run for as long is needed to process all salts.

magnum
a***@openwall.net
2016-11-14 07:27:58 UTC
Permalink
Hi,

Thank you all for the testing. I appreciate feedback and suggestions. Some issues, including severe ones, emerged after testing on more boards.

The major issue is: on some boards bitstream upload fails.
Actually bitstream upload consists of two steps: 1)upload and 2)check if bitstream operates.
1) - The following error:
"SN XXXXXXXXXX: uploading bitstreams.. SN XXXXXXXXXX: usb_bulk_write returns -1 (LIBUSB_ERROR_IO)
failed"
means general USB I/O error while upload. That can be faulty hardware, cable, etc. (You often get same error when you unplug or power-off the board while upload). On such error please try uploading bitstream using FWLoader from Ztex SDK - repeat 10 times or so to see it problem persists.
Usage: java -cp [<path to>]FWLoader.jar FWLoader [-sf <fpga_id:0-3>] -uf <jtr_dir>/run/ztex/ztex115y_descrypt.bit
2) - The other error:
"SN XXXXXXXXXX: uploading bitstreams.. ok
SN XXXXXXXXXX: device_list_check_bitstreams(): no bitstream or wrong type"
means upload was done but subsequent check results in error.
It looks like there's some issue with Low-speed communication interface (the one that's via usb_control_messages, endpoint 0 and MCU). That's used to check if bitstream operates.There's not very well written asynchronous circuitry there in HDL unit I was going to rewrite.
In such case could you please use FWLoader to upload bitstream on all 4 FPGAs and then run JtR several times. That would help to narrow down the problem.
else if (!strncmp("USB-FPGA Module 1.15y (default)", dev->product_string, 31)) {
The user will end up with an error "no valid ZTEX devices found" and JtR will quit. I've recompiled ZTEX_DEBUG with ZTEX_DEBUG=1, then it will print the current firmware name on the board. I used that name to replace string in the comparison above, then it worked. I think you need to find a different solution to that, if not simply ignore the firmware running and simply overwrite it.
When I was implementing that, I thought if a board has 3rd party firmware then the user intentionally dedicated the board to some other application. BTC miners (at least those I did take a look at) often allow the user to select what devices to use, never write firmware into EEPROM. I guessed the user typically would write custom firmware into EEPROM (using Ztex SDK) to reduce power-up time or/and to simplify running from command-line. Anyway this is a good question what would be the best for JtR to do in such case. "Greedy behavior" to use all boards regardless of what's installed - looks an easy solution but what if somebody runs several applications?  For now, I'm going to introduce an informational message when there're boards with 3rd party firmware.
Denis

Solar Designer
2016-11-09 22:11:11 UTC
Permalink
Hi Denis,

I am sorry I didn't reply to your questions earlier.
Post by a***@openwall.net
2. Self-test.
2.1. For test array, I've generated several hashes with same salt and partial binaries. That is, on-device comparator is loaded with first 35 bits of hashes and they are the same, resulting in false positives. That cause self-test to fail. If that hashes are used for creation of password file then it works as expected, false positives successfully ruled out with cmp_exact() including the case where several false positives occur in one crypt_all() call.
I guess you mean these two, which you had to comment out because of the
issue above? -

// These cause self-test to fail:
// cmp_one() returns true and cmp_exact() returns false
// because of partial binaries in the comparator
//
//{"35LSBeq.RUJA.", "==tCG*l2"},
//{"35LSBeq.Xbkho", "==*]fyOo"},
Post by a***@openwall.net
2.2. "Warning: salt() returned misaligned pointer" self-test message. Format has 2-byte salt and I've set salt alignment to 2 bytes - is that correct or I should set salt_align to ARCH_WORD? Salt on host system is rarily accessed.
"is that correct or I should set salt_align to ARCH_WORD?" - neither.

You have:

static void *salt(char *ciphertext)
{
static unsigned char out[2];

int salt = DES_raw_get_salt(ciphertext);
out[0] = salt;
out[1] = salt >> 8;

return out;
}

which doesn't provide any guaranteed alignment for its return value (the
"out" array). Thus, to match this function above, you should set
SALT_ALIGN to 1. This would be correct assuming that you don't depend
on the salts having any specific alignment in other places; if you do,
then you'd need to adjust salt() to provide that alignment and to set
SALT_ALIGN accordingly. (But I think you're fine with no alignment and
SALT_ALIGN set to 1.)
Post by a***@openwall.net
2.3. I've implemented a warning when mask is too short and that results in performance degradation because USB 2.0 link has no enough bandwith. The warning appears during self-test and it looks confusing. How can format know when self-test is running, to suppress the warning? Is it planned to add usage of mask in self-test?
Oh, so this is why I saw that spurious warning.

I think yes, you can detect self-test, but I'd rather someone else
(more familiar with jumbo specific) answers this. magnum, Jim?
Post by a***@openwall.net
3. FMT_REMOVE. How does format know when some binary was removed? If that's possible I'd prefer to keep comparator configuration until it actually changes - skipping unnecessary transfers to device would improve performance in case where only one salt is being audited. So far FMT_REMOVE is not implemented.
We added FMT_REMOVE retroactively, after Sayantan's OpenCL formats
started relying on salt->list being up to date, so that we wouldn't
break their assumption. Thus, I think your specific question wasn't
actually discussed. I think Sayantan just did it somehow. My guess is
formats should be checking salt->count in order to detect whether
anything was removed or not without having to traverse the list first.
Looking at the code now, I see e.g. opencl_DES_bs_plug.c:
update_buffer() implement something like this, albeit with some
thresholds - perhaps that's to avoid too frequent updates of the GPU's
list of hashes (not after every hash cracked, but once in a while, not
to slow down cracking when a lot of hashes are getting cracked quickly).

I hope the above helps.

Now some requests from me:

Please get your Verilog sources and other source files required for
rebuilding the bitstream into the JtR source tree as well, in some
subdirectory. They won't be used during normal JtR builds, but we'd
like to have the full source code in the same tree anyway.

Please add copyright and license statements to all of your source files.

If some of the files directly build upon others' work (and still contain
remnants of it sufficient to be subject to copyright), then please be
sure to include their copyright statements as well, and make sure their
work can be placed under the same license. (That is, was already
licensed in the same or in a compatible way.)

Thanks,

Alexander
magnum
2016-11-10 20:23:29 UTC
Permalink
Post by Solar Designer
Post by a***@openwall.net
2.3. I've implemented a warning when mask is too short and that results in performance degradation because USB 2.0 link has no enough bandwith. The warning appears during self-test and it looks confusing. How can format know when self-test is running, to suppress the warning?
Oh, so this is why I saw that spurious warning.
I think yes, you can detect self-test, but I'd rather someone else
(more familiar with jumbo specific) answers this. magnum, Jim?
You can test for bench_running (declare it as extern volatile int). It
will be true during self-test as well as benchmarking.
Post by Solar Designer
Post by a***@openwall.net
Is it planned to add usage of mask in self-test?
You can use '-test -mask' for benchmarking (optionally with a mask of
choice), but it won't (yet) test/verify proper functionality (self tests
are skipped).

magnum
Loading...