Science and Simulation Performance

Beyond rendering workloads, we also have a number of key math-heavy specific workloads that systems like the Magnetar X64T were designed for.

AES Encoding

Algorithms using AES coding have spread far and wide as a ubiquitous tool for encryption. Again, this is another CPU limited test, and modern CPUs have special AES pathways to accelerate their performance. We often see scaling in both frequency and cores with this benchmark. We use the latest version of TrueCrypt and run its benchmark mode over 1GB of in-DRAM data. Results shown are the GB/s average of encryption and decryption.

(5-3) AES Encoding

Our test here has a limit of only using 64 threads, which the X64T takes to full effect applying maximum frequency across all the cores.

Agisoft Photoscan 1.3.3: link

Photoscan stays in our benchmark suite from the previous benchmark scripts, but is updated to the 1.3.3 Pro version. As this benchmark has evolved, features such as Speed Shift or XFR on the latest processors come into play as it has many segments in a variable threaded workload.

The concept of Photoscan is about translating many 2D images into a 3D model - so the more detailed the images, and the more you have, the better the final 3D model in both spatial accuracy and texturing accuracy. The algorithm has four stages, with some parts of the stages being single-threaded and others multi-threaded, along with some cache/memory dependency in there as well. For some of the more variable threaded workload, features such as Speed Shift and XFR will be able to take advantage of CPU stalls or downtime, giving sizeable speedups on newer microarchitectures.

For the update to version 1.3.3, the Agisoft software now supports command line operation. Agisoft provided us with a set of new images for this version of the test, and a python script to run it. We’ve modified the script slightly by changing some quality settings for the sake of the benchmark suite length, as well as adjusting how the final timing data is recorded. The python script dumps the results file in the format of our choosing. For our test we obtain the time for each stage of the benchmark, as well as the overall time.

(1-1) Agisoft Photoscan 1.3, Complex Test

Because Photoscan is a more varied workload, the gains from something like this are more niche.

3D Particle Movement v2.1: Non-AVX and AVX2/AVX512

This is the latest version of the benchmark designed to simulate semi-optimized scientific algorithms taken directly from my doctorate thesis. This involves randomly moving particles in a 3D space using a set of algorithms that define random movement. Version 2.1 improves over 2.0 by passing the main particle structs by reference rather than by value, and decreasing the amount of double->float->double recasts the compiler was adding in.

The initial version of v2.1 is a custom C++ binary of my own code, flags are in place to allow for multiple loops of the code with a custom benchmark length. By default this version runs six times and outputs the average score to the console, which we capture with a redirection operator that writes to file.


An example run on an i7-6950X

For v2.1, we also have a fully optimized AVX2/AVX512 version, which uses intrinsics to get the best performance out of the software. This was done by a former Intel AVX-512 engineer who now works elsewhere. According to Jim Keller, there are only a couple dozen or so people who understand how to extract the best performance out of a CPU, and this guy is one of them. To keep things honest, AMD also has a copy of the code, but has not proposed any changes.

The 3DPM test is set to output millions of movements per second, rather than time to complete a fixed number of movements. This way the data represented becomes a linear when performance scales and easier to read as a result.

(2-2) 3D Particle Movement v2.1 (Peak AVX)

Because the Intel processors have AVX-512, they win here. This is one of the fundamental differences that might put people in the direction of an Intel system, should their code be AVX-512 accelerated. For AVX-2 code paths, the Magnetar gets another 36% lead over the stock processor.

NAMD 2.13 (ApoA1): Molecular Dynamics

One of the popular science fields is modelling the dynamics of proteins. By looking at how the energy of active sites within a large protein structure over time, scientists behind the research can calculate required activation energies for potential interactions. This becomes very important in drug discovery. Molecular dynamics also plays a large role in protein folding, and in understanding what happens when proteins misfold, and what can be done to prevent it. Two of the most popular molecular dynamics packages in use today are NAMD and GROMACS.

NAMD, or Nanoscale Molecular Dynamics, has already been used in extensive Coronavirus research on the Frontier supercomputer. Typical simulations using the package are measured in how many nanoseconds per day can be calculated with the given hardware, and the ApoA1 protein (92,224 atoms) has been the standard model for molecular dynamics simulation.

Luckily the compute can home in on a typical ‘nanoseconds-per-day’ rate after only 60 seconds of simulation, however we stretch that out to 10 minutes to take a more sustained value, as by that time most turbo limits should be surpassed. The simulation itself works with 2 femtosecond timesteps.

(2-5) NAMD ApoA1 Simulation

NAMD requires a lot more core-to-core communication as well as memory access, and so we reach an asymptotic limit in our test here.

DigiCortex v1.35: link

DigiCortex is a pet project for the visualization of neuron and synapse activity in the brain. The software comes with a variety of benchmark modes, and we take the small benchmark which runs a 32k neuron/1.8B synapse simulation, similar to a small slug.

The results on the output are given as a fraction of whether the system can simulate in real-time, so anything above a value of one is suitable for real-time work. The benchmark offers a 'no firing synapse' mode, which in essence detects DRAM and bus speed, however we take the firing mode which adds CPU work with every firing.

I reached out to the author of the software, who has added in several features to make the software conducive to benchmarking. The software comes with a series of batch files for testing, and we run the ‘small 64-bit nogui’ version with a modified command line to allow for ‘benchmark warmup’ and then perform the actual testing.

The software originally shipped with a benchmark that recorded the first few cycles and output a result. So while fast multi-threaded processors this made the benchmark last less than a few seconds, slow dual-core processors could be running for almost an hour. There is also the issue of DigiCortex starting with a base neuron/synapse map in ‘off mode’, giving a high result in the first few cycles as none of the nodes are currently active. We found that the performance settles down into a steady state after a while (when the model is actively in use), so we asked the author to allow for a ‘warm-up’ phase and for the benchmark to be the average over a second sample time.

For our test, we give the benchmark 20000 cycles to warm up and then take the data over the next 10000 cycles seconds for the test – on a modern processor this takes 30 seconds and 150 seconds respectively. This is then repeated a minimum of 10 times, with the first three results rejected.

We also have an additional flag on the software to make the benchmark exit when complete (which is not default behavior). The final results are output into a predefined file, which can be parsed for the result. The number of interest for us is the ability to simulate this system in real-time, and results are given as a factor of this: hardware that can simulate double real-time is given the value of 2.0, for example.

The final result is a table that looks like this:

(3-1) DigiCortex 1.35 (32k Neuron, 1.8B Synapse)

Digicortex is another memory access asymptotic benchmark, however with more focus on the memory and interconnect latency as well. The differences between the X64T we tested and the stock 3990X likely come down to the memory configuration. 

SPEC2017rate

For a final benchmark, I want to turn to SPEC. Due to the nature of running SPEC2017 rate with all 128 threads, the run-time for this benchmark is over 16 hours, and so we've had to prioritize other testing to speed up the review process. We only did a single cycle of SPEC2017rate128, however we did score the following:

  • Average SPEC2017int rate 128 (estimated): 254.8
  • Average SPEC2017fp rate 128 (estimated): 234.0

The full sub-test results are in Bench. We normally showcase the results as a geomean as well, to which the Magnetar X64T scores 164.1, which compared to Intel's 28 core, which scores 111.1.

Rendering Benchmark Performance Power Consumption, Thermals, and Noise
POST A COMMENT

97 Comments

View All Comments

  • KillgoreTrout - Wednesday, September 9, 2020 - link

    Intelol Reply
  • close - Wednesday, September 9, 2020 - link

    This shows some awesome performance but the tradeoff is the limited memory capacity. If you don;t need that great. If you do then Threadripper is not the best option. Reply
  • twotwotwo - Wednesday, September 9, 2020 - link

    Hmm, so you're saying AnandTech needs a 3995WX or 2x7742 workstation sample? :) Reply
  • close - Wednesday, September 9, 2020 - link

    A stack of them even :). Thing is memory support doesn't make for a more interesting review, doesn't really change any of the bars there. It's a tick box "supports up to 2TB of RAM".

    Memory support is of the things that makes an otherwise absurdly expensive workstation like the Mac Pro attractive (that and the fact that for whoever needs to stay within that ecosystem the licenses alone probably cost more than a stack of Pros).
    Reply
  • oleyska - Wednesday, September 9, 2020 - link

    https://www.lenovo.com/no/no/thinkstation-p620

    will probably be able to help.
    Reply
  • close - Wednesday, September 9, 2020 - link

    The P620 supports up to 512GB of RAM. Generally OK and probably delivers on every other aspect but for those few that need 1.5-2TB of RAM it still wouldn't cut it. For that the go to is usually a Xeon, or EPYC more recently. Reply
  • schujj07 - Wednesday, September 9, 2020 - link

    Remember that Threadripper Pro supports 2TB of RAM in an 8 channel setup. While getting 2TB/socket isn't cheap, it is a possibility. Reply
  • rbanffy - Thursday, September 10, 2020 - link

    I wonder the impact of the 8-channel config on single-threaded workloads. The 256MB of L3 is already quite ample to the point I'm unsure how diminished are the returns at that point. Reply
  • sjerra - Monday, September 28, 2020 - link

    This is my biggest concern and rarely considered or studied in reviews. Design space exploration.
    CAE over many design variations. Hundreds of design variations calculated as much as possible in parallel over the available cores (one core per variation, but each grabbing a slice of the memory). I've tested this on a 7960xe, purposely running it on dual channel and quad channel memory. On dual channel memory, at 12 parallel calculations (so 6 cores/channel) I measured a 46% increase in the calculation time / sample. in quad channel, at 12 parallel calculations (so 3 cores/ channel) I already measured a 30% reduction per calculation. (can anyone explain the worse results for quad channel?)
    Either way, it leaves me to conclude that 64 cores with 4 channel memory for this type of workload is a big no go. Something to keep in mind. I'm now spec'ing a dual processor workstation with two lower core count processors and fully populated memory channels. (either epic (2x32c, 16 channels) or Xeon (2x24c, 12 channels). still deciding).
    Reply
  • sjerra - Monday, September 28, 2020 - link

    Edit: 30% increase of course. Reply

Log in

Don't have an account? Sign up now