The core algorithm is not very many lines of code and already pretty quick:
After initial variable allocations, the outer loop goes through all of the source wires, and the inner loop calculates the magnetic field generated by each of those sources for all nodes.
Vectorized
The vectorized version replaces the inner loop with built-in Julia matrix math:
The initial allocations are a bit of an eyesore compared to the previous version, but a necessity. I’d love to get rid of the outer loop, but unfortunately didn’t find a way…
Multi-Threaded
To split the problem up into similarly-sized pieces for each thread, first I need to ‘chunk’ or partition it:
This function takes the current thread number it, the total number of threads available Nt, and the length of the sources matrix, N. It then determines which indices of the sources matrix that the thread should index into; i.e. which part of the problem that thread should solve.
Then, the outer function can spawn new threads to solve each chunk of the problem:
Always Check for Excessive Memory Allocations
This is one of the first performance tips in the Julia documentation for good reason. Allocating memory is slow and expensive compared to either not requiring a memory allocation or using memory that is already allocated. My first iteration of the Julia algorithm required 1.12 gigabytes because I was re-allocating memory every time through the calculation loop! After I realize why this was happening, the efficient version of the code required just 23.8 kilobytes of memory allocations - an extremely embarassing performance improvement.
Similarly, Write Functions to Modify Existing Data Structures In-Place (When Appropriate)
Julia has a simple syntax for writing functions that mutate their data: function mutate_vars!(args). Similarly to modifying arrays in place, this syntax states that an input variable may be modified. For example, in my code I write a function that computes a vector cross-product in-place, eliminating the need for re-allocating the output vector every time the function is called:
Every time I call the function, I pass the same vector c as an argument for it to modify. For the baseline benchmark, this saved 1000x1000 allocations, or about 8 MB. Again - that’s not a lot for a small problem, but for larger problems that are thousands of times this size, it adds up quickly.