« Algorithm Design | Conclusions »

Core Algorithm

Simple Loops

The core algorithm is not very many lines of code and already pretty quick:

@views function solve(nodes::AbstractArray{Float64}, sources::AbstractArray{Float64})

    B = zeros(size(nodes))
    a = Vector{Float64}(undef,3)
    b = similar(a); c = similar(a); cxa = similar(a)
    d = 0.0     # Probably not necessary to allocate a scalar...

    for j in axes(sources)[2]
        d = mu0 * sources[7,j] / (4pi)
        a .= sources[4:6,j] .- sources[1:3,j] 
        for i in axes(nodes)[1]
            b .= sources[1:3,j] .- nodes[i,:] 
            c .= sources[4:6,j] .- nodes[i,:]
            cross!(cxa, c, a)

            B[i,:] .+= cxa .* d .* (1/(norm(cxa)^2)) .* 
                                        (dot(a,c)/norm(c) .- dot(a,b)/norm(b))
        end
    end

    return B
end

After initial variable allocations, the outer loop goes through all of the source wires, and the inner loop calculates the magnetic field generated by each of those sources for all nodes.

Vectorized

The vectorized version replaces the inner loop with built-in Julia matrix math:

 @views function solve2(nodes::AbstractArray{Float64}, sources::AbstractArray{Float64})

    Nn = size(nodes)[1]
    B = zeros(Nn,3); b = zeros(Nn,3); c = zeros(Nn,3); cxa = zeros(Nn,3) 
    a = zeros(3) 
    norm_cxa = zeros(Nn); dot_ac = zeros(Nn); norm_c = zeros(Nn); dot_ab = zeros(Nn) 
    norm_b = zeros(Nn) 
    d = 0.0

    for j in axes(sources)[2]

        # The use of "dot assignment" i.e. `.=` prevents re-allocation of memory 
        #  within the loop
        d = mu0 * sources[7,j] / (4pi)
        a .= sources[4:6,j] .- sources[1:3,j] 
        b .= sources[1:3,j]' .- nodes 
        c .= sources[4:6,j]' .- nodes

        # Functions with mutating arguments also prevent re-allocation of memory
        crosscols!(cxa, c, a)
        normrows!(norm_cxa, cxa)
        dotrows!(dot_ac, c, a) 
        dotrows!(dot_ab, b, a) 
        normrows!(norm_c, c) 
        normrows!(norm_b, b)

        multcols!(cxa, (d .* (norm_cxa.^(-2)) .* (dot_ac./norm_c .- dot_ab./norm_b))) 
        B .+= cxa
    end

    return B
end

The initial allocations are a bit of an eyesore compared to the previous version, but a necessity. I’d love to get rid of the outer loop, but unfortunately didn’t find a way…

Multi-Threaded

To split the problem up into similarly-sized pieces for each thread, first I need to ‘chunk’ or partition it:

function threadindices(it::Integer, Nt::Integer, N::Integer) 

    Nperthread = div(N, Nt)
    remainder = rem(N, Nt) 

    if it == 1
        i1 = 1 
        i2 = i1 + Nperthread - 1 
    elseif it == Nt 
        i2 = N 
        i1 = i2 - Nperthread - remainder + 1 
    else
        i1 = (it-1)*Nperthread + 1
        i2 = i1 + Nperthread - 1 
    end 

    return i1:i2
end

This function takes the current thread number it, the total number of threads available Nt, and the length of the sources matrix, N. It then determines which indices of the sources matrix that the thread should index into; i.e. which part of the problem that thread should solve.

Then, the outer function can spawn new threads to solve each chunk of the problem:

@views function solve3(nodes::AbstractArray{Float64}, sources::AbstractArray{Float64}; 
                        Nt::Integer = 0)

    Ns = size(sources)[2]
    if Nt == 0 
        Nt = Threads.nthreads()
    end

    # Spawn a new task for each thread by splitting up the source array
    tasks = Vector{Task}(undef, Nt)
    for it = 1:Nt 
        @views tasks[it] = Threads.@spawn solve2(nodes, 
                                            sources[:,threadindices(it, Nt, Ns)])
    end
    
    # Get the result from each calculation and add it to the output array 
    B = zeros(size(nodes))
    for it = 1:Nt 
        B .+= fetch(tasks[it]) 
    end 

    return B
end

Always Check for Excessive Memory Allocations

This is one of the first performance tips in the Julia documentation for good reason. Allocating memory is slow and expensive compared to either not requiring a memory allocation or using memory that is already allocated. My first iteration of the Julia algorithm required 1.12 gigabytes because I was re-allocating memory every time through the calculation loop! After I realize why this was happening, the efficient version of the code required just 23.8 kilobytes of memory allocations - an extremely embarassing performance improvement.

See this Julia Discourse thread for more info.

Avoid Memory Re-Allocation with a “Dot-Call”

In Julia, reassigning or modifying an array like this:

A = rand(1000) 
A *= 2

actually re-allocates memory for A instead of modifying the existing array at its current place in memory.

Instead, if the original array should be modified in-place, use the “dot-call” syntax:

 A .*= 2 

The memory savings can be substantial:

julia>A = rand(1000,1000) 
julia>function slow(A::AbstractArray{Float64}) 
          A *= 2
      end 
julia>function fast(A::AbstractArray{Float64})
          A.*= 2 
      end 
julia>@benchmark slow($A)
BenchmarkTools.Trial: 6619 samples with 1 evaluation.
 Range (min  max):  178.917 μs    2.691 ms   GC (min  max):  0.00%  73.97%
 Time  (median):     585.791 μs                GC (median):     0.00%
 Time  (mean ± σ):   752.145 μs ± 384.731 μs   GC (mean ± σ):  21.59% ± 22.77%

              ▅█▅▃▃▂                       ▁▁▂▁▁▁▂▁▁▁▁▁▁      
  █▁▁▁▁▁▁▁▁▁▁▁▁███████▆▄▃▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▇███████████████████▇▇ 
  179 μs        Histogram: log(frequency) by time       1.88 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 2.

julia> @benchmark fast($A)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  148.667 μs   3.356 ms   GC (min  max): 0.00%  0.00%
 Time  (median):     150.208 μs               GC (median):    0.00%
 Time  (mean ± σ):   151.284 μs ± 32.238 μs   GC (mean ± σ):  0.00% ± 0.00%

     ▂██▂                                                       
  ▁▁▂████▃▃▆██▇▅▃▂▂▃▂▂▁▁▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ 
  149 μs          Histogram: frequency by time          162 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

In this case, the dot-call syntax is 5x faster!

This is also covered in the documentation.

Similarly, Write Functions to Modify Existing Data Structures In-Place (When Appropriate)

Julia has a simple syntax for writing functions that mutate their data: function mutate_vars!(args). Similarly to modifying arrays in place, this syntax states that an input variable may be modified. For example, in my code I write a function that computes a vector cross-product in-place, eliminating the need for re-allocating the output vector every time the function is called:

function cross!(c::Vector{Float64}, a::Vector{Float64}, b::Vector{Float64})
    c .= (a[2]*b[3]-a[3]*b[2], a[3]*b[1]-a[1]*b[3], a[1]*b[2]-a[2]*b[1])
end

Every time I call the function, I pass the same vector c as an argument for it to modify. For the baseline benchmark, this saved 1000x1000 allocations, or about 8 MB. Again - that’s not a lot for a small problem, but for larger problems that are thousands of times this size, it adds up quickly.


Copyright © 2024 ryan@freestatelabs