« Algorithm Design | Conclusions »
Core Algorithm
Simple Loops
The core algorithm is not very many lines of code and already pretty quick:
@views function solve(nodes::AbstractArray{Float64}, sources::AbstractArray{Float64})
B = zeros(size(nodes))
a = Vector{Float64}(undef,3)
b = similar(a); c = similar(a); cxa = similar(a)
d = 0.0 # Probably not necessary to allocate a scalar...
for j in axes(sources)[2]
d = mu0 * sources[7,j] / (4pi)
a .= sources[4:6,j] .- sources[1:3,j]
for i in axes(nodes)[1]
b .= sources[1:3,j] .- nodes[i,:]
c .= sources[4:6,j] .- nodes[i,:]
cross!(cxa, c, a)
B[i,:] .+= cxa .* d .* (1/(norm(cxa)^2)) .*
(dot(a,c)/norm(c) .- dot(a,b)/norm(b))
end
end
return B
end
After initial variable allocations, the outer loop goes through all of the source wires, and the inner loop calculates the magnetic field generated by each of those sources for all nodes.
Vectorized
The vectorized version replaces the inner loop with built-in Julia matrix math:
@views function solve2(nodes::AbstractArray{Float64}, sources::AbstractArray{Float64})
Nn = size(nodes)[1]
B = zeros(Nn,3); b = zeros(Nn,3); c = zeros(Nn,3); cxa = zeros(Nn,3)
a = zeros(3)
norm_cxa = zeros(Nn); dot_ac = zeros(Nn); norm_c = zeros(Nn); dot_ab = zeros(Nn)
norm_b = zeros(Nn)
d = 0.0
for j in axes(sources)[2]
# The use of "dot assignment" i.e. `.=` prevents re-allocation of memory
# within the loop
d = mu0 * sources[7,j] / (4pi)
a .= sources[4:6,j] .- sources[1:3,j]
b .= sources[1:3,j]' .- nodes
c .= sources[4:6,j]' .- nodes
# Functions with mutating arguments also prevent re-allocation of memory
crosscols!(cxa, c, a)
normrows!(norm_cxa, cxa)
dotrows!(dot_ac, c, a)
dotrows!(dot_ab, b, a)
normrows!(norm_c, c)
normrows!(norm_b, b)
multcols!(cxa, (d .* (norm_cxa.^(-2)) .* (dot_ac./norm_c .- dot_ab./norm_b)))
B .+= cxa
end
return B
end
The initial allocations are a bit of an eyesore compared to the previous version, but a necessity. I’d love to get rid of the outer loop, but unfortunately didn’t find a way…
Multi-Threaded
To split the problem up into similarly-sized pieces for each thread, first I need to ‘chunk’ or partition it:
function threadindices(it::Integer, Nt::Integer, N::Integer)
Nperthread = div(N, Nt)
remainder = rem(N, Nt)
if it == 1
i1 = 1
i2 = i1 + Nperthread - 1
elseif it == Nt
i2 = N
i1 = i2 - Nperthread - remainder + 1
else
i1 = (it-1)*Nperthread + 1
i2 = i1 + Nperthread - 1
end
return i1:i2
end
This function takes the current thread number it
, the total number of threads available Nt
, and the length of the sources
matrix, N
. It then determines which indices of the sources
matrix that the thread should index into; i.e. which part of the problem that thread should solve.
Then, the outer function can spawn new threads to solve each chunk of the problem:
@views function solve3(nodes::AbstractArray{Float64}, sources::AbstractArray{Float64};
Nt::Integer = 0)
Ns = size(sources)[2]
if Nt == 0
Nt = Threads.nthreads()
end
# Spawn a new task for each thread by splitting up the source array
tasks = Vector{Task}(undef, Nt)
for it = 1:Nt
@views tasks[it] = Threads.@spawn solve2(nodes,
sources[:,threadindices(it, Nt, Ns)])
end
# Get the result from each calculation and add it to the output array
B = zeros(size(nodes))
for it = 1:Nt
B .+= fetch(tasks[it])
end
return B
end
Always Check for Excessive Memory Allocations
This is one of the first performance tips in the Julia documentation for good reason. Allocating memory is slow and expensive compared to either not requiring a memory allocation or using memory that is already allocated. My first iteration of the Julia algorithm required 1.12 gigabytes because I was re-allocating memory every time through the calculation loop! After I realize why this was happening, the efficient version of the code required just 23.8 kilobytes of memory allocations - an extremely embarassing performance improvement.
See this Julia Discourse thread for more info.
Avoid Memory Re-Allocation with a “Dot-Call”
In Julia, reassigning or modifying an array like this:
A = rand(1000)
A *= 2
actually re-allocates memory for A instead of modifying the existing array at its current place in memory.
Instead, if the original array should be modified in-place, use the “dot-call” syntax:
A .*= 2
The memory savings can be substantial:
julia>A = rand(1000,1000)
julia>function slow(A::AbstractArray{Float64})
A *= 2
end
julia>function fast(A::AbstractArray{Float64})
A.*= 2
end
julia>@benchmark slow($A)
BenchmarkTools.Trial: 6619 samples with 1 evaluation.
Range (min … max): 178.917 μs … 2.691 ms ┊ GC (min … max): 0.00% … 73.97%
Time (median): 585.791 μs ┊ GC (median): 0.00%
Time (mean ± σ): 752.145 μs ± 384.731 μs ┊ GC (mean ± σ): 21.59% ± 22.77%
▁ ▅█▅▃▃▂ ▁▁▂▁▁▁▂▁▁▁▁▁▁ ▁ ▁
█▁▁▁▁▁▁▁▁▁▁▁▁███████▆▄▃▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▇███████████████████▇▇ █
179 μs Histogram: log(frequency) by time 1.88 ms <
Memory estimate: 7.63 MiB, allocs estimate: 2.
julia> @benchmark fast($A)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 148.667 μs … 3.356 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 150.208 μs ┊ GC (median): 0.00%
Time (mean ± σ): 151.284 μs ± 32.238 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▂██▂
▁▁▂████▃▃▆██▇▅▃▂▂▃▂▂▁▁▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
149 μs Histogram: frequency by time 162 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
In this case, the dot-call syntax is 5x faster!
This is also covered in the documentation.
Similarly, Write Functions to Modify Existing Data Structures In-Place (When Appropriate)
Julia has a simple syntax for writing functions that mutate their data: function mutate_vars!(args)
. Similarly to modifying arrays in place, this syntax states that an input variable may be modified. For example, in my code I write a function that computes a vector cross-product in-place, eliminating the need for re-allocating the output vector every time the function is called:
function cross!(c::Vector{Float64}, a::Vector{Float64}, b::Vector{Float64})
c .= (a[2]*b[3]-a[3]*b[2], a[3]*b[1]-a[1]*b[3], a[1]*b[2]-a[2]*b[1])
end
Every time I call the function, I pass the same vector c
as an argument for it to modify. For the baseline benchmark, this saved 1000x1000 allocations, or about 8 MB. Again - that’s not a lot for a small problem, but for larger problems that are thousands of times this size, it adds up quickly.