<feed xmlns="http://www.w3.org/2005/Atom"><title>Oscar Peace's blog</title><link href="https://www.oscarcp.net" /><updated>2026-05-16T11:56:27.384312</updated><subtitle>Feed for the blog at www.oscarcp.net</subtitle><id>https://www.oscarcp.net/</id><link href="https://www.oscarcp.net/feeds/atom" rel="self" type="application/atom+xml" /><entry><title>Optimising Go with SIMD assembly</title><link href="https://www.oscarcp.net/" /><id>https://www.oscarcp.net/blog/optimising-goputer-with-assembly</id><updated>2026-01-21T00:00:00</updated><summary>Rendering is slow, but it can be made a lot faster</summary><content type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml">&lt;p&gt;&lt;em&gt;Note: This is a continuation from my &lt;a href="/blog/making-pong-in-goputer" hx-get="/blog/making-pong-in-goputer" hx-push-url="true" hx-replace="innerHTML" hx-target="#content-block" hx-trigger="click"&gt;previous post&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Ever since goputer's switch to software rendering - instead of relying on frontends to do rendering operations themselves in hardware - drawing anything to the screen has been a bit of a bottleneck. Turns out doing stuff in hardware is a lot faster than doing it in software, especially when because of some very bad decisions earlier on, none of these operations are using 4 bytes and thus cannot be easily vectorised by a good compiler&lt;sup id="fnref:bytesimd"&gt;&lt;a class="footnote-ref" href="#fn:bytesimd"&gt;[1]&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;
&lt;p&gt;The following post details my experience optimising my Go code with assembly. If you are already familiar with SIMD operations then feel free to skip the next section, however it may be helpful for anyone else who isn't. This article focuses on x86 vector instructions, however the principles of SIMD are the same for all CPU architectures.&lt;/p&gt;
&lt;span class="header" id="what-is-vectorisation-and-simd"&gt;&lt;a href="#what-is-vectorisation-and-simd"&gt;#&lt;/a&gt;&lt;h1&gt;What is vectorisation and SIMD?&lt;/h1&gt;&lt;/span&gt;&lt;p&gt;Firstly SIMD stands for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;S&lt;/strong&gt;ingle&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;I&lt;/strong&gt;nstruction&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;M&lt;/strong&gt;ultiple&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;D&lt;/strong&gt;ata&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In essence this means that for a given instruction, which has multiple parameters (registers and/or memory locations), these parameters may themselves contain multiple pieces of data to be operated on in parallel.&lt;/p&gt;
&lt;p&gt;Also, as opposed to scalar instructions which will operate on standard sized registers, vector instructions mostly operate on &lt;strong&gt;vector registers&lt;/strong&gt;. On x86 vector registers can be 128, 256, or 512 bits wide.&lt;/p&gt;
&lt;p&gt;&lt;span style="color: red;"&gt;Vector&lt;/span&gt;isation refers to the process of rewriting a piece of code to make use of &lt;span style="color: red;"&gt;vector&lt;/span&gt; instructions.&lt;/p&gt;
&lt;p&gt;A very simple example of vectorisation would be to say you had a loop which added two (very long) arrays together. Below is some Python&lt;sup id="fnref:pythonvector"&gt;&lt;a class="footnote-ref" href="#fn:pythonvector"&gt;[2]&lt;/a&gt;&lt;/sup&gt; pseudocode to demonstrate this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Traditionally the above code would be executed in a &lt;strong&gt;scalar&lt;/strong&gt; manner. I.e. one addition would have to be executed for every pair of integers, a and b. This means that 1024 additions would have to be performed. While this may execute fairly quickly for small-ish arrays of integers such as this, it will get increasingly slower as the arrays grow.&lt;/p&gt;
&lt;p&gt;A "vectorised" version of the above code would be as follows (for this example, a and b can be thought of as an array of &lt;code&gt;uint32&lt;/code&gt;'s):&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vpaddd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Note: The &lt;code&gt;extend&lt;/code&gt; method is used to concatenate the two lists together.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Compared to before we are now adding 16 integers together at a time, instead of just 1. This means our code will now execute (in theory) 16x faster.&lt;/p&gt;
&lt;p&gt;Our &lt;code&gt;vpaddd&lt;/code&gt; function can be thought of as an instruction intrinsic. An intrinsic is a "function" that tells the compiler that you want this operation to be carried out with this specific instruction. In our case, &lt;code&gt;vpaddd&lt;/code&gt; refers to the "Add Packed Integers" instruction, specifically adding packed doubles - a packed instruction operates on values of it's size (byte/word/double etc.) as opposed to treating the register like one big number. In real code which uses intrinsics this would be &lt;code&gt;_mm_add_epi32&lt;/code&gt;&lt;sup id="fnref:addintrinsic"&gt;&lt;a class="footnote-ref" href="#fn:addintrinsic"&gt;[3]&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;
&lt;p&gt;In our case, each integer is 32 bits wide, and we are using 512 bit &lt;code&gt;zmm&lt;/code&gt; registers when performing the add instruction.&lt;/p&gt;
&lt;p&gt;&lt;picture&gt;&lt;source srcset="/content/assets/goputer/simd%20optimisation/pre%20simd%20add.avif" /&gt;&lt;img /&gt;&lt;/picture&gt;&lt;/p&gt;
&lt;figcaption&gt;Values in ZMM registers before adding, when i is zero&lt;/figcaption&gt;&lt;p&gt;The offset value in the above diagram refers to the number of bits each value is offset from the start of the register. Typically a vector register would be displayed with it's most significant bit first, but in this article, the least significant bit is displayed first.&lt;/p&gt;
&lt;p&gt;We then perform the &lt;code&gt;vpaddd&lt;/code&gt; instruction:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;vpaddd zmm2, zmm1, zmm0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This takes each packed (packed, i.e. they are stored next to each other in the register, no gaps) double-word (32 bit integer)&lt;sup id="fnref:byteworddouble"&gt;&lt;a class="footnote-ref" href="#fn:byteworddouble"&gt;[4]&lt;/a&gt;&lt;/sup&gt; in &lt;code&gt;zmm0&lt;/code&gt; and &lt;code&gt;zmm1&lt;/code&gt;, and adds them together, storing the result in &lt;code&gt;zmm2&lt;/code&gt;. In Intel assembly syntax, operands follow the order of &lt;code&gt;destination, source&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;After we perform our &lt;code&gt;vpaddd&lt;/code&gt; instruction, &lt;code&gt;zmm2&lt;/code&gt; now has our result:&lt;/p&gt;
&lt;p&gt;&lt;picture&gt;&lt;source srcset="/content/assets/goputer/simd%20optimisation/post%20simd%20add.avif" /&gt;&lt;img /&gt;&lt;/picture&gt;&lt;/p&gt;
&lt;p&gt;In reality our &lt;code&gt;.extend&lt;/code&gt; operation at the end would translate to a &lt;code&gt;vmovdqa64&lt;/code&gt; (move double quadword aligned) instruction, in order to move our data back to our result array.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;vmovdqa64 [rax], zmm0 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Note: The register &lt;code&gt;rax&lt;/code&gt; contains our current offset in our storage array. In practice any large enough register could contain this value.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Adding isn't the only operation you can perform on a vector register though. Pretty much any operation that can be applied to a scalar register can be applied to vector registers as well, whether it be other arithmetic operations such as multiplication, or logical operations, and shifting.&lt;/p&gt;
&lt;p&gt;There also exists a set of instructions specifically for operating on vector registers, such as broadcasting, shuffling, extraction/insertion, or other permutations. How these work will be detailed in the rest of the post.&lt;/p&gt;
&lt;span class="header" id="using-assembly-with-go"&gt;&lt;a href="#using-assembly-with-go"&gt;#&lt;/a&gt;&lt;h1&gt;Using assembly with Go&lt;/h1&gt;&lt;/span&gt;&lt;p&gt;The first half of this section focuses on writing raw assembly for use with Go, and the second half focuses on making use of the Avo library. I would recommend reading both halves as it is important to understand the assembly that Avo generates, and it is also sometimes easier to debug your code by reading the generated assembly.&lt;/p&gt;
&lt;span class="header" id="raw-assembly"&gt;&lt;a href="#raw-assembly"&gt;##&lt;/a&gt;&lt;h2&gt;Raw assembly&lt;/h2&gt;&lt;/span&gt;&lt;p&gt;&lt;em&gt;Note: From now on, Go's Plan9 inspired assembly syntax is used, the main difference being the order of operands is different (source, destination).&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Go makes it quite easy to link with assembly. It is used a lot in the standard library, especially in the &lt;code&gt;math&lt;/code&gt; package, to make use of hardware specific features for various operations.&lt;/p&gt;
&lt;p&gt;For example &lt;a href="https://github.com/golang/go/blob/455282911aba7512e2ba045ffd9244eb97756247/src/math/exp_amd64.s" target="_blank"&gt;&lt;code&gt;math/exp_amd64.s&lt;/code&gt;&lt;/a&gt; contains optimised code for calculating a base &lt;math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;/mrow&gt;&lt;/math&gt; exponential.&lt;/p&gt;
&lt;p&gt;A much simpler example would be something that &lt;a href="https://github.com/mmcloughlin/avo/tree/c096992c06ffcc996c6ebb924d9d5cf1e53e49d4/examples/add" target="_blank"&gt;adds two numbers&lt;/a&gt;&lt;sup id="fnref:avocredit"&gt;&lt;a class="footnote-ref" href="#fn:avocredit"&gt;[5]&lt;/a&gt;&lt;/sup&gt;, i.e:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;#include &amp;quot;textflag.h&amp;quot;

// func Add(x uint64, y uint64) uint64
TEXT &#183;Add(SB), NOSPLIT, $0-24
    MOVQ x+0(FP), AX
    MOVQ y+8(FP), CX
    ADDQ AX, CX
    MOVQ CX, ret+16(FP)
    RET
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;An explainer on the syntax used above:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;TEXT &#183;Add(SB), NOSPLIT, $0-24
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Above is the function declaration. &lt;code&gt;SB&lt;/code&gt; is a virtual register used to refer to the static base pointer, which can be thought of as the origin of memory.&lt;/p&gt;
&lt;p&gt;Names and descriptions for the pseudo-registers from the &lt;a href="https://go.dev/doc/asm" target="_blank"&gt;Go documentation&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;FP: Frame pointer: arguments and locals.&lt;/li&gt;
&lt;li&gt;PC: Program counter: jumps and branches.&lt;/li&gt;
&lt;li&gt;SB: Static base pointer: global symbols.&lt;/li&gt;
&lt;li&gt;SP: Stack pointer: the highest address within the local stack frame.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;code&gt;NOSPLIT&lt;/code&gt; tells the compiler not to insert a special piece of code that checks if the stack needs to grow. I don't fully understand this myself, and for our purposes it isn't important, but for those interested there is a detailed explanation in &lt;a href="https://mcyoung.xyz/2025/07/07/nosplit/" target="_blank"&gt;this blogpost&lt;/a&gt; by Miguel Young de la Sota.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;$0-24&lt;/code&gt; specifies the size of the stack frame for this function.&lt;/p&gt;
&lt;p&gt;The move instruction, &lt;code&gt;MOVQ x+0(FP), AX&lt;/code&gt;, can be read as: move the parameter &lt;code&gt;x&lt;/code&gt;, offset zero bytes from the frame pointer &lt;code&gt;FP&lt;/code&gt;, to the register &lt;code&gt;AX&lt;/code&gt;. The same applies to the second move instruction, except an offset of 8 bytes instead of 0.&lt;/p&gt;
&lt;p&gt;The final move instruction can be read as: move the value in &lt;code&gt;CX&lt;/code&gt; to the return parameter, offset 16 bytes from the frame pointer.&lt;/p&gt;
&lt;p&gt;This can then be linked to the following method stub:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kd"&gt;func&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;y&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint64&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The method stub can then be used in any normal Go code, just as a normal function would.&lt;/p&gt;
&lt;p&gt;While the above approach might be okay for writing small functions, as functions and their requirements get larger, and worrying about argument sizes, constant data, and register allocation becomes a problem, it might be useful to have some way to take this "boilerplate" work away.&lt;/p&gt;
&lt;span class="header" id="here-comes-avo"&gt;&lt;a href="#here-comes-avo"&gt;##&lt;/a&gt;&lt;h2&gt;Here comes Avo&lt;/h2&gt;&lt;/span&gt;&lt;p&gt;&lt;a href="https://github.com/mmcloughlin/avo" target="_blank"&gt;Avo&lt;/a&gt; is a Go library that can be used to generate assembly by just writing normal Go code. The assembly example used previously was actually from Avo's examples directory. You still have to "write" the assembly yourself, but Avo will take care of all the boring stuff like argument sizes and Plan9 syntax for you.&lt;/p&gt;
&lt;p&gt;Here is the Go code which could have be written instead to generate that assembly - in Avo, as in Go's assembly, operands follow a source, dest order:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;github.com/mmcloughlin/avo/build&amp;quot;&lt;/span&gt;

&lt;span class="kd"&gt;func&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Add&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;NOSPLIT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;func(x, y uint64) uint64&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;Param&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;x&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;GP64&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;y&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;Param&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;y&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;GP64&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;ADDQ&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;Store&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ReturnIndex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;RET&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;Generate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Notice how &lt;code&gt;x&lt;/code&gt; and &lt;code&gt;y&lt;/code&gt; are now actually variables, in this case virtual registers (64 bit) which have been allocated for us and initialized by Avo.&lt;/p&gt;
&lt;p&gt;We then use the &lt;code&gt;Store&lt;/code&gt; method to generate the instruction which pushes the result back onto the stack.&lt;/p&gt;
&lt;p&gt;Avo also generates the stub file as well.&lt;/p&gt;
&lt;span class="header" id="using-simd-with-go"&gt;&lt;a href="#using-simd-with-go"&gt;#&lt;/a&gt;&lt;h1&gt;Using SIMD with Go&lt;/h1&gt;&lt;/span&gt;&lt;p&gt;As stated previously, the primary operation I wanted to optimise with SIMD was rendering. This is not only because it was a bottleneck, but also because rendering is an ideal candidate (or in my case slightly less than ideal) for a SIMD workflow.&lt;/p&gt;
&lt;p&gt;In my case, I also couldn't use any AVX-512 instructions as my CPU does not support it.&lt;/p&gt;
&lt;span class="header" id="video-clearing"&gt;&lt;a href="#video-clearing"&gt;##&lt;/a&gt;&lt;h2&gt;Video clearing&lt;/h2&gt;&lt;/span&gt;&lt;p&gt;The first operation I wanted to make use of SIMD was &lt;code&gt;int vc&lt;/code&gt;, or video clear interrupt. While a better approach could have been to make use of the &lt;code&gt;rep movsb&lt;/code&gt; instruction, as I didn't care what was already in the buffer, just replacing it, the buffer format (RGB8), meant that this was not possible.&lt;/p&gt;
&lt;p&gt;The steps I needed to take in order to do this were as follows:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Assemble the colour&lt;/li&gt;
&lt;li&gt;Move the colour data to memory&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;"Assembling" the colour meant loading the RGB values from the parameters, then placing this data in the correct order in a vector register, or actually multiple, in order for it to be ordered properly.&lt;/p&gt;
&lt;p&gt;Loading the parameters then ordering the colour correctly in a 32 bit register is quite simple:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nx"&gt;red&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;Param&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;red&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;GP8&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="nx"&gt;green&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;Param&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;green&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;GP8&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="nx"&gt;blue&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;Param&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;blue&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;GP8&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="nx"&gt;tmp&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;GP32&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nx"&gt;MOVL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;U32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;tmp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;MOVB&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;blue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;tmp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;As8&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="nx"&gt;SHLL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;Imm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;tmp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;ORB&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;green&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;tmp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;As8&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="nx"&gt;SHLL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;Imm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;tmp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;ORB&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;red&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;tmp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;As8&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The value in this register can then be moved to an XMM register using a &lt;code&gt;MOVD&lt;/code&gt; (move double word) instruction. Also a quick note, x86 is &lt;strong&gt;little endian&lt;/strong&gt;, which means that the byte with the lowest value is stored first. This is why we move the blue first, then green, then red, as opposed to the other way round. This is important when we consider the order of bytes in the XMM register.&lt;/p&gt;
&lt;p&gt;Once we've constructed the colour, we can move it to the lowest 4 bytes of the XMM register:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nx"&gt;MOVD&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tmp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;xmm0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Once this data was in XMM, the rest of XMM needed to be filled with it. Also because it was a 3 byte pattern, and XMM registers are 16 bytes wide, the pattern would have to be repeated across 3 XMM registers in order to work properly.&lt;/p&gt;
&lt;p&gt;An appropriate instruction for this is the shuffle instruction, specifically &lt;a href="https://www.felixcloutier.com/x86/pshufb" target="_blank"&gt;&lt;code&gt;PSHUFB&lt;/code&gt;&lt;/a&gt; or &lt;em&gt;Packed Shuffle Bytes&lt;/em&gt;. It will re-order the contents of one vector register, when given a shuffle mask contained in another. TLDR; When given source &lt;code&gt;x&lt;/code&gt;, destination &lt;code&gt;y&lt;/code&gt;, and mask &lt;code&gt;m&lt;/code&gt;, &lt;code&gt;y[i] = x[m[i]]&lt;/code&gt;. &lt;/p&gt;
&lt;p&gt;&lt;picture&gt;&lt;source srcset="/content/assets/goputer/simd%20optimisation/simd%20shuffle.avif" /&gt;&lt;img /&gt;&lt;/picture&gt;&lt;/p&gt;
&lt;figcaption&gt;Values are coloured to show which index the mask sources it's value from&lt;/figcaption&gt;&lt;p&gt;In this case the register is filled with a &lt;strong&gt;&lt;span style="color: #d623f4;"&gt;purple-ish&lt;/span&gt;&lt;/strong&gt; colour.&lt;/p&gt;
&lt;p&gt;Or in code:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nx"&gt;shuffle_mask&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;GLOBL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;shuffle_mask&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;RODATA&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="nx"&gt;NOPTR&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;DATA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;String&lt;/span&gt;&lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;}))&lt;/span&gt;

&lt;span class="nx"&gt;VPSHUFB&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;shuffle_mask&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;xmm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;xmm0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;There is a problem here, because the mask is not a complete sequence, that is it ends in zero, instead of 2, multiple registers with slightly different masks must be used, in order to form one complete sequence.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nx"&gt;first_mask&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;GLOBL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;first_mask&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;RODATA&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="nx"&gt;NOPTR&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;DATA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;String&lt;/span&gt;&lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;}))&lt;/span&gt;
&lt;span class="nx"&gt;second_mask&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;GLOBL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;second_mask&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;RODATA&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="nx"&gt;NOPTR&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;DATA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;String&lt;/span&gt;&lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;}))&lt;/span&gt;
&lt;span class="nx"&gt;third_mask&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;GLOBL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;third_mask&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;RODATA&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="nx"&gt;NOPTR&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;DATA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;String&lt;/span&gt;&lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;}))&lt;/span&gt;

&lt;span class="nx"&gt;MOVD&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tmp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;xmm0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;VPSHUFB&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;first_mask&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;xmm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;xmm0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;VPSHUFB&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;second_mask&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;xmm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;xmm1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;VPSHUFB&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;third_mask&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;xmm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;xmm2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Once we have shuffled the contents correctly for each register, we can store that data in memory:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nx"&gt;Label&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;loop&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nx"&gt;VMOVDQA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;xmm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Mem&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;Base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="nx"&gt;VMOVDQA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;xmm1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Mem&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;Base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Disp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="nx"&gt;VMOVDQA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;xmm2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Mem&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;Base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Disp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="nx"&gt;VMOVDQA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;xmm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Mem&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;Base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Disp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;48&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="nx"&gt;VMOVDQA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;xmm1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Mem&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;Base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Disp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="nx"&gt;VMOVDQA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;xmm2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Mem&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;Base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Disp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="nx"&gt;VMOVDQA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;xmm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Mem&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;Base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Disp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;96&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="nx"&gt;VMOVDQA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;xmm1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Mem&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;Base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Disp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;112&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="nx"&gt;VMOVDQA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;xmm2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Mem&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;Base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Disp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="nx"&gt;VMOVDQA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;xmm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Mem&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;Base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Disp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;144&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="nx"&gt;VMOVDQA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;xmm1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Mem&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;Base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Disp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;160&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="nx"&gt;VMOVDQA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;xmm2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Mem&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;Base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Disp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;176&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="nx"&gt;ADDQ&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;Imm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;192&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;CMPQ&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;max&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;JBE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;LabelRef&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;loop&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;max&lt;/code&gt; is a value calculated before this (unrolled&lt;sup id="fnref:unrolling"&gt;&lt;a class="footnote-ref" href="#fn:unrolling"&gt;[6]&lt;/a&gt;&lt;/sup&gt;) loop which stores the bounds of the array. The &lt;code&gt;Mem&lt;/code&gt; struct tells avo to interpret this argument as a memory address, with the start (base) of this address being the value of &lt;code&gt;ptr&lt;/code&gt;, and then offset by the value of &lt;code&gt;Disp&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;VMOVDQA&lt;/code&gt; is the "Move Double Quadword Aligned" instruction, which moves the vector register to main memory and vice versa. The "Aligned" part means that the location must be aligned on a certain boundary, in our case 8 bytes as we are targeting a 64 bit system. This is because CPUs &lt;strong&gt;mainly&lt;/strong&gt; access memory in terms of their word size, which on a 64 bit system is 8 bytes, this tends to be faster than doing an &lt;em&gt;unaligned&lt;/em&gt; memory operation.&lt;/p&gt;
&lt;span class="header" id="video-areas"&gt;&lt;a href="#video-areas"&gt;##&lt;/a&gt;&lt;h2&gt;Video areas&lt;/h2&gt;&lt;/span&gt;&lt;p&gt;A "video area" (&lt;code&gt;int va&lt;/code&gt;) is the term used in goputer for drawing a rectangle to the screen. The logic for drawing an opaque video area is broadly the same as video clearing, apart from a different loop, so I have decided not to include it for brevity. Instead, this section focuses on drawing a transparent area and the additional code which is required to do that.&lt;/p&gt;
&lt;p&gt;The main differences between drawing opaque colour, and transparent colour, is that we now need to not only load data from memory beforehand, but also perform arithmetic on it.&lt;/p&gt;
&lt;p&gt;Because goputer's framebuffer doesn't actually store alpha values, blending is done in a sort of fake way:&lt;/p&gt;
&lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mi&gt;d&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;msub&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mi&gt;R&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#x0003D;&lt;/mo&gt;&lt;mo stretchy="false"&gt;&amp;#x00028;&lt;/mo&gt;&lt;mo stretchy="false"&gt;&amp;#x00028;&lt;/mo&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;msub&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;R&lt;/mi&gt;&lt;/msub&gt;&lt;mi&gt;&amp;#x000D7;&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;msub&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;/msub&gt;&lt;mo stretchy="false"&gt;&amp;#x00029;&lt;/mo&gt;&lt;mo&gt;&amp;#x0002B;&lt;/mo&gt;&lt;mo stretchy="false"&gt;&amp;#x00028;&lt;/mo&gt;&lt;mi&gt;d&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;msub&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mi&gt;R&lt;/mi&gt;&lt;/msub&gt;&lt;mi&gt;&amp;#x000D7;&lt;/mi&gt;&lt;mi&gt;&amp;#x000AC;&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;msub&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;/msub&gt;&lt;mo stretchy="false"&gt;&amp;#x00029;&lt;/mo&gt;&lt;mo stretchy="false"&gt;&amp;#x00029;&lt;/mo&gt;&lt;mo&gt;&amp;#x0226B;&lt;/mo&gt;&lt;mn&gt;8&lt;/mn&gt;&lt;/mrow&gt;&lt;/math&gt;
&lt;p&gt;Or in a code form:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nx"&gt;dest&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;src&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;src&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;dest&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;^&lt;/span&gt;&lt;span class="nx"&gt;src&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Fortunately, this is fairly easy to vectorise. The first part of the operation, &lt;math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mo stretchy="false"&gt;&amp;#x00028;&lt;/mo&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;msub&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;R&lt;/mi&gt;&lt;/msub&gt;&lt;mi&gt;&amp;#x000D7;&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;msub&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;/msub&gt;&lt;mo stretchy="false"&gt;&amp;#x00029;&lt;/mo&gt;&lt;/mrow&gt;&lt;/math&gt;, can be performed ahead of the main loop because it is constant for each channel. However, before we do that, the byte values in the register need to be widened to 16 bit words, in order to avoid overflow and preserve accuracy.&lt;/p&gt;
&lt;p&gt;This can be done with the &lt;code&gt;VPMOVZXBW&lt;/code&gt;, or packed move ith zero extend, instruction. In our use case, it will take every byte in a XMM register, zero extend (which is to say it pads it with extra zeros instead of just inserting it with garbage data) it to a word, before packing those words into the YMM register. For the alpha channel, where the alpha value is 230 (0xE6), which will become 0x00E6 when widened:&lt;/p&gt;
&lt;p&gt;&lt;picture&gt;&lt;source srcset="/content/assets/goputer/simd%20optimisation/simd%20widen%20zero%20extend.avif" /&gt;&lt;img /&gt;&lt;/picture&gt;&lt;/p&gt;
&lt;p&gt;In order to fill the alpha register with both the inverted and inverted alpha values before it is widened, the &lt;code&gt;VPBROADCAST&lt;/code&gt; instruction (specifically with the &lt;code&gt;B&lt;/code&gt; for byte suffix) can be used. This instruction tells the CPU to take the first byte/word/double word etc. of this register, and then fill this other (or the same) register with it.&lt;/p&gt;
&lt;p&gt;&lt;picture&gt;&lt;source srcset="/content/assets/goputer/simd%20optimisation/simd%20broadcast.avif" /&gt;&lt;img /&gt;&lt;/picture&gt;&lt;/p&gt;
&lt;p&gt;With all the data in place, now the final colour result needs to be calculated. As mentioned previously, there exists a set of instructions for operating on packed values. So calculating the first part of the expression is as follows:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nx"&gt;VPMULLW&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;widened_shuffled_colour_bytes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;widened_alpha_values&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;widened_shuffled_colour_bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This instruction multiplies each packed word, and then stores the low (bits 0-16) result.&lt;/p&gt;
&lt;p&gt;As for computing the rest of the expression, firstly the data needs to be loaded from memory and widened. Also, unlike before where the data in memory could be discarded, this time the 16th byte needs to be preserved for the next iteration - we are only writing 15 bytes at a time. This can be done with the &lt;code&gt;PEXTRB&lt;/code&gt; and &lt;code&gt;PINSRB&lt;/code&gt;, extract byte and insert byte respectively, instructions. I &lt;strong&gt;wouldn't actually recommend&lt;/strong&gt; using these, as you will see later, but I am including them here so you can learn from my mistakes.&lt;/p&gt;
&lt;p&gt;Load memory data, and extract last byte:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nx"&gt;VMOVDQU&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;Mem&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;Base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;RDI&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;memory_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;XORL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;last_byte&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;last_byte&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;PEXTRB&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;Imm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;memory_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;last_byte&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;Imm(15)&lt;/code&gt; tells Avo that this an immediate value - a value embedded in the instruction operands, as opposed to being stored in a register or memory. In this case it will automatically determine the size, but similar functions exist for generating an immediate of a specific size.&lt;/p&gt;
&lt;p&gt;Widen memory data, then multiply it with the inverted alpha values.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nx"&gt;VPMOVZXBW&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;memory_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;widened_memory_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// &amp;quot;Multiply Packed Signed Integers and Store Low Result&amp;quot;&lt;/span&gt;
&lt;span class="nx"&gt;VPMULLW&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;widened_memory_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;widened_inverted_alpha_values&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;widened_memory_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Finally, add the two values together and perform the packed shift operation:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;// &amp;quot;Add Packed Integers&amp;quot;&lt;/span&gt;
&lt;span class="nx"&gt;VPADDW&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;widened_memory_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;widened_shuffled_colour_bytes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;widened_memory_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// &amp;quot;Shift Packed Data Right Logical&amp;quot;&lt;/span&gt;
&lt;span class="nx"&gt;VPSRLW&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;Imm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;widened_memory_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;widened_memory_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Now the values had been calculated, they need to be packed back down into bytes, and stored back into main memory.&lt;/p&gt;
&lt;p&gt;To pack the numbers back down, &lt;code&gt;PACKUSWB&lt;/code&gt; ("Pack With Unsigned Saturation") is used. Saturation refers to how the words (which can be from 0-65535) are compressed back down into the byte range of 0-255. If the value is &lt;math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mo&gt;&amp;#x0003E;&lt;/mo&gt;&lt;mn&gt;255&lt;/mn&gt;&lt;/mrow&gt;&lt;/math&gt;, then it is clipped at 255. If it is &lt;math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mo&gt;&amp;#x0003C;&lt;/mo&gt;&lt;mn&gt;0&lt;/mn&gt;&lt;/mrow&gt;&lt;/math&gt;, then it is clipped at zero. If in-between, then it is kept the same.&lt;/p&gt;
&lt;p&gt;Because the pack instruction operates on 128 bit lane of the YMM register in order to operate on data in a stream, and per lane, the colour values are now split between each lane, note that this takes place after the rightwards shift, hence each value starts with 0x00:&lt;/p&gt;
&lt;p&gt;&lt;picture&gt;&lt;source srcset="/content/assets/goputer/simd%20optimisation/pack%20word%20to%20byte.avif" /&gt;&lt;img /&gt;&lt;/picture&gt;&lt;/p&gt;
&lt;figcaption&gt;Values which be kept are in bold. Line through centre is at the lane border.&lt;/figcaption&gt;&lt;p&gt;Luckily this can be rectified with the &lt;code&gt;VPERMQ&lt;/code&gt; (Qwords Element Permutation) instruction. This instruction functions in a similar way to the shuffle instruction from earlier, but this time it operates on quad-words (64 bit) and the mask is only one byte, with four two bit values for the locations. The 256bit YMM register is split up into 4 quad-word slots.&lt;/p&gt;
&lt;p&gt;In this case the mask is &lt;code&gt;0xD8&lt;/code&gt;, or &lt;code&gt;0b11011000&lt;/code&gt; in binary form. The source in the mask is implicit, based on the location of the bits within the mask.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Bits&lt;/th&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;Destination&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;00&lt;/td&gt;
&lt;td&gt;Q0&lt;/td&gt;
&lt;td&gt;Q0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;Q1&lt;/td&gt;
&lt;td&gt;Q2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;01&lt;/td&gt;
&lt;td&gt;Q2&lt;/td&gt;
&lt;td&gt;Q1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;Q3&lt;/td&gt;
&lt;td&gt;Q3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The result:&lt;/p&gt;
&lt;p&gt;&lt;picture&gt;&lt;source srcset="/content/assets/goputer/simd%20optimisation/simd%20permute.avif" /&gt;&lt;img /&gt;&lt;/picture&gt;&lt;/p&gt;
&lt;p&gt;When all the data is in the correct place in YMM, the lower lane (bits 0-128) can be extracted and placed in XMM. This data can then be placed back into memory, but only after the last byte is inserted.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nx"&gt;VEXTRACTI128&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;Imm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;widened_memory&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;memory_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nx"&gt;PINSRB&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;Imm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;last_byte&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;memory_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;VMOVDQU&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;memory_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Mem&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;Base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;RDI&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Note: This time the move instruction is the &lt;/em&gt;&lt;em&gt;unaligned&lt;/em&gt;&lt;em&gt; variant, as the pointer is incremented by 15 bytes each time, which is not a proper number for correct alignment.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;All of the code in this section has been extracts from the routine used in goputer, which is available &lt;a href="https://github.com/sccreeper/goputer/blob/668142325cf8a3f1ab1b9b425dbc34162c1820ab/pkg/vm/asm/video_area_alpha.go" target="_blank"&gt;on GitHub&lt;/a&gt;. The next section explains the rest of that routine in a bit more detail, it is not necessary to understand &lt;a href="#is-it-actually-faster"&gt;the conclusion&lt;/a&gt; sections so feel free to skip it.&lt;/p&gt;
&lt;span class="header" id="extra-notes-on-the-loop-implementation"&gt;&lt;a href="#extra-notes-on-the-loop-implementation"&gt;###&lt;/a&gt;&lt;h3&gt;Extra notes on the loop implementation&lt;/h3&gt;&lt;/span&gt;&lt;p&gt;The loop itself is a standard &lt;code&gt;for y { for x }&lt;/code&gt; style loop:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nx"&gt;Label&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;a_loop&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nx"&gt;MOVQ&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;RDI&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nx"&gt;CMPL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;width&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Imm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="nx"&gt;JB&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;LabelRef&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;a_blit_remaining&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="nx"&gt;Label&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;a_loop_x&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// &lt;/span&gt;
&lt;span class="c1"&gt;// Loop body&lt;/span&gt;
&lt;span class="c1"&gt;// &lt;/span&gt;

&lt;span class="nx"&gt;Label&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;a_loop_end&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nx"&gt;XORL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;counter_x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;counter_x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;ADDQ&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;U32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;960&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;INCL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;counter_y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;CMPL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;counter_y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;JB&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;LabelRef&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;a_loop&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The body of the loop contains an additional branch to check if we are drawing &lt;math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mo&gt;&amp;#x0003C;&lt;/mo&gt;&lt;mn&gt;5&lt;/mn&gt;&lt;/mrow&gt;&lt;/math&gt; pixels (&lt;math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mn&gt;15&lt;/mn&gt;&lt;mtext&gt;&amp;#x000A0;bytes&lt;/mtext&gt;&lt;mi&gt;&amp;#x000F7;&lt;/mi&gt;&lt;mn&gt;3&lt;/mn&gt;&lt;mo&gt;&amp;#x0003D;&lt;/mo&gt;&lt;mn&gt;5&lt;/mn&gt;&lt;/mrow&gt;&lt;/math&gt;). This contains instructions which draw in a scalar as opposed to vector manner:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nx"&gt;MOVBWZX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;Mem&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;Base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;RDI&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;tmp_16&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;IMULW&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;inv_alpha_16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;tmp_16&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;ADDW&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;red_16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;tmp_16&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;SHRW&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;Imm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;tmp_16&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;MOVB&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tmp_16&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;As8L&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Mem&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;Base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;RDI&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The instructions are the same for red and green, but with a &lt;code&gt;Disp&lt;/code&gt; value in the &lt;code&gt;Mem&lt;/code&gt; struct of 1 and 2 respectively.&lt;/p&gt;
&lt;span class="header" id="is-it-actually-faster"&gt;&lt;a href="#is-it-actually-faster"&gt;#&lt;/a&gt;&lt;h1&gt;Is it actually faster?&lt;/h1&gt;&lt;/span&gt;&lt;p&gt;Now only one question remained:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;"Is this faster than before?"&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;After all, it would be a bit pointless if it wasn't.&lt;/p&gt;
&lt;p&gt;According to goputer's builtin profiler the video clear operation was now &lt;strong&gt;~16% faster&lt;/strong&gt; in canned benchmarks&lt;sup id="fnref:benchmarks"&gt;&lt;a class="footnote-ref" href="#fn:benchmarks"&gt;[7]&lt;/a&gt;&lt;/sup&gt;. Not exactly a huge improvement but still something.&lt;/p&gt;
&lt;p&gt;The video area operation with no alpha was actually on par with the scalar code from before. I tried optimising it further with cache prefetch instructions but these had no measurable effect.&lt;/p&gt;
&lt;p&gt;Initially, the alpha area operation was &lt;strong&gt;slower&lt;/strong&gt;, but after removing the extract/insert byte instructions and replacing them with normal memory -&amp;gt; register and vice versa move instructions, there was a &lt;strong&gt;~60%&lt;/strong&gt; improvement.&lt;/p&gt;
&lt;p&gt;The final code examples used in this article, after improvements are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/sccreeper/goputer/blob/850f3c482ba84b935bb27c9111bfd322edc5af16/pkg/vm/asm/video_area.go" target="_blank"&gt;pkg/vm/asm/video_area.go&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/sccreeper/goputer/blob/850f3c482ba84b935bb27c9111bfd322edc5af16/pkg/vm/asm/video_area_alpha.go" target="_blank"&gt;pkg/vm/asm/video_area_alpha.go&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/sccreeper/goputer/blob/850f3c482ba84b935bb27c9111bfd322edc5af16/pkg/vm/asm/video_clear.go" target="_blank"&gt;pkg/vm/asm/video_clear.go&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;span class="header" id="conclusion"&gt;&lt;a href="#conclusion"&gt;#&lt;/a&gt;&lt;h1&gt;Conclusion&lt;/h1&gt;&lt;/span&gt;&lt;p&gt;In conclusion, while some of the improvements made aren't exactly as much as what is theoretically possible with SIMD, they are still something. At the end of the day computers are very fast, and as such will run un-optimised code just as well as optimised code. &lt;/p&gt;
&lt;p&gt;As I write this, Go is planning to add experimental support for &lt;a href="https://antonz.org/go-1-26/#simd" target="_blank"&gt;vector operations&lt;/a&gt; as a builtin feature in the standard library in the next version, 1.26. So if I had done this a few months later it might have been a lot easier, and in the future it'll hopefully support more architectures (even including WASM), so I won't have to touch assembly at all. On the other hand, it has certainly been very interesting learning the ins and outs of x86 assembly and SIMD in particular.&lt;/p&gt;
&lt;p&gt;Finally, apart from the addition of a high level language, the rest of the improvements I have &lt;a href="https://github.com/sccreeper/goputer/issues" target="_blank"&gt;planned&lt;/a&gt; for goputer are fairly minor things, so there &lt;em&gt;probably&lt;/em&gt; won't be another post like this for a while. For now, look at the other posts on &lt;a href="/posts?tags=goputer" hx-get="/posts?tags=goputer" hx-push-url="true" hx-replace="innerHTML" hx-target="#content-block" hx-trigger="click"&gt;goputer&lt;/a&gt; or the repository &lt;a href="https://github.com/sccreeper/goputer" target="_blank"&gt;itself&lt;/a&gt;.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id="fn:bytesimd"&gt;
&lt;p&gt;SIMD instructions work better when using multiples of 4 elements, see this &lt;a href="https://learn.arm.com/learning-paths/cross-platform/vectorization-friendly-data-layout/data-layout-basics-2/" target="_blank"&gt;ARM documentation&lt;/a&gt;.&amp;#160;&lt;a class="footnote-backref" href="#fnref:bytesimd" title="Jump back to footnote 1 in the text"&gt;^&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:pythonvector"&gt;
&lt;p&gt;In reality the reference Python interpreter, CPython, does not vectorise code. Vectorisation in Python can be achieved by using libraries such as &lt;a href="https://numpy.org/" target="_blank"&gt;numpy&lt;/a&gt;.&amp;#160;&lt;a class="footnote-backref" href="#fnref:pythonvector" title="Jump back to footnote 2 in the text"&gt;^&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:addintrinsic"&gt;
&lt;p&gt;See &lt;a href="https://www.felixcloutier.com/x86/paddb:paddw:paddd:paddq#intel-c-c++-compiler-intrinsic-equivalents" target="_blank"&gt;paddd&lt;/a&gt; on the x86 reference.&amp;#160;&lt;a class="footnote-backref" href="#fnref:addintrinsic" title="Jump back to footnote 3 in the text"&gt;^&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:byteworddouble"&gt;
&lt;p&gt;On x86 the terms byte (8 bits), word (16 bits), double-word (32 bits), and quad-word (64 bits) are sometimes used to denote integer sizes.&amp;#160;&lt;a class="footnote-backref" href="#fnref:byteworddouble" title="Jump back to footnote 4 in the text"&gt;^&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:avocredit"&gt;
&lt;p&gt;This example is from the &lt;code&gt;avo&lt;/code&gt; library.&amp;#160;&lt;a class="footnote-backref" href="#fnref:avocredit" title="Jump back to footnote 5 in the text"&gt;^&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:unrolling"&gt;
&lt;p&gt;"Unrolling" refers to the practice of including multiple iterations of a loop in it's body, in order to reduce the total number of branches which occur.&amp;#160;&lt;a class="footnote-backref" href="#fnref:unrolling" title="Jump back to footnote 6 in the text"&gt;^&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:benchmarks"&gt;
&lt;p&gt;Benchmarks were conducted using these examples and this hardware.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Video clearing:&lt;/strong&gt; &lt;a href="https://github.com/sccreeper/goputer/blob/850f3c482ba84b935bb27c9111bfd322edc5af16/examples/video_clear_bench.gpasm" target="_blank"&gt;examples/video_clear_bench.gpasm&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Video area, no alpha:&lt;/strong&gt; &lt;a href="https://github.com/sccreeper/goputer/blob/850f3c482ba84b935bb27c9111bfd322edc5af16/examples/area_test.gpasm" target="_blank"&gt;examples/area_test.gpasm&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Video area, with alpha:&lt;/strong&gt; &lt;a href="https://github.com/sccreeper/goputer/blob/850f3c482ba84b935bb27c9111bfd322edc5af16/examples/alpha_area_test.gpasm" target="_blank"&gt;examples/alpha_area_test.gpasm&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;System specification:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;CPU:&lt;/strong&gt; Ryzen 7 7730U (8C, 16T)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Memory:&lt;/strong&gt; 16GB&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;WSL OS:&lt;/strong&gt; Fedora 42, 6.6.87.2-microsoft-standard-WSL2&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Host OS:&lt;/strong&gt; Windows 11, 10.0.26200&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a class="footnote-backref" href="#fnref:benchmarks" title="Jump back to footnote 7 in the text"&gt;^&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</div></content><author><name>Oscar Peace</name></author></entry><entry><title>Making pong with my fantasy VM</title><link href="https://www.oscarcp.net/" /><id>https://www.oscarcp.net/blog/making-pong-in-goputer</id><updated>2025-10-28T00:00:00</updated><summary>Finally, a real project</summary><content type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml">&lt;p&gt;&lt;em&gt;Note: This is a continuation from my &lt;a href="/blog/goputer-profiler" hx-get="/blog/goputer-profiler" hx-push-url="true" hx-replace="innerHTML" hx-target="#content-block" hx-trigger="click"&gt;previous post&lt;/a&gt;. This post can also serve as a more in-depth explanation of one of the goputer examples as well as a tutorial of sorts.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;For as long as goputer has existed, I have never made what I would call a complete program or project with it. Any code that has been written for it has been short examples, intended to test functionality rather than showcase what is possible. With this in mind I decided to embark on creating &lt;em&gt;Pong&lt;/em&gt;. While I could have possibly gone for something a bit more advanced, such as &lt;em&gt;Snake&lt;/em&gt; or &lt;em&gt;Tetris&lt;/em&gt;, &lt;em&gt;Pong&lt;/em&gt; was the simplest thing I could make whilst still testing the full feature set of goputer.&lt;/p&gt;
&lt;p&gt;This post is split into four parts with a conclusion:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Drawing&lt;/li&gt;
&lt;li&gt;Moving&lt;/li&gt;
&lt;li&gt;Scoring&lt;/li&gt;
&lt;li&gt;Debugging&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Even though this post describes the function of each piece of code, I would still recommend skimming over the &lt;a href="https://github.com/sccreeper/goputer/wiki/Syntax" target="_blank"&gt;syntax&lt;/a&gt; page in the documentation beforehand.&lt;/p&gt;
&lt;span class="header" id="drawing"&gt;&lt;a href="#drawing"&gt;#&lt;/a&gt;&lt;h1&gt;Drawing&lt;/h1&gt;&lt;/span&gt;&lt;p&gt;This was the easiest part. I often describe goputer as "an assembly language that controls Microsoft Paint", and it's not far from the truth. A huge part of the feature-set of goputer is drawing stuff to the screen. After all, what's the point of an educational tool if you can't draw things.&lt;/p&gt;
&lt;p&gt;A pong screen is very simple to draw, it contains two paddles, a ball, and a dividing line. I'll get onto the score later.&lt;/p&gt;
&lt;p&gt;Firstly the initial positions need to be defined so they can be changed later if necessary, also it's very poor practice to not use constants where possible.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;#def paddle_offset 5
#def paddle_width 5
#def paddle_height 25
#def paddle_speed 10

#def left_paddle_pos 120
#def right_paddle_pos 120

#def ball_x 160
#def ball_y 120
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The first set of definitions define constants that are referenced in later code. The second set define the positions of the paddles and the ball respectively. The reason why everything is multiples of five is because it makes bounds checking ever so slightly easier, i.e. I shouldn't have to worry about &lt;a href="https://en.wikipedia.org/wiki/Arithmetic_underflow" target="_blank"&gt;integer underflow&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;These needed to be used in conjunction with a routine that draws the shapes in question onto the screen:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;...

lda @left_paddle_pos
mov d0 vy0
add vy0 $d:paddle_height
mov a0 vy1

mov $320-d:paddle_offset-d:paddle_width vx0
mov $320-d:paddle_offset vx1

int va

...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;A quick note on the immediate values prefixed with &lt;code&gt;$&lt;/code&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Expressions are evaluated from left-to-right, and don't follow normal operator precedence.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;d:xyz&lt;/code&gt; refers to the value of a definition at compile time.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Drawing the ball and the centre line followed a &lt;a href="https://github.com/sccreeper/goputer/blob/d604504813b27805801260a1b21bc31588053b03/examples/pong.gpasm#L142-L157" target="_blank"&gt;similar routine&lt;/a&gt;.&lt;/p&gt;
&lt;span class="header" id="moving"&gt;&lt;a href="#moving"&gt;#&lt;/a&gt;&lt;h1&gt;Moving&lt;/h1&gt;&lt;/span&gt;&lt;p&gt;A static image is pretty useless if you're trying to make a game, therefore the next thing I focused on was trying to make things move. &lt;/p&gt;
&lt;span class="header" id="moving-the-paddles"&gt;&lt;a href="#moving-the-paddles"&gt;##&lt;/a&gt;&lt;h2&gt;Moving the paddles&lt;/h2&gt;&lt;/span&gt;&lt;p&gt;Firstly, I focused on moving the paddles, as I figured this would be the easiest thing to do because the code should be the same for both of them. The steps that were required were as follows:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Interrupt to register the keypress&lt;/li&gt;
&lt;li&gt;Then decide which key it is, and thus which paddle to move&lt;/li&gt;
&lt;li&gt;Move the respective paddle&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Registering an interrupt listener is simple, however you can only register one per interrupt.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;#label keyup

// Do something

iret

#intsub kd keyup
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This means the &lt;code&gt;keyup&lt;/code&gt; label will be called whenever someone presses a key. After an interrupt is called the next step is to preserve any registers we are going to write to.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;push r01
push dp
push dl
push a0
sta @d0_store
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The function of the last line is slightly different because it stores the entire value in &lt;code&gt;d0&lt;/code&gt; on the static stack (populated and allocated at compile time) as opposed to the dynamic stack (populated and allocated at runtime). The &lt;code&gt;sta&lt;/code&gt; instruction differentiates between addressing the static stack with immediate values and main memory with registers by checking if the argument is &lt;a href="https://github.com/sccreeper/goputer/blob/d604504813b27805801260a1b21bc31588053b03/pkg/vm/load_store.go#L40" target="_blank"&gt;greater than the number of registers&lt;/a&gt;. If so the first 4 bytes at that address show how many bytes to store.&lt;/p&gt;
&lt;p&gt;After this there are a series of cascading equality checks to determine which key was pressed:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;...

neq kc $23
cndjmp @m_lp_down
lda @left_paddle_pos
mov d0 r01
eq r01 $0
cndjmp @move_end
sub r01 $d:paddle_speed
mov a0 d0
sta @left_paddle_pos
jmp @move_end

...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The above code does the following:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Is the &lt;code&gt;kc&lt;/code&gt; (current key) register equal to any other value than 23. If so, jump to the next equality check, if not keep going.&lt;/li&gt;
&lt;li&gt;Load the paddle position and see if it is equal to zero. If it is, jump to the cleanup label because we can't move any further.&lt;/li&gt;
&lt;li&gt;Then subtract the paddle speed, because in this case we are moving up the screen.&lt;/li&gt;
&lt;li&gt;Store the paddle position for later use and jump to the cleanup routine.&lt;/li&gt;
&lt;/ol&gt;
&lt;span class="header" id="moving-the-ball"&gt;&lt;a href="#moving-the-ball"&gt;##&lt;/a&gt;&lt;h2&gt;Moving the ball&lt;/h2&gt;&lt;/span&gt;&lt;p&gt;The ball is slightly different given the fact it moves in two dimensions, not one and needs to move every frame/update as opposed to whenever a user presses a specific key. I'm not showing the code that loads values into registers as you've already seen an example of that but in order to help with understanding the values in each register are as follows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;r00&lt;/code&gt; is the ball's X position.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;r01&lt;/code&gt; is the Y position.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;r02&lt;/code&gt; is the X direction.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;r03&lt;/code&gt; is the Y direction.&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;...

// Bounds right
neq r00 $320-d:ball_size
cndjmp @bbc_left
mov $0 r02
push $0
call @increase_score
jmp @check_paddles
// Bounds left
#label bbc_left
neq r00 $0
cndjmp @check_paddles
mov $1 r02
push $1
call @increase_score

...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You may notice the call to &lt;code&gt;increase_score&lt;/code&gt; but I'll get back to that &lt;a href="#scoring"&gt;later&lt;/a&gt;. The bounds checking code for the right is the same. However, checking the top and bottom is different because the ball has to bounce, not resetting back to the centre of the screen.&lt;/p&gt;
&lt;p&gt;"Bouncing" the ball for both the paddles and the top and bottom of the screen is simple. If the Y current direction value is 1, say when hitting the bottom of the screen, we make it zero and then the ball will go up when the move code is called. This same bouncing logic applies when hitting a paddle, yet this time the X direction is changed.&lt;/p&gt;
&lt;p&gt;See the code for checking the right paddle:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;...

eq r00 $320-d:ball_size-d:paddle_width-d:paddle_offset
mov a0 r15
gt r01 r05
mov a0 r14
add r05 $d:paddle_height
lt r01 a0
mov a0 r13
and r15 r14
and a0 r13
inv a0
cndjmp @check_left_paddle
mov $0 r02
jmp @bbc_bottom

...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The process of the routine is as follows:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Check if the X position of ball such that it is in-line with the left hand side of the paddle.&lt;/li&gt;
&lt;li&gt;Then see if the paddle Y is in-between the top and bottom of the paddle.&lt;/li&gt;
&lt;li&gt;If both of these conditions are true (successive &lt;code&gt;and&lt;/code&gt; operations), then we invert the ball direction and jump to the bottom bounds check.&lt;/li&gt;
&lt;/ol&gt;
&lt;span class="header" id="scoring"&gt;&lt;a href="#scoring"&gt;#&lt;/a&gt;&lt;h1&gt;Scoring&lt;/h1&gt;&lt;/span&gt;&lt;p&gt;Now all the movement logic was in place I needed a way to score the game. As mentioned earlier, the &lt;a href="https://github.com/sccreeper/goputer/blob/d604504813b27805801260a1b21bc31588053b03/examples/pong.gpasm#L355" target="_blank"&gt;&lt;code&gt;increase_score&lt;/code&gt; routine&lt;/a&gt; is called whenever the ball hits the left or right side. Before it is called, one value is pushed to the stack - which player's score to increase.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;...

lda @player_one_score
mov d0 r14
add r14 $1
mov a0 d0
sta @player_one_score
jmp @reset_ball

...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Above: Increasing the score for the left player/player one.&lt;/p&gt;
&lt;p&gt;Even though the score was increasing, you still couldn't see what it's value was, at least without inspecting registers/memory locations. I still needed to write the assembly to display it.&lt;/p&gt;
&lt;span class="header" id="integers-to-strings"&gt;&lt;a href="#integers-to-strings"&gt;##&lt;/a&gt;&lt;h2&gt;Integers to strings&lt;/h2&gt;&lt;/span&gt;&lt;p&gt;In any other language you'd simply make use of a standard library function to convert an integer to a string, in Golang you'd use &lt;code&gt;strconv.Itoa(x)&lt;/code&gt; or Python would make use of the even simpler &lt;code&gt;str(x)&lt;/code&gt;. Unfortunately, working in an assembly language means you don't have access to such luxuries so you have to make these things yourself.&lt;/p&gt;
&lt;p&gt;The loop that is used to convert a number into an integer is below:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;...

eq r01 $0
cndjmp @draw_zero

#label convert_number
mod r01 $10
add a0 $48
mov $1 dl
mov a0 vt
sr vt $1
div r01 $10
mov a0 r01

neq r01 $0
cndjmp @convert_number

sl vt $1
int vt

...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;First we check if the number is already zero, as this is a special case. After than we keep taking the remainder of the number divided by ten, then writing the corresponding character to &lt;code&gt;vt&lt;/code&gt; (the text buffer) - we also shift the text buffer right at this stage. We store the result of the division of the number for use in the next iteration of the loop. This repeats until the result of the division is zero.&lt;/p&gt;
&lt;p&gt;We also have to shift the text buffer left by one byte to cleanup from the last iteration.&lt;/p&gt;
&lt;span class="header" id="debugging"&gt;&lt;a href="#debugging"&gt;#&lt;/a&gt;&lt;h1&gt;Debugging&lt;/h1&gt;&lt;/span&gt;&lt;p&gt;In all honestly there weren't actually that many bugs with the assembly I wrote, at least not any major ones that took ages to figure out. That being said, there were some problems with the runtime that I didn't know about or should have known about:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The data stack didn't work properly, which meant the ball always teleported to the top of the screen when a key was pressed.&lt;/li&gt;
&lt;li&gt;The possibility of a race condition if a key was pressed in the middle of a ball update.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The stack not working properly was easy enough to fix, the race condition however required the addition of two new instructions and the exposure of the previously internal interrupt flag through a new control register. The two new instructions were inspired by x86's &lt;code&gt;sti&lt;/code&gt; and &lt;code&gt;cli&lt;/code&gt; (set and clear interrupt bit) instructions, albeit with different names.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;pri

mov r00 d0
sta @ball_x

mov r01 d0
sta @ball_y

mov r02 d0
sta @ball_direction_x

mov r03 d0
sta @ball_direction_y

eni
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In this case, &lt;code&gt;pri&lt;/code&gt; prevents interrupts, and &lt;code&gt;eni&lt;/code&gt; re-enables them. The instructions in-between write the updated ball location data to memory.&lt;/p&gt;
&lt;span class="header" id="conclusion"&gt;&lt;a href="#conclusion"&gt;#&lt;/a&gt;&lt;h1&gt;Conclusion&lt;/h1&gt;&lt;/span&gt;&lt;p&gt;&lt;picture&gt;&lt;source srcset="/content/assets/goputer/pong.avif" /&gt;&lt;img /&gt;&lt;/picture&gt;&lt;/p&gt;
&lt;figcaption&gt;The pong program running in the web frontend&lt;/figcaption&gt;&lt;p&gt;The next steps are probably to write a better &lt;em&gt;Pong&lt;/em&gt; clone rather than the quite simple one I've got now, and also finding a way to speed up the VM - which will almost certainly require writing some of the rendering instructions in assembly. The other major improvement I have made since the last post has been to rewrite the expansion system around Lua instead of Go's &lt;code&gt;plugin&lt;/code&gt; library, meaning that it is now cross platform.&lt;/p&gt;
&lt;p&gt;Other than making the VM faster a proper IDE or even a VSCode extension would be nice.&lt;/p&gt;
&lt;p&gt;If you want to see any other posts linked to Goputer, click &lt;a href="/posts?tags=goputer" hx-get="/posts?tags=goputer" hx-push-url="true" hx-replace="innerHTML" hx-target="#content-block" hx-trigger="click"&gt;here&lt;/a&gt; or the repository is &lt;a href="https://github.com/sccreeper/goputer" target="_blank"&gt;here&lt;/a&gt;. The full source code for the &lt;em&gt;Pong&lt;/em&gt; example talked about in this blog post is on GitHub &lt;a href="https://github.com/sccreeper/goputer/blob/d604504813b27805801260a1b21bc31588053b03/examples/pong.gpasm" target="_blank"&gt;here&lt;/a&gt;.&lt;/p&gt;</div></content><author><name>Oscar Peace</name></author></entry><entry><title>Creating a profiler for goputer</title><link href="https://www.oscarcp.net/" /><id>https://www.oscarcp.net/blog/goputer-profiler</id><updated>2025-08-09T00:00:00</updated><summary>Sometimes you have to do some measuring</summary><content type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml">&lt;p&gt;&lt;em&gt;Note: This is a continuation from my &lt;a href="/blog/writing-software-renderer" hx-get="/blog/writing-software-renderer" hx-push-url="true" hx-replace="innerHTML" hx-target="#content-block" hx-trigger="click"&gt;previous post&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;For a while now, goputer has had no way of measuring performance. The closest you could get was adding a frame rate counter to a frontend and hoping it would be somewhat accurate - spoiler, it wasn't. Turns out it takes a lot longer to render a frame than to add two numbers together, who would've guessed. This necessitated the creation of a better solution: a profiler.&lt;/p&gt;
&lt;p&gt;I wasn't aiming for anything complex, such as Google's &lt;a href="https://github.com/google/pprof" target="_blank"&gt;pprof&lt;/a&gt;, instead I was just intending to measure how long each instruction took to run, then do a tiny bit of statistics to make the collected values slightly more useful. If anything it would also serve to highlight any outstanding issues with the core runtime in addition to any assembly I wrote.&lt;/p&gt;
&lt;span class="header" id="design"&gt;&lt;a href="#design"&gt;#&lt;/a&gt;&lt;h1&gt;Design&lt;/h1&gt;&lt;/span&gt;&lt;p&gt;My only experience using anything that could be considered a profiler thus far had been using browser dev-tools, which are much more complicated than anything I intended to make and are made for a very different domain. As such, I settled upon a very simple way of measuring performance:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Measure the time for each cycle&lt;/li&gt;
&lt;li&gt;Add that data to an entry in a hashmap&lt;/li&gt;
&lt;li&gt;Do some additional stuff to calculate the mean etc. at the end&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;As for the design I decided on a &lt;code&gt;Profiler&lt;/code&gt; type which would have a &lt;code&gt;Data&lt;/code&gt; field of type &lt;code&gt;map[uint64]*ProfileEntry&lt;/code&gt; (the pointer was so I could change values in place), &lt;code&gt;ProfileEntry&lt;/code&gt; would contain various data about that &lt;strong&gt;specific&lt;/strong&gt; instruction - total cycle time, times executed.&lt;/p&gt;
&lt;p&gt;The key for the map was created by packing the instruction (5 bytes) and the current program counter value, sort of akin to a &lt;a href="https://en.wikipedia.org/wiki/Composite_key" target="_blank"&gt;composite key&lt;/a&gt;. While this meant that only 3/4 bytes were available for the program counter value (largest value 16,777,215 so 16MB) the default memory addressable (not including the video buffer) size for goputer is 65KB, so I don't see this as being a problem, unless the program counter is overwritten. If I wanted to in the future I could change the key to two uint64's or just 9 bytes.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Instruction&lt;/th&gt;
&lt;th&gt;Address&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;40 bits&lt;/td&gt;
&lt;td&gt;24 bits&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Now I had collected the data I needed some way to store it. While I could have simply just dumped the entire thing to a JSON file and been done with it, however this had a major drawback of being slower and the file size being larger. Also I just needed to store numbers.&lt;/p&gt;
&lt;p&gt;All byte orders are little endian.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Magic number&lt;/th&gt;
&lt;th&gt;Number of entries&lt;/th&gt;
&lt;th&gt;Total cycles&lt;/th&gt;
&lt;th&gt;Entries&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;GPPR&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;8 bytes (uint64)&lt;/td&gt;
&lt;td&gt;8 bytes (uint64)&lt;/td&gt;
&lt;td&gt;Variable 33 bytes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Each entry was further broken down as follows:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Instruction&lt;/th&gt;
&lt;th&gt;Address&lt;/th&gt;
&lt;th&gt;Total cycle time&lt;/th&gt;
&lt;th&gt;Total times executed&lt;/th&gt;
&lt;th&gt;Standard deviation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;5 bytes&lt;/td&gt;
&lt;td&gt;4 bytes (&lt;code&gt;uint32&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;8 bytes (&lt;code&gt;uint32&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;8 bytes (&lt;code&gt;uint32&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;8 bytes (&lt;code&gt;float64&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;em&gt;Note: The standard deviation is technically stored as a uint64 in little-endian order, see &lt;a href="https://pkg.go.dev/math#Float64bits" target="_blank"&gt;&lt;code&gt;math.Float64bits&lt;/code&gt;&lt;/a&gt; which does some &lt;code&gt;unsafer.Pointer&lt;/code&gt; stuff, specifically &lt;code&gt;*(*uint64)(unsafe.Pointer(&amp;amp;f)&lt;/code&gt;, where &lt;code&gt;f&lt;/code&gt; is the &lt;code&gt;float64&lt;/code&gt; in question.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;To make things a bit nicer the &lt;code&gt;Load&lt;/code&gt; and &lt;code&gt;Dump&lt;/code&gt; methods both used Go's &lt;code&gt;io.ReaderSeeker&lt;/code&gt; and &lt;code&gt;io.WriterSeeker&lt;/code&gt; interfaces respectively.&lt;/p&gt;
&lt;span class="header" id="making-it-useful"&gt;&lt;a href="#making-it-useful"&gt;#&lt;/a&gt;&lt;h1&gt;Making it useful&lt;/h1&gt;&lt;/span&gt;&lt;p&gt;Collecting all this information was okay, but it was pretty useless on it's own unless you like reading binary file formats or printed structs. I needed some way to display it, for this I chose &lt;a href="https://github.com/rivo/tview" target="_blank"&gt;&lt;code&gt;tview&lt;/code&gt;&lt;/a&gt;, a TUI framework, as while there already some small GUI apps I had developed for goputer, most notably the launcher, they really should have been console apps instead.&lt;/p&gt;
&lt;p&gt;&lt;picture&gt;&lt;source srcset="/content/assets/goputer/profiler/profiler.avif" /&gt;&lt;img /&gt;&lt;/picture&gt;&lt;/p&gt;
&lt;figcaption&gt;The profiler UI, with instructions sorted by times executed&lt;/figcaption&gt;&lt;p&gt;In the above screenshot, the source for the executable being analysed is available &lt;a href="https://github.com/sccreeper/goputer/blob/73942d06998e8cf1fea7e3e2a41acc92bf566660/examples/area_test.gpasm" target="_blank"&gt;here&lt;/a&gt;. As you have may have noticed there is also an option for grouping, this simply groups instructions which have the same opcode together and aggregates their data (apart from standard deviation and addresses).&lt;/p&gt;
&lt;p&gt;&lt;picture&gt;&lt;source srcset="/content/assets/goputer/profiler/profiler%20grouped.avif" /&gt;&lt;img /&gt;&lt;/picture&gt;&lt;/p&gt;
&lt;figcaption&gt;Profiler with grouping enabled&lt;/figcaption&gt;&lt;p&gt;You can also jump to instances of specific instructions using the "instruction" field. The first screenshot is much closer to the original representation of the data.&lt;/p&gt;
&lt;p&gt;As a side-note, the numbers in that screenshot are much higher than they should be, in canned benchmarks they were much lower, my current theory is that for some reason running a windowed application in WSL is slower than a non-windowed one. I'll only find out for sure once make goputer run natively (on Windows without WSL) though.&lt;/p&gt;
&lt;p&gt;Another useful addition might be to add some other form of visualisation (i.e. barcharts) however I can't really see the utility of it over the existing data view.&lt;/p&gt;
&lt;span class="header" id="what-next"&gt;&lt;a href="#what-next"&gt;#&lt;/a&gt;&lt;h1&gt;What next?&lt;/h1&gt;&lt;/span&gt;&lt;p&gt;Now armed with a profiler I can hopefully go about creating more complex programs and maybe even high level language, using the profiler to make performance analysis ever so slightly easier.&lt;/p&gt;
&lt;p&gt;This was the one of the last major improvements I had planned for goputer. The ones that are most likely to be developed next are moving frontends to standalone executables instead of using Go's &lt;code&gt;plugin&lt;/code&gt; library, which only works on Linux, FreeBSD, and MacOS, so no Windows support at all, and moving extensions to embedded Lua for the same reason. Software rendering doesn't show in the Python frontend/the Python frontend hasn't been updated since software rendering was implemented so that'll have to be fixed as well.&lt;/p&gt;
&lt;p&gt;Also at the moment the profiler is currently just tacked onto the gp32 runtime without any way of turning it off, but that'll come as part of the previously mentioned frontends to standalone executables thing.&lt;/p&gt;
&lt;p&gt;If you want to see any other posts linked to Goputer, click &lt;a href="/posts?tags=goputer" hx-get="/posts?tags=goputer" hx-push-url="true" hx-replace="innerHTML" hx-target="#content-block" hx-trigger="click"&gt;here&lt;/a&gt; or the repository itself is &lt;a href="https://github.com/sccreeper/goputer" target="_blank"&gt;here&lt;/a&gt;.&lt;/p&gt;</div></content><author><name>Oscar Peace</name></author></entry><entry><title>URL Shortener with Flask and HTMX</title><link href="https://www.oscarcp.net/" /><id>https://www.oscarcp.net/blog/url-shortener</id><updated>2025-07-07T00:00:00</updated><summary>A tiny app that can be recreated in a weekend</summary><content type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml">&lt;p&gt;It's been a while since I've made anything new with HTMX. The last thing was the &lt;a href="/blog/more-blog-updates" hx-get="/blog/more-blog-updates" hx-push-url="true" hx-replace="innerHTML" hx-target="#content-block" hx-trigger="click"&gt;comment section&lt;/a&gt; for this blog, which itself &lt;a href="/blog/making-a-blog-with-flask-and-htmx" hx-get="/blog/making-a-blog-with-flask-and-htmx" hx-push-url="true" hx-replace="innerHTML" hx-target="#content-block" hx-trigger="click"&gt;is made&lt;/a&gt; using Flask and HTMX. TLDR; I like using Flask and HTMX. Something I've always wanted to have is some kind of link shortening service. I'm well aware that services like bit.ly and Tiny URL exist however in my opinion it's much better to have something of your own especially when it's incredibly simple to make.&lt;/p&gt;
&lt;p&gt;While I could've used a fullstack framework for this, e.g. Svelte, I already had some experience using Flask and HTMX together now and I wanted to see how quickly I could create something usable. In the end it turned out that I was indeed able to create something rather quickly. &lt;/p&gt;
&lt;p&gt;Since I had already chosen Flask and HTMX the only remaining decision was what database engine to use. This ended up being SQLite for the same reasons mentioned before, ease of use, familiarity etc. This project was essentially the biggest case of why you shouldn't reinvent the wheel, something I have done too many times already, using new stacks for projects where the scope just isn't worth it, and later coming to regret it as a result.&lt;/p&gt;
&lt;p&gt;The codebase was then hammered together using the same outline I had figured out when I added comments to this blog, with some minor improvements - having data migration from the start and correct use of indexes. In the end while the thing I ended up with was rather ugly because I didn't use any CSS bar some for colouring &lt;span style="color: red;"&gt;error messages&lt;/span&gt;, choosing to stick to the principles of &lt;a href="https://justfuckingusehtml.com/" target="_blank"&gt;justfuckingusehtml.com&lt;/a&gt; instead.&lt;/p&gt;
&lt;p&gt;I also added a JSON API which used the same routes as the HTMX, just expecting and returning JSON instead of form data and HTML respectively. No need for another backend server when you have &lt;code&gt;if&lt;/code&gt; statements and schemas.&lt;/p&gt;
&lt;p&gt;&lt;picture&gt;&lt;source srcset="/content/assets/signpost%20admin%20ui.avif" /&gt;&lt;img /&gt;&lt;/picture&gt;&lt;/p&gt;
&lt;figcaption&gt;Zoomed in screenshot of the admin UI&lt;/figcaption&gt;&lt;p&gt;Opaque means wether or not to return a 301 redirect response or to return a small JavaScript snippet to do the redirection. This means that any link previews won't work when it's enabled. Useful for creating rick roll links.&lt;/p&gt;
&lt;p&gt;You can test it for yourself now by clicking on &lt;a href="https://ospe.lol/o7JU" target="_blank"&gt;this link&lt;/a&gt; which redirects to this blogpost.&lt;/p&gt;
&lt;p&gt;The source for the project itself can be found &lt;a href="https://github.com/sccreeper/signpost" target="_blank"&gt;here&lt;/a&gt; on GitHub.&lt;/p&gt;</div></content><author><name>Oscar Peace</name></author></entry><entry><title>DLAPS stack</title><link href="https://www.oscarcp.net/" /><id>https://www.oscarcp.net/blog/dlaps-stack</id><updated>2025-06-27T00:00:00</updated><summary>How I deploy most of my projects</summary><content type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml">&lt;p&gt;Most of my "online" projects (including this blog) use what I call the DLAPS stack which can be summarised as the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;D&lt;/strong&gt;ocker&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;L&lt;/strong&gt;inux&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;A&lt;/strong&gt;pache&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;P&lt;/strong&gt;rogram&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;S&lt;/strong&gt;QLite&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Alternatively if deploying straight onto bare metal the SLAPS stack also works and rolls off the tongue much better. However, because most deployments don't use that it'll have to be DLAPS. Essentially it is a variation of the classic &lt;a href="https://en.wikipedia.org/w/index.php?title=LAMP_(software_bundle)&amp;amp;oldid=1295099813" target="_blank"&gt;LAMP stack&lt;/a&gt; but using SQLite instead of MySQL and Docker as a virtualisation layer.&lt;/p&gt;
&lt;p&gt;Why Docker? Because it runs anywhere and you can run anything in it. Not to sound too much like their marketing copy but it's true. Also great for avoiding testing in prod.&lt;/p&gt;
&lt;p&gt;Linux technically occurs in two places in this stack. That is, whatever bare metal machine Docker and Apache are running on and in whatever Docker container(s) are running. The most honest reason for using it is quite frankly why would you ever use anything else on a server (at least definitely not Windows), also the widest adoption, most support etc.&lt;/p&gt;
&lt;p&gt;Any reverse proxy could fulfill the role of Apache (nginx, Caddy, etc.), however I have the most experience with Apache so it'll have to stay that way. Like I said earlier this was based on the LAMP stack so I have to be somewhat faithful to the original when coming up with a formal &lt;em&gt;de jure&lt;/em&gt; version of it.&lt;/p&gt;
&lt;p&gt;As for the program part this could really be anything. It doesn't have to communicate with the internet as such, thus the Apache part would then be redundant. The reason for choosing program over Python or Perl etc. is because I also write Go - &lt;a href="https://github.com/sccreeper/goputer" target="_blank"&gt;goputer&lt;/a&gt; and &lt;a href="https://github.com/sccreeper/chime" target="_blank"&gt;Chime&lt;/a&gt;. While only the latter is internet based I do want to write some more Go based web stuff in the future (probably using &lt;a href="https://github.com/a-h/templ" target="_blank"&gt;templ&lt;/a&gt;), or even any other language.&lt;/p&gt;
&lt;p&gt;Finally, SQLite. Probably the best database out there. If anything I was doing needed maybe a slightly larger feature-set I would probably switch to PostgresSQL seeming as that's what I've been taught to use at university. SQLite is lightweight, easy to deploy, and could probably run on a little Arduino (a &lt;a href="https://github.com/siara-cc/esp32_arduino_sqlite3_lib" target="_blank"&gt;library&lt;/a&gt; exists for running it on an ESP32, which is a slightly beefier micro-controller) with some minor changes. TLDR; it runs literally everywhere and is good for the vast majority of use cases.&lt;/p&gt;
&lt;p&gt;To conclude in a world of "serverless" this and that it's nice to have something that's deployable on a real machine, because lets face it that Cloudflare free tier (which admittedly I use myself for my &lt;a href="https://emc.oscarcp.net/" target="_blank"&gt;EarthMC dashboard&lt;/a&gt;) isn't going to last once your app starts getting traction and you need actual performance, not that that'll ever happen anyway so you might as well stick to a tiny VPS and call it a day.&lt;/p&gt;</div></content><author><name>Oscar Peace</name></author></entry><entry><title>Optimising Go with SIMD assembly</title><link href="https://www.oscarcp.net/" /><id>https://www.oscarcp.net/blog/optimising-goputer-with-assembly</id><updated>2026-01-21T00:00:00</updated><summary>Rendering is slow, but it can be made a lot faster</summary><content type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml">&lt;p&gt;&lt;em&gt;Note: This is a continuation from my &lt;a href="/blog/making-pong-in-goputer" hx-get="/blog/making-pong-in-goputer" hx-push-url="true" hx-replace="innerHTML" hx-target="#content-block" hx-trigger="click"&gt;previous post&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Ever since goputer's switch to software rendering - instead of relying on frontends to do rendering operations themselves in hardware - drawing anything to the screen has been a bit of a bottleneck. Turns out doing stuff in hardware is a lot faster than doing it in software, especially when because of some very bad decisions earlier on, none of these operations are using 4 bytes and thus cannot be easily vectorised by a good compiler&lt;sup id="fnref:bytesimd"&gt;&lt;a class="footnote-ref" href="#fn:bytesimd"&gt;[1]&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;
&lt;p&gt;The following post details my experience optimising my Go code with assembly. If you are already familiar with SIMD operations then feel free to skip the next section, however it may be helpful for anyone else who isn't. This article focuses on x86 vector instructions, however the principles of SIMD are the same for all CPU architectures.&lt;/p&gt;
&lt;span class="header" id="what-is-vectorisation-and-simd"&gt;&lt;a href="#what-is-vectorisation-and-simd"&gt;#&lt;/a&gt;&lt;h1&gt;What is vectorisation and SIMD?&lt;/h1&gt;&lt;/span&gt;&lt;p&gt;Firstly SIMD stands for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;S&lt;/strong&gt;ingle&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;I&lt;/strong&gt;nstruction&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;M&lt;/strong&gt;ultiple&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;D&lt;/strong&gt;ata&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In essence this means that for a given instruction, which has multiple parameters (registers and/or memory locations), these parameters may themselves contain multiple pieces of data to be operated on in parallel.&lt;/p&gt;
&lt;p&gt;Also, as opposed to scalar instructions which will operate on standard sized registers, vector instructions mostly operate on &lt;strong&gt;vector registers&lt;/strong&gt;. On x86 vector registers can be 128, 256, or 512 bits wide.&lt;/p&gt;
&lt;p&gt;&lt;span style="color: red;"&gt;Vector&lt;/span&gt;isation refers to the process of rewriting a piece of code to make use of &lt;span style="color: red;"&gt;vector&lt;/span&gt; instructions.&lt;/p&gt;
&lt;p&gt;A very simple example of vectorisation would be to say you had a loop which added two (very long) arrays together. Below is some Python&lt;sup id="fnref:pythonvector"&gt;&lt;a class="footnote-ref" href="#fn:pythonvector"&gt;[2]&lt;/a&gt;&lt;/sup&gt; pseudocode to demonstrate this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Traditionally the above code would be executed in a &lt;strong&gt;scalar&lt;/strong&gt; manner. I.e. one addition would have to be executed for every pair of integers, a and b. This means that 1024 additions would have to be performed. While this may execute fairly quickly for small-ish arrays of integers such as this, it will get increasingly slower as the arrays grow.&lt;/p&gt;
&lt;p&gt;A "vectorised" version of the above code would be as follows (for this example, a and b can be thought of as an array of &lt;code&gt;uint32&lt;/code&gt;'s):&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vpaddd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Note: The &lt;code&gt;extend&lt;/code&gt; method is used to concatenate the two lists together.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Compared to before we are now adding 16 integers together at a time, instead of just 1. This means our code will now execute (in theory) 16x faster.&lt;/p&gt;
&lt;p&gt;Our &lt;code&gt;vpaddd&lt;/code&gt; function can be thought of as an instruction intrinsic. An intrinsic is a "function" that tells the compiler that you want this operation to be carried out with this specific instruction. In our case, &lt;code&gt;vpaddd&lt;/code&gt; refers to the "Add Packed Integers" instruction, specifically adding packed doubles - a packed instruction operates on values of it's size (byte/word/double etc.) as opposed to treating the register like one big number. In real code which uses intrinsics this would be &lt;code&gt;_mm_add_epi32&lt;/code&gt;&lt;sup id="fnref:addintrinsic"&gt;&lt;a class="footnote-ref" href="#fn:addintrinsic"&gt;[3]&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;
&lt;p&gt;In our case, each integer is 32 bits wide, and we are using 512 bit &lt;code&gt;zmm&lt;/code&gt; registers when performing the add instruction.&lt;/p&gt;
&lt;p&gt;&lt;picture&gt;&lt;source srcset="/content/assets/goputer/simd%20optimisation/pre%20simd%20add.avif" /&gt;&lt;img /&gt;&lt;/picture&gt;&lt;/p&gt;
&lt;figcaption&gt;Values in ZMM registers before adding, when i is zero&lt;/figcaption&gt;&lt;p&gt;The offset value in the above diagram refers to the number of bits each value is offset from the start of the register. Typically a vector register would be displayed with it's most significant bit first, but in this article, the least significant bit is displayed first.&lt;/p&gt;
&lt;p&gt;We then perform the &lt;code&gt;vpaddd&lt;/code&gt; instruction:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;vpaddd zmm2, zmm1, zmm0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This takes each packed (packed, i.e. they are stored next to each other in the register, no gaps) double-word (32 bit integer)&lt;sup id="fnref:byteworddouble"&gt;&lt;a class="footnote-ref" href="#fn:byteworddouble"&gt;[4]&lt;/a&gt;&lt;/sup&gt; in &lt;code&gt;zmm0&lt;/code&gt; and &lt;code&gt;zmm1&lt;/code&gt;, and adds them together, storing the result in &lt;code&gt;zmm2&lt;/code&gt;. In Intel assembly syntax, operands follow the order of &lt;code&gt;destination, source&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;After we perform our &lt;code&gt;vpaddd&lt;/code&gt; instruction, &lt;code&gt;zmm2&lt;/code&gt; now has our result:&lt;/p&gt;
&lt;p&gt;&lt;picture&gt;&lt;source srcset="/content/assets/goputer/simd%20optimisation/post%20simd%20add.avif" /&gt;&lt;img /&gt;&lt;/picture&gt;&lt;/p&gt;
&lt;p&gt;In reality our &lt;code&gt;.extend&lt;/code&gt; operation at the end would translate to a &lt;code&gt;vmovdqa64&lt;/code&gt; (move double quadword aligned) instruction, in order to move our data back to our result array.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;vmovdqa64 [rax], zmm0 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Note: The register &lt;code&gt;rax&lt;/code&gt; contains our current offset in our storage array. In practice any large enough register could contain this value.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Adding isn't the only operation you can perform on a vector register though. Pretty much any operation that can be applied to a scalar register can be applied to vector registers as well, whether it be other arithmetic operations such as multiplication, or logical operations, and shifting.&lt;/p&gt;
&lt;p&gt;There also exists a set of instructions specifically for operating on vector registers, such as broadcasting, shuffling, extraction/insertion, or other permutations. How these work will be detailed in the rest of the post.&lt;/p&gt;
&lt;span class="header" id="using-assembly-with-go"&gt;&lt;a href="#using-assembly-with-go"&gt;#&lt;/a&gt;&lt;h1&gt;Using assembly with Go&lt;/h1&gt;&lt;/span&gt;&lt;p&gt;The first half of this section focuses on writing raw assembly for use with Go, and the second half focuses on making use of the Avo library. I would recommend reading both halves as it is important to understand the assembly that Avo generates, and it is also sometimes easier to debug your code by reading the generated assembly.&lt;/p&gt;
&lt;span class="header" id="raw-assembly"&gt;&lt;a href="#raw-assembly"&gt;##&lt;/a&gt;&lt;h2&gt;Raw assembly&lt;/h2&gt;&lt;/span&gt;&lt;p&gt;&lt;em&gt;Note: From now on, Go's Plan9 inspired assembly syntax is used, the main difference being the order of operands is different (source, destination).&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Go makes it quite easy to link with assembly. It is used a lot in the standard library, especially in the &lt;code&gt;math&lt;/code&gt; package, to make use of hardware specific features for various operations.&lt;/p&gt;
&lt;p&gt;For example &lt;a href="https://github.com/golang/go/blob/455282911aba7512e2ba045ffd9244eb97756247/src/math/exp_amd64.s" target="_blank"&gt;&lt;code&gt;math/exp_amd64.s&lt;/code&gt;&lt;/a&gt; contains optimised code for calculating a base &lt;math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;/mrow&gt;&lt;/math&gt; exponential.&lt;/p&gt;
&lt;p&gt;A much simpler example would be something that &lt;a href="https://github.com/mmcloughlin/avo/tree/c096992c06ffcc996c6ebb924d9d5cf1e53e49d4/examples/add" target="_blank"&gt;adds two numbers&lt;/a&gt;&lt;sup id="fnref:avocredit"&gt;&lt;a class="footnote-ref" href="#fn:avocredit"&gt;[5]&lt;/a&gt;&lt;/sup&gt;, i.e:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;#include &amp;quot;textflag.h&amp;quot;

// func Add(x uint64, y uint64) uint64
TEXT &#183;Add(SB), NOSPLIT, $0-24
    MOVQ x+0(FP), AX
    MOVQ y+8(FP), CX
    ADDQ AX, CX
    MOVQ CX, ret+16(FP)
    RET
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;An explainer on the syntax used above:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;TEXT &#183;Add(SB), NOSPLIT, $0-24
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Above is the function declaration. &lt;code&gt;SB&lt;/code&gt; is a virtual register used to refer to the static base pointer, which can be thought of as the origin of memory.&lt;/p&gt;
&lt;p&gt;Names and descriptions for the pseudo-registers from the &lt;a href="https://go.dev/doc/asm" target="_blank"&gt;Go documentation&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;FP: Frame pointer: arguments and locals.&lt;/li&gt;
&lt;li&gt;PC: Program counter: jumps and branches.&lt;/li&gt;
&lt;li&gt;SB: Static base pointer: global symbols.&lt;/li&gt;
&lt;li&gt;SP: Stack pointer: the highest address within the local stack frame.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;code&gt;NOSPLIT&lt;/code&gt; tells the compiler not to insert a special piece of code that checks if the stack needs to grow. I don't fully understand this myself, and for our purposes it isn't important, but for those interested there is a detailed explanation in &lt;a href="https://mcyoung.xyz/2025/07/07/nosplit/" target="_blank"&gt;this blogpost&lt;/a&gt; by Miguel Young de la Sota.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;$0-24&lt;/code&gt; specifies the size of the stack frame for this function.&lt;/p&gt;
&lt;p&gt;The move instruction, &lt;code&gt;MOVQ x+0(FP), AX&lt;/code&gt;, can be read as: move the parameter &lt;code&gt;x&lt;/code&gt;, offset zero bytes from the frame pointer &lt;code&gt;FP&lt;/code&gt;, to the register &lt;code&gt;AX&lt;/code&gt;. The same applies to the second move instruction, except an offset of 8 bytes instead of 0.&lt;/p&gt;
&lt;p&gt;The final move instruction can be read as: move the value in &lt;code&gt;CX&lt;/code&gt; to the return parameter, offset 16 bytes from the frame pointer.&lt;/p&gt;
&lt;p&gt;This can then be linked to the following method stub:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kd"&gt;func&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;y&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;uint64&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The method stub can then be used in any normal Go code, just as a normal function would.&lt;/p&gt;
&lt;p&gt;While the above approach might be okay for writing small functions, as functions and their requirements get larger, and worrying about argument sizes, constant data, and register allocation becomes a problem, it might be useful to have some way to take this "boilerplate" work away.&lt;/p&gt;
&lt;span class="header" id="here-comes-avo"&gt;&lt;a href="#here-comes-avo"&gt;##&lt;/a&gt;&lt;h2&gt;Here comes Avo&lt;/h2&gt;&lt;/span&gt;&lt;p&gt;&lt;a href="https://github.com/mmcloughlin/avo" target="_blank"&gt;Avo&lt;/a&gt; is a Go library that can be used to generate assembly by just writing normal Go code. The assembly example used previously was actually from Avo's examples directory. You still have to "write" the assembly yourself, but Avo will take care of all the boring stuff like argument sizes and Plan9 syntax for you.&lt;/p&gt;
&lt;p&gt;Here is the Go code which could have be written instead to generate that assembly - in Avo, as in Go's assembly, operands follow a source, dest order:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;github.com/mmcloughlin/avo/build&amp;quot;&lt;/span&gt;

&lt;span class="kd"&gt;func&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Add&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;NOSPLIT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;func(x, y uint64) uint64&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;Param&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;x&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;GP64&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;y&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;Param&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;y&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;GP64&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;ADDQ&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;Store&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ReturnIndex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;RET&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nx"&gt;Generate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Notice how &lt;code&gt;x&lt;/code&gt; and &lt;code&gt;y&lt;/code&gt; are now actually variables, in this case virtual registers (64 bit) which have been allocated for us and initialized by Avo.&lt;/p&gt;
&lt;p&gt;We then use the &lt;code&gt;Store&lt;/code&gt; method to generate the instruction which pushes the result back onto the stack.&lt;/p&gt;
&lt;p&gt;Avo also generates the stub file as well.&lt;/p&gt;
&lt;span class="header" id="using-simd-with-go"&gt;&lt;a href="#using-simd-with-go"&gt;#&lt;/a&gt;&lt;h1&gt;Using SIMD with Go&lt;/h1&gt;&lt;/span&gt;&lt;p&gt;As stated previously, the primary operation I wanted to optimise with SIMD was rendering. This is not only because it was a bottleneck, but also because rendering is an ideal candidate (or in my case slightly less than ideal) for a SIMD workflow.&lt;/p&gt;
&lt;p&gt;In my case, I also couldn't use any AVX-512 instructions as my CPU does not support it.&lt;/p&gt;
&lt;span class="header" id="video-clearing"&gt;&lt;a href="#video-clearing"&gt;##&lt;/a&gt;&lt;h2&gt;Video clearing&lt;/h2&gt;&lt;/span&gt;&lt;p&gt;The first operation I wanted to make use of SIMD was &lt;code&gt;int vc&lt;/code&gt;, or video clear interrupt. While a better approach could have been to make use of the &lt;code&gt;rep movsb&lt;/code&gt; instruction, as I didn't care what was already in the buffer, just replacing it, the buffer format (RGB8), meant that this was not possible.&lt;/p&gt;
&lt;p&gt;The steps I needed to take in order to do this were as follows:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Assemble the colour&lt;/li&gt;
&lt;li&gt;Move the colour data to memory&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;"Assembling" the colour meant loading the RGB values from the parameters, then placing this data in the correct order in a vector register, or actually multiple, in order for it to be ordered properly.&lt;/p&gt;
&lt;p&gt;Loading the parameters then ordering the colour correctly in a 32 bit register is quite simple:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nx"&gt;red&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;Param&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;red&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;GP8&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="nx"&gt;green&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;Param&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;green&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;GP8&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="nx"&gt;blue&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;Param&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;blue&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;GP8&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="nx"&gt;tmp&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;GP32&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nx"&gt;MOVL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;U32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;tmp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;MOVB&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;blue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;tmp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;As8&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="nx"&gt;SHLL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;Imm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;tmp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;ORB&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;green&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;tmp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;As8&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="nx"&gt;SHLL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;Imm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;tmp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;ORB&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;red&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;tmp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;As8&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The value in this register can then be moved to an XMM register using a &lt;code&gt;MOVD&lt;/code&gt; (move double word) instruction. Also a quick note, x86 is &lt;strong&gt;little endian&lt;/strong&gt;, which means that the byte with the lowest value is stored first. This is why we move the blue first, then green, then red, as opposed to the other way round. This is important when we consider the order of bytes in the XMM register.&lt;/p&gt;
&lt;p&gt;Once we've constructed the colour, we can move it to the lowest 4 bytes of the XMM register:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nx"&gt;MOVD&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tmp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;xmm0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Once this data was in XMM, the rest of XMM needed to be filled with it. Also because it was a 3 byte pattern, and XMM registers are 16 bytes wide, the pattern would have to be repeated across 3 XMM registers in order to work properly.&lt;/p&gt;
&lt;p&gt;An appropriate instruction for this is the shuffle instruction, specifically &lt;a href="https://www.felixcloutier.com/x86/pshufb" target="_blank"&gt;&lt;code&gt;PSHUFB&lt;/code&gt;&lt;/a&gt; or &lt;em&gt;Packed Shuffle Bytes&lt;/em&gt;. It will re-order the contents of one vector register, when given a shuffle mask contained in another. TLDR; When given source &lt;code&gt;x&lt;/code&gt;, destination &lt;code&gt;y&lt;/code&gt;, and mask &lt;code&gt;m&lt;/code&gt;, &lt;code&gt;y[i] = x[m[i]]&lt;/code&gt;. &lt;/p&gt;
&lt;p&gt;&lt;picture&gt;&lt;source srcset="/content/assets/goputer/simd%20optimisation/simd%20shuffle.avif" /&gt;&lt;img /&gt;&lt;/picture&gt;&lt;/p&gt;
&lt;figcaption&gt;Values are coloured to show which index the mask sources it's value from&lt;/figcaption&gt;&lt;p&gt;In this case the register is filled with a &lt;strong&gt;&lt;span style="color: #d623f4;"&gt;purple-ish&lt;/span&gt;&lt;/strong&gt; colour.&lt;/p&gt;
&lt;p&gt;Or in code:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nx"&gt;shuffle_mask&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;GLOBL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;shuffle_mask&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;RODATA&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="nx"&gt;NOPTR&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;DATA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;String&lt;/span&gt;&lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;}))&lt;/span&gt;

&lt;span class="nx"&gt;VPSHUFB&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;shuffle_mask&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;xmm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;xmm0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;There is a problem here, because the mask is not a complete sequence, that is it ends in zero, instead of 2, multiple registers with slightly different masks must be used, in order to form one complete sequence.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nx"&gt;first_mask&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;GLOBL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;first_mask&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;RODATA&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="nx"&gt;NOPTR&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;DATA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;String&lt;/span&gt;&lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;}))&lt;/span&gt;
&lt;span class="nx"&gt;second_mask&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;GLOBL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;second_mask&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;RODATA&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="nx"&gt;NOPTR&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;DATA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;String&lt;/span&gt;&lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;}))&lt;/span&gt;
&lt;span class="nx"&gt;third_mask&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;GLOBL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;third_mask&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;RODATA&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="nx"&gt;NOPTR&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;DATA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;String&lt;/span&gt;&lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;}))&lt;/span&gt;

&lt;span class="nx"&gt;MOVD&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tmp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;xmm0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;VPSHUFB&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;first_mask&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;xmm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;xmm0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;VPSHUFB&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;second_mask&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;xmm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;xmm1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;VPSHUFB&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;third_mask&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;xmm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;xmm2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Once we have shuffled the contents correctly for each register, we can store that data in memory:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nx"&gt;Label&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;loop&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nx"&gt;VMOVDQA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;xmm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Mem&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;Base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="nx"&gt;VMOVDQA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;xmm1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Mem&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;Base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Disp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="nx"&gt;VMOVDQA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;xmm2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Mem&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;Base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Disp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="nx"&gt;VMOVDQA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;xmm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Mem&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;Base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Disp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;48&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="nx"&gt;VMOVDQA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;xmm1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Mem&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;Base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Disp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="nx"&gt;VMOVDQA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;xmm2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Mem&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;Base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Disp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="nx"&gt;VMOVDQA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;xmm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Mem&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;Base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Disp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;96&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="nx"&gt;VMOVDQA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;xmm1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Mem&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;Base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Disp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;112&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="nx"&gt;VMOVDQA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;xmm2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Mem&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;Base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Disp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="nx"&gt;VMOVDQA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;xmm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Mem&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;Base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Disp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;144&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="nx"&gt;VMOVDQA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;xmm1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Mem&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;Base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Disp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;160&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="nx"&gt;VMOVDQA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;xmm2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Mem&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;Base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Disp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;176&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="nx"&gt;ADDQ&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;Imm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;192&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;CMPQ&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;max&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;JBE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;LabelRef&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;loop&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;max&lt;/code&gt; is a value calculated before this (unrolled&lt;sup id="fnref:unrolling"&gt;&lt;a class="footnote-ref" href="#fn:unrolling"&gt;[6]&lt;/a&gt;&lt;/sup&gt;) loop which stores the bounds of the array. The &lt;code&gt;Mem&lt;/code&gt; struct tells avo to interpret this argument as a memory address, with the start (base) of this address being the value of &lt;code&gt;ptr&lt;/code&gt;, and then offset by the value of &lt;code&gt;Disp&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;VMOVDQA&lt;/code&gt; is the "Move Double Quadword Aligned" instruction, which moves the vector register to main memory and vice versa. The "Aligned" part means that the location must be aligned on a certain boundary, in our case 8 bytes as we are targeting a 64 bit system. This is because CPUs &lt;strong&gt;mainly&lt;/strong&gt; access memory in terms of their word size, which on a 64 bit system is 8 bytes, this tends to be faster than doing an &lt;em&gt;unaligned&lt;/em&gt; memory operation.&lt;/p&gt;
&lt;span class="header" id="video-areas"&gt;&lt;a href="#video-areas"&gt;##&lt;/a&gt;&lt;h2&gt;Video areas&lt;/h2&gt;&lt;/span&gt;&lt;p&gt;A "video area" (&lt;code&gt;int va&lt;/code&gt;) is the term used in goputer for drawing a rectangle to the screen. The logic for drawing an opaque video area is broadly the same as video clearing, apart from a different loop, so I have decided not to include it for brevity. Instead, this section focuses on drawing a transparent area and the additional code which is required to do that.&lt;/p&gt;
&lt;p&gt;The main differences between drawing opaque colour, and transparent colour, is that we now need to not only load data from memory beforehand, but also perform arithmetic on it.&lt;/p&gt;
&lt;p&gt;Because goputer's framebuffer doesn't actually store alpha values, blending is done in a sort of fake way:&lt;/p&gt;
&lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mi&gt;d&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;msub&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mi&gt;R&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#x0003D;&lt;/mo&gt;&lt;mo stretchy="false"&gt;&amp;#x00028;&lt;/mo&gt;&lt;mo stretchy="false"&gt;&amp;#x00028;&lt;/mo&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;msub&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;R&lt;/mi&gt;&lt;/msub&gt;&lt;mi&gt;&amp;#x000D7;&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;msub&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;/msub&gt;&lt;mo stretchy="false"&gt;&amp;#x00029;&lt;/mo&gt;&lt;mo&gt;&amp;#x0002B;&lt;/mo&gt;&lt;mo stretchy="false"&gt;&amp;#x00028;&lt;/mo&gt;&lt;mi&gt;d&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;msub&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mi&gt;R&lt;/mi&gt;&lt;/msub&gt;&lt;mi&gt;&amp;#x000D7;&lt;/mi&gt;&lt;mi&gt;&amp;#x000AC;&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;msub&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;/msub&gt;&lt;mo stretchy="false"&gt;&amp;#x00029;&lt;/mo&gt;&lt;mo stretchy="false"&gt;&amp;#x00029;&lt;/mo&gt;&lt;mo&gt;&amp;#x0226B;&lt;/mo&gt;&lt;mn&gt;8&lt;/mn&gt;&lt;/mrow&gt;&lt;/math&gt;
&lt;p&gt;Or in a code form:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nx"&gt;dest&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;src&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;src&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;dest&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;^&lt;/span&gt;&lt;span class="nx"&gt;src&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Fortunately, this is fairly easy to vectorise. The first part of the operation, &lt;math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mo stretchy="false"&gt;&amp;#x00028;&lt;/mo&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;msub&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;R&lt;/mi&gt;&lt;/msub&gt;&lt;mi&gt;&amp;#x000D7;&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;msub&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;/msub&gt;&lt;mo stretchy="false"&gt;&amp;#x00029;&lt;/mo&gt;&lt;/mrow&gt;&lt;/math&gt;, can be performed ahead of the main loop because it is constant for each channel. However, before we do that, the byte values in the register need to be widened to 16 bit words, in order to avoid overflow and preserve accuracy.&lt;/p&gt;
&lt;p&gt;This can be done with the &lt;code&gt;VPMOVZXBW&lt;/code&gt;, or packed move ith zero extend, instruction. In our use case, it will take every byte in a XMM register, zero extend (which is to say it pads it with extra zeros instead of just inserting it with garbage data) it to a word, before packing those words into the YMM register. For the alpha channel, where the alpha value is 230 (0xE6), which will become 0x00E6 when widened:&lt;/p&gt;
&lt;p&gt;&lt;picture&gt;&lt;source srcset="/content/assets/goputer/simd%20optimisation/simd%20widen%20zero%20extend.avif" /&gt;&lt;img /&gt;&lt;/picture&gt;&lt;/p&gt;
&lt;p&gt;In order to fill the alpha register with both the inverted and inverted alpha values before it is widened, the &lt;code&gt;VPBROADCAST&lt;/code&gt; instruction (specifically with the &lt;code&gt;B&lt;/code&gt; for byte suffix) can be used. This instruction tells the CPU to take the first byte/word/double word etc. of this register, and then fill this other (or the same) register with it.&lt;/p&gt;
&lt;p&gt;&lt;picture&gt;&lt;source srcset="/content/assets/goputer/simd%20optimisation/simd%20broadcast.avif" /&gt;&lt;img /&gt;&lt;/picture&gt;&lt;/p&gt;
&lt;p&gt;With all the data in place, now the final colour result needs to be calculated. As mentioned previously, there exists a set of instructions for operating on packed values. So calculating the first part of the expression is as follows:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nx"&gt;VPMULLW&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;widened_shuffled_colour_bytes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;widened_alpha_values&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;widened_shuffled_colour_bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This instruction multiplies each packed word, and then stores the low (bits 0-16) result.&lt;/p&gt;
&lt;p&gt;As for computing the rest of the expression, firstly the data needs to be loaded from memory and widened. Also, unlike before where the data in memory could be discarded, this time the 16th byte needs to be preserved for the next iteration - we are only writing 15 bytes at a time. This can be done with the &lt;code&gt;PEXTRB&lt;/code&gt; and &lt;code&gt;PINSRB&lt;/code&gt;, extract byte and insert byte respectively, instructions. I &lt;strong&gt;wouldn't actually recommend&lt;/strong&gt; using these, as you will see later, but I am including them here so you can learn from my mistakes.&lt;/p&gt;
&lt;p&gt;Load memory data, and extract last byte:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nx"&gt;VMOVDQU&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;Mem&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;Base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;RDI&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;memory_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;XORL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;last_byte&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;last_byte&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;PEXTRB&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;Imm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;memory_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;last_byte&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;Imm(15)&lt;/code&gt; tells Avo that this an immediate value - a value embedded in the instruction operands, as opposed to being stored in a register or memory. In this case it will automatically determine the size, but similar functions exist for generating an immediate of a specific size.&lt;/p&gt;
&lt;p&gt;Widen memory data, then multiply it with the inverted alpha values.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nx"&gt;VPMOVZXBW&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;memory_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;widened_memory_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// &amp;quot;Multiply Packed Signed Integers and Store Low Result&amp;quot;&lt;/span&gt;
&lt;span class="nx"&gt;VPMULLW&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;widened_memory_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;widened_inverted_alpha_values&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;widened_memory_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Finally, add the two values together and perform the packed shift operation:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;// &amp;quot;Add Packed Integers&amp;quot;&lt;/span&gt;
&lt;span class="nx"&gt;VPADDW&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;widened_memory_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;widened_shuffled_colour_bytes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;widened_memory_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// &amp;quot;Shift Packed Data Right Logical&amp;quot;&lt;/span&gt;
&lt;span class="nx"&gt;VPSRLW&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;Imm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;widened_memory_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;widened_memory_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Now the values had been calculated, they need to be packed back down into bytes, and stored back into main memory.&lt;/p&gt;
&lt;p&gt;To pack the numbers back down, &lt;code&gt;PACKUSWB&lt;/code&gt; ("Pack With Unsigned Saturation") is used. Saturation refers to how the words (which can be from 0-65535) are compressed back down into the byte range of 0-255. If the value is &lt;math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mo&gt;&amp;#x0003E;&lt;/mo&gt;&lt;mn&gt;255&lt;/mn&gt;&lt;/mrow&gt;&lt;/math&gt;, then it is clipped at 255. If it is &lt;math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mo&gt;&amp;#x0003C;&lt;/mo&gt;&lt;mn&gt;0&lt;/mn&gt;&lt;/mrow&gt;&lt;/math&gt;, then it is clipped at zero. If in-between, then it is kept the same.&lt;/p&gt;
&lt;p&gt;Because the pack instruction operates on 128 bit lane of the YMM register in order to operate on data in a stream, and per lane, the colour values are now split between each lane, note that this takes place after the rightwards shift, hence each value starts with 0x00:&lt;/p&gt;
&lt;p&gt;&lt;picture&gt;&lt;source srcset="/content/assets/goputer/simd%20optimisation/pack%20word%20to%20byte.avif" /&gt;&lt;img /&gt;&lt;/picture&gt;&lt;/p&gt;
&lt;figcaption&gt;Values which be kept are in bold. Line through centre is at the lane border.&lt;/figcaption&gt;&lt;p&gt;Luckily this can be rectified with the &lt;code&gt;VPERMQ&lt;/code&gt; (Qwords Element Permutation) instruction. This instruction functions in a similar way to the shuffle instruction from earlier, but this time it operates on quad-words (64 bit) and the mask is only one byte, with four two bit values for the locations. The 256bit YMM register is split up into 4 quad-word slots.&lt;/p&gt;
&lt;p&gt;In this case the mask is &lt;code&gt;0xD8&lt;/code&gt;, or &lt;code&gt;0b11011000&lt;/code&gt; in binary form. The source in the mask is implicit, based on the location of the bits within the mask.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Bits&lt;/th&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;Destination&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;00&lt;/td&gt;
&lt;td&gt;Q0&lt;/td&gt;
&lt;td&gt;Q0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;Q1&lt;/td&gt;
&lt;td&gt;Q2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;01&lt;/td&gt;
&lt;td&gt;Q2&lt;/td&gt;
&lt;td&gt;Q1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;Q3&lt;/td&gt;
&lt;td&gt;Q3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The result:&lt;/p&gt;
&lt;p&gt;&lt;picture&gt;&lt;source srcset="/content/assets/goputer/simd%20optimisation/simd%20permute.avif" /&gt;&lt;img /&gt;&lt;/picture&gt;&lt;/p&gt;
&lt;p&gt;When all the data is in the correct place in YMM, the lower lane (bits 0-128) can be extracted and placed in XMM. This data can then be placed back into memory, but only after the last byte is inserted.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nx"&gt;VEXTRACTI128&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;Imm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;widened_memory&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;memory_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nx"&gt;PINSRB&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;Imm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;last_byte&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;memory_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;VMOVDQU&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;memory_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Mem&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;Base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;RDI&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Note: This time the move instruction is the &lt;/em&gt;&lt;em&gt;unaligned&lt;/em&gt;&lt;em&gt; variant, as the pointer is incremented by 15 bytes each time, which is not a proper number for correct alignment.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;All of the code in this section has been extracts from the routine used in goputer, which is available &lt;a href="https://github.com/sccreeper/goputer/blob/668142325cf8a3f1ab1b9b425dbc34162c1820ab/pkg/vm/asm/video_area_alpha.go" target="_blank"&gt;on GitHub&lt;/a&gt;. The next section explains the rest of that routine in a bit more detail, it is not necessary to understand &lt;a href="#is-it-actually-faster"&gt;the conclusion&lt;/a&gt; sections so feel free to skip it.&lt;/p&gt;
&lt;span class="header" id="extra-notes-on-the-loop-implementation"&gt;&lt;a href="#extra-notes-on-the-loop-implementation"&gt;###&lt;/a&gt;&lt;h3&gt;Extra notes on the loop implementation&lt;/h3&gt;&lt;/span&gt;&lt;p&gt;The loop itself is a standard &lt;code&gt;for y { for x }&lt;/code&gt; style loop:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nx"&gt;Label&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;a_loop&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nx"&gt;MOVQ&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;RDI&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nx"&gt;CMPL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;width&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Imm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="nx"&gt;JB&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;LabelRef&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;a_blit_remaining&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="nx"&gt;Label&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;a_loop_x&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// &lt;/span&gt;
&lt;span class="c1"&gt;// Loop body&lt;/span&gt;
&lt;span class="c1"&gt;// &lt;/span&gt;

&lt;span class="nx"&gt;Label&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;a_loop_end&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nx"&gt;XORL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;counter_x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;counter_x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;ADDQ&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;U32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;960&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;INCL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;counter_y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;CMPL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;counter_y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;JB&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;LabelRef&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;a_loop&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The body of the loop contains an additional branch to check if we are drawing &lt;math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mo&gt;&amp;#x0003C;&lt;/mo&gt;&lt;mn&gt;5&lt;/mn&gt;&lt;/mrow&gt;&lt;/math&gt; pixels (&lt;math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;mrow&gt;&lt;mn&gt;15&lt;/mn&gt;&lt;mtext&gt;&amp;#x000A0;bytes&lt;/mtext&gt;&lt;mi&gt;&amp;#x000F7;&lt;/mi&gt;&lt;mn&gt;3&lt;/mn&gt;&lt;mo&gt;&amp;#x0003D;&lt;/mo&gt;&lt;mn&gt;5&lt;/mn&gt;&lt;/mrow&gt;&lt;/math&gt;). This contains instructions which draw in a scalar as opposed to vector manner:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nx"&gt;MOVBWZX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;Mem&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;Base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;RDI&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;tmp_16&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;IMULW&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;inv_alpha_16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;tmp_16&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;ADDW&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;red_16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;tmp_16&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;SHRW&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;Imm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;tmp_16&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;MOVB&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tmp_16&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;As8L&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;Mem&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;Base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;RDI&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The instructions are the same for red and green, but with a &lt;code&gt;Disp&lt;/code&gt; value in the &lt;code&gt;Mem&lt;/code&gt; struct of 1 and 2 respectively.&lt;/p&gt;
&lt;span class="header" id="is-it-actually-faster"&gt;&lt;a href="#is-it-actually-faster"&gt;#&lt;/a&gt;&lt;h1&gt;Is it actually faster?&lt;/h1&gt;&lt;/span&gt;&lt;p&gt;Now only one question remained:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;"Is this faster than before?"&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;After all, it would be a bit pointless if it wasn't.&lt;/p&gt;
&lt;p&gt;According to goputer's builtin profiler the video clear operation was now &lt;strong&gt;~16% faster&lt;/strong&gt; in canned benchmarks&lt;sup id="fnref:benchmarks"&gt;&lt;a class="footnote-ref" href="#fn:benchmarks"&gt;[7]&lt;/a&gt;&lt;/sup&gt;. Not exactly a huge improvement but still something.&lt;/p&gt;
&lt;p&gt;The video area operation with no alpha was actually on par with the scalar code from before. I tried optimising it further with cache prefetch instructions but these had no measurable effect.&lt;/p&gt;
&lt;p&gt;Initially, the alpha area operation was &lt;strong&gt;slower&lt;/strong&gt;, but after removing the extract/insert byte instructions and replacing them with normal memory -&amp;gt; register and vice versa move instructions, there was a &lt;strong&gt;~60%&lt;/strong&gt; improvement.&lt;/p&gt;
&lt;p&gt;The final code examples used in this article, after improvements are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/sccreeper/goputer/blob/850f3c482ba84b935bb27c9111bfd322edc5af16/pkg/vm/asm/video_area.go" target="_blank"&gt;pkg/vm/asm/video_area.go&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/sccreeper/goputer/blob/850f3c482ba84b935bb27c9111bfd322edc5af16/pkg/vm/asm/video_area_alpha.go" target="_blank"&gt;pkg/vm/asm/video_area_alpha.go&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/sccreeper/goputer/blob/850f3c482ba84b935bb27c9111bfd322edc5af16/pkg/vm/asm/video_clear.go" target="_blank"&gt;pkg/vm/asm/video_clear.go&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;span class="header" id="conclusion"&gt;&lt;a href="#conclusion"&gt;#&lt;/a&gt;&lt;h1&gt;Conclusion&lt;/h1&gt;&lt;/span&gt;&lt;p&gt;In conclusion, while some of the improvements made aren't exactly as much as what is theoretically possible with SIMD, they are still something. At the end of the day computers are very fast, and as such will run un-optimised code just as well as optimised code. &lt;/p&gt;
&lt;p&gt;As I write this, Go is planning to add experimental support for &lt;a href="https://antonz.org/go-1-26/#simd" target="_blank"&gt;vector operations&lt;/a&gt; as a builtin feature in the standard library in the next version, 1.26. So if I had done this a few months later it might have been a lot easier, and in the future it'll hopefully support more architectures (even including WASM), so I won't have to touch assembly at all. On the other hand, it has certainly been very interesting learning the ins and outs of x86 assembly and SIMD in particular.&lt;/p&gt;
&lt;p&gt;Finally, apart from the addition of a high level language, the rest of the improvements I have &lt;a href="https://github.com/sccreeper/goputer/issues" target="_blank"&gt;planned&lt;/a&gt; for goputer are fairly minor things, so there &lt;em&gt;probably&lt;/em&gt; won't be another post like this for a while. For now, look at the other posts on &lt;a href="/posts?tags=goputer" hx-get="/posts?tags=goputer" hx-push-url="true" hx-replace="innerHTML" hx-target="#content-block" hx-trigger="click"&gt;goputer&lt;/a&gt; or the repository &lt;a href="https://github.com/sccreeper/goputer" target="_blank"&gt;itself&lt;/a&gt;.&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id="fn:bytesimd"&gt;
&lt;p&gt;SIMD instructions work better when using multiples of 4 elements, see this &lt;a href="https://learn.arm.com/learning-paths/cross-platform/vectorization-friendly-data-layout/data-layout-basics-2/" target="_blank"&gt;ARM documentation&lt;/a&gt;.&amp;#160;&lt;a class="footnote-backref" href="#fnref:bytesimd" title="Jump back to footnote 1 in the text"&gt;^&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:pythonvector"&gt;
&lt;p&gt;In reality the reference Python interpreter, CPython, does not vectorise code. Vectorisation in Python can be achieved by using libraries such as &lt;a href="https://numpy.org/" target="_blank"&gt;numpy&lt;/a&gt;.&amp;#160;&lt;a class="footnote-backref" href="#fnref:pythonvector" title="Jump back to footnote 2 in the text"&gt;^&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:addintrinsic"&gt;
&lt;p&gt;See &lt;a href="https://www.felixcloutier.com/x86/paddb:paddw:paddd:paddq#intel-c-c++-compiler-intrinsic-equivalents" target="_blank"&gt;paddd&lt;/a&gt; on the x86 reference.&amp;#160;&lt;a class="footnote-backref" href="#fnref:addintrinsic" title="Jump back to footnote 3 in the text"&gt;^&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:byteworddouble"&gt;
&lt;p&gt;On x86 the terms byte (8 bits), word (16 bits), double-word (32 bits), and quad-word (64 bits) are sometimes used to denote integer sizes.&amp;#160;&lt;a class="footnote-backref" href="#fnref:byteworddouble" title="Jump back to footnote 4 in the text"&gt;^&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:avocredit"&gt;
&lt;p&gt;This example is from the &lt;code&gt;avo&lt;/code&gt; library.&amp;#160;&lt;a class="footnote-backref" href="#fnref:avocredit" title="Jump back to footnote 5 in the text"&gt;^&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:unrolling"&gt;
&lt;p&gt;"Unrolling" refers to the practice of including multiple iterations of a loop in it's body, in order to reduce the total number of branches which occur.&amp;#160;&lt;a class="footnote-backref" href="#fnref:unrolling" title="Jump back to footnote 6 in the text"&gt;^&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id="fn:benchmarks"&gt;
&lt;p&gt;Benchmarks were conducted using these examples and this hardware.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Video clearing:&lt;/strong&gt; &lt;a href="https://github.com/sccreeper/goputer/blob/850f3c482ba84b935bb27c9111bfd322edc5af16/examples/video_clear_bench.gpasm" target="_blank"&gt;examples/video_clear_bench.gpasm&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Video area, no alpha:&lt;/strong&gt; &lt;a href="https://github.com/sccreeper/goputer/blob/850f3c482ba84b935bb27c9111bfd322edc5af16/examples/area_test.gpasm" target="_blank"&gt;examples/area_test.gpasm&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Video area, with alpha:&lt;/strong&gt; &lt;a href="https://github.com/sccreeper/goputer/blob/850f3c482ba84b935bb27c9111bfd322edc5af16/examples/alpha_area_test.gpasm" target="_blank"&gt;examples/alpha_area_test.gpasm&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;System specification:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;CPU:&lt;/strong&gt; Ryzen 7 7730U (8C, 16T)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Memory:&lt;/strong&gt; 16GB&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;WSL OS:&lt;/strong&gt; Fedora 42, 6.6.87.2-microsoft-standard-WSL2&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Host OS:&lt;/strong&gt; Windows 11, 10.0.26200&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a class="footnote-backref" href="#fnref:benchmarks" title="Jump back to footnote 7 in the text"&gt;^&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;</div></content><author><name>Oscar Peace</name></author></entry><entry><title>Making pong with my fantasy VM</title><link href="https://www.oscarcp.net/" /><id>https://www.oscarcp.net/blog/making-pong-in-goputer</id><updated>2025-10-28T00:00:00</updated><summary>Finally, a real project</summary><content type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml">&lt;p&gt;&lt;em&gt;Note: This is a continuation from my &lt;a href="/blog/goputer-profiler" hx-get="/blog/goputer-profiler" hx-push-url="true" hx-replace="innerHTML" hx-target="#content-block" hx-trigger="click"&gt;previous post&lt;/a&gt;. This post can also serve as a more in-depth explanation of one of the goputer examples as well as a tutorial of sorts.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;For as long as goputer has existed, I have never made what I would call a complete program or project with it. Any code that has been written for it has been short examples, intended to test functionality rather than showcase what is possible. With this in mind I decided to embark on creating &lt;em&gt;Pong&lt;/em&gt;. While I could have possibly gone for something a bit more advanced, such as &lt;em&gt;Snake&lt;/em&gt; or &lt;em&gt;Tetris&lt;/em&gt;, &lt;em&gt;Pong&lt;/em&gt; was the simplest thing I could make whilst still testing the full feature set of goputer.&lt;/p&gt;
&lt;p&gt;This post is split into four parts with a conclusion:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Drawing&lt;/li&gt;
&lt;li&gt;Moving&lt;/li&gt;
&lt;li&gt;Scoring&lt;/li&gt;
&lt;li&gt;Debugging&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Even though this post describes the function of each piece of code, I would still recommend skimming over the &lt;a href="https://github.com/sccreeper/goputer/wiki/Syntax" target="_blank"&gt;syntax&lt;/a&gt; page in the documentation beforehand.&lt;/p&gt;
&lt;span class="header" id="drawing"&gt;&lt;a href="#drawing"&gt;#&lt;/a&gt;&lt;h1&gt;Drawing&lt;/h1&gt;&lt;/span&gt;&lt;p&gt;This was the easiest part. I often describe goputer as "an assembly language that controls Microsoft Paint", and it's not far from the truth. A huge part of the feature-set of goputer is drawing stuff to the screen. After all, what's the point of an educational tool if you can't draw things.&lt;/p&gt;
&lt;p&gt;A pong screen is very simple to draw, it contains two paddles, a ball, and a dividing line. I'll get onto the score later.&lt;/p&gt;
&lt;p&gt;Firstly the initial positions need to be defined so they can be changed later if necessary, also it's very poor practice to not use constants where possible.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;#def paddle_offset 5
#def paddle_width 5
#def paddle_height 25
#def paddle_speed 10

#def left_paddle_pos 120
#def right_paddle_pos 120

#def ball_x 160
#def ball_y 120
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The first set of definitions define constants that are referenced in later code. The second set define the positions of the paddles and the ball respectively. The reason why everything is multiples of five is because it makes bounds checking ever so slightly easier, i.e. I shouldn't have to worry about &lt;a href="https://en.wikipedia.org/wiki/Arithmetic_underflow" target="_blank"&gt;integer underflow&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;These needed to be used in conjunction with a routine that draws the shapes in question onto the screen:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;...

lda @left_paddle_pos
mov d0 vy0
add vy0 $d:paddle_height
mov a0 vy1

mov $320-d:paddle_offset-d:paddle_width vx0
mov $320-d:paddle_offset vx1

int va

...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;A quick note on the immediate values prefixed with &lt;code&gt;$&lt;/code&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Expressions are evaluated from left-to-right, and don't follow normal operator precedence.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;d:xyz&lt;/code&gt; refers to the value of a definition at compile time.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Drawing the ball and the centre line followed a &lt;a href="https://github.com/sccreeper/goputer/blob/d604504813b27805801260a1b21bc31588053b03/examples/pong.gpasm#L142-L157" target="_blank"&gt;similar routine&lt;/a&gt;.&lt;/p&gt;
&lt;span class="header" id="moving"&gt;&lt;a href="#moving"&gt;#&lt;/a&gt;&lt;h1&gt;Moving&lt;/h1&gt;&lt;/span&gt;&lt;p&gt;A static image is pretty useless if you're trying to make a game, therefore the next thing I focused on was trying to make things move. &lt;/p&gt;
&lt;span class="header" id="moving-the-paddles"&gt;&lt;a href="#moving-the-paddles"&gt;##&lt;/a&gt;&lt;h2&gt;Moving the paddles&lt;/h2&gt;&lt;/span&gt;&lt;p&gt;Firstly, I focused on moving the paddles, as I figured this would be the easiest thing to do because the code should be the same for both of them. The steps that were required were as follows:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Interrupt to register the keypress&lt;/li&gt;
&lt;li&gt;Then decide which key it is, and thus which paddle to move&lt;/li&gt;
&lt;li&gt;Move the respective paddle&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Registering an interrupt listener is simple, however you can only register one per interrupt.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;#label keyup

// Do something

iret

#intsub kd keyup
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This means the &lt;code&gt;keyup&lt;/code&gt; label will be called whenever someone presses a key. After an interrupt is called the next step is to preserve any registers we are going to write to.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;push r01
push dp
push dl
push a0
sta @d0_store
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The function of the last line is slightly different because it stores the entire value in &lt;code&gt;d0&lt;/code&gt; on the static stack (populated and allocated at compile time) as opposed to the dynamic stack (populated and allocated at runtime). The &lt;code&gt;sta&lt;/code&gt; instruction differentiates between addressing the static stack with immediate values and main memory with registers by checking if the argument is &lt;a href="https://github.com/sccreeper/goputer/blob/d604504813b27805801260a1b21bc31588053b03/pkg/vm/load_store.go#L40" target="_blank"&gt;greater than the number of registers&lt;/a&gt;. If so the first 4 bytes at that address show how many bytes to store.&lt;/p&gt;
&lt;p&gt;After this there are a series of cascading equality checks to determine which key was pressed:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;...

neq kc $23
cndjmp @m_lp_down
lda @left_paddle_pos
mov d0 r01
eq r01 $0
cndjmp @move_end
sub r01 $d:paddle_speed
mov a0 d0
sta @left_paddle_pos
jmp @move_end

...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The above code does the following:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Is the &lt;code&gt;kc&lt;/code&gt; (current key) register equal to any other value than 23. If so, jump to the next equality check, if not keep going.&lt;/li&gt;
&lt;li&gt;Load the paddle position and see if it is equal to zero. If it is, jump to the cleanup label because we can't move any further.&lt;/li&gt;
&lt;li&gt;Then subtract the paddle speed, because in this case we are moving up the screen.&lt;/li&gt;
&lt;li&gt;Store the paddle position for later use and jump to the cleanup routine.&lt;/li&gt;
&lt;/ol&gt;
&lt;span class="header" id="moving-the-ball"&gt;&lt;a href="#moving-the-ball"&gt;##&lt;/a&gt;&lt;h2&gt;Moving the ball&lt;/h2&gt;&lt;/span&gt;&lt;p&gt;The ball is slightly different given the fact it moves in two dimensions, not one and needs to move every frame/update as opposed to whenever a user presses a specific key. I'm not showing the code that loads values into registers as you've already seen an example of that but in order to help with understanding the values in each register are as follows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;r00&lt;/code&gt; is the ball's X position.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;r01&lt;/code&gt; is the Y position.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;r02&lt;/code&gt; is the X direction.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;r03&lt;/code&gt; is the Y direction.&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;...

// Bounds right
neq r00 $320-d:ball_size
cndjmp @bbc_left
mov $0 r02
push $0
call @increase_score
jmp @check_paddles
// Bounds left
#label bbc_left
neq r00 $0
cndjmp @check_paddles
mov $1 r02
push $1
call @increase_score

...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You may notice the call to &lt;code&gt;increase_score&lt;/code&gt; but I'll get back to that &lt;a href="#scoring"&gt;later&lt;/a&gt;. The bounds checking code for the right is the same. However, checking the top and bottom is different because the ball has to bounce, not resetting back to the centre of the screen.&lt;/p&gt;
&lt;p&gt;"Bouncing" the ball for both the paddles and the top and bottom of the screen is simple. If the Y current direction value is 1, say when hitting the bottom of the screen, we make it zero and then the ball will go up when the move code is called. This same bouncing logic applies when hitting a paddle, yet this time the X direction is changed.&lt;/p&gt;
&lt;p&gt;See the code for checking the right paddle:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;...

eq r00 $320-d:ball_size-d:paddle_width-d:paddle_offset
mov a0 r15
gt r01 r05
mov a0 r14
add r05 $d:paddle_height
lt r01 a0
mov a0 r13
and r15 r14
and a0 r13
inv a0
cndjmp @check_left_paddle
mov $0 r02
jmp @bbc_bottom

...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The process of the routine is as follows:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Check if the X position of ball such that it is in-line with the left hand side of the paddle.&lt;/li&gt;
&lt;li&gt;Then see if the paddle Y is in-between the top and bottom of the paddle.&lt;/li&gt;
&lt;li&gt;If both of these conditions are true (successive &lt;code&gt;and&lt;/code&gt; operations), then we invert the ball direction and jump to the bottom bounds check.&lt;/li&gt;
&lt;/ol&gt;
&lt;span class="header" id="scoring"&gt;&lt;a href="#scoring"&gt;#&lt;/a&gt;&lt;h1&gt;Scoring&lt;/h1&gt;&lt;/span&gt;&lt;p&gt;Now all the movement logic was in place I needed a way to score the game. As mentioned earlier, the &lt;a href="https://github.com/sccreeper/goputer/blob/d604504813b27805801260a1b21bc31588053b03/examples/pong.gpasm#L355" target="_blank"&gt;&lt;code&gt;increase_score&lt;/code&gt; routine&lt;/a&gt; is called whenever the ball hits the left or right side. Before it is called, one value is pushed to the stack - which player's score to increase.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;...

lda @player_one_score
mov d0 r14
add r14 $1
mov a0 d0
sta @player_one_score
jmp @reset_ball

...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Above: Increasing the score for the left player/player one.&lt;/p&gt;
&lt;p&gt;Even though the score was increasing, you still couldn't see what it's value was, at least without inspecting registers/memory locations. I still needed to write the assembly to display it.&lt;/p&gt;
&lt;span class="header" id="integers-to-strings"&gt;&lt;a href="#integers-to-strings"&gt;##&lt;/a&gt;&lt;h2&gt;Integers to strings&lt;/h2&gt;&lt;/span&gt;&lt;p&gt;In any other language you'd simply make use of a standard library function to convert an integer to a string, in Golang you'd use &lt;code&gt;strconv.Itoa(x)&lt;/code&gt; or Python would make use of the even simpler &lt;code&gt;str(x)&lt;/code&gt;. Unfortunately, working in an assembly language means you don't have access to such luxuries so you have to make these things yourself.&lt;/p&gt;
&lt;p&gt;The loop that is used to convert a number into an integer is below:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;...

eq r01 $0
cndjmp @draw_zero

#label convert_number
mod r01 $10
add a0 $48
mov $1 dl
mov a0 vt
sr vt $1
div r01 $10
mov a0 r01

neq r01 $0
cndjmp @convert_number

sl vt $1
int vt

...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;First we check if the number is already zero, as this is a special case. After than we keep taking the remainder of the number divided by ten, then writing the corresponding character to &lt;code&gt;vt&lt;/code&gt; (the text buffer) - we also shift the text buffer right at this stage. We store the result of the division of the number for use in the next iteration of the loop. This repeats until the result of the division is zero.&lt;/p&gt;
&lt;p&gt;We also have to shift the text buffer left by one byte to cleanup from the last iteration.&lt;/p&gt;
&lt;span class="header" id="debugging"&gt;&lt;a href="#debugging"&gt;#&lt;/a&gt;&lt;h1&gt;Debugging&lt;/h1&gt;&lt;/span&gt;&lt;p&gt;In all honestly there weren't actually that many bugs with the assembly I wrote, at least not any major ones that took ages to figure out. That being said, there were some problems with the runtime that I didn't know about or should have known about:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The data stack didn't work properly, which meant the ball always teleported to the top of the screen when a key was pressed.&lt;/li&gt;
&lt;li&gt;The possibility of a race condition if a key was pressed in the middle of a ball update.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The stack not working properly was easy enough to fix, the race condition however required the addition of two new instructions and the exposure of the previously internal interrupt flag through a new control register. The two new instructions were inspired by x86's &lt;code&gt;sti&lt;/code&gt; and &lt;code&gt;cli&lt;/code&gt; (set and clear interrupt bit) instructions, albeit with different names.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;pri

mov r00 d0
sta @ball_x

mov r01 d0
sta @ball_y

mov r02 d0
sta @ball_direction_x

mov r03 d0
sta @ball_direction_y

eni
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In this case, &lt;code&gt;pri&lt;/code&gt; prevents interrupts, and &lt;code&gt;eni&lt;/code&gt; re-enables them. The instructions in-between write the updated ball location data to memory.&lt;/p&gt;
&lt;span class="header" id="conclusion"&gt;&lt;a href="#conclusion"&gt;#&lt;/a&gt;&lt;h1&gt;Conclusion&lt;/h1&gt;&lt;/span&gt;&lt;p&gt;&lt;picture&gt;&lt;source srcset="/content/assets/goputer/pong.avif" /&gt;&lt;img /&gt;&lt;/picture&gt;&lt;/p&gt;
&lt;figcaption&gt;The pong program running in the web frontend&lt;/figcaption&gt;&lt;p&gt;The next steps are probably to write a better &lt;em&gt;Pong&lt;/em&gt; clone rather than the quite simple one I've got now, and also finding a way to speed up the VM - which will almost certainly require writing some of the rendering instructions in assembly. The other major improvement I have made since the last post has been to rewrite the expansion system around Lua instead of Go's &lt;code&gt;plugin&lt;/code&gt; library, meaning that it is now cross platform.&lt;/p&gt;
&lt;p&gt;Other than making the VM faster a proper IDE or even a VSCode extension would be nice.&lt;/p&gt;
&lt;p&gt;If you want to see any other posts linked to Goputer, click &lt;a href="/posts?tags=goputer" hx-get="/posts?tags=goputer" hx-push-url="true" hx-replace="innerHTML" hx-target="#content-block" hx-trigger="click"&gt;here&lt;/a&gt; or the repository is &lt;a href="https://github.com/sccreeper/goputer" target="_blank"&gt;here&lt;/a&gt;. The full source code for the &lt;em&gt;Pong&lt;/em&gt; example talked about in this blog post is on GitHub &lt;a href="https://github.com/sccreeper/goputer/blob/d604504813b27805801260a1b21bc31588053b03/examples/pong.gpasm" target="_blank"&gt;here&lt;/a&gt;.&lt;/p&gt;</div></content><author><name>Oscar Peace</name></author></entry><entry><title>Creating a profiler for goputer</title><link href="https://www.oscarcp.net/" /><id>https://www.oscarcp.net/blog/goputer-profiler</id><updated>2025-08-09T00:00:00</updated><summary>Sometimes you have to do some measuring</summary><content type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml">&lt;p&gt;&lt;em&gt;Note: This is a continuation from my &lt;a href="/blog/writing-software-renderer" hx-get="/blog/writing-software-renderer" hx-push-url="true" hx-replace="innerHTML" hx-target="#content-block" hx-trigger="click"&gt;previous post&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;For a while now, goputer has had no way of measuring performance. The closest you could get was adding a frame rate counter to a frontend and hoping it would be somewhat accurate - spoiler, it wasn't. Turns out it takes a lot longer to render a frame than to add two numbers together, who would've guessed. This necessitated the creation of a better solution: a profiler.&lt;/p&gt;
&lt;p&gt;I wasn't aiming for anything complex, such as Google's &lt;a href="https://github.com/google/pprof" target="_blank"&gt;pprof&lt;/a&gt;, instead I was just intending to measure how long each instruction took to run, then do a tiny bit of statistics to make the collected values slightly more useful. If anything it would also serve to highlight any outstanding issues with the core runtime in addition to any assembly I wrote.&lt;/p&gt;
&lt;span class="header" id="design"&gt;&lt;a href="#design"&gt;#&lt;/a&gt;&lt;h1&gt;Design&lt;/h1&gt;&lt;/span&gt;&lt;p&gt;My only experience using anything that could be considered a profiler thus far had been using browser dev-tools, which are much more complicated than anything I intended to make and are made for a very different domain. As such, I settled upon a very simple way of measuring performance:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Measure the time for each cycle&lt;/li&gt;
&lt;li&gt;Add that data to an entry in a hashmap&lt;/li&gt;
&lt;li&gt;Do some additional stuff to calculate the mean etc. at the end&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;As for the design I decided on a &lt;code&gt;Profiler&lt;/code&gt; type which would have a &lt;code&gt;Data&lt;/code&gt; field of type &lt;code&gt;map[uint64]*ProfileEntry&lt;/code&gt; (the pointer was so I could change values in place), &lt;code&gt;ProfileEntry&lt;/code&gt; would contain various data about that &lt;strong&gt;specific&lt;/strong&gt; instruction - total cycle time, times executed.&lt;/p&gt;
&lt;p&gt;The key for the map was created by packing the instruction (5 bytes) and the current program counter value, sort of akin to a &lt;a href="https://en.wikipedia.org/wiki/Composite_key" target="_blank"&gt;composite key&lt;/a&gt;. While this meant that only 3/4 bytes were available for the program counter value (largest value 16,777,215 so 16MB) the default memory addressable (not including the video buffer) size for goputer is 65KB, so I don't see this as being a problem, unless the program counter is overwritten. If I wanted to in the future I could change the key to two uint64's or just 9 bytes.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Instruction&lt;/th&gt;
&lt;th&gt;Address&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;40 bits&lt;/td&gt;
&lt;td&gt;24 bits&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Now I had collected the data I needed some way to store it. While I could have simply just dumped the entire thing to a JSON file and been done with it, however this had a major drawback of being slower and the file size being larger. Also I just needed to store numbers.&lt;/p&gt;
&lt;p&gt;All byte orders are little endian.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Magic number&lt;/th&gt;
&lt;th&gt;Number of entries&lt;/th&gt;
&lt;th&gt;Total cycles&lt;/th&gt;
&lt;th&gt;Entries&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;GPPR&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;8 bytes (uint64)&lt;/td&gt;
&lt;td&gt;8 bytes (uint64)&lt;/td&gt;
&lt;td&gt;Variable 33 bytes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Each entry was further broken down as follows:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Instruction&lt;/th&gt;
&lt;th&gt;Address&lt;/th&gt;
&lt;th&gt;Total cycle time&lt;/th&gt;
&lt;th&gt;Total times executed&lt;/th&gt;
&lt;th&gt;Standard deviation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;5 bytes&lt;/td&gt;
&lt;td&gt;4 bytes (&lt;code&gt;uint32&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;8 bytes (&lt;code&gt;uint32&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;8 bytes (&lt;code&gt;uint32&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;8 bytes (&lt;code&gt;float64&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;em&gt;Note: The standard deviation is technically stored as a uint64 in little-endian order, see &lt;a href="https://pkg.go.dev/math#Float64bits" target="_blank"&gt;&lt;code&gt;math.Float64bits&lt;/code&gt;&lt;/a&gt; which does some &lt;code&gt;unsafer.Pointer&lt;/code&gt; stuff, specifically &lt;code&gt;*(*uint64)(unsafe.Pointer(&amp;amp;f)&lt;/code&gt;, where &lt;code&gt;f&lt;/code&gt; is the &lt;code&gt;float64&lt;/code&gt; in question.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;To make things a bit nicer the &lt;code&gt;Load&lt;/code&gt; and &lt;code&gt;Dump&lt;/code&gt; methods both used Go's &lt;code&gt;io.ReaderSeeker&lt;/code&gt; and &lt;code&gt;io.WriterSeeker&lt;/code&gt; interfaces respectively.&lt;/p&gt;
&lt;span class="header" id="making-it-useful"&gt;&lt;a href="#making-it-useful"&gt;#&lt;/a&gt;&lt;h1&gt;Making it useful&lt;/h1&gt;&lt;/span&gt;&lt;p&gt;Collecting all this information was okay, but it was pretty useless on it's own unless you like reading binary file formats or printed structs. I needed some way to display it, for this I chose &lt;a href="https://github.com/rivo/tview" target="_blank"&gt;&lt;code&gt;tview&lt;/code&gt;&lt;/a&gt;, a TUI framework, as while there already some small GUI apps I had developed for goputer, most notably the launcher, they really should have been console apps instead.&lt;/p&gt;
&lt;p&gt;&lt;picture&gt;&lt;source srcset="/content/assets/goputer/profiler/profiler.avif" /&gt;&lt;img /&gt;&lt;/picture&gt;&lt;/p&gt;
&lt;figcaption&gt;The profiler UI, with instructions sorted by times executed&lt;/figcaption&gt;&lt;p&gt;In the above screenshot, the source for the executable being analysed is available &lt;a href="https://github.com/sccreeper/goputer/blob/73942d06998e8cf1fea7e3e2a41acc92bf566660/examples/area_test.gpasm" target="_blank"&gt;here&lt;/a&gt;. As you have may have noticed there is also an option for grouping, this simply groups instructions which have the same opcode together and aggregates their data (apart from standard deviation and addresses).&lt;/p&gt;
&lt;p&gt;&lt;picture&gt;&lt;source srcset="/content/assets/goputer/profiler/profiler%20grouped.avif" /&gt;&lt;img /&gt;&lt;/picture&gt;&lt;/p&gt;
&lt;figcaption&gt;Profiler with grouping enabled&lt;/figcaption&gt;&lt;p&gt;You can also jump to instances of specific instructions using the "instruction" field. The first screenshot is much closer to the original representation of the data.&lt;/p&gt;
&lt;p&gt;As a side-note, the numbers in that screenshot are much higher than they should be, in canned benchmarks they were much lower, my current theory is that for some reason running a windowed application in WSL is slower than a non-windowed one. I'll only find out for sure once make goputer run natively (on Windows without WSL) though.&lt;/p&gt;
&lt;p&gt;Another useful addition might be to add some other form of visualisation (i.e. barcharts) however I can't really see the utility of it over the existing data view.&lt;/p&gt;
&lt;span class="header" id="what-next"&gt;&lt;a href="#what-next"&gt;#&lt;/a&gt;&lt;h1&gt;What next?&lt;/h1&gt;&lt;/span&gt;&lt;p&gt;Now armed with a profiler I can hopefully go about creating more complex programs and maybe even high level language, using the profiler to make performance analysis ever so slightly easier.&lt;/p&gt;
&lt;p&gt;This was the one of the last major improvements I had planned for goputer. The ones that are most likely to be developed next are moving frontends to standalone executables instead of using Go's &lt;code&gt;plugin&lt;/code&gt; library, which only works on Linux, FreeBSD, and MacOS, so no Windows support at all, and moving extensions to embedded Lua for the same reason. Software rendering doesn't show in the Python frontend/the Python frontend hasn't been updated since software rendering was implemented so that'll have to be fixed as well.&lt;/p&gt;
&lt;p&gt;Also at the moment the profiler is currently just tacked onto the gp32 runtime without any way of turning it off, but that'll come as part of the previously mentioned frontends to standalone executables thing.&lt;/p&gt;
&lt;p&gt;If you want to see any other posts linked to Goputer, click &lt;a href="/posts?tags=goputer" hx-get="/posts?tags=goputer" hx-push-url="true" hx-replace="innerHTML" hx-target="#content-block" hx-trigger="click"&gt;here&lt;/a&gt; or the repository itself is &lt;a href="https://github.com/sccreeper/goputer" target="_blank"&gt;here&lt;/a&gt;.&lt;/p&gt;</div></content><author><name>Oscar Peace</name></author></entry><entry><title>URL Shortener with Flask and HTMX</title><link href="https://www.oscarcp.net/" /><id>https://www.oscarcp.net/blog/url-shortener</id><updated>2025-07-07T00:00:00</updated><summary>A tiny app that can be recreated in a weekend</summary><content type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml">&lt;p&gt;It's been a while since I've made anything new with HTMX. The last thing was the &lt;a href="/blog/more-blog-updates" hx-get="/blog/more-blog-updates" hx-push-url="true" hx-replace="innerHTML" hx-target="#content-block" hx-trigger="click"&gt;comment section&lt;/a&gt; for this blog, which itself &lt;a href="/blog/making-a-blog-with-flask-and-htmx" hx-get="/blog/making-a-blog-with-flask-and-htmx" hx-push-url="true" hx-replace="innerHTML" hx-target="#content-block" hx-trigger="click"&gt;is made&lt;/a&gt; using Flask and HTMX. TLDR; I like using Flask and HTMX. Something I've always wanted to have is some kind of link shortening service. I'm well aware that services like bit.ly and Tiny URL exist however in my opinion it's much better to have something of your own especially when it's incredibly simple to make.&lt;/p&gt;
&lt;p&gt;While I could've used a fullstack framework for this, e.g. Svelte, I already had some experience using Flask and HTMX together now and I wanted to see how quickly I could create something usable. In the end it turned out that I was indeed able to create something rather quickly. &lt;/p&gt;
&lt;p&gt;Since I had already chosen Flask and HTMX the only remaining decision was what database engine to use. This ended up being SQLite for the same reasons mentioned before, ease of use, familiarity etc. This project was essentially the biggest case of why you shouldn't reinvent the wheel, something I have done too many times already, using new stacks for projects where the scope just isn't worth it, and later coming to regret it as a result.&lt;/p&gt;
&lt;p&gt;The codebase was then hammered together using the same outline I had figured out when I added comments to this blog, with some minor improvements - having data migration from the start and correct use of indexes. In the end while the thing I ended up with was rather ugly because I didn't use any CSS bar some for colouring &lt;span style="color: red;"&gt;error messages&lt;/span&gt;, choosing to stick to the principles of &lt;a href="https://justfuckingusehtml.com/" target="_blank"&gt;justfuckingusehtml.com&lt;/a&gt; instead.&lt;/p&gt;
&lt;p&gt;I also added a JSON API which used the same routes as the HTMX, just expecting and returning JSON instead of form data and HTML respectively. No need for another backend server when you have &lt;code&gt;if&lt;/code&gt; statements and schemas.&lt;/p&gt;
&lt;p&gt;&lt;picture&gt;&lt;source srcset="/content/assets/signpost%20admin%20ui.avif" /&gt;&lt;img /&gt;&lt;/picture&gt;&lt;/p&gt;
&lt;figcaption&gt;Zoomed in screenshot of the admin UI&lt;/figcaption&gt;&lt;p&gt;Opaque means wether or not to return a 301 redirect response or to return a small JavaScript snippet to do the redirection. This means that any link previews won't work when it's enabled. Useful for creating rick roll links.&lt;/p&gt;
&lt;p&gt;You can test it for yourself now by clicking on &lt;a href="https://ospe.lol/o7JU" target="_blank"&gt;this link&lt;/a&gt; which redirects to this blogpost.&lt;/p&gt;
&lt;p&gt;The source for the project itself can be found &lt;a href="https://github.com/sccreeper/signpost" target="_blank"&gt;here&lt;/a&gt; on GitHub.&lt;/p&gt;</div></content><author><name>Oscar Peace</name></author></entry><entry><title>DLAPS stack</title><link href="https://www.oscarcp.net/" /><id>https://www.oscarcp.net/blog/dlaps-stack</id><updated>2025-06-27T00:00:00</updated><summary>How I deploy most of my projects</summary><content type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml">&lt;p&gt;Most of my "online" projects (including this blog) use what I call the DLAPS stack which can be summarised as the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;D&lt;/strong&gt;ocker&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;L&lt;/strong&gt;inux&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;A&lt;/strong&gt;pache&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;P&lt;/strong&gt;rogram&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;S&lt;/strong&gt;QLite&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Alternatively if deploying straight onto bare metal the SLAPS stack also works and rolls off the tongue much better. However, because most deployments don't use that it'll have to be DLAPS. Essentially it is a variation of the classic &lt;a href="https://en.wikipedia.org/w/index.php?title=LAMP_(software_bundle)&amp;amp;oldid=1295099813" target="_blank"&gt;LAMP stack&lt;/a&gt; but using SQLite instead of MySQL and Docker as a virtualisation layer.&lt;/p&gt;
&lt;p&gt;Why Docker? Because it runs anywhere and you can run anything in it. Not to sound too much like their marketing copy but it's true. Also great for avoiding testing in prod.&lt;/p&gt;
&lt;p&gt;Linux technically occurs in two places in this stack. That is, whatever bare metal machine Docker and Apache are running on and in whatever Docker container(s) are running. The most honest reason for using it is quite frankly why would you ever use anything else on a server (at least definitely not Windows), also the widest adoption, most support etc.&lt;/p&gt;
&lt;p&gt;Any reverse proxy could fulfill the role of Apache (nginx, Caddy, etc.), however I have the most experience with Apache so it'll have to stay that way. Like I said earlier this was based on the LAMP stack so I have to be somewhat faithful to the original when coming up with a formal &lt;em&gt;de jure&lt;/em&gt; version of it.&lt;/p&gt;
&lt;p&gt;As for the program part this could really be anything. It doesn't have to communicate with the internet as such, thus the Apache part would then be redundant. The reason for choosing program over Python or Perl etc. is because I also write Go - &lt;a href="https://github.com/sccreeper/goputer" target="_blank"&gt;goputer&lt;/a&gt; and &lt;a href="https://github.com/sccreeper/chime" target="_blank"&gt;Chime&lt;/a&gt;. While only the latter is internet based I do want to write some more Go based web stuff in the future (probably using &lt;a href="https://github.com/a-h/templ" target="_blank"&gt;templ&lt;/a&gt;), or even any other language.&lt;/p&gt;
&lt;p&gt;Finally, SQLite. Probably the best database out there. If anything I was doing needed maybe a slightly larger feature-set I would probably switch to PostgresSQL seeming as that's what I've been taught to use at university. SQLite is lightweight, easy to deploy, and could probably run on a little Arduino (a &lt;a href="https://github.com/siara-cc/esp32_arduino_sqlite3_lib" target="_blank"&gt;library&lt;/a&gt; exists for running it on an ESP32, which is a slightly beefier micro-controller) with some minor changes. TLDR; it runs literally everywhere and is good for the vast majority of use cases.&lt;/p&gt;
&lt;p&gt;To conclude in a world of "serverless" this and that it's nice to have something that's deployable on a real machine, because lets face it that Cloudflare free tier (which admittedly I use myself for my &lt;a href="https://emc.oscarcp.net/" target="_blank"&gt;EarthMC dashboard&lt;/a&gt;) isn't going to last once your app starts getting traction and you need actual performance, not that that'll ever happen anyway so you might as well stick to a tiny VPS and call it a day.&lt;/p&gt;</div></content><author><name>Oscar Peace</name></author></entry></feed>