Oscar Peace's blog

Optimising Go with SIMD assembly

2026-01-21T00:00:00

Note: This is a continuation from my previous post.

Ever since goputer's switch to software rendering - instead of relying on frontends to do rendering operations themselves in hardware - drawing anything to the screen has been a bit of a bottleneck. Turns out doing stuff in hardware is a lot faster than doing it in software, especially when because of some very bad decisions earlier on, none of these operations are using 4 bytes and thus cannot be easily vectorised by a good compiler^[1].

The following post details my experience optimising my Go code with assembly. If you are already familiar with SIMD operations then feel free to skip the next section, however it may be helpful for anyone else who isn't. This article focuses on x86 vector instructions, however the principles of SIMD are the same for all CPU architectures.

Firstly SIMD stands for:

Single
Instruction
Multiple
Data

In essence this means that for a given instruction, which has multiple parameters (registers and/or memory locations), these parameters may themselves contain multiple pieces of data to be operated on in parallel.

Also, as opposed to scalar instructions which will operate on standard sized registers, vector instructions mostly operate on vector registers. On x86 vector registers can be 128, 256, or 512 bits wide.

Vectorisation refers to the process of rewriting a piece of code to make use of vector instructions.

A very simple example of vectorisation would be to say you had a loop which added two (very long) arrays together. Below is some Python^[2] pseudocode to demonstrate this:

a = [x for x in range(1024)]
b = [x for x in range(1024)]
result = []

for i in range(len(a)):
    result.append(a[i] + b[i])

Traditionally the above code would be executed in a scalar manner. I.e. one addition would have to be executed for every pair of integers, a and b. This means that 1024 additions would have to be performed. While this may execute fairly quickly for small-ish arrays of integers such as this, it will get increasingly slower as the arrays grow.

A "vectorised" version of the above code would be as follows (for this example, a and b can be thought of as an array of uint32's):

a = [x for x in range(1024)]
b = [x for x in range(1024)]
result = []

for i in range(0, len(a), 16):
    c = vpaddd(a[i:i+16], b[i:i+16])

    result.extend(c)

Note: The extend method is used to concatenate the two lists together.

Compared to before we are now adding 16 integers together at a time, instead of just 1. This means our code will now execute (in theory) 16x faster.

Our vpaddd function can be thought of as an instruction intrinsic. An intrinsic is a "function" that tells the compiler that you want this operation to be carried out with this specific instruction. In our case, vpaddd refers to the "Add Packed Integers" instruction, specifically adding packed doubles - a packed instruction operates on values of it's size (byte/word/double etc.) as opposed to treating the register like one big number. In real code which uses intrinsics this would be _mm_add_epi32^[3].

In our case, each integer is 32 bits wide, and we are using 512 bit zmm registers when performing the add instruction.

Values in ZMM registers before adding, when i is zero

The offset value in the above diagram refers to the number of bits each value is offset from the start of the register. Typically a vector register would be displayed with it's most significant bit first, but in this article, the least significant bit is displayed first.

We then perform the vpaddd instruction:

vpaddd zmm2, zmm1, zmm0

This takes each packed (packed, i.e. they are stored next to each other in the register, no gaps) double-word (32 bit integer)^[4] in zmm0 and zmm1, and adds them together, storing the result in zmm2. In Intel assembly syntax, operands follow the order of destination, source.

After we perform our vpaddd instruction, zmm2 now has our result:

In reality our .extend operation at the end would translate to a vmovdqa64 (move double quadword aligned) instruction, in order to move our data back to our result array.

vmovdqa64 [rax], zmm0

Note: The register rax contains our current offset in our storage array. In practice any large enough register could contain this value.

Adding isn't the only operation you can perform on a vector register though. Pretty much any operation that can be applied to a scalar register can be applied to vector registers as well, whether it be other arithmetic operations such as multiplication, or logical operations, and shifting.

There also exists a set of instructions specifically for operating on vector registers, such as broadcasting, shuffling, extraction/insertion, or other permutations. How these work will be detailed in the rest of the post.

The first half of this section focuses on writing raw assembly for use with Go, and the second half focuses on making use of the Avo library. I would recommend reading both halves as it is important to understand the assembly that Avo generates, and it is also sometimes easier to debug your code by reading the generated assembly.

Note: From now on, Go's Plan9 inspired assembly syntax is used, the main difference being the order of operands is different (source, destination).

Go makes it quite easy to link with assembly. It is used a lot in the standard library, especially in the math package, to make use of hardware specific features for various operations.

For example math/exp_amd64.s contains optimised code for calculating a base $e$ exponential.

A much simpler example would be something that adds two numbers^[5], i.e:

#include "textflag.h"

// func Add(x uint64, y uint64) uint64
TEXT ·Add(SB), NOSPLIT, $0-24
    MOVQ x+0(FP), AX
    MOVQ y+8(FP), CX
    ADDQ AX, CX
    MOVQ CX, ret+16(FP)
    RET

An explainer on the syntax used above:

TEXT ·Add(SB), NOSPLIT, $0-24

Above is the function declaration. SB is a virtual register used to refer to the static base pointer, which can be thought of as the origin of memory.

Names and descriptions for the pseudo-registers from the Go documentation:

FP: Frame pointer: arguments and locals.

PC: Program counter: jumps and branches.

SB: Static base pointer: global symbols.

SP: Stack pointer: the highest address within the local stack frame.

NOSPLIT tells the compiler not to insert a special piece of code that checks if the stack needs to grow. I don't fully understand this myself, and for our purposes it isn't important, but for those interested there is a detailed explanation in this blogpost by Miguel Young de la Sota.

$0-24 specifies the size of the stack frame for this function.

The move instruction, MOVQ x+0(FP), AX, can be read as: move the parameter x, offset zero bytes from the frame pointer FP, to the register AX. The same applies to the second move instruction, except an offset of 8 bytes instead of 0.

The final move instruction can be read as: move the value in CX to the return parameter, offset 16 bytes from the frame pointer.

This can then be linked to the following method stub:

func Add(x uint64, y uint64) uint64

The method stub can then be used in any normal Go code, just as a normal function would.

While the above approach might be okay for writing small functions, as functions and their requirements get larger, and worrying about argument sizes, constant data, and register allocation becomes a problem, it might be useful to have some way to take this "boilerplate" work away.

Avo is a Go library that can be used to generate assembly by just writing normal Go code. The assembly example used previously was actually from Avo's examples directory. You still have to "write" the assembly yourself, but Avo will take care of all the boring stuff like argument sizes and Plan9 syntax for you.

Here is the Go code which could have be written instead to generate that assembly - in Avo, as in Go's assembly, operands follow a source, dest order:

import . "github.com/mmcloughlin/avo/build"

func main() {
    TEXT("Add", NOSPLIT, "func(x, y uint64) uint64")

    x := Load(Param("x"), GP64())
    y := Load(Param("y"), GP64())

    ADDQ(x, y)

    Store(y, ReturnIndex(0))
    RET()

    Generate()
}

Notice how x and y are now actually variables, in this case virtual registers (64 bit) which have been allocated for us and initialized by Avo.

We then use the Store method to generate the instruction which pushes the result back onto the stack.

Avo also generates the stub file as well.

As stated previously, the primary operation I wanted to optimise with SIMD was rendering. This is not only because it was a bottleneck, but also because rendering is an ideal candidate (or in my case slightly less than ideal) for a SIMD workflow.

In my case, I also couldn't use any AVX-512 instructions as my CPU does not support it.

The first operation I wanted to make use of SIMD was int vc, or video clear interrupt. While a better approach could have been to make use of the rep movsb instruction, as I didn't care what was already in the buffer, just replacing it, the buffer format (RGB8), meant that this was not possible.

The steps I needed to take in order to do this were as follows:

Assemble the colour
Move the colour data to memory

"Assembling" the colour meant loading the RGB values from the parameters, then placing this data in the correct order in a vector register, or actually multiple, in order for it to be ordered properly.

Loading the parameters then ordering the colour correctly in a 32 bit register is quite simple:

red := Load(Param("red"), GP8())
green := Load(Param("green"), GP8())
blue := Load(Param("blue"), GP8())
tmp := GP32()

MOVL(U32(0), tmp)
MOVB(blue, tmp.As8())
SHLL(Imm(8), tmp)
ORB(green, tmp.As8())
SHLL(Imm(8), tmp)
ORB(red, tmp.As8())

The value in this register can then be moved to an XMM register using a MOVD (move double word) instruction. Also a quick note, x86 is little endian, which means that the byte with the lowest value is stored first. This is why we move the blue first, then green, then red, as opposed to the other way round. This is important when we consider the order of bytes in the XMM register.

Once we've constructed the colour, we can move it to the lowest 4 bytes of the XMM register:

MOVD(tmp, xmm0)

Once this data was in XMM, the rest of XMM needed to be filled with it. Also because it was a 3 byte pattern, and XMM registers are 16 bytes wide, the pattern would have to be repeated across 3 XMM registers in order to work properly.

An appropriate instruction for this is the shuffle instruction, specifically PSHUFB or Packed Shuffle Bytes. It will re-order the contents of one vector register, when given a shuffle mask contained in another. TLDR; When given source x, destination y, and mask m, y[i] = x[m[i]].

Values are coloured to show which index the mask sources it's value from

In this case the register is filled with a purple-ish colour.

Or in code:

shuffle_mask := GLOBL("shuffle_mask", RODATA|NOPTR)
DATA(0, String([]byte{0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0}))

VPSHUFB(shuffle_mask, xmm0, xmm0)

There is a problem here, because the mask is not a complete sequence, that is it ends in zero, instead of 2, multiple registers with slightly different masks must be used, in order to form one complete sequence.

first_mask := GLOBL("first_mask", RODATA|NOPTR)
DATA(0, String([]byte{0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0}))
second_mask := GLOBL("second_mask", RODATA|NOPTR)
DATA(0, String([]byte{1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1}))
third_mask := GLOBL("third_mask", RODATA|NOPTR)
DATA(0, String([]byte{2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2}))

MOVD(tmp, xmm0)
VPSHUFB(first_mask, xmm0, xmm0)
VPSHUFB(second_mask, xmm0, xmm1)
VPSHUFB(third_mask, xmm0, xmm2)

Once we have shuffled the contents correctly for each register, we can store that data in memory:

Label("loop")

VMOVDQA(xmm0, Mem{Base: ptr})
VMOVDQA(xmm1, Mem{Base: ptr, Disp: 16})
VMOVDQA(xmm2, Mem{Base: ptr, Disp: 32})

VMOVDQA(xmm0, Mem{Base: ptr, Disp: 48})
VMOVDQA(xmm1, Mem{Base: ptr, Disp: 64})
VMOVDQA(xmm2, Mem{Base: ptr, Disp: 80})

VMOVDQA(xmm0, Mem{Base: ptr, Disp: 96})
VMOVDQA(xmm1, Mem{Base: ptr, Disp: 112})
VMOVDQA(xmm2, Mem{Base: ptr, Disp: 128})

VMOVDQA(xmm0, Mem{Base: ptr, Disp: 144})
VMOVDQA(xmm1, Mem{Base: ptr, Disp: 160})
VMOVDQA(xmm2, Mem{Base: ptr, Disp: 176})

ADDQ(Imm(192), ptr)
CMPQ(ptr, max)
JBE(LabelRef("loop"))

max is a value calculated before this (unrolled^[6]) loop which stores the bounds of the array. The Mem struct tells avo to interpret this argument as a memory address, with the start (base) of this address being the value of ptr, and then offset by the value of Disp.

VMOVDQA is the "Move Double Quadword Aligned" instruction, which moves the vector register to main memory and vice versa. The "Aligned" part means that the location must be aligned on a certain boundary, in our case 8 bytes as we are targeting a 64 bit system. This is because CPUs mainly access memory in terms of their word size, which on a 64 bit system is 8 bytes, this tends to be faster than doing an unaligned memory operation.

A "video area" (int va) is the term used in goputer for drawing a rectangle to the screen. The logic for drawing an opaque video area is broadly the same as video clearing, apart from a different loop, so I have decided not to include it for brevity. Instead, this section focuses on drawing a transparent area and the additional code which is required to do that.

The main differences between drawing opaque colour, and transparent colour, is that we now need to not only load data from memory beforehand, but also perform arithmetic on it.

Because goputer's framebuffer doesn't actually store alpha values, blending is done in a sort of fake way:

d e s t_{R} = ((s r c_{R} \times s r c_{A}) + (d e s t_{R} \times \neg s r c_{A})) ≫ 8

Or in a code form:

dest[0] = byte((int(src[0])*int(src[3]) + int(dest[0])*^src[3]) >> 8)

Fortunately, this is fairly easy to vectorise. The first part of the operation, $(s r c_{R} \times s r c_{A})$ , can be performed ahead of the main loop because it is constant for each channel. However, before we do that, the byte values in the register need to be widened to 16 bit words, in order to avoid overflow and preserve accuracy.

This can be done with the VPMOVZXBW, or packed move ith zero extend, instruction. In our use case, it will take every byte in a XMM register, zero extend (which is to say it pads it with extra zeros instead of just inserting it with garbage data) it to a word, before packing those words into the YMM register. For the alpha channel, where the alpha value is 230 (0xE6), which will become 0x00E6 when widened:

In order to fill the alpha register with both the inverted and inverted alpha values before it is widened, the VPBROADCAST instruction (specifically with the B for byte suffix) can be used. This instruction tells the CPU to take the first byte/word/double word etc. of this register, and then fill this other (or the same) register with it.

With all the data in place, now the final colour result needs to be calculated. As mentioned previously, there exists a set of instructions for operating on packed values. So calculating the first part of the expression is as follows:

VPMULLW(widened_shuffled_colour_bytes, widened_alpha_values, widened_shuffled_colour_bytes)

This instruction multiplies each packed word, and then stores the low (bits 0-16) result.

As for computing the rest of the expression, firstly the data needs to be loaded from memory and widened. Also, unlike before where the data in memory could be discarded, this time the 16th byte needs to be preserved for the next iteration - we are only writing 15 bytes at a time. This can be done with the PEXTRB and PINSRB, extract byte and insert byte respectively, instructions. I wouldn't actually recommend using these, as you will see later, but I am including them here so you can learn from my mistakes.

Load memory data, and extract last byte:

VMOVDQU(Mem{Base: RDI}, memory_data)
XORL(last_byte, last_byte)
PEXTRB(Imm(15), memory_data, last_byte)

Imm(15) tells Avo that this an immediate value - a value embedded in the instruction operands, as opposed to being stored in a register or memory. In this case it will automatically determine the size, but similar functions exist for generating an immediate of a specific size.

Widen memory data, then multiply it with the inverted alpha values.

VPMOVZXBW(memory_data, widened_memory_data)

// "Multiply Packed Signed Integers and Store Low Result"
VPMULLW(widened_memory_data, widened_inverted_alpha_values, widened_memory_data)

Finally, add the two values together and perform the packed shift operation:

// "Add Packed Integers"
VPADDW(widened_memory_data, widened_shuffled_colour_bytes, widened_memory_data)

// "Shift Packed Data Right Logical"
VPSRLW(Imm(8), widened_memory_data, widened_memory_data)

Now the values had been calculated, they need to be packed back down into bytes, and stored back into main memory.

To pack the numbers back down, PACKUSWB ("Pack With Unsigned Saturation") is used. Saturation refers to how the words (which can be from 0-65535) are compressed back down into the byte range of 0-255. If the value is $> 255$ , then it is clipped at 255. If it is $< 0$ , then it is clipped at zero. If in-between, then it is kept the same.

Because the pack instruction operates on 128 bit lane of the YMM register in order to operate on data in a stream, and per lane, the colour values are now split between each lane, note that this takes place after the rightwards shift, hence each value starts with 0x00:

Values which be kept are in bold. Line through centre is at the lane border.

Luckily this can be rectified with the VPERMQ (Qwords Element Permutation) instruction. This instruction functions in a similar way to the shuffle instruction from earlier, but this time it operates on quad-words (64 bit) and the mask is only one byte, with four two bit values for the locations. The 256bit YMM register is split up into 4 quad-word slots.

In this case the mask is 0xD8, or 0b11011000 in binary form. The source in the mask is implicit, based on the location of the bits within the mask.

Bits	Source	Destination
00	Q0	Q0
10	Q1	Q2
01	Q2	Q1
11	Q3	Q3

The result:

When all the data is in the correct place in YMM, the lower lane (bits 0-128) can be extracted and placed in XMM. This data can then be placed back into memory, but only after the last byte is inserted.

VEXTRACTI128(Imm(0), widened_memory, memory_data)

PINSRB(Imm(15), last_byte, memory_data)
VMOVDQU(memory_data, Mem{Base: RDI})

Note: This time the move instruction is the unaligned variant, as the pointer is incremented by 15 bytes each time, which is not a proper number for correct alignment.

All of the code in this section has been extracts from the routine used in goputer, which is available on GitHub. The next section explains the rest of that routine in a bit more detail, it is not necessary to understand the conclusion sections so feel free to skip it.

The loop itself is a standard for y { for x } style loop:

Label("a_loop")

MOVQ(ptr, RDI)

CMPL(width, Imm(5))
JB(LabelRef("a_blit_remaining"))

Label("a_loop_x")

// 
// Loop body
// 

Label("a_loop_end")

XORL(counter_x, counter_x)
ADDQ(U32(960), ptr)
INCL(counter_y)
CMPL(counter_y, rows)
JB(LabelRef("a_loop"))

The body of the loop contains an additional branch to check if we are drawing $< 5$ pixels ( $15 bytes \div 3 = 5$ ). This contains instructions which draw in a scalar as opposed to vector manner:

MOVBWZX(Mem{Base: RDI}, tmp_16)
IMULW(inv_alpha_16, tmp_16)
ADDW(red_16, tmp_16)
SHRW(Imm(8), tmp_16)
MOVB(tmp_16.As8L(), Mem{Base: RDI})

The instructions are the same for red and green, but with a Disp value in the Mem struct of 1 and 2 respectively.

Now only one question remained:

"Is this faster than before?"

After all, it would be a bit pointless if it wasn't.

According to goputer's builtin profiler the video clear operation was now ~16% faster in canned benchmarks^[7]. Not exactly a huge improvement but still something.

The video area operation with no alpha was actually on par with the scalar code from before. I tried optimising it further with cache prefetch instructions but these had no measurable effect.

Initially, the alpha area operation was slower, but after removing the extract/insert byte instructions and replacing them with normal memory -> register and vice versa move instructions, there was a ~60% improvement.

The final code examples used in this article, after improvements are:

In conclusion, while some of the improvements made aren't exactly as much as what is theoretically possible with SIMD, they are still something. At the end of the day computers are very fast, and as such will run un-optimised code just as well as optimised code.

As I write this, Go is planning to add experimental support for vector operations as a builtin feature in the standard library in the next version, 1.26. So if I had done this a few months later it might have been a lot easier, and in the future it'll hopefully support more architectures (even including WASM), so I won't have to touch assembly at all. On the other hand, it has certainly been very interesting learning the ins and outs of x86 assembly and SIMD in particular.

Finally, apart from the addition of a high level language, the rest of the improvements I have planned for goputer are fairly minor things, so there probably won't be another post like this for a while. For now, look at the other posts on goputer or the repository itself.

SIMD instructions work better when using multiples of 4 elements, see this ARM documentation. ^
In reality the reference Python interpreter, CPython, does not vectorise code. Vectorisation in Python can be achieved by using libraries such as numpy. ^
See paddd on the x86 reference. ^
On x86 the terms byte (8 bits), word (16 bits), double-word (32 bits), and quad-word (64 bits) are sometimes used to denote integer sizes. ^
This example is from the avo library. ^
"Unrolling" refers to the practice of including multiple iterations of a loop in it's body, in order to reduce the total number of branches which occur. ^
Benchmarks were conducted using these examples and this hardware.
- Video clearing: examples/video_clear_bench.gpasm
- Video area, no alpha: examples/area_test.gpasm
- Video area, with alpha: examples/alpha_area_test.gpasm
System specification:
- CPU: Ryzen 7 7730U (8C, 16T)
- Memory: 16GB
- WSL OS: Fedora 42, 6.6.87.2-microsoft-standard-WSL2
- Host OS: Windows 11, 10.0.26200
^

Making pong with my fantasy VM

2025-10-28T00:00:00

Note: This is a continuation from my previous post. This post can also serve as a more in-depth explanation of one of the goputer examples as well as a tutorial of sorts.

For as long as goputer has existed, I have never made what I would call a complete program or project with it. Any code that has been written for it has been short examples, intended to test functionality rather than showcase what is possible. With this in mind I decided to embark on creating Pong. While I could have possibly gone for something a bit more advanced, such as Snake or Tetris, Pong was the simplest thing I could make whilst still testing the full feature set of goputer.

This post is split into four parts with a conclusion:

Drawing
Moving
Scoring
Debugging

Even though this post describes the function of each piece of code, I would still recommend skimming over the syntax page in the documentation beforehand.

This was the easiest part. I often describe goputer as "an assembly language that controls Microsoft Paint", and it's not far from the truth. A huge part of the feature-set of goputer is drawing stuff to the screen. After all, what's the point of an educational tool if you can't draw things.

A pong screen is very simple to draw, it contains two paddles, a ball, and a dividing line. I'll get onto the score later.

Firstly the initial positions need to be defined so they can be changed later if necessary, also it's very poor practice to not use constants where possible.

#def paddle_offset 5
#def paddle_width 5
#def paddle_height 25
#def paddle_speed 10

#def left_paddle_pos 120
#def right_paddle_pos 120

#def ball_x 160
#def ball_y 120

The first set of definitions define constants that are referenced in later code. The second set define the positions of the paddles and the ball respectively. The reason why everything is multiples of five is because it makes bounds checking ever so slightly easier, i.e. I shouldn't have to worry about integer underflow.

These needed to be used in conjunction with a routine that draws the shapes in question onto the screen:

...

lda @left_paddle_pos
mov d0 vy0
add vy0 $d:paddle_height
mov a0 vy1

mov $320-d:paddle_offset-d:paddle_width vx0
mov $320-d:paddle_offset vx1

int va

...

A quick note on the immediate values prefixed with $:

Expressions are evaluated from left-to-right, and don't follow normal operator precedence.
d:xyz refers to the value of a definition at compile time.

Drawing the ball and the centre line followed a similar routine.

A static image is pretty useless if you're trying to make a game, therefore the next thing I focused on was trying to make things move.

Firstly, I focused on moving the paddles, as I figured this would be the easiest thing to do because the code should be the same for both of them. The steps that were required were as follows:

Interrupt to register the keypress
Then decide which key it is, and thus which paddle to move
Move the respective paddle

Registering an interrupt listener is simple, however you can only register one per interrupt.

#label keyup

// Do something

iret

#intsub kd keyup

This means the keyup label will be called whenever someone presses a key. After an interrupt is called the next step is to preserve any registers we are going to write to.

push r01
push dp
push dl
push a0
sta @d0_store

The function of the last line is slightly different because it stores the entire value in d0 on the static stack (populated and allocated at compile time) as opposed to the dynamic stack (populated and allocated at runtime). The sta instruction differentiates between addressing the static stack with immediate values and main memory with registers by checking if the argument is greater than the number of registers. If so the first 4 bytes at that address show how many bytes to store.

After this there are a series of cascading equality checks to determine which key was pressed:

...

neq kc $23
cndjmp @m_lp_down
lda @left_paddle_pos
mov d0 r01
eq r01 $0
cndjmp @move_end
sub r01 $d:paddle_speed
mov a0 d0
sta @left_paddle_pos
jmp @move_end

...

The above code does the following:

Is the kc (current key) register equal to any other value than 23. If so, jump to the next equality check, if not keep going.
Load the paddle position and see if it is equal to zero. If it is, jump to the cleanup label because we can't move any further.
Then subtract the paddle speed, because in this case we are moving up the screen.
Store the paddle position for later use and jump to the cleanup routine.

The ball is slightly different given the fact it moves in two dimensions, not one and needs to move every frame/update as opposed to whenever a user presses a specific key. I'm not showing the code that loads values into registers as you've already seen an example of that but in order to help with understanding the values in each register are as follows:

r00 is the ball's X position.
r01 is the Y position.
r02 is the X direction.
r03 is the Y direction.

...

// Bounds right
neq r00 $320-d:ball_size
cndjmp @bbc_left
mov $0 r02
push $0
call @increase_score
jmp @check_paddles
// Bounds left
#label bbc_left
neq r00 $0
cndjmp @check_paddles
mov $1 r02
push $1
call @increase_score

...

You may notice the call to increase_score but I'll get back to that later. The bounds checking code for the right is the same. However, checking the top and bottom is different because the ball has to bounce, not resetting back to the centre of the screen.

"Bouncing" the ball for both the paddles and the top and bottom of the screen is simple. If the Y current direction value is 1, say when hitting the bottom of the screen, we make it zero and then the ball will go up when the move code is called. This same bouncing logic applies when hitting a paddle, yet this time the X direction is changed.

See the code for checking the right paddle:

...

eq r00 $320-d:ball_size-d:paddle_width-d:paddle_offset
mov a0 r15
gt r01 r05
mov a0 r14
add r05 $d:paddle_height
lt r01 a0
mov a0 r13
and r15 r14
and a0 r13
inv a0
cndjmp @check_left_paddle
mov $0 r02
jmp @bbc_bottom

...

The process of the routine is as follows:

Check if the X position of ball such that it is in-line with the left hand side of the paddle.
Then see if the paddle Y is in-between the top and bottom of the paddle.
If both of these conditions are true (successive and operations), then we invert the ball direction and jump to the bottom bounds check.

Now all the movement logic was in place I needed a way to score the game. As mentioned earlier, the increase_score routine is called whenever the ball hits the left or right side. Before it is called, one value is pushed to the stack - which player's score to increase.

...

lda @player_one_score
mov d0 r14
add r14 $1
mov a0 d0
sta @player_one_score
jmp @reset_ball

...

Above: Increasing the score for the left player/player one.

Even though the score was increasing, you still couldn't see what it's value was, at least without inspecting registers/memory locations. I still needed to write the assembly to display it.

In any other language you'd simply make use of a standard library function to convert an integer to a string, in Golang you'd use strconv.Itoa(x) or Python would make use of the even simpler str(x). Unfortunately, working in an assembly language means you don't have access to such luxuries so you have to make these things yourself.

The loop that is used to convert a number into an integer is below:

...

eq r01 $0
cndjmp @draw_zero

#label convert_number
mod r01 $10
add a0 $48
mov $1 dl
mov a0 vt
sr vt $1
div r01 $10
mov a0 r01

neq r01 $0
cndjmp @convert_number

sl vt $1
int vt

...

First we check if the number is already zero, as this is a special case. After than we keep taking the remainder of the number divided by ten, then writing the corresponding character to vt (the text buffer) - we also shift the text buffer right at this stage. We store the result of the division of the number for use in the next iteration of the loop. This repeats until the result of the division is zero.

We also have to shift the text buffer left by one byte to cleanup from the last iteration.

In all honestly there weren't actually that many bugs with the assembly I wrote, at least not any major ones that took ages to figure out. That being said, there were some problems with the runtime that I didn't know about or should have known about:

The data stack didn't work properly, which meant the ball always teleported to the top of the screen when a key was pressed.
The possibility of a race condition if a key was pressed in the middle of a ball update.

The stack not working properly was easy enough to fix, the race condition however required the addition of two new instructions and the exposure of the previously internal interrupt flag through a new control register. The two new instructions were inspired by x86's sti and cli (set and clear interrupt bit) instructions, albeit with different names.

pri

mov r00 d0
sta @ball_x

mov r01 d0
sta @ball_y

mov r02 d0
sta @ball_direction_x

mov r03 d0
sta @ball_direction_y

eni

In this case, pri prevents interrupts, and eni re-enables them. The instructions in-between write the updated ball location data to memory.

The pong program running in the web frontend

The next steps are probably to write a better Pong clone rather than the quite simple one I've got now, and also finding a way to speed up the VM - which will almost certainly require writing some of the rendering instructions in assembly. The other major improvement I have made since the last post has been to rewrite the expansion system around Lua instead of Go's plugin library, meaning that it is now cross platform.

Other than making the VM faster a proper IDE or even a VSCode extension would be nice.

If you want to see any other posts linked to Goputer, click here or the repository is here. The full source code for the Pong example talked about in this blog post is on GitHub here.

Creating a profiler for goputer

2025-08-09T00:00:00

Note: This is a continuation from my previous post.

For a while now, goputer has had no way of measuring performance. The closest you could get was adding a frame rate counter to a frontend and hoping it would be somewhat accurate - spoiler, it wasn't. Turns out it takes a lot longer to render a frame than to add two numbers together, who would've guessed. This necessitated the creation of a better solution: a profiler.

I wasn't aiming for anything complex, such as Google's pprof, instead I was just intending to measure how long each instruction took to run, then do a tiny bit of statistics to make the collected values slightly more useful. If anything it would also serve to highlight any outstanding issues with the core runtime in addition to any assembly I wrote.

My only experience using anything that could be considered a profiler thus far had been using browser dev-tools, which are much more complicated than anything I intended to make and are made for a very different domain. As such, I settled upon a very simple way of measuring performance:

Measure the time for each cycle
Add that data to an entry in a hashmap
Do some additional stuff to calculate the mean etc. at the end

As for the design I decided on a Profiler type which would have a Data field of type map[uint64]*ProfileEntry (the pointer was so I could change values in place), ProfileEntry would contain various data about that specific instruction - total cycle time, times executed.

The key for the map was created by packing the instruction (5 bytes) and the current program counter value, sort of akin to a composite key. While this meant that only 3/4 bytes were available for the program counter value (largest value 16,777,215 so 16MB) the default memory addressable (not including the video buffer) size for goputer is 65KB, so I don't see this as being a problem, unless the program counter is overwritten. If I wanted to in the future I could change the key to two uint64's or just 9 bytes.

Instruction	Address
40 bits	24 bits

Now I had collected the data I needed some way to store it. While I could have simply just dumped the entire thing to a JSON file and been done with it, however this had a major drawback of being slower and the file size being larger. Also I just needed to store numbers.

All byte orders are little endian.

Magic number	Number of entries	Total cycles	Entries
`GPPR`	8 bytes (uint64)	8 bytes (uint64)	Variable 33 bytes

Each entry was further broken down as follows:

Instruction	Address	Total cycle time	Total times executed	Standard deviation
5 bytes	4 bytes (`uint32`)	8 bytes (`uint32`)	8 bytes (`uint32`)	8 bytes (`float64`)

Note: The standard deviation is technically stored as a uint64 in little-endian order, see math.Float64bits which does some unsafer.Pointer stuff, specifically *(*uint64)(unsafe.Pointer(&f), where f is the float64 in question.

To make things a bit nicer the Load and Dump methods both used Go's io.ReaderSeeker and io.WriterSeeker interfaces respectively.

Collecting all this information was okay, but it was pretty useless on it's own unless you like reading binary file formats or printed structs. I needed some way to display it, for this I chose tview, a TUI framework, as while there already some small GUI apps I had developed for goputer, most notably the launcher, they really should have been console apps instead.

The profiler UI, with instructions sorted by times executed

In the above screenshot, the source for the executable being analysed is available here. As you have may have noticed there is also an option for grouping, this simply groups instructions which have the same opcode together and aggregates their data (apart from standard deviation and addresses).

Profiler with grouping enabled

You can also jump to instances of specific instructions using the "instruction" field. The first screenshot is much closer to the original representation of the data.

As a side-note, the numbers in that screenshot are much higher than they should be, in canned benchmarks they were much lower, my current theory is that for some reason running a windowed application in WSL is slower than a non-windowed one. I'll only find out for sure once make goputer run natively (on Windows without WSL) though.

Another useful addition might be to add some other form of visualisation (i.e. barcharts) however I can't really see the utility of it over the existing data view.

Now armed with a profiler I can hopefully go about creating more complex programs and maybe even high level language, using the profiler to make performance analysis ever so slightly easier.

This was the one of the last major improvements I had planned for goputer. The ones that are most likely to be developed next are moving frontends to standalone executables instead of using Go's plugin library, which only works on Linux, FreeBSD, and MacOS, so no Windows support at all, and moving extensions to embedded Lua for the same reason. Software rendering doesn't show in the Python frontend/the Python frontend hasn't been updated since software rendering was implemented so that'll have to be fixed as well.

Also at the moment the profiler is currently just tacked onto the gp32 runtime without any way of turning it off, but that'll come as part of the previously mentioned frontends to standalone executables thing.

If you want to see any other posts linked to Goputer, click here or the repository itself is here.

URL Shortener with Flask and HTMX

2025-07-07T00:00:00

It's been a while since I've made anything new with HTMX. The last thing was the comment section for this blog, which itself is made using Flask and HTMX. TLDR; I like using Flask and HTMX. Something I've always wanted to have is some kind of link shortening service. I'm well aware that services like bit.ly and Tiny URL exist however in my opinion it's much better to have something of your own especially when it's incredibly simple to make.

While I could've used a fullstack framework for this, e.g. Svelte, I already had some experience using Flask and HTMX together now and I wanted to see how quickly I could create something usable. In the end it turned out that I was indeed able to create something rather quickly.

Since I had already chosen Flask and HTMX the only remaining decision was what database engine to use. This ended up being SQLite for the same reasons mentioned before, ease of use, familiarity etc. This project was essentially the biggest case of why you shouldn't reinvent the wheel, something I have done too many times already, using new stacks for projects where the scope just isn't worth it, and later coming to regret it as a result.

The codebase was then hammered together using the same outline I had figured out when I added comments to this blog, with some minor improvements - having data migration from the start and correct use of indexes. In the end while the thing I ended up with was rather ugly because I didn't use any CSS bar some for colouring error messages, choosing to stick to the principles of justfuckingusehtml.com instead.

I also added a JSON API which used the same routes as the HTMX, just expecting and returning JSON instead of form data and HTML respectively. No need for another backend server when you have if statements and schemas.

Zoomed in screenshot of the admin UI

Opaque means wether or not to return a 301 redirect response or to return a small JavaScript snippet to do the redirection. This means that any link previews won't work when it's enabled. Useful for creating rick roll links.

You can test it for yourself now by clicking on this link which redirects to this blogpost.

The source for the project itself can be found here on GitHub.

DLAPS stack

2025-06-27T00:00:00

Most of my "online" projects (including this blog) use what I call the DLAPS stack which can be summarised as the following:

Docker
Linux
Apache
Program
SQLite

Alternatively if deploying straight onto bare metal the SLAPS stack also works and rolls off the tongue much better. However, because most deployments don't use that it'll have to be DLAPS. Essentially it is a variation of the classic LAMP stack but using SQLite instead of MySQL and Docker as a virtualisation layer.

Why Docker? Because it runs anywhere and you can run anything in it. Not to sound too much like their marketing copy but it's true. Also great for avoiding testing in prod.

Linux technically occurs in two places in this stack. That is, whatever bare metal machine Docker and Apache are running on and in whatever Docker container(s) are running. The most honest reason for using it is quite frankly why would you ever use anything else on a server (at least definitely not Windows), also the widest adoption, most support etc.

Any reverse proxy could fulfill the role of Apache (nginx, Caddy, etc.), however I have the most experience with Apache so it'll have to stay that way. Like I said earlier this was based on the LAMP stack so I have to be somewhat faithful to the original when coming up with a formal de jure version of it.

As for the program part this could really be anything. It doesn't have to communicate with the internet as such, thus the Apache part would then be redundant. The reason for choosing program over Python or Perl etc. is because I also write Go - goputer and Chime. While only the latter is internet based I do want to write some more Go based web stuff in the future (probably using templ), or even any other language.

Finally, SQLite. Probably the best database out there. If anything I was doing needed maybe a slightly larger feature-set I would probably switch to PostgresSQL seeming as that's what I've been taught to use at university. SQLite is lightweight, easy to deploy, and could probably run on a little Arduino (a library exists for running it on an ESP32, which is a slightly beefier micro-controller) with some minor changes. TLDR; it runs literally everywhere and is good for the vast majority of use cases.

To conclude in a world of "serverless" this and that it's nice to have something that's deployable on a real machine, because lets face it that Cloudflare free tier (which admittedly I use myself for my EarthMC dashboard) isn't going to last once your app starts getting traction and you need actual performance, not that that'll ever happen anyway so you might as well stick to a tiny VPS and call it a day.

Oscar Peace's blog

Optimising Go with SIMD assembly

What is vectorisation and SIMD?

Using assembly with Go

Raw assembly

Here comes Avo

Using SIMD with Go

Video clearing

Video areas

Extra notes on the loop implementation

Is it actually faster?

Conclusion

Making pong with my fantasy VM

Drawing

Moving

Moving the paddles

Moving the ball

Scoring

Integers to strings

Debugging

Conclusion

Creating a profiler for goputer

Design

Making it useful

What next?

URL Shortener with Flask and HTMX

DLAPS stack