Boosting Bytecode Efficiency: The Power of GCC’s Label as Value

June 24, 2024

Boosting Bytecode Efficiency: The Power of GCC’s Label as Value

GEMscript and Virtual Machines

If you’ve been using GEMstudio, you’re probably familiar with our programming language, GEMscript. We designed GEMscript to be a user-friendly, C-like language with the intention of enabling a “write once, run anywhere” approach. This means it can be used seamlessly across all our platforms, including GEMplayer on PC and various hardware devices.

GEMscript is a VM (virtual machine) based language, meaning your code gets compiled into “bytecode” and runs in a VM interpreter instead of being compiled down to native machine code. This allows us to achieve our goal of running the same compiled code across multiple platforms. Additionally, by sandboxing GEMscript from our OS in a VM, we avoid some pitfalls of writing in native C, such as unsafe memory access and code execution. However, there’s a big trade-off when using VMs: speed.

It’s no secret that VMs are generally slower than native machine code. Therefore, optimizing VMs is crucial, particularly for limited hardware where every bit of speed counts. So, I rolled up my sleeves and started exploring ways to optimize our code. And guess what? I found a neat and easy trick for speeding up opcode dispatch. This article discusses the performance improvements I achieved by optimizing opcode dispatch in GEMscript using GCC’s “Label as Value” feature, demonstrating a significant speed increase.

Enhancing VM Performance: Speeding Up Opcode Dispatch

When I started looking at our VM, I realized that focusing on opcode dispatch could yield significant performance gains. Efficient opcode dispatch is key to faster execution because it reduces the overhead of interpreting each instruction in the VM. Let me put this in simpler terms:

Imagine you’ve got a list of vocabulary words to study. One way to do this is by using the index at the back of a dictionary. For each word, you:

Look up the page number in the index.
Flip to the correct page to read about the word.
Return to the index for the next word.

Doing this for each word is straightforward but slow and tedious.

Now, imagine using index cards with all the vocabulary words printed in order. For each word, you:

Read the information on the top index card.
Move instantly to the next card.

No more flipping back and forth! This method is much faster and more efficient, even though it takes a bit of effort upfront to set up the index cards. Similarly, by optimizing our opcode dispatch, we can make our VM run instructions more quickly and efficiently.

The Basics of VM Interpreting

Let’s start with a basic interpreter loop for a VM, which is similar to using the dictionary index method. Instead of a list of vocabulary words, we have bytecode, which is a list of opcodes in memory. Instead of searching an index for the correct page, we use a switch statement, with each opcode case representing a different operation. This switch case is in a loop, so we repeatedly look for our current opcode and execute it until we run out of opcodes. Here’s a simple example in C:

void runVM(){
    uint32_t pc = 0; // program counter
    while (1) {
        uint32_t opcode = memory[pc++];

        switch (opcode) {
            case OP_ADD:
                // do addition operations here
                break;
            case OP_SUB:
                // do subtraction operations here
                break;
            ...
        }
    }
}

Threaded Code Execution: A More Efficient Approach

We can do better! Remember the index card approach? The programming concept that mirrors this is called threaded code. It lets us execute an opcode and then move directly (or indirectly) to the next opcode, creating one continuous thread of execution. The previous example isn’t a continuous thread because it must jump back to the start of the loop and search again for every opcode.

Using labels in C, we can replicate this approach by using `goto` to jump between instructions instead of searching for them in a switch statement. However, there’s a challenge: how can we know which label to go to from a list of opcode values? We need to map the opcodes to the corresponding labels in our function so we can easily jump to the right place.

void runVM(){
    uint32_t pc = 0; // program counter
    uint32_t opcode = memory[pc++];

    // FIX: find first label from opcode?
    goto FIRST_LABEL;

OP_ADD:
    // do addition operations here

    // FIX: find next label from opcode?
    goto NEXT_LABEL;
OP_SUB:
    // do subtraction operations here

    // FIX: find next label from opcode?
    goto NEXT_LABEL;
    ...
}

Introducing GCC Label Values: A Game Changer

While digging around for ways to speed up our VM, I discovered that a very common way to implement threaded code in C is with a specific feature in GCC called “Label as Value”. Even after eight years of working with C and C++, I didn’t know about this! The official docs say that by using the `&&` operator, you can reference the address of a label within a function. The GCC docs also describe the exact scenario we are in:

“Another use of label values is in an interpreter for threaded code. The labels within the interpreter function can be stored in the threaded code for super-fast dispatching.”

Optimizing Our VM with GCC Label Values

Armed with this new knowledge, I set out to optimize our VM by storing label addresses and mapping opcodes to these labels. Here’s a revised example where our bytecode is stored in `memory[]`, and we have a macro called `NEXT()` to jump to the next label based on the current opcode.

#define NEXT(pc) // explained later...

void runVM(){
    // pointer to first opcode loaded into memory
    uint32_t* opcode = &memory[0];

    // array of opcode labels
    // (GCC Label as Value)
    static const void * const opcodelist[] = {
        &&OP_ADD,
        &&OP_SUB,
        ...
    }

    // find first label and jump
    NEXT(opcode);

OP_ADD:
    // do addition operations here
    NEXT(opcode);
OP_SUB:
    // do subtraction operations here
    NEXT(opcode);
    ...
}

Now, we have labels for each operation and a table of addresses to these labels. The `NEXT()` macro lets us jump to the next opcode efficiently. Depending on our choice of indirect or direct threading, this macro can be implemented in different ways.

Indirect Threading: Taking an Extra Step

First, I tried using indirect threading. This approach sticks to the logic we talked about earlier: when the interpreter jumps to a new opcode, it first looks up the mapped opcode-to-label address indirectly. We store our label addresses in `opcodelist`, so by using the opcode as an index into `opcodelist`, then dereferencing and jumping to the address, we’re using indirect threading.

Here’s how the `NEXT()` macro looks with indirect threading:

#define NEXT(pc) goto *opcodelist[*pc++]

// Example: opcode points to an index of opcodelist
uint32_t* opcode = &memory[0];
NEXT(opcode); // expands to `goto *opcodelist[*opcode++]`

Direct Threading: Streamlining Execution

While indirect threading is simple and fast, it’s not quite like our index card analogy. With the index cards, they were pre-made and set up in order. You just had to move to the next card, which already had the vocab word on it. We can achieve something similar with direct threading. If we can fix up our opcodes in memory before executing them, then we don’t need to use the opcodes as a map to label addresses; the opcodes themselves can be the addresses.

Here’s how the `NEXT()` macro looks with direct threading:

#define NEXT(pc) goto **pc++

// Example: opcode points to a pointer to the label in opcodelist
uint32_t* opcode = &memory[0];
NEXT(opcode); // expands to `goto **opcode++`

Of course, this requires the opcode size to match the address size of your platform (32bit, 64bit, etc.), but for us it works perfectly. This also requires our VM to do the fixup when the bytecode is first loaded, but it’s only a one-time speed decrease that doesn’t impact the overall speed of execution much. Let’s assume in our examples that this fixup is already applied when using direct threading.

Measuring the Speed Boost

To test these optimizations, I wrote a new, simple VM specifically for benchmarking purposes, separate from GEMscript. This test VM can handle basic arithmetic and conditional jumps. To evaluate its performance, I used a prime number algorithm to find the 65,535th prime number and tested four scenarios:

Native C
VM with a Switch Statement
VM with Indirect Threading (GCC labels as value)
VM with Direct Threading (GCC labels as value)

All options were compiled with GCC and -O3 optimizations. Here are the results:

Native: 70ms
Switch Statement: 1140ms
Indirect Threading: 400ms
Direct Threading: 380ms

Normalized to the switch statement speed:

Native: 16.29x faster
Switch Statement: 1x faster (reference speed)
Indirect Threading: 2.85x faster
Direct Threading: 3x faster

Wrapping Up

These optimizations didn’t match the speed of native C execution, but that was expected. However, we did achieve a 2.85x to 3x performance boost by adjusting our interpreter loop to threaded code using GCC’s labels as values! For such a simple change, that’s a very welcome speed increase.

We’re excited to roll out these optimizations in an upcoming release of GEMstudio. These changes will make a difference in the performance of any code heavy applications that may have slowed down in the past. To stay updated on the latest features and enhancements to GEMstudio (like this one), sign up for our newsletter: [link for newsletter sign-up goes here]

Author Bio:

Ian Klask is a firmware engineer passionate about C and C++ programming. He began his journey through video game development and DIY electronics projects. Ian now channels his creativity and technical skills into optimizing embedded systems. Follow his posts for insights and projects in software and embedded firmware engineering.[/vc_column_text][/vc_column][/vc_row]

Fluid Edge Themes