Compiling a Go program into a native binary for Nintendo Switch™

Hajime Hoshi
2022-01-03

This is an English translation of my article in Japanese.

tl;dr

Previously, we compiled a Go program into a WebAssembly and then converted it into C++ files to make it run on Nintendo Switch. Now, I have succeeded in compiling a Go program into a native binary for Nintendo Switch, and also running a game there. I replaced system calls with C function calls using the -overlay option. Also, I have developed a new package Hitsumabushi to generate JSON content for this.

Caution

This article and the open-source projects in this article are based only on publicly available information. Hajime is responsible for this article's content. Please do not ask Nintendo about this article.

Background

I have been developing a 2D game engine called Ebiten in my spare time. I have succeeded in porting this to Nintendo Switch and the Nintendo Switch version of "Bear's Restaurant" was released in 2021.

Bear's Restaurant

Copyright 2021 Odencat Inc.

The method was to compile a Go program into a WebAssembly (Wasm) binary and then convert it to C++ files. See the presentation slides from GoConference 2021 Autumn for more details. The advantages were low uncertainty, low maintenance cost, and high portability. Once I developed the tool, its maintenance cost was pretty small as Wasm's specification is stable. On the other hand, the disadvantages were bad performance and long compiling time. Not only that performance was worse than native, but GC also suspended the game due to a single thread.

Compiling a Go program into a native binary for Nintendo Switch without using Wasm was quite uncertain and a rocky road. Of course, Go doesn't support Nintendo Switch officially. And naturally, Nintendo Switch's source code and binary formats are not open. Even if I hit an issue, it'd be possible that there would not be any clues to help me solve it. However, if I knew that I were to succeed, performance would be better than ever, and compiling speed would be as fast as Go. So I thought it was worth a shot and have been doing some experiments intermittently for one year.

Strategy

The strategy is basically to replace system calls with C function calls in the runtime and the standard library. The system calls part is OS-dependent, and if I replace it with something portable, Go should work everywhere in theory. It seems pretty easy, doesn't it? Well, it was a lot more challenging than I expected...

The graphic below describes what I had to do. The left side is a structure of a structural overview of standard Go compiling. System calls work on specific systems and of course, this doesn't work on Nintendo Switch. So I had to replace them with standard C function calls like the right side.

Replacing system calls with C function calls

Replacing system calls with C function calls

And, there is another action item to adjust the binary format that the Go compiler generates to fit with Nintendo Switch. So in summary, the action items were as follows:

  1. Replacing system calls with standard C function and/or pthread function calls
  2. Adjust the ELF format that the Go compiler generates

For replacing system calls, of course, system calls do not correspond one-to-one with C functions. And, there are too many system calls to implement. So, I replaced system calls one by one by finding which ones refused to work on an actual Nintendo Switch device.

The Go compiler can generate only formats that the Go compiler officially supports. For example, when a target is Linux, the format is ELF. Can Nintendo Switch support ELF? To make a long story short, yes, I managed it. I won't describe the details about 2. here*1.

What I have to do is create a .a file via the Go compiler with GOOS=linux GOARCH=arm64 and -buildmode=c-archive, and then link it with other object files and libraries via Nintendo Switch compiler. The reason why I don't use -buildmode=default is that there are some items I have to do around an entry point. IMO, in general, it is more portable to depend on the platform for an entry point.

System calls are defined basically in the standard library, especially runtime and syscall packages. So, how did I rewrite them? In this project, I adopted the -overlay option.

Hitsumabushi - rewriting the runtime with the -overlay option

go build's -overlay is an option that overwrites Go files to be compiled. I overwrote Go files in the runtime with this option. This is the official document's explanation:

-overlay file
    read a JSON config file that provides an overlay for build operations.
    The file is a JSON struct with a single field, named 'Replace', that
    maps each disk file path (a string) to its backing file path, so that
    a build will run as if the disk file path exists with the contents
    given by the backing file paths, or as if the disk file path does not
    exist if its backing file path is empty. Support for the -overlay flag
    has some limitations: importantly, cgo files included from outside the
    include path must be in the same directory as the Go package they are
    included from, and overlays will not appear when binaries and tests are
    run through go run and go test respectively.

This is the format to give -overlay:

{
  "Replace": {
    "/usr/local/go/src/runtime/os_linux.go": "/home/hajimehoshi/my_os_linux.go"
  }
}

If you build a Go program with this, os_linux.go's content in runtime is replaced with my_os_linux.go's. Pretty handy, isn’t it?

Managing this JSON file as it is is not portable. A location where Go is installed depends on environments, and then the target files' locations vary. Plus, you very rarely have to replace the full contents of a file, and in most cases, it is enough to replace some functions. As such, it is troublesome to update source files to match each Go version update.

So, I developed a new package to generate a JSON for this project. This is Hitsumabushi (ひつまぶし)*2. I adopted this name because I wanted a name ending with 'bushi' as a play on libc (ree-boo-shee (りぶしー) in Japanese pronunciation), because this is one of the primary things that Hitsumabushi deals with. There was another candidate I was considering, Katsuobushi (かつおぶし)*3, but I won’t get into that...

Hitsumabushi is a very simple package defining an API like this:

// GenOverlayJSON generates JSON content that can be passed
// to -overlay based on the given options, or returns an error
// when an error occurs.
//
// There are some options like specifying command arguments
// and specifying the number of CPU.
func GenOverlayJSON(options ...Option) ([]byte, error)

Implementation of Hitsumabushi

I have created an original patch format for Hitsumabushi that looks like this:

//--from
func getRandomData(r []byte) {
    if startupRandomData != nil {
        n := copy(r, startupRandomData)
        extendRandom(r, n)
        return
    }
    fd := open(&urandom_dev[0], 0 /* O_RDONLY */, 0)
    n := read(fd, unsafe.Pointer(&r[0]), int32(len(r)))
    closefd(fd)
    extendRandom(r, int(n))
}
//--to
// Use getRandomData in os_plan9.go.

//go:nosplit
func getRandomData(r []byte) {
    // inspired by wyrand see hash32.go for detail
    t := nanotime()
    v := getg().m.procid ^ uint64(t)

    for len(r) > 0 {
        v ^= 0xa0761d6478bd642f
        v *= 0xe7037ed1a0b428db
        size := 8
        if len(r) < 8 {
            size = len(r)
        }
        for i := 0; i < size; i++ {
            r[i] = byte(v >> (8 * i))
        }
        r = r[size:]
        v = v>>32 | v<<32
    }
}

The part after //--from and the part after //--to represent a replacing source and a target respectively. The reason why I invented my simple format is that the existing patch formats don't assume to be modified by a human being. In the above example, Linux's getRandomData implementation is replaced with Plan 9's. Linux's getRandomData uses /dev/urandom and this is not portable*4. This patch format saves some amount of work to manage the differences I want to replace. Of course, the cost to keep up with the Go version updates doesn't become zero even with this, but it should help a lot.

Hitsumabushi creates modified files with this format and puts them in a temporary directory. It uses the files as the content of JSON (the replacing source file names).

Note that Hitsumabushi rewrites the standard library and the runtime, and the Go compiler is not the target to rewrite. In other words, the regular Go compiler is used as is.

The replacements by Hitsumabushi are only the standard C function calls and pthread function calls. It never deals with platform-specific APIs*5. So, ideally, Hitsumabushi should enable a Go program to run on any platform, regardless of whether or not the Go compiler originally supports it.

Replacements

Calling C functions from runtime

It is not an easy task to call a C function from runtime. In a usual Go program, you can call a C function easily with Cgo. However, runtime cannot use Cgo. Using Cgo means to depend on runtime/cgo, and runtime/cgo depends on runtime, so this would be a circular dependency.

To get straight to the point, libcCall makes it possible to call a C function from runtime. Some environments like GOOS=darwin already do this.

In addition, various compiler directives are required.

  • //go:nosplit: Skips an overflow in the stack.
  • //go:cgo_unsafe_args: Treats Go arguments as C arguments.
  • //go:linkname: Treats something defined in another package as if it was defined in this package. Or, it treats something defined in this package as if it was defined in another package. It ignores whether the symbol is exported or not. Very useful!
  • //go:cgo_import_static: Static-links a C function and makes it possible to treat the symbol value in Go.

Let's see an actual example. To call the write system call from runtime, a function called write1 is defined on the Go side.

// An excerpt from runtime/stubs2.go in Go 1.17.5

//go:noescape
func write1(fd uintptr, p unsafe.Pointer, n int32) int32
// An excerpt from runtime/sys_linux_arm64.s in Go 1.17.5

TEXT runtime·write1(SB),NOSPLIT|NOFRAME,$0-28
    MOVD    fd+0(FP), R0
    MOVD    p+8(FP), R1
    MOVW    n+16(FP), R2
    MOVD    $SYS_write, R8
    SVC
    MOVW    R0, ret+24(FP)
    RET

In the case of 64bit ARM, SVC is used to invoke a system call.

Let's replace this with a C function call by libcCall and compiler directives.

// An excerpt from runtime/stubs2.go after Hitsumabushi's replacement

//go:nosplit
//go:cgo_unsafe_args
func write1(fd uintptr, p unsafe.Pointer, n int32) int32 {
    return libcCall(unsafe.Pointer(abi.FuncPCABI0(write1_trampoline)), unsafe.Pointer(&fd))
}
func write1_trampoline(fd uintptr, p unsafe.Pointer, n int32) int32
// An excerpt from runtime/os_linux.go after Hitsumabushi's replacement

//go:linkname c_write1 c_write1
//go:cgo_import_static c_write1
var c_write1 byte
// An excerpt from runtime/sys_linux_arm64.s after Hitsumabushi's replacement

TEXT runtime·write1_trampoline(SB),NOSPLIT,$0-28
    MOVD    8(R0), R1   // p
    MOVW    16(R0), R2  // n
    MOVD    0(R0), R0   // fd
    BL  c_write1(SB)
    RET
// An excerpt from runtime/cgo/gcc_linux_arm64.c after Hitsumabushi's replacement

int32_t c_write1(uintptr_t fd, void *p, int32_t n) {
  static pthread_mutex_t m = PTHREAD_MUTEX_INITIALIZER;
  int32_t ret = 0;
  pthread_mutex_lock(&m);
  switch (fd) {
  case 1:
    ret = fwrite(p, 1, n, stdout);
    fflush(stdout);
    break;
  case 2:
    ret = fwrite(p, 1, n, stderr);
    fflush(stderr);
    break;
  default:
    fprintf(stderr, "syscall write(%lu, %p, %d) is not implemented\n", fd, p, n);
    break;
  }
  pthread_mutex_unlock(&m);
  return ret;
}

By the way, libcCall is not defined on GOOS=linux. I had to rewrite //go:build in runtime/sys_libc.go properly.

If you forcibly call a C function using assembly without libcCall, a C stack will be on the current Goroutine's stack. Then, you might find very mysterious errors. I don't recommend invoking a C function without libcCall.

Ignoring signals

Hitsumabushi ignores all signals. For example, sigaltstack and sigprocmask in runtime are empty. There are standard C functions that deal with signals, but they are not implemented in some environments.

As a side effect, accessing a nil pointer caused SEGV, and recover-ing it became impossible. A program dies without panic messages, even. This is inconvenient to some extent, but we have to put in the effort to avoid this issue in production environments.

Implementing a pseudo file system

Even when a Go program does nothing, the runtime might access the file system. On Linux, apparently these files are read from the runtime:

  • /proc/self/auxv (Information about e.g. a page size)
  • /sys/kernel/mm/transparent_hugepage/hpage_pmd_size (Huge Page Size)

I hand-crafted some content for both. For example, I used 0 for Huge Page Size since it worked. For the implementation, see Hitsumabushi's c_open.

For writing files, I implemented only a standard output and a standard error. Both just use fprintf. Without them, even println doesn't work. I decided not to implement reading and writing other files for now. For the implementation, see Hitsumabushi's c_write1.

Implementing a pseudo memory system

In Go's heap memory management, mmap system call is the bottom layer on Linux. Go manages virtual memory allocated there. munmap is called for unused regions.

There are 4 states of a heap memory region and these states transition as in the diagram below. When the state is 'Ready', the region is available.

The state transition diagram of Go's memory

The state transition diagram of Go's memory

Go specifies an address in virtual memory and uses an allocated memory region with the address. However, there is no standard C function to allocate memory with a specific address. That's unfortunate.

There are some platforms where it is impossible to allocate memory with a specific address: Plan 9 and Wasm. Hitsumabushi referred to them and implemented a 'corner-cutting' memory system. It referred to the Wasm version in particular, which is the simplest implementation. I won't describe the details here, but basically, the implementation is as shown in the following list. For an actual source, see Hitsumabushi's mem_linux.go.

  • sysAlloc: Calls sysReserve and sysMap.
  • sysMap: Increments the total size record of heap memory.
  • sysFree: Decrements the total size record of heap memory.
  • sysReserve: Calls calloc.
  • The other functions do nothing.

As you can see, there is a call of calloc but no call of free. It is impossible to free a part of a region allocated by calloc. This means that memory usage is monotonically increased. Originally, the method to make an Ebiten application work on Nintendo Switch was to convert Go to C++ via Wasm, and memory usage was also monotonically increased there*6. It didn’t end up making things worse, at the very least, so I’ve compromised with this solution so far, but I would like to fix this in the future...

Implementing pseudo futex

futex is the bottom layer of the part that handles sleeping and waking up threads. Of course, the standard C functions and pthread functions cannot invoke futex directly. So, I had to mimic the behavior of futex with pthread. Originally, pthread itself is implemented with futex, so I had to do the opposite thing.

There are two ways to use futex via Go.

  • futexsleep(uint32 *addr, uint32 val): Makes the thread sleep when addr is val.
  • futexwake(uint32 *addr): Wakes up the thread which sleeps with addr.

In Hitsumabushi, I added a simple implementation like this. For an actual source, see Hitsumabushi's pseudo_futex.

// A pseudo code
pseudo_futex(void* uaddr, int32_t val) {
  static pthread_cond_t cond; // A condition variable

  switch (mode) {
  case sleep:
    if (*uaddr == val) {
      cond_wait(&cond); // Sleep
    }
    break;
  case wake:
    cond_broadcast(&cond); // Wake up all the threads sleeping with cond.
    break;
  }
}

When wake is called, it will wake up not only the necessary threads, but all the threads. If you want to wake up only necessary threads, you would need to manage multiple condition variables for each uaddr, which would be cumbersome. Such unnecessary waking up is called spurious wakeup. This is explicitly expected in Go source code, so this is not problematic. However, performance might be degraded.

Adjusting the number of CPU cores

The number of CPU cores is determined by the result of the sched_getaffinity system call. There is no corresponding standard C function, so I gave Hitsumabushi an option to specify the number of cores to GenOverlayJSON. For the actual source, see Hitsumabushi's c_sched_getaffinity.

There were some environments where an application froze with 2 or more CPU cores specified. It's because a thread could use only one core by default. Thus, I had to call pthread_setaffinity_np explicitly. In Hitsumabushi, I added a hack to call pthread_setaffinity_np just after pthread_create. For the actual source, see Hitsumabushi's overlay.go. As an aside, it was quite hard to find this solution. I can’t tell you how happy I was to finally solve this conundrum.

Entry point

Hitsumabushi is assumed to be used with -buildmode=c-archive. The generated file is a C library, and even main is not called. If you want to call main, you have to define a C function and call main explicitly inside. Calling main explicitly does not make sense usually, but I think it is practical for c-archive.

package main

import "C"

//export GoMain
func GoMain() {
    main()
}
// Call the entry point in Go in the entry point in C.
int main() {
  GoMain();
  return 0;
}

Results

  • I managed to get a game called "Innovation 2007" working on an actual Nintendo Switch device. Controller support, touch inputting, and audio all work perfectly. Innovation 2007 uses most of Ebiten's features, so I'm sure other games would work as well.
  • Compiling speed became much faster. Before this solution, it took 5 to 10 minutes to full-build a C++ project, but now it only takes less than 10 seconds. This is awesome!
  • Suspensions by GC seem to have disappeared.
  • I now have to update whenever a new version of Go is released. This is an acceptable compromise to me. From my past experiments, I don’t expect any major changes anyway.

Remarks

This is a side note, but the implementation of Go's runtime has ample accumulation of knowledge about modern OSes and is very insightful. I think it can teach you a significant amount about computer science. That said, it can be quite daunting to read it without a purpose, so I recommend doing so with some sort of modification project in mind.

Thanks to the near-success of this project, the method I presented in the Go Conference is now becoming outdated. This was inevitable, obviously, but it still makes me feel a little sad to see that hard work go obsolete.

Future works

I'll continue polishing this so that a proper game can be released for Nintendo Switch. As I described first, there is a high level of uncertainty in this project. Until a game is released, I cannot anticipate what kind of issues will occur, and I always have to be on high alert. Even in the worst case scenario, however, I know we can continue to release the game with the help of go2cpp, which is reassuring. Still, with all the hard work I’ve put into this already, I really want to release a game with Hitsumabushi and see it achieve some actual results.

Acknowledgments

Thanks to the kind folks over in the PySpa community for all their technical advice. I’d also like to express my gratitude to Daigo, President of Odencat Inc., who kindly uses Ebiten for Nintendo Switch. Thank you very much.

Happy new year!

  • *1 It's due to complicated business reasons.
  • *2 Hitsumabushi is Japanese food.
  • *3 Katsuobushi is yet another Japanese food.
  • *4 There is another solution, making a pseudo /dev/urandom file, but I didn't adopt this. There is no other good way than using a platform-specific API.
  • *5 The main reason is portability, but there is also another compelling reason: I wouldn't be able to make it open-source if it used a platform-specific API.
  • *6 To be exact, about 2G of memory was allocated first and was used without additional allocations.