IR types and transformation passes

This section explains various IR types in Asterius, and hopefully presents a clear picture of how information flows from Haskell to WebAssembly. (There's a similar section in jsffi.md which explains implementation details of JSFFI)

Cmm IR

Everything starts from Cmm, or more specifically, "raw" Cmm which satisfies:

All calls are tail calls, parameters are passed by global registers like R1 or on the stack.
All info tables are converted to binary data segments.

Check Cmm module in ghc package to get started on Cmm.

Asterius obtains in-memory raw Cmm via:

cmmToRawCmmHook in our custom GHC fork. This allow us to lay our fingers on Cmm generated by either compiling Haskell modules, or .cmm files (which are in rts)
There is some abstraction in ghc-toolkit, the compiler logic is actually in the Compiler datatype as some callbacks, and ghc-toolkit converts them to hooks, frontend plugins and ghc executable wrappers.

There is one minor annoyance with the Cmm types in GHC (or any other GHC IR type): it's very hard to serialize/deserialize them without setting up complicated contexts related to package databases, etc. To experiment with new backends, it's reasonable to marshal to a custom serializable IR first.

Pre-linking expression IR

We then marshal raw Cmm to an expression IR defined in Asterius.Types. Each compilation unit (Haskell module or .cmm file) maps to one AsteriusModule, and each AsteriusModule is serialized to a .asterius_o object file which will be deserialized at link time. Since we serialize/deserialize a structured expression IR faithfully, it's possible to perform aggressive LTO by traversing/rewriting IR at link time, and that's what we're doing right now.

The expression IR is mostly a Haskell modeling of a subset of binaryen's expression IR, with some additions:

Unresolved related variants, which allow us to use a symbol as an expression. At link time, the symbols are re-written to absolute addresses.
Unresolved locals/globals. At link time, unresolved locals are laid out to Wasm locals, and unresolved globals (which are really just Cmm global regs) become fields in the global Capability's StgRegTable.
EmitErrorMessage, as a placeholder of emitting a string error message then trapping. At link time, such error messages are collected into an "error message pool", and the Wasm code is just "calling some error message reporting function with an array index".
Null. We're civilized, educated functional programmers and should really be using Maybe Expression in some fields instead of adding a Null constructor, but this is just handy. Blame me.

It's possible to encounter things we can't handle in Cmm (unsupported primops, etc). So AsteriusModule also contains compile-time error messages when something isn't supported, but the errors are not reported, instead they are deferred to runtime error messages. (Ideally link-time, but it turns out to be hard)

The symbols are simply converted to Z-encoded strings that also contain module prefixes, and they are assumed to be unique across different compilation units.

The store

There's an AsteriusStore type in Asterius.Types. It's an immutable data structure that maps symbols to underlying entities in the expression IR for every single module, and is a critical component of the linker.

Modeling the store as a self-contained data structure makes it pleasant to write linker logic, at the cost of exploding RAM usage. So we implemented a poor man's KV store in Asterius.Store which performs lazy-loading of modules: when initializing the store, we only load the symbols, but not the actual modules; only when a module is "requested" for the first time, we perform deserialization for that module.

AsteriusStore supports merging. It's a handy operation, since we can first initialize a "global" store that represents the standard libraries, then make another store based on compiling user input, simply merge the two and we can start linking from the output store.

Post-linking expression IR

At link time, we take AsteriusStore which contains everything (standard libraries and user input code), then performs live-code discovery: starting from a "root symbol set" (something like Main_main_closure), iteratively fetch the entity from the store, traverse the AST and collect new symbols. When we reach a fixpoint, that fixpoint is the outcome of dependency analysis, representing a self-contained Wasm module.

We then do some rewriting work on the self contained module: making symbol tables, rewriting symbols to absolute addresses, using our own relooper to convert from control-flow graphs to structured control flow, etc. Most of the logic is in Asterius.Resolve.

The output of linker is Module. It differs from AsteriusModule, and although it shares quite some datatypes with AsteriusModule (for example, Expression), it guarantees that some variants will not appear (for example, Unresolved*). A Module is ready to be fed to a backend which emits real Wasm binary code.

There are some useful linker byproducts. For example, there's LinkReport which contains mappings from symbols to addresses which will be lost in Wasm binary code, but is still useful for debugging.

Generating binary code via binaryen

Once we have a Module (which is essentially just Haskell modeling of binaryen C API), we can invoke binaryen to validate it and generate Wasm binary code. The low-level bindings are maintained in the binaryen package, and Asterius.Marshal contains the logic to call the imported functions to do actual work.

Generating binary code via wasm-toolkit

We can also convert Module to IR types of wasm-toolkit, which is our native Haskell Wasm engine. It's now the default backend of ahc-link, but the binaryen backend can still be chosen by ahc-link --binaryen.

Generating JavaScript stub script

To make it actually run in Node.js/Chrome, we need two pieces of JavaScript code:

Common runtime which can be reused across different Asterius compiled modules. It's in asterius/rts/rts.js.
Stub code which contains specific information like error messages, etc.

The linker generates stub script along with Wasm binary code, and concats the runtime and the stub script to a self-contained JavaScript file which can be run or embedded. It's possible to specify JavaScript "target" to either Node.js or Chrome via ahc-link flags.