IR types and transformation passes
This section explains various IR types in Asterius, and hopefully presents a
clear picture of how information flows from Haskell to WebAssembly. (There's a
similar section in jsffi.md
which explains implementation details of JSFFI)
Cmm IR
Everything starts from Cmm, or more specifically, "raw" Cmm which satisfies:
-
All calls are tail calls, parameters are passed by global registers like R1 or on the stack.
-
All info tables are converted to binary data segments.
Check Cmm
module in ghc
package to get started on Cmm.
Asterius obtains in-memory raw Cmm via:
-
cmmToRawCmmHook
in our custom GHC fork. This allow us to lay our fingers on Cmm generated by either compiling Haskell modules, or.cmm
files (which are inrts
) -
There is some abstraction in
ghc-toolkit
, the compiler logic is actually in theCompiler
datatype as some callbacks, andghc-toolkit
converts them to hooks, frontend plugins andghc
executable wrappers.
There is one minor annoyance with the Cmm types in GHC (or any other GHC IR type): it's very hard to serialize/deserialize them without setting up complicated contexts related to package databases, etc. To experiment with new backends, it's reasonable to marshal to a custom serializable IR first.
Pre-linking expression IR
We then marshal raw Cmm to an expression IR defined in Asterius.Types
. Each
compilation unit (Haskell module or .cmm
file) maps to one AsteriusModule
,
and each AsteriusModule
is serialized to a .asterius_o
object file which
will be deserialized at link time. Since we serialize/deserialize a structured
expression IR faithfully, it's possible to perform aggressive LTO by
traversing/rewriting IR at link time, and that's what we're doing right now.
The expression IR is mostly a Haskell modeling of a subset of binaryen
's
expression IR, with some additions:
-
Unresolved
related variants, which allow us to use a symbol as an expression. At link time, the symbols are re-written to absolute addresses. -
Unresolved locals/globals. At link time, unresolved locals are laid out to Wasm locals, and unresolved globals (which are really just Cmm global regs) become fields in the global Capability's
StgRegTable
. -
EmitErrorMessage
, as a placeholder of emitting a string error message then trapping. At link time, such error messages are collected into an "error message pool", and the Wasm code is just "calling some error message reporting function with an array index". -
Null
. We're civilized, educated functional programmers and should really be usingMaybe Expression
in some fields instead of adding aNull
constructor, but this is just handy. Blame me.
It's possible to encounter things we can't handle in Cmm (unsupported primops,
etc). So AsteriusModule
also contains compile-time error messages when
something isn't supported, but the errors are not reported, instead they are
deferred to runtime error messages. (Ideally link-time, but it turns out to be
hard)
The symbols are simply converted to Z-encoded strings that also contain module prefixes, and they are assumed to be unique across different compilation units.
The store
There's an AsteriusStore
type in Asterius.Types
. It's an immutable data
structure that maps symbols to underlying entities in the expression IR for
every single module, and is a critical component of the linker.
Modeling the store as a self-contained data structure makes it pleasant to
write linker logic, at the cost of exploding RAM usage. So we implemented a
poor man's KV store in Asterius.Store
which performs lazy-loading of modules:
when initializing the store, we only load the symbols, but not the actual
modules; only when a module is "requested" for the first time, we perform
deserialization for that module.
AsteriusStore
supports merging. It's a handy operation, since we can first
initialize a "global" store that represents the standard libraries, then make
another store based on compiling user input, simply merge the two and we can
start linking from the output store.
Post-linking expression IR
At link time, we take AsteriusStore
which contains everything (standard
libraries and user input code), then performs live-code discovery: starting
from a "root symbol set" (something like Main_main_closure
), iteratively
fetch the entity from the store, traverse the AST and collect new symbols. When
we reach a fixpoint, that fixpoint is the outcome of dependency analysis,
representing a self-contained Wasm module.
We then do some rewriting work on the self contained module: making symbol
tables, rewriting symbols to absolute addresses, using our own relooper to
convert from control-flow graphs to structured control flow, etc. Most of the
logic is in Asterius.Resolve
.
The output of linker is Module
. It differs from AsteriusModule
, and
although it shares quite some datatypes with AsteriusModule
(for example,
Expression
), it guarantees that some variants will not appear (for example,
Unresolved*
). A Module
is ready to be fed to a backend which emits real
Wasm binary code.
There are some useful linker byproducts. For example, there's LinkReport
which contains mappings from symbols to addresses which will be lost in Wasm
binary code, but is still useful for debugging.
Generating binary code via binaryen
Once we have a Module
(which is essentially just Haskell modeling of binaryen
C API), we can invoke binaryen to validate it and generate Wasm binary code.
The low-level bindings are maintained in the binaryen
package, and
Asterius.Marshal
contains the logic to call the imported functions to do
actual work.
Generating binary code via wasm-toolkit
We can also convert Module
to IR types of wasm-toolkit
, which is our native
Haskell Wasm engine. It's now the default backend of ahc-link
, but the
binaryen backend can still be chosen by ahc-link --binaryen
.
Generating JavaScript stub script
To make it actually run in Node.js/Chrome, we need two pieces of JavaScript code:
-
Common runtime which can be reused across different Asterius compiled modules. It's in
asterius/rts/rts.js
. -
Stub code which contains specific information like error messages, etc.
The linker generates stub script along with Wasm binary code, and concats the
runtime and the stub script to a self-contained JavaScript file which can be
run or embedded. It's possible to specify JavaScript "target" to either Node.js
or Chrome via ahc-link
flags.