April 28th 2012

TAGS:
Open Source

OSS Grant Roundup: JRuby’s New Intermediate Representation

Between December 2011 and March 2012, Engine Yard provided me with a 3-month grant to put in focused time on JRuby’s new runtime system based on a new intermediate representation, which will be driving performance optimizations in JRuby in the coming years. In this blog post, I will outline the state of this project so far and what lies ahead. While I am the one writing this blog post, a number of ideas here have come out of discussions with Tom and Charlie.

I started work on this project in 2009 (Charlie roped me in when I thought compilers were behind me) and did a first pass on building an intermediate representation (IR) from the AST, and outlines of a couple optimizations. Through 2010, Tom got the first version of an IR-based interpreter implemented. In 2011, Tom and I ironed out a lot of the bugs in the IR, refactored and cleaned up code. But, thanks to the grant, I was able to work a little more intensively with Tom to debug the IR in both 1.8 and 1.9 modes (using the new interpreter as a vehicle for this), with and without some preliminary optimizations in place. As of now, RubySpec runs are “green” (more about the air-quotes later), MRI tests are “green”, Rails boots up and runs (yet to be tested extensively), gem functionality works, and IRB runs.

This project will be part of the 1.7 release in experimental form. We hope to have some version of the JIT (even if not functionally complete) as part of the release as well. But, with a default functionally complete and correct interpreter in place, we will now be ready to embark more seriously on performance work in the coming months and years. Thanks to Engine Yard for the grant and kudos for putting money into this project where the returns are not going to be immediate.

But, first things first …

Why a new IR (intermediate representation)?

The AST captures higher level structural semantics compactly, but its non-linear tree-structure makes it harder to perform standard dataflow analyses and code transformations on it, especially in the presence of blocks, exceptions, non-local breaks, non-local returns, and retries. There is a theoretical framework for dataflow and control-flow analysis based on control flow graphs and basic blocks with instruction sequences that can be used to analyze this IR, and implement performance improving code transformations on it. This would be harder to do with an AST. Inlining transformations are easier in this IR form than with an AST.
This IR attempts to capture “operational semantics” (to use the term somewhat loosely) by enabling primitive operations on the JRuby runtime to be specified, whereas the AST captures “structural semantics” (once again to use the term loosely). For example, the AST captures a call via different flavors of call AST nodes, but looking at it, you don't know what happens under the covers. But the IR can operationally specify what happens on a call: arguments are prepared, method is looked up, bindings are allocated, other values are pushed on the stack, arities are checked, arguments are received. Similarly, in the AST what are Colon2 and Colon3 nodes transform to identical InheritanceSearchConst, test for successful lookup, followed by ConstMissing instructions. All of these primitive IR operations are then subject to analysis and optimization which is harder in the AST since they are not surfaced explicitly.
It is also a lot more straightforward to compile this IR to JVM bytecode when compared to the existing JIT. In addition, it is also possible to target new backends (ex: Dalvik) fairly easily while taking advantage of all optimization passes.
An additional benefit of this instruction-based representation is that it enables JRuby to persist the IR to disk (similar to .s output for C) across runs.
We also expect to be able to simplify pieces of the existing JRuby runtime once we switch to IR mode. For example, the IR-based interpreter localizes all non-local break and return handling to the interpreter loop and the break and return instructions. Implementation details of the VM dont leak out into JRuby code that implements Ruby libraries, unlike the existing runtime.

So, now that you know why we are going this direction, let us look at some IR output. The figure below shows a simple recursive fibonacci function and the corresponding textual AST and IR representations. The textual output of IR is far easier to read and understand as well.

Ruby Source code:

def fib_ruby(n)
 (n < 2) ? n : fib_ruby(n - 2) + fib_ruby(n - 1)
end

JRuby AST:

ArgumentNode:fib_ruby 0
ArgsPreOneArgNode 0
  ArrayNode 0
    ArgumentNode:n 0
NewlineNode 1
  IfNode 1
    NewlineNode 1
      CallOneArgFixnumNode:< 1
        LocalVarNode:n 1
        ArrayNode 1
          FixnumNode 1
    LocalVarNode:n 1
    CallOneArgNode:+ 1
      FCallOneArgNode:fib_ruby 1
        ArrayNode 1
          CallOneArgFixnumNode:- 1
            LocalVarNode:n 1
            ArrayNode 1
              FixnumNode 1
      ArrayNode 1
        FCallOneArgNode:fib_ruby 1
          ArrayNode 1
            CallOneArgFixnumNode:- 1
              LocalVarNode:n 1
              ArrayNode 1
                FixnumNode 1

JRuby IR:

LBL_2:
LBL_3:
 	check_arity(1, 0, -1)
 	n(0:0) = recv_pre_reqd_arg(0)
 	%block(0:1) = recv_closure
 	thread_poll
 	line_num(1)
 	%v_0 = call(< , n(0:0), [2:fixnum])
 	b_false(%v_0, LBL_0)
LBL_4:
 	%v_0 = copy(n(0:0))
 	jump LBL_1
LBL_0:
 	%v_1 = call(-, n(0:0), [2:fixnum])
 	%v_2 = call(fib_ruby, %self, [%v_1])
 	%v_1 = call(-, n(0:0), [1:fixnum])
 	%v_3 = call(fib_ruby, %self, [%v_1])
 	%v_1 = call(+, %v_2, [%v_3])
 	%v_0 = copy(%v_1)
LBL_1:
 	return(%v_0)
LBL_5:

Playing around in IR mode

All this code is on JRuby master. If you want to play around with this, the notes below should get you going:

-X-CIR enables interpretation in IR mode. This mode has been extensively tested in both 1.8 and 1.9 modes. Please report correctness bugs if your Ruby code yields incorrect output in IR mode. Note that some incorrect uses of break that ought to yield LocalJumpErrors dont always raise these errors (hence the air-quotes earlier) -- something that will be fixed in the coming months. As far as performance goes, the IR-based interpreter is slower than the AST-based interpreter, from 30% - 3X, depending on the Ruby code.
-Xir.debug=true enables debug mode which enables verbose output of IR instructions. In addition, just to get clean debug output, you will also want to enable -Xlogger.class=org.jruby.util.log.StandardErrorLogger
If you are going to be running RubySpec tests, please remember to pass in -Xlaunch.inproc=true so that the tests run with -X-CIR mode in the same process. If not, new spawned processes will run with default JRuby mode (-X+C) which doesn't use IR yet. Alternatively, you can set env var JRUBY_OPTS=-X-CIR, and that should work as well.
-X+CIR enables JIT-ting in IR mode. This is not yet functionally complete. Charlie is working on this as you are reading this. But, for starters, we are going to target Java 7 (using invokedynamic functionality). While it is not hard to add support for Java 6, we'll only add that when it becomes necessary. As this work moves along, we expect -X+CIR will perform better relative to -X+C than -X-CIR does relative to -X-C, since the IR is designed to help the JIT generate better code. But don't hold your breath -- this is all work in progress. But, I would be surprised if -X+CIR doesn't match up with and eventually do better than -X+C.

Besides these basic flags, you can also play around with a few IR passes from the command-line. The following passes are currently operational:

OptimizeTempVarsPass: As part of generating IR, a lot of temporary variables are generated by the IR builder (%v_0, %v_1, %v_2, %v_3 in the IR example above). This pass eliminates unused temporaries, does simple constant and copy propagation of temporaries, and heavily reuses temporaries to minimize the number of temporary variables used. This compaction reduces memory allocation in the interpreter and local variables in the JIT.
LocalOptimizationPass: Does simple copy/constant propagation within a basic block. Nothing fancy.
DeadCodeElimination: As the name says, it gets rid of code that doesn't need to be executed. This pass is fairly conservative right now -- caused by the use of blocks-as-binding which means all writes to local variables would have to be preserved for the worst-case scenario. Even so, it does yield 10% or more speedup in certain cases. In future, we'll explore ways to relax these constraints. This code has been debugged and tested for correctness but this needs additional testing. Please report bugs where you run into them.
AddLocalVarLoadStoreInstructions: In Ruby, local variables are not like locals in languages like C/Java because of closures. They reside in a heap structure (binding). The current JRuby runtime is smart enough to detect when a method doesn't require a binding, but cannot do much more than that. This pass lets JRuby try to be smart about when to store the variable to the binding and load from it. This is experimental and not yet well-tested for performance (will be once JIT is in place). This is primarily targeted at the JIT, but in certain scenarios, benefits the interpreter as well (up to 15% speedup in certain cases). More details in a future post.

You can enable passes on the commandline via -Xir.passes flag. A few valid options:

`-Xir.passes=LinearizeCFG`
`-Xir.passes=OptimizeTempVarsPass,LocalOptimizationsPass,LinearizeCFG`
`-Xir.passes=DeadCodeElimination,LinearizeCFG`
`-Xir.passes=DeadCodeElimination,AddLocalVarLoadStoreInstructions,LinearizeCFG`

This CLI is still quite basic at this time, and a lot of functionality is missing, but it lets you mix and match passes which is especially good for testing the IR code in different ways. But, given the preliminary state of this CLI flag, you need to always add LinearizeCFG as the last pass whenever you provide an explicit passes flag, and the ordering of passes on the CLI matters as well. As long as you preserve the relative order of passes as above, you should be good.

We also have preliminary profiling and inlining (methods and blocks) in place, but it is not fully enabled on master (requires a non-committed patch at this time). In the interest of keeping length under control, I am skipping the details here since you cannot really play with it right now.

Now that we got the preliminaries out of the way, let us look at where we are at right now, and what lies in the future.

Current status

As should have been clear from the previous sections, a lot of the work so far has been to try out different things with the IR (interpret in 1.8 and 1.9 modes, compile to bytecode, run analyses and code transformations, profile, inline) with the goal to stabilize the IR and the code around it so we are increasingly confident that it accurately captures Ruby semantics in a clean way, and that we can use it for code optimizations. While we have dabbled in some performance experiments, most of these have been preliminary stabs. Once the JIT is complete, we will have a solid interpreter and JIT that are functionally correct, and we can more confidently experiment with code transformations to improve performance.

Overall, it is best to think of this project as moving towards a newer JRuby-level VM that is based on this new IR. As it stands now, we have a high degree of confidence that the IR, as it exists now, accurately captures semantics of Ruby, and that the IR-based interpreter, in both 1.8 and 1.9 modes, is as functionally correct as the default JRuby implementation. Based on this confidence, over the past few months, Tom has proceeded to do some much needed code cleanup and refactoring.

To summarize, here is where things are with this project and the pieces that are already in place:

IR builder to transform AST to a sequence of IR instructions.
Local peephole optimizations (post-IR construction). - Minimizes use of temporary variables. - Simple copy and constant propagation within a basic block.
A generic abstract-interpretation based dataflow analysis framework.
Dead code elimination (based on the dataflow framework).
Preliminary profiling support. - Implements scope hotness profiling by embedding profiling on thread poll instructions placed on scope entries and loop back-edges. Thread-poll counts act as proxies for a hot scope.
Method and block inlining transformation (with some known bugs and gaps).
Local variable access opts (based on the dataflow framework).
IR -> bytecode JIT is coming along. - For now, this targets JDK 7.
Beginnings of work to optimize the IR stream for interpretation. - There is some tension between what the interpreter wants from the IR and what opt passes and JIT want from the IR. The latter want more explicit information and simpler instructions, whereas interpreter wants fewer complex instructions (to eliminate dispatching and interpretation costs). So, one experiment we started is to transform instruction chains in a basic block into an expression tree which eliminates dispatch costs, and temporary variable read/write overheads from the interpreter loop.

Future work

The first thing we want to accomplish is get the IR -> bytecode JIT (-X+CIR) functionally complete and correct.

Going forward, there are several things we want to address:

Additional IR cleanup: There are still some oddities left in the instruction set (GetClassVarContainerModuleInstr, InstanceofInstr) that we want to get rid of. That just reflects a gap in my understanding of Ruby semantics as I worked on this.
Improving interpreter performance: The 3X slowdown that I reported up there was to dampen your expectations :). The 3X slowdown is on micro-benchmarks like fib_recursive. “Real” programs perform much better. But, this also highlights another issue with performance testing: micro-benchmarks aside, is there a good set of benchmarks on which we can evaluate performance that is meaningful? That apart, we still have some gap to close with the AST interpreter before it can become the default interpreter.
Expose more JRuby state in the IR so we can more explicitly analyze and manipulate it - where is a piece of state really required, and where can it be removed. For example, on a typical call, JRuby passes a block, sets up a call frame with many fields, sets up a "binding" for local variables, all of which adds to the cost of Ruby calls. Not all of this is necessary at all call sites. We have brainstormed several ideas about how to tackle this and this is something that we'll be working on. No immediate results here since some of this state winds its way through the existing runtime, and we have to carefully unwind it.
Additional analyses/optimizations (non-exhaustive): - Optimize allocation of bindings: Right now, bindings/dyn-scopes are always allocated -- we will import some heuristics from the AST JIT, but more work to be done here. - Migrate call setup code out of JRuby runtime into the IR (split between the call site and callee entry/exit). - Method and block inlining: As discussed earlier, beginnings of this are in place which you can play with via the -Xir.profile=true option. This still needs work to iron out bugs and figuring out good inlining strategies which is different from the transformation (which is needed when the strategy decides scope A should be inlined into scope B). - Unboxing of primitives: Rather than create first-class Ruby/Java objects for fixnums, booleans, can we get away using JVM primitives instead? - Profiling: What is the kind of profiling we need and how do we do it effectively with low overheads? - VM manager: We need a manager process that orchestrates this whole thing. How much do you optimize before interpretation? When do you profile and how long? When do you JIT? When do you inline and how much do you inline? How much work should JRuby do and how much should we let the JVM deal with? Our strategy will be one where we always let the JVM do its magic and find areas where we can help it along. But, to me, it is clear that we cannot expect the JVM to optimize everything that is thrown at it which is where this VM will play a role.

Some of these tasks will take months and others will happen over a few years. So, lots to be done.

Share your thoughts with @engineyard on Twitter