Making Ruby Fast: The Rubinius JIT

In order to execute Ruby code as fast as possible, Rubinius has the ability to compile Ruby code all the way down to machine code when it detects that a method is heavily used. In Rubinius, the system that manages this process is its JIT.

In today’s post, I’ll be giving an overview of the various players involved in the path that code takes to get from source to machine code. Without further ado, I’ll jump right in.

###Melbourne Parser

This is the first step. The parser takes Ruby source code as input and calls for each element to create an internal representation of the code: the AST. (lib/ext/melbourne)

###Compiler

The compiler takes the AST that the parser created and analyzes it, creating bytecode in the form of a CompiledMethod and InstructionSequence. (lib/compiler)

###Bytecode

A CompiledMethod object contains an InstructionSequence object, which is the raw bytecode which will perform the semantic actions of the Ruby source code. (lib/compiler/iseq.rb)

###Virtual Machine

The VM itself then executes the bytecode using a simple interpreter. A key data structure used in this evaluation is the VMMethod, which is an internal mirror of a CompiledMethod, but translated into constructs that are easier to interpret. As the VM interprets a VMMethod, it uses InlineCache objects to speed up method dispatch. In addition, these InlineCache objects remember profiling information about what methods they have seen. This information is later used by the JIT.

The VM also increments a call counter on the VMMethod at a few critical points (on start and on backward branch). This call counter is what controls when the JIT kicks in. When the call counter reaches some predetermined value (controlled via -Xjit.call_til_compile), the first stage of the JIT kicks in. ###Method Chooser Now that a call counter has reached the proper level, the JIT is ready to kick in. The JIT could simply take the method whose counter has hit the level, but instead it starts the search. It’s looking for a good method to JIT, which it finds by looking up the call stack.

The reason it does this is because the JIT has the ability to inline methods into methods that call them. We’re exploiting the fact that the call stack shows not just one method heating up, but a whole chain of them (vm/llvm/jit:compile_callframe). So we walk up the call stack, looking at each method along the way, and asking ourselves: could this method be inlined into the one that called it? If the answer is yes, we move to the next method. By doing this, we’re able to inline methods along hot paths in code, which yields better speeds.

###Compiler Thread

Now that we’ve picked a good method at which to start the JIT process, the method is placed into a queue. This queue is then automatically emptied by another native thread, which is always running. It’s in this background thread that the rest of the JIT process takes place.

Using a background thread means that the Ruby code is free to continue to run while the JIT runs in the background. This means that the JIT imposes virtually no slowdown, because it never stands in the way of running Ruby code (vm/llvm/jit:compile_soon).

The JIT thread pops an entry off the queue and begins compiling it. Because the JIT uses LLVM to perform low level optimizations and machine code generation, we need to translate the method into a structure that LLVM understands. This structure is the LLVM IR.

To convert the method, we walk through the bytecode and call methods on a JITVisit object (one for each kind of bytecode). The JITVisit class uses LLVM’s IRBuilder class to build a big tree data structure that represents the actions that should be taken.

A simple example is that a goto bytecode instruction is translated into a Branch object and inserted into the IR. For most bytecode, the process is fairly straightforward, there being a simple set of IR objects to generate per bytecode.

###Method Inlining

The most complicated bytecodes to generate IR for are the send instructions. This is because it’s at these points that we have the opportunity to inline a method. When a method is inlined, the code to perform the method is inserted where the send instruction would normally be. This eliminates any calling overhead and allows LLVM to optimize more.

At this stage, control is handled to an Inliner object. The Inliner will only inline a method if it can see that the method was the the primary method called at a particular send instruction. Because Ruby is a dynamic language, any method call can always invoke a brand new method. But in reality, that happens rarely. Instead, most method calls always end up calling the same method over and over again. The profiling information that the InlineCaches have been gathering allows us to see that this is the case and perform inlining.

One constraint that the Inliner has is that it needs to avoid over-inlining. Over-inlining causes the generated function to become extremely large and slower than it would be if there were no inlining. To do this, the Inliner keeps track of the cost of a inlining. For every method that is inlined, the cost increases. When the cost reaches a threshold, no more inlining takes place.

Now that the Inliner has decided to go ahead and inline the method, it must insert a guard before the inlined code. This guard makes sure that the object is still of the type seen in the profiling information, and that therefore the inlined method code is the proper code to run. The generated IR for this looks something like:

if(obj->class == profiled_class) {
 result = inlined_code_for_method_name;
} else {
 result = obj->send method_name, …;
}

This allows Ruby code to continue to be dynamic, but exploits those points in the code where the dispatch is actually static.

After the JITVisit class has finished, control is handed off to optimize and generates machine code.

###LLVM Optimization

Up to now, we’ve simply been constructing information to feed to LLVM. Now we actually hand over the IR to LLVM. The first thing LLVM does is run a number of optimization passes over the IR. This cleans up the IR and makes it quite a bit more efficient. At this stage, the IR can be reduced in size by five to ten times by remove redundancies and reordering.

###LLVM Code Generation

Finally, the optimized IR is run through LLVMs code generator. This code generator is fairly complex, but its API is extremely simple. When the generator finishes, it returns a function pointer that can be called to execute the code. We put this function pointer into a special slot on the original CompiledMethod.

By default, this slot in the object holds a pointer to the function that implements the interpreter. By swapping them, any future calling of the method automatically uses the new JIT’d version.

###Deoptimization

Because Ruby is so dynamic, there are cases where JIT’d code must be discarded. The primary example is where a method that was inlined is redefined. In this case, the VM keeps a table to know all methods that inlined the method being redefined. It resets those methods back to using the interpreter and tags the JIT’d code to be discarded. Because the method has been reset, it can now be JIT’d again later, incorporating the newly redefined method.

So those are the systems that interaction to make speed up Ruby. The process can achieve speeds up by as much as 10x over non JIT’d code. We’ve only begun to scratch the surface of the techniques we can use to strip even more dynamic aspects of the code away and make it faster. 2010 is going to be an exciting year.