May 15th 2010

By Kirk Haines

TAGS:
Ruby

Concurrency, Real and Imagined, in MRI; Threads

In computer science, concurrency is a property of systems in which several computations are executing simultaneously, and potentially interacting with each other. The computations may be executing on multiple cores in the same chip, preemptively time-shared threads on the same processor, or executed on physically separated processors.

– Wikipedia Concurrency article

Simply put, concurrency is when you have more than one logical thread of execution occurring simultaneously, or at least appearing to occur simultaneously. When you write software that makes use of concurrency, you want your software to do two or more things at once.

The motivations for using concurrency are varied. Sometimes you may have architectural reasons for using concurrency – your code makes more sense to you or is easier to write if you conceive it in more than one discretely executing unit. In other cases you may want to employ concurrency in order to make better use of the multiple cores that many modern computers have, enabling you to get better total throughput out of your code than you would have from a non-concurrent implementation.

Whatever the motivation for employing concurrency, the reality is that concurrency is a complex subject. There are many different ways to achieve concurrency in software, and they each have their own set of tradeoffs. Furthermore, if your platform is Ruby, your decisions about what kind of concurrency to employ will be influenced by the specific Ruby implementation you are targeting. Each provides a different set of concurrency options for you to consider.

This is the first installment in a new series of articles focusing on introducing and exploring the variety of concurrency options available in the Ruby ecosystem. Advantages and disadvantages will be discussed for each, and I’ll leave you with a few examples of how you can leverage these different options in your code. It should be a fun subject to explore!

Concurrency is all about multitasking – doing more than one thing at once. The building blocks of multitasking are processes, threads, and fibers. Each of these components is complex in itself, both because of the nuances in how they interact and can be combined, and because different platforms have variations in which capabilities they implement and in how they are implemented. Luckily, their overall description can be summarized in a useful way.

Processes are independent units of execution that generally share nothing with other processes, except for resources which are intended to be shared (such as shared memory segments, shared IO resources, or memory mapped files). Processes carry a lot of state information with them and have their own address spaces. Communication between them has to be through an interprocess communication mechanism provided by the platform that the processes are running in. Processes running on the same machine will be scheduled by the kernel, which will typically use some sort of time slicing algorithm to spread CPU usage of all running processes across the available cores.

Threads come in several different flavors, including kernel, user space, and green threads. On some platforms there are entities called light-weight processes that bring kernel threads into user space so they look somewhat like processes, but are less expensive. For our purposes, threads are contained within a process, and share the memory space and process state of the process with each other. Green threads differ in that they are not controlled or scheduled by the operating system. Rather, they are provided by the process itself. This has a portability advantage because it means that the threads will be available on every platform that the process can run on, and will work the same on each. The main disadvantage is that green threads, being managed by the process itself, are generally confined to sharing a single core, and are limited to the peculiarities of the process’s threading implementation (which may vary substantially from the platform’s own threading implementation). Regardless of the type of threading, context switching with threads is generally faster than it is with processes.

Fibers are like user space threads, except the operating system doesn’t handle scheduling for them. Instead, fibers must be explicitly yielded to allow other fibers to run. This can have performance advantages like the reduction of system scheduling overhead. Since multitasking with fibers is cooperative, the need to use locks on shared resources is reduced or eliminated. Programmers can also leverage fibers to their advantage with IO operations by allowing other things to run while waiting for a slow or blocking IO operation.

Ruby concurrency isn’t quite as simple as selecting one of the above and using it, however. In the beginning, there was just Ruby, a single implementation that everyone used. This Ruby implementation, now commonly called the Matz Ruby Implmenetation (MRI), saw a widespread usage explosion with the 1.8.x version. It’s pretty old now. This is from the ftp://ruby-lang.org FTP server:

carbon:/home/ftp/pub/ruby/1.8$ ls -la | grep ruby-1.8.0
-rw-rw-r--  1 root     ftp   1979070 Aug  4  2003 ruby-1.8.0.tar.gz

So, it has been around for a while, and offers a good starting point for discussing concurrency in Ruby.

MRI Ruby 1.8.x supports concurrency in a few ways. One of the first things newcomers to Ruby leap for are its threads. Depending on the language these newcomers were familiar with before arriving at Ruby, they may be in for a surprise. MRI Ruby 1.8.x provides a green thread implementation. As mentioned above, green threads do not make use of any threading system native to the platform. Instead, 1.8.x’s threads are implemented within the interpreter itself. This leads to threads behaving consistently across any platform the interpreter runs on. Because they are green threads, however, they offer no advantages for CPU bound tasks.

cpu_bound_threads.rb

require 'benchmark'
threads = []
thread_count = ARGV[0].to_i
iterations = ARGV[1].to_i
increment = iterations / thread_count.to_f
sum = 0

Benchmark.bm do |bm|
  bm.report do
    thread_count.times do |counter|
      threads &lt;&lt; Thread.new do
        my_sum = 0
        queue = (1 + (increment * counter).to_i)..(0 + (increment * (counter + 1)).to_i)
        queue.each do |x|
          my_sum += x
        end
        Thread.current[:sum] = my_sum
      end
    end

    threads.each {|thread| thread.join; sum += thread[:sum]}

    puts "The sum of #{iterations} is #{sum}"

  end
end

This is a simple program that takes a large range of numbers, divides them into smaller ranges, and hands each smaller range to a thread that calculates the sum of the range it was given. The results from each individual thread are then added together to arrive at a final answer.

All examples ran on an 8 core Linux machine. The numbers below are an average of the results of 100 runs for each set of inputs.

Threads
Iterations	50000	500000	5000000
1	0.01730298	0.17149276	1.70610744
2	0.01724724	0.17179465	1.70557474
4	0.01729293	0.17181384	1.70570264
8	0.01741591	0.17210276	1.71201153

As demonstrated by the numbers, MRI 1.8 threads are absolutely no help at all for a CPU bound application. In fact, there is a small but measurable cost to the overhead of managing them that is apparent in the numbers. As thread count increased, timing consistently and measurably slowed. If you are an MRI 1.8 user, do not despair; threads are but one concurrency option available to you.

An option that will better serve you for CPU bound tasks is process based concurrency. The idea is simple. In order to leverage multiple cores/CPUs, just create more than one process to handle the work load. Ruby provides a fork() method call which, on platforms that support it using the underlying fork() call from the C standard library. This will create a new process, with a new process ID, that can be considered an exact copy of the parent process, except that its resource allocations will be reset to 0.

Since processes do not share memory spaces, you must utilize another system provided communication mechanism in order to pass work to or from processes; this avoids the potential pitfalls that arise when trying to correctly manage locks on shared resources, but it does force one to think more specifically about exactly how to achieve communication.

cpu_bound_processes.rb

require 'benchmark'
processes = []
process_count = ARGV[0].to_i
iterations = ARGV[1].to_i
increment = iterations / process_count.to_f
sum = 0

def in_subprocess
  from_subprocess, to_parent = IO.pipe

  pid = fork do
    from_subprocess.close
    r = yield
    to_parent.puts [Marshal.dump(r)].pack("m")
    exit!
  end

  to_parent.close
  [pid,from_subprocess]
end

def get_result_from_subprocess(pid, from_subprocess)
  r = from_subprocess.read
  from_subprocess.close
  Process.waitpid(pid)
  Marshal.load(r.unpack("m")[0])
end

Benchmark.bm do |bm|
  bm.report do
    process_count.times do |counter|
      processes &lt;&lt; in_subprocess do
        my_sum = 0
        queue = (1 + (increment * counter).to_i)..(0 + (increment * (counter + 1)).to_i)
        queue.each do |x|
          my_sum += x
        end
        my_sum
      end
    end

   processes.each {|process| sum += get_result_from_subprocess(*process)}

   puts "The sum of #{iterations} is #{sum}"

  end
end

In this example I used IO pipes to send data from the master process to the children, and to receive data from the children, back into the master.

As earlier, testing was done on an 8 core linux machine, with 100 runs of each test. The program is equivalent to the threaded version, and was changed only as necessary to enable it to be used in a multiprocess model instead of a multithread model.

Worker Processes
Iterations	50000	500000	5000000
1	0.01805432	0.17199047	1.70812685
2	0.0098329	0.08675517	0.85509328
4	0.00609409	0.0446612	0.43100698
8	0.00847991	0.05346145	0.25621009

Take a good look at these numbers. Everything moves in the correct direction, until you get to the 8 process column. Then timing slows for both the 50000 and 500000 iteration rows that are under the 4 process column. Do you have any theories as to why?

Processes are, in many ways, a great way to handle concurrency. One of their drawbacks, though, is that they are heavy structures. They can take up significant time and resources to create . Linux uses copy-on-write semantics when creating forked processes. This means it doesn’t actually duplicate the address space of the forked process until pages in that space start changing. Then, it duplicates what changes. This means that forked processes on Linux can be created fairly quickly. However, MRI 1.8 is not very friendly to copy-on-write semantics.

If you are unfamiliar with the way memory is managed and garbage is collected in MRI 1.8, you should check out my article on MRI Memory Allocation. One key aspect is that objects carry all of their status bits with them. This means that when the garbage collector scans the object space to find objects it can collect, it touches every object in the address space. For a process forked with copy-on-write semantics, this forces the kernel to make copies of all of those pages. This takes time, and largely negates the fast-creation benefit of copy-on-write forked processes.

The times for the lower iterations on the 8 thread test reveal a cost to this form of concurrency. The overhead associated with creating the forked processes overwhelms the performance gains from the division of labor when the work to be done is brief enough. This is a reality for any form of concurrency – there is always a performance tax from some amount of overhead. That tax is just higher when spawning something heavy like a process. Keep this in mind when you explore concurrency options for your task.

These first two examples both represent CPU bound problems. Many real world problems are not CPU bound, though. Rather, they are IO bound issues. Because an IO bound problem has latencies imposed on it by something outside of the program itself, IO bound problems can provide an excellent case for using MRI 1.8’s green threads to improve performance.

io_bound_threads.rb

require 'net/http'
require 'thread'
require 'benchmark'

def get_data(url)
  tries = 0
  response = nil
  if /^http/.match(url)
    m = /^http:\/\/([^\/]*)(.*)/.match(url)
    site = m[1]
    path = m[2]
    begin
      http = Net::HTTP.new(site)
      http.open_timeout = 30
      http.start {|h| response = h.get(path)}
    rescue Exception
      tries += 1
      retry if tries &lt; 5
    end
  end
  response.kind_of?(Array) ? response[1] : response.respond_to?(:body) ? response.body : ''
end

mutex = Mutex.new
signal = ConditionVariable.new
thread_count = ARGV[0].to_i
fetches = ARGV[1].to_i
url = ARGV[2]
threads = []
count = 0
active_threads = 0

Benchmark.bm do |bm|
  bm.report do
    while count &lt; fetches
      while count &lt; fetches &amp;&amp; active_threads &lt; thread_count
        mutex.synchronize do
          active_threads += 1
          count += 1
        end
        Thread.new do
          get_data(url)
          mutex.synchronize do
            active_threads -= 1
            threads &lt;&lt; Thread.current
            signal.signal
          end
        end
      end

      mutex.synchronize do
        signal.wait(mutex)
      end
      while th = threads.shift
        th.join
      end
    end
  end
end

This script makes many HTTP requests. For simplicity’s sake, lets say it just makes the same request over and over again, but could easily be expanded to take a list of URLs, and to do something useful with the returned data. The script uses threads much like the CPU bound example, except that it is a bit more sophisticated in how it counts the work it has assigned to generated threads, and how it waits for all the threads to be completed.

This table shows timing from it in action. The target URL used was not local to the testing machine. Each run used the indicated number of threads to gather the URL, either a “fast” URL, with an over-the-net response speed of about 35 requests per second, or a “slow” URL with an over-the-net response speed of about 3 requests per second, 400 times. There were 100 runs completed. The numbers below are an average from those runs.

Worker Threads
Request speed	35/second	3/second
1	6.53462668	61.1016239
2	3.34861606	30.4514539
5	1.38942396	12.1620945
10	0.72804622	6.0968646
20	0.47964698	3.0411382

Just a glance at these numbers clearly shows that Ruby threads are a big help with an IO bound activity like this. The relationship between number of threads and reduction in time to complete the task is not linear; but even with up to 20 threads there is a significant benefit to additional numbers of threads. The benefit is more linear, and evident for slower requests because the requests spend more time waiting on IO, and less on CPU bound activities.

There are some caveats to be aware of with regard to Ruby threads. First, even though they are green threads, as soon as one starts sharing resources between threads, threading becomes something that can be hard to get right. Share as little as possible, thoroughly think through your code, and use tests to support your reasoning, because threading problems can be hard to diagnose and solve.

Second, MRI 1.8 has a limit on the number of threads that it will manage. As a consequence of how the internals are implemented, this means that on most systems (notably excluding win32 systems), total thread count is limited to 1024. Also, because of the way it is implemented, the overhead increases to manage a larger number of threads versus smaller. Each thread consumes a significant amount of memory, so do not go crazy with threads or it will backfire on you.

Third, because of the way that Ruby threading is implemented, it is possible for a C extension to Ruby to take control of the process and prevent Ruby from allowing context switches to other threads. It is possible to write extensions so that they do not do this, but many are not written in this way. Where this bites most people, is with code that interacts with a database. One can reasonably look at a database query as an IO bound activity – all the Ruby process is really doing is sending a request to the DB and waiting for a response. However, most DB interaction libraries are implemented as C extensions, and some of them do not play well with Ruby threads. One of the most common offenders is Mysql-Ruby. It will block all of Ruby while waiting for the result from a long running query. This means that a long running query will block the whole process until it returns. On the other hand, Ruby-PG, the driver for Postgres, will context switch within pgconn_block(), the function that makes blocking calls to the database, thus permitting other MRI 1.8 threads to run even during a long running query.

Fourth, because MRI 1.8 threads are green threads, they all run inside the context of a single process and a single system thread. Thus, while they give the appearance of concurrency, there is actually only one thread running at once. This is okay, because it is the appearance of concurrency that matters. If you run top on your laptop or VM shell, you will see a large number of processes running on your system. This number will exceed the number of cores that you have by a large margin, but you rarely have to worry about which processes are actually running on one of the cores at any given time. Your kernel takes care of slicing up access to the CPU into fine enough grains that it appears that all the running processes are executing on a core at the same time (even though most of them probably are not actually running at any given time). Concurrency in computing doesn’t strictly mean that two or more things are actually running at the same time. Rather, it means that there is an appearance that they are, and that one works with them on the assumption that they are, and lets the underlying scheduler deal with making reality fit that appearance.

An entire book could be written about concurrency in Ruby. I’ve just scratched the surface with this overview of process and thread based concurrency in Ruby. Hopefully this helped answer a few questions or suggested some techniques to consider.

Future installments in this series will cover Ruby 1.9.x (which uses system threads as opposed to green threads), JRuby, Rubinius, and using event systems like EventMachine to handle concurrency. So stay tuned! There is a lot more coming soon!

Share your thoughts with @engineyard on Twitter