Performance Anxiety

Posted by Rick DeNatale Tue, 26 Sep 2006 19:17:00 GMT

I've been meaning to write about Ruby performance for a while, and a recent blog post by an old friend and colleague, got me off my proverbial.

The old friend is John Duimovich, who wrote about the relative performance of C++ and Smalltalk and what that could mean for ruby.

John's message is important for those who bemoan the performance of Ruby, and I plan to expand on that message in this and later posts to this blog, but first a few words about Mr. Duimovich.

Consider the source

In his day job, paraphrasing his self description John "works for IBM on Java virtual machines and is the lead on the Eclipse tools project management commitee."

But some of my readers might be interested in John's background. John was for a very long time, the lead of the Smalltalk and Java virtual machine team at Object Technology International (OTI) dating from before the time it was acquired by IBM. Among other things John was responsible for the development of embedded Smalltalk virtual machines from OTI, which spawned the VM used in Smalltalk/V Mac, IBM Smalltalk (used in IBM/VisualAge), the 'Universal' Virtual machine which implemented Java on an extended Smalltalk VM, and which was used for the early releases of IBM/VisualAge for Java, and the J9 Java VM. A good deal of what I know about implementing VMs comes from working, lunching, and bar-hopping with John.

John had become OTI's Chief Technology Officer before OTI got assimilated into the IBMborg.

John is a brilliant guy, with a great sense of humor. Two characteristics which seem to have been requirements for a job at OTI. I'm still not sure how I ended up spending several years there.

Dynamically Typed Doesn't Need to Mean Slow

I encourage you to read John's blog post yourself, but to summarize; John ran across another blog item which gave a benchmark written in C++, Ruby and Python. The C++ version runs in under 1/10 of the time needed for either the Ruby or Python versions.

John duplicated the results on his machine, then decided to port the Ruby version of the benchmark to Smalltalk. He then ran it using VisualAge Smalltalk.

And the Smalltalk version runs in the same time as the optimized C++ version!

How can this be?

The Value of Pole Vaulting

Languages like Smalltalk and Self started from the position that a clean object-oriented language was more important than one which makes compromises to make efficient implementation obvious.

Early implementations of Smalltalk used obvious implementations of some features, which were 'fast enough' in many cases, but by no means fast. Two areas which cried for improvement were method dispatch and garbage colection. The obvious techniques were walking up the class hierarchy each time a method was needed, and relatively easy to implement GC techniques like reference counting, and mark-and-sweep. Reference counting has a fairly high cost for each change of an object reference, and also has the drawback of leaking memory because cyclical references lead to garbage which is uncollectable. Mark and sweep delays the overhead until storage is exhausted, but leads to more perceptible pauses when the application gets paused so that the housemaid cleans the room.

Encountering (or having set) this high bar, various implementors of these languages found very clever techniques for both problems. Dave Ungar made measurements of the lifespans of Smalltalk objects and observed that most objects died very shortly after being instantiated, with few living a long life. This led to the invention of generational GC techniques, which quickly dispatched young dead objects, which are the vast majority.

Method dispatching techniques of efficiently implemented dynamically typed languages tend to use clever caching algorithms which can get to what is probably the right method quickly, with a quick test to make sure that the right method was found.

These dispatching techniques turn out to be faster than the virtual function pointer dispatching made possible by strongly-typed languages like C++. In fact, I've heard that more modern implementations of these languages have actually used a more dynamic method dispatch mechanism internally in order to increase performance.

Anoher implementation choice is how to represent executable code. Most efficient implementations use a combination of byte-code representation, and some form of just-in-time translation of byte-codes to machine code. Just how to divide execution between byte-code and machine code is a complicated decision. Back when DIgitalk first produced a version of Smalltalk/V for OS/2, they decided to eschew byte-codes entirely and generate 80286 machine code. The reason was that they were tired of hearing complaints about Smalltalk being an 'interpreted' language.

The surprising result of this experiment was that the resulting implementation was slower. Machine code was bigger, so it took longer to load, and caused more swapping. These costs were paid whether the code in question was executed once or a million times.

Again caching was the basis for getting the best of both worlds. Peter Deutsch of Xerox, later ParcPlace, had introduced the notion of translating byte-codes to machine code into a cache during execution, David Ungar's implementation of Self introduced the notion of using light-weight profiling techniques to avoid the overhhead of translating byte-coded methods which were infrequently executed.

Another area which posed difficulties in implementation was control flow. Smalltalk-80 defines all control flow as methods. Even primitive control flow constructs such as if (ifTrue: in Smalltalk) were implemented as methods on Boolean classes. This is one area where Smalltalk implementations cheated compiling such methods in to testing and branching byte-codes, and requiring the receivers to be boolean instances.

Self eschewed this early optimization. Ungar's team instead relied on run-time type inference in order to dynamicaly generate code which achieved the same or better performance when such a message was sent to a boolean without restricting other cases.

The Current State of Ruby Implementation

Ruby performance today is surprisingly acceptable for a wide range of uses.

This is despite the fact that the implementation is relatively straightforward, almost to the point of being naive. In the current standard implementation of Ruby:

  • Method dispatch is done by walking up the 'class' hierarchy looking for methods in a hash table in each class/module.
  • Garbage Collection is done by a simple mark and sweep algorithm.
  • Executable code is represented by a parse tree which is executed by traversal.

This is not meant to understate the achievements of Matz and the ruby developers. Ruby as it is definitely usable for many production uses.

The point is how much better Ruby performance can get as the implementation matures. A virtual machine, with byte-codes, and better GC is on the roadmap. Ruby virtual machines such as YARV, and JRuby are showing glimmers of the value of implementing Ruby as a virtual machine. If Ruby continues to grow in acceptance, I've no doubt that other clever implementers with experience in efficently implementing dynamically typed languages will provide more implementations.

My prediction is that the future will be so bright that we're going to have to wear (ruby colored) shades!


Trackbacks

Use the following link to trackback from your own site:
http://talklikeaduck.denhaven2.com/trackbacks?article_id=39

Comments

  1. christopher baus 2 days later:
    If you look closely at my C++ code, you'll notice that virtual function dispatch isn't used. Much C++ code these days uses compile time polymorphism (ie templates) which results in direct function calls which can even be inlined by optimizing compiliers. I suspect the day will come when Ruby and Python will perform as well as C++, but we aren't close yet. Even IronPython which runs on the Microsoft CLR, which is about the most modern VM available is still a magnitude slower than C++. But just as Ruby is getting faster, so is C++.
  2. Rick DeNatale 3 days later:
    But as John Duimovich points out, Smalltalk is already closer than spitting distance. The techniques for making dynamically typed languages run fast are well known, albeit by a rather small cadre of folks who have done it. There are a lot of other issues as well. Years ago I was involved with an IBM team which looked at Apple's Pink project about the time IBM bought into Apple and founded Taligent. The Pink team had all kinds of performance issues with C++. A lot of these had to do with memory management. Because C++ requires the programming team to manage deallocation, they built rules and patterns for who was responsible for freeing things. The result was that all that hand-coded memory management was a significant drag on system performance since it was slower than garbage collection, and made the code harder to write, understand, and maintain. And the tight binding of interface implementation between C++ components leads to long build times, because those dependencies mean that more needs to be re-built on a change, particularly when the build system only understands dependencies at the granularity of files. And templates don't help that. I'm happy for you that you seem to be happy with C++, I never was and never could be. I hope that in your case it's not an example of the Stockholm syndrome.
  3. christopher baus 5 days later:
    C++ does give the developer choice over which memory strategy to use. That was my point in my original list performance tests. To show how pool allocation could improve linked list performance. But I am no means a C++ bigot, and I am one of the people who tends to believe that dynamic languages will outperform static not only because of memory issues, but because of runtime cache optimizations.
  4. Wog 11 days later:
    Chris, if your no C++ bigot, then why is every post I read of yours in defence of C++ with the same disclaimer at the end? How about more explanatory comments in line with the last paragraph in your last post. I would find them far more helpful.
  5. Isaac Gouy 2 months later:
    The old friend is John Duimovich, who wrote about the relative performance... I noticed this in the comments: "This is a rather unfair comparison though, as it is comparing completely different data structures with a test that caters to the strengths of one. Python and Ruby's lists are implemented via vectors, not linked lists, so have O(n) time to remove the first element. Linked lists (with a reference to their end) are O(1) for both insert at start, and delete from end."
  6. Isaac Gouy 2 months later:
    The Computer Language Shootout Benchmarks includes several Smalltalk implementations as-well-as Ruby and YARV.
  7. Bob Bane 5 months later:
    Be careful what you wish for, guys. All of the clever, aggressive optimization techniques mentioned above have the side effect of making it more difficult to drop out of the optimized implementation into C libraries - better GC means more ways to screw yourself passing data up and down, compilation to native code means dynamic linking complications, method dispatch cacheing means potential problems with cache invalidation from the C side. How many of these optimizations will be accepted if the price for doing so is flushing/reimplementing popular libraries? One major reason Smalltalk and Lisp are non-mainstream languages is the dreaded phrase "foreign function call".
  8. Alan Friendly 5 months later:
    Don't forget to credit Allan Schiffman for the Smalltalk dynamic translation work. Allan and Peter Deutsch worked together to make Smalltalk one of the fastest interactive language/environments.
  9. Rick DeNatale 6 months later:

    To Bob Bane,

    No Smalltalk was for a time being used for lots of enterprise level programming. What took the wind out of its sails was the huge groundswell of Java popularity, not a lack of the ability to call C, which is something which both languages have.

    To Alan Friendly,

    Yes Alan and Peter were both early contributors to Smalltalk implementation. And the implementation techniques were improved over time by many folks including (off the top of my head) Dave Ungar at UC Berkeley, George Bosworth and his team at Digitalk, and John Duimovich and the OTI VM team.

  10. christopher baus 6 months later:

    This is now a very old post, but I want to respond to wog regarding C++ bigotry.

    I think most folks who know me would agree that I by no means a C++ bigot. It is possible to avoid virtual function dispatch in C++ and at times get increased performance. But that comes at a cost.

    • Difficult to read errors
    • Bigger object code (which can be slower)
    • Longer compiles
    • Type explosion

    C++ give you a lot of choice, but that isn’t always what developers want. I think languages like python are easy to develop in, but currently at a cost of performance.