Safely dividing a UTF-8 String in Ruby

Posted by Rick DeNatale Thu, 28 May 2009 16:17:00 GMT

The other day, someone brought up a UTF-8 related issue with RiCal.

RFC2445 specifies that each line of a icalendar datastream must be no more than 75 bytes, and longer lines need to be folded by breaking them into sections with the second and following sections put into lines with an initial space character to mark them as continuation lines. As was pointed out to me, simply breaking a UTF-8 string in Ruby runs the risk of splitting up a multi-byte character.

Here's a spec to show what I needed:

describe "String#safe_utf8_split" do
  context "For an all-ascii string" do
    before(:each) do
      @it = "abcdef"
    end

    it "should properly split an ascii string when n leaves 1 character" do
      @it.utf8_safe_split(5).should == ["abcde", "f"]
    end

    it "should return a nil remainder if the string has less than n characters" do
      @it.utf8_safe_split(7).should == ["abcdef", nil]
    end
    
    it "should return a nil remainder if the string has exactly n characters" do
      @it.utf8_safe_split(6).should == ["abcdef", nil]
    end
  end
  
  context "For a string containing a 2-byte UTF-8 character" do
    before(:each) do
      @it = "Café"
    end


    it "should split properly just before the 2-byte character" do
      @it.utf8_safe_split(3).should == ["Caf", "é"]
    end

    it "should split before when n is at the start of the 2-byte character" do
      @it.utf8_safe_split(4).should == ["Caf", "é"]
    end

    it "should split after when n is at the second byte of a 2-byte character" do
      @it.utf8_safe_split(5).should == ["Café", nil]
    end
  end
  
  context "For a string containing a 3-byte UTF-8 character" do
    before(:each) do
      @it = "Prix €200"
    end


    it "should split properly just before the 3-byte character" do
      @it.utf8_safe_split(5).should == ["Prix ", "€200"]
    end

    it "should split before when n is at the start of the 3-byte character" do
      @it.utf8_safe_split(6).should == ["Prix ", "€200"]
    end

    it "should split before when n is at the second byte of a 3-byte character" do
      @it.utf8_safe_split(7).should == ["Prix ", "€200"]
    end

    it "should split after when n is at the third byte of a 3-byte character" do
      @it.utf8_safe_split(8).should == ["Prix €", "200"]
    end
  end
  
end

So to fix this I came up with a pretty simple idea, split the string and check to see if the second part is valid UTF-8:

class String
  def valid_utf8?
    unpack("U") rescue nil
  end

  def utf8_safe_split(n)
    if length <= n
      [self, nil]
    else
      before = self[0, n]
      after = self[n..-1]
      until after.valid_utf8?
        n = n - 1
        before = self[0, n]
        after = self[n..-1]
      end      
      [before, after.empty? ? nil : after]
    end
  end  
end

In RiCal, I actually implemented this using functional methods in another object, since I didn't want to 'pollute' Strings instance methods, but the code here illustrates the basic idea.


Trackbacks

Use the following link to trackback from your own site:
http://talklikeaduck.denhaven2.com/trackbacks?article_id=558

Comments

  1. Jan M about 1 hour later:

    nice solution. Here is what I came up with (I couldn’t get markdown to display ruby code here)

  2. Laurens Holst about 6 hours later:

    Maybe a better approach is to check the first character of the string after the split:

    if (first_character < 0x80 || first_character >= 0xC0) return ok else return bad

    In other words, don’t break the line before a character in the range 0x80-0xC0.

    ~Laurens

  3. Laurens Holst about 6 hours later:

    I don’t know Ruby, but something like this:

      def utf8_safe_split(n)
        if length <= n
          [self, nil]
        else
          until utf8_safe_split_at_character(n)?
            n = n - 1
          end      
          before = self[0, n]
          after = self[n..-1]
          [before, after.empty? ? nil : after]
        end
      end  
    
      def utf8_safe_split_at_character(n)
        self[n] < 0x80 || self[n] >= 0xC0
      end
    

    Much more efficient than repeatedly splitting and looping over the entire string.

  4. Laurens Holst about 17 hours later:

    I blogged my previous comments:

    http://www.grauw.nl/blog/entry/521

    I really should make a trackback script btw.

  5. Rick DeNatale 1 day later:

    Laurens,

    Actually my code isn’t as inefficient as you think. Ruby uses copy on write semantics for strings, so using the slice method (a.k.a []) doesn’t do anything but make a string pointing the the right bytes in the original string. Nothing is copied.

    Now, as it turns out checking the character value is a bit more efficient than having unpack, I’d actually considered that but I don’t like the ‘magic’ number and think that unpack is clearer and puts the burden of understanding utf-8 on the standard library.

    That said, I realize that I’ve introduced a Ruby 1.9 incompatibility here. Ruby 1.9 has nicer support for unicode, but that nicer support actually makes it harder to meet the requirements of the RFC 2445 spec which limit the maximum line length to a certain number of octets, NOT characters. So in order to check the limits I guess I’ll have to somehow unmask the underlying representation which 1.9 is hiding. Hmmmmmm.

  6. Laurens Holst 2 days later:

    Maybe Ruby 1.9 allows you to ‘encode’ its internal unicode representation to UTF-8 to bytes? Doesn’t seem unlikely, if you want to e.g. send data over a raw TCP socket you need that kind of functionality. Worst case you could compare code point values of each index and add to a counter (> 128 is 2 bytes, > 2048 is 3 bytes, etc.).