This guide largely covers the versions of Rails before 1.2. To see how the problem is managed now, see HowToUseUnicodeStrings
Background
While you can’t use Unicode strings with ruby, you can store UTF-8 encoded data in your 8-bit strings. However some of the String methods assume a single byte encoding and therefore return wrong results. Besides, without proper settings you will get output and input as sequences of bytes and you might get parse errors in literals.
Please note that right now Rails basically knows nothing about Unicode and pretends everything is just bytes. It means that validates_length_of for multibyte characters will trigger errors at the wrong places, various kinds of Unicode whitespaces are not going to get trimmed and sometimes Rails will cut right into your characters. Absolute most of Rails internals makes no notion that multibyte text even exists, Rails just delegates all to the Ruby string handling code (which in current Ruby is all single-byte).
This is being looked at, but in the meantime you use UTF8 encoded strings at your own risk and you can expect (and wil get) bugs and problems :-)
Fixing internals
To tune standard input and output, as well as to enter Unicode literals in scripts add the following code to your application configurations at the start of config/environment.rb (and don’t forget to restart the server for the change to take effect):
$KCODE = 'u'
require 'jcode'
This will give you correct input and output and correct UTF-8 “general” sorting. Alternatively you can change the “shebang line” in your dispatcher file (dispatch.cgi or dispatch.fcgi) to the following:
#!/path/to/ruby -Ku -rjcode
When using mod_ruby you can set KCODE for all scripts activated withing a specific location by using the RubyKanjiCode? directive
<IfModule mod_ruby.c>
RubyKanjiCode u
</IfModule>
When using string functions use the versions from the String class in jcode.rb (which is part of your ruby installations). There are some functions where the name differs from the original version. E.g. instead of foo.length() you now have to use foo.jlength().
Note though that most of the methods return wrong results. Among the methods which are not covered by jcode.rb (every use of these methods in your application introduces a bug, sometimes including irreversible damage to strings):
String#reverse
String#size
String#index
String#[]
String#downcase
String#capitalize
String#downcase
String#strip, String#rstrip and String#lstrip
String#slice
Note that all of these are heavily used internally by Rails.
You can partially heal this by using the routines from the Unicode library by Yoshida Masato, available as a gem
But before you do this, don’t forget to have the dev-libraries for that.
(e.g. Debian needs apt-get install ruby1.8-dev)
Windows people might need unicode-0.1-mswin32.gem
then install your gem
gem install unicode
After that the following functions will be available:
Unicode::downcase(string)
Unicode::upcase(string)
Unicode::normalize etc.
You can also try to install the unicode_hacks gem which is available at http://julik.textdriven.com/svn/tools/rails_plugins/unicode_hacks/. (manfred said on Sep 23,2006: We’ve moved the code from Julik’s repository over to the Fingertips servers since we refactored the code. You can find it at https://fngtps.com/svn/multibyte_for_rails/”:))
It gives you an accessor to address strings as characters explicitly:
"Some string".chars[0..2] # => "Som" but works for UTF8 too
"Cafe".chars.reverse # => "efaC" - works for UTF8 too
If you have the Unicode gem installed it will also help you with tasks such as normalizing all the incoming data to KC (so that you don’t strore ligatures in your database and all the Unicode whitespace is stripped).
Configuring the database connection
In Rails before 1.0
If you are using MySQL 4.1 or above, it is indispensable that you tell Rails to tell MySQL to SET NAMES UTF8. You should therefore use a modified version of the before_filter above, as follows:
class ApplicationController < ActionController::Base
before_filter :configure_charsets
def configure_charsets
@headers["Content-Type"] = "text/html; charset=utf-8"
suppress(ActiveRecord::StatementInvalid) do
ActiveRecord::Base.connection.execute 'SET NAMES UTF8'
end
end
end
For PostgreSQL this will be:
class ApplicationController < ActionController::Base
before_filter :configure_charsets
def configure_charsets
@headers["Content-Type"] = "text/html; charset=utf-8"
ActiveRecord::Base.connection.execute 'SET CLIENT_ENCODING TO UNICODE'
end
end
In Rails 1.0 and above
You have a special setting in your database.yml at your disposal, made for setting the charset used by your database connection, it is called encoding. Example:
development:
adapter: mysql
database: example_development
encoding: utf8
username: root
password:
This will be “utf8” for MySQL or “unicode” for PostgreSQL respectively (for names of character sets that you need to use see MySQL and PostgreSQL manuals – Rails just forwards this name to the database driver).
To use SQLite with UTF8 you need to compile your SQLite with UTF8 support burned-in, after that you don’t need any special settings in database.yml
Beng careful with the schema
When using [Migrations] and schema.rb be careful and set up your database to create Unicode tables by default! The Ruby dump format does not preserve any charset-specifics.
ALTER DATABASE foo_development CHARSET=utf8;
ALTER DATABASE foo_test CHARSET=utf8;
You have been hit by this problem if you are running your tests and all characters from the database come as ”????”, while ind evelopment mode they look and work OK.
Running tests and storing fixtures
Older versions of Rails had some problems when loading Unicode-encoded fixtures from YAML, these problems have been fixed. Besides that use a normal workflow.
Note however, that YAML is broken when you actually want to store your Unicode text back to file – it is going to be converted into a binary base64-encoded string (but will be perfectly usable when you read the YAML file back in). This is a known missing functionality of Syck, the Ruby YAML parser-seerializer. Funnily enough, Syck supports UTF-8 on the Perl side.
update If you are having problems with YAML and Ruby 1.8.4, (you read it in, can’t read it out) get the latest version of Ruby (as in, latest stable version of 1.8.4 out of CVS). There were bugs in yaml which are subsequently fixed. Works for us in version marked as 4-28-2006, did not work for us using a december version of 1.8.4 Do ruby -v to check what date you have. —crispynews
Using helpers and ActionMailer
Some helper methods that work with text (such as truncate) will use Unicode-aware truncations if you have your $KCODE set to “UTF8”. Others will just damage your strings or give no result. Please file a big on the TRAC if you have problems with helpers. ActionMailer should be set up using the charset option.
Configuring output correctly
Further you have to set the charset of your pages to UTF-8. You can do this by adding a before_filter to your ApplicationController:
class ApplicationController < ActionController::Base
before_filter :set_charset
def set_charset
@headers["Content-Type"] = "text/html; charset=utf-8"
end
end
NOTE: You do not want to do this if you’re using RJS templates. RJS templates need to use text/javascript not text/html as the response type. Read the HowtoSetDefaultEncoding section for more details and workarounds.
Apache Users: Tell Apache to return ‘utf-8’ using the AddDefaultCharset directive either in your .htaccess or httpd.conf file. Both Rails and Apache will return a Content-Type, but Apache’s will be first and most user agents seem to take the first over all others.
Also, don’t forget to include proper tag in your rhtml master-template, otherwise older browsers may not correctly display UTF-8 encoded characters (for example Safari 1.0 from Mac OS 10.2):
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
To make Unicode support work with the Ajax helpers and versions of Safari prior to 2.0 (which has a bug) you have to take care of that in a after_filter in your ApplicationController:
class ApplicationController < ActionController::Base
after_filter :fix_unicode_for_safari
# automatically and transparently fixes utf-8 bug
# with Safari when using xmlhttp
def fix_unicode_for_safari
if @headers["Content-Type"] == "text/html; charset=utf-8" and
@request.env['HTTP_USER_AGENT'].to_s.include? 'AppleWebKit' and request.xhr?
@response.body = @response.body.gsub(/([^\x00-\xa0])/u) { |s| "&#x%x;" % $1.unpack('U')[0] }
end
end
This will convert all Unicode glyphs in the response body to UTF-8 decimal entities. Please note that of you send Javascript encoded like this it won’t be evaluated, so your best bet would be to advise your Safari users to upgrade.
—
I’m using Safari 2.0.3, and I’m facing this UTF-8/AJAX helper problem. I tried the solution of def fix_unicode_for_safari, but it didn’t worked. The solution I found was to change the test
if @headers["Content-Type"] "text/html; charset=utf-8" by if @headers["Content-Type"] “text/javascript”Another way to fix UTF-8 handling for AJAX in Safari is to prepend an XML prolog to your AJAX output (this will also prevent all the aJS code that was sent from being run):
def add_item
render_text "<?xml version='1.0' encoding='utf-8'?>" + "<li>" + params[:newitem] + "</li>"
end
Configuring input
If you send your headers properly (and let the browser know that your site outputs pages in UTF8) all modern browsers will send you forms in UTF-8 automatically, so you don’t have to do any input conversions or define “accept-charset” on forms.
Converting between charsets
Use iconv.
require 'iconv'
# will convert from UTF8 to UTF16
Iconv.new('utf-16', 'utf-8').iconv(person.name)
Configuring ActionWebService
Just let AWS know that you have iconv (or the character detection in SOAP won’t work and it will just presume it’s a dirty string and BASE64 your strings)
require 'iconv'
and ensure $KCODE is properly set to “UTF8”! OK!
If you have a MySQL setup and you already have data in it and don’t want to drop your tables etc, you can instead use the mysql client to get in there and change the character sets.
From the command prompt:
baci:~/rails-app/mysql -u root -p Enter password: ********* mysql>ALTER TABLE myTable CHARACTER SET utf8; Query OK, 2404 rows affected (0.77 sec) Records: 2404 Duplicates: 0 Warnings: 0
If you only want to do a column of data:
mysql>ALTER TABLE myTable MODIFY myColumn VARCHAR(255) CHARACTER SET utf8;
This was taken from the MySQL Support Site
This peratins to locales, irrelevant here—Julik
If all String functions are broken, how reasonable is it to use unicode?
Just as reasonable as it is to work for someone who doesn’t have the luck of writing in Latin1—Julik
I just encountered a problem with a String containinung German umlauts (aou?) and the scan method of the String class. I wanted to tokenize the string on word boundaries and tried the following:
firstname, lastname = s.scan(/\w+/)
This does not work; scan generated a new token each time it came across an umlaut. So I changed the call slightly:
firstname, lastname = s.scan(/[^ ]+/)
This works as expected.
String#scan is Umlaut-aware for me on Ruby 1.8.4 when $KCODE is properly set to “u”.
Additional community links
For Nuby on Unicode: Unicode video presentation by Julik
If you want to know what exactly you can expect.