Adding Support for Foreign Characters to Django

5 minute read

This post explains the changes I needed to make to a Python Django 1.8 website in order to add Unicode (special character) support.

The Situation

I’ve been helping run a Django 1.8 website, and we’ve been getting tired of errors everytime a user uses accents, foreign language characters, or curly quotes. Errors like
DjangoUnicodeDecodeError at /admin/registration/student/18004/change/
or
'ascii' codec can't decode byte 0xf0 in position 30: ordinal not in range(128). You passed in (<class 'website.models.SentEmail'>)
The root of the problem was that our site wasn’t setup to handle Unicode characters- it only handled ASCII ones. Adding support for Unicode, specifically UTF-8, was a bit complicated, so here’s my notes.
If you want to know what I did, read on. But chances are you should probably take this opportunity to actually come to understand what Unicode and character sets are all about by reading this. It was a topic that I found daunting for a long time, but it helps to finally straighten it out.
Anyways here’s how I got my Django 1.8 project ready for Unicode…

Make Sure Software is Current Enough

Unicode support seems to have become more common in the last 10 years or so, and older software doesn’t support it terribly well. So you’ll need to be on Django version 1.5 or higher (that’s when they added support for Unicode), and MySQL 5.5.3 or higher (that’s when they added support for the database collation you’ll want). I was on Python 2, where support for Unicode seems to have been introduced. If you’re using Python 3, Unicode support is more built-in, so chances are you wouldn’t have needed to worry about doing this.

Database Changes

This will get your database ready for all sorts of crazy Unicode, including emojis.😎 Mostly this involves setting the collation.

Backup

Create database backup. In case you totally break everything somehow.

MySQL Collation

Change the database collation to be utf8mb4_unicode_ci (that will support all foreign characters, and even emojis, will do case insensitive comparisons, and know how to handle comparisons in foreign languages). Notice that wasn’t utf8_unicode_ci (because that doesn’t support some Unicode characters, like emojis), nor was it utf8mb4_general_ci (because although it’s a bit faster, it doesn’t order and compare some characters properly), nor was it utf8mb4_unicode (because that makes database comparisons be case-sensitive, which is usually annoying and not what you want.)
Here’s the queries I ran on my database to make the change:
ALTER DATABASE {database} CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
That made it so individual columns no longer had their own collations, but I still had to run
ALTER TABLE {database}.{table} CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
on each table. That was a bit tedious, but it really only took 5 minutes.

Django Database Config

You also need to tell Django what collation the database is using, so it knows what to pass into it. So in my settings.py file, I needed to change the DATABASES declaration from

DATABASES = {
'default': {
'ENGINE': 'django.db.backends.mysql',
'NAME': 'database-name',
'USER': 'database-user',
'PASSWORD': 'database-password',
'HOST': '127.0.0.1',
'PORT':'3306',
}
}

to

DATABASES = {
'default': {
'ENGINE': 'django.db.backends.mysql',
'NAME': 'database-name',
'USER': 'database-user',
'PASSWORD': 'database-password',
'HOST': '127.0.0.1',
'PORT':'3306',
'OPTIONS':{'charset':'utf8mb4'},
}
}

Notice the addition of “OPTIONS” at the very end there, instructor Django that the database uses the utf8mb4 character set.

Code Changes

It seems support for unicode and foreign language characters is one of the major improvements in Python 3. However, our site is on Python 2.7, and so it took a bit of work to get it into shape.

Use Unicode Strings, not Bytestrings

In Django, there are basically two types of strings: unicode strings, and bytestrings. In Python 2 unicode strings look like u"unicode string" and bytestrings look like "byte string". The Django docs explain the difference pretty succinctly. But the jist is this: you should be adding those annoying little “u”s basically everywhere to make the strings unicode strings, which makes them support unicode characters. Pretty annoying. But there’s a shortcut you can do instead…
If you’re on python 2, add
from __future__ import unicode_literals
to the top of all .py files in your project, if it’s not already there. This makes it so all the strings you use in that file will be Unicode strings (which support foreign characters) instead of byte strings (which don’t).
It’s possible this will introduce some errors if there were functions that specifically expected a byte string instead of a Unicode string. If so, you can add a “b” onto the front of the problem strings to make them byte strings again (eg b"byte string again").
And yes, you need to add it to the top of ALL .py files. Sorry. But that’s better than needing to rewrite all strings to begin with a “u” to make them Unicode strings.

Cast to Unicode, Not to ByteStrings

I also found I had to remove uses of the str() function, as that specifically creates byte strings. Its unicode-safe alternative is unicode(), so that was mostly just a search-and-replace.

Keep Unicode Out of Filenames

Maybe someone can correct me here, but I found I needed to avoid having unicode chararacters in filenames, as that would create errors. I did that by running filenames through django.utils.text.slugify.

Writing To Unicode Files

In a few spots we write to some files. Well, the standard open function used for opening files automatically casts it to a bytestring, which will have an error if the string has Unicode in it. You’re better off using io.open. How to write to unicode files is mentioned here, and the details of io.open are explained here.

Send the Right Content Type

If you’re sending the Content Type HTTP response header, or the HTML http-equiv meta tag to simulate the HTTP response header, make sure it’s setting the content type to text/html; charset=utf-8. For some HTML files that we were converting to PDF files using wkhtml2pdf, we were sending the wrong content type which caused the non-ASCII characters to look like random characters. Once I set the right content type, the characters appeared correctly.

Emails Generated using Django’s Template Rendering System Don’t Need to be Escaped

Our system uses Django’s template rendering system to create dynamic emails. After adding support for Unicode everywhere else, I noticed that some characters were appearing strange in the emails we were generating.
It turned out this was because our emails were just being sent as plaintext (ie, not HTML), but the Django template system creates content it thinks is HTML and so it escapes it. So an email body like “Mike’s Email Body” got turned into “Mike’s Email Body”, and displayed like that to its recipients.
The fix was: after we used Django’s template system to render the email body, we then just needed to unescape it.
Ie, I took

from django.template import Template, Context
template = Template(self.template)
context = Context(context)
html_body = template.render(context)

which creates escaped HTML content, and added

from six.moves.html_parser import HTMLParser
h = HTMLParser()
plaintext_content = h.unescape(html_body )

 
to create the plaintext email body, where special characters weren’t escaped.
This was way less work than marking each template variable as “safe”. This would be a bad idea if the file were being rendered as HTML, because it would allow template variables to render harmful JavaScript or HTML tags. But, because clients were treating the content as plaintext, they weren’t going to interpret the content as HTML or JavaScript. So, no problem.

Test Everything

After your database and code are theoretically ready, it’s time to retest basically everything.
Tell me your questions or suggestions!

Leave a Reply