ASCII Characters

Many years ago, long before Snapchat, emails used to be limited to English letters and a few extra characters. Now you can throw in all the Asian letters, as well as some emoji nonsense. What's going on behind the scenes is a shift from ASCII to Unicode and the Internationalization of the internet.

ASCII was, more or less, the original way text was represented online. Because computers take binary data and make human-readable versions, a standard was needed to say what binary data would correspond to what piece of text. So the early internet pioneers came up with ASCII (American Standard Code for Information Interchange). In ASCII, each character is represented with 7 bits. That means there are 128 possibilities - upper case, lower case, numbers, basic symbols, and few other things. Back in the day, data interchange was more constrained, so ASCII was a reasonable way to send representations of text. This was especially true because the early internet was a largely American and Western European effort.

1200px-ASCII_Code_Chart.svg.png

ASCII couldn't handle much of the required complexity of a more sophisticated internet. UTF-8 (Unicode Transformation Format - 8 bit) was invented as a way to preserve interoperability with ASCII and add a much more robust character representation. UTF-8 allows 1,112,064 different character representations - including basically every character in every language. The difference between ASCII and UTF-8 is that ASCII is a simple 7-bits-to-a-character representation of binary to letter. UTF-8 is more complicated, involving a variable-length number of bytes (each byte is 8 bits) that must be interpreted based on the bits that have already arrived for a particular character.

So when your browser sends a request for a web page, the server will respond with a header including something like the following:

Content-Type:text/html; charset=UTF-8

This means that after the headers, there will be a stream of bits that the browser must interpret using the UTF-8 decoding, instead of ASCII or whatever. All HTTP headers, ironically, are in ASCII. UTF-8 now accounts for about 88% of all web page encoding online., whereas 10 years ago, ASCII had a similar percent of the web.

And as the web gets internationalized, so too do ads. So even something like French and it's ç's and German and it's ü's needed something more robust than ASCII. The transition from ASCII to UTF-8 has been very important to the internationalization of the web and for ads. And it's the same UTF-8 that underlies our support for internationalization.

When we answer the question of whether TripleLift ads support, for example, Japanese characters, you're actually asking a number of questions:

  1. Does our implementation of the OpenRTB protocol support UTF-8? Yes
  2. Do our databases store text as UTF-8? Yes, where necessary. But if we know 100% that we will only need english characters - and there will be a lot of text - then ASCII is more efficient.
  3. Do we respond with a UTF-8 encoding for our ads? Yes
  4. Do browsers support rendering Japanese characters - meaning do they support UTF-8? Yes: all modern browsers natively support UTF-8.

So yes, end-to-end we support all the characters supported by UTF-8. Some of the challenges have been ensuring that we used UTF-8 at every step of the way - and never only had support for ASCII.