I was recently on the ASP.NET Forums and a member was asking, "How can I figure out the encoding of text?" and that got me thinking. There should be a reasonable way to do this, right? It's a useful thing to know. First, we need a little background on how text is encoded into bytes.
Long ago, back when 64K of memory was a big deal, characters took up a single byte. A byte ranges from 0 - 255, which allows us to support a total of 256 characters. Seems like plenty, no? English has 26, 52 for both cases, 62 with numbers, 92 with punctuation, and a few extra for line breaks, carriage returns, and tabs. So about 100, give or take a few. So what's the problem?
Well, this worked great and all, but other languages use different characters. The Cyrillic language by itself supports 33 letters. This is where encoding was introduced. In order to support multiple character sets, what each byte meant was determined by its encoding. This was done simply by knowing what encoding was used.
In today's world, where that average calculator has more memory than PCs did long ago, we now also use 2 byte encoding. That means that we can support 255 to the second power of characters, or 65,536. That is enough to support all languages in a single encoding, even though it takes up double the space. Problem solved, right? Not exactly.
While in this day and age we support double byte encoding, there are still other factors involved, such as the endianness (the order of the bytes. Big endian is backwards). Even then, there is still a lot of legacy data to support that is still single byte.
Say I give you a big binary chunk of data, and I tell you to convert it to text. How do you know which encoding is used? How do you even know which language it is in? I could be giving you a chunk of data using IBM-Latin. So how do we figure this out? Some smarts and process of elimination. Let's start with things we know.
All of the non single-byte encodings have what's called a Byte Order Mark, or BOM for short. This is a small amount of binary data pre-appended to the rest of the data that identifies which encoding it is. In .NET world, this is called the Preamble. Since the BOM is an ISO standard, it is always the same for the encoding used regardless if you are using .NET, Python, Ruby on Rails, etc. We can look at our data and see if the BOM can tell us.
To achieve this in .NET, we will be using most of the classes in the System.Text namespace. Specifically, the Encoding class. An instance of the encoding class has a method called GetPreamble(). Which will give us our BOM for that encoding. A BOM can be from 2 - 4 bytes, depending on the number of bytes used in the encoding. Remember when I said two bytes would be plenty? Well I fibbed, since there is an encoding called UTF-32 that supports 4 bytes (a whopping 4.2 billion character support).
We can then check our data to see if it starts with the BOM.
private static bool DataStartsWithBom(byte data, byte bom)
bool success = data.Length >= bom.Length && bom.Length > 0;
for (int j = 0; success && j < bom.Length; j++)
success = data[j] == bom[j];
So lets look at this method. This method takes our data, and a BOM, and determines if the data starts with the BOM. There are a few assumptions:
- The data length is always greater than or equal to the BOM. If it is not, then there is no BOM at all, and we'll cover that in a bit.
- The BOM's length is always greater than zero.
So let's put it to use (assume the local data is a byte):
foreach (EncodingInfo encodingInfo in Encoding.GetEncodings())
Encoding encoding = encodingInfo.GetEncoding();
byte bom = encoding.GetPreamble();
if (DataStartsWithBom(data, bom))
Here, we get all of the encodings that .NET knows of, and looks to see if our data byte array starts with that encodings BOM. If the encoding has no BOM, the DataStartsWithBom method will handle that with the bom.Length > 0 on the 3rd line. Once we know the encoding, we can decode it. You have to ensure that you don't actually try to decode the BOM itself:
encoding.GetString(data, bom.Length, data.Length - bom.Length);
Pretty straight forward so far, right?
Yes? OK let's move on. What about the case where we can't figure it out by the BOM? Most encodings don't have a BOM, only the UTF encodings do. ISO and OEM encodings, do not.
This is where it gets tricky, and where some pretty complex algorithms can come into play. The most important piece of information that you can have at this point, is knowing which language the text is in. With that, we can take a reasonable stab at which encoding is it.
.NET supports languages through the System.Globalization.CultureInfo class. This class will be very useful from here on forward. Let's take baby steps on attacking this problem, and while we don't know everything, we can use clues.
Each language has what's called an ANSI encoding. This a standard encoding used for that language assigned by the American National Standards Institute. The ANSI encoding is always a single byte encoding. This seems like a reasonable place to start.
We can get this Encoding by calling cultureInfoInstance.TextInfo.ANSICodePage. This only gives us the numeric code page (an identifier), but it's simple enough to create an instance of the Encoding class with the code page by calling Encoding.GetEncoding(int codePage).
How do I figure out the language? Chances are you know what language your users are using, or at least most of them. A case where you wouldn't know is screen-scraping. That can be figured out by looking at the encoding of the response. You can do that by looking at the ContentEncoding property off of the HttpResponse instance.
In most cases, this will probably work. By no means am I saying, "this will always work" in fact, there are a lot of bases that I haven't covered that I hope to in future blog posts. There are other code bits out there that do this already, and do a good job, but it's always good to know how it actually works, and fully understand the problem you are trying to solve.
So what'll be in part 2? How to decode text without knowing the language, and maybe in part two (part 3?) lossy decoding.