[INFO] How to avoid unicode decoding related runtime errors in your scripts/addons.
#1
If I was to analyze the most common causes of python script failure in XBMC I would undoubtedly point as the first of the list to string encoding issues.

The issue is not lame as it is caused most of the time by the lack of understanding of programmers on what string encoding is and how to handle it in programs. In fact, most programmers (even very experienced ones) doesn't know such things should be their concern until users start complaining about a program that, for them, works perfectly.

As a Spanish born Computer Engineer that dedicated most of his life to create and maintain data interfaces between heterogeneous computer systems I have dealt with encoding issues all my life.

In this thread I will try to give a short explanation (very simplified and possibly not 100% accurate for the shake of not boring any possible readers - and because I don't know best) about what are string encodings and how to create safe python scripts regarding strings encodings.


THIS IS A PERMANENT WORK IN PROGRESS SO ALL CONSTRUCTIVE COMMENTS OR CORRECTIONS ARE WELCOME.


A little bit of History: (if you get tired just jump to the next section for python issue solving)

"In the beginning some guys in the US created the computer. But it was formless and all about numbers. And ANSI said, “Let there be characters,” and there were characters. Each one associated to a number, the only thing computers were able to store. And "A" was called "65", his little brother "a" was called "97". Even numbers got new numbers (and odd numbers too), so "1" was called "49", "2" was "50" and so on. And each know character got a number. Even non-characters like Tabulator or space got numbers. And ANSI saw that ASCII was good..."

What ANSI didn't took into account was that the rest of the world would like also to use computers. And those foreign people didn't even spoke English and they needed new and strange characters that didn't exist in the ANSI creation. As always things could have been done right, but they didn't. Instead of getting everybody together and agreeing in a new common extension to ASCII each country, S.O creator or even developer created a new and different extension that suited their particular needs.

At that time, most computer systems had 8 bit microprocessors in their hearts. Those microprocessors where able to store in each memory cell a single number between 0 and 255 (the famous byte composed by 8 bits capable of storing 0 or 1, this threw the 256 possible combinations). Initially ASCII was limited to 128 characters (half the possible values with one byte). This allowed to reserve the last bit in each byte for basic data verification purposes (needed in the primitive data transmission media).

That allowed the foreigners to use the second half of available char numbers (from 128 to 255) to include extra chars that weren't present in the ASCII standard. As we already said each new encoding used those codes according to their preferences (maintaining the first half untouched). So, for example, the "á" character used in many Western European Languages would be coded as "225" in latin1, as "160" in IBM437 or as "135" in Macintosh encoding.

As you can imagine this caused many troubles when systems with different encodings tried to exchange text information.

So finally, after many many years of trouble, some smart folks sat down and decided to create a single standard encoding for everybody. When they started working they faced one (among many others that we'll skip here) very big issue:

1. Since they wanted all characters in the world to be included in this new standard encoding they realized that 256 possibilities weren't nearly enough. Even joining 2 bytes (witch gave 65536 different possible codes) weren't enough. This meant that if everybody was to convert all their text string (encoded in each different existing limited encoding) to the new encoding they will need 2, 3 or 4 times the original amount of space.

So they created this new encoding (in fact various) called UNICODE. But because of the space requirements it's implantation in the real world is not occurring as fast as everybody would liked. In fact, as today there are very few systems that can say they are 100% unicode exclusive (is there any?)

To address this issue some even smarter guys created utf-8. What is that? Well, the idea is simple and the implementation is brilliant. utf-8 is a new encoding that, like UNICODE, encodes all existing characters in the world but instead of using the same length for each character like unicode, it uses a variable character length for each character from 1 to 4 bytes. It assigns 1 byte codes to the most common characters in the world (at least the western world Wink and from 2 to 4 to the rest (also giving less chars to second most common). And it does that achieving some other very interesting features:
1. It's easy to discover the beginning of the next full character in a stream (very important for data integrity in transmissions).
2. utf-8 is backwards 100% compatible with ASCII (any ASCII string is a valid utf-8 string). This, of course is very interesting for English based systems. Not so much for the rest...

So, what's the picture right now?

Well. In reality it's not a very nice one. The coexistence of so many text encodings, even inside a single system (or even inside the same program!) Leads to the need of having to take care of the issue even if we don't plan to exchange that many information (although nowadays in the Internet era almost any program has some kind of data interchange with other modules, web services, etc.)

So if you are planning to create a program that handles text be ready to deal with encoding issues. This, depending on your programming language selection, running environment and connections can be from transparent to a nightmare.

How does python handles all this stuff?

If you ask for my personal opinion I would have to say "not very well". It is true that python can work with unicode strings as well as with text encoded in many other encodings. It even does automatic re-encoding in some cases (and that in the end is more of a source of trouble). But let's go to the useful stuff:

Python works with 2 separate types for Strings:
  • unicode strings (unicode).
  • single byte strings (str). (Well more or less)

Both can coexist in a python program. You can even operate mixing them. But this only contributes to the mess. From now on I'm gonna declare some Speaking conventions that are my personal election. I don't know if this can be consider standard. But it has proven to help clarify how to proceed with strings in python:

I consider that "unicode" strings are "un-encoded (decoded) text". I know that all strings are encoded one way or another but it's practical to consider UNICODE strings as decoded.
The other type of strings (that I call single byte) "str" I consider them "encoded strings". This strings are always encoded using a single byte encoding might it be "ascii", "latin", "utf-8" or whatever other codec there exists around the world.

To convert a str string into a unicode string you have to DECODE it and there is a method in the str type in python to achieve this. To go from str to unicode you must indicate the "encoding" used in the str variable:
Code:
>>> str_string = "Hi world!"
>>> unicode_string = str_string.decode ("ascii")
>>> unicode_string
u'Hi world!'

The inverse step, going from unicode to str is called (by me) ENCODING and there is also a method in the unicode class to do that. In this case you have to indicate the "encoding" you want the new variable to be in:
Code:
>>> new_str_string = unicode_string.encode("utf-8")
>>> new_str_string
'Hi world!'

To remember it more easily I like to think of str strings as the "zipped" (compressed) versions of unicode strings. the unicode version most of the time will occupy more space in memory than any encoded (zipped or compressed) version.

The BIG ISSUE (and this again is my personal opinion) with str strings is that the object model does not store anywhere the encoding used to create that variable. That's why it's so easy to do this:
Code:
>>> my_utf8_str = u"This unicode string is going to be encoded with utf-8. ¡Fantástico!".encode("utf-8")
>>> my_unicode_back = my_utf8_str.decode("ascii")
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 55: ordinal not in range(128)

Sounds Familiar?

This wouldn't be the case if str strings stored internally the encoding used when creating them. In fact, there would not be any need to indicate any parameter in the decode method. But this is not the case, so programmers must keep track of the actual encoding used in each str variable to avoid conflicts.

Before going to the practical stuff I will go a little deeper in the python str fiasco: If we look into the methods supported by str class we find that they have an encode method. One might think that a possible use for that is to change the encoding of a text. For example, if I wanted to change the encoding of my previous utf-8 string I would like to do the following. Unfortunately it wouldn't work:
Code:
>>> my_latin_str = my_utf8_str.encode("latin")
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 55: ordinal not in range(128)

Why doesn't this work? Because python assumes that "any str string is encoded with ascii unless specified otherwise". Also, there is an implicit decoding going on there as part of the re-encoding process (well, I don't know this for a fact, but I suspect that the encode method in strs does an implicit decode to unicode and then a encode to the indicated encoding). So if we make explicit the process what python is trying is:
Code:
>>> my_latin_str = my_utf8_str.decode("ascii").encode("latin")
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 55: ordinal not in range(128)

Same error, as you can see. The right way to do what I wanted is this:
Code:
>>> my_latin_str = my_utf8_str.decode("utf-8").encode("latin")
>>> print my_latin_str
This unicode string is going to be encoded with utf-8. ¡Fantástico!

Exercise:
Predict if, with the sample variables used before, the following statements will success or fail: (open question: does anybody know if there is the equivalent of
in xbmc forum?)

PHP Code:
>>> my_unicode_back my_utf8_str.encode("unicode")  # (S/F)?
>>> my_unicode_back my_utf8_str.decode("latin")  # (S/F)?
>>> my_unicode_back my_utf8_str.encode("utf-8")  # (S/F)?
>>> my_other_unicode my_latin_str.decode("latin")  # (S/F)?
>>> my_unicode_back my_utf8_str.decode("utf-8")  # (S/F)?
>>> my_other_unicode_back my_latin_str.decode("utf-8")  # (S/F)? 

FINALLY WHAT YOU WANTED: PLAIN ADVISE ON HOW TO HANDLE THIS IN PYTHON FOR XBMC

Well, after all this I have to tell: if you didn't read it all, of you didn't understand it's quite possible that the following wont help you much. But I might be wrong.

TIPS AND TRICKS:

Tip 1. Decode all text variables related to files and file paths to unicode. In fact it will be best if all your internal text variables where unicode and you encoded to what ever you need when interfacing with modules, services, etc. Real life samples:

All system calls (sys, os.path, etc.) admit unicode strings (and will return them too) so there is no fear when calling that.
Also most of the xbmc python calls admit both unicode and "utf-8" encoded strings. There are some that show issues with unicode (xbmc.executebuiltin being one) if you try to pass a unicode string but, my advise will be: "try to call with unicode and if it doesn't work encode to utf-8. That will work nearly all the time.

Also, all calls from xbmc python modules (xbmc, xbmcaddon), when they return str strings, this are always "utf-8" encoded. So it's easy to decode them into unicode for later use for the system functions (sys, os.path) and avoid trouble in conflictive systems (like windows).

Real life (real scripts) examples:
Code:
import xbmcaddon

__addon__      = xbmcaddon.Addon()
__cwd__        = xbmc.translatePath( __addon__.getAddonInfo('path') ).decode("utf-8")

Here we get a str string comming from a call to xbmcaddon.Addon in "utf-8" encoding and decode it to unicode. So when, in the future, we use that to call something like
Code:
os.path.join( __cwd__, 'resources', 'lib' )
We will also get unicode and avoid trouble.

It's worth noting that when we mix unicode and str strings (like before with __cwd__ and 'resources' constant) python is doing an implicit decoding to unicode of the str string. Check that, in this example, this implicit decodings can't cause failure! (in the previous example 'resources'.decode("ascii") won't fail because 'resources' is a correct ascii string). Implicit decoding makes code more readable, but we need to be aware of what really is going on. If in doubt use unicode constants.

Tip 2. In times when we can't be sure if a variable is unicode or string we can always check with "isinstance". This is better explained with a real life example:

PHP Code:
def log(txt): #Log admits both unicode strings and str encoded with "utf-8" (or ascii). will fail with other str encodings.
    
if isinstance (txt,str):
        
txt txt.decode("utf-8"#if it is str we assume it's "utf-8" encoded.
                                  #will fail if called with other encodings (latin, etc) BE ADVISED!
# At this point we are sure txt is a unicode string.
    
message u'%s: %s' % (__addonid__txt)
    
xbmc.log(msg=message.encode("utf-8"), level=xbmc.LOGDEBUG)
    
# I reencode to utf-8 because in many xbmc versions log doesn't admit unicode. 

This check can be useful if we have an addon that must work for several xbmc versions that change the interface of some modules.


FINAL THOUGHTS

Ok. Although I wrote a lot and hope it can be useful for some of you I would like this thread to be an open debate for unicode encoding related issues in your real life.

First I would like to get your feedback about if you find this post useful or not. If you got to here I guess you don't mind voting the pool.

Then, I accept any comments, fixes, clarifications or what ever you want to say to help others improve their code. Even if you want to ask me (politely please) to remove the thread...

As I said in the beginning (remember?) this is a Work in Progress thread so don't be shy and participate!

Thanks for your time and see you around.
Always read the XBMC Online Manual,Frequently Asked Questions and search the forum before posting.
For troubleshooting and bug reporting use -> Log file.

Reply
#2
Please don't hesitate to make a page on our wiki for this Smile
This has been a pain in the **** for everyone
Read/follow the forum rules.
For troubleshooting and bug reporting, read this first
Interested in seeing some YouTube videos about Kodi? Go here and subscribe
Reply
#3
(2012-11-07, 21:07)Martijn Wrote: Please don't hesitate to make a page on our wiki for this Smile
This has been a pain in the **** for everyone

Can't believe you already read this. It took me forever to write it!

Maybe in a while. When it's a little more debated and complete...
Always read the XBMC Online Manual,Frequently Asked Questions and search the forum before posting.
For troubleshooting and bug reporting use -> Log file.

Reply
#4
If I see a title:
Quote:"How to avoid unicode decoding related runtime errors in your scripts/addons.

i don't need to know more Big Grin


I should be mad at you because it took you so long to post this Wink
Read/follow the forum rules.
For troubleshooting and bug reporting, read this first
Interested in seeing some YouTube videos about Kodi? Go here and subscribe
Reply
#5
I'd suggest that at the top you write the bit on "what does xbmc expect, and what does it return"

1. It expects you to pass it either unicode or utf-8 str.
2. It almost always returns a utf-8 str (I think there might be the odd one that returns unicode?)

Cheers,
Jonathan
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


Image
Reply
#6
* ronie presses print and sticks it above his bed
Do not PM or e-mail Team-Kodi members directly asking for support.
Always read the Forum rules, Kodi online-manual, FAQ, Help and Search the forum before posting.
Reply
#7
(2012-11-08, 00:25)jmarshall Wrote: I'd suggest that at the top you write the bit on "what does xbmc expect, and what does it return"

1. It expects you to pass it either unicode or utf-8 str.
2. It almost always returns a utf-8 str (I think there might be the odd one that returns unicode?)

Noted.

I'l have to cut the whole thing. It allowed me to create it but now I can't edit because it exceeds 5000 chars the maximum allowed length for a post :p
Always read the XBMC Online Manual,Frequently Asked Questions and search the forum before posting.
For troubleshooting and bug reporting use -> Log file.

Reply
#8
if you're nearing the 5000 char limit, take a hint ;D
Reply
#9
(2012-11-08, 14:30)spiff Wrote: if you're nearing the 5000 char limit, take a hint ;D

I was thinking of you when I was writing it Wink I know how much you like this kind of stuff...
Always read the XBMC Online Manual,Frequently Asked Questions and search the forum before posting.
For troubleshooting and bug reporting use -> Log file.

Reply
#10
It would be nice to include a section on extracting the encoding from a web page, both from the headers and from the meta keys
Reply
#11
I'd also recommend this talk for general unicode/python info, http://nedbatchelder.com/text/unipain.html
Reply

Logout Mark Read Team Forum Stats Members Help
[INFO] How to avoid unicode decoding related runtime errors in your scripts/addons.1