Unicode/utf encoding python scripts
#1
i know it's not really xbmc related but i'm having a hard time understanding how python handles "special" ascii chars..

i've found a couple of pages explaning how to handle unicode strings so it can be parsed by "parsestring" and that seems to work when i put some unicode string manually in a script but i can't find a way to "encode" strings imported from the web so it can be processed by minidom "toxml()" method..

i can't either manage to parse a file with special chars..

if someone has some working sequence to parse an xml string from file, properly convert strings from the web (with special chars) to add them to my dom object and then use some way to put that dom object content back into a file, it would really help

thx,

.dvbm
Reply
#2
well it seems dealing with unicode and encodings is quite a nightmare..

i've managed to use utf-8 encoded unicode strings, read from file, fetch from urls, write to file and parse to xml dom

but despites all the encoding stuff some very basic chars are still blocking the various encoding/decoding processes, like char "à", whereas "éè" work fine..

on top of that when using utf-8 unicode strings, any string comparison or manipulation seems to fail..

like doing a lower() or == between strings seem to call hidden encodings/decondings which can also fail for many reasons..

any help would be very appreciated, especially as for when to convert thing and how, before calling toxml() in order to save an xml tree to file..

thx

.dvbm
Reply
#3
hmmm... i know i've seen a script somewhere that did things with unicode encoding
Reply
#4
encodings can be a serious problem to get the grip of (i certainly don't quite get it yet). when you have a string and want to have it in unicode you have to use the decode function. the problem is that you need to know what encoding the site you just downloaded uses, but this should be in it's header. then you can do yourstringobject.decode('utf-8') or similar. minidom should handle unicode without a problem afaik. to convert from unicode to string use encode(). (similar to decode() above). i guess you might know all this already though...
xbmcscripts.com administrator
Reply
#5
yeah i fainally managed to get unicode working with my script..

so i'll try to explain a few things that might be usefull to other scripters out there :

the basic idea is indeed to decode any incoming bytestring text using the proper encoding parameter in order to store it in unicode format, then unicode versus unicode manipulation seems to work "fine" throughout your script..

as i'm using some libs from ooba script, to get a web page you can do the following :

html = httpfetcher.urlopen(url)
html = html.decode('iso-8859-1')

iso-8859-1 should be stated somewhere in the meta tags of the html document.. in fact i don't look for it as this particular website has every page encoded in iso-8859-1.. but the good way to do it woud be to look for the meta tag and use the encoding value to decode your bytestring..

in case you wonder how httpfetcher returns the bytestring it simply opens the local cached file in "rb" mode and performs a read() on the file handler.. so no codecs lib is used here..

the html var contains the unicode format of the web page content and you can do whatever you need to do with it including regex and so on.. for regex i've added the re.unicode parameter although it seems to work without it.. and of course every match.group from the regex output is still in unicode..

now for minidom the thing is quite simple too..
i use two different xml documents in my script, one for the settings, called "settings.xml" encoded in "iso-8859-1" which is the default encoding of my text editor, so you can manipulate it and insert non us-ascii chars without breaking your xml document..
and a second xml file used to store additional data used by the script which is encoded in utf-8, just for the fun of it, and which should not be tempered with, if you don't want your minidom parsing to fail gloriously..

so to parse the xml documents you just need to set your xml document's header properly like this:
<?xml version="1.0" encoding="utf-8"?>
or
<?xml version="1.0" encoding="iso-8859-1"?>

and call : parse(pathtofile)

now minidom manipulation is done exclusively in unicode, meaning that you should pass unicode objects to minidom elements and minidom will return unicode objects when reading elements from nodes and so on..
so you shouldn't have any trouble using unicode as long as your texts are in unicode..

then in order to store your xml document you have to tell minidom how you want it to be stored, and you do that like this :

f = file(pathtofile,'wb')
f.write(obj.toxml(encoding="utf-8"))

tha's for my utf-8 xml doc of course, the other one would be iso-..

that's all..


so basically if you're willing to use unicode you should never have to call encode/decode methods within your script in order to manipulate/compare text strings, after they've been imported and properly decoded to unicode.. just keep everything in unicode..

keep also in mind that some strings manipulation/comparison could imply hidden decoding of your unicode objects and that decoding would then use the default encoding of your python setup, which is usually ascii, and that will most certainly raise encoding errors..
the only manimulation i can think of in my script is "print", so it's quite likely that your calls to print will fail when trying to print unicode objects with non ascii chars in it..

one other important thing is that some python modules/methods are not able to use unicode strings, like os.path

so you'll have to do this everywhere:
os.path.exists(str(pathtofile))

beware that the str() will decode the unicode object pathtofile to a bytestring using the default encoding "ascii", so if your pathtofile is actually a properly formated ascii path then it shouldn't fail, in any other case it will fail..


hope that helps.. been a painfull experience but as i finally managed to do almost everything i wanted to do, i guess it wasn't so bad after all..

good luck.

.dvbm
Reply
#6
hi dvbm,

i too am having some encoding issues. just wanted to know if you were able to get xbmc to display special characters in gui elements like buttons and lists as seem to just get squares.

any help would be appreciated.

chad.
Reply

Logout Mark Read Team Forum Stats Members Help
Unicode/utf encoding python scripts0