How to deal with EUC-JP encoded Japanese source in scraper
#1
Dear all,
I am writing my own scraper for some Japanese videos. and the source website is encoded in "EUC-JP" instead of UTF-8. The GetSearchResults() and GetDetails() function will return some weird characters.

Even though I put <?xml version="1.0" encoding="EUC-JP" standalone="yes"> as the first line of the output of GetSearchResults(), it doesn't work. I don't know whether it's related to my language & charset settings in System settings in xbmc (by default). Is there any place in xbmc to let the http component adaptively choose the charset?

Can anyone here give me some hint or advices? thank you very much
Reply
#2
This is a little guesswork on my part, so it may not work, but there is a SearchStringEncoding attribute you can add to CreateSearchUrl, like:
Code:
<CreateSearchUrl SearchStringEncoding="EUC-JP">
...
</CreateSearchUrl>

From the code it looks like it only gets used in conjunction with a fixchars attribute in any expression node. Like noclean or trim, you specify which capturing groups you want "fixchars"ed, e.g.
Code:
<RegExp input="$$1" output="&lt;title&gt;\1&lt;/title&gt;" dest="4+">
    <expression fixchars="1" trim="1">&lt;movie_title&gt;([^&lt;]*)&lt;/movie_title&gt;</expression>
</RegExp>

The scraper parser definitely uses the SearchStringEncoding to do something to the fixchars groups (presumably render them intelligible).

Like I say, that's just a guess based on the code, whether it works or not...
Reply
#3
Thank you for your reply, scudlee. But it doesn't work.
I think the wrongly-encoding is because that XBMC does not have the eue-jp encoding support(it has shift-jis). so it cannot display or handle the characters as UTF-8.

Is it possible to write a plugin to convert the page from euc-jp to utf-8 and pass it to scraper to process?
Reply

Logout Mark Read Team Forum Stats Members Help
How to deal with EUC-JP encoded Japanese source in scraper0