Stupid question: lowercase the string
#1
Hi all,

I have a stupid question: how to make a CStdString(W) lowercase?

Initially I have an UTF-8 Cyrillic string in CStdString. It shows fine both in karaoke lyrics and in logs. But so far any attempt to make it lowercase failed.

1. The most obvious way

Code:
CStdString lower = m_lyrics[i].text;
lower.ToLower();
lower.MakeLower(); // no idea which to use
CLog::Log(LOGERROR, "lower %s", lower.c_str());

prints the same uppercase string. Using CStdStringW surprisingly doesn't help either:

Code:
CStdStringW temptext;
g_charsetConverter.utf8ToW( m_lyrics[i].text, temptext );
CStdStringW lowertext = temptext;
lowertext.ToLower();
lowertext.MakeLower();
CStdString lower;
g_charsetConverter.wToUTF8( lowertext, lower );

Again returns the same string as above. How to properly lowercase a non-English utf8 string in XBMC?

P.S. I use Linux.
Reply
#2
We can't pretty much. There's a ticket on trac regarding this - there's an intel lib to handle it all that we're hoping to use eventually (it'll also replace iconv + fridibi as well).

Feel free to give it a go if you want!
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


Image
Reply
#3
Maybe we could switch to Qt (http://www.trolltech.com)? Of course not whole Qt, just QtCore - strings, vectors and so on.

Obvious benefits of QString:
- thread-safe reference counting (very helpful for those who do not pass by pointer/reference);
- full Unicode support without external dependencies;
- stl-compatible as much as possible;
- has all necessary conversions and even more text encoders/decoders than iconv.
- commercial lib (dual-licensed under GPL as well), widely used.
- excellent documentation and samples.

So we could throw away iconv/fribidi easily.

All we need is subclass CStdString and forward calls to QString.

The potential issues:
- It's based on QChar's, not chars (which is obvious since it's not possible to enumerate a unicode string otherwise). Therefore all code which goes through the string char-by-char AND makes assumptions on character count = memory buffer size might need fix. But it needs fixes anyway since it's not utf-8 aware (when number of bytes in string != number of chars for any non-English language)
- It's relatively large. But Unicode conversion tables are large, and there is little we could do. We could strip it down, but I'd prefer to do as little mods as needed to easily update to latest versions.

I haven't seen Intel library, but alone it will not resolve the problems with CStdString without changing it. For example, my karaoke analyzer need to parse the string char-by-char, not byte-by-byte (which will break utf-8), so it needs some changes anyway.

What do you think?
Reply
#4
intel libs serves the same purpose while being much more contained than qstring. the changes needed is exactly the same as those that would be needed to use qstring (i.e. the issue is the internal representation). for char-by-char processing you need to use a fixed width charset anyways (wstring or utf-32)
Reply
#5
Agreed - utf16 or (more ideally) utf32 seem the most natural things to use for char by char processing. utf8<->utf16/32 is ofcourse a very fast conversion.

We currently assume everything fits into a utf16 char at the moment, which is true for most languages that we're likely to have someone support.

Cheers,
Jonathan
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


Image
Reply

Logout Mark Read Team Forum Stats Members Help
Stupid question: lowercase the string0