[PATCH 2/8] Files: add wrappers for open(), fopen(), sqlite3_open()

Thiago Macieira thiago at macieira.org
Wed Dec 18 15:47:16 UTC 2013


On quarta-feira, 18 de dezembro de 2013 15:36:56, Thiago Macieira wrote:
> In the worst case, conversion from UTF-8 to UTF-16 results in the same
> number  of characters, or double the number of bytes. That's actually the
> US-ASCII case: each byte becomes one 16-bit word. For everything else,
> UTF-16 takes fewer number of characters.
> 
> You multiply by 3 when you convert from UTF-16 to UTF-8 for the worst case 
> scenario.

range					UTF-8		UTF-16
U+0000 to U+007F			1 byte		1 word
U+0080 to U+07FF			2 bytes		1 word
U+0800 to U+FFFF			3 bytes		1 word
U+10000 to U+10FFFF		4 bytes		2 words

That's why it works. The worst case scenario is that we allocate a buffer that 
is 3x as big as it needs to be, when converting text from U+0800 to U+FFFF.

Unfortunately, that's character count, not byte count, which means doing in-
place conversions like I want to do for Qt aren't going to be easy. In-place 
conversions from UTF-16 to UTF-8 work for ASCII text (shrinks by half), U+0080 
to U+07FF text and non-BMP text (same memory usage), but it increases by 50% 
when encoding U+0800 to U+FFFF.

I'll still try because there's a lot of ASCII text in Qt applications and even 
for CJK text, it might work if there were buffer gains from previous ASCII text 
in the same string ("hello こんにちは" can be converted in-place)

-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel Open Source Technology Center
      PGP/GPG: 0x6EF45358; fingerprint:
      E067 918B B660 DBD1 105C  966C 33F5 F005 6EF4 5358
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 190 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.hohndel.org/pipermail/subsurface/attachments/20131218/d74dc96f/attachment.sig>


More information about the subsurface mailing list