AP Vision import
Linus Torvalds
torvalds at linux-foundation.org
Sat Apr 15 13:54:05 PDT 2017
On Sat, Apr 15, 2017 at 10:10 AM, Miika Turkia <miika.turkia at gmail.com> wrote:
> The problem is that the character encoding in the sample file is
> iso8859-1. Unfortunately we assume the encoding to be UTF-8 on our
> imports. I do not know any efficient way to detect the encoding nor to
> convert the input into UTF-8 in C. This is a problem that our users do
> run into occasionally, so it would be great if someone had an idea how
> to fix this in sensible manner...
So if you *know* the input is in latin1, converting to utf-8 is trivial.
The conversion is literally:
for (;;) {
unsigned char c = *input++;
if (!c)
break;
/* US-ASCII? All done */
if (c < 128) {
*output++ = c;
continue;
}
/* High bit set - turn character into two-byte UTF-8 110000xx
10xxxxxx */
*output++ = 0xc0 + (c >> 6);
*output++ = 0x80 + (c & 0x3f);
}
it's really that easy.
It's actually harder to check "is this valid utf-8" than it is to just
convert random latin1 code to valid utf-8.
Linus
More information about the subsurface
mailing list