AP Vision import

Miika Turkia miika.turkia at gmail.com
Sun Apr 16 08:10:14 PDT 2017


On Sat, Apr 15, 2017 at 11:54 PM, Linus Torvalds
<torvalds at linux-foundation.org> wrote:
> On Sat, Apr 15, 2017 at 10:10 AM, Miika Turkia <miika.turkia at gmail.com> wrote:
>> The problem is that the character encoding in the sample file is
>> iso8859-1. Unfortunately we assume the encoding to be UTF-8 on our
>> imports. I do not know any efficient way to detect the encoding nor to
>> convert the input into UTF-8 in C. This is a problem that our users do
>> run into occasionally, so it would be great if someone had an idea how
>> to fix this in sensible manner...
>
> So if you *know* the input is in latin1, converting to utf-8 is trivial.

That is the problem. I have no clue what the input is. In the bug
report it was latin1 but it could really be something else as well.

> The conversion is literally:
>
>     for (;;) {
>         unsigned char c = *input++;
>
>         if (!c)
>             break;
>
>         /* US-ASCII? All done */
>         if (c < 128) {
>             *output++ = c;
>             continue;
>         }
>
>         /* High bit set - turn character into two-byte UTF-8 110000xx
> 10xxxxxx */
>         *output++ = 0xc0 + (c >> 6);
>         *output++ = 0x80 + (c & 0x3f);
>     }
>
> it's really that easy.

Seems that there is no need for conversion. The XML library parses
other encodings just fine if it is instructed to do so. Now I have
code that attempts first UTF8 and if that fails runs with latin1.

miika


More information about the subsurface mailing list