AP Vision import

Sat Apr 15 13:54:05 PDT 2017

On Sat, Apr 15, 2017 at 10:10 AM, Miika Turkia <miika.turkia at gmail.com> wrote:
> The problem is that the character encoding in the sample file is
> iso8859-1. Unfortunately we assume the encoding to be UTF-8 on our
> imports. I do not know any efficient way to detect the encoding nor to
> convert the input into UTF-8 in C. This is a problem that our users do
> run into occasionally, so it would be great if someone had an idea how
> to fix this in sensible manner...

So if you *know* the input is in latin1, converting to utf-8 is trivial.

The conversion is literally:

    for (;;) {
        unsigned char c = *input++;

        if (!c)
            break;

        /* US-ASCII? All done */
        if (c < 128) {
            *output++ = c;
            continue;
        }

        /* High bit set - turn character into two-byte UTF-8 110000xx
10xxxxxx */
        *output++ = 0xc0 + (c >> 6);
        *output++ = 0x80 + (c & 0x3f);
    }

it's really that easy.

It's actually harder to check "is this valid utf-8" than it is to just
convert random latin1 code to valid utf-8.

              Linus