[PATCH] Check if DLD contains non-ascii characters

Tue Mar 26 13:37:08 PDT 2013

Miika Turkia <miika.turkia at gmail.com> writes:

> On Tue, Mar 26, 2013 at 10:15 PM, Dirk Hohndel <dirk at hohndel.org> wrote:
>> Miika Turkia <miika.turkia at gmail.com> writes:
>>
>>> Valid divelogs.de export might contain non-ascii characters in CDATA
>>> fields as long as these characters are found in iso-8859-1. So we'll
>>> have to test to make sure the content is fully ascii before calling
>>> xmlStringLenDecodeEntities to decode possible character references.
>>
>> So what happens if we have both ä and #&1023; in the CDATA section?
>
> Good point. My first assumption was that the whole XML file would be
> encoded, but now that I tested it, it is not the case. Each CDATA is
> treated independently and it is possible to have öä in one CDATA and
> another CDATA in Cyrillic and thus encoded. I would guess this to be
> unlikely but certainly possible. So this patch would display the
> Cyrillic CDATA in such a mixed case with character references
> (ϫ).

So what we really need to do is
- scan the file for &#NNNN;
- if found, scan for non-ascii characters
- if found, convert to their &#NNNN; equivalent
- send buffer to conversion routine

Correct?

/D