[PATCH] Check if DLD contains non-ascii characters

Miika Turkia miika.turkia at gmail.com
Tue Mar 26 13:44:31 PDT 2013


On Tue, Mar 26, 2013 at 10:37 PM, Dirk Hohndel <dirk at hohndel.org> wrote:
> Miika Turkia <miika.turkia at gmail.com> writes:
>
>> On Tue, Mar 26, 2013 at 10:15 PM, Dirk Hohndel <dirk at hohndel.org> wrote:
>>> Miika Turkia <miika.turkia at gmail.com> writes:
>>>
>>>> Valid divelogs.de export might contain non-ascii characters in CDATA
>>>> fields as long as these characters are found in iso-8859-1. So we'll
>>>> have to test to make sure the content is fully ascii before calling
>>>> xmlStringLenDecodeEntities to decode possible character references.
>>>
>>> So what happens if we have both ä and #&1023; in the CDATA section?
>>
>> Good point. My first assumption was that the whole XML file would be
>> encoded, but now that I tested it, it is not the case. Each CDATA is
>> treated independently and it is possible to have öä in one CDATA and
>> another CDATA in Cyrillic and thus encoded. I would guess this to be
>> unlikely but certainly possible. So this patch would display the
>> Cyrillic CDATA in such a mixed case with character references
>> (ϫ).
>
> So what we really need to do is
> - scan the file for &#NNNN;
> - if found, scan for non-ascii characters
> - if found, convert to their &#NNNN; equivalent
> - send buffer to conversion routine
>
> Correct?

That is the simplest solution that comes to mind at the moment. Unless
there is something that can be done in the XML form (after
xmlReadMemory but before applying the stylesheet). I didn't find such
solution when we were fighting on the Cyrillic issue so it is still
probably a dead end.

miika


More information about the subsurface mailing list