[PATCH] DLD upload

Dirk Hohndel dirk at hohndel.org
Fri Mar 15 16:43:10 PDT 2013


Miika Turkia <miika.turkia at gmail.com> writes:
>
> I have also been playing with the attached patch. The debug print
> seems to indicate that the buffer is formatted properly, but then it
> just does not work. I wonder if this is related to the above oddity.

That attached patch finally got me to look in the right direction.

Here's what I just pushed.

Please test (especially Sergey as you were the person initially
reporting the problem).

Rainer - from my experiments I am guessing that you have fixed size
buffers for the strings that you export and that you don't account for
the fact that the HTML encoding causes you to create 7 bytes per
character... and I think that means that your buffers are too small and
some strings get truncated in the middle - which then causes incorrect
encodings like this:

  <BOATNAME><![CDATA[открыти&]]></BOATNAME>
  <CYLINDERDESCRIPTION><![CDATA[алюми&#108]]></CYLINDERDESCRIPTION>

in both cases the end of the string looks truncated, right?

Anyway, here's what I have and what seems to work for me

/D

>From 757791335f212a189790452cb2d467c31a2ae672 Mon Sep 17 00:00:00 2001
From: Miika Turkia <miika.turkia at gmail.com>
Date: Fri, 15 Mar 2013 19:02:14 +0200
Subject: [PATCH 1/2] Support divelogs.de exports that include Cyrillic
 characters

divelogs.de sends us XML files that explicitly state that they are in
ISO-8859-1 encoding (which is true). These files contain the HTML encoded
Cyrillic characters. Once we decode those characters the resulting file is
actually UTF-8 encoded (which is a superset of ISO-8859-1). That seriously
confuses libxml when it tries to parse things.

So instead recognize divelogs.de files and skip the encoding declaration
for them before decoding the HTML encoded non-ISO-8859-1 characters.

This does show, however, that divelogs.de incorrectly truncates the
encoded strings (at least in some sample data that I created the parsing
throws errors because of that).

Reported-by: Sergey Starosek <sergey.starosek at gmail.com>
Based-on-code-by: Miika Turkia <miika.turkia at gmail.com>
Signed-off-by: Dirk Hohndel <dirk at hohndel.org>
---
 parse-xml.c        | 24 +++++++++++++++++++++++-
 xslt/divelogs.xslt |  2 +-
 2 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/parse-xml.c b/parse-xml.c
index 4cdc3d8..b24806b 100644
--- a/parse-xml.c
+++ b/parse-xml.c
@@ -8,6 +8,7 @@
 #define __USE_XOPEN
 #include <time.h>
 #include <libxml/parser.h>
+#include <libxml/parserInternals.h>
 #include <libxml/tree.h>
 #ifdef XSLT
 #include <libxslt/transform.h>
@@ -1533,13 +1534,34 @@ static void reset_all(void)
 	import_source = UNKNOWN;
 }
 
+/* divelog.de sends us xml files that claim to be iso-8859-1
+ * but once we decode the HTML encoded characters they turn
+ * into UTF-8 instead. So skip the incorrect encoding
+ * declaration and decode the HTML encoded characters */
+const char *preprocess_divelog_de(const char *buffer)
+{
+	char *ret = strstr(buffer, "<DIVELOGSDATA>");
+
+	if (ret) {
+		xmlParserCtxtPtr ctx;
+		char buf[] = "";
+
+		ctx = xmlCreateMemoryParserCtxt(buf, sizeof(buf));
+		ret = xmlStringLenDecodeEntities(ctx, ret, strlen(ret),  XML_SUBSTITUTE_REF, 0, 0, 0);
+
+		return ret;
+	}
+	return buffer;
+}
+
 void parse_xml_buffer(const char *url, const char *buffer, int size,
 			struct dive_table *table, GError **error)
 {
 	xmlDoc *doc;
+	const char *res = preprocess_divelog_de(buffer);
 
 	target_table = table;
-	doc = xmlReadMemory(buffer, size, url, NULL, 0);
+	doc = xmlReadMemory(res, strlen(res), url, NULL, 0);
 	if (!doc) {
 		fprintf(stderr, _("Failed to parse '%s'.\n"), url);
 		parser_error(error, _("Failed to parse '%s'"), url);
diff --git a/xslt/divelogs.xslt b/xslt/divelogs.xslt
index f66ffcc..c0585a5 100644
--- a/xslt/divelogs.xslt
+++ b/xslt/divelogs.xslt
@@ -1,7 +1,7 @@
 <?xml version="1.0"?>
 <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
   <xsl:strip-space elements="*"/>
-  <xsl:output method="xml" indent="yes"/>
+  <xsl:output method="xml" indent="no" encoding="UTF-8" omit-xml-declaration="yes"/>
 
   <xsl:template match="/">
     <divelog program='subsurface-import' version='2'>
-- 
1.8.0.rc0.18.gf84667d



More information about the subsurface mailing list