Thread: Encoding problems in PostgreSQL with XML data
This is not directly related to current development, but it is something that might need a low-level solution. I've been thinking for some time about how to enchance the current "XML support" (e.g., contrib/xml). The central problem I have is this: How do we deal with the fact that an XML datum carries its own encoding information? Here's a scenario: It is desirable to have validity checking on XML input, be it a special XML data type or some functions that take XML data. Say we define a data type that stores XML documents and rejects documents that are not well-formed. I want to insert something in psql: CREATE TABLE test ( description text, content xml ); INSERT INTO test VALUES ('test document', '<?xml version="1.0"?><doc><para>blah</para>...</doc>'); Now an XML parser will assume this document to be in UTF-8, and say at the client it is. What if client_encoding=UNICODE but server_encoding=LATIN1? Do we expect some layer to rewrite the <?xml?> declaration to contain the correct encoding information? Or can the xml type bypass encoding conversion? What about reading it back out of the database with yet another client encoding? Rewriting the <?xml?> declaration seems like a workable solution, but it would break the transparency of the client/server encoding conversion. Also, some people might dislike that their documents are being changed as they are stored. Any ideas?
> Rewriting the <?xml?> declaration seems like a workable solution, but it > would break the transparency of the client/server encoding conversion. > Also, some people might dislike that their documents are being changed > as they are stored. I presume that the XML type stores the textual representation of the XML rather than a binary (parsed) format? I say we treat XML the same was as we deal with things like float and allow some play with the stored object so long as it's reasonably small and has the same final interpretation. If they wanted non-mutable text, they would not have declared it as a structured format. Just like if they don't want leading 0's removed from numeric input, they don't declare it as numeric but as binary.
Peter Eisentraut wrote: > The central problem I have is this: How do we deal with the fact that > an XML datum carries its own encoding information? Maybe I am misunderstanding your question, but IMO postgres should be treating xml documents as if they were binary data, unless the server takes on the role of a parser, in which case it should handle unspecified/unknown encodings just like a normal xml parser would (and this does *not* include changing the encoding!). According to me, an XML parser should not change one bit of a document, because that is not a 'parse', but a 'transformation'. > Rewriting the <?xml?> declaration seems like a workable solution, but it > would break the transparency of the client/server encoding conversion. > Also, some people might dislike that their documents are being changed > as they are stored. Right, your example begs the question: why does the server care what the encoding of the documents is (perhaps indexing)? ZML validation is a standardized operation which the server (or psql, I suppose) can subcontract out to another application. Just a side thought: what if the xml encoding type was built into the domain type itself? create domain xml_utf8 ... Which allows casting, etc. which is more natural than an implicit transformation. Regards, Merlin
Perhaps the document should be stored in canonical form. See http://www.w3.org/TR/xml-c14n I think I agree with Rod's opinion elsewhere in this thread. I guess the "philosophical" question is this: If 2 XML documents with different encodings have the same canonical form, or perhaps produce the same DOM, are they equivalent? Merlin appears to want to say "no", and I think I want to say "yes". cheers andrew Merlin Moncure wrote: >Peter Eisentraut wrote: > > >>The central problem I have is this: How do we deal with the fact that >>an XML datum carries its own encoding information? >> >> > >Maybe I am misunderstanding your question, but IMO postgres should be >treating xml documents as if they were binary data, unless the server >takes on the role of a parser, in which case it should handle >unspecified/unknown encodings just like a normal xml parser would (and >this does *not* include changing the encoding!). > >According to me, an XML parser should not change one bit of a document, >because that is not a 'parse', but a 'transformation'. > > > >>Rewriting the <?xml?> declaration seems like a workable solution, but >> >> >it > > >>would break the transparency of the client/server encoding conversion. >>Also, some people might dislike that their documents are being changed >>as they are stored. >> >> > >Right, your example begs the question: why does the server care what the >encoding of the documents is (perhaps indexing)? ZML validation is a >standardized operation which the server (or psql, I suppose) can >subcontract out to another application. > >Just a side thought: what if the xml encoding type was built into the >domain type itself? >create domain xml_utf8 ... >Which allows casting, etc. which is more natural than an implicit >transformation. > >Regards, >Merlin > >---------------------------(end of broadcast)--------------------------- >TIP 8: explain analyze is your friend > > >
Andrew Dunstan wrote: > I think I agree with Rod's opinion elsewhere in this thread. I guess the > "philosophical" question is this: If 2 XML documents with different > encodings have the same canonical form, or perhaps produce the same DOM, > are they equivalent? Merlin appears to want to say "no", and I think I > want to say "yes". Er, yes, except for canonical XML. Canonical XML neatly bypasses all the encoding issues that I can see. Maybe I am still not getting the basic point, but the part I was not quite clear on is why the server would need to parse the document at all, much less change the encoding. Sure, it doesn't necessarily hurt to do it, but why bother? An external parser could handle both the parsing and the validation. Reading Peter's post, he seems to be primarily concerned with an automatic XML validation trigger that comes built in with the XML 'type'. *unless* 1. The server needs to parse the document and get values from the document for indexing/key generation purposes, now the encoding becomes very important (especially considering joins between XML to non XML data types). 2. There are plans to integrate Xpath expressions into queries. 3. The server wants to compose generated XML documents from stored XML/non XML sources, with (substantial) additions to the query language to facilitate this, i.e. a nested data extraction replacement for psql. But, since I'm wishing for things, I may as well ask for a hockey rink in my living room :) Merlin
Andrew Dunstan wrote: > Perhaps the document should be stored in canonical form. That kills the DTD, the id attributes, thus crippling XPath, and it looks horrible on output. I don't think that can be accepted. Canonical form is useful for comparing documents, but not for operating on them, I think.
Peter Eisentraut wrote: >Andrew Dunstan wrote: > > >>Perhaps the document should be stored in canonical form. >> >> > >That kills the DTD, the id attributes, thus crippling XPath, and it >looks horrible on output. I don't think that can be accepted. >Canonical form is useful for comparing documents, but not for operating >on them, I think. > > OK, fair enough. What exactly do we expect the functionality of an xml type to be? Merely a guarantee that it is well-formed? cheers andrew
Merlin Moncure kirjutas R, 09.01.2004 kell 22:04: > Peter Eisentraut wrote: > > The central problem I have is this: How do we deal with the fact that > > an XML datum carries its own encoding information? > > Maybe I am misunderstanding your question, but IMO postgres should be > treating xml documents as if they were binary data, unless the server > takes on the role of a parser, in which case it should handle > unspecified/unknown encodings just like a normal xml parser would (and > this does *not* include changing the encoding!). > > According to me, an XML parser should not change one bit of a document, > because that is not a 'parse', but a 'transformation'. IIRC, the charset transformations are done as a separate step in the wire protocol _before_ any parser has chance transform or not. The charset transform on incoming query are also agnostic of text contents - it just blindly transforms the whole SQL statement. --------------- Hannu
Hannu Krosing wrote: > IIRC, the charset transformations are done as a separate step in the > wire protocol _before_ any parser has chance transform or not. Yep. My point is that this is wrong. I think of XML the same way I think of a zip file contains a text document. Postgres does not unzip a text file to change the char encoding any more than it should parse an XML document and change the encoding unless this is the specific intent of the user for a specific purpose. Validation alone does not qualify as a reason because a XML parser (xerces) can do validation server-side without mucking with document. Postgres need only be aware of the fact that the data is XML and should be validated. If postgres needs to be aware of internal document contents (for indexing, for example), XSLT can be used for that purpose. Regards, Merlin
Merlin Moncure kirjutas E, 12.01.2004 kell 19:56: > Hannu Krosing wrote: > > IIRC, the charset transformations are done as a separate step in the > > wire protocol _before_ any parser has chance transform or not. > > Yep. My point is that this is wrong. Of course :) It seems to be a quick hack somebody implemented in need, and doing it as a separate step was suely the easiest way to get it working. I hope that real as-needed-column-by-column translation will be used with bound argument queries. It also seems possible to delegate the encoding changes to after the query is parsed, but this will never work for EBCDIC and other funny encodings (like rot13 ;). for these we need to define the actual SQL statement encoding on-wire to be always ASCII. --------------- Hannu
Hannu Krosing wrote: > I hope that real as-needed-column-by-column translation will be used > with bound argument queries. > > It also seems possible to delegate the encoding changes to after the > query is parsed, but this will never work for EBCDIC and other funny > encodings (like rot13 ;). > > for these we need to define the actual SQL statement encoding on-wire to > be always ASCII. In that case, treat the XML document like a binary stream, using PQescapeBytea, etc. to encode if necessary pre-query. Also, the XML domain should inherit from bytea, not varchar. The document should be stored bit for bit as was submitted. If we can do that for bitmaps, why can't we do it for XML documents? OTOH, if we are transforming the document down to a more generic format (either canonical or otherwise), then the xml could be dealt with like text in the ususal way. Of course, then we are not really storing xml, more like 'meta' xml ;) Merlin
Merlin Moncure kirjutas K, 14.01.2004 kell 15:49: > Hannu Krosing wrote: > > I hope that real as-needed-column-by-column translation will be used > > with bound argument queries. > > > > It also seems possible to delegate the encoding changes to after the > > query is parsed, but this will never work for EBCDIC and other funny > > encodings (like rot13 ;). > > > > for these we need to define the actual SQL statement encoding on-wire > to > > be always ASCII. > > In that case, treat the XML document like a binary stream, using > PQescapeBytea, etc. to encode if necessary pre-query. Also, the XML > domain should inherit from bytea, not varchar. why ? the allowed characters repertoire in XML is even less than in varchar. > The document should be stored bit for bit as was submitted. Or in some pre-parsed form which allows restoration of submitted form, which could be more for things like xpath queries or subtree extraction. > If we can do that for bitmaps, why can't we do it for XML documents? > > OTOH, if we are transforming the document down to a more generic format > (either canonical or otherwise), then the xml could be dealt with like > text in the ususal way. Of course, then we are not really storing xml, > more like 'meta' xml ;) On the contrary! If there is DTD or Schema or other structure definition for XML, then we know which whitespace is significant and can do whatever we like with insignificant whitespace. It also is ok to store all XML in some UNICODE encoding as this is what every XML must be convertible to. its he same as storing ints - you don't care if you specified 1000 ot 1e3 when doing the insert as hannu=# select 1000=1e3;?column? ----------t (1 row) in the same way the following should also be true select '<d/>'::xml == '<?xml version="1.0" encoding="utf-8"?>\n<d/>\n'::xml ; ----------- Hannu
Hannu Krosing wrote: > > In that case, treat the XML document like a binary stream, using > > PQescapeBytea, etc. to encode if necessary pre-query. Also, the XML > > domain should inherit from bytea, not varchar. > > why ? > > the allowed characters repertoire in XML is even less than in varchar. Yes, that is correct. I was resorting to hyperbole...see my reasoning below. > > The document should be stored bit for bit as was submitted. > > Or in some pre-parsed form which allows restoration of submitted form, > which could be more for things like xpath queries or subtree extraction. This is the crucial point: I'll try and explain my thinking better. > > OTOH, if we are transforming the document down to a more generic format > > (either canonical or otherwise), then the xml could be dealt with like > > text in the ususal way. Of course, then we are not really storing xml, > > more like 'meta' xml ;) > > On the contrary! If there is DTD or Schema or other structure definition > for XML, then we know which whitespace is significant and can do > whatever we like with insignificant whitespace. According to the XML standard, whitespace is always significant unless it is outside an element or attribute and thus not part of the real data. A DTD or Schema adds constraints, not removes them. I'm nitpicking, but this is extra evidence to my philosophy of xml storage that I'll explain below. > select > '<d/>'::xml == '<?xml version="1.0" encoding="utf-8"?>\n<d/>\n'::xml Right: I understand your reasoning here. Here is the trick: select '[...]'::xml introduces a casting step which justifies a transformation. The original input data is not xml, but varchar. Since there are no arbitrary rules on how to do this, we have some flexibility here to do things like change the encoding/mess with the whitespace. I am trying to find away to break the assumption that my xml data necessarily has to be converted from raw text. My basic point is that we are confusing the roles of storing and parsing/transformation. The question is: are we storing xml documents or the metadata that makes up xml documents? We need to be absolutely clear on which role the server takes on...in fact both roles may be appropriate for different situations, but should be represented by a different type. I'll try and give examples of both situations. If we are strictly storing documents, IMO the server should perform zero modification on the document. Validation could be applied conceptually as a constraint (and, possibly XSLT/XPATH to allow a fancy type of indexing). However there is no advantage that I can see to manipulating the document except to break the 'C' of ACID. My earlier comments wrt binary encoding is that there simply has to be a way to prevent the server mucking with my document. For example, if I was using postgres to store XML-EDI documents in a DX system this is the role I would prefer. Validation and indexing are useful, but my expected use of the server is a type of electronic xerox of the incoming document. I would be highly suspicious of any modification the server made to my document for any reason. Now, if we are storing xml as content, say for a content database backing a web page, the server takes on the role of a meta-data storage system. Now, it is reasonable to assume the server might do additional processing besides storage and validation. The character encoding of the incoming data is of little or no importance because the xml will almost certainly undergo an additional transformation step after extraction from the database. Flexibility, simplicity, and utility are the most important requirements, so text transformation to a default encoding would be quite useful. Based on your suggestions I think you are primarily concerned with the second example. However, in my work I do a lot of DX and I see the xml document as a binary object. Server-side validation would be extremely helpful, but please don't change my document! So, I submit that we are both right for different reasons. Regards, Merlin
Merlin Moncure kirjutas N, 15.01.2004 kell 18:43: > Hannu Krosing wrote: > > select > > '<d/>'::xml == '<?xml version="1.0" encoding="utf-8"?>\n<d/>\n'::xml > > Right: I understand your reasoning here. Here is the trick: > > select '[...]'::xml introduces a casting step which justifies a > transformation. The original input data is not xml, but varchar. Since > there are no arbitrary rules on how to do this, we have some flexibility > here to do things like change the encoding/mess with the whitespace. I > am trying to find away to break the assumption that my xml data > necessarily has to be converted from raw text. > > My basic point is that we are confusing the roles of storing and > parsing/transformation. The question is: are we storing xml documents > or the metadata that makes up xml documents? We need to be absolutely > clear on which role the server takes on...in fact both roles may be > appropriate for different situations, but should be represented by a > different type. I'll try and give examples of both situations. > > If we are strictly storing documents, IMO the server should perform zero > modification on the document. Validation could be applied conceptually > as a constraint (and, possibly XSLT/XPATH to allow a fancy type of > indexing). However there is no advantage that I can see to manipulating > the document except to break the 'C' of ACID. My earlier comments wrt > binary encoding is that there simply has to be a way to prevent the > server mucking with my document. > > For example, if I was using postgres to store XML-EDI documents in a DX > system this is the role I would prefer. Validation and indexing are > useful, but my expected use of the server is a type of electronic xerox > of the incoming document. I would be highly suspicious of any > modification the server made to my document for any reason. The current charset/encoding support can be evil in some cases ;( The only solution seems to be keeping both server and client encoding as ASCII (or just disable it) The proper path to encodings must unfortunately do the encoding conversions *after* parsing, when it is known, which parts of the original query string should be changed. Or, as you suggested, always encode anything outside plain ASCII (n<32 and n>127), both on input (can be done client-side) and output (IIRC needs another type with different output function) > Based on your suggestions I think you are primarily concerned with the > second example. However, in my work I do a lot of DX and I see the xml > document as a binary object. Server-side validation would be extremely > helpful, but please don't change my document! So the problem is not exactly XML, but rather problems with changing encodings of "binary" strings that should not be changed. I hope (but I'm not sure) that keeping client and server encodings the same should prevent that. > So, I submit that we are both right for different reasons. Seems so. ----------------- Hannu
> Merlin Moncure kirjutas E, 12.01.2004 kell 19:56: > > Hannu Krosing wrote: > > > IIRC, the charset transformations are done as a separate step in the > > > wire protocol _before_ any parser has chance transform or not. > > > > Yep. My point is that this is wrong. > > Of course :) We need this because our parser cannot handle some encodings including UCS, Shift-JIS and Big5 (we currently do not allow UCS as a client side encoding, but this is another story). > It seems to be a quick hack somebody implemented in need, and doing it > as a separate step was suely the easiest way to get it working. > > I hope that real as-needed-column-by-column translation will be used > with bound argument queries. IMO this does not help our parser's limitation neither. > It also seems possible to delegate the encoding changes to after the > query is parsed, but this will never work for EBCDIC and other funny > encodings (like rot13 ;). > > for these we need to define the actual SQL statement encoding on-wire to > be always ASCII.