Thread: Encoding problems in PostgreSQL with XML data

Encoding problems in PostgreSQL with XML data

From
Peter Eisentraut
Date:
This is not directly related to current development, but it is something 
that might need a low-level solution.  I've been thinking for some time 
about how to enchance the current "XML support" (e.g., contrib/xml).

The central problem I have is this:  How do we deal with the fact that 
an XML datum carries its own encoding information?

Here's a scenario:  It is desirable to have validity checking on XML 
input, be it a special XML data type or some functions that take XML 
data.  Say we define a data type that stores XML documents and rejects 
documents that are not well-formed.  I want to insert something in 
psql:

CREATE TABLE test (   description text,   content xml
);

INSERT INTO test VALUES ('test document', '<?xml 
version="1.0"?><doc><para>blah</para>...</doc>');

Now an XML parser will assume this document to be in UTF-8, and say at 
the client it is.  What if client_encoding=UNICODE but 
server_encoding=LATIN1?  Do we expect some layer to rewrite the <?xml?> 
declaration to contain the correct encoding information?  Or can the 
xml type bypass encoding conversion?  What about reading it back out of 
the database with yet another client encoding?

Rewriting the <?xml?> declaration seems like a workable solution, but it 
would break the transparency of the client/server encoding conversion.  
Also, some people might dislike that their documents are being changed 
as they are stored.

Any ideas?



Re: Encoding problems in PostgreSQL with XML data

From
Rod Taylor
Date:
> Rewriting the <?xml?> declaration seems like a workable solution, but it 
> would break the transparency of the client/server encoding conversion.  
> Also, some people might dislike that their documents are being changed 
> as they are stored.

I presume that the XML type stores the textual representation of the XML
rather than a binary (parsed) format?

I say we treat XML the same was as we deal with things like float and
allow some play with the stored object so long as it's reasonably small
and has the same final interpretation.

If they wanted non-mutable text, they would not have declared it as a
structured format. Just like if they don't want leading 0's removed from
numeric input, they don't declare it as numeric but as binary.



Re: Encoding problems in PostgreSQL with XML data

From
"Merlin Moncure"
Date:
Peter Eisentraut wrote:
> The central problem I have is this:  How do we deal with the fact that
> an XML datum carries its own encoding information?

Maybe I am misunderstanding your question, but IMO postgres should be
treating xml documents as if they were binary data, unless the server
takes on the role of a parser, in which case it should handle
unspecified/unknown encodings just like a normal xml parser would (and
this does *not* include changing the encoding!).

According to me, an XML parser should not change one bit of a document,
because that is not a 'parse', but a 'transformation'.
> Rewriting the <?xml?> declaration seems like a workable solution, but
it
> would break the transparency of the client/server encoding conversion.
> Also, some people might dislike that their documents are being changed
> as they are stored.

Right, your example begs the question: why does the server care what the
encoding of the documents is (perhaps indexing)?  ZML validation is a
standardized operation which the server (or psql, I suppose) can
subcontract out to another application.

Just a side thought: what if the xml encoding type was built into the
domain type itself?
create domain xml_utf8 ...
Which allows casting, etc. which is more natural than an implicit
transformation.

Regards,
Merlin


Re: Encoding problems in PostgreSQL with XML data

From
Andrew Dunstan
Date:
Perhaps the document should be stored in canonical form. See 
http://www.w3.org/TR/xml-c14n

I think I agree with Rod's opinion elsewhere in this thread. I guess the 
"philosophical" question is this: If 2 XML documents with different 
encodings have the same canonical form, or perhaps produce the same DOM, 
are they equivalent? Merlin appears to want to say "no", and I think I 
want to say "yes".

cheers

andrew

Merlin Moncure wrote:

>Peter Eisentraut wrote:
>  
>
>>The central problem I have is this:  How do we deal with the fact that
>>an XML datum carries its own encoding information?
>>    
>>
>
>Maybe I am misunderstanding your question, but IMO postgres should be
>treating xml documents as if they were binary data, unless the server
>takes on the role of a parser, in which case it should handle
>unspecified/unknown encodings just like a normal xml parser would (and
>this does *not* include changing the encoding!).
>
>According to me, an XML parser should not change one bit of a document,
>because that is not a 'parse', but a 'transformation'.
> 
>  
>
>>Rewriting the <?xml?> declaration seems like a workable solution, but
>>    
>>
>it
>  
>
>>would break the transparency of the client/server encoding conversion.
>>Also, some people might dislike that their documents are being changed
>>as they are stored.
>>    
>>
>
>Right, your example begs the question: why does the server care what the
>encoding of the documents is (perhaps indexing)?  ZML validation is a
>standardized operation which the server (or psql, I suppose) can
>subcontract out to another application.
>
>Just a side thought: what if the xml encoding type was built into the
>domain type itself?
>create domain xml_utf8 ...
>Which allows casting, etc. which is more natural than an implicit
>transformation.
>
>Regards,
>Merlin
>
>---------------------------(end of broadcast)---------------------------
>TIP 8: explain analyze is your friend
>
>  
>



Re: Encoding problems in PostgreSQL with XML data

From
"Merlin Moncure"
Date:
Andrew Dunstan wrote:
> I think I agree with Rod's opinion elsewhere in this thread. I guess
the
> "philosophical" question is this: If 2 XML documents with different
> encodings have the same canonical form, or perhaps produce the same
DOM,
> are they equivalent? Merlin appears to want to say "no", and I think I
> want to say "yes".

Er, yes, except for canonical XML.  Canonical XML neatly bypasses all
the encoding issues that I can see.

Maybe I am still not getting the basic point, but the part I was not
quite clear on is why the server would need to parse the document at
all, much less change the encoding.  Sure, it doesn't necessarily hurt
to do it, but why bother?  An external parser could handle both the
parsing and the validation.  Reading Peter's post, he seems to be
primarily concerned with an automatic XML validation trigger that comes
built in with the XML 'type'.

*unless*

1. The server needs to parse the document and get values from the
document for indexing/key generation purposes, now the encoding becomes
very important (especially considering joins between XML to non XML data
types).
2. There are plans to integrate Xpath expressions into queries.
3. The server wants to compose generated XML documents from stored
XML/non XML sources, with (substantial) additions to the query language
to facilitate this, i.e. a nested data extraction replacement for psql.

But, since I'm wishing for things, I may as well ask for a hockey rink
in my living room :)

Merlin


Re: Encoding problems in PostgreSQL with XML data

From
Peter Eisentraut
Date:
Andrew Dunstan wrote:
> Perhaps the document should be stored in canonical form.

That kills the DTD, the id attributes, thus crippling XPath, and it 
looks horrible on output.  I don't think that can be accepted.  
Canonical form is useful for comparing documents, but not for operating 
on them, I think.



Re: Encoding problems in PostgreSQL with XML data

From
Andrew Dunstan
Date:

Peter Eisentraut wrote:

>Andrew Dunstan wrote:
>  
>
>>Perhaps the document should be stored in canonical form.
>>    
>>
>
>That kills the DTD, the id attributes, thus crippling XPath, and it 
>looks horrible on output.  I don't think that can be accepted.  
>Canonical form is useful for comparing documents, but not for operating 
>on them, I think.
>  
>

OK, fair enough.

What exactly do we expect the functionality of an xml type to be? Merely 
a guarantee that it is well-formed?

cheers

andrew



Re: Encoding problems in PostgreSQL with XML data

From
Hannu Krosing
Date:
Merlin Moncure kirjutas R, 09.01.2004 kell 22:04:
> Peter Eisentraut wrote:
> > The central problem I have is this:  How do we deal with the fact that
> > an XML datum carries its own encoding information?
> 
> Maybe I am misunderstanding your question, but IMO postgres should be
> treating xml documents as if they were binary data, unless the server
> takes on the role of a parser, in which case it should handle
> unspecified/unknown encodings just like a normal xml parser would (and
> this does *not* include changing the encoding!).
> 
> According to me, an XML parser should not change one bit of a document,
> because that is not a 'parse', but a 'transformation'.

IIRC, the charset transformations are done as a separate step in the
wire protocol _before_ any parser has chance transform or not.

The charset transform on incoming query are also agnostic of text
contents - it just blindly transforms the whole SQL statement.

---------------
Hannu



Re: Encoding problems in PostgreSQL with XML data

From
"Merlin Moncure"
Date:
Hannu Krosing wrote:
> IIRC, the charset transformations are done as a separate step in the
> wire protocol _before_ any parser has chance transform or not.

Yep.  My point is that this is wrong.  I think of XML the same way I
think of a zip file contains a text document.  Postgres does not unzip a
text file to change the char encoding any more than it should parse an
XML document and change the encoding unless this is the specific intent
of the user for a specific purpose.

Validation alone does not qualify as a reason because a XML parser
(xerces) can do validation server-side without mucking with document.
Postgres need only be aware of the fact that the data is XML and should
be validated.

If postgres needs to be aware of internal document contents (for
indexing, for example), XSLT can be used for that purpose.

Regards,
Merlin



Re: Encoding problems in PostgreSQL with XML data

From
Hannu Krosing
Date:
Merlin Moncure kirjutas E, 12.01.2004 kell 19:56:
> Hannu Krosing wrote:
> > IIRC, the charset transformations are done as a separate step in the
> > wire protocol _before_ any parser has chance transform or not.
> 
> Yep.  My point is that this is wrong.  

Of course :)

It seems to be a quick hack somebody implemented in need, and doing it
as a separate step was suely the easiest way to get it working.

I hope that real as-needed-column-by-column translation will be used
with bound argument queries.

It also seems possible to delegate the encoding changes to after the
query is parsed, but this will never work for EBCDIC and other funny
encodings (like rot13 ;). 

for these we need to define the actual SQL statement encoding on-wire to
be always ASCII.

---------------
Hannu



Re: Encoding problems in PostgreSQL with XML data

From
"Merlin Moncure"
Date:
Hannu Krosing wrote:
> I hope that real as-needed-column-by-column translation will be used
> with bound argument queries.
>
> It also seems possible to delegate the encoding changes to after the
> query is parsed, but this will never work for EBCDIC and other funny
> encodings (like rot13 ;).
>
> for these we need to define the actual SQL statement encoding on-wire
to
> be always ASCII.

In that case, treat the XML document like a binary stream, using
PQescapeBytea, etc. to encode if necessary pre-query.  Also, the XML
domain should inherit from bytea, not varchar.  The document should be
stored bit for bit as was submitted.  If we can do that for bitmaps, why
can't we do it for XML documents?

OTOH, if we are transforming the document down to a more generic format
(either canonical or otherwise), then the xml could be dealt with like
text in the ususal way.  Of course, then we are not really storing xml,
more like 'meta' xml ;)

Merlin


Re: Encoding problems in PostgreSQL with XML data

From
Hannu Krosing
Date:
Merlin Moncure kirjutas K, 14.01.2004 kell 15:49:
> Hannu Krosing wrote:
> > I hope that real as-needed-column-by-column translation will be used
> > with bound argument queries.
> > 
> > It also seems possible to delegate the encoding changes to after the
> > query is parsed, but this will never work for EBCDIC and other funny
> > encodings (like rot13 ;).
> > 
> > for these we need to define the actual SQL statement encoding on-wire
> to
> > be always ASCII.
> 
> In that case, treat the XML document like a binary stream, using
> PQescapeBytea, etc. to encode if necessary pre-query.  Also, the XML
> domain should inherit from bytea, not varchar.

why ?

the allowed characters repertoire in XML is even less than in varchar.

> The document should be stored bit for bit as was submitted.

Or in some pre-parsed form which allows restoration of submitted form,
which could be more for things like xpath queries or subtree extraction.

> If we can do that for bitmaps, why can't we do it for XML documents?  
> 
> OTOH, if we are transforming the document down to a more generic format
> (either canonical or otherwise), then the xml could be dealt with like
> text in the ususal way.  Of course, then we are not really storing xml,
> more like 'meta' xml ;)

On the contrary! If there is DTD or Schema or other structure definition
for XML, then we know which whitespace is significant and can do
whatever we like with insignificant whitespace.

It also is ok to store all XML in some UNICODE encoding as this is what
every XML must be convertible to.

its he same as storing ints - you don't care if you specified 1000 ot
1e3 when doing the insert as

hannu=# select 1000=1e3;?column?
----------t
(1 row)

in the same way the following should also be true

select
'<d/>'::xml == '<?xml version="1.0" encoding="utf-8"?>\n<d/>\n'::xml
;

-----------
Hannu



Re: Encoding problems in PostgreSQL with XML data

From
"Merlin Moncure"
Date:
Hannu Krosing wrote:

> > In that case, treat the XML document like a binary stream, using
> > PQescapeBytea, etc. to encode if necessary pre-query.  Also, the XML
> > domain should inherit from bytea, not varchar.
>
> why ?
>
> the allowed characters repertoire in XML is even less than in varchar.

Yes, that is correct.  I was resorting to hyperbole...see my reasoning
below.

> > The document should be stored bit for bit as was submitted.
>
> Or in some pre-parsed form which allows restoration of submitted form,
> which could be more for things like xpath queries or subtree
extraction.

This is the crucial point:  I'll try and explain my thinking better.

> > OTOH, if we are transforming the document down to a more generic
format
> > (either canonical or otherwise), then the xml could be dealt with
like
> > text in the ususal way.  Of course, then we are not really storing
xml,
> > more like 'meta' xml ;)
>
> On the contrary! If there is DTD or Schema or other structure
definition
> for XML, then we know which whitespace is significant and can do
> whatever we like with insignificant whitespace.

According to the XML standard, whitespace is always significant unless
it is outside an element or attribute and thus not part of the real
data.  A DTD or Schema adds constraints, not removes them.  I'm
nitpicking, but this is extra evidence to my philosophy of xml storage
that I'll explain below.

> select
> '<d/>'::xml == '<?xml version="1.0" encoding="utf-8"?>\n<d/>\n'::xml

Right: I understand your reasoning here.  Here is the trick:

select '[...]'::xml introduces a casting step which justifies a
transformation.  The original input data is not xml, but varchar.  Since
there are no arbitrary rules on how to do this, we have some flexibility
here to do things like change the encoding/mess with the whitespace.  I
am trying to find away to break the assumption that my xml data
necessarily has to be converted from raw text.

My basic point is that we are confusing the roles of storing and
parsing/transformation.  The question is: are we storing xml documents
or the metadata that makes up xml documents?  We need to be absolutely
clear on which role the server takes on...in fact both roles may be
appropriate for different situations, but should be represented by a
different type.  I'll try and give examples of both situations.

If we are strictly storing documents, IMO the server should perform zero
modification on the document.  Validation could be applied conceptually
as a constraint (and, possibly XSLT/XPATH to allow a fancy type of
indexing).  However there is no advantage that I can see to manipulating
the document except to break the 'C' of ACID.  My earlier comments wrt
binary encoding is that there simply has to be a way to prevent the
server mucking with my document.

For example, if I was using postgres to store XML-EDI documents in a DX
system this is the role I would prefer.  Validation and indexing are
useful, but my expected use of the server is a type of electronic xerox
of the incoming document.  I would be highly suspicious of any
modification the server made to my document for any reason.

Now, if we are storing xml as content, say for a content database
backing a web page, the server takes on the role of a meta-data storage
system.  Now, it is reasonable to assume the server might do additional
processing besides storage and validation.  The character encoding of
the incoming data is of little or no importance because the xml will
almost certainly undergo an additional transformation step after
extraction from the database.  Flexibility, simplicity, and utility are
the most important requirements, so text transformation to a default
encoding would be quite useful.

Based on your suggestions I think you are primarily concerned with the
second example.  However, in my work I do a lot of DX and I see the xml
document as a binary object.  Server-side validation would be extremely
helpful, but please don't change my document!

So, I submit that we are both right for different reasons.

Regards,
Merlin



Re: Encoding problems in PostgreSQL with XML data

From
Hannu Krosing
Date:
Merlin Moncure kirjutas N, 15.01.2004 kell 18:43:
> Hannu Krosing wrote:

> > select
> > '<d/>'::xml == '<?xml version="1.0" encoding="utf-8"?>\n<d/>\n'::xml
> 
> Right: I understand your reasoning here.  Here is the trick:
> 
> select '[...]'::xml introduces a casting step which justifies a
> transformation.  The original input data is not xml, but varchar.  Since
> there are no arbitrary rules on how to do this, we have some flexibility
> here to do things like change the encoding/mess with the whitespace.  I
> am trying to find away to break the assumption that my xml data
> necessarily has to be converted from raw text.
> 
> My basic point is that we are confusing the roles of storing and
> parsing/transformation.  The question is: are we storing xml documents
> or the metadata that makes up xml documents?  We need to be absolutely
> clear on which role the server takes on...in fact both roles may be
> appropriate for different situations, but should be represented by a
> different type.  I'll try and give examples of both situations.
> 
> If we are strictly storing documents, IMO the server should perform zero
> modification on the document.  Validation could be applied conceptually
> as a constraint (and, possibly XSLT/XPATH to allow a fancy type of
> indexing).  However there is no advantage that I can see to manipulating
> the document except to break the 'C' of ACID.  My earlier comments wrt
> binary encoding is that there simply has to be a way to prevent the
> server mucking with my document.
> 
> For example, if I was using postgres to store XML-EDI documents in a DX
> system this is the role I would prefer.  Validation and indexing are
> useful, but my expected use of the server is a type of electronic xerox
> of the incoming document.  I would be highly suspicious of any
> modification the server made to my document for any reason.  

The current charset/encoding support can be evil in some cases ;(

The only solution seems to be keeping both server and client encoding as
ASCII (or just disable it)

The proper path to encodings must unfortunately do the encoding
conversions *after* parsing, when it is known, which parts of the
original query string should be changed. 

Or, as you suggested, always encode anything outside plain ASCII (n<32
and n>127), both on input (can be done client-side) and output (IIRC
needs another type with different output function)

> Based on your suggestions I think you are primarily concerned with the
> second example.  However, in my work I do a lot of DX and I see the xml
> document as a binary object.  Server-side validation would be extremely
> helpful, but please don't change my document!

So the problem is not exactly XML, but rather problems with changing
encodings of "binary" strings that should not be changed.

I hope (but I'm not sure) that keeping client and server encodings the
same should prevent that.

> So, I submit that we are both right for different reasons.

Seems so.

-----------------
Hannu




Re: Encoding problems in PostgreSQL with XML data

From
Tatsuo Ishii
Date:
> Merlin Moncure kirjutas E, 12.01.2004 kell 19:56:
> > Hannu Krosing wrote:
> > > IIRC, the charset transformations are done as a separate step in the
> > > wire protocol _before_ any parser has chance transform or not.
> > 
> > Yep.  My point is that this is wrong.  
> 
> Of course :)

We need this because our parser cannot handle some encodings including
UCS, Shift-JIS and Big5 (we currently do not allow UCS as a client
side encoding, but this is another story).

> It seems to be a quick hack somebody implemented in need, and doing it
> as a separate step was suely the easiest way to get it working.
> 
> I hope that real as-needed-column-by-column translation will be used
> with bound argument queries.

IMO this does not help our parser's limitation neither.

> It also seems possible to delegate the encoding changes to after the
> query is parsed, but this will never work for EBCDIC and other funny
> encodings (like rot13 ;). 
> 
> for these we need to define the actual SQL statement encoding on-wire to
> be always ASCII.