Re: Encoding problems in PostgreSQL with XML data - Mailing list pgsql-hackers

From Hannu Krosing
Subject Re: Encoding problems in PostgreSQL with XML data
Date
Msg-id 1074199591.3292.12.camel@fuji.krosing.net
Whole thread Raw
In response to Re: Encoding problems in PostgreSQL with XML data  ("Merlin Moncure" <merlin.moncure@rcsonline.com>)
List pgsql-hackers
Merlin Moncure kirjutas N, 15.01.2004 kell 18:43:
> Hannu Krosing wrote:

> > select
> > '<d/>'::xml == '<?xml version="1.0" encoding="utf-8"?>\n<d/>\n'::xml
> 
> Right: I understand your reasoning here.  Here is the trick:
> 
> select '[...]'::xml introduces a casting step which justifies a
> transformation.  The original input data is not xml, but varchar.  Since
> there are no arbitrary rules on how to do this, we have some flexibility
> here to do things like change the encoding/mess with the whitespace.  I
> am trying to find away to break the assumption that my xml data
> necessarily has to be converted from raw text.
> 
> My basic point is that we are confusing the roles of storing and
> parsing/transformation.  The question is: are we storing xml documents
> or the metadata that makes up xml documents?  We need to be absolutely
> clear on which role the server takes on...in fact both roles may be
> appropriate for different situations, but should be represented by a
> different type.  I'll try and give examples of both situations.
> 
> If we are strictly storing documents, IMO the server should perform zero
> modification on the document.  Validation could be applied conceptually
> as a constraint (and, possibly XSLT/XPATH to allow a fancy type of
> indexing).  However there is no advantage that I can see to manipulating
> the document except to break the 'C' of ACID.  My earlier comments wrt
> binary encoding is that there simply has to be a way to prevent the
> server mucking with my document.
> 
> For example, if I was using postgres to store XML-EDI documents in a DX
> system this is the role I would prefer.  Validation and indexing are
> useful, but my expected use of the server is a type of electronic xerox
> of the incoming document.  I would be highly suspicious of any
> modification the server made to my document for any reason.  

The current charset/encoding support can be evil in some cases ;(

The only solution seems to be keeping both server and client encoding as
ASCII (or just disable it)

The proper path to encodings must unfortunately do the encoding
conversions *after* parsing, when it is known, which parts of the
original query string should be changed. 

Or, as you suggested, always encode anything outside plain ASCII (n<32
and n>127), both on input (can be done client-side) and output (IIRC
needs another type with different output function)

> Based on your suggestions I think you are primarily concerned with the
> second example.  However, in my work I do a lot of DX and I see the xml
> document as a binary object.  Server-side validation would be extremely
> helpful, but please don't change my document!

So the problem is not exactly XML, but rather problems with changing
encodings of "binary" strings that should not be changed.

I hope (but I'm not sure) that keeping client and server encodings the
same should prevent that.

> So, I submit that we are both right for different reasons.

Seems so.

-----------------
Hannu




pgsql-hackers by date:

Previous
From: "Merlin Moncure"
Date:
Subject: Re: Encoding problems in PostgreSQL with XML data
Next
From: Michael Brusser
Date:
Subject: Postgres v.7.3.4 - Performance tuning