Re: Native XML - Mailing list pgsql-hackers
From | Kevin Grittner |
---|---|
Subject | Re: Native XML |
Date | |
Msg-id | 4D6CF171020000250003B20E@gw.wicourts.gov Whole thread Raw |
In response to | Re: Native XML (Andrew Dunstan <andrew@dunslane.net>) |
Responses |
Re: Native XML
Re: Native XML |
List | pgsql-hackers |
Andrew Dunstan <andrew@dunslane.net> wrote: > On 02/28/2011 05:28 PM, Kevin Grittner wrote: >> Anton<antonin.houska@gmail.com> wrote: >> >>> it was actually the focal point of my considerations: whether to >>> store plain text or 'something else'. > > There seems to be an almost universal assumption that storing XML > in its native form (i.e. a text stream) is going to produce > inefficient results. Well, certainly not in all cases. Finding those rows which satisfy an XPath search among a few million rows with 20KB XML fields might benefit from sort of indexing, though. > unless we implemented our own XPath processor to work with our own > XML format (do we really want to do that?), to evaluate an XPath > expression for a piece of XML we'd actually need to produce the > text format from our internal format before passing it to some > external library to parse into its internal format and then > process the XPath expression. My suggestion was that you would store the text format, and allow the developer to create a sharded format in a different column with a different type if desired, not the other way around. As I said, similar to what a developer would do for tsvector to allow text searches. I agree that creating the text from an internal format doesn't sound good. >> Given that there were similar issues for other hierarchical data >> types, perhaps we need something similar to tsvector, but for >> hierarchical data. The extra layer of abstraction might not cost >> much when used for XML compared to the possible benefit with >> other data. It seems likely to be a very nice fit with GiST >> indexes. >> >> So under this idea, you would always have the text (or maybe byte >> array?) version of the XML, and you could "shard" it to a >> separate column for fast searches. > Tsearch should be able to handle XML now. It certainly knows how > to recognize XML tags. I apparently didn't express myself very well, since you seem to have *completely* missed my point. I know we can do tsearch2 searches against XML, or JSON, or YAML, or (insert next week's new favorite format here). What we can't currently do efficiently is search for particular values in some particular place in the hierarchy of a document. I've had loads of fun approximating it with regular expressions, but some days I'd like life to be easier. What I was arguing for is a new type which would represent the structure in a fashion which was independent of the particular text format and was efficient to traverse hierarchically. Done right, that would map well to GiST. Although, thinking about that some more, perhaps there would be a way to create a GiST index suitable for that straight from the XML text, and avoid the sharded column. A GiST index actually seems pretty close to what such a structure would look like anyway.... -Kevin
pgsql-hackers by date: