Firstly, I appreciate this may be a hare-brained scheme, but I've been
thinking about indexes in which the tuple pointer is not unique.
The reason for my interest is storing XML documents in text fields in the
database. (It could also help with particular kinds of full-text search?)
I would like to be able to construct indexes on a collection of XML
documents, based on the "value" of certain "fields" within the document.
(In jargon terms, producing an index whose key is the CDATA content of a
particular XML element). This could tie in with the Xpath and XQuery
proposals
A simplified example (from an archaeological site classification system):
<site><name>Glebe Farm, Long Itchington</name><location scheme="osgb">SU41793684</location><feature>
<type>Agricultural:StockControl</type>    <date scheme="code">med</date></feature><feature>
<type>Unassigned:Ditch</type>   <size type="depth" unit="m">1.5</size></feature>
 
</site> 
I'd like to produce an index on feature types so that I could type
(roughly):
SELECT siteid, xpath(doc,'//site/name'), xpath(doc,'//site/location') FROM
documents WHERE xpath(doc,'feature/type') = 'Agricultural: Stock Control';
[create table documents (integer siteid, text doc)]
Obviously I need to write a basic XML parser that can support such an
xpath function, but it would also be good to index by the results of that
function-i.e. to have an index containing feature type values. As each
document could have any number of these instances, the number of index
tuples would differ from the number of heap tuples.
As far as I can see, there is no particular reason why a btree index could
not be used[*]. However, vacuum.c makes assertions about number of index
tuples == number of heap tuples. I realise this is a useful consistency
check, but would it be possible to have a field in pg_index
(indnoidentity?) that indicates that a given index hasn't got a 1:1
index:heap relationship.
I have tried the approach of decomposing documents into cdata, element and
attribute tables, and I can use joins to extract a list of feature types
etc. (and could use triggers to update this) but the idea of not having to
parse a document to enter it into the database and not requiring
application logic to reconstruct it again seems a potential win for a
system which might store complex documents but usually searches on limited
criteria.