Re: Multi-entry indexes (with a view to XPath queries) - Mailing list pgsql-hackers

From John Gray
Subject Re: Multi-entry indexes (with a view to XPath queries)
Date
Msg-id csn9h9.c63.ln@adzuki
Whole thread Raw
In response to Re: Multi-entry indexes (with a view to XPath queries)  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
In article <28692.993502132@sss.pgh.pa.us>, tgl@sss.pgh.pa.us (Tom Lane)
wrote:
> "John Gray" <jgray@beansindustry.co.uk> writes:
>> Firstly, I appreciate this may be a hare-brained scheme, but I've been
>> thinking about indexes in which the tuple pointer is not unique.
> 
> It sounds pretty hare-brained to me all right ;-).  What's wrong with
> the normal approach of one index tuple per heap tuple, ie, multiple
> index tuples with the same key?  It seems to me that your idea will just
> make index maintenance a lot more difficult.  For example, what happens
> when one of the referenced rows is deleted?  We'd have to actually
> change, not just remove, the index tuple, since it'd also be pointing at
> undeleted rows.  That'll create a whole bunch of concurrency problems.
> 

Sorry, I fear my explanation has not been very clear. I'm not suggesting
that one index entry will point to more than one heap tuple, but that more
than one index entry may point to the same heap tuple. In this respect it
is the same as a full-text index (or an 'inverted index' as described by 
Hannu).  i.e. the data format and algorithm of the btree index need not 
change.

If the tuple is changed, then a set of x index tuples which point to it will
become invalid, (and a new set of y index tuples will be created pointing 
to then new version) but VACUUM will dispose of the outdated index tuples 
readily enough.

In fact, maybe my question should be: Full-text indexing in contrib is
provided by an auxiliary table. Is there a reason why it couldn't be
performed using a functional btree index with (broadly) the same 
format?:

word(key)    document (heaptuplepointer)
apple        2 
apple        3 
Bob        1 
cow        2 
coward        1


>> Obviously I need to write a basic XML parser that can support such an
>> xpath function, but it would also be good to index by the results of
>> that function-i.e. to have an index containing feature type values. As
>> each document could have any number of these instances, the number of
>> index tuples would differ from the number of heap tuples.
> 
> Why would you want multiple index entries for the same key (never mind
> whether they are in a single index tuple or multiple tuples) pointing to
> the same row?
> 

Muddled thinking :). I was trying to decide what to do if two identical 
items appeared in the same record: should there be two index entries 
or just  one?

I had thought that this might help queries where you want to count the 
number of instances of a particular element -but on reflection, the index 
entries aren't a useful way to achieve that. 

> Actually, after thinking a little more, I suspect the idea you are
> really trying to describe here is index entries with finer-than-tuple
> granularity.  This is not silly, but it is sufficiently outside the
> normal domain of SQL that I think you are fighting an uphill battle.
> You'd be *much* better off creating a table that has one row per
> indexable entity, whatever that is.
> 

I accept that the offset (and its use) is not so straightforward. It would 
have benefits for indexing documents, but as SQL isn't built around 
document entities, it may be something best left. 

In general, I was trying to formulate something where the functionality 
didn't rely on extra tables (because I reckoned the choice of extra tables 
would depend on the type of documents I was trying to index. HOWEVER, 
I realise that this is not true: I can use a table with

path                  string      document
/feature/type        Moat        3
/feature/type        Cowshed     3

etc. and use a two-column index on path and string instead)

>> I have tried the approach of decomposing documents into cdata, element
>> and attribute tables, and I can use joins to extract a list of feature
>> types etc. (and could use triggers to update this) but the idea of not
>> having to parse a document to enter it into the database
> 
> How do you expect that to happen, when you will have to parse it to get
> the index terms?
> 

My feeling was that I wanted the database side to be document-
structure agnostic. In other words, I could use the database as a plain 
document store and I would develop the xpath function which could be 
used with a sequential scan at first. Then I started to ask whether I could 
create a functional index using it -the point being that the xpath 
function is like a 'words' function which returns all the words from a 
string -it returns more than one entry for a given row. An index based on 
this would need to have more than one entry relating to a given heap 
tuple.

Thanks for your comments: given that actions speak louder than words 
maybe I'll try and implement my fabled xpath operator first, then worry
about indexing or performance improvements :)

Regards

John



pgsql-hackers by date:

Previous
From: Hiroshi Inoue
Date:
Subject: Re: stuck spin lock with many concurrent users
Next
From: Jim Mercer
Date:
Subject: Re: Encrypting pg_shadow passwords