Re: Adding a suffix array index - Mailing list pgsql-hackers

From Hannu Krosing
Subject Re: Adding a suffix array index
Date
Msg-id 1100867900.3919.15.camel@fuji.krosing.net
Whole thread Raw
In response to Adding a suffix array index  (Troels Arvin <troels@arvin.dk>)
List pgsql-hackers
On R, 2004-11-19 at 12:42, Troels Arvin wrote:
> The basic parts of the type are pretty much done. Those interested may
> have a look at http://troels.arvin.dk/svn-snap/postgresql-dnaseq/ (the
> code organization needs some clean-up). The basic type implementation
> should be improved by adding more string functions and by implementing a
> set of specialized selectivity functions. 

I cant answer your immediate questions, just rant on general issues ;)

> Part of my current code concerns packing DNA characters: As the alphabet 
> of DNA strings is very small (four characters), it seems like a 
> straigt-forward optimization to store each character in two bits. 

My advice would be to get it to work first, oprimize later.

Thus I guess you could start by storing DNA sequences as character
strings.

> A warning: This is my first C project, so please
> don't laugh too much (publicly) if you find strange constructs in my code...

Then even more so - get the novel and generally interesting part
(indexing huge arrays) right first, and optimise for space (compressing
4 chars into 1) later.

You could do this 4->1 compression when storing the string into tuple,
but I strongly recommend doing actual work (indexing/searching) at a
level that C directly supports (i.e. bytes/characters)

This enables you to get the basics right first without distraction from
all bit-shifting inside bytes. A good tuned algorithm will almost
certainly offset the 4 time space disadvantage.

...

> My first and most immediate goal is to support efficient answering of a
> question like "which rows contain the sequence TTGACCACTTG in column foo?".

If you store your sequences as strings, you may try to use trigrams (or
modify them to 4,5,6 or 7-grams ;) to get some feel how that works.

trigram module is in contrib/pg_trgm.

------------
Hannu



pgsql-hackers by date:

Previous
From: Oleg Bartunov
Date:
Subject: Re: Adding a suffix array index
Next
From: Adam Witney
Date:
Subject: Re: Adding a suffix array index