Thread: Adding a suffix array index

Adding a suffix array index

From
Troels Arvin
Date:
Hello,

I'm working on a thesis project where I explore the addition of a
specialized, bioinformatics-related data type to a RDBMS. My choice of
RDBMS is PostgreSQL, of course, and I've started by adding a "dnaseq" (DNA
sequence) data type, using PostgreSQL's APIs for type additions.

The idea is to try to make it practical and even "attractive" to work with
DNA sequences in an RDBMS. My starting goal is to make it viable to work
with sequences in the 50-500 million base range. Some may think that
RDBMSes and long chunks of data don't match well. My opinion is that the
increasing power of computers and RDBMS software should at some point make
it possible to work with DNA sequences in a "normal" data management
setting like a RDBMS, instead of solely using stand-alone tools and
stand-alone data files. Anyways, it's an open question if my hypothesis is
right.

The basic parts of the type are pretty much done. Those interested may
have a look at http://troels.arvin.dk/svn-snap/postgresql-dnaseq/ (the
code organization needs some clean-up). The basic type implementation
should be improved by adding more string functions and by implementing a
set of specialized selectivity functions. Part of my current code concerns
packing DNA characters: As the alphabet of DNA strings is very small (four
characters), it seems like a straigt-forward optimization to store each
character in two bits. A warning: This is my first C project, so please
don't laugh too much (publicly) if you find strange constructs in my code...

Although B-trees and hash indexes may be used with my dnaseq type, they
aren't very interesting for long DNA sequences. I would like to add
support for adding a suffix array or suffix tree based index to dnaseq
columns where the user expects long sequences.

My review of the latest academic work on indexing of long sequences
indicates that suffix arrays are probably the way to go: Recent advances
in suffix array algorithms make them more attractive than suffix trees
(because the take up less space). And since the DNA data I'm currently
trying to support aren't separated in words, a string B-tree doesn't seem
to be relevant.

My first and most immediate goal is to support efficient answering of a
question like "which rows contain the sequence TTGACCACTTG in column foo?".

I have to immediate problems:
1. The algoriths I've found don't indicate a good way to  update index arrays. So I'm thinking of implementing  a sort
ofindex-"cluster" which has one sub-index per  indexed value. That way, I don't have to worry about  updating a large
indexcovering all the indexes values  in the column. Is it conceivable to create such an  index "cluster"?
 
2. Does someone know of interesting documentation (perhaps  in the form of interesting code comments) which I should
read,as a basis for creating a non-standard index type  in PostgreSQL?
 

-- 
Greetings from Troels Arvin, Copenhagen, Denmark




Re: Adding a suffix array index

From
Oleg Bartunov
Date:
Hi,

your project looks very attractive. In principle, suffix array should be
implemented using GiST framework. String Btree should be very useful
for your problem. My student is working on string btree library, but we
have no plan to intergrate it into postgresql.
    Oleg


On Fri, 19 Nov 2004, Troels Arvin wrote:

> Hello,
>
> I'm working on a thesis project where I explore the addition of a
> specialized, bioinformatics-related data type to a RDBMS. My choice of
> RDBMS is PostgreSQL, of course, and I've started by adding a "dnaseq" (DNA
> sequence) data type, using PostgreSQL's APIs for type additions.
>
> The idea is to try to make it practical and even "attractive" to work with
> DNA sequences in an RDBMS. My starting goal is to make it viable to work
> with sequences in the 50-500 million base range. Some may think that
> RDBMSes and long chunks of data don't match well. My opinion is that the
> increasing power of computers and RDBMS software should at some point make
> it possible to work with DNA sequences in a "normal" data management
> setting like a RDBMS, instead of solely using stand-alone tools and
> stand-alone data files. Anyways, it's an open question if my hypothesis is
> right.
>
> The basic parts of the type are pretty much done. Those interested may
> have a look at http://troels.arvin.dk/svn-snap/postgresql-dnaseq/ (the
> code organization needs some clean-up). The basic type implementation
> should be improved by adding more string functions and by implementing a
> set of specialized selectivity functions. Part of my current code concerns
> packing DNA characters: As the alphabet of DNA strings is very small (four
> characters), it seems like a straigt-forward optimization to store each
> character in two bits. A warning: This is my first C project, so please
> don't laugh too much (publicly) if you find strange constructs in my code...
>
> Although B-trees and hash indexes may be used with my dnaseq type, they
> aren't very interesting for long DNA sequences. I would like to add
> support for adding a suffix array or suffix tree based index to dnaseq
> columns where the user expects long sequences.
>
> My review of the latest academic work on indexing of long sequences
> indicates that suffix arrays are probably the way to go: Recent advances
> in suffix array algorithms make them more attractive than suffix trees
> (because the take up less space). And since the DNA data I'm currently
> trying to support aren't separated in words, a string B-tree doesn't seem
> to be relevant.
>
> My first and most immediate goal is to support efficient answering of a
> question like "which rows contain the sequence TTGACCACTTG in column foo?".
>
> I have to immediate problems:
> 1. The algoriths I've found don't indicate a good way to
>   update index arrays. So I'm thinking of implementing
>   a sort of index-"cluster" which has one sub-index per
>   indexed value. That way, I don't have to worry about
>   updating a large index covering all the indexes values
>   in the column. Is it conceivable to create such an
>   index "cluster"?
> 2. Does someone know of interesting documentation (perhaps
>   in the form of interesting code comments) which I should
>   read, as a basis for creating a non-standard index type
>   in PostgreSQL?
>
>
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83


Re: Adding a suffix array index

From
Hannu Krosing
Date:
On R, 2004-11-19 at 12:42, Troels Arvin wrote:
> The basic parts of the type are pretty much done. Those interested may
> have a look at http://troels.arvin.dk/svn-snap/postgresql-dnaseq/ (the
> code organization needs some clean-up). The basic type implementation
> should be improved by adding more string functions and by implementing a
> set of specialized selectivity functions. 

I cant answer your immediate questions, just rant on general issues ;)

> Part of my current code concerns packing DNA characters: As the alphabet 
> of DNA strings is very small (four characters), it seems like a 
> straigt-forward optimization to store each character in two bits. 

My advice would be to get it to work first, oprimize later.

Thus I guess you could start by storing DNA sequences as character
strings.

> A warning: This is my first C project, so please
> don't laugh too much (publicly) if you find strange constructs in my code...

Then even more so - get the novel and generally interesting part
(indexing huge arrays) right first, and optimise for space (compressing
4 chars into 1) later.

You could do this 4->1 compression when storing the string into tuple,
but I strongly recommend doing actual work (indexing/searching) at a
level that C directly supports (i.e. bytes/characters)

This enables you to get the basics right first without distraction from
all bit-shifting inside bytes. A good tuned algorithm will almost
certainly offset the 4 time space disadvantage.

...

> My first and most immediate goal is to support efficient answering of a
> question like "which rows contain the sequence TTGACCACTTG in column foo?".

If you store your sequences as strings, you may try to use trigrams (or
modify them to 4,5,6 or 7-grams ;) to get some feel how that works.

trigram module is in contrib/pg_trgm.

------------
Hannu



Re: Adding a suffix array index

From
Adam Witney
Date:
Hi Troels,

This is not related to the database aspects of your question... But there
are more than 4 possible letters in DNA sequences, 16 in fact. Depending on
the accuracy of the DNA sequences you are storing, you may come across
ambiguity DNA bases, so your type will have to take these into account. The
possible letters (according to IUPAC rules) are listed at the bottom of this
page

http://gnomix.ansci.umn.edu/FASTA.html

Cheers

Adam


> Hello,
> 
> I'm working on a thesis project where I explore the addition of a
> specialized, bioinformatics-related data type to a RDBMS. My choice of
> RDBMS is PostgreSQL, of course, and I've started by adding a "dnaseq" (DNA
> sequence) data type, using PostgreSQL's APIs for type additions.
> 
> The idea is to try to make it practical and even "attractive" to work with
> DNA sequences in an RDBMS. My starting goal is to make it viable to work
> with sequences in the 50-500 million base range. Some may think that
> RDBMSes and long chunks of data don't match well. My opinion is that the
> increasing power of computers and RDBMS software should at some point make
> it possible to work with DNA sequences in a "normal" data management
> setting like a RDBMS, instead of solely using stand-alone tools and
> stand-alone data files. Anyways, it's an open question if my hypothesis is
> right.
> 
> The basic parts of the type are pretty much done. Those interested may
> have a look at http://troels.arvin.dk/svn-snap/postgresql-dnaseq/ (the
> code organization needs some clean-up). The basic type implementation
> should be improved by adding more string functions and by implementing a
> set of specialized selectivity functions. Part of my current code concerns
> packing DNA characters: As the alphabet of DNA strings is very small (four
> characters), it seems like a straigt-forward optimization to store each
> character in two bits. A warning: This is my first C project, so please
> don't laugh too much (publicly) if you find strange constructs in my code...
> 
> Although B-trees and hash indexes may be used with my dnaseq type, they
> aren't very interesting for long DNA sequences. I would like to add
> support for adding a suffix array or suffix tree based index to dnaseq
> columns where the user expects long sequences.
> 
> My review of the latest academic work on indexing of long sequences
> indicates that suffix arrays are probably the way to go: Recent advances
> in suffix array algorithms make them more attractive than suffix trees
> (because the take up less space). And since the DNA data I'm currently
> trying to support aren't separated in words, a string B-tree doesn't seem
> to be relevant.
> 
> My first and most immediate goal is to support efficient answering of a
> question like "which rows contain the sequence TTGACCACTTG in column foo?".
> 
> I have to immediate problems:
> 1. The algoriths I've found don't indicate a good way to
>  update index arrays. So I'm thinking of implementing
>  a sort of index-"cluster" which has one sub-index per
>  indexed value. That way, I don't have to worry about
>  updating a large index covering all the indexes values
>  in the column. Is it conceivable to create such an
>  index "cluster"?
> 2. Does someone know of interesting documentation (perhaps
>  in the form of interesting code comments) which I should
>  read, as a basis for creating a non-standard index type
>  in PostgreSQL?


-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



Re: Adding a suffix array index

From
Troels Arvin
Date:
Hello Oleg,

On Fri, 2004-11-19 at 15:35 +0300, Oleg Bartunov wrote:

> your project looks very attractive.
Thanks.

> In principle, suffix array should be implemented using GiST framework.

But in a previous conversation between the two of us, you wrote that the
GiST wasn't suitable for this problem? I would very much like to use GiST
instead of having to create a whole new index infrastructure (if that's
the alternative).

> String Btree should be very useful for your problem.

I thought that string b-trees were only usable for texts with natural
stops (word boundaries), as opposed to very long, contiguous sequences. Am
I mistaken?

-- 
Greetings from Troels Arvin, Copenhagen, Denmark




Re: Adding a suffix array index

From
Troels Arvin
Date:
On Fri, 19 Nov 2004 14:38:20 +0200, Hannu Krosing wrote:

>> Part of my current code concerns packing DNA characters: As the alphabet 
>> of DNA strings is very small (four characters), it seems like a 
>> straigt-forward optimization to store each character in two bits. 
> 
> My advice would be to get it to work first, oprimize later.

Valid point. However, I needed something rather basic to work on, to get
to know C and to get to know PostgreSQL in a user defined type context.
But if packing proves to be a problem when implementing the interesting
stuff, then thanks&yes: Packing should be an afterthought.

>> My first and most immediate goal is to support efficient answering of a
>> question like "which rows contain the sequence TTGACCACTTG in column foo?".
> 
> If you store your sequences as strings, you may try to use trigrams (or
> modify them to 4,5,6 or 7-grams ;) to get some feel how that works.
> 
> trigram module is in contrib/pg_trgm.

(/me Printing readme.) Thanks.

-- 
Greetings from Troels Arvin, Copenhagen, Denmark




Re: Adding a suffix array index

From
Oleg Bartunov
Date:
On Fri, 19 Nov 2004, Troels Arvin wrote:

> Hello Oleg,
>
> On Fri, 2004-11-19 at 15:35 +0300, Oleg Bartunov wrote:
>
>> your project looks very attractive.
>
> Thanks.
>
>> In principle, suffix array should be implemented using GiST framework.
>
> But in a previous conversation between the two of us, you wrote that the
> GiST wasn't suitable for this problem? I would very much like to use GiST

it's still true :( some problem with my english grammar.


> instead of having to create a whole new index infrastructure (if that's
> the alternative).
>
>> String Btree should be very useful for your problem.
>
> I thought that string b-trees were only usable for texts with natural
> stops (word boundaries), as opposed to very long, contiguous sequences. Am
> I mistaken?
>

I meant SB-Tree by Paolo Ferragina 
http://citeseer.ist.psu.edu/ferragina98string.html

>
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83


Re: Adding a suffix array index

From
Tom Lane
Date:
Troels Arvin <troels@arvin.dk> writes:
> 2. Does someone know of interesting documentation (perhaps
>    in the form of interesting code comments) which I should
>    read, as a basis for creating a non-standard index type
>    in PostgreSQL?

There's not a whole lot :-( and you should definitely expect to have to
read code, not just comments.  You have of course already read
"Interfacing Extensions To Indexes" and "Index Cost Estimation
Functions" in the SGML docs?  After that I'd suggest looking at
src/backend/access/nbtree/README and src/backend/access/hash/README,
and then diving into the code of one or more of the existing index
access methods.  Offhand I think that hash and rtree might be the best
ones to read.  btree is the most "industrial strength" of the four
because it's been worked over and optimized much more carefully than the
others, but by the same token its code is vastly more bulky than the
others; I think you'd have a harder time seeing the forest instead of
the trees if you read btree.  gist and rtree are nearly alike so you
probably don't want to read both of those.
        regards, tom lane


Re: Adding a suffix array index

From
Simon Riggs
Date:
On Fri, 2004-11-19 at 10:42, Troels Arvin wrote:
> Hello,
> 
> I'm working on a thesis project where I explore the addition of a
> specialized, bioinformatics-related data type to a RDBMS. My choice of
> RDBMS is PostgreSQL, of course, and I've started by adding a "dnaseq" (DNA
> sequence) data type, using PostgreSQL's APIs for type additions.
> 
> The idea is to try to make it practical and even "attractive" to work with
> DNA sequences in an RDBMS. My starting goal is to make it viable to work
> with sequences in the 50-500 million base range. Some may think that
> RDBMSes and long chunks of data don't match well. My opinion is that the
> increasing power of computers and RDBMS software should at some point make
> it possible to work with DNA sequences in a "normal" data management
> setting like a RDBMS, instead of solely using stand-alone tools and
> stand-alone data files. Anyways, it's an open question if my hypothesis is
> right.
> 

Presumably you know about these?

http://www.ncbi.nih.gov/BLAST/
http://www.ciri.upc.es/cela_pblade/BLAST.htm
http://www.netezza.com/products/bio.cfm

I think you're right, but you'd need to have more than one application
of the data for it to be a convincing argument. Without parallelism,
your best efforts will be to equal the speed of the single-use data
structures used in BLAST.

-- 
Best Regards, Simon Riggs



Re: Adding a suffix array index

From
Troels Arvin
Date:
On Fri, 19 Nov 2004 10:35:20 -0500, Tom Lane wrote:

>> 2. Does someone know of interesting documentation (perhaps
>>    in the form of interesting code comments) which I should
>>    read, as a basis for creating a non-standard index type
>>    in PostgreSQL?
> 
> There's not a whole lot :-( and you should definitely expect to have to
> read code, not just comments.

I have read some code, and have gained some understanding. I think/hope.
However:

For the suffix array to work, it needs to store positions from the base
table (among other pieces of information): A suffix array stores the
complete set of suffixes from the indexed string, in sorted order. Storage
is in the form of pointers to the indexed string.

What kind of (logical) block identifier should I point to in my index? Can
I be guaranteed that the block will not move, leaving dangling pointers in
the index?

Am I right that the answer is related to BlockIdData? If I store
BlockIds in an index, do I then have to worry about physical blocks on
the disk which might somehow move?

-- 
Greetings from Troels Arvin, Copenhagen, Denmark




Re: Adding a suffix array index

From
Tom Lane
Date:
Troels Arvin <troels@arvin.dk> writes:
> What kind of (logical) block identifier should I point to in my index?

CTID (block # + line #) is the only valid pointer from an index to a
table.  It doesn't change over the life of an index entry.  I think
though that you'd be making a serious mistake by not duplicating the
suffixes into the index (rather than expecting to retrieve them from the
table every time, as you seem to be contemplating).  You need to be able
to scan the index and identify rows matching a query without making lots
of probes into the table.
        regards, tom lane


Re: Adding a suffix array index

From
Troels Arvin
Date:
On Sun, 28 Nov 2004 16:52:47 -0500, Tom Lane wrote:

> CTID (block # + line #) is the only valid pointer from an index to a
> table.

Thanks.

> I think
> though that you'd be making a serious mistake by not duplicating the
> suffixes into the index (rather than expecting to retrieve them from the
> table every time, as you seem to be contemplating).

Yes, I've thought about this, and I may end up doing that.

> You need to be able
> to scan the index and identify rows matching a query without making lots
> of probes into the table.

But is it cheaper, IO-wise to "jump" around in an index than to go back
and forth between index and tuple blocks?

-- 
Greetings from Troels Arvin, Copenhagen, Denmark




Re: Adding a suffix array index

From
Tom Lane
Date:
Troels Arvin <troels@arvin.dk> writes:
> On Sun, 28 Nov 2004 16:52:47 -0500, Tom Lane wrote:
>> You need to be able
>> to scan the index and identify rows matching a query without making lots
>> of probes into the table.

> But is it cheaper, IO-wise to "jump" around in an index than to go back
> and forth between index and tuple blocks?

Perhaps not --- but why would you be "jumping around"?  Wouldn't the
needed info appear in consecutive locations in the index?

Even if that's not the case, the index should be much denser than the
table because it's only storing the keys and not the rest of the
columns.  So I'd expect less net I/O even if the access pattern is just
as random.
        regards, tom lane


Re: Adding a suffix array index

From
Troels Arvin
Date:
On Sun, 28 Nov 2004 17:53:38 -0500, Tom Lane wrote:
>> But is it cheaper, IO-wise to "jump" around in an index than to go back
>> and forth between index and tuple blocks?
> 
> Perhaps not --- but why would you be "jumping around"?  Wouldn't the
> needed info appear in consecutive locations in the index?

Searching for a match, using a suffix array entails a binary search in the
suffix array (can be optimized through the use of a longest-common-prefix
helper-array). So some amount of "jumping" is needed.

Anyhow: I've given up trying to create the suffix array as a "normal"
index type. As I'm running out of time, I'm inclined to hacks. I'm
considering storing the index of a sequence in a large object which I then
store a reference to in the data item: The large object interface seems
like something I could use.

Or I might store dnaseq values as- some meta-information, perhaps (like a hash value)- a reference to a large object
containingthe sequence- a reference to a large object containing the suffix array- a reference to a large object
containinga helper-array  (longest common prefix-information)
 

One problem with this approach is that the related, large objects will not
automatically be removed when a value is removed from a table (but that
could probably be somewhat fixes using a trigger). Beyond being somewhat
ugly: Is it possible?

How much of[1] is still the case today? Are today's large objects somewhat
corresponding to the article's description of "v-segments"? The article mentions that POSTGRES supported a CREATE LARGE
TYPE
construct. Am I right in assuming that today's corresponding construct is
as exemplified in the manual: CREATE TYPE bigobj (INPUT = lo_filein, OUTPUT = lo_fileout,...)

Reference 1:
Stonebraker & Olson: Large Object Support in POSTGRES (1993)
http://epoch.cs.berkeley.edu:8000/postgres/papers/S2K-93-30.pdf

-- 
Greetings from Troels Arvin, Copenhagen, Denmark




Re: Adding a suffix array index

From
Tom Lane
Date:
Troels Arvin <troels@arvin.dk> writes:
> How much of[1] is still the case today?
> Reference 1:
> Stonebraker & Olson: Large Object Support in POSTGRES (1993)
> http://epoch.cs.berkeley.edu:8000/postgres/papers/S2K-93-30.pdf

Probably almost none of it ... the only thing I know about the
Berkeley-era large object features is that at least two-thirds of that
code got ripped out later.
        regards, tom lane