Adding a suffix array index - Mailing list pgsql-hackers

From Troels Arvin
Subject Adding a suffix array index
Date
Msg-id pan.2004.11.19.10.42.37.901629@arvin.dk
Whole thread Raw
Responses Re: Adding a suffix array index  (Oleg Bartunov <oleg@sai.msu.su>)
Re: Adding a suffix array index  (Hannu Krosing <hannu@tm.ee>)
Re: Adding a suffix array index  (Adam Witney <awitney@sghms.ac.uk>)
Re: Adding a suffix array index  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Adding a suffix array index  (Simon Riggs <simon@2ndquadrant.com>)
List pgsql-hackers
Hello,

I'm working on a thesis project where I explore the addition of a
specialized, bioinformatics-related data type to a RDBMS. My choice of
RDBMS is PostgreSQL, of course, and I've started by adding a "dnaseq" (DNA
sequence) data type, using PostgreSQL's APIs for type additions.

The idea is to try to make it practical and even "attractive" to work with
DNA sequences in an RDBMS. My starting goal is to make it viable to work
with sequences in the 50-500 million base range. Some may think that
RDBMSes and long chunks of data don't match well. My opinion is that the
increasing power of computers and RDBMS software should at some point make
it possible to work with DNA sequences in a "normal" data management
setting like a RDBMS, instead of solely using stand-alone tools and
stand-alone data files. Anyways, it's an open question if my hypothesis is
right.

The basic parts of the type are pretty much done. Those interested may
have a look at http://troels.arvin.dk/svn-snap/postgresql-dnaseq/ (the
code organization needs some clean-up). The basic type implementation
should be improved by adding more string functions and by implementing a
set of specialized selectivity functions. Part of my current code concerns
packing DNA characters: As the alphabet of DNA strings is very small (four
characters), it seems like a straigt-forward optimization to store each
character in two bits. A warning: This is my first C project, so please
don't laugh too much (publicly) if you find strange constructs in my code...

Although B-trees and hash indexes may be used with my dnaseq type, they
aren't very interesting for long DNA sequences. I would like to add
support for adding a suffix array or suffix tree based index to dnaseq
columns where the user expects long sequences.

My review of the latest academic work on indexing of long sequences
indicates that suffix arrays are probably the way to go: Recent advances
in suffix array algorithms make them more attractive than suffix trees
(because the take up less space). And since the DNA data I'm currently
trying to support aren't separated in words, a string B-tree doesn't seem
to be relevant.

My first and most immediate goal is to support efficient answering of a
question like "which rows contain the sequence TTGACCACTTG in column foo?".

I have to immediate problems:
1. The algoriths I've found don't indicate a good way to  update index arrays. So I'm thinking of implementing  a sort
ofindex-"cluster" which has one sub-index per  indexed value. That way, I don't have to worry about  updating a large
indexcovering all the indexes values  in the column. Is it conceivable to create such an  index "cluster"?
 
2. Does someone know of interesting documentation (perhaps  in the form of interesting code comments) which I should
read,as a basis for creating a non-standard index type  in PostgreSQL?
 

-- 
Greetings from Troels Arvin, Copenhagen, Denmark




pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: Test database for new installs?
Next
From: "Andrew Dunstan"
Date:
Subject: Re: [Plperlng-devel] Re: Concern about new PL/Perl