[feature] cached index to speed up specific queries on extremely large data sets - Mailing list pgsql-hackers

From lkcl .
Subject [feature] cached index to speed up specific queries on extremely large data sets
Date
Msg-id CAPweEDzyR923NrEedEUKXS=EdZAkL=bEWB6AN+0sVkoK56o4Vg@mail.gmail.com
Whole thread Raw
Responses Re: [feature] cached index to speed up specific queries on extremely large data sets  (Heikki Linnakangas <hlinnakangas@vmware.com>)
List pgsql-hackers
hi folks, please cc me direct on responses as i am subscribed on digest.

i've been asked to look at how to deal with around 7 billion records
(appx 30 columns, appx data size total 1k) and this might have to be
in a single system (i will need to Have Words with the client about
that).  the data is read-only and an arbitrary number of additional
tables may be created to "manage" the data.  records come in at a rate
of around 25 million per day, the 7 billion records is based on the
assumption of keeping one month's worth of data around.

analysis of this data needs to be done across the entire set: i.e. it
may not be subdivided into isolated tables (by day for example).  i am
therefore um rather concerned about efficiency, even just from the
perspective of using 2nd normalised form and not doing JOINs against
other tables.

so i had an idea.  there already exists the concept of indexes.  there
already exists the concept of "cached queries".  question: would it be
practical to *merge* those two concepts such that specific queries
could be *updated* as new records are added, such that when the query
is called again it answers basically pretty much immediately? let us
assume that performance degradation on "update" (given that indexes
already exist and are required to be updated) is acceptable.

the only practical way (without digging into postgresql's c code) to
do this at a higher level would be in effect to abandon the advantages
of the postgresql query optimisation engine and *reimplement* it in a
high-level language, subdividing the data into smaller (more
manageable) tables, using yet more tables to store intermediate
results of a previous query then somehow managing to stitch together a
new response based on newer packets.  it would be a complete nightmare
to both implement and maintain.

second question then based on whether the first is practical: is there
anyone who would be willing (assuming it can be arranged) to engage in
a contract to implement the required functionality?

thanks,

l.



pgsql-hackers by date:

Previous
From: Craig Ringer
Date:
Subject: Re: WIP patch (v2) for updatable security barrier views
Next
From: Heikki Linnakangas
Date:
Subject: Re: [feature] cached index to speed up specific queries on extremely large data sets