Doing range-predicates right (correction) - Mailing list pgsql-sql

From Mischa Sandberg
Subject Doing range-predicates right (correction)
Date
Msg-id 1115571038.427e435ee7f3c@webmail.telus.net
Whole thread Raw
List pgsql-sql
A few weeks ago I posted a way to do efficient range predicate joins,
given only B-tree indexes. I've since gotten back home and looked at the
code I last used. My apologies for an offhand hasty posting.

The following is the solution I worked out when I used this method on a
large data conversion. It has the advantage (over what I posted) that
the indexable ranges fit more tightly around the actual ranges --- a
pseudo-range will never be more than twice as wide as the original range.

For those who didn't see the original, the problem was, how do you get
efficient lookups against large tables of the form:
  TABLE Ranges(rangeid <sometype>, lower INT, upper INT)

... or any range-bound type that can be mapped to int (e.g. by mapping
timestamps to epoch-seconds).

... with queries like
 select rangeid, sampleid from   Samples  join Ranges on Samples.val between Ranges.lower and Ranges.upper

The problem is that a Btree index on Ranges.lower (or .upper) is
ineffective; in most cases, the query planner will rightly ignore it.

One (my) solution is to map the given ranges into slightly larger
ranges, that have a small number of different widths, and all start on a
more limited set of boundaries.

One way to do this is to map all ranges to ranges that have a width that
is a power of two, and that begin on a whole multiple of that power of two.

Unfortunately, if you just map a range (width) to the smallest power of
2 greater than (upper-lower), then lower and upper may be in two
different ranges of that width. For example, if your original range is
[1003..1030] (i.e. width 28), the power of 2 that covers this range is
32, but 1003 is in the range [992..1023] and 1030 is the one above it. A
sloppy fix for this is to take the next higher power of two as the
pseudo-width.

The original solution created a new version of Ranges that had as many
rows as the original Ranges table. The following solution creates a new
version with no more than twice as many rows, but with.
 -- Function to return the lowest power of two greater than -- a given inclusive interval:
 create or replace function width2(int, int) returns int   immutable language 'plpgsql' as ' begin   return pow(2,
ceil(log(2,$2-$1+1)))::int; end';
 
 -- Construct an augmented Ranges table:
 select rangeid, lower, upper, width,            start-mod(start, width) as start into    PseudoRange from (  select
rangeid,lower, upper, start,                width2(start, finish) as width         from (           select rangeid,
lower,upper,                   lower as start,                   upper-mod(upper, width2(lower,upper))-1 as finish
     from Range               union all           select rangeid, lower, upper,                   upper-mod(upper,
width2(lower,upper))as start,                  upper as finish            from Range       ) as X where start <= finish
 ) as Y;
 
 create unique index PR_start_width on PseudoRange(start,width);

The query using PseudoRange also uses a table PseudoWidth.
If (lower) and (upper) are ints, this table can at most have 31 rows
with values (1,2,4,8,...). This can also be calculated by:
 select distinct width into PseudoWidth from PseudoRange;

... which will have fewer values, for proportionately faster lookups.

The lookup query becomes:
 select rangeid, sampleid from   Samples, PseudoRange as PR, PseudoWidth as PW where  PR.width = PW.width  and
PR.start= Samples.val - (Samples.val % PW.width)  and   Samples.val between PR.lower and PR.upper
 

The usual query plan will do at most (!) 31 index lookups per Sample
row. If this is unacceptable, a power greater than 2 can be used.
-- 
"Dreams come true, not free." -- S.Sondheim



pgsql-sql by date:

Previous
From: Tom Lane
Date:
Subject: Re: encoding
Next
From: Aarni Ruuhimäki
Date:
Subject: Re: encoding