Thread: Join with an array

Join with an array

From
Markus Schiltknecht
Date:
Hi,

I'm trying to speed up a query with a lookup table. This lookup table
gets very big and should still fit into memory. It does not change very
often. Given these facts I decided to use an array, as follows:

CREATE TABLE lookup_table (id INT PRIMARY KEY, items INT[] NOT NULL);

I know this is not considered good database design, but it saves a lot
of overhead for tuple visibility compared to a 1:1 table.

To fetch an item via the lookup_table I tried to use the following
query:

SELECT i.id, i.title FROM item iJOIN lookup_table lut ON i.id = ANY(lut.items)WHERE lut.id = $LOOKUP_ID;

Unfortunately that one seems to always use a sequential scan over items.
As the items array in the lookup table often has only 3 - 10 entries
(compared to about 1 mio rows in the item table) this is a very
expensive operation.

I tried to circumvent the problem with generate_series:

SELECT i.id, i.title FROM generate_series(0, $MAX) sJOIN lookup_table lut ON s = ANY(lut.items)JOIN item i ON s =
i.idWHERElut.id = $LOOKUP_ID;
 

That query uses the index to lookup the item, but as soon as $MAX gets
bigger than 10000 generate_series takes too long and too many
comparisons s = ANY(lut.items) need to be done.

I think it would be possible to write a function generate_series(INT[])
which returns all the elements of the array. The query would then look
something like:

SELECT i.id, i.titleFROM generate_series(SELECT lut.items FROM lookup_table lut WHERE
lut.id = $LOOKUP_ID) sJOIN item i ON s = i.id;

Do you see any problem in implementing such function? Does something
similar already existt?

Why does the first query use a seqscan instead of the index on items? Do
I miss anything? What problems do I face if I want to teach the planner
to use the index in the first query [1]?

Regards

Markus

[1]: generally in most cases like "JOIN .. ON x IN ANY($ARRAY)" where
$ARRAY is reasonably small.




Re: Join with an array

From
Martijn van Oosterhout
Date:
On Thu, Feb 23, 2006 at 12:36:35PM +0100, Markus Schiltknecht wrote:
> Hi,
>
> I'm trying to speed up a query with a lookup table. This lookup table
> gets very big and should still fit into memory. It does not change very
> often. Given these facts I decided to use an array, as follows:
>
> CREATE TABLE lookup_table (id INT PRIMARY KEY, items INT[] NOT NULL);

<snip>

> SELECT i.id, i.title FROM item i
>     JOIN lookup_table lut ON i.id = ANY(lut.items)
>     WHERE lut.id = $LOOKUP_ID;

At the very least you're going to have to tell us which version you are
running plus the output of EXPLAIN ANALYZE for that query. Anything
less and we're guessing. Have you got the appropriate indexes?
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Join with an array

From
Oleg Bartunov
Date:
Markus,

have you seen contrib/intarray ?
    Oleg
On Thu, 23 Feb 2006, Markus Schiltknecht wrote:

> Hi,
>
> I'm trying to speed up a query with a lookup table. This lookup table
> gets very big and should still fit into memory. It does not change very
> often. Given these facts I decided to use an array, as follows:
>
> CREATE TABLE lookup_table (id INT PRIMARY KEY, items INT[] NOT NULL);
>
> I know this is not considered good database design, but it saves a lot
> of overhead for tuple visibility compared to a 1:1 table.
>
> To fetch an item via the lookup_table I tried to use the following
> query:
>
> SELECT i.id, i.title FROM item i
>     JOIN lookup_table lut ON i.id = ANY(lut.items)
>     WHERE lut.id = $LOOKUP_ID;
>
> Unfortunately that one seems to always use a sequential scan over items.
> As the items array in the lookup table often has only 3 - 10 entries
> (compared to about 1 mio rows in the item table) this is a very
> expensive operation.
>
> I tried to circumvent the problem with generate_series:
>
> SELECT i.id, i.title FROM generate_series(0, $MAX) s
>     JOIN lookup_table lut ON s = ANY(lut.items)
>     JOIN item i ON s = i.id
>     WHERE lut.id = $LOOKUP_ID;
>
> That query uses the index to lookup the item, but as soon as $MAX gets
> bigger than 10000 generate_series takes too long and too many
> comparisons s = ANY(lut.items) need to be done.
>
> I think it would be possible to write a function generate_series(INT[])
> which returns all the elements of the array. The query would then look
> something like:
>
> SELECT i.id, i.title
>     FROM generate_series(SELECT lut.items FROM lookup_table lut WHERE
> lut.id = $LOOKUP_ID) s
>     JOIN item i ON s = i.id;
>
> Do you see any problem in implementing such function? Does something
> similar already existt?
>
> Why does the first query use a seqscan instead of the index on items? Do
> I miss anything? What problems do I face if I want to teach the planner
> to use the index in the first query [1]?
>
> Regards
>
> Markus
>
> [1]: generally in most cases like "JOIN .. ON x IN ANY($ARRAY)" where
> $ARRAY is reasonably small.
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>       choose an index scan if your joining column's datatypes do not
>       match
>
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: Join with an array

From
Markus Schiltknecht
Date:
Hello Martijn,

On Thu, 2006-02-23 at 12:44 +0100, Martijn van Oosterhout wrote:
> > SELECT i.id, i.title FROM item i
> >     JOIN lookup_table lut ON i.id = ANY(lut.items)
> >     WHERE lut.id = $LOOKUP_ID;
> 
> At the very least you're going to have to tell us which version you are
> running plus the output of EXPLAIN ANALYZE for that query. Anything
> less and we're guessing. Have you got the appropriate indexes?

Sorry, I knew I'd forget something ;-)

I'm on PostgreSQL 8.1.3. The 'PRIMARY KEY' constraint automatically
creates an index on the lookup_table. The items table as well has an
index on item(id).

Because the other, similar queries use the indices I concluded that in
this first query PostgreSQL _never_ uses an index scan. It also should
not always use it, because the array might be large and a seqscan could
be cheaper in such cases. How should the planer know? In my case,
thought, I assume it would always be cheaper to use an index scan.

If this functionality already exists I was very sorry for the noise and
I beg you to tell me what knobs to fiddle with to make the planner use
the index.

Regards

Markus




Re: Join with an array

From
Markus Schiltknecht
Date:
Hello Oleg,

On Thu, 2006-02-23 at 15:02 +0300, Oleg Bartunov wrote:
> have you seen contrib/intarray ?

Yes, but in the documentation I did not find anything like
'generate_series' or thelike. Maybe I'm looking at the wrong place,
please give me a hint.

Regards

Markus




Re: Join with an array

From
Tom Lane
Date:
Markus Schiltknecht <markus@bluegap.ch> writes:
> I'm trying to speed up a query with a lookup table. This lookup table
> gets very big and should still fit into memory. It does not change very
> often. Given these facts I decided to use an array, as follows:

> CREATE TABLE lookup_table (id INT PRIMARY KEY, items INT[] NOT NULL);

> I know this is not considered good database design, but it saves a lot
> of overhead for tuple visibility compared to a 1:1 table.

> To fetch an item via the lookup_table I tried to use the following
> query:

> SELECT i.id, i.title FROM item i
>     JOIN lookup_table lut ON i.id = ANY(lut.items)
>     WHERE lut.id = $LOOKUP_ID;

> Unfortunately that one seems to always use a sequential scan over items.

FWIW, "indexcol = ANY(array)" searches are indexable in CVS tip.
There's no hope in any existing release though :-(

> I tried to circumvent the problem with generate_series:

> SELECT i.id, i.title FROM generate_series(0, $MAX) s
>     JOIN lookup_table lut ON s = ANY(lut.items)
>     JOIN item i ON s = i.id
>     WHERE lut.id = $LOOKUP_ID;

Seems like the hard way --- why aren't you searching over array subscripts?

SELECT i.id, i.title FROM generate_series(1, $MAX) sJOIN lookup_table lut ON s <= array_upper(lut.items)JOIN item i ON
i.id= lut.items[s]WHERE lut.id = $LOOKUP_ID;
 

$MAX need only be as large as the widest array in lookup_table.
        regards, tom lane