Re: [SQL] "SELECT IN" Still Broken in 7.4b - Mailing list pgsql-hackers

Stephan Szabo <sszabo@megazone.bigpanda.com> writes:
> Within the scope of the new hashed IN stuff I believe so in at least some
> cases.  I have a few million row table of integers where searching for
> values IN (~10000 values) takes longer than creating the temp table,
> copying into it and doing the in subquery.

I did some profiling and soon realized that the main problem is the
executor's trick for not returning the same row multiple times in a
multiple-indexscan plan node.  The point is that givenWHERE a = 1 OR b = 1
you could create a plan that first indexscans on a, then indexscans on
b --- but you mustn't return any tuples in the second scan that you
already returned in the first.  IndexNext solves this by evaluating the
prior-scan index conditions to see if they are true.  Which is okay if
there aren't too many of them.  But when you have an N-element IN list
this means you are going to do O(N^2) index expression evaluations.
In the 10000-element IN-list test case, ExecQual gets called almost
50 million times :-(

I'm toying with the notion of ripping out that logic and instead
building an in-memory hashtable of already-returned TIDs.  This could
theoretically run out of memory if the multiple indexscan returns enough
tuples, but I think in practice that wouldn't happen because the planner
wouldn't choose an indexscan when very large numbers of tuples are being
selected.

Comments?
        regards, tom lane


pgsql-hackers by date:

Previous
From: Ian Barwick
Date:
Subject: Re: [GENERAL] [pgsql-advocacy] Need concrete "Why Postgres not MySQL" bullet list
Next
From: Manfred Koizar
Date:
Subject: Decent VACUUM (was: Buglist)