Thread: sequence query performance issues

sequence query performance issues

From
"Peter Koczan"
Date:
Hello,

I have a weird performance issue with a query I'm testing. Basically,
I'm trying to port a function that generates user uids, and since
postgres offers a sequence generator function, I figure I'd take
advantage of that. Basically, I generate our uid range, filter out
those which are in use, and randomly pick however many I need.
However, when I run it it takes forever (>10 minutes and I get nothing
so I cancelled the query) and cpu usage on the server is maxed out.

Here's my query (I'll post the explain output later so as not to
obscure my question):
=> select a.uid from generate_series(1000, 32767) as a(uid) where
a.uid not in (select uid from people) order by random() limit 1;

I thought that nulls were a problem, so I tried:
=> select a.uid from generate_series(1000, 32767) as a(uid) where
a.uid not in (select coalesce(uid,0) from people) order by random()
limit 1;
And that finished in less than a second.

I then tried:
=> select a.uid from generate_series(1000, 32767) as a(uid) where
a.uid not in (select coalesce(uid,0) from people where uid is not
null) order by random() limit 1;
And we're back to taking forever.

So I have 2 questions:

- Is there a better query for this purpose? Mine works when coalesced,
but it seems a little brute-force and the random() sorting, while
kinda nice, is slow.

- Is this in any way expected? I know that nulls sometimes cause
problems, but why is it taking forever even when trying to filter
those out?

Thanks.

Peter

The gory details:
- There is an btree index on people(uid), and there are ~6300 rows, of
which ~1300 have null uids.

- EXPLAIN output (I couldn't get EXPLAIN ANALYZE output from the first
two queries since they took too long):
=> explain select a.uid from generate_series(1000, 32767) as a(uid)
where a.uid not in (select uid from people) order by random() limit 1;
                                        QUERY PLAN
------------------------------------------------------------------------------------------
 Limit  (cost=40025.57..40025.60 rows=10 width=4)
   ->  Sort  (cost=40025.57..40026.82 rows=500 width=4)
         Sort Key: random()
         ->  Function Scan on generate_series a
(cost=693.16..40003.16 rows=500 width=4)
               Filter: (NOT (subplan))
               SubPlan
                 ->  Materialize  (cost=693.16..756.03 rows=6287 width=2)
                       ->  Seq Scan on people  (cost=0.00..686.87
rows=6287 width=2)
(8 rows)

=> explain select a.uid from generate_series(1000, 32767) as a(uid)
where a.uid not in (select uid from people where uid is not null)
order by random() limit 1;
                                        QUERY PLAN
------------------------------------------------------------------------------------------
 Limit  (cost=31486.71..31486.73 rows=10 width=4)
   ->  Sort  (cost=31486.71..31487.96 rows=500 width=4)
         Sort Key: random()
         ->  Function Scan on generate_series a
(cost=691.79..31464.29 rows=500 width=4)
               Filter: (NOT (subplan))
               SubPlan
                 ->  Materialize  (cost=691.79..741.00 rows=4921 width=2)
                       ->  Seq Scan on people  (cost=0.00..686.87
rows=4921 width=2)
                             Filter: (uid IS NOT NULL)
(9 rows)

=> explain select a.uid from generate_series(1000, 32767) as a(uid)
where a.uid not in (select coalesce(uid, 0) from people) order by
random() limit 1;
                                       QUERY PLAN
----------------------------------------------------------------------------------------
 Limit  (cost=756.97..756.99 rows=10 width=4)
   ->  Sort  (cost=756.97..758.22 rows=500 width=4)
         Sort Key: random()
         ->  Function Scan on generate_series a  (cost=718.30..734.55
rows=500 width=4)
               Filter: (NOT (hashed subplan))
               SubPlan
                 ->  Seq Scan on people  (cost=0.00..702.59 rows=6287 width=2)
(7 rows)

=> explain analyze select a.uid from generate_series(1000, 32767) as
a(uid) where a.uid not in (select coalesce(uid, 0) from people) order
by random() limit 1;
                                                               QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=756.97..756.99 rows=10 width=4) (actual
time=370.444..370.554 rows=10 loops=1)
   ->  Sort  (cost=756.97..758.22 rows=500 width=4) (actual
time=370.434..370.472 rows=10 loops=1)
         Sort Key: random()
         ->  Function Scan on generate_series a  (cost=718.30..734.55
rows=500 width=4) (actual time=70.018..199.540 rows=26808 loops=1)
               Filter: (NOT (hashed subplan))
               SubPlan
                 ->  Seq Scan on people  (cost=0.00..702.59 rows=6287
width=2) (actual time=0.023..29.167 rows=6294 loops=1)
 Total runtime: 372.224 ms
(8 rows)

Re: sequence query performance issues

From
Richard Huxton
Date:
Peter Koczan wrote:
> Hello,
>
> I have a weird performance issue with a query I'm testing. Basically,
> I'm trying to port a function that generates user uids, and since
> postgres offers a sequence generator function, I figure I'd take
> advantage of that. Basically, I generate our uid range, filter out
> those which are in use, and randomly pick however many I need.
> However, when I run it it takes forever (>10 minutes and I get nothing
> so I cancelled the query) and cpu usage on the server is maxed out.

I'd suspect either an unconstrained join or looping through seq-scans.

> Here's my query (I'll post the explain output later so as not to
> obscure my question):
> => select a.uid from generate_series(1000, 32767) as a(uid) where
> a.uid not in (select uid from people) order by random() limit 1;

I let this run to it's conclusion and it's the materialize. If you see,
it's materializing the result-set once for every value it tests against
(loops=31768)

                                                                   QUERY
PLAN

----------------------------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=62722.66..62722.67 rows=1 width=4) (actual
time=189963.485..189963.485 rows=0 loops=1)
    ->  Sort  (cost=62722.66..62723.91 rows=500 width=4) (actual
time=189961.063..189961.063 rows=0 loops=1)
          Sort Key: random()
          ->  Function Scan on generate_series a  (cost=184.00..62700.25
rows=500 width=4) (actual time=189960.797..189960.797 rows=0 loops=1)
                Filter: (NOT (subplan))
                SubPlan
                  ->  Materialize  (cost=184.00..284.00 rows=10000
width=2) (actual time=0.000..2.406 rows=9372 loops=31768)
                        ->  Seq Scan on people  (cost=0.00..174.00
rows=10000 width=2) (actual time=0.055..7.181 rows=10000 loops=1)
  Total runtime: 189967.150 ms

Hmm - why is it doing that? It's clearly confused about something.

I suspect the root of the problem is that it doesn't know what
generate_series() will return. To the planner it's just another
set-returning function.

This means it's getting (i) the # of rows wrong (rows=500) and also
doesn't know (ii) there will be no nulls or (iii) what the range of
values returned will be.

Easy enough to test:

CREATE TEMP TABLE all_uids (uid int2);
INSERT INTO all_uids SELECT generate_series(1000,32767);
ANALYSE all_uids;

EXPLAIN ANALYSE SELECT a.uid
FROM all_uids a
WHERE a.uid NOT IN (SELECT uid FROM people)
ORDER BY random() LIMIT 1;
                                                           QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=1884.14..1884.14 rows=1 width=2) (actual
time=39.019..39.019 rows=0 loops=1)
    ->  Sort  (cost=1884.14..1923.85 rows=15884 width=2) (actual
time=39.014..39.014 rows=0 loops=1)
          Sort Key: random()
          ->  Seq Scan on all_uids a  (cost=199.00..775.81 rows=15884
width=2) (actual time=38.959..38.959 rows=0 loops=1)
                Filter: (NOT (hashed subplan))
                SubPlan
                  ->  Seq Scan on people  (cost=0.00..174.00 rows=10000
width=2) (actual time=0.046..7.282 rows=10000 loops=1)
  Total runtime: 39.284 ms

That's more sensible.

I'd actually use a table to track unused_uids and have triggers that
kept everything in step. However, if you didn't want to do that, I'd try
a left-join.

EXPLAIN ANALYSE
SELECT a.uid
FROM generate_series(1000, 32767) as a(uid) LEFT JOIN people p ON
a.uid=p.uid
WHERE
     p.uid IS NULL
ORDER BY random() LIMIT 1;

Not ideal, but like I say I'd use an unused_uids table. If nothing else,
I'd be wary about immediately re-using a uid - your db+application might
cope fine, but these values have a tendency to be referred to elsewhere.

HTH
--
   Richard Huxton
   Archonet Ltd

Re: sequence query performance issues

From
Tom Lane
Date:
Richard Huxton <dev@archonet.com> writes:
> Hmm - why is it doing that?

I'm betting that the OP's people.uid column is not an integer.  Existing
PG releases can't use hashed subplans for cross-data-type comparisons
(8.3 will be a bit smarter).

            regards, tom lane

Re: sequence query performance issues

From
Richard Huxton
Date:
Tom Lane wrote:
> Richard Huxton <dev@archonet.com> writes:
>> Hmm - why is it doing that?
>
> I'm betting that the OP's people.uid column is not an integer.  Existing
> PG releases can't use hashed subplans for cross-data-type comparisons
> (8.3 will be a bit smarter).

Looked like an int2 to me (width=2, max value ~ 32k)

--
   Richard Huxton
   Archonet Ltd

Re: sequence query performance issues

From
"Peter Koczan"
Date:
> > Hmm - why is it doing that?
>
> I'm betting that the OP's people.uid column is not an integer.  Existing
> PG releases can't use hashed subplans for cross-data-type comparisons
> (8.3 will be a bit smarter).

*light bulb* Ahhhhhhh, that's it. So, I guess the solution is either
to cast the column or wait for 8.3 (which isn't a problem since the
port won't be done until 8.3 is released anyway).

Thanks again.

Peter

Re: sequence query performance issues

From
"Peter Koczan"
Date:
> *light bulb* Ahhhhhhh, that's it. So, I guess the solution is either
> to cast the column or wait for 8.3 (which isn't a problem since the
> port won't be done until 8.3 is released anyway).

Just a quick bit of follow-up:

This query works and is equivalent to what I was trying to do (minus
the randomization and limiting):
=> select a.uid from generate_series(1000, 32000) as a(uid) where
a.uid::smallint not in (select uid from people where uid is not null);

It turns out that this and using coalesce are a wash in terms of
performance, usually coming within 10 ms of each other no matter what
limit and ordering constraints you put on the queries.

Peter

=> explain analyze select a.uid from generate_series(1000, 32767) as
a(uid) where a.uid not in (select coalesce(uid, 0) from people);
                                                         QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------
 Function Scan on generate_series a  (cost=718.41..733.41 rows=500
width=4) (actual time=68.742..186.340 rows=26808 loops=1)
   Filter: (NOT (hashed subplan))
   SubPlan
     ->  Seq Scan on people  (cost=0.00..702.68 rows=6294 width=2)
(actual time=0.025..28.368 rows=6294 loops=1)
 Total runtime: 286.311 ms
(5 rows)

=> explain analyze select a.uid from generate_series(1000, 32767) as
a(uid) where a.uid::smallint not in (select uid from people where uid
is not null);
                                                         QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------
 Function Scan on generate_series a  (cost=699.34..716.84 rows=500
width=4) (actual time=58.508..177.683 rows=26808 loops=1)
   Filter: (NOT (hashed subplan))
   SubPlan
     ->  Seq Scan on people  (cost=0.00..686.94 rows=4958 width=2)
(actual time=0.017..23.123 rows=4971 loops=1)
           Filter: (uid IS NOT NULL)
 Total runtime: 277.699 ms
(6 rows)