sequence query performance issues - Mailing list pgsql-performance

From Peter Koczan
Subject sequence query performance issues
Date
Msg-id 4544e0330709271504g364fd042u161966fa448934f6@mail.gmail.com
Whole thread Raw
Responses Re: sequence query performance issues
List pgsql-performance
Hello,

I have a weird performance issue with a query I'm testing. Basically,
I'm trying to port a function that generates user uids, and since
postgres offers a sequence generator function, I figure I'd take
advantage of that. Basically, I generate our uid range, filter out
those which are in use, and randomly pick however many I need.
However, when I run it it takes forever (>10 minutes and I get nothing
so I cancelled the query) and cpu usage on the server is maxed out.

Here's my query (I'll post the explain output later so as not to
obscure my question):
=> select a.uid from generate_series(1000, 32767) as a(uid) where
a.uid not in (select uid from people) order by random() limit 1;

I thought that nulls were a problem, so I tried:
=> select a.uid from generate_series(1000, 32767) as a(uid) where
a.uid not in (select coalesce(uid,0) from people) order by random()
limit 1;
And that finished in less than a second.

I then tried:
=> select a.uid from generate_series(1000, 32767) as a(uid) where
a.uid not in (select coalesce(uid,0) from people where uid is not
null) order by random() limit 1;
And we're back to taking forever.

So I have 2 questions:

- Is there a better query for this purpose? Mine works when coalesced,
but it seems a little brute-force and the random() sorting, while
kinda nice, is slow.

- Is this in any way expected? I know that nulls sometimes cause
problems, but why is it taking forever even when trying to filter
those out?

Thanks.

Peter

The gory details:
- There is an btree index on people(uid), and there are ~6300 rows, of
which ~1300 have null uids.

- EXPLAIN output (I couldn't get EXPLAIN ANALYZE output from the first
two queries since they took too long):
=> explain select a.uid from generate_series(1000, 32767) as a(uid)
where a.uid not in (select uid from people) order by random() limit 1;
                                        QUERY PLAN
------------------------------------------------------------------------------------------
 Limit  (cost=40025.57..40025.60 rows=10 width=4)
   ->  Sort  (cost=40025.57..40026.82 rows=500 width=4)
         Sort Key: random()
         ->  Function Scan on generate_series a
(cost=693.16..40003.16 rows=500 width=4)
               Filter: (NOT (subplan))
               SubPlan
                 ->  Materialize  (cost=693.16..756.03 rows=6287 width=2)
                       ->  Seq Scan on people  (cost=0.00..686.87
rows=6287 width=2)
(8 rows)

=> explain select a.uid from generate_series(1000, 32767) as a(uid)
where a.uid not in (select uid from people where uid is not null)
order by random() limit 1;
                                        QUERY PLAN
------------------------------------------------------------------------------------------
 Limit  (cost=31486.71..31486.73 rows=10 width=4)
   ->  Sort  (cost=31486.71..31487.96 rows=500 width=4)
         Sort Key: random()
         ->  Function Scan on generate_series a
(cost=691.79..31464.29 rows=500 width=4)
               Filter: (NOT (subplan))
               SubPlan
                 ->  Materialize  (cost=691.79..741.00 rows=4921 width=2)
                       ->  Seq Scan on people  (cost=0.00..686.87
rows=4921 width=2)
                             Filter: (uid IS NOT NULL)
(9 rows)

=> explain select a.uid from generate_series(1000, 32767) as a(uid)
where a.uid not in (select coalesce(uid, 0) from people) order by
random() limit 1;
                                       QUERY PLAN
----------------------------------------------------------------------------------------
 Limit  (cost=756.97..756.99 rows=10 width=4)
   ->  Sort  (cost=756.97..758.22 rows=500 width=4)
         Sort Key: random()
         ->  Function Scan on generate_series a  (cost=718.30..734.55
rows=500 width=4)
               Filter: (NOT (hashed subplan))
               SubPlan
                 ->  Seq Scan on people  (cost=0.00..702.59 rows=6287 width=2)
(7 rows)

=> explain analyze select a.uid from generate_series(1000, 32767) as
a(uid) where a.uid not in (select coalesce(uid, 0) from people) order
by random() limit 1;
                                                               QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=756.97..756.99 rows=10 width=4) (actual
time=370.444..370.554 rows=10 loops=1)
   ->  Sort  (cost=756.97..758.22 rows=500 width=4) (actual
time=370.434..370.472 rows=10 loops=1)
         Sort Key: random()
         ->  Function Scan on generate_series a  (cost=718.30..734.55
rows=500 width=4) (actual time=70.018..199.540 rows=26808 loops=1)
               Filter: (NOT (hashed subplan))
               SubPlan
                 ->  Seq Scan on people  (cost=0.00..702.59 rows=6287
width=2) (actual time=0.023..29.167 rows=6294 loops=1)
 Total runtime: 372.224 ms
(8 rows)

pgsql-performance by date:

Previous
From: Ron Mayer
Date:
Subject: Re: Searching for the cause of a bad plan
Next
From: Richard Huxton
Date:
Subject: Re: sequence query performance issues