Thread: sequence query performance issues
Hello, I have a weird performance issue with a query I'm testing. Basically, I'm trying to port a function that generates user uids, and since postgres offers a sequence generator function, I figure I'd take advantage of that. Basically, I generate our uid range, filter out those which are in use, and randomly pick however many I need. However, when I run it it takes forever (>10 minutes and I get nothing so I cancelled the query) and cpu usage on the server is maxed out. Here's my query (I'll post the explain output later so as not to obscure my question): => select a.uid from generate_series(1000, 32767) as a(uid) where a.uid not in (select uid from people) order by random() limit 1; I thought that nulls were a problem, so I tried: => select a.uid from generate_series(1000, 32767) as a(uid) where a.uid not in (select coalesce(uid,0) from people) order by random() limit 1; And that finished in less than a second. I then tried: => select a.uid from generate_series(1000, 32767) as a(uid) where a.uid not in (select coalesce(uid,0) from people where uid is not null) order by random() limit 1; And we're back to taking forever. So I have 2 questions: - Is there a better query for this purpose? Mine works when coalesced, but it seems a little brute-force and the random() sorting, while kinda nice, is slow. - Is this in any way expected? I know that nulls sometimes cause problems, but why is it taking forever even when trying to filter those out? Thanks. Peter The gory details: - There is an btree index on people(uid), and there are ~6300 rows, of which ~1300 have null uids. - EXPLAIN output (I couldn't get EXPLAIN ANALYZE output from the first two queries since they took too long): => explain select a.uid from generate_series(1000, 32767) as a(uid) where a.uid not in (select uid from people) order by random() limit 1; QUERY PLAN ------------------------------------------------------------------------------------------ Limit (cost=40025.57..40025.60 rows=10 width=4) -> Sort (cost=40025.57..40026.82 rows=500 width=4) Sort Key: random() -> Function Scan on generate_series a (cost=693.16..40003.16 rows=500 width=4) Filter: (NOT (subplan)) SubPlan -> Materialize (cost=693.16..756.03 rows=6287 width=2) -> Seq Scan on people (cost=0.00..686.87 rows=6287 width=2) (8 rows) => explain select a.uid from generate_series(1000, 32767) as a(uid) where a.uid not in (select uid from people where uid is not null) order by random() limit 1; QUERY PLAN ------------------------------------------------------------------------------------------ Limit (cost=31486.71..31486.73 rows=10 width=4) -> Sort (cost=31486.71..31487.96 rows=500 width=4) Sort Key: random() -> Function Scan on generate_series a (cost=691.79..31464.29 rows=500 width=4) Filter: (NOT (subplan)) SubPlan -> Materialize (cost=691.79..741.00 rows=4921 width=2) -> Seq Scan on people (cost=0.00..686.87 rows=4921 width=2) Filter: (uid IS NOT NULL) (9 rows) => explain select a.uid from generate_series(1000, 32767) as a(uid) where a.uid not in (select coalesce(uid, 0) from people) order by random() limit 1; QUERY PLAN ---------------------------------------------------------------------------------------- Limit (cost=756.97..756.99 rows=10 width=4) -> Sort (cost=756.97..758.22 rows=500 width=4) Sort Key: random() -> Function Scan on generate_series a (cost=718.30..734.55 rows=500 width=4) Filter: (NOT (hashed subplan)) SubPlan -> Seq Scan on people (cost=0.00..702.59 rows=6287 width=2) (7 rows) => explain analyze select a.uid from generate_series(1000, 32767) as a(uid) where a.uid not in (select coalesce(uid, 0) from people) order by random() limit 1; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=756.97..756.99 rows=10 width=4) (actual time=370.444..370.554 rows=10 loops=1) -> Sort (cost=756.97..758.22 rows=500 width=4) (actual time=370.434..370.472 rows=10 loops=1) Sort Key: random() -> Function Scan on generate_series a (cost=718.30..734.55 rows=500 width=4) (actual time=70.018..199.540 rows=26808 loops=1) Filter: (NOT (hashed subplan)) SubPlan -> Seq Scan on people (cost=0.00..702.59 rows=6287 width=2) (actual time=0.023..29.167 rows=6294 loops=1) Total runtime: 372.224 ms (8 rows)
Peter Koczan wrote: > Hello, > > I have a weird performance issue with a query I'm testing. Basically, > I'm trying to port a function that generates user uids, and since > postgres offers a sequence generator function, I figure I'd take > advantage of that. Basically, I generate our uid range, filter out > those which are in use, and randomly pick however many I need. > However, when I run it it takes forever (>10 minutes and I get nothing > so I cancelled the query) and cpu usage on the server is maxed out. I'd suspect either an unconstrained join or looping through seq-scans. > Here's my query (I'll post the explain output later so as not to > obscure my question): > => select a.uid from generate_series(1000, 32767) as a(uid) where > a.uid not in (select uid from people) order by random() limit 1; I let this run to it's conclusion and it's the materialize. If you see, it's materializing the result-set once for every value it tests against (loops=31768) QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=62722.66..62722.67 rows=1 width=4) (actual time=189963.485..189963.485 rows=0 loops=1) -> Sort (cost=62722.66..62723.91 rows=500 width=4) (actual time=189961.063..189961.063 rows=0 loops=1) Sort Key: random() -> Function Scan on generate_series a (cost=184.00..62700.25 rows=500 width=4) (actual time=189960.797..189960.797 rows=0 loops=1) Filter: (NOT (subplan)) SubPlan -> Materialize (cost=184.00..284.00 rows=10000 width=2) (actual time=0.000..2.406 rows=9372 loops=31768) -> Seq Scan on people (cost=0.00..174.00 rows=10000 width=2) (actual time=0.055..7.181 rows=10000 loops=1) Total runtime: 189967.150 ms Hmm - why is it doing that? It's clearly confused about something. I suspect the root of the problem is that it doesn't know what generate_series() will return. To the planner it's just another set-returning function. This means it's getting (i) the # of rows wrong (rows=500) and also doesn't know (ii) there will be no nulls or (iii) what the range of values returned will be. Easy enough to test: CREATE TEMP TABLE all_uids (uid int2); INSERT INTO all_uids SELECT generate_series(1000,32767); ANALYSE all_uids; EXPLAIN ANALYSE SELECT a.uid FROM all_uids a WHERE a.uid NOT IN (SELECT uid FROM people) ORDER BY random() LIMIT 1; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------ Limit (cost=1884.14..1884.14 rows=1 width=2) (actual time=39.019..39.019 rows=0 loops=1) -> Sort (cost=1884.14..1923.85 rows=15884 width=2) (actual time=39.014..39.014 rows=0 loops=1) Sort Key: random() -> Seq Scan on all_uids a (cost=199.00..775.81 rows=15884 width=2) (actual time=38.959..38.959 rows=0 loops=1) Filter: (NOT (hashed subplan)) SubPlan -> Seq Scan on people (cost=0.00..174.00 rows=10000 width=2) (actual time=0.046..7.282 rows=10000 loops=1) Total runtime: 39.284 ms That's more sensible. I'd actually use a table to track unused_uids and have triggers that kept everything in step. However, if you didn't want to do that, I'd try a left-join. EXPLAIN ANALYSE SELECT a.uid FROM generate_series(1000, 32767) as a(uid) LEFT JOIN people p ON a.uid=p.uid WHERE p.uid IS NULL ORDER BY random() LIMIT 1; Not ideal, but like I say I'd use an unused_uids table. If nothing else, I'd be wary about immediately re-using a uid - your db+application might cope fine, but these values have a tendency to be referred to elsewhere. HTH -- Richard Huxton Archonet Ltd
Richard Huxton <dev@archonet.com> writes: > Hmm - why is it doing that? I'm betting that the OP's people.uid column is not an integer. Existing PG releases can't use hashed subplans for cross-data-type comparisons (8.3 will be a bit smarter). regards, tom lane
Tom Lane wrote: > Richard Huxton <dev@archonet.com> writes: >> Hmm - why is it doing that? > > I'm betting that the OP's people.uid column is not an integer. Existing > PG releases can't use hashed subplans for cross-data-type comparisons > (8.3 will be a bit smarter). Looked like an int2 to me (width=2, max value ~ 32k) -- Richard Huxton Archonet Ltd
> > Hmm - why is it doing that? > > I'm betting that the OP's people.uid column is not an integer. Existing > PG releases can't use hashed subplans for cross-data-type comparisons > (8.3 will be a bit smarter). *light bulb* Ahhhhhhh, that's it. So, I guess the solution is either to cast the column or wait for 8.3 (which isn't a problem since the port won't be done until 8.3 is released anyway). Thanks again. Peter
> *light bulb* Ahhhhhhh, that's it. So, I guess the solution is either > to cast the column or wait for 8.3 (which isn't a problem since the > port won't be done until 8.3 is released anyway). Just a quick bit of follow-up: This query works and is equivalent to what I was trying to do (minus the randomization and limiting): => select a.uid from generate_series(1000, 32000) as a(uid) where a.uid::smallint not in (select uid from people where uid is not null); It turns out that this and using coalesce are a wash in terms of performance, usually coming within 10 ms of each other no matter what limit and ordering constraints you put on the queries. Peter => explain analyze select a.uid from generate_series(1000, 32767) as a(uid) where a.uid not in (select coalesce(uid, 0) from people); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------- Function Scan on generate_series a (cost=718.41..733.41 rows=500 width=4) (actual time=68.742..186.340 rows=26808 loops=1) Filter: (NOT (hashed subplan)) SubPlan -> Seq Scan on people (cost=0.00..702.68 rows=6294 width=2) (actual time=0.025..28.368 rows=6294 loops=1) Total runtime: 286.311 ms (5 rows) => explain analyze select a.uid from generate_series(1000, 32767) as a(uid) where a.uid::smallint not in (select uid from people where uid is not null); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------- Function Scan on generate_series a (cost=699.34..716.84 rows=500 width=4) (actual time=58.508..177.683 rows=26808 loops=1) Filter: (NOT (hashed subplan)) SubPlan -> Seq Scan on people (cost=0.00..686.94 rows=4958 width=2) (actual time=0.017..23.123 rows=4971 loops=1) Filter: (uid IS NOT NULL) Total runtime: 277.699 ms (6 rows)