Thread: analyzing intermediate query
I noticed that query SELECT dok.* FROM dok JOIN (SELECT DISTINCT dokumnr FROM temptbl ) x USING(dokumnr); is slow in 8.1.4 I cannot use explain analyze since this query uses results from temporary table temptbl which is not available. Sometimes innter table returns only 1 row so maybe seq scan is selected instead of single row index access becauses expected count is 1000 As I understand, PostgreSql requires manually running ANALYZE for temporary tables if their row count is different from 1000 How to force PostgreSql to analyze inner table in this query or use other way to get index using query plan if inner query returns single row ? How
> I noticed that query > > SELECT dok.* > FROM dok > JOIN (SELECT DISTINCT dokumnr FROM temptbl ) x USING(dokumnr); > > is slow in 8.1.4 > I cannot use explain analyze since this query uses results from > temporary table temptbl which is not available. Generally if you know your temptbl will always contains a few rows (say, generally a few and never more than a few thousands) it is better to use something like that : - get list of items - SELECT * FROM table WHERE id IN (...) Of course you must be pretty damn sure that the list isn't gonna contain 10 million items. Or else you'll have a little problem. But it generally works pretty well. The overhead of generating and parsing the IN() is lower than the overhead of temptables... By the way, sometime ago there was talk about adding estimation of number of rows returned to set-returning functions. What's the status of this ? It doesn't seem to have survived... 8.3> EXPLAIN SELECT * FROM generate_series( 1,10 ); QUERY PLAN ------------------------------------------------------------------------ Function Scan on generate_series (cost=0.00..12.50 rows=1000 width=4) > Sometimes innter table returns only 1 row so maybe seq scan is selected > instead of single row index access becauses expected count is 1000 > > As I understand, PostgreSql requires manually running ANALYZE for > temporary tables if their row count is different from 1000 > > How to force PostgreSql to analyze inner table in this query or use > other way to get index using query plan if inner query returns single > row ? > > How
> Generally if you know your temptbl will always contains a few rows (say, > generally a few and never more than a few thousands) it is better to use > something like that : > > - get list of items > - SELECT * FROM table WHERE id IN (...) My list can contain 1 .. 100000 records and table contains 3000000 records and is growing. As discussed here few time ago, IN (...) forces seq scan over 3000000 rows and maybe stack overflow exception also occurs (stack overflow occurs in 8.0, maybe it is fixed in 8.1). Using temp table + ANALYZE enables bitmap index scan for this query and is thus a lot faster. I formerly used IN (...) but changed this to use temp table + primary key on temp table + analyze this temp table. Using 8.1.4 I can switch this to temp table also if it helps. This requires some special logic to generate temp table name since there may be a number of such tables in single transaction, so is would be major appl rewrite. Andrus.
> My list can contain 1 .. 100000 records and table contains 3000000 > records and is growing. Ah. No IN(), then ;) Temp table + ANALYZE seems your only option...
>> My list can contain 1 .. 100000 records and table contains 3000000 >> records and is growing. > > Ah. No IN(), then ;) > Temp table + ANALYZE seems your only option... In 8.3 or 8.4 I think that IN() or temp table produce exactly the same result. Andrus.
>>> My list can contain 1 .. 100000 records and table contains 3000000 >>> records and is growing. >> >> Ah. No IN(), then ;) >> Temp table + ANALYZE seems your only option... > > In 8.3 or 8.4 I think that IN() or temp table produce exactly the same > result. > > Andrus. Oh, I just thought about something, I don't remember in which version it was added, but : EXPLAIN ANALYZE SELECT sum(column1) FROM (VALUES ...a million integers... ) AS v Postgres is perfectly happy with that ; it's either a bit slow (about 1 second) or very fast depending on how you view things... Aggregate (cost=15000.00..15000.01 rows=1 width=4) (actual time=1060.253..1060.253 rows=1 loops=1) -> Values Scan on "*VALUES*" (cost=0.00..12500.00 rows=1000000 width=4) (actual time=0.009..634.728 rows=1000000 loops=1) Total runtime: 1091.420 ms The most interesting thing, of course, is that the statistics are exact. You can use VALUES like a table (Join, whatever). Of course it's always slightly annoying to juggle around with result sets and stuff them in comma-separated strings, but it works. Here it knows there's few rows ===> nested loop EXPLAIN SELECT a.* FROM annonces a JOIN (VALUES (0),(1),(2),(3),(4),(5),(6),(7)) AS v ON (a.id=v.column1); QUERY PLAN ---------------------------------------------------------------------------------------- Nested Loop (cost=0.00..66.73 rows=8 width=943) -> Values Scan on "*VALUES*" (cost=0.00..0.10 rows=8 width=4) -> Index Scan using annonces_pkey on annonces a (cost=0.00..8.32 rows=1 width=943) Index Cond: (a.id = "*VALUES*".column1) With a million values it goes hash of course, etc.
> Oh, I just thought about something, I don't remember in which version it > was added, but : > > EXPLAIN ANALYZE SELECT sum(column1) FROM (VALUES ...a million > ntegers... ) AS v > > Postgres is perfectly happy with that ; it's either a bit slow (about 1 > second) or very fast depending on how you view things... I tried in 8.1.4 select * from (values (0)) xx but got ERROR: syntax error at or near ")" SQL state: 42601 Character: 26 Even if this works this may be not solution: I need to apply distinct to temporary table. Temporary table may contain duplicate values and without DISTINCT join produces invalid result. Temporary table itself is created from data from server tables, it is not generated from list. I can use SELECT dok.* FROM dok WHERE dokumnr IN (SELECT dokumnr FROM temptbl) but this seems never use bitmap index scan in 8.1.4 Sadly, creating second temporary table from first temporary table specially for this query seems to be only solution. When materialized row count will be added so that statistics is exact and select count(*) from tbl runs fast ? Andrus.
Often times, switching an inner subselect that requires a distinct to a group by on that column yields better results. Inthis case, the IN should be equivalent, so it probably will not help. This would look like: SELECT dok.* FROM dok JOIN (SELECT dokumnr FROM temptbl GROUP BY dokumnr ) x USING(dokumnr); Whether that hepls depends on how big dokumnr is and where the query bottleneck is. Note there are subtle differences betweenDISTINCT and GROUP BY with respect to nulls. ________________________________________ From: pgsql-performance-owner@postgresql.org [pgsql-performance-owner@postgresql.org] On Behalf Of Andrus [kobruleht2@hot.ee] Sent: Tuesday, December 02, 2008 7:50 AM To: pgsql-performance@postgresql.org; PFC Subject: Re: [PERFORM] analyzing intermediate query > Oh, I just thought about something, I don't remember in which version it > was added, but : > > EXPLAIN ANALYZE SELECT sum(column1) FROM (VALUES ...a million > ntegers... ) AS v > > Postgres is perfectly happy with that ; it's either a bit slow (about 1 > second) or very fast depending on how you view things... I tried in 8.1.4 select * from (values (0)) xx but got ERROR: syntax error at or near ")" SQL state: 42601 Character: 26 Even if this works this may be not solution: I need to apply distinct to temporary table. Temporary table may contain duplicate values and without DISTINCT join produces invalid result. Temporary table itself is created from data from server tables, it is not generated from list. I can use SELECT dok.* FROM dok WHERE dokumnr IN (SELECT dokumnr FROM temptbl) but this seems never use bitmap index scan in 8.1.4 Sadly, creating second temporary table from first temporary table specially for this query seems to be only solution. When materialized row count will be added so that statistics is exact and select count(*) from tbl runs fast ? Andrus. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Scott, >Often times, switching an inner subselect that requires a distinct to a >group by on that column yields better results. In this case, the IN should >be equivalent, so it probably will not help. This would look like: SELECT dok.* FROM dok JOIN (SELECT dokumnr FROM temptbl GROUP BY dokumnr ) x USING(dokumnr); Thank you. This may be great idea. I changed my query to use GROUP BY instead of DISTINCT >Whether that hepls depends on how big dokumnr is and where the query >bottleneck is. I'm wondering how this can solve the issue when there is single or few dokumnr columns. Planner still thinks that temptbl projection contains 1000 rows and uses seq scan instead of using bitmap index on dok table. I tried SELECT dok.* FROM dok JOIN (SELECT dokumnr FROM temptbl GROUP BY dokumnr ANALYZE ) x USING(dokumnr); but got error. > Note there are subtle differences between DISTINCT and GROUP BY with > respect to nulls. dokumnr is int type and is not null always. Andrus.
Have you tried running ANALYZE on the temp table before selecting out of it? That should give it the statistics necessaryto correctly guess the output of a group by on a single column. ANALYZE temptbl; SELECT dok.* FROM dok JOIN (SELECT dokumnr FROM temptbl GROUP BY dokumnr ) x USING(dokumnr); -----Original Message----- From: Andrus [mailto:kobruleht2@hot.ee] Sent: Tuesday, December 02, 2008 10:59 AM To: Scott Carey; pgsql-performance@postgresql.org; PFC Subject: Re: [PERFORM] analyzing intermediate query Scott, >Often times, switching an inner subselect that requires a distinct to a >group by on that column yields better results. In this case, the IN should >be equivalent, so it probably will not help. This would look like: SELECT dok.* FROM dok JOIN (SELECT dokumnr FROM temptbl GROUP BY dokumnr ) x USING(dokumnr); Thank you. This may be great idea. I changed my query to use GROUP BY instead of DISTINCT >Whether that hepls depends on how big dokumnr is and where the query >bottleneck is. I'm wondering how this can solve the issue when there is single or few dokumnr columns. Planner still thinks that temptbl projection contains 1000 rows and uses seq scan instead of using bitmap index on dok table. I tried SELECT dok.* FROM dok JOIN (SELECT dokumnr FROM temptbl GROUP BY dokumnr ANALYZE ) x USING(dokumnr); but got error. > Note there are subtle differences between DISTINCT and GROUP BY with > respect to nulls. dokumnr is int type and is not null always. Andrus.