Thread: analyzing intermediate query

analyzing intermediate query

From
"Andrus"
Date:
I noticed that query

SELECT dok.*
 FROM dok
JOIN  (SELECT DISTINCT dokumnr FROM  temptbl ) x USING(dokumnr);

is slow in 8.1.4
I cannot use explain analyze since this query uses results from temporary
table temptbl which is not available.

Sometimes innter table returns only 1 row so maybe seq scan is selected
instead of single row index access becauses expected count is 1000

As I understand, PostgreSql requires manually running ANALYZE for temporary
tables if their row count is different from 1000

How to force PostgreSql to analyze inner table in this query or use other
way to get index using query plan if inner query returns single row ?

How


Re: analyzing intermediate query

From
PFC
Date:
> I noticed that query
>
> SELECT dok.*
>  FROM dok
> JOIN  (SELECT DISTINCT dokumnr FROM  temptbl ) x USING(dokumnr);
>
> is slow in 8.1.4
> I cannot use explain analyze since this query uses results from
> temporary table temptbl which is not available.

    Generally if you know your temptbl will always contains a few rows (say,
generally a few and never more than a few thousands) it is better to use
something like that :

    - get list of items
    - SELECT * FROM table WHERE id IN (...)

    Of course you must be pretty damn sure that the list isn't gonna contain
10 million items. Or else you'll have a little problem. But it generally
works pretty well. The overhead of generating and parsing the IN() is
lower than the overhead of temptables...


    By the way, sometime ago there was talk about adding estimation of number
of rows returned to set-returning functions. What's the status of this ?
It doesn't seem to have survived...

8.3> EXPLAIN SELECT * FROM generate_series( 1,10 );
                                QUERY PLAN
------------------------------------------------------------------------
  Function Scan on generate_series  (cost=0.00..12.50 rows=1000 width=4)


> Sometimes innter table returns only 1 row so maybe seq scan is selected
> instead of single row index access becauses expected count is 1000
>
> As I understand, PostgreSql requires manually running ANALYZE for
> temporary tables if their row count is different from 1000
>
> How to force PostgreSql to analyze inner table in this query or use
> other way to get index using query plan if inner query returns single
> row ?
>
> How



Re: analyzing intermediate query

From
"Andrus"
Date:
> Generally if you know your temptbl will always contains a few rows (say,
> generally a few and never more than a few thousands) it is better to use
> something like that :
>
> - get list of items
> - SELECT * FROM table WHERE id IN (...)

My list can contain 1 .. 100000  records and table contains 3000000 records
and is growing.

As discussed here few time ago, IN (...)  forces seq scan over 3000000 rows
and maybe stack overflow exception also occurs (stack overflow occurs in
8.0, maybe it is fixed in 8.1).

Using temp table + ANALYZE enables bitmap index scan for this query and is
thus a lot faster.

I formerly used IN (...) but changed this to use temp table + primary key on
temp table + analyze this temp table.

Using 8.1.4

I can switch this to temp table also if it helps.
This requires some special logic to generate temp table name since there may
be a number of such tables in single transaction, so is would be major appl
rewrite.

Andrus.


Re: analyzing intermediate query

From
PFC
Date:
> My list can contain 1 .. 100000  records and table contains 3000000
> records and is growing.

    Ah. No IN(), then ;)
    Temp table + ANALYZE seems your only option...


Re: analyzing intermediate query

From
"Andrus"
Date:
>> My list can contain 1 .. 100000  records and table contains 3000000
>> records and is growing.
>
> Ah. No IN(), then ;)
> Temp table + ANALYZE seems your only option...

In 8.3 or 8.4  I think that IN() or temp table produce exactly the same
result.

Andrus.


Re: analyzing intermediate query

From
PFC
Date:
>>> My list can contain 1 .. 100000  records and table contains 3000000
>>> records and is growing.
>>
>> Ah. No IN(), then ;)
>> Temp table + ANALYZE seems your only option...
>
> In 8.3 or 8.4  I think that IN() or temp table produce exactly the same
> result.
>
> Andrus.

    Oh, I just thought about something, I don't remember in which version it
was added, but :

EXPLAIN ANALYZE SELECT sum(column1) FROM (VALUES ...a million integers...
) AS v

    Postgres is perfectly happy with that ; it's either a bit slow (about 1
second) or very fast depending on how you view things...

Aggregate  (cost=15000.00..15000.01 rows=1 width=4) (actual
time=1060.253..1060.253 rows=1 loops=1)
->  Values Scan on "*VALUES*"  (cost=0.00..12500.00 rows=1000000 width=4)
(actual time=0.009..634.728 rows=1000000 loops=1)
Total runtime: 1091.420 ms

    The most interesting thing, of course, is that the statistics are exact.
    You can use VALUES like a table (Join, whatever).
    Of course it's always slightly annoying to juggle around with result sets
and stuff them in comma-separated strings, but it works.

    Here it knows there's few rows ===> nested loop

EXPLAIN SELECT a.* FROM annonces a JOIN (VALUES
(0),(1),(2),(3),(4),(5),(6),(7)) AS v ON (a.id=v.column1);
                                        QUERY PLAN
----------------------------------------------------------------------------------------
  Nested Loop  (cost=0.00..66.73 rows=8 width=943)
    ->  Values Scan on "*VALUES*"  (cost=0.00..0.10 rows=8 width=4)
    ->  Index Scan using annonces_pkey on annonces a  (cost=0.00..8.32
rows=1 width=943)
          Index Cond: (a.id = "*VALUES*".column1)

    With a million values it goes hash of course, etc.

Re: analyzing intermediate query

From
"Andrus"
Date:
> Oh, I just thought about something, I don't remember in which version it
> was added, but :
>
> EXPLAIN ANALYZE SELECT sum(column1) FROM (VALUES ...a million
> ntegers...  ) AS v
>
> Postgres is perfectly happy with that ; it's either a bit slow (about 1
> second) or very fast depending on how you view things...

I tried in 8.1.4

select * from (values (0)) xx

but got

ERROR: syntax error at or near ")"
SQL state: 42601
Character: 26

Even if this works this may be not solution: I need to apply distinct to
temporary table. Temporary table may contain duplicate values and without
DISTINCT join produces invalid result.
Temporary table itself is created from data from server tables, it is not
generated from list.

I can use

SELECT dok.*
 FROM dok
WHERE dokumnr IN  (SELECT dokumnr FROM  temptbl)

but this seems never use bitmap index scan in 8.1.4

Sadly, creating second temporary table from first temporary table specially
for this query seems to be only solution.

When materialized row count will be added so that statistics is exact and
select count(*) from tbl runs fast ?

Andrus.


Re: analyzing intermediate query

From
Scott Carey
Date:
Often times, switching an inner subselect that requires a distinct to a group by on that column yields better results.
Inthis case, the IN should be equivalent, so it probably will not help.  This would look like: 

SELECT dok.*
 FROM dok
JOIN  (SELECT dokumnr FROM  temptbl GROUP BY dokumnr ) x USING(dokumnr);

Whether that hepls depends on how big dokumnr is and where the query bottleneck is.  Note there are subtle differences
betweenDISTINCT and GROUP BY with respect to nulls. 

________________________________________
From: pgsql-performance-owner@postgresql.org [pgsql-performance-owner@postgresql.org] On Behalf Of Andrus
[kobruleht2@hot.ee]
Sent: Tuesday, December 02, 2008 7:50 AM
To: pgsql-performance@postgresql.org; PFC
Subject: Re: [PERFORM] analyzing intermediate query

> Oh, I just thought about something, I don't remember in which version it
> was added, but :
>
> EXPLAIN ANALYZE SELECT sum(column1) FROM (VALUES ...a million
> ntegers...  ) AS v
>
> Postgres is perfectly happy with that ; it's either a bit slow (about 1
> second) or very fast depending on how you view things...

I tried in 8.1.4

select * from (values (0)) xx

but got

ERROR: syntax error at or near ")"
SQL state: 42601
Character: 26

Even if this works this may be not solution: I need to apply distinct to
temporary table. Temporary table may contain duplicate values and without
DISTINCT join produces invalid result.
Temporary table itself is created from data from server tables, it is not
generated from list.

I can use

SELECT dok.*
 FROM dok
WHERE dokumnr IN  (SELECT dokumnr FROM  temptbl)

but this seems never use bitmap index scan in 8.1.4

Sadly, creating second temporary table from first temporary table specially
for this query seems to be only solution.

When materialized row count will be added so that statistics is exact and
select count(*) from tbl runs fast ?

Andrus.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: analyzing intermediate query

From
"Andrus"
Date:
Scott,

>Often times, switching an inner subselect that requires a distinct to a
>group by on that column yields better results.  In this case, the IN should
>be equivalent, so it probably will not help.  This would look like:

SELECT dok.*
 FROM dok
JOIN  (SELECT dokumnr FROM  temptbl GROUP BY dokumnr ) x USING(dokumnr);

Thank you. This may be great idea.
I changed my query to use GROUP BY instead of DISTINCT

>Whether that hepls depends on how big dokumnr is and where the query
>bottleneck is.

I'm wondering how this can solve the issue when there is single or few
dokumnr columns.
Planner still thinks that temptbl projection contains 1000 rows and uses seq
scan instead of using bitmap index on dok table.

I tried

SELECT dok.*
 FROM dok
JOIN  (SELECT dokumnr FROM  temptbl GROUP BY dokumnr ANALYZE ) x
USING(dokumnr);

but got error.

> Note there are subtle differences between DISTINCT and GROUP BY with
> respect to nulls.

dokumnr is int type and is not null always.

Andrus.


Re: analyzing intermediate query

From
Scott Carey
Date:
Have you tried running ANALYZE on the temp table before selecting out of it?  That should give it the statistics
necessaryto correctly guess the output of a group by on a single column. 

ANALYZE temptbl;
SELECT dok.*
 FROM dok
JOIN  (SELECT dokumnr FROM  temptbl GROUP BY dokumnr ) x USING(dokumnr);

-----Original Message-----
From: Andrus [mailto:kobruleht2@hot.ee]
Sent: Tuesday, December 02, 2008 10:59 AM
To: Scott Carey; pgsql-performance@postgresql.org; PFC
Subject: Re: [PERFORM] analyzing intermediate query

Scott,

>Often times, switching an inner subselect that requires a distinct to a
>group by on that column yields better results.  In this case, the IN should
>be equivalent, so it probably will not help.  This would look like:

SELECT dok.*
 FROM dok
JOIN  (SELECT dokumnr FROM  temptbl GROUP BY dokumnr ) x USING(dokumnr);

Thank you. This may be great idea.
I changed my query to use GROUP BY instead of DISTINCT

>Whether that hepls depends on how big dokumnr is and where the query
>bottleneck is.

I'm wondering how this can solve the issue when there is single or few
dokumnr columns.
Planner still thinks that temptbl projection contains 1000 rows and uses seq
scan instead of using bitmap index on dok table.

I tried

SELECT dok.*
 FROM dok
JOIN  (SELECT dokumnr FROM  temptbl GROUP BY dokumnr ANALYZE ) x
USING(dokumnr);

but got error.

> Note there are subtle differences between DISTINCT and GROUP BY with
> respect to nulls.

dokumnr is int type and is not null always.

Andrus.