Thread: Optimizer's issue

Optimizer's issue

From
Vlad Arkhipov
Date:
I found strange issue in very simple query. Statistics for all columns
is on the level 1000 but I also tried other levels.

create table g (
  id bigint primary key,
  isgroup boolean not null);

create table a (
  groupid bigint references g(id),
  id bigint,
  unique(id, groupid));

analyze g;
analyze a;

select count(*) from a
294

select count(*) from g
320

explain analyze
select *
from g
  join a on a.groupid = g.id
where g.isgroup

Hash Join  (cost=5.35..11.50 rows=11 width=25) (actual time=0.261..1.755
rows=294 loops=1)
  Hash Cond: (a.groupid = g.id)
  ->  Seq Scan on a  (cost=0.00..4.94 rows=294 width=16) (actual
time=0.047..0.482 rows=294 loops=1)
  ->  Hash  (cost=5.20..5.20 rows=12 width=9) (actual time=0.164..0.164
rows=12 loops=1)
        ->  Seq Scan on g  (cost=0.00..5.20 rows=12 width=9) (actual
time=0.042..0.136 rows=12 loops=1)
              Filter: isgroup
Total runtime: 2.225 ms

And this is more interesting:
explain analyze
select *
from g
  join a on a.groupid = g.id
where not g.isgroup

Hash Join  (cost=9.05..17.92 rows=283 width=25) (actual
time=2.038..2.038 rows=0 loops=1)
  Hash Cond: (a.groupid = g.id)
  ->  Seq Scan on a  (cost=0.00..4.94 rows=294 width=16) (actual
time=0.046..0.478 rows=294 loops=1)
  ->  Hash  (cost=5.20..5.20 rows=308 width=9) (actual time=1.090..1.090
rows=308 loops=1)
        ->  Seq Scan on g  (cost=0.00..5.20 rows=308 width=9) (actual
time=0.038..0.557 rows=308 loops=1)
              Filter: (NOT isgroup)
Total runtime: 2.126 ms

PostgreSQL 8.3
These queries are part of big query and optimizer put them on the leaf
of query tree, so rows miscount causes a real problem.

Statistics for table a:
id
--
histogram_bounds: {1,40,73,111,143,174,204,484,683,715,753}
correlation: 0.796828

groupid
-------
n_distinct: 12
most_common_vals: {96,98,21,82,114,131,48,44,173,682,752}
most_common_freqs:
{0.265306,0.166667,0.163265,0.136054,0.0884354,0.0782313,0.0714286,0.00680272,0.00680272,0.00680272,0.00680272}
correlation: 0.366704

for table g:
id
--
histogram_bounds: {1,32,64,101,134,166,199,451,677,714,753}
correlation: 1

isgroup
-------
n_distinct: 2
most_common_freqs: {0.9625,0.0375}
correlation: 0.904198



Re: Optimizer's issue

From
"Albe Laurenz"
Date:
Vlad Arkhipov wrote:
> I found strange issue in very simple query.

You forgot to mention what your problem is.

Yours,
Laurenz Albe

Re: Optimizer's issue

From
Vlad Arkhipov
Date:
Albe Laurenz пишет:
> Vlad Arkhipov wrote:
>
>> I found strange issue in very simple query.
>>
>
> You forgot to mention what your problem is.
>
> Yours,
> Laurenz Albe
>
It was written below in my first post:
"These queries are part of big query and optimizer put them on the leaf
of query tree, so rows miscount causes a real problem. "
actual rows count for the first query is 294, estimate - 11; for the
second -- 283 and 0. Postgres tries to use nested loops when it should
use merge/hash join and vice versa.


Re: Optimizer's issue

From
Matthew Wakeling
Date:
On Thu, 24 Apr 2008, Vlad Arkhipov wrote:
> It was written below in my first post:
> "These queries are part of big query and optimizer put them on the leaf
> of query tree, so rows miscount causes a real problem. "
> actual rows count for the first query is 294, estimate - 11; for the
> second -- 283 and 0. Postgres tries to use nested loops when it should
> use merge/hash join and vice versa.

The queries are taking less than three milliseconds. You only have three
hundred rows in each table. The precise algorithm Postgres uses probably
won't make much of a difference. Have you tried turning off the relevant
join plans to see how long Postgres takes with your desired plan?

When you have millions of rows, the algorithm matters a lot more.

Matthew

--
Richards' Laws of Data Security:
 1. Don't buy a computer.
 2. If you must buy a computer, don't turn it on.

Re: Optimizer's issue

From
PFC
Date:
On Thu, 24 Apr 2008 03:14:54 +0200, Vlad Arkhipov <arhipov@dc.baikal.ru>
wrote:

> I found strange issue in very simple query. Statistics for all columns
> is on the level 1000 but I also tried other levels.
>
> create table g (
>   id bigint primary key,
>   isgroup boolean not null);
>
> create table a (
>   groupid bigint references g(id),
>   id bigint,
>   unique(id, groupid));
>
> analyze g;
> analyze a;
>
> select count(*) from a
> 294
>
> select count(*) from g
> 320
>
> explain analyze
> select *
> from g
>   join a on a.groupid = g.id
> where g.isgroup
>
> Hash Join  (cost=5.35..11.50 rows=11 width=25) (actual time=0.261..1.755
> rows=294 loops=1)
>   Hash Cond: (a.groupid = g.id)
>   ->  Seq Scan on a  (cost=0.00..4.94 rows=294 width=16) (actual
> time=0.047..0.482 rows=294 loops=1)
>   ->  Hash  (cost=5.20..5.20 rows=12 width=9) (actual time=0.164..0.164
> rows=12 loops=1)
>         ->  Seq Scan on g  (cost=0.00..5.20 rows=12 width=9) (actual
> time=0.042..0.136 rows=12 loops=1)
>               Filter: isgroup
> Total runtime: 2.225 ms

    You should really put an EXPLAIN ANALYZE of your big query.

    This little query plan seems OK to me.
    Two very small tables, ok, hash'em, it's the best.
    Now, of course if it is repeated for every row in your JOIN, you have a
problem.
    The question is, why is it repeated for every row ?
    This cannot be answered without seeing the whole query.

    Another question would be, is there a way to structure the tables
differently ?
    Again, this cannot be answered without seeing the whole query, and some
explanation about what the data & fields mean.

    Please provide more information...



Re: Optimizer's issue

From
Vlad Arkhipov
Date:
PFC пишет:
> On Thu, 24 Apr 2008 03:14:54 +0200, Vlad Arkhipov
> <arhipov@dc.baikal.ru> wrote:
>
>> I found strange issue in very simple query. Statistics for all columns
>> is on the level 1000 but I also tried other levels.
>>
>> create table g (
>>   id bigint primary key,
>>   isgroup boolean not null);
>>
>> create table a (
>>   groupid bigint references g(id),
>>   id bigint,
>>   unique(id, groupid));
>>
>> analyze g;
>> analyze a;
>>
>> select count(*) from a
>> 294
>>
>> select count(*) from g
>> 320
>>
>> explain analyze
>> select *
>> from g
>>   join a on a.groupid = g.id
>> where g.isgroup
>>
>> Hash Join  (cost=5.35..11.50 rows=11 width=25) (actual time=0.261..1.755
>> rows=294 loops=1)
>>   Hash Cond: (a.groupid = g.id)
>>   ->  Seq Scan on a  (cost=0.00..4.94 rows=294 width=16) (actual
>> time=0.047..0.482 rows=294 loops=1)
>>   ->  Hash  (cost=5.20..5.20 rows=12 width=9) (actual time=0.164..0.164
>> rows=12 loops=1)
>>         ->  Seq Scan on g  (cost=0.00..5.20 rows=12 width=9) (actual
>> time=0.042..0.136 rows=12 loops=1)
>>               Filter: isgroup
>> Total runtime: 2.225 ms
>
>     You should really put an EXPLAIN ANALYZE of your big query.
>
>     This little query plan seems OK to me.
>     Two very small tables, ok, hash'em, it's the best.
>     Now, of course if it is repeated for every row in your JOIN, you
> have a problem.
>     The question is, why is it repeated for every row ?
>     This cannot be answered without seeing the whole query.
>
>     Another question would be, is there a way to structure the tables
> differently ?
>     Again, this cannot be answered without seeing the whole query, and
> some explanation about what the data & fields mean.
>
>     Please provide more information...
>
>
>
I redesigned tables structure and the query seems to be become faster.
You was right, the problem was not in this query.