Thread: index problem

index problem

From
Szabo Zoltan
Date:
Hi,

I have that:

1)
db=> explain select pxygy_pid from prog_dgy_xy where pxygy_pid=12121;
NOTICE:  QUERY PLAN:

Group  (cost=0.00..29970.34 rows=921 width=4)  ->  Index Scan using progdgyxy_idx2 on prog_dgy_xy 
(cost=0.00..29947.32 rows=9210 width=4)

than:
2)
db=> explain select pxygy_pid from prog_dgy_xy where pxygy_pid>12121;
NOTICE:  QUERY PLAN:

Group  (cost=66927.88..67695.39 rows=30700 width=4)  ->  Sort  (cost=66927.88..66927.88 rows=307004 width=4)        ->
SeqScan on prog_dgy_xy  (cost=0.00..32447.66 rows=307004 
 
width=4)

I making some banchmarks on: oracle vs postgres vs mysql. And this is 
breaking me now;) Mysql and oracle width same table and index use that 
index on pxygy_pid;
I had vacuum before.

Time with mysql:

bash-2.04$ time echo " select count(*) from PROG_DGY_XY where 
pxygy_pid>12121;" | mysql -uuser -ppasswd db
count(*)
484984

real    0m13.761s
user    0m0.008s
sys     0m0.019s

Time with postgres:
bash-2.04$ time echo "select count(*) from PROG_DGY_XY where 
pxygy_pid>12121 " | psql -Uuser db count
-------- 484984
(1 row)


real    0m22.480s
user    0m0.011s
sys     0m0.021s

And this is just a little part of another selects joining tables, but 
because this index is not used, selecting from 2 tables (which has 
indexes, and keys on joining collumns) takes extrem time for postgres: 
2m14.978s while for mysql it takes: 0m0.578s !!!

this select is: select distinct 
PROG_ID,PROG_FTYPE,PROG_FCASTHOUR,PROG_DATE from PROG_DATA, PROG_DGY_XY  where prog_id=pxygy_pid  order by
prog_date,prog_ftype,prog_fcasthour

indexes:
PROG_DATA:
create index prod_data_idx1 on prog_data 
(prog_date,prog_ftype,prog_fcasthour);
prog_id is primary key

PROG_DGY_XY:
create unique index progdgyxy_idx1 on PROG_DGY_XY 
(PXYGY_PID,PXYGY_X,PXYGY_Y);
create index progdgyxy_idx2 on PROG_DGY_XY (PXYGY_PID);


Thx
CoL



Re: index problem

From
Stephan Szabo
Date:
On Tue, 16 Oct 2001, CoL wrote:

> ---------------------------
> The 2 table query, where prog_data has ~8800 rowsn and index on prog_id:
> bash-2.04$ time echo "explain select distinct 
> prog_id,prog_ftype,prog_fcasthour,prog_date from prog_dgy_xy,prog_data 
> where pxygy_pid=prog_id " | psql -Uuser db
> NOTICE:  QUERY PLAN:
> 
> Unique  (cost=7432549.69..7680455.07 rows=2479054 width=32)
>    ->  Sort  (cost=7432549.69..7432549.69 rows=24790538 width=32)
>          ->  Merge Join  (cost=148864.65..161189.33 rows=24790538 width=32)
>                ->  Index Scan using prog_data_pkey on prog_data 
> (cost=0.00..701.12 rows=8872 width=28)
>                ->  Sort  (cost=148864.65..148864.65 rows=921013 width=4)
>                      ->  Seq Scan on prog_dgy_xy  (cost=0.00..30145.13 
> rows=921013 width=4)

I'm guessing that the approximately 25 million row estimate on the join
has to be wrong as well given that prog_data.prog_id should be unique.

Hmm, does the explain change if you vacuum analyze the other table
(prog_data)?  If not, what does explain show if you do a
set enable_seqscan='off';
before it?



Re: index problem

From
Stephan Szabo
Date:
On Mon, 15 Oct 2001, Szabo Zoltan wrote:

> Hi,
> 
> I have that:
> 
> 1)
> db=> explain select pxygy_pid from prog_dgy_xy where pxygy_pid=12121;
> NOTICE:  QUERY PLAN:
> 
> Group  (cost=0.00..29970.34 rows=921 width=4)
>    ->  Index Scan using progdgyxy_idx2 on prog_dgy_xy 
> (cost=0.00..29947.32 rows=9210 width=4)
> 
> than:
> 2)
> db=> explain select pxygy_pid from prog_dgy_xy where pxygy_pid>12121;
> NOTICE:  QUERY PLAN:
> 
> Group  (cost=66927.88..67695.39 rows=30700 width=4)
>    ->  Sort  (cost=66927.88..66927.88 rows=307004 width=4)
>          ->  Seq Scan on prog_dgy_xy  (cost=0.00..32447.66 rows=307004 
> width=4)
> 
> I making some banchmarks on: oracle vs postgres vs mysql. And this is 
> breaking me now;) Mysql and oracle width same table and index use that 
> index on pxygy_pid;
> I had vacuum before.

I assume you mean you did a vacuum analyze (a plain vacuum isn't
sufficient).  If you did just do a regular vacuum, do a vacuum analyze
to get the updated statistics.

How many rows actually match pxygy_pid>12121?  Is 307000 rows a reasonable
estimate?  How many rows are in the table?



Re: index problem

From
Szabo Zoltan
Date:
I forget:
select version();                               version
--------------------------------------------------------------------- PostgreSQL 7.1.3 on i386-unknown-freebsd4.3,
compiledby GCC 2.95.3
 

It seems that there are index using problems in 7.1.3 ?
(checkin same problem in comp.databases.postgresql.bugs msg from Orion)

thx
CoL

Szabo Zoltan wrote:

> Hi,
> 
> I have that:
> 
> 1)
> db=> explain select pxygy_pid from prog_dgy_xy where pxygy_pid=12121;
> NOTICE:  QUERY PLAN:
> 
> Group  (cost=0.00..29970.34 rows=921 width=4)
>   ->  Index Scan using progdgyxy_idx2 on prog_dgy_xy 
> (cost=0.00..29947.32 rows=9210 width=4)
> 
> than:
> 2)
> db=> explain select pxygy_pid from prog_dgy_xy where pxygy_pid>12121;
> NOTICE:  QUERY PLAN:
> 
> Group  (cost=66927.88..67695.39 rows=30700 width=4)
>   ->  Sort  (cost=66927.88..66927.88 rows=307004 width=4)
>         ->  Seq Scan on prog_dgy_xy  (cost=0.00..32447.66 rows=307004 
> width=4)
> 
> I making some banchmarks on: oracle vs postgres vs mysql. And this is 
> breaking me now;) Mysql and oracle width same table and index use that 
> index on pxygy_pid;
> I had vacuum before.
> 
> Time with mysql:
> 
> bash-2.04$ time echo " select count(*) from PROG_DGY_XY where 
> pxygy_pid>12121;" | mysql -uuser -ppasswd db
> count(*)
> 484984
> 
> real    0m13.761s
> user    0m0.008s
> sys     0m0.019s
> 
> Time with postgres:
> bash-2.04$ time echo "select count(*) from PROG_DGY_XY where 
> pxygy_pid>12121 " | psql -Uuser db
>  count
> --------
>  484984
> (1 row)
> 
> 
> real    0m22.480s
> user    0m0.011s
> sys     0m0.021s
> 
> And this is just a little part of another selects joining tables, but 
> because this index is not used, selecting from 2 tables (which has 
> indexes, and keys on joining collumns) takes extrem time for postgres: 
> 2m14.978s while for mysql it takes: 0m0.578s !!!
> 
> this select is: select distinct 
> PROG_ID,PROG_FTYPE,PROG_FCASTHOUR,PROG_DATE from PROG_DATA, PROG_DGY_XY 
>  where prog_id=pxygy_pid  order by prog_date,prog_ftype,prog_fcasthour
> 
> indexes:
> PROG_DATA:
> create index prod_data_idx1 on prog_data 
> (prog_date,prog_ftype,prog_fcasthour);
> prog_id is primary key
> 
> PROG_DGY_XY:
> create unique index progdgyxy_idx1 on PROG_DGY_XY 
> (PXYGY_PID,PXYGY_X,PXYGY_Y);
> create index progdgyxy_idx2 on PROG_DGY_XY (PXYGY_PID);
> 
> 
> Thx
> CoL
> 


-- 
[         Szabo Zoltan              ]
[       software fejleszto          ]
[    econet.hu Informatikai Rt.     ]
[ 1117 Budapest, Hauszmann A. u. 3. ]
[   tel.: 371 2100 fax: 371 2101    ]



Re: index problem

From
CoL
Date:
Hi,

I did not make vacuum analyze ;), the vacuum  once now:

vacuumdb -Uuser -ddb -v -tprog_dgy_xy

NOTICE:  --Relation prog_dgy_xy--
NOTICE:  Pages 20935: Changed 0, reaped 0, Empty 0, New 0; Tup 921013: 
Vac 0, Keep/VTL 0/0, Crash 0, UnUsed 0, MinLen 178, MaxLen 184; 
Re-using: Free/Avail. Space 0/0; EndEmpty/Avail. Pages 0/0. CPU 
2.71s/0.32u sec.
NOTICE:  Index progdgyxy_idx1: Pages 6679; Tuples 921013. CPU 
1.41s/1.40u sec.
NOTICE:  Index progdgyxy_idx2: Pages 2019; Tuples 921013. CPU 
0.28s/1.28u sec.

I make it with -z too.
So this table has more 921013 rows.
The query show the same as bellow. The version is 7.1.3.

-------------------------
One more interesting: the insering of these rows.
Postgres:
bash-2.04$ time  psql -q -Uuser -f prog_dgy_xy.dump  db
real    131m50.006s
user    3m21.838s
sys     1m20.963s

Mysql:
bash-2.04$ time cat prog_dgy_xy.dump | mysql -uuser -ppass db
real    24m50.137s
user    2m6.629s
sys     1m37.757s

the dump file was: insert into table (...) values (...);

I tried with copy, and to add begin; inserts; commit; , but the result 
with same time :(
[For Oracle 8.1.6 sqlloader it takes 450 sec ;) ]

---------------------------
The 2 table query, where prog_data has ~8800 rowsn and index on prog_id:
bash-2.04$ time echo "explain select distinct 
prog_id,prog_ftype,prog_fcasthour,prog_date from prog_dgy_xy,prog_data 
where pxygy_pid=prog_id " | psql -Uuser db
NOTICE:  QUERY PLAN:

Unique  (cost=7432549.69..7680455.07 rows=2479054 width=32)  ->  Sort  (cost=7432549.69..7432549.69 rows=24790538
width=32)       ->  Merge Join  (cost=148864.65..161189.33 rows=24790538 width=32)              ->  Index Scan using
prog_data_pkeyon prog_data 
 
(cost=0.00..701.12 rows=8872 width=28)              ->  Sort  (cost=148864.65..148864.65 rows=921013 width=4)
        ->  Seq Scan on prog_dgy_xy  (cost=0.00..30145.13 
 
rows=921013 width=4)

Time: !!!
real    2m3.620s

the same query with mysql (i did explain in mysql, and says it use the 
indexes):
real    0m1.998s !!!

I just askin why? and why just using the index on releation "=".
(same table, same index, vacuumed) (made the test more than twice)
It seams to be a 7.1.3 bug? i do not test yet with 7.1.2 but tomorrow i 
will.

CoL

Stephan Szabo wrote:

> On Mon, 15 Oct 2001, Szabo Zoltan wrote:
> 
> 
>>Hi,
>>
>>I have that:
>>
>>1)
>>db=> explain select pxygy_pid from prog_dgy_xy where pxygy_pid=12121;
>>NOTICE:  QUERY PLAN:
>>
>>Group  (cost=0.00..29970.34 rows=921 width=4)
>>   ->  Index Scan using progdgyxy_idx2 on prog_dgy_xy 
>>(cost=0.00..29947.32 rows=9210 width=4)
>>
>>than:
>>2)
>>db=> explain select pxygy_pid from prog_dgy_xy where pxygy_pid>12121;
>>NOTICE:  QUERY PLAN:
>>
>>Group  (cost=66927.88..67695.39 rows=30700 width=4)
>>   ->  Sort  (cost=66927.88..66927.88 rows=307004 width=4)
>>         ->  Seq Scan on prog_dgy_xy  (cost=0.00..32447.66 rows=307004 
>>width=4)
>>
>>I making some banchmarks on: oracle vs postgres vs mysql. And this is 
>>breaking me now;) Mysql and oracle width same table and index use that 
>>index on pxygy_pid;
>>I had vacuum before.
>>
> 
> I assume you mean you did a vacuum analyze (a plain vacuum isn't
> sufficient).  If you did just do a regular vacuum, do a vacuum analyze
> to get the updated statistics.
> 
> How many rows actually match pxygy_pid>12121?  Is 307000 rows a reasonable
> estimate?  How many rows are in the table?
> 
> 
> 





Re: index problem

From
Stephan Szabo
Date:
> > Hmm, does the explain change if you vacuum analyze the other table
> > (prog_data)?  If not, what does explain show if you do a
> > set enable_seqscan='off';
> > before it?

Did you do the vacuum analyze on the other table (prog_data) as well?
It seems to be overestimating the number of joined rows, and I wonder
if it would choose a different plan if it had the correct number.

> The result:
> db=>set enable_seqscan='off';
> db=>explain select distinct prog_id,prog_ftype,prog_fcasthour,prog_date 
> from prog_dgy_xy,prog_data  where pxygy_pid=prog_id;
> NOTICE:  QUERY PLAN:
> 
> Unique  (cost=7606982.10..7854887.48 rows=2479054 width=32)
>    ->  Sort  (cost=7606982.10..7606982.10 rows=24790538 width=32)
>          ->  Merge Join  (cost=0.00..335621.73 rows=24790538 width=32)
>                ->  Index Scan using progdgyxy_idx2 on prog_dgy_xy 
> (cost=0.00..323297.05 rows=921013 width=4)
>                ->  Index Scan using prog_data_pkey on prog_data 
> (cost=0.00..701.12 rows=8872 width=28)
> 
> It "seems" index is used, but the same result :(((, and bigger execution 
> time: real   3m41.830s

Well, that means the plan it chose before was better, so enable_seqscan
isn't a win here.

> And why:
> POSTGRES:
> set enable_seqscan ='off'; select count(*) from prog_dgy_xy where 
> pxygy_pid<13161;
>   count
> --------
>   900029
> real    2m34.340s
> explain:
> Aggregate  (cost=327896.89..327896.89 rows=1 width=0)
>    ->  Index Scan using progdgyxy_idx2 on prog_dgy_xy 
> (cost=0.00..325594.54 rows=920940 width=0)

It's estimating the entire table will be seen (or most of it anyway),
so it would choose Seq Scan as faster, but you've basically disallowed
that with the enable_seqscan='off'.  Is it faster without the explicit
hint (it probably will be).  Index Scans are not always better than
Sequence Scans (especially when traversing most of the table as in the
above) and you don't want to use the enable_* unless it actually
is giving you a performance increase.



Re: index problem

From
CoL
Date:
Hi,

Stephan Szabo wrote:

> On Tue, 16 Oct 2001, CoL wrote:
> 
> 
>>---------------------------
>>The 2 table query, where prog_data has ~8800 rowsn and index on prog_id:
>>bash-2.04$ time echo "explain select distinct 
>>prog_id,prog_ftype,prog_fcasthour,prog_date from prog_dgy_xy,prog_data 
>>where pxygy_pid=prog_id " | psql -Uuser db
>>NOTICE:  QUERY PLAN:
>>
>>Unique  (cost=7432549.69..7680455.07 rows=2479054 width=32)
>>   ->  Sort  (cost=7432549.69..7432549.69 rows=24790538 width=32)
>>         ->  Merge Join  (cost=148864.65..161189.33 rows=24790538 width=32)
>>               ->  Index Scan using prog_data_pkey on prog_data 
>>(cost=0.00..701.12 rows=8872 width=28)
>>               ->  Sort  (cost=148864.65..148864.65 rows=921013 width=4)
>>                     ->  Seq Scan on prog_dgy_xy  (cost=0.00..30145.13 
>>rows=921013 width=4)
>>
> 
> I'm guessing that the approximately 25 million row estimate on the join
> has to be wrong as well given that prog_data.prog_id should be unique.
> 
> Hmm, does the explain change if you vacuum analyze the other table
> (prog_data)?  If not, what does explain show if you do a
> set enable_seqscan='off';
> before it?

The result:
db=>set enable_seqscan='off';
db=>explain select distinct prog_id,prog_ftype,prog_fcasthour,prog_date 
from prog_dgy_xy,prog_data  where pxygy_pid=prog_id;
NOTICE:  QUERY PLAN:

Unique  (cost=7606982.10..7854887.48 rows=2479054 width=32)  ->  Sort  (cost=7606982.10..7606982.10 rows=24790538
width=32)       ->  Merge Join  (cost=0.00..335621.73 rows=24790538 width=32)              ->  Index Scan using
progdgyxy_idx2on prog_dgy_xy 
 
(cost=0.00..323297.05 rows=921013 width=4)              ->  Index Scan using prog_data_pkey on prog_data 
(cost=0.00..701.12 rows=8872 width=28)

It "seems" index is used, but the same result :(((, and bigger execution 
time: real   3m41.830s

What is in tables?
prog_data contains unique id and other info.
prog_dgy_xy contains that id with x,y coordinates (so many ids from 
prog_data with unique x,y)
#prog_data:
#prog_id, prog_ftype, prog_fcasthour, prog_date
#1 
'type'        6                  2001-10-14 12:00:00
#2 
'type'        12              2001-10-14 12:00:00
#prog_dgy_xy:
#pxygy_pid, pxygy_x, pxygy_y
#1     0.1       0.1
#1          0.1       0.15

How can this query takes real    0m1.755s for mysql, [17 sec for 
oracle], and 2-3 minutes!! for postgres?

And why:
POSTGRES:
set enable_seqscan ='off'; select count(*) from prog_dgy_xy where 
pxygy_pid<13161; count
-------- 900029
real    2m34.340s
explain:
Aggregate  (cost=327896.89..327896.89 rows=1 width=0)  ->  Index Scan using progdgyxy_idx2 on prog_dgy_xy 
(cost=0.00..325594.54 rows=920940 width=0)


MYSQL:
select count(pxygy_pid) from PROG_DGY_XY where pxygy_pid<13161
count(pxygy_pid)
900029
real    0m27.878s
explain:
table   type    possible_keys   key     key_len ref     rows    Extra
PROG_DGY_XY     range   progdgyxy_idx1,progdgyxy_idx2   progdgyxy_idx2 
4       NULL    906856  where used; Using index

The same time difference in case of: = or >, however explain says, cause 
seq scan is off, the index is used.
I did vacuum, and vacuum analyze too before.

PS: I think i have to make a site for that, cause there are many 
questions :), and weird things.
I love postgres but this makes me "hm?". Today i'll make these test 
under 7.1.2.
thx
CoL