Thread: Looks like merge join planning time is too big, 55 seconds

Looks like merge join planning time is too big, 55 seconds

From
Sergey Burladyan
Date:
Hello, i have a problem with planning time, I do not understand why this can happen.

PostgreSQL 9.2.2 on x86_64-unknown-linux-gnu, compiled by gcc-4.4.real (Debian 4.4.5-8) 4.4.5, 64-bit

# explain
# select i.item_id, u.user_id from items i
# left join users u on u.user_id = i.user_id
# where item_id = 169946840;
                                          QUERY PLAN                                          
----------------------------------------------------------------------------------------------
 Nested Loop Left Join  (cost=0.00..397.14 rows=1 width=16)
   ->  Index Scan using items_item_ux on items i  (cost=0.00..358.84 rows=1 width=16)
         Index Cond: (item_id = 169946840)
   ->  Index Only Scan using users_user_id_pkey on users u  (cost=0.00..38.30 rows=1 width=8)
         Index Cond: (user_id = i.user_id)

time: 55919.910 ms

# set enable_mergejoin to off;

# explain
select i.item_id, u.user_id from items i
left join users u on u.user_id = i.user_id
where item_id = 169946840;
                                          QUERY PLAN                                          
----------------------------------------------------------------------------------------------
 Nested Loop Left Join  (cost=0.00..397.14 rows=1 width=16)
   ->  Index Scan using items_item_ux on items i  (cost=0.00..358.84 rows=1 width=16)
         Index Cond: (item_id = 169946840)
   ->  Index Only Scan using users_user_id_pkey on users u  (cost=0.00..38.30 rows=1 width=8)
         Index Cond: (user_id = i.user_id)

time: 28.874 ms

--
Sergey Burladyan

Re: Looks like merge join planning time is too big, 55 seconds

From
Thomas Reiss
Date:
Le 01/08/2013 11:55, Sergey Burladyan a écrit :
> Hello, i have a problem with planning time, I do not understand why this
> can happen.
>
> PostgreSQL 9.2.2 on x86_64-unknown-linux-gnu, compiled by gcc-4.4.real
> (Debian 4.4.5-8) 4.4.5, 64-bit
>
> # explain
> # select i.item_id, u.user_id from items i
> # left join users u on u.user_id = i.user_id
> # where item_id = 169946840;
>                                           QUERY PLAN
>
> ----------------------------------------------------------------------------------------------
>  Nested Loop Left Join  (cost=0.00..397.14 rows=1 width=16)
>    ->  Index Scan using items_item_ux on items i  (cost=0.00..358.84
> rows=1 width=16)
>          Index Cond: (item_id = 169946840)
>    ->  Index Only Scan using users_user_id_pkey on users u
>  (cost=0.00..38.30 rows=1 width=8)
>          Index Cond: (user_id = i.user_id)
>
> time: 55919.910 ms
>
> # set enable_mergejoin to off;
>
> # explain
> select i.item_id, u.user_id from items i
> left join users u on u.user_id = i.user_id
> where item_id = 169946840;
>                                           QUERY PLAN
>
> ----------------------------------------------------------------------------------------------
>  Nested Loop Left Join  (cost=0.00..397.14 rows=1 width=16)
>    ->  Index Scan using items_item_ux on items i  (cost=0.00..358.84
> rows=1 width=16)
>          Index Cond: (item_id = 169946840)
>    ->  Index Only Scan using users_user_id_pkey on users u
>  (cost=0.00..38.30 rows=1 width=8)
>          Index Cond: (user_id = i.user_id)
>
> time: 28.874 ms
>
> --
> Sergey Burladyan

Hello,

If you leave enable_mergejoin to on, what happens if you run the explain
two time in a row ? Do you get the same planning time ?

At first look, this reminds me some catalog bloat issue. Can you provide
the result of these queries :
SELECT pg_size_pretty(pg_table_size('pg_class')) AS size_pg_class;
SELECT pg_size_pretty(pg_table_size('pg_attribute')) AS size_pg_attribute;

Thanks
--
Thomas Reiss
Consultant Dalibo
http://dalibo.com - http://dalibo.org


Re: Looks like merge join planning time is too big, 55 seconds

From
Sergey Burladyan
Date:


01.08.2013 14:05 пользователь "Thomas Reiss" <thomas.reiss@dalibo.com> написал:
>
> If you leave enable_mergejoin to on, what happens if you run the explain
> two time in a row ? Do you get the same planning time ?

Yes, I get the same planning time.

Re: Looks like merge join planning time is too big, 55 seconds

From
Sergey Burladyan
Date:
On Thu, Aug 1, 2013 at 2:04 PM, Thomas Reiss <thomas.reiss@dalibo.com> wrote:
Le 01/08/2013 11:55, Sergey Burladyan a écrit :
At first look, this reminds me some catalog bloat issue. Can you provide
the result of these queries :
SELECT pg_size_pretty(pg_table_size('pg_class')) AS size_pg_class;
SELECT pg_size_pretty(pg_table_size('pg_attribute')) AS size_pg_attribute;

SELECT pg_size_pretty(pg_table_size('pg_class')) AS size_pg_class; --- '16 MB'
SELECT pg_size_pretty(pg_table_size('pg_attribute')) AS size_pg_attribute; --- '63 MB'

--
Sergey Burladyan

Re: Looks like merge join planning time is too big, 55 seconds

From
Sergey Burladyan
Date:
I find another query with big planning time:
explain select * from xview.user_items_v v where ( v.item_id = 132358330 );
                                                    QUERY PLAN                                                     
-------------------------------------------------------------------------------------------------------------------
 Nested Loop Left Join  (cost=0.00..363.28 rows=1 width=44)
   Join Filter: (ief.item_id = ix.item_id)
   ->  Index Scan using items_item_ux on items ix  (cost=0.00..359.20 rows=1 width=36)
         Index Cond: (item_id = 132358330)
         Filter: ((xa_txtime IS NULL) AND (user_id > 0) AND (status_id < 20))
   ->  Index Scan using item_enabled_flags_item_id_idx on item_enabled_flags ief  (cost=0.00..4.06 rows=1 width=8)
         Index Cond: (item_id = 132358330)
(7 rows)

Time: 44037.758 ms

looks like planning algorithm hang on 'items' table statistics. Setting enable_mergejoin to off does not help with this query.

--
Sergey Burladyan

Re: Looks like merge join planning time is too big, 55 seconds

From
Sergey Burladyan
Date:
Sergey Burladyan <eshkinkot@gmail.com> writes:

> # explain
> # select i.item_id, u.user_id from items i
> # left join users u on u.user_id = i.user_id
> # where item_id = 169946840;
> QUERY PLAN
> ----------------------------------------------------------------------------------------------
> Nested Loop Left Join (cost=0.00..397.14 rows=1 width=16)
> -> Index Scan using items_item_ux on items i (cost=0.00..358.84 rows=1 width=16)
> Index Cond: (item_id = 169946840)
> -> Index Only Scan using users_user_id_pkey on users u (cost=0.00..38.30 rows=1 width=8)
> Index Cond: (user_id = i.user_id)
>
> time: 55919.910 ms

While running this EXPLAIN backend use disk for a long time:
 TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
                                                                     
21638 be/4 postgres    2.10 M/s    9.45 M/s  0.00 % 69.04 % postgres: postgres xxxxx xxx.xxx.xxx.xxx(50987) EXPLAIN

Why it read and write to disk 10 megabytes per second for EXPLAIN query? Cannot understand what is going on here :(



Re: Looks like merge join planning time is too big, 55 seconds

From
David Kerr
Date:
On Thu, Aug 01, 2013 at 07:17:27PM +0400, Sergey Burladyan wrote:
- Sergey Burladyan <eshkinkot@gmail.com> writes:
-
- > # explain
- > # select i.item_id, u.user_id from items i
- > # left join users u on u.user_id = i.user_id
- > # where item_id = 169946840;
- > QUERY PLAN
- > ----------------------------------------------------------------------------------------------
- > Nested Loop Left Join (cost=0.00..397.14 rows=1 width=16)
- > -> Index Scan using items_item_ux on items i (cost=0.00..358.84 rows=1 width=16)
- > Index Cond: (item_id = 169946840)
- > -> Index Only Scan using users_user_id_pkey on users u (cost=0.00..38.30 rows=1 width=8)
- > Index Cond: (user_id = i.user_id)
- >
- > time: 55919.910 ms
-
- While running this EXPLAIN backend use disk for a long time:
-  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
                                                                       
- 21638 be/4 postgres    2.10 M/s    9.45 M/s  0.00 % 69.04 % postgres: postgres xxxxx xxx.xxx.xxx.xxx(50987) EXPLAIN
-
- Why it read and write to disk 10 megabytes per second for EXPLAIN query? Cannot understand what is going on here :(


That sounds familiar - is it possible you're running into this?
http://www.postgresql.org/message-id/20120713065122.GA45804@mr-paradox.net


Re: Looks like merge join planning time is too big, 55 seconds

From
Jeff Janes
Date:
On Thu, Aug 1, 2013 at 8:17 AM, Sergey Burladyan <eshkinkot@gmail.com> wrote:
> Sergey Burladyan <eshkinkot@gmail.com> writes:
>
>> # explain
>> # select i.item_id, u.user_id from items i
>> # left join users u on u.user_id = i.user_id
>> # where item_id = 169946840;
>> QUERY PLAN
>> ----------------------------------------------------------------------------------------------
>> Nested Loop Left Join (cost=0.00..397.14 rows=1 width=16)
>> -> Index Scan using items_item_ux on items i (cost=0.00..358.84 rows=1 width=16)
>> Index Cond: (item_id = 169946840)
>> -> Index Only Scan using users_user_id_pkey on users u (cost=0.00..38.30 rows=1 width=8)
>> Index Cond: (user_id = i.user_id)
>>
>> time: 55919.910 ms
>
> While running this EXPLAIN backend use disk for a long time:
>  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
> 21638 be/4 postgres    2.10 M/s    9.45 M/s  0.00 % 69.04 % postgres: postgres xxxxx xxx.xxx.xxx.xxx(50987) EXPLAIN
>
> Why it read and write to disk 10 megabytes per second for EXPLAIN query? Cannot understand what is going on here :(

I'd use strace to find what file handle is being read and written, and
lsof to figure out what file that is.

It looks like it is more write than read, which does seem strange.

Any chance you can create a self-contained test case?

Cheers,

Jeff


Re: Looks like merge join planning time is too big, 55 seconds

From
Sergey Burladyan
Date:
Jeff Janes <jeff.janes@gmail.com> writes:

> I'd use strace to find what file handle is being read and written, and
> lsof to figure out what file that is.

I use strace, it is more read then write:
$ cut -d '(' -f 1 /var/tmp/pg.trace | sort | uniq -c | sort -n
     49 select
    708 close
   1021 open
   7356 write
 212744 read
 219650 lseek

top reads:
7859 read(150 open("base/16444/17685.129", O_RDWR|O_CREAT, 0600) = 150
9513 read(149 open("base/16444/17685.128", O_RDWR|O_CREAT, 0600) = 149
10529 read(151 open("base/16444/17685.130", O_RDWR|O_CREAT, 0600) = 151
12155 read(152 open("base/16444/17685.131", O_RDWR|O_CREAT, 0600) = 152
12768 read(154 open("base/16444/17685.133", O_RDWR|O_CREAT, 0600) = 154
16210 read(153 open("base/16444/17685.132", O_RDWR|O_CREAT, 0600) = 153

it is 'items' table:
select relname from pg_class where relfilenode = 17685;
 relname
---------
 items

each read is 8192 bytes, so for EXPLAIN query with two simple index scan, *without* ANALYZE postgres
read (7859 + 9513 + 10529 + 12155 + 12768 + 16210) * 8192 = 565 526 528 bytes from it.

> It looks like it is more write than read, which does seem strange.

Why it read something for simple EXPLAIN, without real executing query? :-)

> Any chance you can create a self-contained test case?

I think I cannot do this, it is ~1 Tb heavily load database. This is at standby server.

PS: two strace for quick and slow explain:

explain
select i.item_id from items i
where item_id = 169946840

$ cut -d '(' -f 1 /var/tmp/pg-all-normal.trace | sort | uniq -c
    313 lseek
    308 open
      2 read
     13 recvfrom
      6 sendto

explain
select i.item_id, u.user_id from items i
left join users u on u.user_id = i.user_id
where item_id = 169946840

$ cut -d '(' -f 1 /var/tmp/pg-all-slow.trace | sort | uniq -c
    963 close
      1 fsync
5093393 lseek
    925 open
6004995 read
     14 recvfrom
      1 rt_sigreturn
      9 select
   4361 semop
      7 sendto
      1 --- SIGUSR1
 685605 write



Re: Looks like merge join planning time is too big, 55 seconds

From
Jeff Janes
Date:
On Thu, Aug 1, 2013 at 12:13 PM, Sergey Burladyan <eshkinkot@gmail.com> wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
>
>> I'd use strace to find what file handle is being read and written, and
>> lsof to figure out what file that is.
>
> I use strace, it is more read then write:
> $ cut -d '(' -f 1 /var/tmp/pg.trace | sort | uniq -c | sort -n
>      49 select
>     708 close
>    1021 open
>    7356 write
>  212744 read
>  219650 lseek

Based on your iotop (or whatever that was that you posted previously)
most of the reads must be coming from the file system cache.

>
> top reads:
> 7859 read(150 open("base/16444/17685.129", O_RDWR|O_CREAT, 0600) = 150
> 9513 read(149 open("base/16444/17685.128", O_RDWR|O_CREAT, 0600) = 149
> 10529 read(151 open("base/16444/17685.130", O_RDWR|O_CREAT, 0600) = 151
> 12155 read(152 open("base/16444/17685.131", O_RDWR|O_CREAT, 0600) = 152
> 12768 read(154 open("base/16444/17685.133", O_RDWR|O_CREAT, 0600) = 154
> 16210 read(153 open("base/16444/17685.132", O_RDWR|O_CREAT, 0600) = 153
>
> it is 'items' table:
> select relname from pg_class where relfilenode = 17685;
>  relname
> ---------
>  items
>
> each read is 8192 bytes, so for EXPLAIN query with two simple index scan, *without* ANALYZE postgres
> read (7859 + 9513 + 10529 + 12155 + 12768 + 16210) * 8192 = 565 526 528 bytes from it.
>
>> It looks like it is more write than read, which does seem strange.
>
> Why it read something for simple EXPLAIN, without real executing query? :-)

I figured it was reading some system catalogs or something.  I don't
know why it would be reading the table files. Or writing much of
anything, either.

I think the next step would be to run gdb -p <pid> (but don't start
gdb until backend is in the middle of a slow explain), then:

break read
c
bt

Then repeat the c and bt combination a few more times, to build up a
dataset on what the call stack is which is causing the reads to
happen.

Cheers,

Jeff


Re: Looks like merge join planning time is too big, 55 seconds

From
Sergey Burladyan
Date:
Jeff Janes <jeff.janes@gmail.com> writes:

> I think the next step would be to run gdb -p <pid> (but don't start
> gdb until backend is in the middle of a slow explain), then:

Sorry, I am lack debug symbols, so call trace is incomplete:

explain select i.item_id, u.user_id from items i left join users u on u.user_id = i.user_id where item_id = 169946840

#0  0x00007ff766967620 in read () from /lib/libc.so.6
#1  0x00007ff7689cfc25 in FileRead ()
#2  0x00007ff7689ea2f6 in mdread ()
#3  0x00007ff7689cc473 in ?? ()
#4  0x00007ff7689ccf54 in ReadBufferExtended ()
#5  0x00007ff7688050ca in index_fetch_heap ()
#6  0x00007ff76880523e in index_getnext ()
#7  0x00007ff768a63306 in ?? ()
#8  0x00007ff768a67624 in ?? ()
#9  0x00007ff768a67d9c in ?? ()
#10 0x00007ff768a68376 in mergejoinscansel ()
#11 0x00007ff76896faa6 in initial_cost_mergejoin ()
#12 0x00007ff768977695 in ?? ()
#13 0x00007ff76897816c in add_paths_to_joinrel ()
#14 0x00007ff76897981b in make_join_rel ()
#15 0x00007ff768979ac9 in join_search_one_level ()
#16 0x00007ff76896a3ab in standard_join_search ()
#17 0x00007ff7689837c1 in query_planner ()
#18 0x00007ff768985260 in ?? ()
#19 0x00007ff7689870a9 in subquery_planner ()
#20 0x00007ff76898736e in standard_planner ()
#21 0x00007ff7689ef3ce in pg_plan_query ()
#22 0x00007ff7688c94a3 in ?? ()
#23 0x00007ff7688c9809 in ExplainQuery ()
#24 0x00007ff7648095e2 in ?? () from /usr/lib/postgresql/9.2/lib/pg_stat_statements.so
#25 0x00007ff7689f1f27 in ?? ()
#26 0x00007ff7689f3295 in ?? ()
#27 0x00007ff7689f388f in PortalRun ()
#28 0x00007ff7689ef96d in ?? ()
#29 0x00007ff7689f0950 in PostgresMain ()
#30 0x00007ff7689aa7a3 in ?? ()
#31 0x00007ff7689ad73c in PostmasterMain ()
#32 0x00007ff768948e4b in main ()

#0  0x00007ff766973950 in lseek64 () from /lib/libc.so.6
#1  0x00007ff7689cf88d in FileSeek ()
#2  0x00007ff7689ea09c in mdwrite ()
#3  0x00007ff7689cb12f in ?? ()
#4  0x00007ff7689cca43 in ?? ()
#5  0x00007ff7689ccf54 in ReadBufferExtended ()
#6  0x00007ff7688050ca in index_fetch_heap ()
#7  0x00007ff76880523e in index_getnext ()
#8  0x00007ff768a63306 in ?? ()
#9  0x00007ff768a67624 in ?? ()
#10 0x00007ff768a67d9c in ?? ()
#11 0x00007ff768a68376 in mergejoinscansel ()
#12 0x00007ff76896faa6 in initial_cost_mergejoin ()
#13 0x00007ff768977695 in ?? ()
#14 0x00007ff76897816c in add_paths_to_joinrel ()
#15 0x00007ff76897981b in make_join_rel ()
#16 0x00007ff768979ac9 in join_search_one_level ()
#17 0x00007ff76896a3ab in standard_join_search ()
#18 0x00007ff7689837c1 in query_planner ()
#19 0x00007ff768985260 in ?? ()
#20 0x00007ff7689870a9 in subquery_planner ()
#21 0x00007ff76898736e in standard_planner ()
#22 0x00007ff7689ef3ce in pg_plan_query ()
#23 0x00007ff7688c94a3 in ?? ()
#24 0x00007ff7688c9809 in ExplainQuery ()
#25 0x00007ff7648095e2 in ?? () from /usr/lib/postgresql/9.2/lib/pg_stat_statements.so
#26 0x00007ff7689f1f27 in ?? ()
#27 0x00007ff7689f3295 in ?? ()
#28 0x00007ff7689f388f in PortalRun ()
#29 0x00007ff7689ef96d in ?? ()
#30 0x00007ff7689f0950 in PostgresMain ()
#31 0x00007ff7689aa7a3 in ?? ()
#32 0x00007ff7689ad73c in PostmasterMain ()
#33 0x00007ff768948e4b in main ()

#0  0x00007ff766973950 in lseek64 () from /lib/libc.so.6
#1  0x00007ff7689cf88d in FileSeek ()
#2  0x00007ff7689ea2b9 in mdread ()
#3  0x00007ff7689cc473 in ?? ()
#4  0x00007ff7689ccf54 in ReadBufferExtended ()
#5  0x00007ff7688050ca in index_fetch_heap ()
#6  0x00007ff76880523e in index_getnext ()
#7  0x00007ff768a63306 in ?? ()
#8  0x00007ff768a67624 in ?? ()
#9  0x00007ff768a67d9c in ?? ()
#10 0x00007ff768a68376 in mergejoinscansel ()
#11 0x00007ff76896faa6 in initial_cost_mergejoin ()
#12 0x00007ff768977695 in ?? ()
#13 0x00007ff76897816c in add_paths_to_joinrel ()
#14 0x00007ff76897981b in make_join_rel ()
#15 0x00007ff768979ac9 in join_search_one_level ()
#16 0x00007ff76896a3ab in standard_join_search ()
#17 0x00007ff7689837c1 in query_planner ()
#18 0x00007ff768985260 in ?? ()
#19 0x00007ff7689870a9 in subquery_planner ()
#20 0x00007ff76898736e in standard_planner ()
#21 0x00007ff7689ef3ce in pg_plan_query ()
#22 0x00007ff7688c94a3 in ?? ()
#23 0x00007ff7688c9809 in ExplainQuery ()
#24 0x00007ff7648095e2 in ?? () from /usr/lib/postgresql/9.2/lib/pg_stat_statements.so
#25 0x00007ff7689f1f27 in ?? ()
#26 0x00007ff7689f3295 in ?? ()
#27 0x00007ff7689f388f in PortalRun ()
#28 0x00007ff7689ef96d in ?? ()
#29 0x00007ff7689f0950 in PostgresMain ()
#30 0x00007ff7689aa7a3 in ?? ()
#31 0x00007ff7689ad73c in PostmasterMain ()
#32 0x00007ff768948e4b in main ()


Re: Looks like merge join planning time is too big, 55 seconds

From
Sergey Burladyan
Date:
I also find this trace for other query:
explain select * from xview.user_items_v v where ( v.item_id = 132358330 );

#0  0x00007ff766967620 in read () from /lib/libc.so.6
#1  0x00007ff7689cfc25 in FileRead ()
#2  0x00007ff7689ea2f6 in mdread ()
#3  0x00007ff7689cc473 in ?? ()
#4  0x00007ff7689ccf54 in ReadBufferExtended ()
#5  0x00007ff7688050ca in index_fetch_heap ()
#6  0x00007ff76880523e in index_getnext ()
#7  0x00007ff768a63306 in ?? ()
#8  0x00007ff768a67624 in ?? ()
#9  0x00007ff768a67d9c in ?? ()
#10 0x00007ff768a688fc in scalargtsel ()
#11 0x00007ff768ac5211 in OidFunctionCall4Coll ()
#12 0x00007ff768998ce5 in restriction_selectivity ()
#13 0x00007ff76896c71e in clause_selectivity ()
#14 0x00007ff76896bf60 in clauselist_selectivity ()
#15 0x00007ff76896ddfd in set_baserel_size_estimates ()
#16 0x00007ff76896abf2 in ?? ()
#17 0x00007ff76896bc97 in make_one_rel ()
#18 0x00007ff7689837c1 in query_planner ()
#19 0x00007ff768985260 in ?? ()
#20 0x00007ff7689870a9 in subquery_planner ()
#21 0x00007ff76898736e in standard_planner ()
#22 0x00007ff7689ef3ce in pg_plan_query ()
#23 0x00007ff7688c94a3 in ?? ()
#24 0x00007ff7688c9809 in ExplainQuery ()
#25 0x00007ff7648095e2 in ?? () from /usr/lib/postgresql/9.2/lib/pg_stat_statements.so
#26 0x00007ff7689f1f27 in ?? ()
#27 0x00007ff7689f3295 in ?? ()
#28 0x00007ff7689f388f in PortalRun ()
#29 0x00007ff7689ef96d in ?? ()
#30 0x00007ff7689f0950 in PostgresMain ()
#31 0x00007ff7689aa7a3 in ?? ()
#32 0x00007ff7689ad73c in PostmasterMain ()
#33 0x00007ff768948e4b in main ()

I see two code paths:
#6  0x00007ff76880523e in index_getnext ()
...
#9  0x00007ff768a67d9c in ?? ()
#10 0x00007ff768a688fc in scalargtsel ()
...

and

#6  0x00007ff76880523e in index_getnext ()
...
#9  0x00007ff768a67d9c in ?? ()
#10 0x00007ff768a68376 in mergejoinscansel ()
...

If I not mistaken, may be two code paths like this here:
(1) mergejoinscansel -> scalarineqsel-> ineq_histogram_selectivity -> get_actual_variable_range -> index_getnext
(2) scalargtsel -> scalarineqsel -> ineq_histogram_selectivity -> get_actual_variable_range -> index_getnext

And may be get_actual_variable_range() function is too expensive for
call with my bloated table items with bloated index items_user_id_idx on it?



Re: Looks like merge join planning time is too big, 55 seconds

From
Alvaro Herrera
Date:
Sergey Burladyan escribió:
> I also find this trace for other query:
> explain select * from xview.user_items_v v where ( v.item_id = 132358330 );
>
> #0  0x00007ff766967620 in read () from /lib/libc.so.6
> #1  0x00007ff7689cfc25 in FileRead ()
> #2  0x00007ff7689ea2f6 in mdread ()
> #3  0x00007ff7689cc473 in ?? ()
> #4  0x00007ff7689ccf54 in ReadBufferExtended ()
> #5  0x00007ff7688050ca in index_fetch_heap ()
> #6  0x00007ff76880523e in index_getnext ()
> #7  0x00007ff768a63306 in ?? ()
> #8  0x00007ff768a67624 in ?? ()
> #9  0x00007ff768a67d9c in ?? ()
> #10 0x00007ff768a688fc in scalargtsel ()

It'd be useful to see what's in frames 7-9, but this might be related to
get_actual_variable_range().  I don't see anything else nearby that
would try to read portions of the table.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


Re: Looks like merge join planning time is too big, 55 seconds

From
Jeff Janes
Date:
On Thu, Aug 1, 2013 at 5:16 PM, Sergey Burladyan <eshkinkot@gmail.com> wrote:
> I also find this trace for other query:
> explain select * from xview.user_items_v v where ( v.item_id = 132358330 );
>
>
> If I not mistaken, may be two code paths like this here:
> (1) mergejoinscansel -> scalarineqsel-> ineq_histogram_selectivity -> get_actual_variable_range -> index_getnext
> (2) scalargtsel -> scalarineqsel -> ineq_histogram_selectivity -> get_actual_variable_range -> index_getnext

Yeah, I think you are correct.

> And may be get_actual_variable_range() function is too expensive for
> call with my bloated table items with bloated index items_user_id_idx on it?

But why is it bloated in this way?  It must be visiting many thousands
of dead/invisible rows before finding the first visible one.  But,
Btree index have a mechanism to remove dead tuples from indexes, so it
doesn't follow them over and over again (see "kill_prior_tuple").  So
is that mechanism not working, or are the tuples not dead but just
invisible (i.e. inserted by a still open transaction)?

Cheers,

Jeff


Re: Looks like merge join planning time is too big, 55 seconds

From
Jeff Janes
Date:
On Fri, Aug 2, 2013 at 2:58 AM, Sergey Burladyan <eshkinkot@gmail.com> wrote:
>
> PS: I think my main problem is here:
> select min(user_id) from items;
>  min
> -----
>    1
> (1 row)
>
> Time: 504.520 ms

That is a long time, but still 100 fold less than the planner is taking.

What about max(user_id)?

>
> also, i cannot reindex it concurrently now, because it run autovacuum: VACUUM ANALYZE public.items (to prevent
wraparound)

That is going to take a long time if you have the cost settings at
their defaults.

Cheers,

Jeff


Re: Looks like merge join planning time is too big, 55 seconds

From
Tom Lane
Date:
Jeff Janes <jeff.janes@gmail.com> writes:
> On Thu, Aug 1, 2013 at 5:16 PM, Sergey Burladyan <eshkinkot@gmail.com> wrote:
>> If I not mistaken, may be two code paths like this here:
>> (1) mergejoinscansel -> scalarineqsel-> ineq_histogram_selectivity -> get_actual_variable_range -> index_getnext
>> (2) scalargtsel -> scalarineqsel -> ineq_histogram_selectivity -> get_actual_variable_range -> index_getnext

> Yeah, I think you are correct.

mergejoinscansel does *not* call scalarineqsel, nor get_actual_variable_range.
It calls get_variable_range, which only looks at the pg_statistic entries.

I think we need to see the actual stack traces, not incomplete versions.
It's possible that the situation here involves bloat in pg_statistic, but
we're just leaping to conclusions if we assume that that's where the index
fetches are occurring.

            regards, tom lane


Re: Looks like merge join planning time is too big, 55 seconds

From
Sergey Burladyan
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Jeff Janes <jeff.janes@gmail.com> writes:
> > On Thu, Aug 1, 2013 at 5:16 PM, Sergey Burladyan <eshkinkot@gmail.com> wrote:
> >> If I not mistaken, may be two code paths like this here:
> >> (1) mergejoinscansel -> scalarineqsel-> ineq_histogram_selectivity -> get_actual_variable_range -> index_getnext
> >> (2) scalargtsel -> scalarineqsel -> ineq_histogram_selectivity -> get_actual_variable_range -> index_getnext
>
> > Yeah, I think you are correct.
>
> mergejoinscansel does *not* call scalarineqsel, nor get_actual_variable_range.
> It calls get_variable_range, which only looks at the pg_statistic
> entries.

Hmm, I speak about 9.2.2 but in current HEAD this call still exist,
please see: http://doxygen.postgresql.org/selfuncs_8c_source.html#l02976

> I think we need to see the actual stack traces, not incomplete versions.
> It's possible that the situation here involves bloat in pg_statistic, but
> we're just leaping to conclusions if we assume that that's where the index
> fetches are occurring.

I found debug symbols and send stack trace to mail list, but it blocked
by size, try again with zip


Attachment

Re: Looks like merge join planning time is too big, 55 seconds

From
Sergey Burladyan
Date:
Jeff Janes <jeff.janes@gmail.com> writes:

> On Fri, Aug 2, 2013 at 2:58 AM, Sergey Burladyan <eshkinkot@gmail.com> wrote:
> >
> > PS: I think my main problem is here:
> > select min(user_id) from items;
> >  min
> > -----
> >    1
> > (1 row)
> >
> > Time: 504.520 ms
>
> That is a long time, but still 100 fold less than the planner is taking.
>
> What about max(user_id)?

max is good, only rows with user_id = 0 was updated:

select max(user_id) from items;
Time: 59.646 ms

> > also, i cannot reindex it concurrently now, because it run autovacuum: VACUUM ANALYZE public.items (to prevent
wraparound)
>
> That is going to take a long time if you have the cost settings at
> their defaults.

Yes, I have custom setting, more slow, it will last about a week.

> But why is it bloated in this way?

Don't known. It has been updated many items last week. ~ 10% of table.

> It must be visiting many thousands of dead/invisible rows before
> finding the first visible one.  But, Btree index have a mechanism to
> remove dead tuples from indexes, so it doesn't follow them over and
> over again (see "kill_prior_tuple").  So is that mechanism not
> working, or are the tuples not dead but just invisible (i.e. inserted
> by a still open transaction)?

It is deleted, but VACUUM still not completed.

BTW, it is standby server, and it query plan (block read) is very
different from master:

Hot standby:

explain (analyze,verbose,buffers) select min(user_id) from items;

'Result  (cost=0.12..0.13 rows=1 width=0) (actual time=56064.514..56064.514 rows=1 loops=1)'
'  Output: $0'
'  Buffers: shared hit=3694164 read=6591224 written=121652'
'  InitPlan 1 (returns $0)'
'    ->  Limit  (cost=0.00..0.12 rows=1 width=8) (actual time=56064.502..56064.503 rows=1 loops=1)'
'          Output: public.items.user_id'
'          Buffers: shared hit=3694164 read=6591224 written=121652'
'          ->  Index Only Scan using items_user_id_idx on public.items  (cost=0.00..24165743.48 rows=200673143 width=8)
(actualtime=56064.499..56064.499 rows=1 loops=1)' 
'                Output: public.items.user_id'
'                Index Cond: (public.items.user_id IS NOT NULL)'
'                Heap Fetches: 8256426'
'                Buffers: shared hit=3694164 read=6591224 written=121652'
'Total runtime: 56064.571 ms'

Master:

'Result  (cost=0.12..0.13 rows=1 width=0) (actual time=202.759..202.759 rows=1 loops=1)'
'  Output: $0'
'  Buffers: shared hit=153577 read=1'
'  InitPlan 1 (returns $0)'
'    ->  Limit  (cost=0.00..0.12 rows=1 width=8) (actual time=202.756..202.757 rows=1 loops=1)'
'          Output: public.items.user_id'
'          Buffers: shared hit=153577 read=1'
'          ->  Index Only Scan using items_user_id_idx on public.items  (cost=0.00..24166856.02 rows=200680528 width=8)
(actualtime=202.756..202.756 rows=1 loops=1)' 
'                Output: public.items.user_id'
'                Index Cond: (public.items.user_id IS NOT NULL)'
'                Heap Fetches: 0'
'                Buffers: shared hit=153577 read=1'
'Total runtime: 202.786 ms'

And from backup, before index|heap bloated :)

 Result  (cost=0.87..0.88 rows=1 width=0) (actual time=16.002..16.003 rows=1 loops=1)
   Output: $0
   Buffers: shared hit=3 read=4
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.00..0.87 rows=1 width=8) (actual time=15.993..15.995 rows=1 loops=1)
           Output: public.items.user_id
           Buffers: shared hit=3 read=4
           ->  Index Only Scan using items_user_id_idx on public.items  (cost=0.00..169143085.72 rows=193309210
width=8)(actual time=15.987..15.987 rows=1 loops=1) 
                 Output: public.items.user_id
                 Index Cond: (public.items.user_id IS NOT NULL)
                 Heap Fetches: 1
                 Buffers: shared hit=3 read=4
 Total runtime: 16.057 ms


Re: Looks like merge join planning time is too big, 55 seconds

From
Alvaro Herrera
Date:
Tom Lane escribió:
> Jeff Janes <jeff.janes@gmail.com> writes:
> > On Thu, Aug 1, 2013 at 5:16 PM, Sergey Burladyan <eshkinkot@gmail.com> wrote:
> >> If I not mistaken, may be two code paths like this here:
> >> (1) mergejoinscansel -> scalarineqsel-> ineq_histogram_selectivity -> get_actual_variable_range -> index_getnext
> >> (2) scalargtsel -> scalarineqsel -> ineq_histogram_selectivity -> get_actual_variable_range -> index_getnext
>
> > Yeah, I think you are correct.
>
> mergejoinscansel does *not* call scalarineqsel, nor get_actual_variable_range.
> It calls get_variable_range, which only looks at the pg_statistic entries.

Uh?  It's right there in line 2976 in HEAD.


--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


Re: Looks like merge join planning time is too big, 55 seconds

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Tom Lane escribi�:
>> It calls get_variable_range, which only looks at the pg_statistic entries.

> Uh?  It's right there in line 2976 in HEAD.

Meh.  You're right, I was thinking of this bit in get_variable_range()

    /*
     * XXX It's very tempting to try to use the actual column min and max, if
     * we can get them relatively-cheaply with an index probe.  However, since
     * this function is called many times during join planning, that could
     * have unpleasant effects on planning speed.  Need more investigation
     * before enabling this.
     */
#ifdef NOT_USED
    if (get_actual_variable_range(root, vardata, sortop, min, max))
        return true;
#endif

I think when that was written, we didn't have the code in scalarineqsel
that tries to go out and get the actual endpoints from an index.  Now
that we do, the planning cost impact that I was afraid of here can
actually bite us, and it seems that at least for Sergey's case it's pretty
bad.  Another problem is that we'll end up comparing endpoints gotten from
pg_statistic to endpoints gotten from the index, making the resulting
numbers at least self-inconsistent and very possibly meaningless.

The planner already caches the results of mergejoinscansel in hopes of
alleviating its cost, but I wonder if we need another lower-level cache
for the min/max values of each variable that participates in a
mergejoinable clause.

Having said that, it's still not clear why these probes are so expensive
in Sergey's case.  I favor your idea about lots of dead rows, but we don't
have actual proof of it.  Maybe pgstattuple could help here?

            regards, tom lane


Re: Looks like merge join planning time is too big, 55 seconds

From
Sergey Burladyan
Date:
Sergey Burladyan <eshkinkot@gmail.com> writes:

> Hot standby:
...
> '          ->  Index Only Scan using items_user_id_idx on public.items  (cost=0.00..24165743.48 rows=200673143
width=8)(actual time=56064.499..56064.499 rows=1 loops=1)' 
> '                Output: public.items.user_id'
> '                Index Cond: (public.items.user_id IS NOT NULL)'
> '                Heap Fetches: 8256426'
> '                Buffers: shared hit=3694164 read=6591224 written=121652'
> 'Total runtime: 56064.571 ms'
>
> Master:
>
...
> '          ->  Index Only Scan using items_user_id_idx on public.items  (cost=0.00..24166856.02 rows=200680528
width=8)(actual time=202.756..202.756 rows=1 loops=1)' 
> '                Output: public.items.user_id'
> '                Index Cond: (public.items.user_id IS NOT NULL)'
> '                Heap Fetches: 0'
> '                Buffers: shared hit=153577 read=1'
> 'Total runtime: 202.786 ms'

Looks like visibility map is not replicated into slave somehow?

If it matters, Master was restarted yesterday, Standby was not.