Re: Endless loop in ExecNestLoop - Mailing list pgsql-bugs

From Philipp Reisner
Subject Re: Endless loop in ExecNestLoop
Date
Msg-id 200601311054.23671.philipp.reisner@linbit.com
Whole thread Raw
In response to Re: Endless loop in ExecNestLoop  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Endless loop in ExecNestLoop
List pgsql-bugs
Am Montag, 30. Januar 2006 17:42 schrieb Tom Lane:
> Philipp Reisner <philipp.reisner@linbit.com> writes:
> > Version: 8.0.2
>
> I don't think so ... neither bitmap scans nor slot_getattr exist in 8.0.
>

Sorry, a typo, 8.1.2 of course.

> > Is this bug-report of any use so far ?
>
> Not a lot.  You need to find out what the difference is between the
> cases where the query runs quickly and those where it doesn't.  I'm
> betting that the planner is choosing a very bad plan in the slow cases,
> but there's not a lot of evidence here to show what or why.
>
> The explain analyze output shows two levels of hash join underneath
> four levels of nestloop join, whereas the stack trace looks like there
> are five levels of nestloop and only one hash.  So this is some evidence
> that a different plan is being used in the slow cases.  The stack trace
> doesn't provide nearly enough info about what that plan is, though.
>

What reasons are there for the planer to choose a different plan ?=20

> Do you have geqo_threshold set to less than its default value?  Do you
> sometimes execute the query with different sets of parameters?  Either
> of these might lead to changes of plan.

Since we had very bad experience with the generic query optimizer we
completely disabled it. The non geqo plans perform multitudes better=20
on our data-set and work-load. -- And the time to plan the query is
always negligible compared to the execution time.

We have the endless loop about 10 times a day (but execute about 1.5 millio=
n queries a day) :

Program received signal SIGINT, Interrupt.
0x00002aaaaba687b4 in wcsrtombs () from /lib/libc.so.6
(gdb) bt
#0  0x00002aaaaba687b4 in wcsrtombs () from /lib/libc.so.6
#1  0x00002aaaaba163c8 in wcstombs () from /lib/libc.so.6
#2  0x000000000058dcc4 in text_oid ()
#3  0x000000000058deb9 in upper ()
#4  0x00000000004e216e in ExecMakeFunctionResult ()
#5  0x00000000004e210d in ExecMakeFunctionResult ()
#6  0x00000000004e5063 in ExecQual ()
#7  0x00000000004e554e in ExecScan ()
#8  0x00000000004e0cc8 in ExecProcNode ()
#9  0x00000000004ebeb6 in ExecEndHashJoin ()
#10 0x00000000004eb7fa in ExecHashJoin ()
#11 0x00000000004e0d50 in ExecProcNode ()
#12 0x00000000004ee29d in ExecNestLoop ()
#13 0x00000000004e0d36 in ExecProcNode ()
#14 0x00000000004ee261 in ExecNestLoop () <<-----------
#15 0x00000000004e0d36 in ExecProcNode ()
#16 0x00000000004ee261 in ExecNestLoop ()
#17 0x00000000004e0d36 in ExecProcNode ()
#18 0x00000000004ee261 in ExecNestLoop ()
#19 0x00000000004e0d36 in ExecProcNode ()
#20 0x00000000004ee261 in ExecNestLoop ()
#21 0x00000000004e0d36 in ExecProcNode ()
#22 0x00000000004ef57c in ExecSort ()
#23 0x00000000004e0d6a in ExecProcNode ()
#24 0x00000000004df73f in ExecEndPlan ()
#25 0x00000000004dec09 in ExecutorRun ()
#26 0x000000000055f87a in PortalRun ()
#27 0x000000000055f5f9 in PortalRun ()
#28 0x000000000055b561 in pg_plan_queries ()
#29 0x000000000055df8c in PostgresMain ()
#30 0x0000000000539b01 in ClosePostmasterPorts ()
#31 0x00000000005395d0 in ClosePostmasterPorts ()
#32 0x0000000000537b4d in PostmasterMain ()
#33 0x0000000000536e59 in PostmasterMain ()
#34 0x00000000004fda0c in main ()


After detaching with gdb, I killed it. I configured postgres to dump=20
the statement to the logfile with "log_min_error_statement =3D error".
So I have the exact query that caused the problem.=20

This time the plainer says, only explain:

 Sort  (cost=3D16257.75..16257.76 rows=3D1 width=3D262)
   Sort Key: con.shortname
   ->  Nested Loop  (cost=3D1.18..16257.74 rows=3D1 width=3D262)
         ->  Nested Loop  (cost=3D1.18..16254.57 rows=3D1 width=3D130)
               ->  Nested Loop  (cost=3D1.18..16251.50 rows=3D1 width=3D125)
                     ->  Nested Loop  (cost=3D1.18..16248.45 rows=3D1 width=
=3D111)
                           ->  Nested Loop  (cost=3D1.18..16245.31 rows=3D1=
 width=3D86)
                                 Join Filter: ("outer".cel_objid =3D "inner=
".objid)
                                 ->  Seq Scan on devices dev  (cost=3D0.00.=
.11984.25 rows=3D1 width=3D63)
                                       Filter: ((upper((COALESCE(isactiv, '=
Y'::character varying))::text) =3D 'Y'::text) AND (upper((COALESCE(iscommit=
tedsp, 'Y'::character varying))::text) =3D 'Y'::text) AND (upper((COALESCE(=
iscommittedcust, 'Y'::character varying))::text) =3D 'Y'::text))
                                 ->  Hash Left Join  (cost=3D1.18..4256.22 =
rows=3D387 width=3D27)
                                       Hash Cond: ("outer".ser_objid =3D "i=
nner".objid)
                                       ->  Seq Scan on contractelements cel=
  (cost=3D0.00..4253.10 rows=3D387 width=3D21)
                                             Filter: (upper((isactiv)::text=
) =3D 'Y'::text)
                                       ->  Hash  (cost=3D1.14..1.14 rows=3D=
14 width=3D14)
                                             ->  Seq Scan on servicetypes s=
er  (cost=3D0.00..1.14 rows=3D14 width=3D14)
                           ->  Index Scan using contracts_pkey on contracts=
 con  (cost=3D0.00..3.13 rows=3D1 width=3D33)
                                 Index Cond: ("outer".con_objid =3D con.obj=
id)
                                 Filter: ((ccu_objid =3D 3837149) OR (ccu_o=
bjid =3D 3837320) OR (ccu_objid =3D 3837375) OR (ccu_objid =3D 3974872) OR =
(ccu_objid =3D 5032031) OR (ccu_objid =3D 7220313) OR (ccu_objid =3D 156848=
99) OR (ccu_objid =3D 16052168) OR (ccu_objid =3D 19196573) OR (ccu_objid =
=3D 20095712) OR (ccu_objid =3D 22399813) OR (ccu_objid =3D 42092332) OR (c=
cu_objid =3D 42092454) OR (ccu_objid =3D 42092574) OR (ccu_objid =3D 428869=
47) OR (ccu_objid =3D 43813234))
                     ->  Index Scan using devicetypes_pkey on devicetypes d=
ty  (cost=3D0.00..3.04 rows=3D1 width=3D18)
                           Index Cond: (dty.objid =3D "outer".dty_objid)
               ->  Index Scan using bpartners_pkey on bpartners man  (cost=
=3D0.00..3.06 rows=3D1 width=3D13)
                     Index Cond: (man.objid =3D "outer".bpa_objid)
         ->  Index Scan using locations_pkey on locations loc  (cost=3D0.00=
..3.12 rows=3D1 width=3D136)
               Index Cond: (loc.objid =3D "outer".loc_objid)
               Filter: (upper((shortname)::text) =3D '5195'::text)
(26 rows)

Explain and execute does not terminate... even after hours.

This time I _CAN_ _REPRODUCE_ the issue in our staging system !!  Would it =
help
to run the database server with debug symbols ? That would give more inform=
ation
in the stack traceback ...

Tom, could you give us an advise for what we should look for ? How we should
continue in tracing the bug down ?

In case you care, here is the query as well:

select cel.objid,
             cel.shortname, ser.shortname,
             con.isInventoryFlag,
             dev.objid, dty.objid, loc.objid,
             dev.shortname, loc.shortname, loc.name, man.shortName,
             dty.type, dev.serNr, dev.invNr, dev.component,
             loc.zip, loc.city, con.shortname, con.name,
            CASE
       WHEN CURRENT_DATE between loc.locationstateFrom  and loc.locationsta=
teTo  THEN loc.locationstateText
       WHEN CURRENT_DATE between loc.locationstate2From and loc.locationsta=
te2To THEN loc.locationstate2Text
       ELSE null
     END
     ,      CASE
       WHEN CURRENT_DATE between loc.locationstateFrom  and loc.locationsta=
teTo  THEN loc.locationstateColorCode
       WHEN CURRENT_DATE between loc.locationstate2From and loc.locationsta=
te2To THEN loc.locationstate2ColorCode
       ELSE null
     END
     ,
            loc.locationcategory, loc.province
       from contracts con, contractelements cel
            left outer join servicetypes ser on ser.objid =3D cel.ser_objid,
            devices dev, locations loc, devicetypes dty, bpartners man
     where con.ccu_objid in
  (3837149,3837320,3837375,3974872,5032031,7220313,15684899,16052168,191965=
73,20095712,22399813,42092332,42092454,42092574,42886947,43813234)  and cel=
.con_objid =3D con.objid
       and dev.cel_objid =3D cel.objid
       and loc.objid =3D dev.loc_objid
       and dty.objid =3D dev.dty_objid
       and man.objid =3D dty.bpa_objid
       and upper(cel.isActiv) =3D 'Y'
       and upper(coalesce(dev.isActiv,'Y')) =3D 'Y'
       and upper(coalesce(dev.IsCommittedSP,'Y')) =3D 'Y'
       and upper(coalesce(dev.IsCommittedCust,'Y')) =3D 'Y'
      and upper(loc.shortName)  =3D '5195'
       order by 18 ASC

-philipp
--=20
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Sch=C3=B6nbrunnerstr 244, 1120 Vienna, Austria    http://www.linbit.com :

pgsql-bugs by date:

Previous
From: Neil Conway
Date:
Subject: Re: Bug#347548: DOMAIN CHECK constraint bypassed
Next
From: Peter Eisentraut
Date:
Subject: ECPG connect by URL doesn't work