Re: Endless loop in ExecNestLoop - Mailing list pgsql-bugs
From | Philipp Reisner |
---|---|
Subject | Re: Endless loop in ExecNestLoop |
Date | |
Msg-id | 200601311054.23671.philipp.reisner@linbit.com Whole thread Raw |
In response to | Re: Endless loop in ExecNestLoop (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Endless loop in ExecNestLoop
|
List | pgsql-bugs |
Am Montag, 30. Januar 2006 17:42 schrieb Tom Lane: > Philipp Reisner <philipp.reisner@linbit.com> writes: > > Version: 8.0.2 > > I don't think so ... neither bitmap scans nor slot_getattr exist in 8.0. > Sorry, a typo, 8.1.2 of course. > > Is this bug-report of any use so far ? > > Not a lot. You need to find out what the difference is between the > cases where the query runs quickly and those where it doesn't. I'm > betting that the planner is choosing a very bad plan in the slow cases, > but there's not a lot of evidence here to show what or why. > > The explain analyze output shows two levels of hash join underneath > four levels of nestloop join, whereas the stack trace looks like there > are five levels of nestloop and only one hash. So this is some evidence > that a different plan is being used in the slow cases. The stack trace > doesn't provide nearly enough info about what that plan is, though. > What reasons are there for the planer to choose a different plan ?=20 > Do you have geqo_threshold set to less than its default value? Do you > sometimes execute the query with different sets of parameters? Either > of these might lead to changes of plan. Since we had very bad experience with the generic query optimizer we completely disabled it. The non geqo plans perform multitudes better=20 on our data-set and work-load. -- And the time to plan the query is always negligible compared to the execution time. We have the endless loop about 10 times a day (but execute about 1.5 millio= n queries a day) : Program received signal SIGINT, Interrupt. 0x00002aaaaba687b4 in wcsrtombs () from /lib/libc.so.6 (gdb) bt #0 0x00002aaaaba687b4 in wcsrtombs () from /lib/libc.so.6 #1 0x00002aaaaba163c8 in wcstombs () from /lib/libc.so.6 #2 0x000000000058dcc4 in text_oid () #3 0x000000000058deb9 in upper () #4 0x00000000004e216e in ExecMakeFunctionResult () #5 0x00000000004e210d in ExecMakeFunctionResult () #6 0x00000000004e5063 in ExecQual () #7 0x00000000004e554e in ExecScan () #8 0x00000000004e0cc8 in ExecProcNode () #9 0x00000000004ebeb6 in ExecEndHashJoin () #10 0x00000000004eb7fa in ExecHashJoin () #11 0x00000000004e0d50 in ExecProcNode () #12 0x00000000004ee29d in ExecNestLoop () #13 0x00000000004e0d36 in ExecProcNode () #14 0x00000000004ee261 in ExecNestLoop () <<----------- #15 0x00000000004e0d36 in ExecProcNode () #16 0x00000000004ee261 in ExecNestLoop () #17 0x00000000004e0d36 in ExecProcNode () #18 0x00000000004ee261 in ExecNestLoop () #19 0x00000000004e0d36 in ExecProcNode () #20 0x00000000004ee261 in ExecNestLoop () #21 0x00000000004e0d36 in ExecProcNode () #22 0x00000000004ef57c in ExecSort () #23 0x00000000004e0d6a in ExecProcNode () #24 0x00000000004df73f in ExecEndPlan () #25 0x00000000004dec09 in ExecutorRun () #26 0x000000000055f87a in PortalRun () #27 0x000000000055f5f9 in PortalRun () #28 0x000000000055b561 in pg_plan_queries () #29 0x000000000055df8c in PostgresMain () #30 0x0000000000539b01 in ClosePostmasterPorts () #31 0x00000000005395d0 in ClosePostmasterPorts () #32 0x0000000000537b4d in PostmasterMain () #33 0x0000000000536e59 in PostmasterMain () #34 0x00000000004fda0c in main () After detaching with gdb, I killed it. I configured postgres to dump=20 the statement to the logfile with "log_min_error_statement =3D error". So I have the exact query that caused the problem.=20 This time the plainer says, only explain: Sort (cost=3D16257.75..16257.76 rows=3D1 width=3D262) Sort Key: con.shortname -> Nested Loop (cost=3D1.18..16257.74 rows=3D1 width=3D262) -> Nested Loop (cost=3D1.18..16254.57 rows=3D1 width=3D130) -> Nested Loop (cost=3D1.18..16251.50 rows=3D1 width=3D125) -> Nested Loop (cost=3D1.18..16248.45 rows=3D1 width= =3D111) -> Nested Loop (cost=3D1.18..16245.31 rows=3D1= width=3D86) Join Filter: ("outer".cel_objid =3D "inner= ".objid) -> Seq Scan on devices dev (cost=3D0.00.= .11984.25 rows=3D1 width=3D63) Filter: ((upper((COALESCE(isactiv, '= Y'::character varying))::text) =3D 'Y'::text) AND (upper((COALESCE(iscommit= tedsp, 'Y'::character varying))::text) =3D 'Y'::text) AND (upper((COALESCE(= iscommittedcust, 'Y'::character varying))::text) =3D 'Y'::text)) -> Hash Left Join (cost=3D1.18..4256.22 = rows=3D387 width=3D27) Hash Cond: ("outer".ser_objid =3D "i= nner".objid) -> Seq Scan on contractelements cel= (cost=3D0.00..4253.10 rows=3D387 width=3D21) Filter: (upper((isactiv)::text= ) =3D 'Y'::text) -> Hash (cost=3D1.14..1.14 rows=3D= 14 width=3D14) -> Seq Scan on servicetypes s= er (cost=3D0.00..1.14 rows=3D14 width=3D14) -> Index Scan using contracts_pkey on contracts= con (cost=3D0.00..3.13 rows=3D1 width=3D33) Index Cond: ("outer".con_objid =3D con.obj= id) Filter: ((ccu_objid =3D 3837149) OR (ccu_o= bjid =3D 3837320) OR (ccu_objid =3D 3837375) OR (ccu_objid =3D 3974872) OR = (ccu_objid =3D 5032031) OR (ccu_objid =3D 7220313) OR (ccu_objid =3D 156848= 99) OR (ccu_objid =3D 16052168) OR (ccu_objid =3D 19196573) OR (ccu_objid = =3D 20095712) OR (ccu_objid =3D 22399813) OR (ccu_objid =3D 42092332) OR (c= cu_objid =3D 42092454) OR (ccu_objid =3D 42092574) OR (ccu_objid =3D 428869= 47) OR (ccu_objid =3D 43813234)) -> Index Scan using devicetypes_pkey on devicetypes d= ty (cost=3D0.00..3.04 rows=3D1 width=3D18) Index Cond: (dty.objid =3D "outer".dty_objid) -> Index Scan using bpartners_pkey on bpartners man (cost= =3D0.00..3.06 rows=3D1 width=3D13) Index Cond: (man.objid =3D "outer".bpa_objid) -> Index Scan using locations_pkey on locations loc (cost=3D0.00= ..3.12 rows=3D1 width=3D136) Index Cond: (loc.objid =3D "outer".loc_objid) Filter: (upper((shortname)::text) =3D '5195'::text) (26 rows) Explain and execute does not terminate... even after hours. This time I _CAN_ _REPRODUCE_ the issue in our staging system !! Would it = help to run the database server with debug symbols ? That would give more inform= ation in the stack traceback ... Tom, could you give us an advise for what we should look for ? How we should continue in tracing the bug down ? In case you care, here is the query as well: select cel.objid, cel.shortname, ser.shortname, con.isInventoryFlag, dev.objid, dty.objid, loc.objid, dev.shortname, loc.shortname, loc.name, man.shortName, dty.type, dev.serNr, dev.invNr, dev.component, loc.zip, loc.city, con.shortname, con.name, CASE WHEN CURRENT_DATE between loc.locationstateFrom and loc.locationsta= teTo THEN loc.locationstateText WHEN CURRENT_DATE between loc.locationstate2From and loc.locationsta= te2To THEN loc.locationstate2Text ELSE null END , CASE WHEN CURRENT_DATE between loc.locationstateFrom and loc.locationsta= teTo THEN loc.locationstateColorCode WHEN CURRENT_DATE between loc.locationstate2From and loc.locationsta= te2To THEN loc.locationstate2ColorCode ELSE null END , loc.locationcategory, loc.province from contracts con, contractelements cel left outer join servicetypes ser on ser.objid =3D cel.ser_objid, devices dev, locations loc, devicetypes dty, bpartners man where con.ccu_objid in (3837149,3837320,3837375,3974872,5032031,7220313,15684899,16052168,191965= 73,20095712,22399813,42092332,42092454,42092574,42886947,43813234) and cel= .con_objid =3D con.objid and dev.cel_objid =3D cel.objid and loc.objid =3D dev.loc_objid and dty.objid =3D dev.dty_objid and man.objid =3D dty.bpa_objid and upper(cel.isActiv) =3D 'Y' and upper(coalesce(dev.isActiv,'Y')) =3D 'Y' and upper(coalesce(dev.IsCommittedSP,'Y')) =3D 'Y' and upper(coalesce(dev.IsCommittedCust,'Y')) =3D 'Y' and upper(loc.shortName) =3D '5195' order by 18 ASC -philipp --=20 : Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Sch=C3=B6nbrunnerstr 244, 1120 Vienna, Austria http://www.linbit.com :
pgsql-bugs by date: