Thread: Assertion failure with LEFT JOINs among >500 relations

Assertion failure with LEFT JOINs among >500 relations

From
Onder Kalaci
Date:

Hi,

 

I hit an assertion failure. When asserts disabled, it works fine even with more tables  (>5000).

 

Steps to reproduce:

 

CREATE TABLE users_table (user_id int, time timestamp, value_1 int, value_2 int, value_3 float, value_4 bigint);

 

250 relations work fine, see the query (too long to copy & paste here): https://gist.github.com/onderkalaci/2b40a18d989da389ee4fb631e1ad7c0e#file-steps_to_assert_pg-sql-L41

 

--  when # relations >500, we hit the assertion (too long to copy & paste here):

See the query: https://gist.github.com/onderkalaci/2b40a18d989da389ee4fb631e1ad7c0e#file-steps_to_assert_pg-sql-L45

 

 

And, the backtrace:

 

(lldb) bt

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT

  * frame #0: 0x00007fff639fa2c2 libsystem_kernel.dylib`__pthread_kill + 10

    frame #1: 0x00007fff63ab5bf1 libsystem_pthread.dylib`pthread_kill + 284

    frame #2: 0x00007fff639646a6 libsystem_c.dylib`abort + 127

    frame #3: 0x0000000102180a02 postgres`ExceptionalCondition(conditionName=<unavailable>, errorType=<unavailable>, fileName=<unavailable>, lineNumber=<unavailable>) at assert.c:67:2

    frame #4: 0x0000000101ece9b2 postgres`initial_cost_mergejoin(root=0x7ff0000000000000, workspace=0x00007ffeedf5b528, jointype=JOIN_INNER, mergeclauses=<unavailable>, outer_path=0x000000012ebf12d0, inner_path=0x4093d80000000000, outersortkeys=0x0000000000000000, innersortkeys=0x000000012ebf68e8, extra=0x00007ffeedf5b6f8) at costsize.c:3043:2

    frame #5: 0x0000000101eda01b postgres`try_mergejoin_path(root=0x0000000104a12618, joinrel=0x000000012ebeede0, outer_path=0x000000012ebf12d0, inner_path=0x00000001283d00e8, pathkeys=0x000000012ebf67e0, mergeclauses=0x000000012ebf6890, outersortkeys=0x0000000000000000, innersortkeys=0x000000012ebf68e8, jointype=JOIN_LEFT, extra=0x00007ffeedf5b6f8, is_partial=<unavailable>) at joinpath.c:615:2

    frame #6: 0x0000000101ed9426 postgres`sort_inner_and_outer(root=0x0000000104a12618, joinrel=0x000000012ebeede0, outerrel=<unavailable>, innerrel=<unavailable>, jointype=JOIN_LEFT, extra=0x00007ffeedf5b6f8) at joinpath.c:1038:3

    frame #7: 0x0000000101ed8f7a postgres`add_paths_to_joinrel(root=0x0000000104a12618, joinrel=0x000000012ebeede0, outerrel=0x000000012ebe7b48, innerrel=0x0000000127f146e0, jointype=<unavailable>, sjinfo=<unavailable>, restrictlist=0x000000012ebf42b0) at joinpath.c:269:3

    frame #8: 0x0000000101edbdc6 postgres`populate_joinrel_with_paths(root=0x0000000104a12618, rel1=0x000000012ebe7b48, rel2=0x0000000127f146e0, joinrel=0x000000012ebeede0, sjinfo=0x000000012809edc8, restrictlist=0x000000012ebf42b0) at joinrels.c:824:4

    frame #9: 0x0000000101edb57a postgres`make_join_rel(root=0x0000000104a12618, rel1=0x000000012ebe7b48, rel2=0x0000000127f146e0) at joinrels.c:760:2

    frame #10: 0x0000000101edb1ec postgres`make_rels_by_clause_joins(root=0x0000000104a12618, old_rel=0x000000012ebe7b48, other_rels_list=<unavailable>, other_rels=<unavailable>) at joinrels.c:312:11

    frame #11: 0x0000000101edada3 postgres`join_search_one_level(root=0x0000000104a12618, level=2) at joinrels.c:123:4

    frame #12: 0x0000000101ec7feb postgres`standard_join_search(root=0x0000000104a12618, levels_needed=8, initial_rels=0x000000012ebf4078) at allpaths.c:3097:3

    frame #13: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001280a5618) at allpaths.c:2993:14

    frame #14: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001280ab320) at allpaths.c:2993:14

    frame #15: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001280b1028) at allpaths.c:2993:14

    frame #16: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001280b6d30) at allpaths.c:2993:14

    frame #17: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001280bca38) at allpaths.c:2993:14

    frame #18: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001280c2740) at allpaths.c:2993:14

    frame #19: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001280c8448) at allpaths.c:2993:14

    frame #20: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001280ce150) at allpaths.c:2993:14

    frame #21: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001280d3e58) at allpaths.c:2993:14

    frame #22: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001280d9b60) at allpaths.c:2993:14

    frame #23: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001280df868) at allpaths.c:2993:14

    frame #24: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001280e5570) at allpaths.c:2993:14

    frame #25: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001280eb278) at allpaths.c:2993:14

    frame #26: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001280f0f80) at allpaths.c:2993:14

    frame #27: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001280f8d88) at allpaths.c:2993:14

    frame #28: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x0000000128101810) at allpaths.c:2993:14

    frame #29: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x000000012810a298) at allpaths.c:2993:14

    frame #30: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x0000000128112d20) at allpaths.c:2993:14

    frame #31: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x000000012811b7a8) at allpaths.c:2993:14

    frame #32: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x0000000128124230) at allpaths.c:2993:14

    frame #33: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x000000012812ccb8) at allpaths.c:2993:14

    frame #34: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x0000000128135740) at allpaths.c:2993:14

    frame #35: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x000000012813e1c8) at allpaths.c:2993:14

    frame #36: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x0000000128146c50) at allpaths.c:2993:14

    frame #37: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x000000012814f6d8) at allpaths.c:2993:14

    frame #38: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x0000000128158160) at allpaths.c:2993:14

    frame #39: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x0000000128160be8) at allpaths.c:2993:14

    frame #40: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x0000000128169670) at allpaths.c:2993:14

    frame #41: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001281720f8) at allpaths.c:2993:14

    frame #42: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x000000012817ab80) at allpaths.c:2993:14

    frame #43: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x0000000128183608) at allpaths.c:2993:14

    frame #44: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x000000012818c090) at allpaths.c:2993:14

    frame #45: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x0000000128194b18) at allpaths.c:2993:14

    frame #46: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x000000012819d5a0) at allpaths.c:2993:14

    frame #47: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001281a6028) at allpaths.c:2993:14

    frame #48: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001281aeab0) at allpaths.c:2993:14

    frame #49: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001281b7538) at allpaths.c:2993:14

    frame #50: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001281bffc0) at allpaths.c:2993:14

    frame #51: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001281c8a48) at allpaths.c:2993:14

    frame #52: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001281d14d0) at allpaths.c:2993:14

    frame #53: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001281d9f58) at allpaths.c:2993:14

    frame #54: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001281e29e0) at allpaths.c:2993:14

    frame #55: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001281eb468) at allpaths.c:2993:14

    frame #56: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001281f3ef0) at allpaths.c:2993:14

    frame #57: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001281fc978) at allpaths.c:2993:14

    frame #58: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x0000000128205400) at allpaths.c:2993:14

    frame #59: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x000000012820de88) at allpaths.c:2993:14

    frame #60: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x0000000128216910) at allpaths.c:2993:14

    frame #61: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x000000012821f398) at allpaths.c:2993:14

    frame #62: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x0000000128227e20) at allpaths.c:2993:14

    frame #63: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001282308a8) at allpaths.c:2993:14

    frame #64: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x0000000128239330) at allpaths.c:2993:14

    frame #65: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x0000000128241db8) at allpaths.c:2993:14

    frame #66: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x000000012824a840) at allpaths.c:2993:14

    frame #67: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001282532c8) at allpaths.c:2993:14

    frame #68: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x000000012825bd50) at allpaths.c:2993:14

    frame #69: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001282647d8) at allpaths.c:2993:14

    frame #70: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x000000012826d260) at allpaths.c:2993:14

    frame #71: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x0000000128275ce8) at allpaths.c:2993:14

    frame #72: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x000000012827e770) at allpaths.c:2993:14

    frame #73: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001282871f8) at allpaths.c:2993:14

    frame #74: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x000000012828fc80) at allpaths.c:2993:14

    frame #75: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x0000000128298708) at allpaths.c:2993:14

    frame #76: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001282a1190) at allpaths.c:2993:14

    frame #77: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001282a9c18) at allpaths.c:2993:14

    frame #78: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001282b26a0) at allpaths.c:2993:14

    frame #79: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001282bb128) at allpaths.c:2993:14

    frame #80: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001282c3bb0) at allpaths.c:2993:14

    frame #81: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001282cc638) at allpaths.c:2993:14

    frame #82: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001282d50c0) at allpaths.c:2993:14

    frame #83: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001282ddb48) at allpaths.c:2993:14

    frame #84: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001282e65d0) at allpaths.c:2993:14

    frame #85: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001282ef058) at allpaths.c:2993:14

    frame #86: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001282f7ae0) at allpaths.c:2993:14

    frame #87: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x0000000128300568) at allpaths.c:2993:14

    frame #88: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x0000000128308ff0) at allpaths.c:2993:14

    frame #89: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x0000000128311a78) at allpaths.c:2993:14

    frame #90: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x000000012831a500) at allpaths.c:2993:14

    frame #91: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x0000000128322f88) at allpaths.c:2993:14

    frame #92: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x000000012832ba10) at allpaths.c:2993:14

    frame #93: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x0000000128334498) at allpaths.c:2993:14

    frame #94: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x000000012833cf20) at allpaths.c:2993:14

    frame #95: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001283459a8) at allpaths.c:2993:14

    frame #96: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x000000012834e430) at allpaths.c:2993:14

    frame #97: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x0000000128356eb8) at allpaths.c:2993:14

    frame #98: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x000000012835f940) at allpaths.c:2993:14

    frame #99: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x00000001283683c8) at allpaths.c:2993:14

    frame #100: 0x0000000101ec6b38 postgres`make_rel_from_joinlist(root=0x0000000104a12618, joinlist=0x000000012837e358) at allpaths.c:2993:14

    frame #101: 0x0000000101ec688f postgres`make_one_rel(root=0x0000000104a12618, joinlist=0x000000012837e358) at allpaths.c:227:8

    frame #102: 0x0000000101eec187 postgres`query_planner(root=0x0000000104a12618, qp_callback=<unavailable>, qp_extra=0x00007ffeedf5d000) at planmain.c:269:14

    frame #103: 0x0000000101eeea9b postgres`grouping_planner(root=0x0000000104a12618, inheritance_update=<unavailable>, tuple_fraction=<unavailable>) at planner.c:2058:17

    frame #104: 0x0000000101eed1a1 postgres`subquery_planner(glob=<unavailable>, parse=0x00000001049ad620, parent_root=<unavailable>, hasRecursion=<unavailable>, tuple_fraction=0) at planner.c:1015:3

    frame #105: 0x0000000101eec3b6 postgres`standard_planner(parse=0x00000001049ad620, query_string=<unavailable>, cursorOptions=256, boundParams=0x0000000000000000) at planner.c:405:9

    frame #106: 0x0000000101faeaf1 postgres`pg_plan_query(querytree=0x00000001049ad620, query_string="SELECT count(*) FROM users_table u_1123123123123123 LEFT JOIN users_table u0 USING (user_id)  LEFT JOIN users_table u1 USING (user_id)  LEFT JOIN users_table u2 USING (user_id)  LEFT JOIN users_table u3 USING (user_id)  LEFT JOIN users_table u4 USING (user_id)  LEFT JOIN users_table u5 USING (user_id)  LEFT JOIN users_table u6 USING (user_id)  LEFT JOIN users_table u7 USING (user_id)  LEFT JOIN users_table u8 USING (user_id)  LEFT JOIN users_table u9 USING (user_id)  LEFT JOIN users_table u10 USING (user_id)  LEFT JOIN users_table u11 USING (user_id)  LEFT JOIN users_table u12 USING (user_id)  LEFT JOIN users_table u13 USING (user_id)  LEFT JOIN users_table u14 USING (user_id)  LEFT JOIN users_table u15 USING (user_id)  LEFT JOIN users_table u16 USING (user_id)  LEFT JOIN users_table u17 USING (user_id)  LEFT JOIN users_table u18 USING (user_id)  LEFT JOIN users_table u19 USING (user_id)  LEFT JOIN users_table u20 USING (user_id)  LEFT JOIN users_table u21 USING (user_id)  LEFT JOIN users_table u22 USING (use"..., cursorOptions=256, boundParams=0x0000000000000000) at postgres.c:875:9

    frame #107: 0x0000000101faec32 postgres`pg_plan_queries(querytrees=0x00000001275c20e0, query_string="SELECT count(*) FROM users_table u_1123123123123123 LEFT JOIN users_table u0 USING (user_id)  LEFT JOIN users_table u1 USING (user_id)  LEFT JOIN users_table u2 USING (user_id)  LEFT JOIN users_table u3 USING (user_id)  LEFT JOIN users_table u4 USING (user_id)  LEFT JOIN users_table u5 USING (user_id)  LEFT JOIN users_table u6 USING (user_id)  LEFT JOIN users_table u7 USING (user_id)  LEFT JOIN users_table u8 USING (user_id)  LEFT JOIN users_table u9 USING (user_id)  LEFT JOIN users_table u10 USING (user_id)  LEFT JOIN users_table u11 USING (user_id)  LEFT JOIN users_table u12 USING (user_id)  LEFT JOIN users_table u13 USING (user_id)  LEFT JOIN users_table u14 USING (user_id)  LEFT JOIN users_table u15 USING (user_id)  LEFT JOIN users_table u16 USING (user_id)  LEFT JOIN users_table u17 USING (user_id)  LEFT JOIN users_table u18 USING (user_id)  LEFT JOIN users_table u19 USING (user_id)  LEFT JOIN users_table u20 USING (user_id)  LEFT JOIN users_table u21 USING (user_id)  LEFT JOIN users_table u22 USING (use"..., cursorOptions=256, boundParams=0x0000000000000000) at postgres.c:966:11

    frame #108: 0x0000000101fb09fa postgres`exec_simple_query(query_string="SELECT count(*) FROM users_table u_1123123123123123 LEFT JOIN users_table u0 USING (user_id)  LEFT JOIN users_table u1 USING (user_id)  LEFT JOIN users_table u2 USING (user_id)  LEFT JOIN users_table u3 USING (user_id)  LEFT JOIN users_table u4 USING (user_id)  LEFT JOIN users_table u5 USING (user_id)  LEFT JOIN users_table u6 USING (user_id)  LEFT JOIN users_table u7 USING (user_id)  LEFT JOIN users_table u8 USING (user_id)  LEFT JOIN users_table u9 USING (user_id)  LEFT JOIN users_table u10 USING (user_id)  LEFT JOIN users_table u11 USING (user_id)  LEFT JOIN users_table u12 USING (user_id)  LEFT JOIN users_table u13 USING (user_id)  LEFT JOIN users_table u14 USING (user_id)  LEFT JOIN users_table u15 USING (user_id)  LEFT JOIN users_table u16 USING (user_id)  LEFT JOIN users_table u17 USING (user_id)  LEFT JOIN users_table u18 USING (user_id)  LEFT JOIN users_table u19 USING (user_id)  LEFT JOIN users_table u20 USING (user_id)  LEFT JOIN users_table u21 USING (user_id)  LEFT JOIN users_table u22 USING (use"...) at postgres.c:1158:19

    frame #109: 0x0000000101fb024e postgres`PostgresMain(argc=<unavailable>, argv=<unavailable>, dbname=<unavailable>, username=<unavailable>) at postgres.c:0

    frame #110: 0x0000000101f35f65 postgres`BackendRun(port=0x0000000000000001) at postmaster.c:4536:2

    frame #111: 0x0000000101f35830 postgres`BackendStartup(port=<unavailable>) at postmaster.c:4220:3

    frame #112: 0x0000000101f35005 postgres`ServerLoop at postmaster.c:1739:7

    frame #113: 0x0000000101f3321c postgres`PostmasterMain(argc=3, argv=0x00007fc7a7403250) at postmaster.c:1412:11

    frame #114: 0x0000000101e91e06 postgres`main(argc=3, argv=0x00007fc7a7403250) at main.c:210:3

    frame #115: 0x00007fff638bf3d5 libdyld.dylib`start + 1

    frame #116: 0x00007fff638bf3d5 libdyld.dylib`start + 1

 

 

 

SELECT version();

                                                      version                                                     

-------------------------------------------------------------------------------------------------------------------

PostgreSQL 13.0 on x86_64-apple-darwin18.7.0, compiled by Apple clang version 11.0.0 (clang-1100.0.33.17), 64-bit

(1 row)

 

 

pg_config

BINDIR = /Users/onderkalaci/Documents/citus_code/pgenv/pgsql-13.0/bin

DOCDIR = /Users/onderkalaci/Documents/citus_code/pgenv/pgsql-13.0/share/doc

HTMLDIR = /Users/onderkalaci/Documents/citus_code/pgenv/pgsql-13.0/share/doc

INCLUDEDIR = /Users/onderkalaci/Documents/citus_code/pgenv/pgsql-13.0/include

PKGINCLUDEDIR = /Users/onderkalaci/Documents/citus_code/pgenv/pgsql-13.0/include

INCLUDEDIR-SERVER = /Users/onderkalaci/Documents/citus_code/pgenv/pgsql-13.0/include/server

LIBDIR = /Users/onderkalaci/Documents/citus_code/pgenv/pgsql-13.0/lib

PKGLIBDIR = /Users/onderkalaci/Documents/citus_code/pgenv/pgsql-13.0/lib

LOCALEDIR = /Users/onderkalaci/Documents/citus_code/pgenv/pgsql-13.0/share/locale

MANDIR = /Users/onderkalaci/Documents/citus_code/pgenv/pgsql-13.0/share/man

SHAREDIR = /Users/onderkalaci/Documents/citus_code/pgenv/pgsql-13.0/share

SYSCONFDIR = /Users/onderkalaci/Documents/citus_code/pgenv/pgsql-13.0/etc

PGXS = /Users/onderkalaci/Documents/citus_code/pgenv/pgsql-13.0/lib/pgxs/src/makefiles/pgxs.mk

CONFIGURE =  '--prefix=/Users/onderkalaci/Documents/citus_code/pgenv/pgsql-13.0' '--enable-debug' '--enable-cassert' 'CFLAGS=-ggdb -Og -g3 -fno-omit-frame-pointer' '--with-openssl' '--with-icu' 'LDFLAGS=-L/usr/local/opt/readline/lib -L/usr/local/opt/openssl/lib ' 'CPPFLAGS=-I/usr/local/opt/readline/include -I/usr/local/opt/openssl/include/ ' 'PKG_CONFIG_PATH=/usr/local/opt/icu4c/lib/pkgconfig'

CC = gcc

CPPFLAGS = -I/usr/local/Cellar/icu4c/66.1/include -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk -I/usr/local/opt/readline/include -I/usr/local/opt/openssl/include/

CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -Wno-unused-command-line-argument -g -ggdb -Og -g3 -fno-omit-frame-pointer

CFLAGS_SL =

LDFLAGS = -L/usr/local/opt/readline/lib -L/usr/local/opt/openssl/lib -Wl,-dead_strip_dylibs

LDFLAGS_EX =

LDFLAGS_SL =

LIBS = -lpgcommon -lpgport -lssl -lcrypto -lz -lreadline -lm

VERSION = PostgreSQL 13.0

 

 

 

Re: Assertion failure with LEFT JOINs among >500 relations

From
David Rowley
Date:
On Fri, 9 Oct 2020 at 08:16, Onder Kalaci <onderk@microsoft.com> wrote:
> I hit an assertion failure. When asserts disabled, it works fine even with more tables  (>5000).
>
> Steps to reproduce:
> CREATE TABLE users_table (user_id int, time timestamp, value_1 int, value_2 int, value_3 float, value_4 bigint);
> 250 relations work fine, see the query (too long to copy & paste here):
https://gist.github.com/onderkalaci/2b40a18d989da389ee4fb631e1ad7c0e#file-steps_to_assert_pg-sql-L41

I had a quick look at this and I can recreate it using the following
(using psql)

select 'explain select count(*) from users_table ' || string_Agg('LEFT
JOIN users_table u'|| x::text || ' USING (user_id)',' ') from
generate_Series(1,379)x;
\gexec

That triggers the assert due to the Assert(outer_skip_rows <=
outer_rows); failing in initial_cost_mergejoin().

The reason it fails is that outer_path_rows has become infinity due to
calc_joinrel_size_estimate continually multiplying in the join
selectivity of 0.05 (due to our 200 default num distinct from lack of
any stats) which after a number of iterations causes the number to
become very large.

Instead of running 379 joins from above, try with 378 and you get:

 Aggregate  (cost=NaN..NaN rows=1 width=8)
   ->  Nested Loop Left Join  (cost=33329.16..NaN rows=Infinity width=0)
         Join Filter: (users_table.user_id = u378.user_id)
         ->  Merge Left Join  (cost=33329.16..<very large number> width=4)
               Merge Cond: (users_table.user_id = u377.user_id)
               ->  Merge Left Join  (cost=33240.99..<very large number> width=4)

Changing the code in initial_cost_mergejoin() to add:

if (outer_path_rows <= 0 || isnan(outer_path_rows))
    outer_path_rows = 1;
+else if (isinf(outer_path_rows))
+    outer_path_rows = DBL_MAX;

does seem to fix the problem, but that's certainly not the right fix.

Perhaps the right fix is to modify clamp_row_est() with:

@@ -193,7 +194,9 @@ clamp_row_est(double nrows)
         * better and to avoid possible divide-by-zero when interpolating costs.
         * Make it an integer, too.
         */
-       if (nrows <= 1.0)
+       if (isinf(nrows))
+               nrows = rint(DBL_MAX);
+       else if (nrows <= 1.0)
                nrows = 1.0;
        else
                nrows = rint(nrows);

but the row estimates are getting pretty insane well before then.
DBL_MAX is 226 orders of magnitude more than the estimated number of
atoms in the observable universe, so it seems pretty unreasonable that
someone might figure out a way to store that many tuples on a disk any
time soon.

Perhaps DBL_MAX is way to big a number to clamp at. I'm just not sure
what we should reduce it to so that it is reasonable.

David



Re: Assertion failure with LEFT JOINs among >500 relations

From
Tom Lane
Date:
David Rowley <dgrowleyml@gmail.com> writes:
> The reason it fails is that outer_path_rows has become infinity due to
> calc_joinrel_size_estimate continually multiplying in the join
> selectivity of 0.05 (due to our 200 default num distinct from lack of
> any stats) which after a number of iterations causes the number to
> become very large.

0.005, but yeah.  We're estimating that each additional join inflates
the output size by about 6x (1270 * 0.005), and after a few hundred
of those, it'll overflow.

> Perhaps the right fix is to modify clamp_row_est() with:

I thought of that too, but as you say, if the rowcount has overflowed a
double then we've got way worse problems.  It'd make more sense to try
to keep the count to a saner value in the first place.  

In the end, (a) this is an Assert, so not a problem for production
systems, and (b) it's going to take you longer than you want to
wait to join 500+ tables, anyhow, unless maybe they're empty.
I'm kind of disinclined to do anything in the way of a band-aid fix.

If somebody has an idea for a different way of estimating the join
size with no stats, we could talk about that.  I notice though that
the only way a plan of this sort isn't going to blow up at execution
is if the join multiplication factor is at most 1, ie the join
key is unique.  But guess what, we already know what to do in that
case.  Adding a unique or pkey constraint to users_table.user_id
causes the plan to collapse entirely (if they're left joins) or
at least still produce a small rowcount estimate (if plain joins).

            regards, tom lane



Re: Assertion failure with LEFT JOINs among >500 relations

From
David Rowley
Date:
On Fri, 9 Oct 2020 at 12:16, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> > Perhaps the right fix is to modify clamp_row_est() with:
>
> I thought of that too, but as you say, if the rowcount has overflowed a
> double then we've got way worse problems.  It'd make more sense to try
> to keep the count to a saner value in the first place.

I wonder if there was something more logical we could do to maintain
sane estimates too, but someone could surely still cause it to blow up
by writing a long series of clause-less joins. We can't really get
away from the fact that we must estimate those as inner_rows *
outer_rows

I admit it's annoying to add cycles to clamp_row_est() for such insane cases.

David



Re: Assertion failure with LEFT JOINs among >500 relations

From
Tom Lane
Date:
David Rowley <dgrowleyml@gmail.com> writes:
> I admit it's annoying to add cycles to clamp_row_est() for such insane cases.

I poked at this a bit more closely, and noted that the actual problem is
that when we do this:

    outer_skip_rows = rint(outer_path_rows * outerstartsel);

we have outer_path_rows = inf, outerstartsel = 0, and of course inf times
zero is NaN.  So we end up asserting "NaN <= Inf", not "Inf <= Inf"
(which wouldn't have caused a problem).

If we did want to do something here, I'd consider something like

    if (isnan(outer_skip_rows))
        outer_skip_rows = 0;
    if (isnan(inner_skip_rows))
        inner_skip_rows = 0;

(We shouldn't need that for outer_rows/inner_rows, since the endsel
values can't be 0.)  Messing with clamp_row_est would be a much more
indirect way of fixing it, as well as having more widespread effects.

In the end though, I'm still not terribly excited about this.

            regards, tom lane



Re: Assertion failure with LEFT JOINs among >500 relations

From
David Rowley
Date:
On Fri, 9 Oct 2020 at 12:59, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> If we did want to do something here, I'd consider something like
>
>         if (isnan(outer_skip_rows))
>             outer_skip_rows = 0;
>         if (isnan(inner_skip_rows))
>             inner_skip_rows = 0;

Are you worried about the costs above the join that triggers that
coming out as NaN with that fix? It appears that's the case. Cost
comparisons of paths with that are not going to do anything along the
lines of sane.

I guess whether or not that matters depends on if we expect any real
queries to hit this, or if we just want to stop the Assert failure.

... 500 joins. I'm willing to listen to the explanation use case, but
in absence of that explanation, I'd be leaning towards "you're doing
it wrong".  If that turns out to be true, then perhaps your proposed
fix is okay.

David



Re: Assertion failure with LEFT JOINs among >500 relations

From
Tom Lane
Date:
David Rowley <dgrowleyml@gmail.com> writes:
> Are you worried about the costs above the join that triggers that
> coming out as NaN with that fix? It appears that's the case.

[ pokes at that... ]  Yeah, it looks like nestloop cost estimation
also has some issues with inf-times-zero producing NaN; it's just
not asserting about it.

I notice there are some other ad-hoc isnan() checks scattered
about costsize.c, too.  Maybe we should indeed consider fixing
clamp_row_estimate to get rid of inf (and nan too, I suppose)
so that we'd not need those.  I don't recall the exact cases
that made us introduce those checks, but they were for cases
a lot more easily reachable than this one, I believe.

            regards, tom lane



Re: Assertion failure with LEFT JOINs among >500 relations

From
David Rowley
Date:
On Fri, 9 Oct 2020 at 15:06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I notice there are some other ad-hoc isnan() checks scattered
> about costsize.c, too.  Maybe we should indeed consider fixing
> clamp_row_estimate to get rid of inf (and nan too, I suppose)
> so that we'd not need those.  I don't recall the exact cases
> that made us introduce those checks, but they were for cases
> a lot more easily reachable than this one, I believe.

Is there actually a case where nrows could be NaN?  If not, then it
seems like a wasted check.  Wouldn't it take one of the input
relations or the input rels to have an Inf row estimate (which won't
happen after changing clamp_row_estimate()), or the selectivity
estimate being NaN.

David



Re: Assertion failure with LEFT JOINs among >500 relations

From
Tom Lane
Date:
David Rowley <dgrowleyml@gmail.com> writes:
> On Fri, 9 Oct 2020 at 15:06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I notice there are some other ad-hoc isnan() checks scattered
>> about costsize.c, too.  Maybe we should indeed consider fixing
>> clamp_row_estimate to get rid of inf (and nan too, I suppose)
>> so that we'd not need those.  I don't recall the exact cases
>> that made us introduce those checks, but they were for cases
>> a lot more easily reachable than this one, I believe.

> Is there actually a case where nrows could be NaN?  If not, then it
> seems like a wasted check.  Wouldn't it take one of the input
> relations or the input rels to have an Inf row estimate (which won't
> happen after changing clamp_row_estimate()), or the selectivity
> estimate being NaN.

I'm fairly certain that every one of the existing NaN checks was put
there on the basis of hard experience.  Possibly digging in the git
history would offer more info about exactly where the NaNs came from.

            regards, tom lane



Re: Assertion failure with LEFT JOINs among >500 relations

From
David Rowley
Date:
On Sat, 10 Oct 2020 at 02:19, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm fairly certain that every one of the existing NaN checks was put
> there on the basis of hard experience.  Possibly digging in the git
> history would offer more info about exactly where the NaNs came from.


I had a look at this and found there's been quite a number of fixes
which added either that isnan checks or the <= 0 checks.

Namely:

-----------
commit 72826fb362c4aada6d2431df0b706df448806c02
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Fri Apr 15 17:45:41 2011 -0400

    Guard against incoming rowcount estimate of NaN in cost_mergejoin().

    Although rowcount estimates really ought not be NaN, a bug elsewhere
    could perhaps result in that, and that would cause Assert failure in
    cost_mergejoin, which I believe to be the explanation for bug #5977 from
    Anton Kuznetsov.  Seems like a good idea to expend a couple more cycles
    to prevent that, even though the real bug is elsewhere.  Not back-patching,
    though, because we don't encourage running production systems with
    Asserts on.


The discussion for that is in
https://www.postgresql.org/message-id/flat/4602.1302705756%40sss.pgh.pa.us#69dd8c334aa714cfac4e0d9b04c5201c

commit 76281aa9647e6a5dfc646514554d0f519e3b8a58
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Sat Mar 26 12:03:12 2016 -0400

    Avoid a couple of zero-divide scenarios in the planner.



commit fd791e7b5a1bf53131ad15e68e4d4f8ca795fcb4
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Mon Mar 24 21:53:04 2008 +0000

    When a relation has been proven empty by constraint exclusion,
propagate that
    knowledge up through any joins it participates in.  We were doing
that already
    in some special cases but not in the general case.  Also, defend
against zero
    row estimates for the input relations in cost_mergejoin --- this
fix may have
    eliminated the only scenario in which that can happen, but be safe.  Per
    report from Alex Solovey.


That was reported in

https://www.postgresql.org/message-id/flat/BLU136-DAV79FF310AC13FFC96FA2FDAEFD0%40phx.gbl#4cde17b2369fc7e0da83cc7d4aeeaa48

The problem was that an Append with no subpaths could have a 0 row estimate.
-----------

Because there's been quite a few of these, and this report is yet
another one, I wonder if it's time to try and stamp these out at the
source rather than where the row counts are being used.

I toyed around with the attached patch, but I'm still not that excited
about the clamping of infinite values to DBL_MAX.  The test case I
showed above with generate_Series(1,379) still ends up with NaN cost
estimates due to costing a sort with DBL_MAX rows.  When I was writing
the patch, I had it in my head that the costs per row will always be
lower than 1. I thought because of that that even if the row count is
dangerously close to DBL_MAX, the costs will never be higher than the
row count... Turns out, I was wrong about that as clearly sorting a
number of rows even close to DBL_MAX would beyond astronomically
expensive and cause the costs would go infinite.

The fd791e7b5 fix was for a subpath-less Append node having a 0-row
estimate and causing problems in the costing of merge join. In the
patch, I thought it would be better just to fix this by insisting that
Append always will have at least 1 row.  That means even a dummy path
would have 1 row, which will become a const-false Result in the plan.
I've had to add a special case to set the plan_rows back to 0 so that
EXPLAIN shows 0 rows as it did before.  That's not exactly pretty, but
I still feel there is merit in insisting we never have 0-row paths to
get away from these types of bugs once at for all.

The patch does fix the failing Assert. However, something along these
lines seems more suitable for master only. The back branches maybe
should just get a more localised isinf() check and clamp to DBL_MAX
that I mentioned earlier in this thread.

I've searched through the code to see if there are other possible
cases where paths may be generated with a 0-row count.  I imagine
anything that has a qual and performs a selectivity estimate will
already have a clamp_row_est() since we'd see fractional row counts if
it didn't.  That leaves me with Append / Merge Append and each join
type + aggregates.  Currently, it seems we never will generate a Merge
Append without any sub-paths. I wondered if I should just Assert
that's the case in create_merge_append_path(). I ended up just adding
a clamp_row_est call instead.  calc_joinrel_size_estimate() seems to
handle all join path row estimates. That uses clamp_row_est.
Aggregate paths can reduce the number of rows, but I think all the row
estimates from those will go through estimate_num_groups(), which
appears to never be able to return 0.

David

Attachment

Re: Assertion failure with LEFT JOINs among >500 relations

From
Tom Lane
Date:
David Rowley <dgrowleyml@gmail.com> writes:
> Because there's been quite a few of these, and this report is yet
> another one, I wonder if it's time to try and stamp these out at the
> source rather than where the row counts are being used.

I'm on board with trying to get rid of NaN rowcount estimates more
centrally.  I do not think it is a good idea to try to wire in a
prohibition against zero rowcounts.  That is actually the correct
thing in assorted scenarios --- one example recently under discussion
was ModifyTable without RETURNING, and another is where we can prove
that a restriction clause is constant-false.  At some point I think
we are going to want to deal honestly with those cases instead of
sweeping them under the rug.  So I'm disinclined to remove zero
defenses that we'll just have to put back someday.

I think converting Inf to DBL_MAX, in hopes of avoiding creation of
NaNs later, is fine.  (Note that applying rint() to that is quite
useless --- in every floating-point system, values bigger than
2^number-of-mantissa-bits are certainly integral.)

I'm not sure why you propose to map NaN to one.  Wouldn't mapping it
to Inf (and thence to DBL_MAX) make at least as much sense?  Probably
more in fact.  We know that unwarranted one-row estimates are absolute
death to our chances of picking a well-chosen plan.

> I toyed around with the attached patch, but I'm still not that excited
> about the clamping of infinite values to DBL_MAX.  The test case I
> showed above with generate_Series(1,379) still ends up with NaN cost
> estimates due to costing a sort with DBL_MAX rows.  When I was writing
> the patch, I had it in my head that the costs per row will always be
> lower than 1.

Yeah, that is a good point.  Maybe instead of clamping to DBL_MAX,
we should clamp rowcounts to something that provides some headroom
for multiplication by per-row costs.  A max rowcount of say 1e100
should serve fine, while still being comfortably more than any
non-insane estimate.

So now I'm imagining something like

#define MAXIMUM_ROWCOUNT 1e100

clamp_row_est(double nrows)
{
    /* Get rid of NaN, Inf, and impossibly large row counts */
    if (isnan(nrows) || nrows >= MAXIMUM_ROWCOUNT)
        nrows = MAXIMUM_ROWCOUNT;
    else
    ... existing logic ...


Perhaps we should also have some sort of clamp for path cost
estimates, at least to prevent them from being NaNs which
is going to confuse add_path terribly.

            regards, tom lane



Re: Assertion failure with LEFT JOINs among >500 relations

From
David Rowley
Date:
Thanks for having a look at this.

On Wed, 14 Oct 2020 at 04:16, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm on board with trying to get rid of NaN rowcount estimates more
> centrally.  I do not think it is a good idea to try to wire in a
> prohibition against zero rowcounts.  That is actually the correct
> thing in assorted scenarios --- one example recently under discussion
> was ModifyTable without RETURNING, and another is where we can prove
> that a restriction clause is constant-false.  At some point I think
> we are going to want to deal honestly with those cases instead of
> sweeping them under the rug.  So I'm disinclined to remove zero
> defenses that we'll just have to put back someday.

OK, that certainly limits the scope here.  It just means we can't get
rid of the <= 0 checks in join costing functions.  The problem case
that this was added for was a dummy Append. We still have valid cases
that won't convert the join rel to a dummy rel with a dummy Append on
one side.

> I think converting Inf to DBL_MAX, in hopes of avoiding creation of
> NaNs later, is fine.  (Note that applying rint() to that is quite
> useless --- in every floating-point system, values bigger than
> 2^number-of-mantissa-bits are certainly integral.)

Good point.

> I'm not sure why you propose to map NaN to one.  Wouldn't mapping it
> to Inf (and thence to DBL_MAX) make at least as much sense?  Probably
> more in fact.  We know that unwarranted one-row estimates are absolute
> death to our chances of picking a well-chosen plan.

That came around due to what the join costing functions were doing. i.e:

/* Protect some assumptions below that rowcounts aren't zero or NaN */
if (inner_path_rows <= 0 || isnan(inner_path_rows))
   inner_path_rows = 1;

[1] didn't have an example case of how the NaNs were introduced, so I
was mostly just copying the logic that was added to fix that back in
72826fb3.

> > I toyed around with the attached patch, but I'm still not that excited
> > about the clamping of infinite values to DBL_MAX.  The test case I
> > showed above with generate_Series(1,379) still ends up with NaN cost
> > estimates due to costing a sort with DBL_MAX rows.  When I was writing
> > the patch, I had it in my head that the costs per row will always be
> > lower than 1.
>
> Yeah, that is a good point.  Maybe instead of clamping to DBL_MAX,
> we should clamp rowcounts to something that provides some headroom
> for multiplication by per-row costs.  A max rowcount of say 1e100
> should serve fine, while still being comfortably more than any
> non-insane estimate.
>
> So now I'm imagining something like
>
> #define MAXIMUM_ROWCOUNT 1e100

That seems more reasonable. We likely could push it a bit higher, but
I'm not all that motivated to since if that was true, then you could
expect the heat death of the universe to arrive before your query
results. In which case the user would likely struggle to find
electrons to power their computer.

> clamp_row_est(double nrows)
> {
>         /* Get rid of NaN, Inf, and impossibly large row counts */
>         if (isnan(nrows) || nrows >= MAXIMUM_ROWCOUNT)
>             nrows = MAXIMUM_ROWCOUNT;
>         else
>         ... existing logic ...

I've got something along those lines in the attached.

> Perhaps we should also have some sort of clamp for path cost
> estimates, at least to prevent them from being NaNs which
> is going to confuse add_path terribly.

hmm. I'm not quite sure where to start with that one.  Many of the
path estimates will already go through clamp_row_est(). There are
various special requirements, e.g Appends with no subpaths. So when to
apply it would depend on what path type it is. I'd say it would need
lots of careful analysis and a scattering of new calls in pathnode.c

I've ended up leaving the NaN checks in the join costing functions.
There was no case mentioned in [1] that showed how we hit that
reported test case, so I'm not really confident enough to know I'm not
just reintroducing the same problem again by removing that.  The path
row estimate that had the NaN might not have been through
clamp_row_est(). Many don't.

David

[1] https://www.postgresql.org/message-id/7270.1302902842%40sss.pgh.pa.us

Attachment

Re: Assertion failure with LEFT JOINs among >500 relations

From
Tom Lane
Date:
David Rowley <dgrowleyml@gmail.com> writes:
> On Wed, 14 Oct 2020 at 04:16, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> So now I'm imagining something like
>> #define MAXIMUM_ROWCOUNT 1e100

> That seems more reasonable. We likely could push it a bit higher, but
> I'm not all that motivated to since if that was true, then you could
> expect the heat death of the universe to arrive before your query
> results. In which case the user would likely struggle to find
> electrons to power their computer.

Right.  But I'm thinking about joins in which both inputs are clamped to
that maximum estimate.  If we allowed it to be as high as 1e200, then
multiplying the two input rowcounts together would itself overflow.
At 1e100, we can do that and also multiply in a ridiculous per-row cost,
and we're still well below the overflow threshold.  So this should go
pretty far towards preventing internal overflows in any one plan step's
cost & rows calculations.

(For comparison's sake, I believe the number of atoms in the observable
universe is thought to be somewhere on the order of 1e80.  So we are
pretty safe in thinking that no practically-useful rowcount estimate
will exceed 1e100; there is no need to make it higher.)

> I've ended up leaving the NaN checks in the join costing functions.
> There was no case mentioned in [1] that showed how we hit that
> reported test case, so I'm not really confident enough to know I'm not
> just reintroducing the same problem again by removing that.  The path
> row estimate that had the NaN might not have been through
> clamp_row_est(). Many don't.

Hmm, I will try to find some time tomorrow to reconstruct that.

            regards, tom lane



Re: Assertion failure with LEFT JOINs among >500 relations

From
Tom Lane
Date:
I wrote:
> David Rowley <dgrowleyml@gmail.com> writes:
>> I've ended up leaving the NaN checks in the join costing functions.
>> There was no case mentioned in [1] that showed how we hit that
>> reported test case, so I'm not really confident enough to know I'm not
>> just reintroducing the same problem again by removing that.  The path
>> row estimate that had the NaN might not have been through
>> clamp_row_est(). Many don't.

> Hmm, I will try to find some time tomorrow to reconstruct that.

I'm confused now, because the v2 patch does remove those isnan calls?

I rechecked the archives, and I agree that there's no data about
exactly how we could have gotten a NaN here.  My guess though is
infinity-times-zero in some earlier relation size estimate.  So
hopefully the clamp to 1e100 will make that impossible, or if it
doesn't then clamp_row_est() should still prevent a NaN from
propagating to the next level up.

I'm good with the v2 patch.

            regards, tom lane



Re: Assertion failure with LEFT JOINs among >500 relations

From
David Rowley
Date:
On Sat, 17 Oct 2020 at 06:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm confused now, because the v2 patch does remove those isnan calls?

I think that was a case of a last-minute change of mind and forgetting
to attach the updated patch.

> I rechecked the archives, and I agree that there's no data about
> exactly how we could have gotten a NaN here.  My guess though is
> infinity-times-zero in some earlier relation size estimate.  So
> hopefully the clamp to 1e100 will make that impossible, or if it
> doesn't then clamp_row_est() should still prevent a NaN from
> propagating to the next level up.
>
> I'm good with the v2 patch.

Thanks a lot for having a look. I'll proceed in getting the v2 which I
sent earlier into master.

For the backbranches, I think I go with something more minimal in the
form of adding:

if (outer_path_rows <= 0 || isnan(outer_path_rows))
    outer_path_rows = 1;
+else if (isinf(outer_path_rows))
+    outer_path_rows = DBL_MAX;

and the same for the inner_path_rows to each area in costsize.c which
has that code.

Wondering your thoughts on that.

David



Re: Assertion failure with LEFT JOINs among >500 relations

From
Tom Lane
Date:
David Rowley <dgrowleyml@gmail.com> writes:
> On Sat, 17 Oct 2020 at 06:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'm good with the v2 patch.

> Thanks a lot for having a look. I'll proceed in getting the v2 which I
> sent earlier into master.

> For the backbranches, I think I go with something more minimal in the
> form of adding:

TBH, I see no need to do anything in the back branches.  This is not
an issue for production usage.

            regards, tom lane



Re: Assertion failure with LEFT JOINs among >500 relations

From
David Rowley
Date:
On Mon, 19 Oct 2020 at 12:10, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <dgrowleyml@gmail.com> writes:
> > For the backbranches, I think I go with something more minimal in the
> > form of adding:
>
> TBH, I see no need to do anything in the back branches.  This is not
> an issue for production usage.

I understand the Assert failure is pretty harmless, so non-assert
builds shouldn't suffer too greatly.  I just assumed that any large
stakeholders invested in upgrading to a newer version of PostgreSQL
may like to run various tests with their application against an assert
enabled version of PostgreSQL perhaps to gain some confidence in the
upgrade. A failing assert is unlikely to inspire additional
confidence.

I'm not set on backpatching, but that's just my thoughts.

FWIW, the patch I'd thought of is attached.

David

Attachment

Re: Assertion failure with LEFT JOINs among >500 relations

From
Tom Lane
Date:
David Rowley <dgrowleyml@gmail.com> writes:
> On Mon, 19 Oct 2020 at 12:10, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> TBH, I see no need to do anything in the back branches.  This is not
>> an issue for production usage.

> I understand the Assert failure is pretty harmless, so non-assert
> builds shouldn't suffer too greatly.  I just assumed that any large
> stakeholders invested in upgrading to a newer version of PostgreSQL
> may like to run various tests with their application against an assert
> enabled version of PostgreSQL perhaps to gain some confidence in the
> upgrade. A failing assert is unlikely to inspire additional
> confidence.

If any existing outside regression tests hit such corner cases, then
(a) we'd have heard about it, and (b) likely they'd fail in the older
branch as well.  So I don't buy the argument that this will dissuade
somebody from upgrading.

I do, on the other hand, buy the idea that if anyone is indeed working
in this realm, they might be annoyed by a behavior change in a stable
branch.  So it cuts both ways.  On balance I don't think we should
touch this in the back branches.

            regards, tom lane



Re: Assertion failure with LEFT JOINs among >500 relations

From
David Rowley
Date:
On Mon, 19 Oct 2020 at 12:25, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <dgrowleyml@gmail.com> writes:
> > On Mon, 19 Oct 2020 at 12:10, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> TBH, I see no need to do anything in the back branches.  This is not
> >> an issue for production usage.
>
> > I understand the Assert failure is pretty harmless, so non-assert
> > builds shouldn't suffer too greatly.  I just assumed that any large
> > stakeholders invested in upgrading to a newer version of PostgreSQL
> > may like to run various tests with their application against an assert
> > enabled version of PostgreSQL perhaps to gain some confidence in the
> > upgrade. A failing assert is unlikely to inspire additional
> > confidence.
>
> If any existing outside regression tests hit such corner cases, then
> (a) we'd have heard about it, and (b) likely they'd fail in the older
> branch as well.  So I don't buy the argument that this will dissuade
> somebody from upgrading.

hmm, well it was reported to us. Perhaps swapping the word "upgrading"
for "migrating".

It would be good to hear Onder's case to see if he has a good argument
for having a vested interest in pg13 not failing this way with assets
enabled.

> I do, on the other hand, buy the idea that if anyone is indeed working
> in this realm, they might be annoyed by a behavior change in a stable
> branch.  So it cuts both ways.  On balance I don't think we should
> touch this in the back branches.

I guess we could resolve that concern by just changing the failing
assert to become: Assert(outer_skip_rows <= outer_rows ||
isinf(outer_rows));

It's pretty grotty but should address that concern.

David



Re: Assertion failure with LEFT JOINs among >500 relations

From
Tom Lane
Date:
David Rowley <dgrowleyml@gmail.com> writes:
> It would be good to hear Onder's case to see if he has a good argument
> for having a vested interest in pg13 not failing this way with assets
> enabled.

Yeah, some context for this report would be a good thing.
(BTW, am I wrong to suppose that the same case fails the same
way in our older branches?  Certainly that Assert has been there
a long time.)

> I guess we could resolve that concern by just changing the failing
> assert to become: Assert(outer_skip_rows <= outer_rows ||
> isinf(outer_rows));

I can't really object to just weakening the Assert a tad.
My thoughts would have run towards checking for the NaN though.

            regards, tom lane



Re: Assertion failure with LEFT JOINs among >500 relations

From
David Rowley
Date:
On Mon, 19 Oct 2020 at 13:06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> (BTW, am I wrong to suppose that the same case fails the same
> way in our older branches?  Certainly that Assert has been there
> a long time.)

I only tested as back as far as 9.5, but it does fail there.

David



Re: Assertion failure with LEFT JOINs among >500 relations

From
David Rowley
Date:
On Mon, 19 Oct 2020 at 13:06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <dgrowleyml@gmail.com> writes:
> > I guess we could resolve that concern by just changing the failing
> > assert to become: Assert(outer_skip_rows <= outer_rows ||
> > isinf(outer_rows));
>
> I can't really object to just weakening the Assert a tad.
> My thoughts would have run towards checking for the NaN though.

I ended up back-patching a change that does that.

Thanks for your input on this and for the report, Onder.

David