Re: postgres_fdw vs. force_parallel_mode on ppc - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: postgres_fdw vs. force_parallel_mode on ppc |
Date | |
Msg-id | CA+Tgmoa+fjuesD1fC5Q2m_HQ1R_=8Dd7_YyzB8V_L6h=JMWD7w@mail.gmail.com Whole thread Raw |
In response to | Re: postgres_fdw vs. force_parallel_mode on ppc (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: postgres_fdw vs. force_parallel_mode on ppc
|
List | pgsql-hackers |
On Fri, Mar 4, 2016 at 12:46 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> A couple of my colleagues have been looking into this. It's not >> entirely clear to me what's going on here yet, but it looks like the >> stats get there if you wait long enough. Rahila Syed was able to >> reproduce the problem and says that the attached patch fixes it. But >> I don't quite understand why this should fix it. > > I don't like this patch much. While the new function is not bad in > itself, it looks really weird to call it immediately after the other > wait function. And the reason for that, AFAICT, is that somebody dropped > the entire "truncation stats" test sequence into the middle of unrelated > tests, evidently in the vain hope that that way they could piggyback > on the existing wait. Which these failures are showing us is wrong. > > I think we should move all the inserted logic down so that it's not in the > middle of unrelated testing. Sure. If you have an idea what the right thing to do is, please go ahead. I actually don't have a clear idea what's going on here. I guess it's that the wait_for_stats() guarantees that the stats message from the index insertion has been received but the status messages from the "trunc" tables might not have gotten there yet. I thought maybe that works without parallelism because all of those messages are coming from the same backend, and therefore if you have the later one you must have all of the earlier ones, too. But if you're running some of the queries in parallel workers then it's possible for a stats message from a worker run later to arrive later. But it's not that after all, because when I run the regression tests with the pg_sleep removed, I get this: *** /Users/rhaas/pgsql/src/test/regress/expected/stats.out 2016-03-04 08:55:33.000000000 -0500 --- /Users/rhaas/pgsql/src/test/regress/results/stats.out 2016-03-04 09:00:29.000000000 -0500 *************** *** 127,140 **** 1 (1 row) - -- force the rate-limiting logic in pgstat_report_tabstat() to time out - -- and send a message - SELECT pg_sleep(1.0); - pg_sleep - ---------- - - (1 row) - -- wait for stats collector to update SELECT wait_for_stats(); wait_for_stats --- 127,132 ---- *************** *** 148,158 **** WHERE relname like 'trunc_stats_test%' order by relname; relname | n_tup_ins | n_tup_upd | n_tup_del| n_live_tup | n_dead_tup -------------------+-----------+-----------+-----------+------------+------------ ! trunc_stats_test | 3 | 0 | 0 | 0 | 0 ! trunc_stats_test1 | 4 | 2 | 1 | 1 | 0 ! trunc_stats_test2 | 1 | 0 | 0 | 1 | 0 ! trunc_stats_test3 | 4 | 0 | 0 | 2 | 2 ! trunc_stats_test4 | 2 | 0 | 0 | 0 | 2 (5 rows) SELECT st.seq_scan >= pr.seq_scan + 1, --- 140,150 ---- WHERE relname like 'trunc_stats_test%' order by relname; relname | n_tup_ins | n_tup_upd | n_tup_del| n_live_tup | n_dead_tup -------------------+-----------+-----------+-----------+------------+------------ ! trunc_stats_test | 0 | 0 | 0 | 0 | 0 ! trunc_stats_test1 | 0 | 0 | 0 | 0 | 0 ! trunc_stats_test2 | 0 | 0 | 0 | 0 | 0 ! trunc_stats_test3 | 0 | 0 | 0 | 0 | 0 ! trunc_stats_test4 | 0 | 0 | 0 | 0 | 0 (5 rows) SELECT st.seq_scan >= pr.seq_scan + 1, *************** *** 163,169 **** WHERE st.relname='tenk2' AND cl.relname='tenk2'; ?column? | ?column? | ?column? | ?column? ----------+----------+----------+---------- ! t | t | t | t (1 row) SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages, --- 155,161 ---- WHERE st.relname='tenk2' AND cl.relname='tenk2'; ?column? | ?column? | ?column? | ?column? ----------+----------+----------+---------- ! f | f | f | f (1 row) SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages, *************** *** 172,178 **** WHERE st.relname='tenk2' AND cl.relname='tenk2'; ?column? | ?column? ----------+---------- ! t | t (1 row) SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer --- 164,170 ---- WHERE st.relname='tenk2' AND cl.relname='tenk2'; ?column? | ?column? ----------+---------- ! t | f (1 row) SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer That looks suspiciously similar to the failure we're getting with the force_parallel_mode testing, but I'm still confused. >> BTW, this comment is obsolete: > >> -- force the rate-limiting logic in pgstat_report_tabstat() to time out >> -- and send a message >> SELECT pg_sleep(1.0); >> pg_sleep >> ---------- > >> (1 row) > >> That function was renamed in commit >> 93c701edc6c6f065cd25f77f63ab31aff085d6ac, but apparently Tom forgot to >> grep for other uses of that identifier name. > > Duh :-(. Actually, do we need that sleep at all anymore? Seems like > wait_for_stats ought to cover it. Yeah. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: