On 03/04/2018 03:40 AM, Andres Freund wrote:
>
>
> On March 3, 2018 6:36:51 PM PST, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> On 03/04/2018 03:20 AM, Thomas Munro wrote:
>>> Hi,
>>>
>>> I saw a one-off failure like this:
>>>
>>> QUERY PLAN
>>>
>> --------------------------------------------------------------------------
>>> Aggregate (actual rows=1 loops=1)
>>> ! -> Nested Loop (actual rows=98000 loops=1)
>>> -> Seq Scan on tenk2 (actual rows=10 loops=1)
>>> Filter: (thousand = 0)
>>> Rows Removed by Filter: 9990
>>> ! -> Gather (actual rows=9800 loops=10)
>>> Workers Planned: 4
>>> Workers Launched: 4
>>> -> Parallel Seq Scan on tenk1 (actual rows=1960
>> loops=50)
>>> --- 485,495 ----
>>> QUERY PLAN
>>>
>> --------------------------------------------------------------------------
>>> Aggregate (actual rows=1 loops=1)
>>> ! -> Nested Loop (actual rows=97984 loops=1)
>>> -> Seq Scan on tenk2 (actual rows=10 loops=1)
>>> Filter: (thousand = 0)
>>> Rows Removed by Filter: 9990
>>> ! -> Gather (actual rows=9798 loops=10)
>>> Workers Planned: 4
>>> Workers Launched: 4
>>> -> Parallel Seq Scan on tenk1 (actual rows=1960
>> loops=50)
>>>
>>>
>>> Two tuples apparently went missing.
>>>
>>> Similar failures on the build farm:
>>>
>>>
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=okapi&dt=2018-03-03%2020%3A15%3A01
>>>
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2018-03-03%2018%3A13%3A32
>>>
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2018-03-03%2017%3A55%3A11
>>>
>>> Could this be related to commit
>>> 34db06ef9a1d7f36391c64293bf1e0ce44a33915 or commit
>>> 497171d3e2aaeea3b30d710b4e368645ad07ae43?
>>>
>>
>> I think the same failure (or at least very similar plan diff) was
>> already mentioned here:
>>
>> https://www.postgresql.org/message-id/17385.1520018934@sss.pgh.pa.us
>>
>> So I guess someone else already noticed, but I don't see the cause
>> identified in that thread.
>
> Robert and I started discussing it a bit over IM. No conclusion. Robert tried to reproduce locally, including
disablingatomics, without luck.
>
> Can anybody reproduce locally?
>
I've started "make check" with parallel_schedule tweaked to contain many
select_parallel runs, and so far I've seen a couple of failures like
this (about 10 failures out of 1500 runs):
select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and
tenk2.thousand=0;
! ERROR: lost connection to parallel worker
I have no idea why the worker fails (no segfaults in dmesg, nothing in
posgres log), or if it's related to the issue discussed here at all.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services