Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)? - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?
Date
Msg-id CAEepm=0pAqV8Nqdt2D6+nH=bkTkchKjL46tokcGUmGo5P7+4zg@mail.gmail.com
Whole thread Raw
In response to Re: select_parallel test failure: gather sometimes losing tuples (maybe during rescans)?  (Andres Freund <andres@anarazel.de>)
Responses Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?  (Thomas Munro <thomas.munro@enterprisedb.com>)
List pgsql-hackers
On Sun, Mar 4, 2018 at 3:40 PM, Andres Freund <andres@anarazel.de> wrote:
> On March 3, 2018 6:36:51 PM PST, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>On 03/04/2018 03:20 AM, Thomas Munro wrote:
>>> Hi,
>>>
>>> I saw a one-off failure like this:
>>>
>>>                                   QUERY PLAN
>>>
>>--------------------------------------------------------------------------
>>>    Aggregate (actual rows=1 loops=1)
>>> !    ->  Nested Loop (actual rows=98000 loops=1)
>>>            ->  Seq Scan on tenk2 (actual rows=10 loops=1)
>>>                  Filter: (thousand = 0)
>>>                  Rows Removed by Filter: 9990
>>> !          ->  Gather (actual rows=9800 loops=10)
>>>                  Workers Planned: 4
>>>                  Workers Launched: 4
>>>                  ->  Parallel Seq Scan on tenk1 (actual rows=1960
>>loops=50)
>>> --- 485,495 ----
>>>                                   QUERY PLAN
>>>
>>--------------------------------------------------------------------------
>>>    Aggregate (actual rows=1 loops=1)
>>> !    ->  Nested Loop (actual rows=97984 loops=1)
>>>            ->  Seq Scan on tenk2 (actual rows=10 loops=1)
>>>                  Filter: (thousand = 0)
>>>                  Rows Removed by Filter: 9990
>>> !          ->  Gather (actual rows=9798 loops=10)
>>>                  Workers Planned: 4
>>>                  Workers Launched: 4
>>>                  ->  Parallel Seq Scan on tenk1 (actual rows=1960
>>loops=50)
>>>
>>>
>>> Two tuples apparently went missing.
>>>
>>> Similar failures on the build farm:
>>>
>>>
>>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=okapi&dt=2018-03-03%2020%3A15%3A01
>>>
>>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2018-03-03%2018%3A13%3A32
>>>
>>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2018-03-03%2017%3A55%3A11
>>>
>>> Could this be related to commit
>>> 34db06ef9a1d7f36391c64293bf1e0ce44a33915 or commit
>>> 497171d3e2aaeea3b30d710b4e368645ad07ae43?
>>>
>>
>>I think the same failure (or at least very similar plan diff) was
>>already mentioned here:
>>
>>https://www.postgresql.org/message-id/17385.1520018934@sss.pgh.pa.us
>>
>>So I guess someone else already noticed, but I don't see the cause
>>identified in that thread.

Oh.  Sorry, I didn't recognise that as the same thing, from the title.
Doesn't seem to be related to number of workers launched at all... it
looks more like the tuple queue is misbehaving.  Though I haven't got
any proof of anything yet.

> Robert and I started discussing it a bit over IM. No conclusion. Robert tried to reproduce locally, including
disablingatomics, without luck.
 
>
> Can anybody reproduce locally?

I've seen it several times on Travis CI.  (So I would normally have
been able to tell you about this problem before the was committed,
except that the email thread was too long and the mail archive app
cuts long threads off!)  Will try on some different kind of computers
that I have local control off...  I suspect (knowing how we run it on
Travis CI) that being way overloaded might be helpful...

-- 
Thomas Munro
http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: select_parallel test failure: gather sometimes losing tuples (maybe during rescans)?
Next
From: Thomas Munro
Date:
Subject: Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?