On Fri, Mar 4, 2016 at 11:03 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>> Tom Lane wrote:
>>> That's what it looks like to me. I now think that the apparent
>>> connection to parallel query is a mirage. The reason we've only
>>> seen a few cases so far is that the flapping test is new: it
>>> wad added in commit d42358efb16cc811, on 20 Feb.
>
>> It was added on Feb 20 all right, but of *last year*. It's been there
>> working happily for a year now.
>
> Wup, you're right, failed to look closely enough at the commit log
> entry. So that puts us back to wondering why exactly parallel query
> is triggering this. Still, Robert's experiment with removing the
> pg_sleep seems fairly conclusive: it is possible to get the failure
> without parallel query.
>
>> Instead of adding another sleep function, another possibility is to add
>> two booleans, one for the index counter and another for the truncate
>> counters, and only terminate the sleep if both are true. I don't see
>> any reason to make this test any slower than it already is.
>
> Well, that would make the function more complicated, but maybe it's a
> better answer. On the other hand, we know that the stats updates are
> delivered in a deterministic order, so why not simply replace the
> existing test in the wait function with one that looks for the truncation
> updates? If we've gotten those, we must have gotten the earlier ones.
I'm not sure if that's actually true with parallel mode. I'm pretty
sure the earlier workers will have terminated before the later ones
start, but is that enough to guarantee that the stats collector sees
the messages in that order?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company