Re: Elusive segfault with 9.3.5 & query cancel - Mailing list pgsql-hackers

From Josh Berkus
Subject Re: Elusive segfault with 9.3.5 & query cancel
Date
Msg-id 548223BB.90206@agliodbs.com
Whole thread Raw
In response to Elusive segfault with 9.3.5 & query cancel  (Josh Berkus <josh@agliodbs.com>)
Responses Re: Elusive segfault with 9.3.5 & query cancel  (Peter Geoghegan <pg@heroku.com>)
List pgsql-hackers
On 12/05/2014 12:54 PM, Josh Berkus wrote:
> Hackers,
> 
> This is not a complete enough report for a diagnosis.  I'm posting it
> here just in case someone else sees something like it, and having an
> additional report will help figure out the underlying issue.
> 
> * 700GB database with around 5,000 writes per second
> * 8 replicas handling around 10,000 read queries per second each
> * replicas are slammed (40-70% utilization)
> * replication produces lots of replication query cancels
> 
> In this scenario, a specific query against some of the less busy and
> fairly small tables would produce a segfault (signal 11) once every 1-4
> days randomly.  This query could have 100's of successful runs for every
> segfault. This was not reproduceable manually, and the segfaults never
> happened on the master.  Nor did we ever see a segfault based on any
> other query, including against the tables which were generally the
> source of the query cancels.
> 
> In case it's relevant, the query included use of regexp_split_to_array()
> and ORDER BY random(), neither of which are generally used in the user's
> other queries.
> 
> We made some changes which decreased query cancel (optimizing queries,
> turning on hot_standby_feedback) and we haven't seen a segfault since
> then.  As far as the user is concerned, this solves the problem, so I'm
> never going to get a trace or a core dump file.

Forgot a major piece of evidence as to why I think this is related to
query cancel:  in each case, the segfault was preceeded by a
multi-backend query cancel 3ms to 30ms beforehand.  It is possible that
the backend running the query which segfaulted might have been the only
backend *not* cancelled due to query conflict concurrently.
Contradicting this, there are other multi-backend query cancels in the
logs which do NOT produce a segfault.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



pgsql-hackers by date:

Previous
From: Josh Berkus
Date:
Subject: Elusive segfault with 9.3.5 & query cancel
Next
From: Peter Geoghegan
Date:
Subject: Re: Elusive segfault with 9.3.5 & query cancel