Hi Tom,
Thanks for the reply, we appreciate you time on this. The alloc error
queries all seem to be selects from a btree primary index. I gave an
example in my initial post from the logins table. Usually for us it
is logins but sometimes we have seen it on a few other tables, and
it's always a btree primary key index, very simple type of select.
The queries have been showing up in the logs which is how we know, but
we could also confirm in the core dump. If the problem is data
corruption, it is transient. We replay the same queries and get no
errors. We also have jobs that run that basically do the same series
of selects every day or hour etc. but it is totally random which ones
cause an error. E.g. If it is corruption it somehow magically fixes
itself. Also we still have not seen any issues on the master, this
seems to be a problem only on hot standby slaves (we have three
slaves). The OP, incidentally, reported the same thing - issue is
only on hot standby slaves, it is transient, and it happens on a
select from a btree primary index.
This also does not seem to be load related. It often happens under
periods of light load for us.
Please let us know if you have any other thoughts on what we should look at...
Sent from my iPhone
On Jan 30, 2012, at 7:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Bridget Frey <bridget.frey@redfin.com> writes:
>> The second error is an invalid memory alloc error that we're getting ~2
>> dozen times per day in production. The bt for this alloc error is below.
>
> This trace is consistent with the idea that we're getting a corrupt
> tuple out of a table, although it doesn't entirely preclude the
> possibility that the corrupt value is manufactured inside the backend.
> To get much further you're going to need to look at the specific query
> being executed each time this happens, and see if you can detect any
> pattern. Now that you've got debug symbols straightened out, the
> gdb command "p debug_query_string" should accomplish this. (If that
> does not produce anything that looks like one of your application's
> SQL commands, we'll need to try harder to extract the info.) You could
> probably hack the elog(PANIC) to log that before dying, too, if you'd
> rather not manually gdb each core dump.
>
> regards, tom lane