Thread: Reproducible "Bus error" in 9.2.3 during database dump restoration (Ubuntu Server 12.04 LTS)

Hello.

I have a database dump file (unfortunately with proprietary information) which leads to the following error in logs during its restoration (even after initdb - it is stable reproducible, at the same large table, the same time):

LOG:  server process (PID 18705) was terminated by signal 7: Bus error
DETAIL:  Failed process was running: COPY br_agent_log (id, agent_id, stamp, trace, message) FROM stdin;
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
...
and then, after recovery:
...
redo done at 0/12DDB7A8
...
LOG:  database system is ready to accept connections
ERROR:  could not read block 1 in file "base/57390/11783": read only 4448 of 8192 bytes at character 39

I think it could look like a memory corruption in PG? BTW 9.1.8 does not have such problem - the restoration is OK.

Possibly I could help with this crash investigation? How to do it better? Maybe you have a tutorial article about it which shows the preferable error reporting format?


Dmitry Koterov <dmitry@koterov.ru> wrote:

> LOG:  server process (PID 18705) was terminated by signal 7: Bus error

So far I have only heard of this sort of error when PostgreSQL is
running in a virtual machine and the VM software is buggy.  If you
are not running in a VM, my next two suspects would be
hardware/BIOS configuration issues, or an antivirus product.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Tue, Mar 5, 2013 at 3:04 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
> Dmitry Koterov <dmitry@koterov.ru> wrote:
>
>> LOG:  server process (PID 18705) was terminated by signal 7: Bus error
>
> So far I have only heard of this sort of error when PostgreSQL is
> running in a virtual machine and the VM software is buggy.  If you
> are not running in a VM, my next two suspects would be
> hardware/BIOS configuration issues, or an antivirus product.

for posterity, what's the hardware platform?  software bus errors are
more likely on non x86 hardware.

merlin



x86_64, PostgreSQL 9.2. is run within an OpenVZ container and generates SIGBUS.
PostgreSQL 9.1 has no such problem.

(OpenVZ is a linux kernel-level virtualization which adds namespaces for processes, networking, quotas etc. It works not like e.g. Xen or VMWare, because all containers share the same kernel.)


On Wed, Mar 6, 2013 at 7:51 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Tue, Mar 5, 2013 at 3:04 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
> Dmitry Koterov <dmitry@koterov.ru> wrote:
>
>> LOG:  server process (PID 18705) was terminated by signal 7: Bus error
>
> So far I have only heard of this sort of error when PostgreSQL is
> running in a virtual machine and the VM software is buggy.  If you
> are not running in a VM, my next two suspects would be
> hardware/BIOS configuration issues, or an antivirus product.

for posterity, what's the hardware platform?  software bus errors are
more likely on non x86 hardware.

merlin

On 03/11/2013 09:20 PM, Dmitry Koterov wrote:
x86_64, PostgreSQL 9.2. is run within an OpenVZ container and generates SIGBUS.
PostgreSQL 9.1 has no such problem.

(OpenVZ is a linux kernel-level virtualization which adds namespaces for processes, networking, quotas etc. It works not like e.g. Xen or VMWare, because all containers share the same kernel.)
Related to SHM vs mmapped files? Seems unlikely, but I guess it could affect low-enough level work like kernel TLB usage.

At what point in Pg's execution does the SIGBUS occur? Is it always at the same place or few places in the code? It would be helpful if you could enable core files writing and get backtraces from core files or (since it's reproducible) by attaching a debugger directly to a Pg backend. See http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD

If you restore just the first half of the table or just the last half does the crash still happen? If it still happens in one part but not in another, can you do a binary search* to isolate the smallest chunk of the input file that still reliably causes the crash?

* (ie: split the file roughly in half at a record boundary and test each half. Discard the half that doesn't crash, keep the half that crashes. Repeat the process using the kept half as input until you find the smallest chunk that still crashes, or get down to a single record that causes the problem.)

Does the same data cause a crash when restored in another VM on the same OpenVZ container? What about when restored to another machine with the same OS and Pg version outside OpenVZ?

--  Craig Ringer                   http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services