On Mon, Sep 21, 2009 at 12:46 PM, Tom Duffey <tduffey@techbydesign.com> wrote:
>
> On Sep 21, 2009, at 12:40 PM, Scott Marlowe wrote:
>
>> On Mon, Sep 21, 2009 at 11:09 AM, Tom Duffey <tduffey@techbydesign.com>
>> wrote:
>>>
>>> Hi All,
>>>
>>> We're having numerous problems with a PostgreSQL 8.3.7 database running
>>> on a
>>> virtual Linux server w/VMWare ESX. This is not by choice and I have been
>>> asking the operator of this equipment for details about the disk setup
>>> and
>>> here's what I got:
>>>
>>> "We have a SAN that is presenting an NFS share. VMWare sees that share
>>> and
>>> reads the VMDK file that make up the virtual file system."
>>>
>>> Does anyone with a better understanding of PostgreSQL and VMWare know if
>>> this is an unreliable setup for PostgreSQL? I see things like "NFS" and
>>> "VMWare" and start to get worried.
>>
>> I see VMWare and thing performance issues, I see NFS and thing dear
>> god help us all. Even if properly setup NFS is a problem waiting to
>> happen, and it's not reliable storage for a database in my opinion.
>> That said, lots of folks do it. Ask for the NFS mount options from
>> the sysadmin.
>
> Thanks to everyone so far for the insight. I'm trying to get more details
> about the hardware setup but am not making much progress.
>
> Here are some of the errors we're getting. I searched through archives and
> they all seem to point at hardware trouble but is there anything else I
> should be looking at?
>
> ERROR: invalid page header in block 2 of relation "pg_toast_19466_index"
>
> ERROR: invalid memory alloc request size 1667592311
> STATEMENT: COPY public.version_bundle (node_id_hi, node_id_lo, bundle_data)
> TO stdout;
>
> ERROR: unexpected chunk number 1632 (expected 1629) for toast value 19711
> in pg_toast_19184
> STATEMENT: COPY public.data_binval (binval_id, binval_data) TO stdout;
>
> ERROR: invalid page header in block 414 of relation "pg_toast_19460_index"
>
> ERROR: could not open segment 1 of relation 1663/16386/16535 (target block
> 3966127611): No such file or directory
>
> I dealt with some of the above by reindexing or finding and deleting bad
> rows. I can now successfully dump the database but of course have missing
> data so the application is toast. What I'm really wondering now is how to
> prevent this from happening again and if that means moving the database to
> new hardware.
Definitely sounds like file system corruption to me. And who knows
what's gotten hammered that hasn't caused an error, eh? Time to move
to a standalone db server or get a sysadmin who knows how to setup
vmware to make pgsql happy.