could not read block XXXXX in file "base/YYYYY/ZZZZZZ": read only 160 of 8192 bytes - Mailing list pgsql-bugs

From Антон Степаненко
Subject could not read block XXXXX in file "base/YYYYY/ZZZZZZ": read only 160 of 8192 bytes
Date
Msg-id 1208551308255001@web137.yandex.ru
Whole thread Raw
Responses Re: could not read block XXXXX in file "base/YYYYY/ZZZZZZ": read only 160 of 8192 bytes
List pgsql-bugs
Greetings.

First and foremost - sorry for my bad english.
I have PostgreSQL 9.0.4 installation using streaming replication, 1 master and 10 replicas. I've migrated to it about a
monthago, from 8.3 and slony. Built-in replication is wonderful, slony sucks =), but I have some performance issues. In
orderto manage with them I've tried lots of different things. One of them is rising shared_burffers from 8Gb to 12Gb (I
have24Gb at all). After postgres restart he crahed in about 3 hours with messages: 

[2301-1] 2011-06-16 17:40:26 UTC ERROR:  could not read block 34691 in file "base/17931/407169": read only 160 of 8192
bytes
[2394-2] 2011-06-16 17:40:26 UTC STATEMENT: [some statement]
[2390-1] 2011-06-16 17:40:26 UTC ERROR:  could not read block 2242 in file "base/17931/18984": read only 160 of 8192
bytes
[2394-2] 2011-06-16 17:40:26 UTC STATEMENT:[some statement]
[2302-1] 2011-06-16 17:40:26 UTC ERROR:  could not read block 34691 in file "base/17931/407169": read only 160 of 8192
bytes
[2394-2] 2011-06-16 17:40:26 UTC STATEMENT: [some statement]
[2391-1] 2011-06-16 17:40:26 UTC ERROR:  could not read block 19926 in file "base/17931/686609": Bad address
[2394-2] 2011-06-16 17:40:26 UTC STATEMENT: [some statement]
[2392-1] 2011-06-16 17:40:26 UTC ERROR:  could not read block 19926 in file "base/17931/686609": Bad address
[2394-2] 2011-06-16 17:40:26 UTC STATEMENT: [some statement]
[2393-1] 2011-06-16 17:40:26 UTC ERROR:  could not read block 19926 in file "base/17931/686609": Bad address
[2394-2] 2011-06-16 17:40:26 UTC STATEMENT: [some statement]
[2394-1] 2011-06-16 17:40:26 UTC ERROR:  could not read block 25578 in file "base/17931/686571": Bad address
[2394-2] 2011-06-16 17:40:27 UTC STATEMENT: [some statement]
[4-1] 2011-06-16 17:40:27 UTC LOG:  startup process (PID 15292) was terminated by signal 7: Bus error
[5-1] 2011-06-16 17:40:27 UTC LOG:  terminating any other active server processes
[858-1] 2011-06-16 17:40:27 UTC WARNING:  terminating connection because of crash of another server process
[858-2] 2011-06-16 17:40:27 UTC DETAIL:  The postmaster has commanded this server process to roll back the current
transactionand exit, because another server process exited abnormally and possibly corrupted shared memory. 
[2429-1] 2011-06-16 17:40:27 UTC WARNING:  terminating connection because of crash of another server process
[858-3] 2011-06-16 17:40:27 UTC HINT:  In a moment you should be able to reconnect to the database and repeat your
command.

Despite "In a moment you should be able to reconnect to the database and repeat your command" postgres did nor restart
byitself. 
Signal 7 means  hardware problems. But all 10 replicas crashed within 10 minutes, say from 13:35 to 13:45. And when
sharedbuffers were set to 8Gb  I hadn't experienced such troubles. 
I checked /proc/sys/kernel/shmmax - 16Gb, /proc/sys/kernel/shmall - 4194304.
I tried to set vm.overcommit_memory=2 and vm.overcommit_ratio=90 - no sense. Then I tried vm.overcommit_memory=1 - no
sense.
Here are vmstat logs (vmstat 10 > vmstat.log):
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----r  b   swpd   free   buff  cache   si   so
  bi    bo   in   cs us sy id wa0  0      0 5308644      0      0    0    0   508   133    0  743  2  1 97  00  0
05279436      0      0    0    0   620   307    0 1016  3  0 97  01  0      0 5266384      0      0    0    0   242
223   0 1104  8  1 92  01  0      0 5260544      0      0    0    0   359   287    0  959  2  0 98  00  0      0
5245764     0      0    0    0   877   311    0  891  3  0 97  00  0      0 24378380      0      0    0    0   454
314   0 1011  2  7 91  00  0      0 24378380      0      0    0    0     0    69    0  118  0  0 100  00  0      0
24378380     0      0    0    0     0     2    0   99  0  0 100  00  0      0 24378380      0      0    0    0     0
1    0  180  0  0 100  00  0      0 24378024      0      0    0    0     0     1    0  116  0  0 100  00  0      0
24378024     0      0    0    0     0   342    0  126  0  0 100  0 

So you can see that postgres dies when there are about 5Gb of free memory. In one of my experiments postgres worked for
about30 minutes with 0kb free memory. So I think problem is not memory lack. 
I thought about bad indexes and reindex. In messages like "could not read block XXXXX in file "base/YYYYY/ZZZZZZ": read
only160 of 8192 bytes" YYYYY is my database and ZZZZZ somehow relates to a table or to an index. It may be relfilenode
ofan index or oid of a table. And indexes and tables can be of any type, there are no regularity. I haven't tried to
reindex,because after all 10 replicas went down. I made all experiments on only one of them, while reindexing will
affectall installation, which is in production and under high load. 
One important thing - all replicas and master are running on openvz ubuntu lucid 2.6.32, and there is no way to reject
virtualization(it is a long story =)) 

Please, I do not want to discuss my decision to set buffers to 12Gb and postgresql optimization at all. I just want to
undestandwhy I'm getting such errors. 
I will appreciate any help, thank you in advance.


pgsql-bugs by date:

Previous
From: Robert Haas
Date:
Subject: Re: BUG #6039: Trying to use server status
Next
From: "Kevin Grittner"
Date:
Subject: Re: could not read block XXXXX in file "base/YYYYY/ZZZZZZ": read only 160 of 8192 bytes