Thread: could not read block XXXXX in file "base/YYYYY/ZZZZZZ": read only 160 of 8192 bytes

could not read block XXXXX in file "base/YYYYY/ZZZZZZ": read only 160 of 8192 bytes

From
Антон Степаненко
Date:
Greetings.

First and foremost - sorry for my bad english.
I have PostgreSQL 9.0.4 installation using streaming replication, 1 master and 10 replicas. I've migrated to it about a
monthago, from 8.3 and slony. Built-in replication is wonderful, slony sucks =), but I have some performance issues. In
orderto manage with them I've tried lots of different things. One of them is rising shared_burffers from 8Gb to 12Gb (I
have24Gb at all). After postgres restart he crahed in about 3 hours with messages: 

[2301-1] 2011-06-16 17:40:26 UTC ERROR:  could not read block 34691 in file "base/17931/407169": read only 160 of 8192
bytes
[2394-2] 2011-06-16 17:40:26 UTC STATEMENT: [some statement]
[2390-1] 2011-06-16 17:40:26 UTC ERROR:  could not read block 2242 in file "base/17931/18984": read only 160 of 8192
bytes
[2394-2] 2011-06-16 17:40:26 UTC STATEMENT:[some statement]
[2302-1] 2011-06-16 17:40:26 UTC ERROR:  could not read block 34691 in file "base/17931/407169": read only 160 of 8192
bytes
[2394-2] 2011-06-16 17:40:26 UTC STATEMENT: [some statement]
[2391-1] 2011-06-16 17:40:26 UTC ERROR:  could not read block 19926 in file "base/17931/686609": Bad address
[2394-2] 2011-06-16 17:40:26 UTC STATEMENT: [some statement]
[2392-1] 2011-06-16 17:40:26 UTC ERROR:  could not read block 19926 in file "base/17931/686609": Bad address
[2394-2] 2011-06-16 17:40:26 UTC STATEMENT: [some statement]
[2393-1] 2011-06-16 17:40:26 UTC ERROR:  could not read block 19926 in file "base/17931/686609": Bad address
[2394-2] 2011-06-16 17:40:26 UTC STATEMENT: [some statement]
[2394-1] 2011-06-16 17:40:26 UTC ERROR:  could not read block 25578 in file "base/17931/686571": Bad address
[2394-2] 2011-06-16 17:40:27 UTC STATEMENT: [some statement]
[4-1] 2011-06-16 17:40:27 UTC LOG:  startup process (PID 15292) was terminated by signal 7: Bus error
[5-1] 2011-06-16 17:40:27 UTC LOG:  terminating any other active server processes
[858-1] 2011-06-16 17:40:27 UTC WARNING:  terminating connection because of crash of another server process
[858-2] 2011-06-16 17:40:27 UTC DETAIL:  The postmaster has commanded this server process to roll back the current
transactionand exit, because another server process exited abnormally and possibly corrupted shared memory. 
[2429-1] 2011-06-16 17:40:27 UTC WARNING:  terminating connection because of crash of another server process
[858-3] 2011-06-16 17:40:27 UTC HINT:  In a moment you should be able to reconnect to the database and repeat your
command.

Despite "In a moment you should be able to reconnect to the database and repeat your command" postgres did nor restart
byitself. 
Signal 7 means  hardware problems. But all 10 replicas crashed within 10 minutes, say from 13:35 to 13:45. And when
sharedbuffers were set to 8Gb  I hadn't experienced such troubles. 
I checked /proc/sys/kernel/shmmax - 16Gb, /proc/sys/kernel/shmall - 4194304.
I tried to set vm.overcommit_memory=2 and vm.overcommit_ratio=90 - no sense. Then I tried vm.overcommit_memory=1 - no
sense.
Here are vmstat logs (vmstat 10 > vmstat.log):
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----r  b   swpd   free   buff  cache   si   so
  bi    bo   in   cs us sy id wa0  0      0 5308644      0      0    0    0   508   133    0  743  2  1 97  00  0
05279436      0      0    0    0   620   307    0 1016  3  0 97  01  0      0 5266384      0      0    0    0   242
223   0 1104  8  1 92  01  0      0 5260544      0      0    0    0   359   287    0  959  2  0 98  00  0      0
5245764     0      0    0    0   877   311    0  891  3  0 97  00  0      0 24378380      0      0    0    0   454
314   0 1011  2  7 91  00  0      0 24378380      0      0    0    0     0    69    0  118  0  0 100  00  0      0
24378380     0      0    0    0     0     2    0   99  0  0 100  00  0      0 24378380      0      0    0    0     0
1    0  180  0  0 100  00  0      0 24378024      0      0    0    0     0     1    0  116  0  0 100  00  0      0
24378024     0      0    0    0     0   342    0  126  0  0 100  0 

So you can see that postgres dies when there are about 5Gb of free memory. In one of my experiments postgres worked for
about30 minutes with 0kb free memory. So I think problem is not memory lack. 
I thought about bad indexes and reindex. In messages like "could not read block XXXXX in file "base/YYYYY/ZZZZZZ": read
only160 of 8192 bytes" YYYYY is my database and ZZZZZ somehow relates to a table or to an index. It may be relfilenode
ofan index or oid of a table. And indexes and tables can be of any type, there are no regularity. I haven't tried to
reindex,because after all 10 replicas went down. I made all experiments on only one of them, while reindexing will
affectall installation, which is in production and under high load. 
One important thing - all replicas and master are running on openvz ubuntu lucid 2.6.32, and there is no way to reject
virtualization(it is a long story =)) 

Please, I do not want to discuss my decision to set buffers to 12Gb and postgresql optimization at all. I just want to
undestandwhy I'm getting such errors. 
I will appreciate any help, thank you in advance.


***** **********<zlobnynigga@yandex.ru> wrote:

> [4-1] 2011-06-16 17:40:27 UTC LOG:  startup process (PID 15292)
> was terminated by signal 7: Bus error

> Signal 7 means  hardware problems. But all 10 replicas crashed
> within 10 minutes, say from 13:35 to 13:45.

> One important thing - all replicas and master are running on
> openvz

Were the PostgreSQL clusters sharing any hardware?

> there is no way to reject virtualization (it is a long story =))
>
> Please, I do not want to discuss my decision to set buffers to
> 12Gb and postgresql optimization at all. I just want to undestand
> why I'm getting such errors.

On the face of it, the most likely cause would seem to be hardware
or the virtual environment.  Without knowing more about the exact
messages on the replicas and how they compared to each other and the
master it's hard to know whether any of the replica failures were
from passing corrupted data from the master to the replicas, versus
having a common hardware/vm flaw.

-Kevin

Re: could not read block XXXXX in file "base/YYYYY/ZZZZZZ": read only 160 of 8192 bytes

From
Антон Степаненко
Date:

17.06.2011, 00:28, "Kevin Grittner" <Kevin.Grittner@wicourts.gov>:
> ***** **********<zlobnynigga@yandex.ru>; wrote:
>
>>  [4-1] 2011-06-16 17:40:27 UTC LOG:  startup process (PID 15292)
>>  was terminated by signal 7: Bus error
>>  Signal 7 means  hardware problems. But all 10 replicas crashed
>>  within 10 minutes, say from 13:35 to 13:45.
>>  One important thing - all replicas and master are running on
>>  openvz
>
> Were the PostgreSQL clusters sharing any hardware?
>
>>  there is no way to reject virtualization (it is a long story =))
>>
>>  Please, I do not want to discuss my decision to set buffers to
>>  12Gb and postgresql optimization at all. I just want to undestand
>>  why I'm getting such errors.
>
> On the face of it, the most likely cause would seem to be hardware
> or the virtual environment.  Without knowing more about the exact
> messages on the replicas and how they compared to each other and the
> master it's hard to know whether any of the replica failures were
> from passing corrupted data from the master to the replicas, versus
> having a common hardware/vm flaw.
>
> -Kevin

I noticed that crash takes place when shared buffers are almost full, i.e. SELECT SUM(size)  FROM adm.buffercache()
returns11670 at about one minute before crash. Furthermore, last night I set buffers  to 11Gb, at it is working, no
crash,all buffers are used (11120). 
I still do not believe that this is hardware problem. Each replica and master runs on dedicated server, no hardware is
shared.There is only postgresql on each server, no any other software(just crond, zabbix, atop). 
Actually openvz is used only for portability(easily add new replicas or migrate one of them to new server).
Messages on replicas are all the same: "could not read block", then "signal 7". I copypasted error log as is, that is
allI know. 
Master did not crash, I think because it processes less SELECT queries, therefore his buffers do not reach limit.


***** **********<zlobnynigga@yandex.ru> wrote:
> 17.06.2011, 00:28, "Kevin Grittner" <Kevin.Grittner@wicourts.gov>:
>> ***** **********<zlobnynigga@yandex.ru>; wrote:
>>
>>>  [4-1] 2011-06-16 17:40:27 UTC LOG:  startup process (PID 15292)
>>>  was terminated by signal 7: Bus error
>>>  Signal 7 means  hardware problems. But all 10 replicas crashed
>>>  within 10 minutes, say from 13:35 to 13:45.
>>>  One important thing - all replicas and master are running on
>>>  openvz

>> On the face of it, the most likely cause would seem to be
>> hardware or the virtual environment.

> I noticed that crash takes place when shared buffers are almost
> full, i.e. SELECT SUM(size)  FROM adm.buffercache() returns 11670
> at about one minute before crash. Furthermore, last night I set
> buffers  to 11Gb, at it is working, no crash, all buffers are used
> (11120).

Well then, in a pinch you could always fall back to using what
works.

> I still do not believe that this is hardware problem.

How would an application cause a bus error?

> Each replica and master runs on dedicated server, no hardware is
> shared.

OK.  If they had been on the same blade chassis or something I would
have suspected hardware.

> There is only postgresql on each server, no any other
> software(just crond, zabbix, atop). Actually openvz is used only
> for portability(easily add new replicas or migrate one of them to
> new server).

Still, it emulates hardware, so you have to consider it a suspect
for any hardware problem -- at least if you want to solve that
problem.

> Master did not crash

Ah, that wasn't clear from the earlier post.  I'm not sure how
significant it is, but it's good to know.

> I think because it processes less SELECT queries, therefore his
> buffers do not reach limit.

In your shoes I would now be trying to construct a test program to
exercise progressively larger allocations of shared memory, and test
them both under openvz and without it.  Well, first I would probably
try loading the master with queries to drive it to use the full
shared_buffers space, *then* move on to the test program.

The relevant question here is why others can successfully use large
shared_buffers settings while you can't.  Something is different in
your environment.  What?

-Kevin
On Fri, Jun 17, 2011 at 10:56 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
>> I still do not believe that this is hardware problem.
>
> How would an application cause a bus error?

unaligned memory access on risc maybe?  what's this running on?

merlin

Re: could not read block XXXXX in file "base/YYYYY/ZZZZZZ": read only 160 of 8192 bytes

From
Антон Степаненко
Date:

17.06.2011, 20:19, "Merlin Moncure" <mmoncure@gmail.com>:
> On Fri, Jun 17, 2011 at 10:56 AM, Kevin Grittner
> <Kevin.Grittner@wicourts.gov>; wrote:
>
>>>  I still do not believe that this is hardware problem.
>>  How would an application cause a bus error?
>
> unaligned memory access on risc maybe?  what's this running on?
>
> merlin

*****:~$ cat /proc/cpuinfo
processor       : 0
....
processor       : 23
vendor_id       : GenuineIntel
cpu family      : 6
model           : 44
model name      : Intel(R) Xeon(R) CPU           E5645  @ 2.40GHz
stepping        : 2
cpu MHz         : 2400.468
cache size      : 12288 KB
physical id     : 1
siblings        : 12
core id         : 10
cpu cores       : 6
apicid          : 53
initial apicid  : 53
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse
sse2ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc
aperfmperfpni pclmulqdq dtes64 monitor ds_cat /proc/cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt
aeslahf_lm ida arat tpr_shadow vnmi flexpriority ept vpid 
bogomips        : 4799.88
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

*****:~$ cat /proc/meminfo
MemTotal:       24681200 kB
MemFree:         4443356 kB
Cached:                0 kB
Active:                0 kB
Inactive:              0 kB
Active(anon):          0 kB
Inactive(anon):        0 kB
Active(file):          0 kB
Inactive(file):        0 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:                 0 kB
AnonPages:             0 kB
Mapped:                0 kB
Shmem:                 0 kB
Slab:                  0 kB
SReclaimable:          0 kB
SUnreclaim:            0 kB

*****:~$ fdisk -l
Disk /dev/sda: 500.1 GB, 500107862016 bytes
..
Disk /dev/sdd: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

*****:~$ cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md3 : active raid10 sdc3[2] sdd3[3] sda3[0] sdb3[1]     955770752 blocks super 1.2 64K chunks 2 near-copies [4/4]
[UUUU]    [>....................]  resync =  3.3% (31857408/955770752) finish=77.2min speed=199401K/sec 


2011/6/17 =E1=CE=D4=CF=CE =F3=D4=C5=D0=C1=CE=C5=CE=CB=CF <zlobnynigga@yande=
x.ru>:
>
>
> 17.06.2011, 20:19, "Merlin Moncure" <mmoncure@gmail.com>:
>> On Fri, Jun 17, 2011 at 10:56 AM, Kevin Grittner
>> <Kevin.Grittner@wicourts.gov>; wrote:
>>
>>>> =9AI still do not believe that this is hardware problem.
>>> =9AHow would an application cause a bus error?
>>
>> unaligned memory access on risc maybe? =9Awhat's this running on?
>>
>> merlin
>
> *****:~$ cat /proc/cpuinfo
> processor =9A =9A =9A : 0
> ....
> processor =9A =9A =9A : 23
> vendor_id =9A =9A =9A : GenuineIntel
> cpu family =9A =9A =9A: 6
> model =9A =9A =9A =9A =9A : 44
> model name =9A =9A =9A: Intel(R) Xeon(R) CPU =9A =9A =9A =9A =9A E5645 =
=9A@ 2.40GHz

hm, I'm wondering if this
(http://us.generation-nt.com/bug-626451-linux-image-mremap-returns-useless-=
pages-moving-anonymous-shared-mmap-access-causes-sigbus-help-203302832.html)
has anything to do with your problem.

merlin

Re: could not read block XXXXX in file "base/YYYYY/ZZZZZZ": read only 160 of 8192 bytes

From
Антон Степаненко
Date:

17.06.2011, 21:24, "Merlin Moncure" <mmoncure@gmail.com>:
> 2011/6/17 Антон Степаненко <zlobnynigga@yandex.ru>;:
>
>>  17.06.2011, 20:19, "Merlin Moncure" <mmoncure@gmail.com>;:
>>>  On Fri, Jun 17, 2011 at 10:56 AM, Kevin Grittner
>>>  <Kevin.Grittner@wicourts.gov>;; wrote:
>>>>>   I still do not believe that this is hardware problem.
>>>>   How would an application cause a bus error?
>>>  unaligned memory access on risc maybe?  what's this running on?
>>>
>>>  merlin
>>  *****:~$ cat /proc/cpuinfo
>>  processor       : 0
>>  ....
>>  processor       : 23
>>  vendor_id       : GenuineIntel
>>  cpu family      : 6
>>  model           : 44
>>  model name      : Intel(R) Xeon(R) CPU           E5645  @ 2.40GHz
>
> hm, I'm wondering if this
>
(http://us.generation-nt.com/bug-626451-linux-image-mremap-returns-useless-pages-moving-anonymous-shared-mmap-access-causes-sigbus-help-203302832.html)
> has anything to do with your problem.
>
> merlin

Thank you very much, very interesting link. I've compiled it under my ubuntu lucid - it really causes sigbus. But when
compiledunder CentOS 2.6.18 - it makes the same. So I am not sure that this is a bug. 
And event if it is - why it occurs only when buffers are set to 12Gb and filled...
I've read some sources of postgresql, e.g. /src/backend/storage/smgr/md.c:
void
mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,   char *buffer)
{
..
if (nbytes != BLCKSZ){    if (nbytes < 0)        ereport(ERROR,                (errcode_for_file_access(),
  errmsg("could not read block %u in file \"%s\": %m",                        blocknum, FilePathName(v->mdfd_vfd)))); 
    /*     * Short read: we are at or past EOF, or we read a partial block at     * EOF.  Normally this is an error;
upperlevels should never try to     * read a nonexistent block.  However, if zero_damaged_pages is ON or     * we are
InRecovery,we should instead return zeroes without     * complaining.  This allows, for example, the case of trying to
  * update a block that was later truncated away.     */    if (zero_damaged_pages || InRecovery)        MemSet(buffer,
0,BLCKSZ);    else        ereport(ERROR,                (errcode(ERRCODE_DATA_CORRUPTED),                 errmsg("could
notread block %u in file \"%s\": read only %d of %d bytes",                        blocknum, FilePathName(v->mdfd_vfd),
                      nbytes, BLCKSZ)));} 
}

This is the only place reporting errors like 'could not read block in file'.
Then I lookead at /src/backend/storage/file/fd.c:
int
FileRead(File file, char *buffer, int amount)
{
..
retry:returnCode = read(VfdCache[file].fd, buffer, amount);
if (returnCode >= 0)    VfdCache[file].seekPos += returnCode;else{    /*     * Windows may run out of kernel buffers
andreturn "Insufficient     * system resources" error.  Wait a bit and retry to solve it.     *     * It is rumored
thatEINTR is also possible on some Unix filesystems,     * in which case immediate retry is indicated.     */ 
#ifdef WIN32    ...
#endif    /* OK to retry if interrupted */    if (errno == EINTR)        goto retry;
    /* Trouble, so assume we don't know the file position anymore */    VfdCache[file].seekPos = Fileiso-8859-1Pos;}
return returnCode;
}

First, comment started with 'It is rumored' looks suspiciosly =) But I am not a kernel developer, I am event not a C++
developer,so I trust authors. 
I've read 'man read' and 'man 7 signal', and it is said that syscalls could be interrupted by some signals, including
sigbus,but when they do so, they should return to normal behaviour. 
"the call will be automatically restarted after the signal handler returns if the SA_RESTART flag was used; otherwise
thecall will fail with the error EINTR" - from man 7 signal 
So as I far as I understand even if postgresql gets signal 7 it should experience EINTR and retry immediately. What I
amtrying to say is that I do not know why I am getting sigbus, but no matter where it comes from, according to sources
postgresqlshould just try to read one more time, and one more, and so on until read succeeded. But I'm not quite sure
whathappens first - sigbus or 'could not read block' error. 


2011/6/17 =E1=CE=D4=CF=CE =F3=D4=C5=D0=C1=CE=C5=CE=CB=CF <zlobnynigga@yande=
x.ru>:
> 17.06.2011, 21:24, "Merlin Moncure" <mmoncure@gmail.com>:
>> 2011/6/17 =E1=CE=D4=CF=CE =F3=D4=C5=D0=C1=CE=C5=CE=CB=CF <zlobnynigga@ya=
ndex.ru>;:
>>
>>> =9A17.06.2011, 20:19, "Merlin Moncure" <mmoncure@gmail.com>;:
>>>> =9AOn Fri, Jun 17, 2011 at 10:56 AM, Kevin Grittner
>>>> =9A<Kevin.Grittner@wicourts.gov>;; wrote:
>>>>>> =9A=9AI still do not believe that this is hardware problem.
>>>>> =9A=9AHow would an application cause a bus error?
>>>> =9Aunaligned memory access on risc maybe? =9Awhat's this running on?
>>>>
>>>> =9Amerlin
>>> =9A*****:~$ cat /proc/cpuinfo
>>> =9Aprocessor =9A =9A =9A : 0
>>> =9A....
>>> =9Aprocessor =9A =9A =9A : 23
>>> =9Avendor_id =9A =9A =9A : GenuineIntel
>>> =9Acpu family =9A =9A =9A: 6
>>> =9Amodel =9A =9A =9A =9A =9A : 44
>>> =9Amodel name =9A =9A =9A: Intel(R) Xeon(R) CPU =9A =9A =9A =9A =9A E56=
45 =9A@ 2.40GHz
>>
>> hm, I'm wondering if this
>> (http://us.generation-nt.com/bug-626451-linux-image-mremap-returns-usele=
ss-pages-moving-anonymous-shared-mmap-access-causes-sigbus-help-203302832.h=
tml)
>> has anything to do with your problem.
>>
>> merlin
>
> Thank you very much, very interesting link. I've compiled it under my ubu=
ntu lucid - it really causes sigbus. But when compiled under CentOS 2.6.18 =
- it makes the same. So I am not sure that this is a bug.
> And event if it is - why it occurs only when buffers are set to 12Gb and =
filled...
> I've read some sources of postgresql, e.g. /src/backend/storage/smgr/md.c:
> void
> mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
> =9A =9A =9A =9A =9A char *buffer)
> {
> ..
> if (nbytes !=3D BLCKSZ)
> =9A =9A =9A =9A{
> =9A =9A =9A =9A =9A =9A =9A =9Aif (nbytes < 0)
> =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9Aereport(ERROR,
> =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =
=9A =9A(errcode_for_file_access(),
> =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =
=9A =9A errmsg("could not read block %u in file \"%s\": %m",
> =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =
=9A =9A =9A =9A =9A =9A =9A =9A =9A =9Ablocknum, FilePathName(v->mdfd_vfd))=
));
>
> =9A =9A =9A =9A =9A =9A =9A =9A/*
> =9A =9A =9A =9A =9A =9A =9A =9A * Short read: we are at or past EOF, or w=
e read a partial block at
> =9A =9A =9A =9A =9A =9A =9A =9A * EOF. =9ANormally this is an error; uppe=
r levels should never try to
> =9A =9A =9A =9A =9A =9A =9A =9A * read a nonexistent block. =9AHowever, i=
f zero_damaged_pages is ON or
> =9A =9A =9A =9A =9A =9A =9A =9A * we are InRecovery, we should instead re=
turn zeroes without
> =9A =9A =9A =9A =9A =9A =9A =9A * complaining. =9AThis allows, for exampl=
e, the case of trying to
> =9A =9A =9A =9A =9A =9A =9A =9A * update a block that was later truncated=
 away.
> =9A =9A =9A =9A =9A =9A =9A =9A */
> =9A =9A =9A =9A =9A =9A =9A =9Aif (zero_damaged_pages || InRecovery)
> =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9AMemSet(buffer, 0, BLCKSZ);
> =9A =9A =9A =9A =9A =9A =9A =9Aelse
> =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9Aereport(ERROR,
> =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =
=9A =9A(errcode(ERRCODE_DATA_CORRUPTED),
> =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =
=9A =9A errmsg("could not read block %u in file \"%s\": read only %d of %d =
bytes",
> =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =
=9A =9A =9A =9A =9A =9A =9A =9A =9A =9Ablocknum, FilePathName(v->mdfd_vfd),
> =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =
=9A =9A =9A =9A =9A =9A =9A =9A =9A =9Anbytes, BLCKSZ)));
> =9A =9A =9A =9A}
> }
>
> This is the only place reporting errors like 'could not read block in fil=
e'.
> Then I lookead at /src/backend/storage/file/fd.c:
> int
> FileRead(File file, char *buffer, int amount)
> {
> ..
> retry:
> =9A =9A =9A =9AreturnCode =3D read(VfdCache[file].fd, buffer, amount);
>
> =9A =9A =9A =9Aif (returnCode >=3D 0)
> =9A =9A =9A =9A =9A =9A =9A =9AVfdCache[file].seekPos +=3D returnCode;
> =9A =9A =9A =9Aelse
> =9A =9A =9A =9A{
> =9A =9A =9A =9A =9A =9A =9A =9A/*
> =9A =9A =9A =9A =9A =9A =9A =9A * Windows may run out of kernel buffers a=
nd return "Insufficient
> =9A =9A =9A =9A =9A =9A =9A =9A * system resources" error. =9AWait a bit =
and retry to solve it.
> =9A =9A =9A =9A =9A =9A =9A =9A *
> =9A =9A =9A =9A =9A =9A =9A =9A * It is rumored that EINTR is also possib=
le on some Unix filesystems,
> =9A =9A =9A =9A =9A =9A =9A =9A * in which case immediate retry is indica=
ted.
> =9A =9A =9A =9A =9A =9A =9A =9A */
> #ifdef WIN32
> =9A =9A =9A =9A =9A =9A =9A =9A...
> #endif
> =9A =9A =9A =9A =9A =9A =9A =9A/* OK to retry if interrupted */
> =9A =9A =9A =9A =9A =9A =9A =9Aif (errno =3D=3D EINTR)
> =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9A =9Agoto retry;
>
> =9A =9A =9A =9A =9A =9A =9A =9A/* Trouble, so assume we don't know the fi=
le position anymore */
> =9A =9A =9A =9A =9A =9A =9A =9AVfdCache[file].seekPos =3D FileUnknownPos;
> =9A =9A =9A =9A}
>
> =9A =9A =9A =9Areturn returnCode;
> }
>
> First, comment started with 'It is rumored' looks suspiciosly =3D) But I =
am not a kernel developer, I am event not a C++ developer, so I trust autho=
rs.
> I've read 'man read' and 'man 7 signal', and it is said that syscalls cou=
ld be interrupted by some signals, including sigbus, but when they do so, t=
hey should return to normal behaviour.
> "the call will be automatically restarted after the signal handler return=
s if the SA_RESTART flag was used; otherwise the call will fail with the er=
ror EINTR" - from man 7 signal
> So as I far as I understand even if postgresql gets signal 7 it should ex=
perience EINTR and retry immediately. What I am trying to say is that I do =
not know why I am getting sigbus, but no matter where it comes from, accord=
ing to sources postgresql should just try to read one more time, and one mo=
re, and so on until read succeeded. But I'm not quite sure what happens fir=
st - sigbus or 'could not read block' error.

I wonder if you are oversubscribing your memory, and are getting weird
errors when reading data into memory because the pages can't be
reserved to do that.  What happens when you enable overcommit and
attempt to start the server?

merlin

Re: could not read block XXXXX in file "base/YYYYY/ZZZZZZ": read only 160 of 8192 bytes

From
Антон Степаненко
Date:
>
> I wonder if you are oversubscribing your memory, and are getting weird
> errors when reading data into memory because the pages can't be
> reserved to do that.  What happens when you enable overcommit and
> attempt to start the server?
>
> merlin

In my first post I wrote: "I tried to set vm.overcommit_memory=2 and vm.overcommit_ratio=90 - no sense. Then I tried
vm.overcommit_memory=1- no sense." 
"No sense" means that server starts, works for about 3 hours, and then dies with signal 7 and almost all buffers
filled.Just as with vm.overcommit_memory=0. 
I copypasted vmstat log, that shows that there were 5Gb of free memory when postgresql died. In one of my experiments
postgresqlworked for about half an hour with 0k free memory (at least top and vmstat said so). Abscence of free memory
wascaused by the fact that replica had been down for 12 hours or so, and when started wal writer procees took much
resources.But it was woking! With no free memory. 
But this is not important. As I noticed the thing is not how much free memory I have. The thing is how shared buffers
arefilled. And shared buffers fillings makes sense only when they are set to 12Gb. When they set to less - everything
worksfine. 
If I am oversubcribing memory - then I expect to get some "out of memory error" and see 0k free in top output.
Memory for shared buffers can not be ovesubscribed - because if kernel did not provide enough shared memory postgres
willnot start. 
If I am wrong - please, explain why and where.


2011/6/17 =E1=CE=D4=CF=CE =F3=D4=C5=D0=C1=CE=C5=CE=CB=CF <zlobnynigga@yande=
x.ru>:
>>
>> I wonder if you are oversubscribing your memory, and are getting weird
>> errors when reading data into memory because the pages can't be
>> reserved to do that. =9AWhat happens when you enable overcommit and
>> attempt to start the server?
>>
>> merlin
>
> In my first post I wrote: "I tried to set vm.overcommit_memory=3D2 and vm=
.overcommit_ratio=3D90 - no sense. Then I tried vm.overcommit_memory=3D1 - =
no sense."
> "No sense" means that server starts, works for about 3 hours, and then di=
es with signal 7 and almost all buffers filled. Just as with vm.overcommit_=
memory=3D0.
> I copypasted vmstat log, that shows that there were 5Gb of free memory wh=
en postgresql died. In one of my experiments postgresql worked for about ha=
lf an hour with 0k free memory (at least top and vmstat said so). Abscence =
of free memory was caused by the fact that replica had been down for 12 hou=
rs or so, and when started wal writer procees took much resources. But it w=
as woking! With no free memory.
> But this is not important. As I noticed the thing is not how much free me=
mory I have. The thing is how shared buffers are filled. And shared buffers=
 fillings makes sense only when they are set to 12Gb. When they set to less=
 - everything works fine.
> If I am oversubcribing memory - then I expect to get some "out of memory =
error" and see 0k free in top output.
> Memory for shared buffers can not be ovesubscribed - because if kernel di=
d not provide enough shared memory postgres will not start.
> If I am wrong - please, explain why and where.

No, that's all correct, but I smell a rat.  Could be the
virtualization software, not sure.  But the problem looks not to be
with postgres...the server is just reporting o/s calls that are
returning with error.

merlin
bOn 06/17/2011 04:47 PM, áÎÔÏÎ óÔÅÐÁÎÅÎËÏ wrote:
> Memory for shared buffers can not be ovesubscribed - because if kernel
> did not provide enough shared memory postgres will not start.

The block is allocated at once.  But the amount of it that various
client backends end up touching varies as they run, slowly increasing
over time as they access more buffers.  After running for a while, the
individual processes will look like this:

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  2645 gsmith    20   0 12.3g 5.1g 5.1g D   45 32.8  16:59.19 postgres:
gsmith pgbench [local] SELECT

Where their virtual memory size becomes slightly larger than shared_buffers.

I tested this out on a Debian system here, set shared_buffers to 12GB
and beat on the server until every one of them was used by clients
(which is proven by how they've mapped the whole memory set in the
above).  It worked fine.

I suspect you're running into some sort of OpenVZ shared memory handling
bug.  The way it handles this is one of the more complicated, and
therefore likely to have odd failure cases, part of the design.  There's
notes at http://wiki.openvz.org/Postgresql_and_shared_memory about
container-specific things to tune here, so maybe there's just a setting
to tweak you've missed so far.  I'm guessing you already went through
that though.

A quick look around shows there are far more regularly reported bugs
like this in OpenVZ than there are in PostgreSQL, and Ubuntu is not
known for bug-free release practices either.  You're probably chasing
after the wrong thing trying to find a database problem here.  Likely to
end up in the same situation as the last one of these I remember:

http://archives.postgresql.org/pgsql-general/2009-10/msg00125.php
http://lists.debian.org/debian-kernel/2010/03/msg00401.html

...waiting for the OpenVZ problem that's the real cause to get fixed and
make its way to your distribution.

In your situation, I'd just use the smaller setting to avoid the known
problem, and try to focus my energy on finding a platform that isn't as
risky to deploy on instead.  Even if that's not your main deployment
one, just having something on real hardware to compare against would be
extremely valuable for isolating the problem here.

(And that's without even considering that setting shared_buffers so high
on Linux is more likely to slow the server than speed it up, which you
said you didn't want to discuss.  Just pointing it out so no one else
gets the wrong idea from your configuration.)

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

Re: could not read block XXXXX in file "base/YYYYY/ZZZZZZ": read only 160 of 8192 bytes

From
Антон Степаненко
Date:

18.06.2011, 09:58, "Greg Smith" <greg@2ndQuadrant.com>:
> I suspect you're running into some sort of OpenVZ shared memory handling
> bug.  The way it handles this is one of the more complicated, and
> therefore likely to have odd failure cases, part of the design.  There's
> notes at http://wiki.openvz.org/Postgresql_and_shared_memory about
> container-specific things to tune here, so maybe there's just a setting
> to tweak you've missed so far.  I'm guessing you already went through
> that though.
>
> A quick look around shows there are far more regularly reported bugs
> like this in OpenVZ than there are in PostgreSQL, and Ubuntu is not
> known for bug-free release practices either.  You're probably chasing
> after the wrong thing trying to find a database problem here.  Likely to
> end up in the same situation as the last one of these I remember:
>
> http://archives.postgresql.org/pgsql-general/2009-10/msg00125.php
> http://lists.debian.org/debian-kernel/2010/03/msg00401.html
>
> ...waiting for the OpenVZ problem that's the real cause to get fixed and
> make its way to your distribution.
>

Finally I finded out that problem is in OpenVZ, you were right. I've managed to set up one replica on "pure" Linux
withoutOpenVZ. And it works for about a week already with 12Gb shared buffers. 
I'm not sure what exactly is wrong with OpenVZ, but one thing that annoys me is that I have not the last stable OpenVZ
kernel.I will try to compile the last one(unfortunately it requires new vzctl), and then run some tests on it. And if
itwon't help - I'll go to OpenVZ community mailing lists and brainfuck them =) 
Thank you very much for your help, I appreciate it a lot. And sorry again for my bad english.