Thread: Unfamous 'could not read block ... in file "...": read only 0 of 8192 bytes' again

One of servers under my support 2 days ago produced the next error:

ERROR:  could not read block 135 in file "base/16404/118881486": read only 0 of 8192 bytes

Server version 9.0.6. No db or server crashes or db recovery happen on that server since setup.
Server is db backend for very large web project under quite a heavy load (10k+ request per second to db).
No other strange errors found in the logs since server was put to production half year ago.

Now what happened (from DB log):

2012-02-17 22:35:58 MSK 14333 [vxid:340/18229999 txid:2341883282] [DELETE] LOG:  duration: 5.669 ms  execute <unnamed>: delete from agency_statistics
2012-02-17 22:35:58 MSK 14333 [vxid:340/18230000 txid:0] [BIND] LOG:  duration: 2.787 ms  bind <unnamed>: insert into agency_statistics (employer_id, area_id, area_pa
th, professional_area_id, vacancies_number) select e.employer_id, vb.area_id, a.path, s.professional_area_id, count(distinct v.vacancy_id) from vacancy v inner join employer e on v.employer
_id = e.employer_id inner join vacancy_body vb on v.vacancy_body_id = vb.vacancy_body_id inner join vacancy_body_specialization vs on vb.vacancy_body_id = vs.vacancy_body_id inner join spec
ialization s on s.specialization_id = vs.specialization_id inner join area a on vb.area_id = a.area_id where v.archived_status in (0) and v.disabled = false and e.category = $1 and e.state
= $2 and e.manager_id > 0 group by e.employer_id, vb.area_id, a.path, s.professional_area_id
2012-02-17 22:35:59 MSK 14333 [vxid:340/18230000 txid:2341883336] [INSERT] LOG:  duration: 726.700 ms  execute <unnamed>: insert into agency_statistics (employer_id,
area_id, area_path, professional_area_id, vacancies_number) select e.employer_id, vb.area_id, a.path, s.professional_area_id, count(distinct v.vacancy_id) from vacancy v inner join employer
 e on v.employer_id = e.employer_id inner join vacancy_body vb on v.vacancy_body_id = vb.vacancy_body_id inner join vacancy_body_specialization vs on vb.vacancy_body_id = vs.vacancy_body_id
 inner join specialization s on s.specialization_id = vs.specialization_id inner join area a on vb.area_id = a.area_id where v.archived_status in (0) and v.disabled = false and e.category =
 $1 and e.state = $2 and e.manager_id > 0 group by e.employer_id, vb.area_id, a.path, s.professional_area_id
2012-02-17 22:35:59 MSK 14333 [vxid:340/18230070 txid:0] [BIND] LOG:  duration: 30.195 ms  bind S_35: insert into agency_statistics (employer_id, area_id, area_path,
professional_area_id, vacancies_number, rank) values ($1, $2, $3, $4, $5, 0)
2012-02-17 22:35:59 MSK 14333 [vxid:340/18230082 txid:2341883427] [INSERT] ERROR:  could not read block 135 in file "base/16404/118881486": read only 0 of 8192 bytes
2012-02-17 22:35:59 MSK 14333 [vxid:340/18230082 txid:2341883427] [INSERT] STATEMENT:  insert into agency_statistics (employer_id, area_id, area_path, professional_area_id, vacancies_number, rank) values ($1, $2, $3, $4, $5, 0)

No concurrent writes was happened to that table outside of that connection.
May be autovacuum were fired on that table between delete and first or second inserts (unfortunately it wasn't logged).

I recreated table from scratch and keep the damaged table under another name (through alter table agency_statistics rename to agency_statistics_old).
So I have files to dig into.

What I see in file system:
hh=# SELECT relfilenode from pg_class where relname='agency_statistics_old';
 relfilenode
-------------
   118881486
postgres@db10:~/tmp$ ls -la /var/lib/postgresql/9.0/main/base/16404/118881486
-rw------- 1 postgres postgres 0 2012-02-20 12:04 /var/lib/postgresql/9.0/main/base/16404/118881486

So table file size zero bytes (seems autovacuum truncated that table to 0 bytes).

But index show:
hh=#  SELECT relfilenode from pg_class where relname='agency_statistics_pkey';
 relfilenode
-------------
   118881489
postgres@db10:~/tmp$ ls -la /var/lib/postgresql/9.0/main/base/16404/118881489
-rw------- 1 postgres postgres 4956160 2012-02-20 12:07 /var/lib/postgresql/9.0/main/base/16404/118881489

hh=# SELECT * from pgstatindex('agency_statistics_pkey');
-[ RECORD 1 ]------+--------
version            | 2
tree_level         | 2
index_size         | 4947968
root_block_no      | 295
internal_pages     | 1
leaf_pages         | 601
empty_pages        | 0
deleted_pages      | 1
avg_leaf_density   | 0.45
leaf_fragmentation | 13.14

I out of ideas now.
Any suggestions where and what I should look next?

Kind Regards,
Maksym

--
Maxim Boguk
Senior Postgresql DBA.

Phone RU: +7 910 405 4718
Phone AU: +61 45 218 5678

Skype: maxim.boguk
Jabber: maxim.boguk@gmail.com

LinkedIn profile: http://nz.linkedin.com/in/maximboguk
"If they can send one man to the moon... why can't they send them all?"

МойКруг: http://mboguk.moikrug.ru/
"People problems are solved with people.
If people cannot solve the problem, try technology.
People will then wish they'd listened at the first stage."


Maxim Boguk <maxim.boguk@gmail.com> writes:
> One of servers under my support 2 days ago produced the next error:
> ERROR:  could not read block 135 in file "base/16404/118881486": read only
> 0 of 8192 bytes
> ...
> What I see in file system:
> hh=# SELECT relfilenode from pg_class where relname='agency_statistics_old';
>  relfilenode
> -------------
>    118881486
> postgres@db10:~/tmp$ ls -la
> /var/lib/postgresql/9.0/main/base/16404/118881486
> -rw------- 1 postgres postgres 0 2012-02-20 12:04
> /var/lib/postgresql/9.0/main/base/16404/118881486

> So table file size zero bytes (seems autovacuum truncated that table to 0
> bytes).

Hmmm .... something did, but I see no clear evidence that it was
autovacuum.

Do you know why the mod date on the file is 2012-02-20 12:04?  That's
more than two days after the error in your logs, so it's not clear to me
that the current state of the file tells us much about what happened on
the 17th.  If autovacuum had truncated the table then, and the table
wasn't touched otherwise, the file mod date shouldn't have increased.

            regards, tom lane

On Monday, February 20, 2012 5:32:01 pm Tom Lane wrote:
> Maxim Boguk <maxim.boguk@gmail.com> writes:
> > One of servers under my support 2 days ago produced the next error:
> > ERROR:  could not read block 135 in file "base/16404/118881486": read
> > only 0 of 8192 bytes
> > ...
> > What I see in file system:
> > hh=# SELECT relfilenode from pg_class where
> > relname='agency_statistics_old';
> >
> >  relfilenode
> >
> > -------------
> >
> >    118881486
> >
> > postgres@db10:~/tmp$ ls -la
> > /var/lib/postgresql/9.0/main/base/16404/118881486
> > -rw------- 1 postgres postgres 0 2012-02-20 12:04
> > /var/lib/postgresql/9.0/main/base/16404/118881486
> >
> > So table file size zero bytes (seems autovacuum truncated that table to 0
> > bytes).
>
> Hmmm .... something did, but I see no clear evidence that it was
> autovacuum.
>
> Do you know why the mod date on the file is 2012-02-20 12:04?  That's
> more than two days after the error in your logs, so it's not clear to me
> that the current state of the file tells us much about what happened on
> the 17th.  If autovacuum had truncated the table then, and the table
> wasn't touched otherwise, the file mod date shouldn't have increased.

If I am following correctly that is due to this:

"I recreated table from scratch and keep the damaged table under another
name (through alter table agency_statistics rename to
agency_statistics_old).
So I have files to dig into."

>
>             regards, tom lane

--
Adrian Klaver
adrian.klaver@gmail.com



> So table file size zero bytes (seems autovacuum truncated that table to 0
> bytes).

Hmmm .... something did, but I see no clear evidence that it was
autovacuum.

Do you know why the mod date on the file is 2012-02-20 12:04?  That's
more than two days after the error in your logs, so it's not clear to me
that the current state of the file tells us much about what happened on
the 17th.  If autovacuum had truncated the table then, and the table
wasn't touched otherwise, the file mod date shouldn't have increased.

                       regards, tom lane


Hi,

Cron was attempt to populate the table once per hour after that problem happened.
And each time it was produced the same error.
So table had attempted write activity after error happen:
(
2012-02-18 00:36:01 MSK 27743 hh.app@hh from [local] [vxid:95/19965879 txid:2342867908] [INSERT] ERROR:  could not read block 171 in file "base/16404/118881486": read only 0 of 8192 bytes
2012-02-18 00:36:01 MSK 27743 hh.app@hh from [local] [vxid:95/19965879 txid:2342867908] [INSERT] STATEMENT:  insert into agency_statistics (employer_id, area_id, area_path, professional_area_id, vacancies_number) select e.employer_id, vb.area_id, a.path, s.professional_area_id, count(distinct v.vacancy_id) from vacancy v inner join employer e on v.employer_id = e.employer_id inner join vacancy_body vb on v.vacancy_body_id = vb.vacancy_body_id inner join vacancy_body_specialization vs on vb.vacancy_body_id = vs.vacancy_body_id inner join specialization s on s.specialization_id = vs.specialization_id inner join area a on vb.area_id = a.area_id where v.archived_status in (0) and v.disabled = false and e.category = $1 and e.state = $2 and e.manager_id > 0 group by e.employer_id, vb.area_id, a.path, s.professional_area_id
...
and so on until Monday when I was informed about that problem.
).

For information the table has quite simple structure without any unusual features:

hh=# \d+ agency_statistics_old
                        Table "public.agency_statistics_old"
        Column        |          Type          | Modifiers | Storage  | Description
----------------------+------------------------+-----------+----------+-------------
 employer_id          | integer                | not null  | plain    |
 area_id              | integer                | not null  | plain    |
 area_path            | character varying(255) | not null  | extended |
 professional_area_id | integer                | not null  | plain    |
 vacancies_number     | integer                | not null  | plain    |
 rank                 | integer                |           | plain    |
 normalised_rank      | integer                |           | plain    |
Indexes:
    "agency_statistics_pkey" PRIMARY KEY, btree (employer_id, area_id, professional_area_id) CLUSTER
Has OIDs: no
Options: fillfactor=50


I almost sure I will get an error if I try insert something to
the agency_statistics_old.
Should I perform that test?

Kind Regards,
Maxim




Maxim Boguk <maxim.boguk@gmail.com> writes:
>> Do you know why the mod date on the file is 2012-02-20 12:04?

> Cron was attempt to populate the table once per hour after that problem
> happened.
> And each time it was produced the same error.

That's interesting ... is there any possibility that the insertions were
attempting to insert values that matched a previously-existing primary
key value?  I'm thinking there's no reason for the INSERT per se to be
touching nonexistent blocks, but if for some reason the pkey index still
had entries pointing at vanished rows (as it seems to) then the errors
could be coming from uniqueness checks attempting to fetch those rows to
see if they're live.

            regards, tom lane



On Tue, Feb 21, 2012 at 1:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Maxim Boguk <maxim.boguk@gmail.com> writes:
>> Do you know why the mod date on the file is 2012-02-20 12:04?

> Cron was attempt to populate the table once per hour after that problem
> happened.
> And each time it was produced the same error.

That's interesting ... is there any possibility that the insertions were
attempting to insert values that matched a previously-existing primary
key value?  I'm thinking there's no reason for the INSERT per se to be
touching nonexistent blocks, but if for some reason the pkey index still
had entries pointing at vanished rows (as it seems to) then the errors
could be coming from uniqueness checks attempting to fetch those rows to
see if they're live.

                       regards, tom lane

Hi,

There isn't possibility but close to 100% new inserted values were matched a previously-existing primary
key value.
The table is hand-made 'materialyzed view'-type statistic table which is getting recalculated via cron.

--
Maxim Boguk
Senior Postgresql DBA.



On 02/20/2012 07:46 PM, Maxim Boguk wrote:


On Tue, Feb 21, 2012 at 1:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Maxim Boguk <maxim.boguk@gmail.com> writes:
>> Do you know why the mod date on the file is 2012-02-20 12:04?

> Cron was attempt to populate the table once per hour after that problem
> happened.
> And each time it was produced the same error.

That's interesting ... is there any possibility that the insertions were
attempting to insert values that matched a previously-existing primary
key value?  I'm thinking there's no reason for the INSERT per se to be
touching nonexistent blocks, but if for some reason the pkey index still
had entries pointing at vanished rows (as it seems to) then the errors
could be coming from uniqueness checks attempting to fetch those rows to
see if they're live.

                       regards, tom lane

Hi,

There isn't possibility but close to 100% new inserted values were matched a previously-existing primary
key value.
The table is hand-made 'materialyzed view'-type statistic table which is getting recalculated via cron.

--
Maxim Boguk
Senior Postgresql DBA.



Sorry Maxim,
Trying to follow along here:  Are you say the inserted record DO or DO NOT match previously existing key values.




On Tue, Feb 21, 2012 at 1:46 PM, Maxim Boguk <maxim.boguk@gmail.com> wrote:


On Tue, Feb 21, 2012 at 1:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Maxim Boguk <maxim.boguk@gmail.com> writes:
>> Do you know why the mod date on the file is 2012-02-20 12:04?

> Cron was attempt to populate the table once per hour after that problem
> happened.
> And each time it was produced the same error.

That's interesting ... is there any possibility that the insertions were
attempting to insert values that matched a previously-existing primary
key value?  I'm thinking there's no reason for the INSERT per se to be
touching nonexistent blocks, but if for some reason the pkey index still
had entries pointing at vanished rows (as it seems to) then the errors
could be coming from uniqueness checks attempting to fetch those rows to
see if they're live.

                       regards, tom lane

Hi,

There isn't possibility but close to 100% new inserted values were matched a previously-existing primary
key value.
The table is hand-made 'materialyzed view'-type statistic table which is getting recalculated via cron.


To be clear - the new inserted values do match a previously-existing primary key values almost always.
Sorry for not being clear.

Maxim Boguk <maxim.boguk@gmail.com> writes:
> On Tue, Feb 21, 2012 at 1:46 PM, Maxim Boguk <maxim.boguk@gmail.com> wrote:
>> There isn't possibility but close to 100% new inserted values were matched
>> a previously-existing primary
>> key value.
>> The table is hand-made 'materialyzed view'-type statistic table which is
>> getting recalculated via cron.

> To be clear - the new inserted values do match a previously-existing
> primary key values almost always.

OK, so that pretty much explains where the visible symptoms are coming
from: somehow, the table got truncated but its pkey index did not get
cleared out.  So an insert creates an empty page zero, inserts a heap
tuple there, tries to insert an index entry.  The btree code sees there
is an index entry for that key already, and tries to fetch the heap
tuple for that index entry to see if it's dead (which would allow the
insertion to proceed).  But the block number the index is pointing at
isn't there, so you get the quoted error message.  The insertion rolls
back, leaving a dead tuple that can be garbage-collected by autovacuum,
after which it truncates the table again --- but of course without
removing any index entries, except maybe one for TID (0,1) if that's
still there.  Lather rinse repeat.

But this brings us little closer to understanding the cause of the
problem.  How could the table have gotten truncated without vacuuming
out the index?

Are you sure the only operations happening on that table are INSERT,
DELETE, autovacuum?  No DDL of any kind?

            regards, tom lane

I wrote:
> OK, so that pretty much explains where the visible symptoms are coming
> from: somehow, the table got truncated but its pkey index did not get
> cleared out.  So an insert creates an empty page zero, inserts a heap
> tuple there, tries to insert an index entry.  The btree code sees there
> is an index entry for that key already, and tries to fetch the heap
> tuple for that index entry to see if it's dead (which would allow the
> insertion to proceed).  But the block number the index is pointing at
> isn't there, so you get the quoted error message.  The insertion rolls
> back, leaving a dead tuple that can be garbage-collected by autovacuum,
> after which it truncates the table again --- but of course without
> removing any index entries, except maybe one for TID (0,1) if that's
> still there.  Lather rinse repeat.

Hmm ... actually there is a point that this theory doesn't explain
entirely.  If the probability of a collision with an existing index
entry was near 100%, then each hourly cron job should only have been
able to insert one or a few heap tuples before failing.  That would
not trigger an autovacuum right away.  Eventually the number of dead
tuples would build up to the point where autovacuum got interested,
but it strains credulity a bit to assume that this happened exactly
after the last hourly run before you renamed the table.  Yet, if that
didn't happen just that way, how come the size of the table is exactly
zero now?

The theory would be more satisfactory if we could expect that an hourly
run would be able to insert some thousands of tuples before failing,
enough to trigger an autovacuum run.  So I'm wondering if maybe the
index is *partially* cleaned out, but not completely.  Does this
materialized view have a fairly predictable number of rows, and if so
how does that compare to the number of entries in the index?  (If you
have no other way to get the number of entries in the index, try
inserting a dummy row, deleting it, and then VACUUM VERBOSE.)

            regards, tom lane



OK, so that pretty much explains where the visible symptoms are coming
from: somehow, the table got truncated but its pkey index did not get
cleared out.  So an insert creates an empty page zero, inserts a heap
tuple there, tries to insert an index entry.  The btree code sees there
is an index entry for that key already, and tries to fetch the heap
tuple for that index entry to see if it's dead (which would allow the
insertion to proceed).  But the block number the index is pointing at
isn't there, so you get the quoted error message.  The insertion rolls
back, leaving a dead tuple that can be garbage-collected by autovacuum,
after which it truncates the table again --- but of course without
removing any index entries, except maybe one for TID (0,1) if that's
still there.  Lather rinse repeat.

But this brings us little closer to understanding the cause of the
problem.  How could the table have gotten truncated without vacuuming
out the index?

Are you sure the only operations happening on that table are INSERT,
DELETE, autovacuum?  No DDL of any kind?

                       regards, tom lane

Not what I know. There are could be some updates as well, but they performed by the same cron as initial delete and they doesn't happened in that case because the cron script exit on the first database error. E.g. no one in my knowledge could write into that table except that single cron and autovacuum.
And I have logged of the all queries with duration over 1ms.

Kind Regards,
Maksym
 







On Tue, Feb 21, 2012 at 3:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wrote:
> OK, so that pretty much explains where the visible symptoms are coming
> from: somehow, the table got truncated but its pkey index did not get
> cleared out.  So an insert creates an empty page zero, inserts a heap
> tuple there, tries to insert an index entry.  The btree code sees there
> is an index entry for that key already, and tries to fetch the heap
> tuple for that index entry to see if it's dead (which would allow the
> insertion to proceed).  But the block number the index is pointing at
> isn't there, so you get the quoted error message.  The insertion rolls
> back, leaving a dead tuple that can be garbage-collected by autovacuum,
> after which it truncates the table again --- but of course without
> removing any index entries, except maybe one for TID (0,1) if that's
> still there.  Lather rinse repeat.

Hmm ... actually there is a point that this theory doesn't explain
entirely.  If the probability of a collision with an existing index
entry was near 100%, then each hourly cron job should only have been
able to insert one or a few heap tuples before failing.  That would
not trigger an autovacuum right away.  Eventually the number of dead
tuples would build up to the point where autovacuum got interested,
but it strains credulity a bit to assume that this happened exactly
after the last hourly run before you renamed the table.  Yet, if that
didn't happen just that way, how come the size of the table is exactly
zero now?

The theory would be more satisfactory if we could expect that an hourly
run would be able to insert some thousands of tuples before failing,
enough to trigger an autovacuum run.  So I'm wondering if maybe the
index is *partially* cleaned out, but not completely.  Does this
materialized view have a fairly predictable number of rows, and if so
how does that compare to the number of entries in the index?  (If you
have no other way to get the number of entries in the index, try
inserting a dummy row, deleting it, and then VACUUM VERBOSE.)

                       regards, tom lane

There is some funny results:

hh=# VACUUM verbose agency_statistics_old;
INFO:  vacuuming "public.agency_statistics_old"
INFO:  index "agency_statistics_pkey" now contains 0 row versions in 605 pages
DETAIL:  0 index row versions were removed.
595 index pages have been deleted, 595 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.01 sec.
INFO:  "agency_statistics_old": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.01 sec.
VACUUM

However when I try populate that table with production data I get an error:

hh=# insert into agency_statistics_old select * from agency_statistics;
ERROR:  could not read block 228 in file "base/16404/118881486": read only 0 of 8192 bytes

E.g. the database see that index have zero rows, but an insert still fail.

May be I should use pageinspect addon to see an actual index pages content?

--
Maxim Boguk
Senior Postgresql DBA.






On Tue, Feb 21, 2012 at 3:47 PM, Maxim Boguk <maxim.boguk@gmail.com> wrote:


On Tue, Feb 21, 2012 at 3:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wrote:
> OK, so that pretty much explains where the visible symptoms are coming
> from: somehow, the table got truncated but its pkey index did not get
> cleared out.  So an insert creates an empty page zero, inserts a heap
> tuple there, tries to insert an index entry.  The btree code sees there
> is an index entry for that key already, and tries to fetch the heap
> tuple for that index entry to see if it's dead (which would allow the
> insertion to proceed).  But the block number the index is pointing at
> isn't there, so you get the quoted error message.  The insertion rolls
> back, leaving a dead tuple that can be garbage-collected by autovacuum,
> after which it truncates the table again --- but of course without
> removing any index entries, except maybe one for TID (0,1) if that's
> still there.  Lather rinse repeat.

Hmm ... actually there is a point that this theory doesn't explain
entirely.  If the probability of a collision with an existing index
entry was near 100%, then each hourly cron job should only have been
able to insert one or a few heap tuples before failing.  That would
not trigger an autovacuum right away.  Eventually the number of dead
tuples would build up to the point where autovacuum got interested,
but it strains credulity a bit to assume that this happened exactly
after the last hourly run before you renamed the table.  Yet, if that
didn't happen just that way, how come the size of the table is exactly
zero now?

The theory would be more satisfactory if we could expect that an hourly
run would be able to insert some thousands of tuples before failing,
enough to trigger an autovacuum run.  So I'm wondering if maybe the
index is *partially* cleaned out, but not completely.  Does this
materialized view have a fairly predictable number of rows, and if so
how does that compare to the number of entries in the index?  (If you
have no other way to get the number of entries in the index, try
inserting a dummy row, deleting it, and then VACUUM VERBOSE.)

                       regards, tom lane

There is some funny results:

hh=# VACUUM verbose agency_statistics_old;
INFO:  vacuuming "public.agency_statistics_old"
INFO:  index "agency_statistics_pkey" now contains 0 row versions in 605 pages
DETAIL:  0 index row versions were removed.
595 index pages have been deleted, 595 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.01 sec.
INFO:  "agency_statistics_old": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.01 sec.
VACUUM

However when I try populate that table with production data I get an error:

hh=# insert into agency_statistics_old select * from agency_statistics;
ERROR:  could not read block 228 in file "base/16404/118881486": read only 0 of 8192 bytes

E.g. the database see that index have zero rows, but an insert still fail.

May be I should use pageinspect addon to see an actual index pages content?



What makes Your idea about: "index is *partially* cleaned out, but not completely" highly probable that is the next query produce no error:

hh=# insert into agency_statistics_old select * from agency_statistics limit 1;
INSERT 0 1


--
Maxim Boguk
Senior Postgresql DBA.



Maxim Boguk <maxim.boguk@gmail.com> writes:
> There is some funny results:

> hh=# VACUUM verbose agency_statistics_old;
> INFO:  vacuuming "public.agency_statistics_old"
> INFO:  index "agency_statistics_pkey" now contains 0 row versions in 605
> pages
> DETAIL:  0 index row versions were removed.

Wow.  That seems to blow my theory to small pieces.  If the index
contains no entries then it shouldn't be causing any uniqueness check
probes.  But at the same time, if the index is empty then how come
pgstatindex showed avg_leaf_density = 0.45 ?

> May be I should use pageinspect addon to see an actual index pages content?

That or pg_filedump would be interesting.  But your experiments with
adding data from the other table will probably have produced some new
index entries, which will confuse the situation.  Did you save a
physical copy of the index before that?

Another idea is to attach to the backend with gdb, set a breakpoint at
errfinish, and get a stack trace from the point of the "could not read
block" error.  That would show definitively if this is coming from a
uniqueness check or something else entirely.

            regards, tom lane



On Tue, Feb 21, 2012 at 4:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Maxim Boguk <maxim.boguk@gmail.com> writes:
> There is some funny results:

> hh=# VACUUM verbose agency_statistics_old;
> INFO:  vacuuming "public.agency_statistics_old"
> INFO:  index "agency_statistics_pkey" now contains 0 row versions in 605
> pages
> DETAIL:  0 index row versions were removed.

Wow.  That seems to blow my theory to small pieces.  If the index
contains no entries then it shouldn't be causing any uniqueness check
probes.  But at the same time, if the index is empty then how come
pgstatindex showed avg_leaf_density = 0.45 ?

> May be I should use pageinspect addon to see an actual index pages content?

That or pg_filedump would be interesting.  But your experiments with
adding data from the other table will probably have produced some new
index entries, which will confuse the situation.  Did you save a
physical copy of the index before that?

Another idea is to attach to the backend with gdb, set a breakpoint at
errfinish, and get a stack trace from the point of the "could not read
block" error.  That would show definitively if this is coming from a
uniqueness check or something else entirely.

                       regards, tom lane

Yes I had saved a physical copy of the file before start playing with it.

Unfortunately that is a production server with no gcc and gdb available so pg_filedump or gdb yet (but I going to work on it).

While I waiting for gdb/gcc on that server I had built pg_filedump on the development server using same postgresql version and created pg_filedump of the index file.
It can be downloaded there:  http://maximboguk.com/static/etc/agency_statistics_pkey.pg_filedump.gz

I have not enough knowledge of the b-tree index structure to extract any useful information from that file.

Kind Regards,
Maksym




Maxim Boguk <maxim.boguk@gmail.com> writes:
> While I waiting for gdb/gcc on that server I had built pg_filedump on the
> development server using same postgresql version and created pg_filedump of
> the index file.
> It can be downloaded there:
> http://maximboguk.com/static/etc/agency_statistics_pkey.pg_filedump.gz

If the index key values are not private information, could we see that
with "pg_filedump -i -f" not just bare?

            regards, tom lane



On Tue, Feb 21, 2012 at 5:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Maxim Boguk <maxim.boguk@gmail.com> writes:
> While I waiting for gdb/gcc on that server I had built pg_filedump on the
> development server using same postgresql version and created pg_filedump of
> the index file.
> It can be downloaded there:
> http://maximboguk.com/static/etc/agency_statistics_pkey.pg_filedump.gz

If the index key values are not private information, could we see that
with "pg_filedump -i -f" not just bare?

                       regards, tom lane

There is it:
http://maximboguk.com/static/etc/agency_statistics_pkey.pg_full_filedump.gz

--
Maxim Boguk
Senior Postgresql DBA.

Maxim Boguk <maxim.boguk@gmail.com> writes:
> On Tue, Feb 21, 2012 at 5:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> If the index key values are not private information, could we see that
>> with "pg_filedump -i -f" not just bare?

> There is it:
> http://maximboguk.com/static/etc/agency_statistics_pkey.pg_full_filedump.gz

Thanks.  The index definitely contains about a hundred apparently-live
index entries, which shouldn't be there considering the table is empty,
and it doesn't seem to be corrupted in any obvious way.

However, I noticed that all the pages have btpo_cycleid zero, which
seems a bit improbable given that there are some with BTP_SPLIT_END set,
and the only way for that to become set is if there are adjacent index
pages with different btpo_cycleid values.  This led me to look for bugs
that might reset btpo_cycleid incorrectly, and I believe I see one.
_bt_delitems_delete has this code that was copied-and-pasted from its
previous incarnation as _bt_delitems:

    /*
     * We can clear the vacuum cycle ID since this page has certainly been
     * processed by the current vacuum scan.
     */
    opaque = (BTPageOpaque) PageGetSpecialPointer(page);
    opaque->btpo_cycleid = 0;

Now that comment is a lie; this is called from _bt_findinsertloc not
from vacuum.  So the series of events I'm imagining is

1. Autovacuum starts.

2. A page gets split; the page in question is beyond where autovac has
reached in its index scan, but a page from before that gets recycled to
become the new right half.  Both these pages get marked with the
vacuum's cycleid.

3. Whatever's doing the insertions continues to insert into the same
range of the index.  (I notice that your INSERT query has a GROUP BY
that uses the high columns of your pkey, which means it seems fairly
likely that insertions occur in something close to index order.  So this
is not unlikely.)  It hits a couple of duplicate index entries, finds
they're dead, and hence marks them dead and sets the BTP_HAS_GARBAGE
field on the new right-half page (we need to assume this happens because
the split would've cleared that flag).

4. It keeps on inserting and eventually fills the the new right-half
page.  At this point, since BTP_HAS_GARBAGE is set, it runs
_bt_delitems_delete and incorrectly clears btpo_cycleid on that page.

5. Finally, the concurrent vacuum reaches the originally split page.
It notes the btpo_cycleid matching its cycleid and correctly realizes
it has to go back and process the right-half page.  However, when it
reaches that page, it finds btpo_cycleid already zero and concludes it
doesn't need to scan that page.  Result: any entries that should have
been removed, but were not already marked DEAD at the time
_bt_delitems_delete ran, are not removed from the index.

Now obviously I can't prove that this is exactly what happened in your
index.  It would have to have happened several times during a single
vacuum run, I think, because the remaining index entries are spread
across half a dozen index pages.  However, all those pages are
relatively early in the index --- the first such page is block 2, and
the last one block 49, out of a 605-page index.  And we know that your
usage pattern is such that a whole lot of inserts would be going on at
the same time that vacuum had been removing a whole lot of old entries
(and hence causing low-numbered pages to get freed up and returned to
the free space map).  So it seems to fit.  Anyway, clearing
btpo_cycleid there is unquestionably wrong.

            regards, tom lane