Re: BUG #9741: Mininal case for the BUG #9735: Error: "ERROR: tuple offset out of range: 0" during bitmap scan - Mailing list pgsql-bugs

From Maxim Boguk
Subject Re: BUG #9741: Mininal case for the BUG #9735: Error: "ERROR: tuple offset out of range: 0" during bitmap scan
Date
Msg-id CAK-MWwQe_tihMYEa=O8hp6d14AQnwdmPLpEcgQaUWpx9xPMEcw@mail.gmail.com
Whole thread Raw
In response to Re: BUG #9741: Mininal case for the BUG #9735: Error: "ERROR: tuple offset out of range: 0" during bitmap scan  (Maxim Boguk <maxim.boguk@gmail.com>)
List pgsql-bugs
On Thu, Mar 27, 2014 at 11:27 PM, Maxim Boguk <maxim.boguk@gmail.com> wrote:

>
>
>
> On Thu, Mar 27, 2014 at 10:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
>> Maxim Boguk <maxim.boguk@gmail.com> writes:
>> > select * from bt_page_items('transactions_pkey', 88472) where
>> itemoffset in
>> > (93,94);
>> >  itemoffset |    ctid     | itemlen | nulls | vars |          data
>> >
>> ------------+-------------+---------+-------+------+-------------------------
>> >          93 | (2413168,1) |      16 | f     | f    | 7c c2 2c 03 00 00
>> 00 00
>> >          94 | (2337931,0) |      16 | f     | f    | 7c c2 2c 03 00 00
>> 00 00
>>
>> This is sort of what I expected.  Surely reindexing the index will get
>> rid of that bogus entry?
>>
>>                         regards, tom lane
>>
>
> Yep reindexing that particular index fixed problem.
> But this index had been reindexed just 2 days ago in the same situation,
> and now problem reappear (with different table row/index entries).
>
> If problem reappear after the last reindex and with 9.3.4 version I will
> investigate future.
>
>
> It would be interesting know how simple integer btree index could have
> entry with definitely wrong ctid?
> It very strange for problem reappear with the only one table (there are
> couple other tables in the database with similar load but with no issues at
> all), it looks too selective and repeatable for a hardware glitch.
>

Hi,

Situation reappeared just in 3 hours after reindex (and no database
crash/restart happens during that period).
Problem found via manual search of the potentially troublesome id's:

Step 1:
search potentially suspicious index entry/id's (with index ctid have offset
zero):
select *,(get_byte(val, 0))+(get_byte(val, 1)<<8)+(get_byte(val,
2)<<16)+(get_byte(val, 3)<<24) as id from (select
gs.i,t.*,(E'\\x'||replace(trim(trailing ' 0' from data), ' ', ''))::bytea
as val from (select generate_series(1,90000) as i) as gs,
bt_page_items('transactions_pkey', i) as t where ctid::text like '%,0)') as
_t order by id,i,itemoffset;
   i   | itemoffset |    ctid     | itemlen | nulls | vars |
data           |    val     |    id
-------+------------+-------------+---------+-------+------+-------------------------+------------+----------
 83604 |         60 | (2233938,0) |      16 | f     | f    | df aa 11 03 00
00 00 00 | \xdfaa1103 | 51489503
 83648 |         92 | (2245568,0) |      16 | f     | f    | e7 e9 11 03 00
00 00 00 | \xe7e91103 | 51505639
 83819 |         37 | (2253801,0) |      16 | f     | f    | bc dc 12 03 00
00 00 00 | \xbcdc1203 | 51567804
 83907 |        184 | (2256376,0) |      16 | f     | f    | 1f 5b 13 03 00
00 00 00 | \x1f5b1303 | 51600159
...
Found 25 rows.

Now situation become much worse: the test query (
set enable_indexscan to 0;
select * from transactions where id in (25 rows from previous query result);
) produce the same error on the master database as well as on the replicas.

And reindex index transactions_pkey; doesn't help.
However, filenode of index changed for sure (so reindex happened):

select
*,pg_relation_filenode('transactions_pkey'),pg_relation_filepath('transactions_pkey')
from pg_stat_file(pg_relation_filepath('transactions_pkey'));
   size    |         access         |      modification      |
change         | creation | isdir | pg_relation_filenode |
pg_relation_filepath

-----------+------------------------+------------------------+------------------------+----------+-------+----------------------+----------------------
 760471552 | 2014-03-28 04:02:22+04 | 2014-03-28 04:16:34+04 | 2014-03-28
04:16:34+04 |          | f     |               408909 | base/16424/408909
reindex index transactions_pkey;
REINDEX
select
*,pg_relation_filenode('transactions_pkey'),pg_relation_filepath('transactions_pkey')
from pg_stat_file(pg_relation_filepath('transactions_pkey'));
   size    |         access         |      modification      |
change         | creation | isdir | pg_relation_filenode |
pg_relation_filepath

-----------+------------------------+------------------------+------------------------+----------+-------+----------------------+----------------------
 760438784 | 2014-03-28 04:18:20+04 | 2014-03-28 04:19:39+04 | 2014-03-28
04:19:39+04 |          | f     |               408931 | base/16424/408931
set enable_indexscan to 0;
SET
select id from transactions where id in (51489503,51505639,51567804,... 25
total);
ERROR:  tuple offset out of range: 0

Any suggestions what I can do next and where problem could be?
Now it doesn't seems replication issue... may be some repeatable low level
database corruption? But I have no idea what type corruption could produce
so weird results.

Kind Regards,
Maskym

pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #9743: subquery on view is not pulling up.
Next
From: Christoph Berg
Date:
Subject: Re: BUG #8139: initdb: Misleading error message when current user not in /etc/passwd