On Thu, Feb 7, 2013 at 11:09 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> While stress testing Pavan's 2nd pass vacuum visibility patch, I realized
> that vacuum/visibility was busted. But it wasn't his patch that busted it.
> As far as I can tell, the bad commit was in the range
> 692079e5dcb331..168d3157032879
>
> Since a run takes 12 to 24 hours, it will take a while to refine that
> interval.
>
> I was testing using the framework explained here:
>
> http://www.postgresql.org/message-id/CAMkU=1xoA6Fdyoj_4fMLqpicZR1V9GP7cLnXJdHU+iGgqb6WUw@mail.gmail.com
>
> Except that I increased JJ_torn_page to 8000, so that autovacuum has a
> chance to run to completion before each crash; and I turned off archive_mode
> as it was not relevant and caused annoying noise. As far as I know,
> crashing is entirely irrelevant to the current problem, but I just used and
> adapted the framework I had at hand.
>
> A tarball of the data directory is available below, for those who would
> like to do a forensic inspection. The table jjanes.public.foo is clearly in
> violation of its unique index.
The xmins of all the duplicate tuples look dangerously close to 2^31.
I wonder if XID wrap around has anything to do with it.
Index scans do not return any duplicates and you need to force a seq
scan to see them. Assuming that the page level VM bit might be
corrupted, I tried to REINDEX the table to see if it complains of
unique key violations, but that crashes the server with
TRAP: FailedAssertion("!(((bool) ((root_offsets[offnum - 1] !=
((OffsetNumber) 0)) && (root_offsets[offnum - 1] <= ((OffsetNumber)
(8192 / sizeof(ItemIdData)))))))", File: "index.c", Line: 2482)
Will look more into it, but thought this might be useful for others to
spot the problem.
Thanks,
Pavan
P.S BTW, you would need to connect as user "jjanes" to a database
"jjanes" to see the offending table.
--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee