Re: ERROR: MultiXactId xxxx has not been created yet -- apparent wraparound - Mailing list pgsql-hackers
From | Paul Smith |
---|---|
Subject | Re: ERROR: MultiXactId xxxx has not been created yet -- apparent wraparound |
Date | |
Msg-id | 5564DDDF.7050302@pscs.co.uk Whole thread Raw |
In response to | Re: ERROR: MultiXactId xxxx has not been created yet -- apparent wraparound (Alvaro Herrera <alvherre@2ndquadrant.com>) |
Responses |
Re: ERROR: MultiXactId xxxx has not been created yet --
apparent wraparound
|
List | pgsql-hackers |
On 26/05/2015 19:47, Alvaro Herrera wrote: > Paul Smith wrote: > >> No, nothing like that. It was just running fine, and then suddenly (at 2am >> on 23 May) it started throwing up loads of these errors. The DB server >> wasn't even restarted at that point. It was just working fine, then suddenly >> wasn't. (The first error was at 02:00:32 BST, then every few minutes after >> that there's another one). > Another crazy thought is that the host system messed up and overwrote > part of the table with pages from some other guest system. I have never > seen a report of such a thing happening. Yes, that seems unlikely to me, but you never know... While I was trying to work out how to fix it (after my original email), I did see a pgsql server crash: 2015-05-26 12:26:30 BST LOG: server process (PID 35493) was terminated by signal 11: Segmentation fault > >> It's running in a Hyper-V guest. We had taken a backup of the VM at 00:34 on >> 23 May and that looks to be absolutely fine. What I have done now is restore >> that backup and import the new data which arrived since that backup was >> made, and it seems OK now. I still have the 'broken' installation in case >> more information is needed from it. I'd try to get a raw dump of the damaged >> tuple data if I knew how to find where it is in the relation file... > Well, you could try a binary search to locate the (or a) borked page > within the table. Or you could write a plpgsql function with an > EXCEPTION block that reads each tuple, and in case of error reports the > CTID of the tuple being read. (select * from tab where > ctid=generated_ctid; EXCEPT WHEN OTHERS THEN raise notice ' .. ', > generated_ctid). In the backup, for the row which will become broken, I get: ctid | xmin | xmax ------------+----------+---------- (18722,29) | 23862661 | 23862661 Assuming it hadn't changed in the broken version (from application knowledge, it's unlikely to change), then on the broken version, select heap_page_items(get_raw_page('messages.msgdata', 18722)); gives: (1,2,2,0,,,,,,,,,) (2,7816,1,376,21857691,21857691,11,"(18722,2)",32784,10646,24,,) (3,4,2,0,,,,,,,,,) (4,7568,1,248,21858126,21858126,11,"(18722,4)",32784,10646,24,,)(5,6,2,0,,,,,,,,,) (6,7216,1,352,22823803,23871405,41,"(18722,30)",49168,9478,24,,)(7,8,2,0,,,,,,,,,) (8,6968,1,248,22825206,22825206,11,"(18722,8)",32784,10646,24,,)(9,10,2,0,,,,,,,,,) (10,6720,1,248,22826427,22826427,11,"(18722,10)",32784,10646,24,,)(11,12,2,0,,,,,,,,,) (12,6336,1,380,22826899,23935627,1,"(18722,31)",49168,9478,24,,)(13,0,3,0,,,,,,,,,) (14,15,2,0,,,,,,,,,) (15,5400,1,932,23862314,23862314,11,"(18722,15)",32784,10642,24,,)(16,17,2,0,,,,,,,,,) (17,5168,1,228,23862333,23862333,11,"(18722,17)",32784,10646,24,,)(18,19,2,0,,,,,,,,,) (19,4800,1,368,23862365,23862365,11,"(18722,19)",32784,10646,24,,)(20,24,2,0,,,,,,,,,) (21,3112,1,812,1332175474,894779760,223823413,"(1443526721,25142)",12141,18516,57,,)(22,23,2,0,,,,,,,,,) (23,4448,1,352,728903474,1466642284,1869042033,"(1262568257,16743)",16751,26455,65,111101101000001011010010100000101010001011110110100000101101001010000010100000000000000000000000000000001010000001110110110000001000000001001000101000010110111000000000000000001100000111100010000000000000000010110010101100101101100000000000110010110010110011000000000000000000000000000000000000000000000000000000000000000000000010100001,) (24,4184,1,264,619804,228,8000,"(956969542,26694)",22857,29251,98,110001101100001001100110011001100001110010000010001000100000111011110100110100101001011000101100100101100110001001000010101011100101111001010110111100101100001011000110101000101101011000011100010101100100110010011100110100101001101001100110110001101000101000110110100101100100001010011110000010100011011001000010101011001111001010100110010010100111011000101110010010101111010000011010110101101000001000101100101100101000101000001110000110101110001000001100110001100001001001010010000010100101001000011110010010100110110010000110101100000101000011100110110010100011011010110110100110101111010000100110,) (25,26,2,0,,,,,,,,,) (26,3928,1,252,1212837441,1232350037,1668772674,"(728910135,11093)",31284,16687,70,1111011010100010100010100100110011001100100000101101001001110110000101101110011001100110101101101001110010011110010010101000110011110100000100100010111011010100011010100001001010110110101100100011001000101010111011000000111011110100110011001011000001010000101101101111010010110110100001101000001010100010011011100100101010110110001100100101011001001110001101101100011001100110,1717791858) (27,21,2,0,,,,,,,,,)(28,29,2,0,,,,,,,,,) (29,2296,1,812,1229271346,1934308693,1733443957,"(1848337707,21358)",31337,21592,88,,1178686785) (30,1944,1,352,947155032,1380218998,792031842,"(1211650927,22579)",30585,20532,80,,) (31,1560,1,380,23935627,23935627,1,"(18722,31)",40976,9510,24,,) (31 rows) The data for index 29 has 1934308693 in the header (which is the multixactid reported in the error message when trying to retrieve that row). I'm struggling to find the definitions for the data returned by heap_page_items, but the output from this page (for indexes between 21 and 30) looks quite different from other neighbouring pages. If I look at the raw page data, it contains data relating to this installation, so it doesn't look like a page from a different Hyper-V guest. It also doesn't look like just random data from a memory error. It actually looks like legitimate data which should possibly have been written elsewhere. (I don't want to post it here because it contains possibly confidential data from one of our customers).
pgsql-hackers by date: