Thread: invalid memory alloc request size
Yesterday I had a problem on a 64-bit 9.1.1 install: # select version(); version ---------------------------------------------------------------------------------------------------------------- PostgreSQL 9.1.1 on x86_64-pc-linux-gnu, compiled by gcc-4.6.real (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1, 64-bit (1 row) The logs showed this anomaly: 2011-12-25T19:33:18+00:00 pgdb2-vpc postgres[27546]: [74474-1] ERROR: invalid memory alloc request size 18446744073709551613 2011-12-25T19:33:18+00:00 pgdb2-vpc postgres[27546]: [74474-2] STATEMENT: SELECT * FROM "asset_user_accesses" WHERE ("asset_user_accesses"."asset_code"= 'assignments:course_141208' AND "asset_user_accesses"."user_id" = 618503) LIMIT 1; Googling around, it sounds like this is often due to table corruption, which would be unfortunate, but usually seems to berepeatable. I can re-run that query without issue, and in fact can select * from the entire table without issue. I do seethe row was updated a few minutes after this error, so is it wishful thinking that vacuum came around and successfullyremoved the old, corrupted row version?
On Dec 26, 2011, at 8:08 AM, Ben Chobot wrote: > Yesterday I had a problem on a 64-bit 9.1.1 install: > > # select version(); > version > ---------------------------------------------------------------------------------------------------------------- > PostgreSQL 9.1.1 on x86_64-pc-linux-gnu, compiled by gcc-4.6.real (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1, 64-bit > (1 row) > > > The logs showed this anomaly: > > 2011-12-25T19:33:18+00:00 pgdb2-vpc postgres[27546]: [74474-1] ERROR: invalid memory alloc request size 18446744073709551613 > 2011-12-25T19:33:18+00:00 pgdb2-vpc postgres[27546]: [74474-2] STATEMENT: SELECT * FROM "asset_user_accesses" WHERE ("asset_user_accesses"."asset_code"= 'assignments:course_141208' AND "asset_user_accesses"."user_id" = 618503) LIMIT 1; > > > Googling around, it sounds like this is often due to table corruption, which would be unfortunate, but usually seems tobe repeatable. I can re-run that query without issue, and in fact can select * from the entire table without issue. I dosee the row was updated a few minutes after this error, so is it wishful thinking that vacuum came around and successfullyremoved the old, corrupted row version? It also happens that 18446744073709551613 is -3 in 64-bit 2's complement if it was unsigned. Is it possible that -3 was someerror return code that got cast and then passed directly to malloc()?
On 27.12.2011 18:34, Ben Chobot wrote: > On Dec 26, 2011, at 8:08 AM, Ben Chobot wrote: > >> Yesterday I had a problem on a 64-bit 9.1.1 install: >> >> # select version(); >> version >> ---------------------------------------------------------------------------------------------------------------- >> PostgreSQL 9.1.1 on x86_64-pc-linux-gnu, compiled by gcc-4.6.real (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1, 64-bit >> (1 row) >> >> >> The logs showed this anomaly: >> >> 2011-12-25T19:33:18+00:00 pgdb2-vpc postgres[27546]: [74474-1] ERROR: invalid memory alloc request size 18446744073709551613 >> 2011-12-25T19:33:18+00:00 pgdb2-vpc postgres[27546]: [74474-2] STATEMENT: SELECT * FROM "asset_user_accesses" WHERE ("asset_user_accesses"."asset_code"= 'assignments:course_141208' AND "asset_user_accesses"."user_id" = 618503) LIMIT 1; >> >> >> Googling around, it sounds like this is often due to table corruption, which would be unfortunate, but usually seems tobe repeatable. I can re-run that query without issue, and in fact can select * from the entire table without issue. I dosee the row was updated a few minutes after this error, so is it wishful thinking that vacuum came around and successfullyremoved the old, corrupted row version? > > It also happens that 18446744073709551613 is -3 in 64-bit 2's complement if it was unsigned. Is it possible that -3 wassome error return code that got cast and then passed directly to malloc()? That's not likely. The corruption is usually the cause, when it hits varlena header - that's where the length info is stored. In that case PostgreSQL suddenly thinks the varlena field has a negative value (and malloc accepts unsigned integers). Some time ago I've written an extension that might help you locate where's the actual issue (which block / row / field) and Heikki did some review about a month ago so there's a change it might work. It's available here http://github.com/tvondra/pg_check Let me know in case of any issues. regards Tomas
On Tue, Dec 27, 2011 at 4:07 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >>> Googling around, it sounds like this is often due to table corruption, which would be unfortunate, but usually seemsto be repeatable. I can re-run that query without issue, and in fact can select * from the entire table without issue.I do see the row was updated a few minutes after this error, so is it wishful thinking that vacuum came around andsuccessfully removed the old, corrupted row version? >> >> It also happens that 18446744073709551613 is -3 in 64-bit 2's complement if it was unsigned. Is it possible that -3 wassome error return code that got cast and then passed directly to malloc()? > > That's not likely. The corruption is usually the cause, when it hits > varlena header - that's where the length info is stored. In that case > PostgreSQL suddenly thinks the varlena field has a negative value (and > malloc accepts unsigned integers). If the problem truly went away, one likely possibility is that the bad tuple was simply deleted -- occasionally the corruption is limited to a tuple or two but doesn't spill over into the page itself -- in such situations some judicious deletion of rows can get you to a point where you can pull off a dump. merlin
On 27.12.2011 23:23, Merlin Moncure wrote: > On Tue, Dec 27, 2011 at 4:07 PM, Tomas Vondra <tv@fuzzy.cz> wrote: >> That's not likely. The corruption is usually the cause, when it hits >> varlena header - that's where the length info is stored. In that case >> PostgreSQL suddenly thinks the varlena field has a negative value (and >> malloc accepts unsigned integers). > > If the problem truly went away, one likely possibility is that the bad > tuple was simply deleted -- occasionally the corruption is limited to > a tuple or two but doesn't spill over into the page itself -- in such > situations some judicious deletion of rows can get you to a point > where you can pull off a dump. Or maybe the record is not read for some other reason ... maybe the table is accessed in a different way and the corrupted column is not checked. Or maybe it does not match the WHERE condition or something. I've seen cases where the table was accessed sequentially and it was failing (as the column was checked because of the WHERE condition), and then it switched to index scan and it did not fail anymore (because it was not necessary to check the column anymore). Tomas