Thread: BUG #6200: standby bad memory allocations on SELECT
The following bug has been logged online: Bug reference: 6200 Logged by: Daniel Farina Email address: daniel@heroku.com PostgreSQL version: 9.0.4 Operating system: Ubuntu 10.04 Description: standby bad memory allocations on SELECT Details: A huge thanks to Conrad Irwin of Rapportive for furnishing virtually all the details of this bug report. The following error occurs up to a couple of times a day on a busy PostgreSQL database: ERROR: invalid memory alloc request size 18446744073709551613 The occurrence rate is somewhere in the one per tens-of-millions of queries. The number is always the same (2**64 - 3), and there's no obvious pattern in the distribution of errors (they don't even appear to be correlated with system load). The error has not been recorded on the primary database, even though the same workload is submitted. These errors do not reproduce, seeming to evaporate almost immediately on the standby, so durable/long lasting index corruption is not likely. This problem has persisted among multiple generations of hot standbys on different hardware and sourced from different base backups. At least once, a hot standby was promoted to a primary and the errors seem to discontinue, but then reappear on a newly-provisioned standby. The VERSION() string is: PostgreSQL 9.0.4 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.4.real (Ubuntu 4.4.3-4ubuntu5) 4.4.3, 64-bit The problem is confined to a particular access patterns and schema objects, enumerated below: The points2 table looks like: Table "public.points2" Column | Type | Modifiers ------------------+-----------------------------+--------------------------- --------------------------- id | integer | not null default nextval('points2_id_seq'::regclass) identifier | text | not null scope_id | integer | not null class_number | smallint | not null authorization_id | integer | not null sum_json | text | not null amended_at | timestamp without time zone | not null Indexes: "points2_pkey" PRIMARY KEY, btree (id) "points2_special" btree (identifier_hash(identifier), scope_id, class_number, authorization_id) CREATE FUNCTION identifier_hash(text) RETURNS bigint IMMUTABLE LANGUAGE SQL AS $$ SELECT ('x' || md5($1))::bit(64)::bigint; $$; This has only been seen on queries of the form: SELECT * FROM "points2" WHERE (identifier_hash(identifier) = identifier_hash('1104131405') AND identifier = '1104131405' AND scope_id = 0 AND authorization_id = 0 AND class_number = 25) Though this table is accessed similarly frequently by queries of the form: SELECT points2.* FROM points2 JOIN (VALUES (8585261297509044776, 0, 47, 'ae9064e6f24127c6a1f483cd71e14e64')) AS query(hashed_identifier, scope_id, class_number, identifier) ON identifier_hash(points2.identifier) = query.hashed_identifier AND points2.scope_id = query.scope_id AND points2.class_number = query.class_number AND points2.identifier = query.identifier; these do not trigger the problem. The table is always updated to or inserted into one row at a time (using the "id" primary key for updates), though we sometimes update multiple rows in a single transaction, synchronous_commit is turned off for connections that touch the points2 table on the primary.
"Daniel Farina" <daniel@heroku.com> writes: > A huge thanks to Conrad Irwin of Rapportive for furnishing virtually all the > details of this bug report. This isn't really enough information to reproduce the problem ... > The occurrence rate is somewhere in the one per tens-of-millions of > queries. ... and that statement is going to discourage anyone from even trying, since with such a low occurrence rate it's going to be impossible to be sure whether the setup to reproduce the problem is correct. So if you'd like this to be fixed, you're either going to need to show us exactly how to reproduce it, or investigate it yourself. The way that I'd personally proceed to investigate it would probably be to change the "invalid memory alloc request size" size errors (in src/backend/utils/mmgr/mcxt.c; there are about four occurrences) from ERROR to PANIC so that they'll provoke a core dump, and then use gdb to get a stack trace, which would provide at least a little more information about what happened. However, if you are only able to reproduce it in a production server, you might not like that approach. Perhaps you can set up an extra standby that's only there for testing, so you don't mind if it crashes? regards, tom lane
On 09.09.2011 18:02, Tom Lane wrote: > The way that I'd personally proceed to investigate it would probably be > to change the "invalid memory alloc request size" size errors (in > src/backend/utils/mmgr/mcxt.c; there are about four occurrences) from > ERROR to PANIC so that they'll provoke a core dump, and then use gdb > to get a stack trace, which would provide at least a little more > information about what happened. However, if you are only able to > reproduce it in a production server, you might not like that approach. > Perhaps you can set up an extra standby that's only there for testing, > so you don't mind if it crashes? If that's not possible or doesn't reproduce the issue, there's also functions in glibc to produce a backtrace without aborting the program: https://www.gnu.org/s/libc/manual/html_node/Backtraces.html. I think you could also fork() + abort() to generate a core dump, not just a backtrace. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, Sep 8, 2011 at 11:33 PM, Daniel Farina <daniel@heroku.com> wrote: > =A0ERROR: invalid memory alloc request size 18446744073709551613 > At least once, a hot standby was promoted to a primary and the errors seem > to discontinue, but then reappear on a newly-provisioned standby. So the query that fails is a btree index on a hot standby. I don't fully accept it as an HS bug, but lets assume that it is and analyse what could cause it. The MO is certain user queries, only observed in HS. So certain queries might be related to the way we use indexes or not. There is a single and small difference between how a btree index operates in HS and "normal" operation, which relates to whether we kill tuples in the index. That's simple code and there's no obvious bugs there, nor anything that specifically allocates memory even. So the only bug that springs to mind is something related to how we navigate hot chains with/without killed tuples. i.e. the bug is not actually HS related, but is only observed under conditions typical in HS. HS touches almost nothing else in user space, apart from snapshots. So there could be a bug there also, maybe in CopySnapshot(). --=20 =A0Simon Riggs=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 http:/= /www.2ndQuadrant.com/ =A0PostgreSQL Development, 24x7 Support, Training & Services
Hello, We upgraded to postgres 9.1.2 two weeks ago, and we are also experiencing an issue that seems very similar to the one reported as bug 6200. We see approximately 2 dozen alloc errors per day across 3 slaves, and we are getting one segfault approximately every 3 days. We did not experience this issue before our upgrade (we were on version 8.4, and used skytools for replication). We are attempting to get a core dump on segfault (our last attempt did not work due to a config issue for the core dump). We're also attempting to repro the alloc errors on a test setup, but it seems like we may need quite a bit of load to trigger the issue. We're not certain that the alloc issues and the sefaults are "the same issue" - but it seems that it may be since the OP for bug 6200 sees the same behavior. We have seen no issues on the master, all alloc errors and segfaults have been on the slaves. We've seen the alloc errors on a few different tables, but most frequently on logins. Rows are added to the logins table one-by-one, and updates generally happen one row at a time. The table is pretty basic, it looks like this... CREATE TABLE logins ( login_id bigserial NOT NULL, <snip - a bunch of columns> CONSTRAINT logins_pkey PRIMARY KEY (login_id ), <snip - some other constraints...> ) WITH ( FILLFACTOR=80, OIDS=FALSE ); The queries that trigger the alloc error on this table look like this (we use hibernate hence the funny underscoring...) select login0_.login_id as login1_468_0_, l... from logins login0_ where login0_.login_id=$1 The alloc error in the logs looks like this: -01-12_080925.log:2012-01-12 17:33:46 PST [16034]: [7-1] [24/25934] ERROR: invalid memory alloc request size 18446744073709551613 The alloc error is nearly always for size 18446744073709551613 - though we have seen one time where it was a different amount... We have been in touch with the OP for bug 6200, who said he may have time to help us out a bit on debugging this. It seems like what is being suggested is getting a build of postgres that will capture a stack trace for each alloc issue and/or simply dump core when that happens. As this is a production system we would prefer the former. As I mentioned above we're also trying to get a core dump for the segfault. We are treating this as extremely high priority as it is currently causing 2 dozen failures for users of our site per day, as well as a few min of downtime for the segfault every 3 days. I realize there may be little that the postgres experts can do until we provide more information - but since our use case is really not very complicated here (basic use of HS), and another site is also experiencing it, I figured it would be worth posting about what we're seeing. Thanks, -Bridget Frey Redfin
On Mon, Jan 23, 2012 at 3:22 PM, Bridget Frey <bridget.frey@redfin.com> wro= te: > Hello, > We upgraded to postgres 9.1.2 two weeks ago, and we are also experiencing= an > issue that seems very similar to the one reported as bug 6200.=A0 We see > approximately 2 dozen alloc errors per day across 3 slaves, and we are > getting one segfault approximately every 3 days.=A0 We did not experience= this > issue before our upgrade (we were on version 8.4, and used skytools for > replication). > > We are attempting to get a core dump on segfault (our last attempt did not > work due to a config issue for the core dump).=A0 We're also attempting to > repro the alloc errors on a test setup, but it seems like we may need qui= te > a bit of load to trigger the issue.=A0 We're not certain that the alloc i= ssues > and the sefaults are "the same issue" - but it seems that it may be since > the OP for bug 6200 sees the same behavior.=A0 We have seen no issues on = the > master, all alloc errors and segfaults have been on the slaves. > > We've seen the alloc errors on a few different tables, but most frequently > on logins.=A0 Rows are added to the logins table one-by-one, and updates > generally happen one row at a time.=A0 The table is pretty basic, it looks > like this... > > CREATE TABLE logins > ( > =A0 login_id bigserial NOT NULL, > =A0 <snip - a bunch of columns> > =A0 CONSTRAINT logins_pkey PRIMARY KEY (login_id ), > =A0 <snip - some other constraints...> > ) > WITH ( > =A0 FILLFACTOR=3D80, > =A0 OIDS=3DFALSE > ); > > The queries that trigger the alloc error on this table look like this (we > use hibernate hence the funny underscoring...) > select login0_.login_id as login1_468_0_, l...=A0 from logins login0_ whe= re > login0_.login_id=3D$1 > > The alloc error in the logs looks like this: > -01-12_080925.log:2012-01-12 17:33:46 PST [16034]: [7-1] [24/25934] ERROR: > invalid memory alloc request size 18446744073709551613 > > The alloc error is nearly always for size 18446744073709551613 - though we > have seen one time where it was a different amount... Hmm, that number in hex works out to 0xfffffffffffffffd, which makes it sound an awful lot like the system (for some unknown reason) attempted to allocate -3 bytes of memory. I've seen something like this once before on a customer system running a modified version of PostgreSQL. In that case, the problem turned out to be page corruption. Circumstances didn't permit determination of the root cause of the page corruption, however, nor was I able to figure out exactly how the corruption I saw resulted in an allocation request like this. It would be nice to figure out where in the code this is happening and put in a higher-level guard so that we get a better error message. You want want to compile a modified PostgreSQL executable that puts an extremely long sleep (like a year) just before this error is reported. Then, when the system hangs at that point, you can attach a debugger and pull a stack backtrace. Or you could insert an abort() at that point in the code and get a backtrace from the core dump. --=20 Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jan 27, 2012 at 1:31 PM, Bridget Frey <bridget.frey@redfin.com> wro= te: > Thanks for the info - that's very helpful.=A0 We had also noted that the = alloc > seems to be -3 bytes.=A0 We have run pg_check and it found no instances of > corruption. We've also replayed queries that have failed, and have never > been able to get the same query to fail twice.=A0 In the case you > investigated, was there permanent page corruption - e.g. you could run the > same query over and over and get the same result? Yes. I observed that the infomask bits on several tuples had somehow been overwritten by nonsense. I am not sure whether there were other kinds of corruption as well - I suspect probably so - but that's the only one I saw with my own eyes, courtesy of pg_filedump. > It really does seem like this is an issue either in Hot Standby or very > closely related to that feature, where there is temporary corruption of a > btree index that then disappears.=A0 Our master is not experiencing any m= alloc > issues, while the 3 slaves get about a dozen per day, despite similar > workloads.=A0 We haven't have a slave segfault since we set it up to prod= uce a > core dump, but we're expecting to have that within the next few days > (assuming we'll continue to get a segfault every 3-4 days).=A0 We're also > planning to set up one slave that will panic when it gets a malloc issue,= as > you (and other posters on 6400) had suggested. > > Thanks again for the help, and we'll keep you posted as we learn more... The case I investigated involved corruption on the master, and I think it predated Hot Standby. However, the symptom is generic enough that it seems quite possible that there's more than one way for it to happen. :-( --=20 Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Thanks for the info - that's very helpful. We had also noted that the alloc seems to be -3 bytes. We have run pg_check and it found no instances of corruption. We've also replayed queries that have failed, and have never been able to get the same query to fail twice. In the case you investigated, was there permanent page corruption - e.g. you could run the same query over and over and get the same result? It really does seem like this is an issue either in Hot Standby or very closely related to that feature, where there is temporary corruption of a btree index that then disappears. Our master is not experiencing any malloc issues, while the 3 slaves get about a dozen per day, despite similar workloads. We haven't have a slave segfault since we set it up to produce a core dump, but we're expecting to have that within the next few days (assuming we'll continue to get a segfault every 3-4 days). We're also planning to set up one slave that will panic when it gets a malloc issue, as you (and other posters on 6400) had suggested. Thanks again for the help, and we'll keep you posted as we learn more... -B On Fri, Jan 27, 2012 at 6:31 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Jan 23, 2012 at 3:22 PM, Bridget Frey <bridget.frey@redfin.com> > wrote: > > Hello, > > We upgraded to postgres 9.1.2 two weeks ago, and we are also > experiencing an > > issue that seems very similar to the one reported as bug 6200. We see > > approximately 2 dozen alloc errors per day across 3 slaves, and we are > > getting one segfault approximately every 3 days. We did not experience > this > > issue before our upgrade (we were on version 8.4, and used skytools for > > replication). > > > > We are attempting to get a core dump on segfault (our last attempt did > not > > work due to a config issue for the core dump). We're also attempting to > > repro the alloc errors on a test setup, but it seems like we may need > quite > > a bit of load to trigger the issue. We're not certain that the alloc > issues > > and the sefaults are "the same issue" - but it seems that it may be since > > the OP for bug 6200 sees the same behavior. We have seen no issues on > the > > master, all alloc errors and segfaults have been on the slaves. > > > > We've seen the alloc errors on a few different tables, but most > frequently > > on logins. Rows are added to the logins table one-by-one, and updates > > generally happen one row at a time. The table is pretty basic, it looks > > like this... > > > > CREATE TABLE logins > > ( > > login_id bigserial NOT NULL, > > <snip - a bunch of columns> > > CONSTRAINT logins_pkey PRIMARY KEY (login_id ), > > <snip - some other constraints...> > > ) > > WITH ( > > FILLFACTOR=80, > > OIDS=FALSE > > ); > > > > The queries that trigger the alloc error on this table look like this (we > > use hibernate hence the funny underscoring...) > > select login0_.login_id as login1_468_0_, l... from logins login0_ where > > login0_.login_id=$1 > > > > The alloc error in the logs looks like this: > > -01-12_080925.log:2012-01-12 17:33:46 PST [16034]: [7-1] [24/25934] > ERROR: > > invalid memory alloc request size 18446744073709551613 > > > > The alloc error is nearly always for size 18446744073709551613 - though > we > > have seen one time where it was a different amount... > > Hmm, that number in hex works out to 0xfffffffffffffffd, which makes > it sound an awful lot like the system (for some unknown reason) > attempted to allocate -3 bytes of memory. I've seen something like > this once before on a customer system running a modified version of > PostgreSQL. In that case, the problem turned out to be page > corruption. Circumstances didn't permit determination of the root > cause of the page corruption, however, nor was I able to figure out > exactly how the corruption I saw resulted in an allocation request > like this. It would be nice to figure out where in the code this is > happening and put in a higher-level guard so that we get a better > error message. > > You want want to compile a modified PostgreSQL executable that puts an > extremely long sleep (like a year) just before this error is reported. > Then, when the system hangs at that point, you can attach a debugger > and pull a stack backtrace. Or you could insert an abort() at that > point in the code and get a backtrace from the core dump. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > -- Bridget Frey Director, Data & Analytics Engineering | Redfin bridget.frey@redfin.com | tel: 206.576.5894
I work with Bridget at Redfin. We have a core dump from a once-in-5-days (multi-million queries) hot standby segfault in pg 9.1.2 . (It might or might be the same root issue as the "alloc" errors. If I should file a new bug report, let me know. The postgres executable that crashed did not have debugging symbols installed, and we were unable to debug (gdb) the core file using a debug build of postgres. (Symbols didn't match.) Running gdb against a non-debug postgres executable gave us this stack trace: [root@query-7 core]# gdb -q -c /postgres/core/query-9.core.19678 /usr/pgsql-9.1/bin/postgres-non-debug Reading symbols from /usr/pgsql-9.1/bin/postgres-non-debug...(no debugging symbols found)...done. warning: core file may not match specified executable file. [New Thread 19678] warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7fffdcd58000 Core was generated by `postgres: datamover stingray_prod 10.11.0.134(54140) SELEC'. Program terminated with signal 11, Segmentation fault. #0 0x000000000045694c in nocachegetattr () (gdb) bt #0 0x000000000045694c in nocachegetattr () #1 0x00000000006f93c9 in ?? () #2 0x00000000006fa231 in tuplesort_puttupleslot () #3 0x0000000000573ad1 in ExecSort () #4 0x000000000055cdda in ExecProcNode () #5 0x000000000055bcd1 in standard_ExecutorRun () #6 0x0000000000623594 in ?? () #7 0x0000000000624ae0 in PortalRun () #8 0x00000000006220f2 in PostgresMain () #9 0x00000000005e6ba4 in ?? () #10 0x00000000005e791c in PostmasterMain () #11 0x000000000058b9ae in main () We have the (5GB) core file, and are happy to do any more forensics anyone can advise. Please instruct. I hope this helps point to a root cause and resolution.... Thank you, Mike Brauwerman Data Team, Redfin On Fri, Jan 27, 2012 at 10:53 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jan 27, 2012 at 1:31 PM, Bridget Frey <bridget.frey@redfin.com> > wrote: > > Thanks for the info - that's very helpful. We had also noted that the > alloc > > seems to be -3 bytes. We have run pg_check and it found no instances of > > corruption. We've also replayed queries that have failed, and have never > > been able to get the same query to fail twice. In the case you > > investigated, was there permanent page corruption - e.g. you could run > the > > same query over and over and get the same result? > > Yes. I observed that the infomask bits on several tuples had somehow > been overwritten by nonsense. I am not sure whether there were other > kinds of corruption as well - I suspect probably so - but that's the > only one I saw with my own eyes, courtesy of pg_filedump. > > > It really does seem like this is an issue either in Hot Standby or very > > closely related to that feature, where there is temporary corruption of a > > btree index that then disappears. Our master is not experiencing any > malloc > > issues, while the 3 slaves get about a dozen per day, despite similar > > workloads. We haven't have a slave segfault since we set it up to > produce a > > core dump, but we're expecting to have that within the next few days > > (assuming we'll continue to get a segfault every 3-4 days). We're also > > planning to set up one slave that will panic when it gets a malloc > issue, as > > you (and other posters on 6400) had suggested. > > > > Thanks again for the help, and we'll keep you posted as we learn more... > > The case I investigated involved corruption on the master, and I think > it predated Hot Standby. However, the symptom is generic enough that > it seems quite possible that there's more than one way for it to > happen. :-( > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > -- > Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-bugs > -- Mike Brauwerman Data Team, Redfin
On 28 January 2012 21:34, Michael Brauwerman <michael.brauwerman@redfin.com> wrote: > We have the (5GB) core file, and are happy to do any more forensics anyone > can advise. Ideally, you'd be able to install debug information packages, which should give a more detailed and useful stack trace, as described here: http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreS= QL_backend_on_Linux/BSD --=20 Peter Geoghegan =A0 =A0 =A0 http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
We did try that with a postgres 9.1.2, compiled from source with debug flags, but we got 0x10 bad address in gdb. (Obviously we did it wrong somehow) We will keep trying to get a good set of symbols set up. On Jan 28, 2012 2:34 PM, "Peter Geoghegan" <peter@2ndquadrant.com> wrote: > On 28 January 2012 21:34, Michael Brauwerman > <michael.brauwerman@redfin.com> wrote: > > We have the (5GB) core file, and are happy to do any more forensics > anyone > > can advise. > > Ideally, you'd be able to install debug information packages, which > should give a more detailed and useful stack trace, as described here: > > > http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD > > -- > Peter Geoghegan http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training and Services >
On Sat, Jan 28, 2012 at 8:45 PM, Michael Brauwerman <michael.brauwerman@redfin.com> wrote: > We did try that with a postgres 9.1.2, compiled from source with debug > flags, but we got 0x10 bad address in gdb. (Obviously we did it wrong > somehow) > > We will keep trying to get a good set of symbols set up. Hmm. Your backtrace is, on the one hand, unsurprising - because I don't have complete confidence that nocachegetattr won't index off into oblivion given corrupt data on disk - and on the other hand, unconvincing, mostly because of this: warning: core file may not match specified executable file. If you're doing a custom build, it should be enough to use --enable-debug and then gdb the core file with the path to the executable you built. If you're using Red Hat, you can just do debuginfo-install package-name, where package-name is probably something like "postgresql-server". But either way, it's critical that the executable you use to pull the backtrace be exactly the same one that was running when the system went belly-up, and it's hard to be confident that's the case here. All that having been said, it *is* possible, even if you do everything right, to run across a situation where gdb can't extract a backtrace. This mostly commonly happens at higher optimization levels. You might want to try building -fno-omit-frame-pointer or even -O0 to minimize the chances of the compiler doing something that's too clever for gdb. If you do manage to get it working correctly, the backtrace should show not only ALL (rather than only some) functions in the call stack but also the values of the arguments passed to each of those functions, which, probably needless to say, would be awfully helpful in figuring this one out. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
All right, so we were able to get a full bt of the alloc error on a test system. Also, since we have a lot of emails going around on this - I wanted to make it clear that we're seeing *two* production errors, which may or may not be related. (The OP for bug #6200 also sees both issues.) One is a segfault, our emails over the weekend were for that, and we don't have a complete bt for that (we're going to install our own build of postrges to get that.) The second error is an invalid memory alloc error that we're getting ~2 dozen times per day in production. The bt for this alloc error is below. Please note also that this is from a version of postgres which we modified to panic when we hit the invalid memory alloc line. So, postgres won't actually segfault here, it just outputs an error and continues. We modified this to panic so we could catch a bt, and it looks like we did get one :) Anyway, here goes... #0 0x0000003a83e30265 in raise () from /lib64/libc.so.6 #1 0x0000003a83e31d10 in abort () from /lib64/libc.so.6 #2 0x00000000007cb84e in errfinish (dummy=0) at elog.c:523 #3 0x00000000007cd951 in elog_finish (elevel=22, fmt=0x95cdf0 "invalid memory alloc request size %lu") at elog.c:1202 #4 0x00000000007f115c in MemoryContextAlloc (context=0x17b581d0, size=18446744073709551613) at mcxt.c:516 #5 0x0000000000771a46 in text_to_cstring (t=0x17b23608) at varlena.c:139 #6 0x0000000000770747 in varcharout (fcinfo=0x7fffd44854e0) at varchar.c:515 #7 0x00000000007d3ea8 in FunctionCall1Coll (flinfo=0x17b7b5c0, collation=0, arg1=397555208) at fmgr.c:1293 #8 0x00000000007d5562 in OutputFunctionCall (flinfo=0x17b7b5c0, val=397555208) at fmgr.c:1946 #9 0x000000000045a4a1 in printtup (slot=0x17d27b80, self=0x17a9cf00) at printtup.c:338 #10 0x00000000005bab14 in ExecutePlan (estate=0x17d27a70, planstate=0x17d27c10, operation=CMD_SELECT, sendTuples=1 '\001', numberTuples=0, direction=ForwardScanDirection, dest=0x17a9cf00) at execMain.c:1464 #11 0x00000000005b9205 in standard_ExecutorRun (queryDesc=0x17b87918, direction=ForwardScanDirection, count=0) at execMain.c:313 #12 0x00000000005b9100 in ExecutorRun (queryDesc=0x17b87918, direction=ForwardScanDirection, count=0) at execMain.c:261 #13 0x00000000006d3b88 in PortalRunSelect (portal=0x17ae2c30, forward=1 '\001', count=0, dest=0x17a9cf00) at pquery.c:943 #14 0x00000000006d3849 in PortalRun (portal=0x17ae2c30, count=9223372036854775807, isTopLevel=1 '\001', dest=0x17a9cf00, altdest=0x17a9cf00, completionTag=0x7fffd4485cd0 "") at pquery.c:787 #15 0x00000000006cf62d in exec_execute_message (portal_name=0x17a9caf0 "", max_rows=9223372036854775807) at postgres.c:1963 #16 0x00000000006d1e81 in PostgresMain (argc=2, argv=0x179bdaa8, username=0x179bd8c0 "datamover") at postgres.c:3983 #17 0x0000000000686256 in BackendRun (port=0x179e2350) at postmaster.c:3601 #18 0x000000000068593f in BackendStartup (port=0x179e2350) at postmaster.c:3286 #19 0x0000000000682af2 in ServerLoop () at postmaster.c:1455 #20 0x0000000000682298 in PostmasterMain (argc=3, argv=0x179bbc80) at postmaster.c:1116 #21 0x00000000005fe007 in main (argc=3, argv=0x179bbc80) at main.c:199
Bridget Frey <bridget.frey@redfin.com> writes: > The second error is an invalid memory alloc error that we're getting ~2 > dozen times per day in production. The bt for this alloc error is below. This trace is consistent with the idea that we're getting a corrupt tuple out of a table, although it doesn't entirely preclude the possibility that the corrupt value is manufactured inside the backend. To get much further you're going to need to look at the specific query being executed each time this happens, and see if you can detect any pattern. Now that you've got debug symbols straightened out, the gdb command "p debug_query_string" should accomplish this. (If that does not produce anything that looks like one of your application's SQL commands, we'll need to try harder to extract the info.) You could probably hack the elog(PANIC) to log that before dying, too, if you'd rather not manually gdb each core dump. regards, tom lane
Hi Tom, Thanks for the reply, we appreciate you time on this. The alloc error queries all seem to be selects from a btree primary index. I gave an example in my initial post from the logins table. Usually for us it is logins but sometimes we have seen it on a few other tables, and it's always a btree primary key index, very simple type of select. The queries have been showing up in the logs which is how we know, but we could also confirm in the core dump. If the problem is data corruption, it is transient. We replay the same queries and get no errors. We also have jobs that run that basically do the same series of selects every day or hour etc. but it is totally random which ones cause an error. E.g. If it is corruption it somehow magically fixes itself. Also we still have not seen any issues on the master, this seems to be a problem only on hot standby slaves (we have three slaves). The OP, incidentally, reported the same thing - issue is only on hot standby slaves, it is transient, and it happens on a select from a btree primary index. This also does not seem to be load related. It often happens under periods of light load for us. Please let us know if you have any other thoughts on what we should look at... Sent from my iPhone On Jan 30, 2012, at 7:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Bridget Frey <bridget.frey@redfin.com> writes: >> The second error is an invalid memory alloc error that we're getting ~2 >> dozen times per day in production. The bt for this alloc error is below. > > This trace is consistent with the idea that we're getting a corrupt > tuple out of a table, although it doesn't entirely preclude the > possibility that the corrupt value is manufactured inside the backend. > To get much further you're going to need to look at the specific query > being executed each time this happens, and see if you can detect any > pattern. Now that you've got debug symbols straightened out, the > gdb command "p debug_query_string" should accomplish this. (If that > does not produce anything that looks like one of your application's > SQL commands, we'll need to try harder to extract the info.) You could > probably hack the elog(PANIC) to log that before dying, too, if you'd > rather not manually gdb each core dump. > > regards, tom lane
Bridget Frey <bridget.frey@redfin.com> writes: > Thanks for the reply, we appreciate you time on this. The alloc error > queries all seem to be selects from a btree primary index. I gave an > example in my initial post from the logins table. Usually for us it > is logins but sometimes we have seen it on a few other tables, and > it's always a btree primary key index, very simple type of select. Hm. The stack trace is definitive that it's finding the bad data in a tuple that it's trying to print to the client, not in an index. That tuple might've been straight from disk, or it could have been constructed inside the backend ... but if it's a simple SELECT FROM single-table WHERE index-condition then the tuple should be raw data found in a shared buffer. > The queries have been showing up in the logs which is how we know, but > we could also confirm in the core dump. If the problem is data > corruption, it is transient. We replay the same queries and get no > errors. The idea that comes to mind is that somehow btree index updates are reaching the standby in advance of the heap updates they refer to. But how could that be? And even more to the point, if we did follow a bogus TID pointer from an index, how come it's failing there? You'd expect it to usually notice such a problem much earlier, while examining the heap tuple header. (Invalid xmin complaints are the typical symptom from that, since the xmin is one of the first fields we look at that can be sanity-checked to any extent.) Still baffled here. regards, tom lane
I wrote: > Hm. The stack trace is definitive that it's finding the bad data in a > tuple that it's trying to print to the client, not in an index. BTW, after a bit more reflection it occurs to me that it's not so much that the data is necessarily *bad*, as that it seemingly doesn't match the tuple descriptor that the backend's trying to interpret it with. (In particular, the reported symptom would be consistent with finding a small integer constant at a place where the descriptor expects to find a variable-length field.) So that opens up a different line of thought about how those could get out of sync on a standby. Are you in the habit of issuing ALTER TABLE commands to add/delete/change columns on these tables? In fact, is there any DDL whatsoever going on around the time these failures happen? regards, tom lane
We have no DDL whatsoever in the code. We do update rows in the logins table frequently, but we basically have a policy of only doing DDL changes during scheduled upgrades when we bring the site down. We have been discussing this issue a lot and we really haven't come up with anything that would be considered unusual here. The tables experiencing issues have maybe 1M to 200M rows, we do updates and selects frequently, they have standard btree primary key indexes, and the failing query always seems to be a select for a single row based on a primary key lookup. All of these code paths worked flawlessly prior to our 9.1 upgrade (we had been using skytools replication). And we see no problems on the master despite similar workloads there. It is definitely puzzling, and we are not too sure what to look into next... Sent from my iPhone On Jan 30, 2012, at 9:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: >> Hm. The stack trace is definitive that it's finding the bad data in a >> tuple that it's trying to print to the client, not in an index. > > BTW, after a bit more reflection it occurs to me that it's not so much > that the data is necessarily *bad*, as that it seemingly doesn't match > the tuple descriptor that the backend's trying to interpret it with. > (In particular, the reported symptom would be consistent with finding > a small integer constant at a place where the descriptor expects to find > a variable-length field.) So that opens up a different line of thought > about how those could get out of sync on a standby. Are you in the > habit of issuing ALTER TABLE commands to add/delete/change columns on > these tables? In fact, is there any DDL whatsoever going on around the > time these failures happen? > > regards, tom lane
Excerpts from Bridget Frey's message of lun ene 30 18:59:08 -0300 2012: > Anyway, here goes... Maybe a "bt full" could give more insight into what's going on ... > #0 0x0000003a83e30265 in raise () from /lib64/libc.so.6 > #1 0x0000003a83e31d10 in abort () from /lib64/libc.so.6 > #2 0x00000000007cb84e in errfinish (dummy=3D0) at elog.c:523 > #3 0x00000000007cd951 in elog_finish (elevel=3D22, fmt=3D0x95cdf0 "inval= id > memory alloc request size %lu") at elog.c:1202 > #4 0x00000000007f115c in MemoryContextAlloc (context=3D0x17b581d0, > size=3D18446744073709551613) at mcxt.c:516 > #5 0x0000000000771a46 in text_to_cstring (t=3D0x17b23608) at varlena.c:1= 39 > #6 0x0000000000770747 in varcharout (fcinfo=3D0x7fffd44854e0) at > varchar.c:515 --=20 =C3=81lvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Tue, Jan 31, 2012 at 12:05 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: >> Hm. =A0The stack trace is definitive that it's finding the bad data in a >> tuple that it's trying to print to the client, not in an index. > > BTW, after a bit more reflection it occurs to me that it's not so much > that the data is necessarily *bad*, as that it seemingly doesn't match > the tuple descriptor that the backend's trying to interpret it with. Hmm. Could this be caused by the recovery process failing to obtain a sufficiently strong lock on a buffer before replaying some WAL record? For example, getting only an exclusive content lock where a cleanup lock is needed could presumably cause something like this to happen - it would explain the transient nature of the errors as well as the fact that they only seem to occur during Hot Standby operation. On the other hand, it's a little hard to believe we would have missed something that obvious; there aren't that many things that need a cleanup lock on a heap page. --=20 Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Jan 31, 2012 at 12:05 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> BTW, after a bit more reflection it occurs to me that it's not so much >> that the data is necessarily *bad*, as that it seemingly doesn't match >> the tuple descriptor that the backend's trying to interpret it with. > Hmm. Could this be caused by the recovery process failing to obtain a > sufficiently strong lock on a buffer before replaying some WAL record? Well, I was kinda speculating that inadequate locking could result in use of a stale (or too-new?) tuple descriptor, and that would be as good a candidate as any if the basic theory were right. But Bridget says they are not doing any DDL, so it's hard to see how there'd be any tuple descriptor mismatch at all. Still baffled ... regards, tom lane
So here's a better stack trace for the segfault issue (again, just to summarize, since this is a long thread, we're seeing two issues: 1) alloc errors that do not crash the DB (although we modified postgres to panic when this happens in our test environment, and posted a stack earlier) 2) a postgres segfault that happens once every couple of days on our slaves. We're still not sure if these are the same issue or not. This stack is not perfect because it still has some things optimized out (this came from our production database), but it's much more detailed than the last one we posted for a segfault... hope this helps get closer to an answer on this... I'd also be interested in knowing if the postgres experts thing these two symptoms are likely related... or totally separate issues... Thanks! -B #0 0x0000000000455dc1 in slot_deform_tuple (slot=0x53cfc20, natts=70) at heaptuple.c:1090 1090 off = att_align_pointer(off, thisatt->attalign, -1, (gdb) bt #0 0x0000000000455dc1 in slot_deform_tuple (slot=0x53cfc20, natts=70) at heaptuple.c:1090 #1 0x0000000000455fbd in slot_getallattrs (slot=0x53cfc20) at heaptuple.c:1253 #2 0x0000000000458ac7 in printtup (slot=0x53cfc20, self=0x534f1e0) at printtup.c:300 #3 0x000000000055bd69 in ExecutePlan (queryDesc=0x5515978, direction=<value optimized out>, count=0) at execMain.c:1464 #4 standard_ExecutorRun (queryDesc=0x5515978, direction=<value optimized out>, count=0) at execMain.c:313 #5 0x0000000000623594 in PortalRunSelect (portal=0x5394f10, forward=<value optimized out>, count=0, dest=0x534f1e0) at pquery.c:943 #6 0x0000000000624ae0 in PortalRun (portal=0x5394f10, count=9223372036854775807, isTopLevel=1 '\001', dest=0x534f1e0, altdest=0x534f1e0, completionTag=0x7fff014e0640 "") at pquery.c:787 #7 0x00000000006220f2 in exec_execute_message (argc=<value optimized out>, argv=<value optimized out>, username=<value optimized out>) at postgres.c:1963 #8 PostgresMain (argc=<value optimized out>, argv=<value optimized out>, username=<value optimized out>) at postgres.c:3983 #9 0x00000000005e6ba4 in ServerLoop () at postmaster.c:3601 #10 0x00000000005e791c in PostmasterMain (argc=5, argv=0x524cab0) at postmaster.c:1116 #11 0x000000000058b9ae in in (argc=5, argv=<value optimized out>) at main.c:199
On Tue, Jan 31, 2012 at 4:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Tue, Jan 31, 2012 at 12:05 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> BTW, after a bit more reflection it occurs to me that it's not so much >>> that the data is necessarily *bad*, as that it seemingly doesn't match >>> the tuple descriptor that the backend's trying to interpret it with. > >> Hmm. =A0Could this be caused by the recovery process failing to obtain a >> sufficiently strong lock on a buffer before replaying some WAL record? > > Well, I was kinda speculating that inadequate locking could result in > use of a stale (or too-new?) tuple descriptor, and that would be as good > a candidate as any if the basic theory were right. =A0But Bridget says > they are not doing any DDL, so it's hard to see how there'd be any tuple > descriptor mismatch at all. =A0Still baffled ... No, I wasn't thinking about a tuple descriptor mismatch. I was imagining that the page contents themselves might be in flux while we're trying to read from it. Off the top of my head I don't see how that can happen, but it would be awfully interesting to be able to see which WAL record last touched the relevant heap page, and how long before the error that happened. --=20 Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > No, I wasn't thinking about a tuple descriptor mismatch. I was > imagining that the page contents themselves might be in flux while > we're trying to read from it. Oh, gotcha. Yes, that's a horribly plausible idea. All it'd take is one WAL replay routine that hasn't been upgraded to acquire sufficient buffer locks. Pre-hot-standby, there was no reason for them to be careful about locking. On the other hand, if that were the cause, you'd expect the symptoms to be a bit more variable... regards, tom lane
On Wed, Feb 1, 2012 at 11:19 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> No, I wasn't thinking about a tuple descriptor mismatch. =A0I was >> imagining that the page contents themselves might be in flux while >> we're trying to read from it. > > Oh, gotcha. =A0Yes, that's a horribly plausible idea. =A0All it'd take is > one WAL replay routine that hasn't been upgraded to acquire sufficient > buffer locks. =A0Pre-hot-standby, there was no reason for them to be > careful about locking. > > On the other hand, if that were the cause, you'd expect the symptoms > to be a bit more variable... Well, OP has two: crash, and invalid memory allocation. Both share the common thread that they happen while trying to decode a tuple. It would be nice to get a dump of what PostgreSQL thought the entire block looked like at the time the crash happened. That information is presumably already in the core dump, but I'm not sure if there's a nice way to extract it using gdb. --=20 Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: >>> No, I wasn't thinking about a tuple descriptor mismatch. I was >>> imagining that the page contents themselves might be in flux while >>> we're trying to read from it. > It would be nice to get a dump of what PostgreSQL thought the entire > block looked like at the time the crash happened. That information is > presumably already in the core dump, but I'm not sure if there's a > nice way to extract it using gdb. It probably would be possible to get the page out of the dump, but I'd be really surprised if that proved much. By the time the crash-dump-making code gets around to examining the shared memory, the other process that's hypothetically changing the page will have done its work and moved on. A crash in process X doesn't freeze execution in process Y, at least not in any Unixoid system I've ever heard of. regards, tom lane
Excerpts from Tom Lane's message of mi=C3=A9 feb 01 18:06:27 -0300 2012: > Robert Haas <robertmhaas@gmail.com> writes: > >>> No, I wasn't thinking about a tuple descriptor mismatch. I was > >>> imagining that the page contents themselves might be in flux while > >>> we're trying to read from it. >=20 > > It would be nice to get a dump of what PostgreSQL thought the entire > > block looked like at the time the crash happened. That information is > > presumably already in the core dump, but I'm not sure if there's a > > nice way to extract it using gdb. >=20 > It probably would be possible to get the page out of the dump, but > I'd be really surprised if that proved much. By the time the > crash-dump-making code gets around to examining the shared memory, the > other process that's hypothetically changing the page will have done its > work and moved on. A crash in process X doesn't freeze execution in > process Y, at least not in any Unixoid system I've ever heard of. Maybe you can do something like send SIGSTOP to every other backend, then attach to them and find which one was touching the same buffer, then peek at what it was doing. --=20 =C3=81lvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
I recently raised "BUG #6425: Bus error in slot_deform_tuple". During the l= ast reproduction of the problem I saw this: Client 2 aborted in state 0: ERROR: invalid memory alloc request size 1844= 6744073709551613 So like Tom said, these two issues could well be related. I just wanted to = mention it here in this thread, FYI.
I recently raised "BUG #6425: Bus error in slot_deform_tuple". During the l= ast reproduction of the problem I saw this: Client 2 aborted in state 0: ERROR: invalid memory alloc request size 1844= 6744073709551613 So like Tom said, these two issues could well be related. I just wanted to = mention it here in this thread, FYI.