Thread: seg fault crashed the postmaster
Hi, Running Centos, just upgraded our production db from 8.4.4 to 9.0.2 last night. About 20 hours later, an update statement seg faulted and crashed the server. This is a typical update that has worked fine for a long time. 20898 datafeed (58628) 2010-12-30 19:28:14 EST [103]LOG: process 20898 acquired ShareLock on transaction 1286738762 after 90174.969 ms 20898 datafeed (58628) 2010-12-30 19:28:14 EST [104]STATEMENT: update v_messages set status = 'S', updated_on = now() where id in (select id from v_messages where author_id = 34409854 and status != 'S' limit 10000) 5802 2010-12-30 19:28:14 EST [4]LOG: server process (PID 20898) was terminated by signal 11: Segmentation fault 5802 2010-12-30 19:28:14 EST [5]LOG: terminating any other active server processes 15426 pipeline (36834) 2010-12-30 19:28:14 EST [1]WARNING: terminating connection because of crash of another server process15426 pipeline 10.10.11.54(36834) 2010-12-30 19:28:14 EST [2]DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server pro cess exited abnormally and possibly corrupted shared memory. One hint that something might be wrong was this: 18235 datafeed (44228) 2010-12-30 18:16:37 EST [11]LOG: process 18235 acquired ShareLock on transaction 1285952031 after 29966.703 ms 18235 datafeed (44228) 2010-12-30 18:16:37 EST [12]STATEMENT: update v_messages set status = 'S', updated_on = now() where id in (select id from v_messages where author_id = 25301995 and status != 'S' limit 10000) 18235 datafeed (44228) 2010-12-30 18:16:43 EST [13]ERROR: compressed data is corrupt 18235 datafeed (44228) 2010-12-30 18:16:43 EST [14]STATEMENT: update v_messages set status = 'S', updated_on = now() where id in (select id from v_messages where author_id = 25301995 and status != 'S' limit 10000) How concerned should I be? Thanks! Gordon -- View this message in context: http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3323117.html Sent from the PostgreSQL - general mailing list archive at Nabble.com.
Gordon Shannon <gordo169@gmail.com> writes: > Running Centos, just upgraded our production db from 8.4.4 to 9.0.2 last > night. About 20 hours later, an update statement seg faulted and crashed > the server. This is a typical update that has worked fine for a long time. Could we see a stack trace from that? Or at least a self-contained test case? regards, tom lane
I'd love to send you a stack trace. Any suggestions on how to get one? It has since happened again, on the same update command, so I'm guessing I can repeat it.
View this message in context: Re: seg fault crashed the postmaster
Sent from the PostgreSQL - general mailing list archive at Nabble.com.
On Thu, Dec 30, 2010 at 6:52 PM, Tom Lane-2 [via PostgreSQL] <[hidden email]> wrote:
--Gordon Shannon <[hidden email]> writes:
> Running Centos, just upgraded our production db from 8.4.4 to 9.0.2 last
> night. About 20 hours later, an update statement seg faulted and crashed
> the server. This is a typical update that has worked fine for a long time.
Could we see a stack trace from that? Or at least a self-contained
test case?
regards, tom lane
Sent via pgsql-general mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-generalView message @ http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3323151.html
To unsubscribe from seg fault crashed the postmaster, click here.
View this message in context: Re: seg fault crashed the postmaster
Sent from the PostgreSQL - general mailing list archive at Nabble.com.
Gordon Shannon <gordo169@gmail.com> writes: > I'd love to send you a stack trace. Any suggestions on how to get one? It > has since happened again, on the same update command, so I'm guessing I can > repeat it. http://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend regards, tom lane
Stack trace:
#0 0x00000031a147c15c in memcpy () from /lib64/libc.so.6
#1 0x0000000000450cb8 in __memcpy_ichk (tuple=0x7fffb29ac900) at /usr/include/bits/string3.h:51
#2 heap_copytuple (tuple=0x7fffb29ac900) at heaptuple.c:592
#3 0x0000000000543d4c in EvalPlanQualFetchRowMarks (epqstate=0x3cd85ab8) at execMain.c:1794
#4 0x00000000005440db in EvalPlanQual (estate=0x1e0db420, epqstate=0x3cd85ab8, relation=<value optimized out>, rti=4, tid=0x7fffb29aca20, priorXmax=<value optimized out>) at execMain.c:1401
#5 0x00000000005592eb in ExecUpdate (node=0x3cd85a28) at nodeModifyTable.c:527
#6 ExecModifyTable (node=0x3cd85a28) at nodeModifyTable.c:748
#7 0x0000000000545953 in ExecProcNode (node=0x3cd85a28) at execProcnode.c:359
#8 0x0000000000544881 in ExecutePlan (queryDesc=0x1e727990, direction=1039265768, count=0) at execMain.c:1190
#9 standard_ExecutorRun (queryDesc=0x1e727990, direction=1039265768, count=0) at execMain.c:280
#10 0x00002ab002c0f2b5 in explain_ExecutorRun (queryDesc=0x1e727990, direction=ForwardScanDirection, count=0) at auto_explain.c:203
#11 0x00000000005f9c81 in ProcessQuery (plan=0x2112ad60,
sourceText=0x1b3e59e0 "update v_messages set status = 'S', updated_on = now() where id in (select id from v_messages where author_id = 33138761 and status != 'S' limit 10000)",
params=<value optimized out>, dest=0x2112ae40, completionTag=0x7fffb29ace20 "") at pquery.c:197
#12 0x00000000005f9e99 in PortalRunMulti (portal=0x1b32aed0, isTopLevel=1 '\001', dest=0x2112ae40, altdest=0x2112ae40, completionTag=0x7fffb29ace20 "") at pquery.c:1268
#13 0x00000000005fa965 in PortalRun (portal=0x1b32aed0, count=9223372036854775807, isTopLevel=1 '\001', dest=0x2112ae40, altdest=0x2112ae40, completionTag=0x7fffb29ace20 "") at pquery.c:822
#14 0x00000000005f7455 in exec_simple_query (
query_string=0x1b3e59e0 "update v_messages set status = 'S', updated_on = now() where id in (select id from v_messages where author_id = 33138761 and status != 'S' limit 10000)")
at postgres.c:1058
#15 0x00000000005f7d14 in PostgresMain (argc=<value optimized out>, argv=<value optimized out>, username=<value optimized out>) at postgres.c:3929
#16 0x00000000005c7ce5 in ServerLoop () at postmaster.c:3555
#17 0x00000000005c89ec in PostmasterMain (argc=5, argv=0x1b31ea00) at postmaster.c:1092
#18 0x00000000005725fe in main (argc=5, argv=<value optimized out>) at main.c:188
--
If I had more time, I could have written you a shorter letter. (Blaise Pascal)
View this message in context: Re: seg fault crashed the postmaster
Sent from the PostgreSQL - general mailing list archive at Nabble.com.
#0 0x00000031a147c15c in memcpy () from /lib64/libc.so.6
#1 0x0000000000450cb8 in __memcpy_ichk (tuple=0x7fffb29ac900) at /usr/include/bits/string3.h:51
#2 heap_copytuple (tuple=0x7fffb29ac900) at heaptuple.c:592
#3 0x0000000000543d4c in EvalPlanQualFetchRowMarks (epqstate=0x3cd85ab8) at execMain.c:1794
#4 0x00000000005440db in EvalPlanQual (estate=0x1e0db420, epqstate=0x3cd85ab8, relation=<value optimized out>, rti=4, tid=0x7fffb29aca20, priorXmax=<value optimized out>) at execMain.c:1401
#5 0x00000000005592eb in ExecUpdate (node=0x3cd85a28) at nodeModifyTable.c:527
#6 ExecModifyTable (node=0x3cd85a28) at nodeModifyTable.c:748
#7 0x0000000000545953 in ExecProcNode (node=0x3cd85a28) at execProcnode.c:359
#8 0x0000000000544881 in ExecutePlan (queryDesc=0x1e727990, direction=1039265768, count=0) at execMain.c:1190
#9 standard_ExecutorRun (queryDesc=0x1e727990, direction=1039265768, count=0) at execMain.c:280
#10 0x00002ab002c0f2b5 in explain_ExecutorRun (queryDesc=0x1e727990, direction=ForwardScanDirection, count=0) at auto_explain.c:203
#11 0x00000000005f9c81 in ProcessQuery (plan=0x2112ad60,
sourceText=0x1b3e59e0 "update v_messages set status = 'S', updated_on = now() where id in (select id from v_messages where author_id = 33138761 and status != 'S' limit 10000)",
params=<value optimized out>, dest=0x2112ae40, completionTag=0x7fffb29ace20 "") at pquery.c:197
#12 0x00000000005f9e99 in PortalRunMulti (portal=0x1b32aed0, isTopLevel=1 '\001', dest=0x2112ae40, altdest=0x2112ae40, completionTag=0x7fffb29ace20 "") at pquery.c:1268
#13 0x00000000005fa965 in PortalRun (portal=0x1b32aed0, count=9223372036854775807, isTopLevel=1 '\001', dest=0x2112ae40, altdest=0x2112ae40, completionTag=0x7fffb29ace20 "") at pquery.c:822
#14 0x00000000005f7455 in exec_simple_query (
query_string=0x1b3e59e0 "update v_messages set status = 'S', updated_on = now() where id in (select id from v_messages where author_id = 33138761 and status != 'S' limit 10000)")
at postgres.c:1058
#15 0x00000000005f7d14 in PostgresMain (argc=<value optimized out>, argv=<value optimized out>, username=<value optimized out>) at postgres.c:3929
#16 0x00000000005c7ce5 in ServerLoop () at postmaster.c:3555
#17 0x00000000005c89ec in PostmasterMain (argc=5, argv=0x1b31ea00) at postmaster.c:1092
#18 0x00000000005725fe in main (argc=5, argv=<value optimized out>) at main.c:188
On Thu, Dec 30, 2010 at 7:32 PM, Tom Lane-2 [via PostgreSQL] <[hidden email]> wrote:
Gordon Shannon <[hidden email]> writes:> I'd love to send you a stack trace. Any suggestions on how to get one? It
> has since happened again, on the same update command, so I'm guessing I can
> repeat it.regards, tom lane
--
Sent via pgsql-general mailing list ([hidden email])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-generalView message @ http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3323177.html
To unsubscribe from seg fault crashed the postmaster, click here.
--
If I had more time, I could have written you a shorter letter. (Blaise Pascal)
View this message in context: Re: seg fault crashed the postmaster
Sent from the PostgreSQL - general mailing list archive at Nabble.com.
Gordon Shannon <gordo169@gmail.com> writes: > Stack trace: > #0 0x00000031a147c15c in memcpy () from /lib64/libc.so.6 > #1 0x0000000000450cb8 in __memcpy_ichk (tuple=0x7fffb29ac900) at > /usr/include/bits/string3.h:51 > #2 heap_copytuple (tuple=0x7fffb29ac900) at heaptuple.c:592 > #3 0x0000000000543d4c in EvalPlanQualFetchRowMarks (epqstate=0x3cd85ab8) at > execMain.c:1794 > #4 0x00000000005440db in EvalPlanQual (estate=0x1e0db420, > epqstate=0x3cd85ab8, relation=<value optimized out>, rti=4, > tid=0x7fffb29aca20, priorXmax=<value optimized out>) at execMain.c:1401 > #5 0x00000000005592eb in ExecUpdate (node=0x3cd85a28) at > nodeModifyTable.c:527 > #6 ExecModifyTable (node=0x3cd85a28) at nodeModifyTable.c:748 > #7 0x0000000000545953 in ExecProcNode (node=0x3cd85a28) at > execProcnode.c:359 > #8 0x0000000000544881 in ExecutePlan (queryDesc=0x1e727990, > direction=1039265768, count=0) at execMain.c:1190 > #9 standard_ExecutorRun (queryDesc=0x1e727990, direction=1039265768, > count=0) at execMain.c:280 > #10 0x00002ab002c0f2b5 in explain_ExecutorRun (queryDesc=0x1e727990, > direction=ForwardScanDirection, count=0) at auto_explain.c:203 > #11 0x00000000005f9c81 in ProcessQuery (plan=0x2112ad60, > sourceText=0x1b3e59e0 "update v_messages set status = 'S', updated_on = > now() where id in (select id from v_messages where author_id = 33138761 and > status != 'S' limit 10000)", > params=<value optimized out>, dest=0x2112ae40, > completionTag=0x7fffb29ace20 "") at pquery.c:197 Hmm. This suggests that there's something wrong in the EvalPlanQual code, which gets invoked when there are concurrent updates to the same row (ie, the row this UPDATE is trying to change is one that was changed by some other transaction since the query started). That stuff got rewritten rather thoroughly for 9.0, so the idea of a new bug there isn't exactly surprising. But it's going to be hard to find without a test case. Can you show us the full schema for this table and all the queries that execute against it up till the point of the failure? (Turning on log_statement across all sessions would help collect that info, if you don't have it already.) regards, tom lane
Interesting. That's exactly what we have been doing -- trying to update the same rows in multiple txns. For us to proceed in production, I will take steps to ensure we stop doing that, as it's just an app bug really.
The table in question -- v_messages -- is an empty base table with 76 partitions, with a total of 2.8 billion rows.
Let me summarize what I see as the key facts here:
(All problems have come from the UPDATE query, all identical except for different "author_id" values.)
1. We did a "link" upgrade Wed night, from 844 to 902 so the upgrade happened in place, no data files were copied.
2. The 1st error was "compressed data is corrupt" at 18:16
3. We got 2 seg fault crashes before turning on cores and getting a 3rd crash with the stack trace.
4. We then got a " invalid memory alloc request size 18446744073449177092" at 23:50. This was an ERROR, not a crash.
At this point, is it your suspicion that there is a code bug in 9.0.2, rather than corrupt data?
I will post the schema and then work on a test case.
-gordon
View this message in context: Re: seg fault crashed the postmaster
Sent from the PostgreSQL - general mailing list archive at Nabble.com.
The table in question -- v_messages -- is an empty base table with 76 partitions, with a total of 2.8 billion rows.
Let me summarize what I see as the key facts here:
(All problems have come from the UPDATE query, all identical except for different "author_id" values.)
1. We did a "link" upgrade Wed night, from 844 to 902 so the upgrade happened in place, no data files were copied.
2. The 1st error was "compressed data is corrupt" at 18:16
3. We got 2 seg fault crashes before turning on cores and getting a 3rd crash with the stack trace.
4. We then got a " invalid memory alloc request size 18446744073449177092" at 23:50. This was an ERROR, not a crash.
At this point, is it your suspicion that there is a code bug in 9.0.2, rather than corrupt data?
I will post the schema and then work on a test case.
-gordon
On Fri, Dec 31, 2010 at 8:34 AM, Tom Lane-2 [via PostgreSQL] <[hidden email]> wrote:
Hmm. This suggests that there's something wrong in the EvalPlanQual
code, which gets invoked when there are concurrent updates to the same
row (ie, the row this UPDATE is trying to change is one that was changed
by some other transaction since the query started). That stuff got
rewritten rather thoroughly for 9.0, so the idea of a new bug there
isn't exactly surprising. But it's going to be hard to find without
a test case. Can you show us the full schema for this table and all
the queries that execute against it up till the point of the failure?
(Turning on log_statement across all sessions would help collect that
info, if you don't have it already.)
View this message in context: Re: seg fault crashed the postmaster
Sent from the PostgreSQL - general mailing list archive at Nabble.com.
Here is the ddl for the tables in question. There are foreign keys to other tables that I omitted. http://postgresql.1045698.n5.nabble.com/file/n3323804/parts.sql parts.sql -- View this message in context: http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3323804.html Sent from the PostgreSQL - general mailing list archive at Nabble.com.
Gordon Shannon <gordo169@gmail.com> writes: > Here is the ddl for the tables in question. There are foreign keys to other > tables that I omitted. > http://postgresql.1045698.n5.nabble.com/file/n3323804/parts.sql parts.sql Hmmm ... what is "message_status_enum"? Is that an actual enum type, or some kind of domain over text or varchar? If it's an enum, I'm a bit baffled where the "corrupt compressed data" message could have come from. regards, tom lane
Sorry, I left that out. Yeah, I wondered that too, since these tables do not use toast.
CREATE TYPE message_status_enum AS ENUM ( 'V', 'X', 'S', 'R', 'U', 'D' );
View this message in context: Re: seg fault crashed the postmaster
Sent from the PostgreSQL - general mailing list archive at Nabble.com.
CREATE TYPE message_status_enum AS ENUM ( 'V', 'X', 'S', 'R', 'U', 'D' );
On Fri, Dec 31, 2010 at 12:38 PM, Tom Lane-2 [via PostgreSQL] <[hidden email]> wrote:
Hmmm ... what is "message_status_enum"? Is that an actual enum type, or
some kind of domain over text or varchar? If it's an enum, I'm a bit
baffled where the "corrupt compressed data" message could have come from.regards, tom lane
--
View this message in context: Re: seg fault crashed the postmaster
Sent from the PostgreSQL - general mailing list archive at Nabble.com.
Gordon Shannon <gordo169@gmail.com> writes: > Sorry, I left that out. Yeah, I wondered that too, since these tables do > not use toast. Hm. Well, given that the stack trace suggests we're trying to access a tuple value that's not there (bogus pointer, or data overwritten since the pointer was created), the "invalid memory alloc request size" message is believable as another manifestation of the same problem. The odds seem pretty good that the "corrupt compressed data" message has the same origin at bottom, although the lack of any obvious data to be compressed in this table is confusing. Maybe you could get that from trying to copy over a garbage value of that one varchar column, though. I tried to replicate the problem here without success, which suggests to me that maybe it's plan-dependent. Can you show what EXPLAIN gives you for the troublesome query? Another point here is that your logs suggest that the query blocked behind the previous updater of the row for a very long time --- 90 sec in one case, 30 sec in another. Is that the typical runtime for one of these updates? If not, can you identify what the blocking query was? If it's something different from another instance of the identical update, it'd be useful to know what. regards, tom lane
Yes that query does take 30 or 90 secs. I'm pretty sure it was blocking on its twin update running concurrently. However I'm not really sure how to identify what "transaction 1283585646" was.
Enclosed is the query plan -- 21000 lines
-gordon
plan.txt (2M) Download Attachment
View this message in context: Re: seg fault crashed the postmaster
Sent from the PostgreSQL - general mailing list archive at Nabble.com.
Enclosed is the query plan -- 21000 lines
-gordon
I tried to replicate the problem here without success, which suggests to
me that maybe it's plan-dependent. Can you show what EXPLAIN gives you
for the troublesome query?
Another point here is that your logs suggest that the query blocked
behind the previous updater of the row for a very long time --- 90 sec
in one case, 30 sec in another. Is that the typical runtime for one of
these updates? If not, can you identify what the blocking query was?
If it's something different from another instance of the identical
update, it'd be useful to know what.
plan.txt (2M) Download Attachment
View this message in context: Re: seg fault crashed the postmaster
Sent from the PostgreSQL - general mailing list archive at Nabble.com.
Gordon Shannon <gordo169@gmail.com> writes: > Enclosed is the query plan -- 21000 lines Um ... nope? regards, tom lane
Maybe it doesn't work from gmail. I'll try uploading from here. http://postgresql.1045698.n5.nabble.com/file/n3323933/plan.txt plan.txt -- View this message in context: http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3323933.html Sent from the PostgreSQL - general mailing list archive at Nabble.com.
I wrote: > The odds seem pretty good that the "corrupt compressed data" message > has the same origin at bottom, although the lack of any obvious data > to be compressed in this table is confusing. Maybe you could get that > from trying to copy over a garbage value of that one varchar column, > though. Nah, that's overthinking the problem. Just before the previously identified crash point, we have td = DatumGetHeapTupleHeader(datum); That's a pg_detoast_datum call underneath the macro. So if you assume that the tuple-datum pointer is pointing at garbage, you can get all three of the observed symptoms very easily, depending on what the garbage passing for its length word happens to look like. So I'm pretty sure that what we're dealing with is a case of the plan freeing a transient tuple datum sooner than it should. But the toy case I tried here didn't fail, so evidently I'm not close enough to the plan you're actually using. Still need to see that EXPLAIN output. It'd be helpful also to know what nondefault configuration settings you're using, especially work_mem and planner-related settings. Also, do you have an idea of how many rows might've matched the WHERE conditions? regards, tom lane
The number of matching rows on these queries is anything from 0 to 10000. I don't think I can tell how many would have matched on the ones that crashed. Although I suspect it would have been toward the 10000 end. I've been trying to get a reproducable test case with no luck so far.
I assume you can now see the plan? I uploaded it twice, once via gmail and once on Nabble.
Here are all the non-default settings:
> listen_addresses = '*'
> max_connections = 450
> authentication_timeout = 20s
> shared_buffers = 18GB
> work_mem = 200MB
> maintenance_work_mem = 8GB
> shared_preload_libraries = 'auto_explain'
> wal_level = hot_standby
> synchronous_commit = off
> wal_buffers = 8MB
> commit_siblings = 1
> checkpoint_segments = 256
> checkpoint_warning = 5min
> archive_mode = on
> archive_command = 'cp %p /var/lib/pgsql/wal/%f.wrk; mv /var/lib/pgsql/wal/%f.wrk /var/lib/pgsql/wal/%f'
> max_wal_senders = 1
> cpu_tuple_cost = 0.003
> cpu_index_tuple_cost = 0.001
> cpu_operator_cost = 0.0005
> effective_cache_size = 52GB
> default_statistics_target = 200
> log_directory = '/var/log/postgres'
> log_filename = 'pg%d.log'
> log_min_duration_statement = 7min
> log_line_prefix = '%p %u %r %t [%l]'
> log_lock_waits = on
> log_temp_files = 0
> track_functions = pl # none, pl, all
> log_autovacuum_min_duration = 5min
> autovacuum_vacuum_scale_factor = 0.1
> autovacuum_analyze_scale_factor = 0.03
> autovacuum_freeze_max_age = 1500000000 # 1,500,000,000
> autovacuum_vacuum_cost_delay = -1
> temp_tablespaces = 'ts03,ts04,ts05,ts06,ts07,ts08,ts09,ts10,ts11,ts12,ts13,ts14,ts15,ts16,ts17,ts18,ts19,ts20,ts21,ts22,ts23,ts24,ts25,ts26,ts27,ts28,ts29,ts30,ts31,ts32,ts33,ts34,ts35,ts36,ts37,ts38'
> vacuum_freeze_min_age = 500000000 # 500,000,000
> vacuum_freeze_table_age = 1300000000 # 1,300,000,000
> bytea_output = 'escape'
> deadlock_timeout = 5s
> standard_conforming_strings = on
> custom_variable_classes = 'auto_explain'
> auto_explain.log_min_duration = -1 # Remember this means for everybody!
> auto_explain.log_analyze = off ## DANGER! Don't set log_analyze to true unless you know what you're doing!
> auto_explain.log_verbose = off
> auto_explain.log_nested_statements = on
View this message in context: Re: seg fault crashed the postmaster
Sent from the PostgreSQL - general mailing list archive at Nabble.com.
I assume you can now see the plan? I uploaded it twice, once via gmail and once on Nabble.
Here are all the non-default settings:
> listen_addresses = '*'
> max_connections = 450
> authentication_timeout = 20s
> shared_buffers = 18GB
> work_mem = 200MB
> maintenance_work_mem = 8GB
> shared_preload_libraries = 'auto_explain'
> wal_level = hot_standby
> synchronous_commit = off
> wal_buffers = 8MB
> commit_siblings = 1
> checkpoint_segments = 256
> checkpoint_warning = 5min
> archive_mode = on
> archive_command = 'cp %p /var/lib/pgsql/wal/%f.wrk; mv /var/lib/pgsql/wal/%f.wrk /var/lib/pgsql/wal/%f'
> max_wal_senders = 1
> cpu_tuple_cost = 0.003
> cpu_index_tuple_cost = 0.001
> cpu_operator_cost = 0.0005
> effective_cache_size = 52GB
> default_statistics_target = 200
> log_directory = '/var/log/postgres'
> log_filename = 'pg%d.log'
> log_min_duration_statement = 7min
> log_line_prefix = '%p %u %r %t [%l]'
> log_lock_waits = on
> log_temp_files = 0
> track_functions = pl # none, pl, all
> log_autovacuum_min_duration = 5min
> autovacuum_vacuum_scale_factor = 0.1
> autovacuum_analyze_scale_factor = 0.03
> autovacuum_freeze_max_age = 1500000000 # 1,500,000,000
> autovacuum_vacuum_cost_delay = -1
> temp_tablespaces = 'ts03,ts04,ts05,ts06,ts07,ts08,ts09,ts10,ts11,ts12,ts13,ts14,ts15,ts16,ts17,ts18,ts19,ts20,ts21,ts22,ts23,ts24,ts25,ts26,ts27,ts28,ts29,ts30,ts31,ts32,ts33,ts34,ts35,ts36,ts37,ts38'
> vacuum_freeze_min_age = 500000000 # 500,000,000
> vacuum_freeze_table_age = 1300000000 # 1,300,000,000
> bytea_output = 'escape'
> deadlock_timeout = 5s
> standard_conforming_strings = on
> custom_variable_classes = 'auto_explain'
> auto_explain.log_min_duration = -1 # Remember this means for everybody!
> auto_explain.log_analyze = off ## DANGER! Don't set log_analyze to true unless you know what you're doing!
> auto_explain.log_verbose = off
> auto_explain.log_nested_statements = on
On Fri, Dec 31, 2010 at 2:49 PM, Tom Lane-2 [via PostgreSQL] <[hidden email]> wrote:
So I'm pretty sure that what we're dealing with is a case of the plan
freeing a transient tuple datum sooner than it should. But the toy case
I tried here didn't fail, so evidently I'm not close enough to the plan
you're actually using. Still need to see that EXPLAIN output. It'd be
helpful also to know what nondefault configuration settings you're
using, especially work_mem and planner-related settings. Also, do you
have an idea of how many rows might've matched the WHERE conditions?
View this message in context: Re: seg fault crashed the postmaster
Sent from the PostgreSQL - general mailing list archive at Nabble.com.
Gordon Shannon <gordo169@gmail.com> writes: > I assume you can now see the plan? I uploaded it twice, once via gmail and > once on Nabble. Yeah, the Nabble one works. Now I'm even more confused, because the whole-row var seems to be coming from the outside of the nestloop, which is about the simplest possible case. > The number of matching rows on these queries is anything from 0 to 10000. I > don't think I can tell how many would have matched on the ones that > crashed. Although I suspect it would have been toward the 10000 end. I've > been trying to get a reproducable test case with no luck so far. No luck here either. A couple further questions --- do you think it's likely that you could have been over 10000 matches, ie, the LIMIT actually kicked in? Also, could there have been more than two concurrent updates acting on the same rows? regards, tom lane
Unfortunately it's now impossible to say how many were updated, as they get deleted by another process later. I may be able to restore part of a dump from 2 days ago on another machine, and get some counts from that, assuming I have the disk space. I'll work on that.
I do not believe there could have been more than 2 concurrent updates because the app is limited to 2 threads, but I will verify that with the developer.
-gordon
View this message in context: Re: seg fault crashed the postmaster
Sent from the PostgreSQL - general mailing list archive at Nabble.com.
I do not believe there could have been more than 2 concurrent updates because the app is limited to 2 threads, but I will verify that with the developer.
-gordon
On Fri, Dec 31, 2010 at 3:43 PM, Tom Lane-2 [via PostgreSQL] <[hidden email]> wrote:
No luck here either. A couple further questions --- do you think it's
likely that you could have been over 10000 matches, ie, the LIMIT
actually kicked in? Also, could there have been more than two
concurrent updates acting on the same rows?
View this message in context: Re: seg fault crashed the postmaster
Sent from the PostgreSQL - general mailing list archive at Nabble.com.
I'm putting this on this thread, since it could be related to the issue. I'm now seeing this in the log on the HSB/SR server. It's happened about 4 times in the past 2 days. 23964 2011-01-04 05:23:00 EST [47]LOG: invalid record length at 6E53/46E8A010 23535 2011-01-04 05:23:00 EST [2]FATAL: terminating walreceiver process due to administrator command cp: cannot stat `/data23/wal_sync/0000000100006E5300000046': No such file or directory 23964 2011-01-04 05:23:00 EST [48]LOG: invalid record length at 6E53/46E8A010 cp: cannot stat `/data23/wal_sync/0000000100006E5300000046': No such file or directory 2486 2011-01-04 05:23:00 EST [1]LOG: streaming replication successfully connected to primary 23964 2011-01-04 15:47:59 EST [49]LOG: invalid record length at 6E6B/F8222010 2486 2011-01-04 15:47:59 EST [2]FATAL: terminating walreceiver process due to administrator command cp: cannot stat `/data23/wal_sync/0000000100006E6B000000F8': No such file or directory 23964 2011-01-04 15:48:00 EST [50]LOG: invalid record length at 6E6B/F8222010 cp: cannot stat `/data23/wal_sync/0000000100006E6B000000F8': No such file or directory The word FATAL sounds very ominous. -- View this message in context: http://postgresql.1045698.n5.nabble.com/seg-fault-crashed-the-postmaster-tp3323117p3328138.html Sent from the PostgreSQL - general mailing list archive at Nabble.com.
I'm tracking a problem with our tables being bloated and was curious if someone regularly kills autovacuum jobs, will autovacuum later reattempt to vacuum the table it was killed under? I've made autovacuum more aggressive and given more worker threads. Yet for some reason we're not keeping up. The tables on this system are very active (at least many of them are). We have a large database (above 3.5 TB at this time) on a Sun Solaris Thumper (sunfire 4500 series I believe). Vacuum does complete when I run it manually on a table. But I'm suspecting a coworker working late at night be my killing autovacuum. Reading through the logs right now to see if this is the case. Thanks!
Hi, On Wed, 2011-01-05 at 18:08 -0700, u235sentinel wrote: > I'm tracking a problem with our tables being bloated and was curious > if someone regularly kills autovacuum jobs, will autovacuum later > reattempt to vacuum the table it was killed under? In 8.3+, autovacuum kills itself if when it conflicts with other queries. If it is killed by a human or so, it will try to run again in the next cycle (which depends on max_workers and timeout values). > I've made autovacuum more aggressive and given more worker threads. Giving more worker threads will probably not make autovacuum aggressive. You will also want to decrease threshold values either in postgresql.conf, or set custom parameters in pg_class per relation. > Yet for some reason we're not keeping up. Probably because of the threshold values. > Vacuum does complete when I run it manually on a table. But I'm > suspecting a coworker working late at night be my killing autovacuum. > Reading through the logs right now to see if this is the case. Set log_autovacuum_min_duration to 0, and log all autovac activities. It will let you know when it cancels itself, like "ERROR: cancelling autovacuum task" or so (this may not be the exact message, I did not check). Regards, -- Devrim GÜNDÜZ PostgreSQL Danışmanı/Consultant, Red Hat Certified Engineer PostgreSQL RPM Repository: http://yum.pgrpms.org Community: devrim~PostgreSQL.org, devrim.gunduz~linux.org.tr http://www.gunduz.org Twitter: http://twitter.com/devrimgunduz
Attachment
On Wed, Jan 5, 2011 at 6:08 PM, u235sentinel <u235sentinel@gmail.com> wrote: > I'm tracking a problem with our tables being bloated and was curious if > someone regularly kills autovacuum jobs, will autovacuum later reattempt to > vacuum the table it was killed under? > > I've made autovacuum more aggressive and given more worker threads. Yet for > some reason we're not keeping up. Can you show us what you've changed to make autovac more aggressive? You might want to make sure you've lowered autovacuum_vacuum_cost_delay (mine's at 0 as we have 28 or so disks in a single RAID-10 and lots of spare IO). Also raise autovacuum_vacuum_cost_limit much higher (mine's at 5000). Assuming these tables get updated a lot, I'm gonna guess that you're not having a problem with too high of threshold settings but with auto-vac keeping up. But it's just a guess. Also, what do you get from things like iostat, vmstat, and sar as regards your IO utilization and such? If your IO system is always at 100% then more aggressive vacuuming isn't gonna do much, because you'll always be behind.