Thread: BUG #18165: Could not duplicate handle for "Global/PostgreSQL.xxxxxxxxxx": Bad file descriptor
BUG #18165: Could not duplicate handle for "Global/PostgreSQL.xxxxxxxxxx": Bad file descriptor
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 18165 Logged by: Maxime Boyer Email address: maxime.boyer@cra-arc.gc.ca PostgreSQL version: 14.9 Operating system: Windows Server 2019 Description: Could not duplicate handle for "Global/PostgreSQL.xxxxxxxxxx": Bad file descriptor Error started after a pg_upgrade from 11.17 to 14.9 on the same physical servers. Very little information about this online. I'm not sure which file it's referring to. Server: Windows Server 2019 Standard 64-bit HP ProLiant BL460c Gen9 Intel Xeon CPU E5-2620 v4 @ 2.10 GHz (32 CPUs) 128 GB RAM Installed the bundled Microsoft Visual C++ 2015-2022 Redistributable (x64) - 14.36.32532 Facts: - The instance hosts multiple databases of the same application (Jira) - Only one database has this error - Same story on our production instance. The production database has the errors, not training. - The affected databases are larger than the others (5 & 17 GB vs ~1 GB) - The affected databases also show expected "duplicate key value violates unique constraint" errors, although some without the problem also do. Errors: 2023-09-18 12:06:19.375 EDT,"dbuser","database",37884,"app_node_1_ip:59368",6508750f.93fc,5,"",2023-09-18 12:04:31 EDT,182/105680,0,ERROR,XX000,"could not duplicate handle for ""Global/PostgreSQL.2437152779"": Bad file descriptor",,,,,,"select string_agg(word, ',') from pg_catalog.pg_get_keywords() where word <> ALL ('{a,abs,absolute,action,ada,add,admin,after,all,allocate,alter,always,and,any,are,array,as,asc,asensitive,assertion,assignment,asymmetric,at,atomic,attribute,attributes,authorization,avg,before,begin,bernoulli,between,bigint,binary,blob,boolean,both,breadth,by,c,call,called,cardinality,cascade,cascaded,case,cast,catalog,catalog_name,ceil,ceiling,chain,char,char_length,character,character_length,character_set_catalog,character_set_name,character_set_schema,characteristics,characters,check,checked,class_origin,clob,close,coalesce,cobol,code_units,collate,collation,collation_catalog,collation_name,collation_schema,collect,column,column_name,command_function,command_function_code,commit,committed,condition,condition_number,connect,connection_name,constraint,constraint_catalog,constraint_name,constraint_schema,constraints,constructors,contains,continue,convert,corr,corresponding,count,covar_pop,covar_samp,create,cross,cube,cume_dist,current,current_collation,current_date,current_default_transform_group,current_path,current_role,current_time,current_timestamp,current_transform_group_for_type,current_user,cursor,cursor_name,cycle,data,date,datetime_interval_code,datetime_interval_precision,day,deallocate,dec,decimal,declare,default,defaults,deferrable,deferred,defined,definer,degree,delete,dense_rank,depth,deref,derived,desc,describe,descriptor,deterministic,diagnostics,disconnect,dispatch,distinct,domain,double,drop,dynamic,dynamic_function,dynamic_function_code,each,element,else,end,end-exec,equals,escape,every,except,exception,exclude,excluding,exec,execute,exists,exp,external,extract,false,fetch,filter,final,first,float,floor,following,for,foreign,fortran,found,free,from,full,function,fusion,g,general,get,global,go,goto,grant,granted,group,grouping,having,hierarchy,hold,hour,identity,immediate,implementation,in,including,increment,indicator,initially,inner,inout,input,insensitive,insert,instance,instantiable,int,integer,intersect,intersection,interval,into,invoker,is,isolation,join,k,key,key_member,key_type,language,large,last,lateral,leading,left,length,level,like,ln,local,localtime,localtimestamp,locator,lower,m,map,match,matched,max,maxvalue,member,merge,message_length,message_octet_length,message_text,method,min,minute,minvalue,mod,modifies,module,month,more,multiset,mumps,name,names,national,natural,nchar,nclob,nesting,new,next,no,none,normalize,normalized,not,""null"",nullable,nullif,nulls,number,numeric,object,octet_length,octets,of,old,on,only,open,option,options,or,order,ordering,ordinality,others,out,outer,output,over,overlaps,overlay,overriding,pad,parameter,parameter_mode,parameter_name,parameter_ordinal_position,parameter_specific_catalog,parameter_specific_name,parameter_specific_schema,partial,partition,pascal,path,percent_rank,percentile_cont,percentile_disc,placing,pli,position,power,preceding,precision,prepare,preserve,primary,prior,privileges,procedure,public,range,rank,read,reads,real,recursive,ref,references,referencing,regr_avgx,regr_avgy,regr_count,regr_intercept,regr_r2,regr_slope,regr_sxx,regr_sxy,regr_syy,relative,release,repeatable,restart,result,return,returned_cardinality,returned_length,returned_octet_length,returned_sqlstate,returns,revoke,right,role,rollback,rollup,routine,routine_catalog,routine_name,routine_schema,row,row_count,row_number,rows,savepoint,scale,schema,schema_name,scope_catalog,scope_name,scope_schema,scroll,search,second,section,security,select,self,sensitive,sequence,serializable,server_name,session,session_user,set,sets,similar,simple,size,smallint,some,source,space,specific,specific_name,specifictype,sql,sqlexception,sqlstate,sqlwarning,sqrt,start,state,statement,static,stddev_pop,stddev_samp,structure,style,subclass_origin,submultiset,substring,sum,symmetric,system,system_user,table,table_name,tablesample,temporary,then,ties,time,timestamp,timezone_hour,timezone_minute,to,top_level_count,trailing,transaction,transaction_active,transactions_committed,transactions_rolled_back,transform,transforms,translate,translation,treat,trigger,trigger_catalog,trigger_name,trigger_schema,trim,true,type,uescape,unbounded,uncommitted,under,union,unique,unknown,unnamed,unnest,update,upper,usage,user,user_defined_type_catalog,user_defined_type_code,user_defined_type_name,user_defined_type_schema,using,value,values,var_pop,var_samp,varchar,varying,view,when,whenever,where,width_bucket,window,with,within,without,work,write,year,zone}'::text[])",,"dsm_impl_pin_segment, dsm_impl.c:991","PostgreSQL JDBC Driver","client backend",,0 2023-09-18 13:09:52.122 EDT,"dbuser","database",36664,"app_node_2_ip:61657",650883f8.8f38,17,"",2023-09-18 13:08:08 EDT,185/19545,0,ERROR,XX000,"could not duplicate handle for ""Global/PostgreSQL.158941235"": Bad file descriptor",,,,,,"SELECT COUNT(I.ID) FROM public.jiraissue I",,"dsm_impl_pin_segment, dsm_impl.c:991","PostgreSQL JDBC Driver","client backend",,0 2023-09-18 13:10:25.322 EDT,"dbuser","database",28488,"app_node_2_ip:61653",650883e8.6f48,12,"",2023-09-18 13:07:52 EDT,184/44578,0,ERROR,XX000,"could not duplicate handle for ""Global/PostgreSQL.1225474099"": Bad file descriptor",,,,,,"SELECT COUNT(I.ID) FROM public.jiraissue I",,"dsm_impl_pin_segment, dsm_impl.c:991","PostgreSQL JDBC Driver","client backend",,0 2023-09-18 13:11:58.207 EDT,"dbuser","database",36664,"app_node_2_ip:61657",650883f8.8f38,21,"",2023-09-18 13:08:08 EDT,185/53587,0,ERROR,XX000,"could not duplicate handle for ""Global/PostgreSQL.1764720693"": Bad file descriptor",,,,,,"SELECT COUNT(I.ID) FROM public.jiraissue I WHERE (I.ARCHIVED = $1 ) OR (I.ARCHIVED IS NULL )",,"dsm_impl_pin_segment, dsm_impl.c:991","PostgreSQL JDBC Driver","client backend",,0 2023-09-18 13:09:50.434 EDT,"dbuser","database",55164,"app_node_1_ip:51366",6508831b.d77c,4,"",2023-09-18 13:04:27 EDT,164/254836,758544373,ERROR,23505,"duplicate key value violates unique constraint ""cluster_lock_name_idx""","Key (lock_name)=(com.atlassian.jira.service.DefaultServiceManager.updateLock) already exists.",,,,,"insert into public.clusterlockstatus (lock_name, update_time, id) values ($1, $2, $3)",,"_bt_check_unique, nbtinsert.c:670","PostgreSQL JDBC Driver","client backend",,0 Thanks, Max
Re: BUG #18165: Could not duplicate handle for "Global/PostgreSQL.xxxxxxxxxx": Bad file descriptor
From
Tom Lane
Date:
PG Bug reporting form <noreply@postgresql.org> writes: > Could not duplicate handle for "Global/PostgreSQL.xxxxxxxxxx": Bad file > descriptor That must be coming from dsm_impl_pin_segment or dsm_impl_unpin_segment; noplace else has the same error message spelling. > Error started after a pg_upgrade from 11.17 to 14.9 on the same physical > servers. Both those messages predate v11, so it's hard to tell what changed. Are you sure the OS environment didn't change too? regards, tom lane
Re: BUG #18165: Could not duplicate handle for "Global/PostgreSQL.xxxxxxxxxx": Bad file descriptor
From
Andres Freund
Date:
Hi, On 2023-10-20 18:16:04 -0400, Tom Lane wrote: > PG Bug reporting form <noreply@postgresql.org> writes: > > Could not duplicate handle for "Global/PostgreSQL.xxxxxxxxxx": Bad file > > descriptor > > That must be coming from dsm_impl_pin_segment or dsm_impl_unpin_segment; > noplace else has the same error message spelling. > > > Error started after a pg_upgrade from 11.17 to 14.9 on the same physical > > servers. > > Both those messages predate v11, so it's hard to tell what changed. > Are you sure the OS environment didn't change too? Hm. The first failing query in the log queries pg_catalog.pg_get_keywords(), which uses record types, makes me suspect that the use of dsms is triggered by the shared typmod registry. However, the shared typmod registry already exists in 11 and the query should be way too cheap to trigger use of parallel query. But once we started using the registry I think we continue using it for the rest of the lifetime of the backend. I guess we might now parallelize queries that we didn't in 11 and thus it's more likely that we end up needing the typmod registry? Greetings, Andres Freund
Re: BUG #18165: Could not duplicate handle for "Global/PostgreSQL.xxxxxxxxxx": Bad file descriptor
From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes: > On 2023-10-20 18:16:04 -0400, Tom Lane wrote: >> That must be coming from dsm_impl_pin_segment or dsm_impl_unpin_segment; >> noplace else has the same error message spelling. > I guess we might now parallelize queries that we didn't in 11 and thus it's > more likely that we end up needing the typmod registry? Maybe. Disabling parallel queries should be an easy stopgap workaround, in any case. regards, tom lane
Re: BUG #18165: Could not duplicate handle for "Global/PostgreSQL.xxxxxxxxxx": Bad file descriptor
From
Thomas Munro
Date:
On Sat, Oct 21, 2023 at 2:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2023-10-20 18:16:04 -0400, Tom Lane wrote: > >> That must be coming from dsm_impl_pin_segment or dsm_impl_unpin_segment; > >> noplace else has the same error message spelling. > > > I guess we might now parallelize queries that we didn't in 11 and thus it's > > more likely that we end up needing the typmod registry? > > Maybe. Disabling parallel queries should be an easy stopgap workaround, > in any case. Yeah. Some initial observations about the error: it's dsm_impl_pin_segment(), not dsm_impl_unpin_segment() (kinda strange that we have the same error message for the latter given that it's trying to close it, but anyway here we can see the line number in the OP's report). Both callers of dsm_pin_segment() are immediately after dsm_create(), so we're dealing with an inability to pin a freshly allocated memory object. The way we pin in Windows is that we duplicate the handle into the postmaster for safe-keeping, to keep its refcount (the kernel's refcount, not ours) from reaching zero, and then we unpin by closing that dup'd handle in the postmaster. Perhaps there are two handles that could be b0rked there, triggering an error message like that: the postmaster process handle, and the memory object handle.
Re: BUG #18165: Could not duplicate handle for "Global/PostgreSQL.xxxxxxxxxx": Bad file descriptor
From
Thomas Munro
Date:
Ahh, I think I might see it. "Global/PostgreSQL.2437152779" is an odd DSM handle (as in, the low order bit is set). That means that it's a "main region" DSM segment, of the type that you get if you set min_dynamic_shared_memory (commit 84b1c63a). That looks potentially broken on Windows, because we have this extra NT handle sharing stuff in dsm_impl_pin_segment(), which can't possibly work and should probably be gated on !is_main_region_dsm_handle(seg->handle). On every other OS that function does nothing, which would explain how we didn't notice for so long. Maxime, do you have min_dynamic_shared_memory set?
Re: BUG #18165: Could not duplicate handle for "Global/PostgreSQL.xxxxxxxxxx": Bad file descriptor
From
Thomas Munro
Date:
On Sat, Oct 21, 2023 at 5:10 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Ahh, I think I might see it. "Global/PostgreSQL.2437152779" is an odd > DSM handle (as in, the low order bit is set). That means that it's a > "main region" DSM segment, of the type that you get if you set > min_dynamic_shared_memory (commit 84b1c63a). That looks potentially > broken on Windows, because we have this extra NT handle sharing stuff > in dsm_impl_pin_segment(), which can't possibly work and should > probably be gated on !is_main_region_dsm_handle(seg->handle). On > every other OS that function does nothing, which would explain how we > didn't notice for so long. > > Maxime, do you have min_dynamic_shared_memory set? And concretely this would be the fix, I think. But I don't have Windows, and for reasons I haven't fathomed I haven't yet managed to reproduce it by pushing a change of default min_dynamic_shared_memory to CI. Is there any chance that a Postgres hacker on Windows could confirm the above theory, and that the attached fixes it?
Attachment
Re: BUG #18165: Could not duplicate handle for "Global/PostgreSQL.xxxxxxxxxx": Bad file descriptor
From
Alexander Lakhin
Date:
Hi Thomas, 21.10.2023 08:26, Thomas Munro wrote: > And concretely this would be the fix, I think. But I don't have > Windows, and for reasons I haven't fathomed I haven't yet managed to > reproduce it by pushing a change of default min_dynamic_shared_memory > to CI. Is there any chance that a Postgres hacker on Windows could > confirm the above theory, and that the attached fixes it? I've tried on master: echo min_dynamic_shared_memory = 10MB >c:\temp\extra.config set TEMP_CONFIG=c:\temp\extra.config vcregress check and got: ... Installation complete. # initializing database system by running initdb # postmaster failed, examine ".../src/test/regress/log/postmaster.log" for the reasonBail out! >cat ../../../src/test/regress/log/postmaster.log 2023-10-20 23:12:46.110 PDT postmaster[3140] LOG: starting PostgreSQL 17devel, compiled by Visual C++ build 1928, 64-bit 2023-10-20 23:12:46.113 PDT postmaster[3140] LOG: listening on IPv6 address "::1", port 55312 2023-10-20 23:12:46.113 PDT postmaster[3140] LOG: listening on IPv4 address "127.0.0.1", port 55312 2023-10-20 23:12:46.129 PDT startup[6188] LOG: database system was shut down at 2023-10-20 23:12:41 PDT 2023-10-20 23:12:46.130 PDT startup[6188] FATAL: could not duplicate handle for "Global/PostgreSQL.753594369": Bad file descriptor 2023-10-20 23:12:46.132 PDT postmaster[3140] LOG: startup process (PID 6188) exited with exit code 1 2023-10-20 23:12:46.132 PDT postmaster[3140] LOG: aborting startup due to startup process failure 2023-10-20 23:12:46.132 PDT postmaster[3140] LOG: database system is shut down With your patch applied, `vcregress check` passes just fine. Best regards, Alexander
RE: BUG #18165: Could not duplicate handle for "Global/PostgreSQL.xxxxxxxxxx": Bad file descriptor
From
"Boyer, Maxime (he/him | il/lui)"
Date:
You guys are good! I did set min_dynamic_shared_memory to 2 GB in the new config, which didn't exist on 11. I can set itto 0 and give it a try on Tuesday. Does 2 GB seem overkill? We have plenty of memory to go by. dynamic_shared_memory_type = windows # the default is the first option # supported by the operating system: # posix # sysv # windows # mmap # (change requires restart) min_dynamic_shared_memory = 2GB # (change requires restart) Thanks! Maxime -----Original Message----- From: Alexander Lakhin <exclusion@gmail.com> Sent: October 21, 2023 2:30 AM To: Thomas Munro <thomas.munro@gmail.com>; Tom Lane <tgl@sss.pgh.pa.us> Cc: Andres Freund <andres@anarazel.de>; Boyer, Maxime (he/him | il/lui) <Maxime.Boyer@cra-arc.gc.ca>; pgsql-bugs@lists.postgresql.org Subject: Re: BUG #18165: Could not duplicate handle for "Global/PostgreSQL.xxxxxxxxxx": Bad file descriptor Hi Thomas, 21.10.2023 08:26, Thomas Munro wrote: > And concretely this would be the fix, I think. But I don't have > Windows, and for reasons I haven't fathomed I haven't yet managed to > reproduce it by pushing a change of default min_dynamic_shared_memory > to CI. Is there any chance that a Postgres hacker on Windows could > confirm the above theory, and that the attached fixes it? I've tried on master: echo min_dynamic_shared_memory = 10MB >c:\temp\extra.config set TEMP_CONFIG=c:\temp\extra.config vcregress check and got: ... Installation complete. # initializing database system by running initdb # postmaster failed, examine ".../src/test/regress/log/postmaster.log" forthe reasonBail out! >cat ../../../src/test/regress/log/postmaster.log 2023-10-20 23:12:46.110 PDT postmaster[3140] LOG: starting PostgreSQL 17devel, compiled by Visual C++ build 1928, 64-bit 2023-10-20 23:12:46.113 PDT postmaster[3140] LOG: listening on IPv6 address "::1", port 55312 2023-10-20 23:12:46.113 PDT postmaster[3140] LOG: listening on IPv4 address "127.0.0.1", port 55312 2023-10-20 23:12:46.129 PDT startup[6188] LOG: database system was shut down at 2023-10-20 23:12:41 PDT 2023-10-20 23:12:46.130 PDT startup[6188] FATAL: could not duplicate handle for "Global/PostgreSQL.753594369": Bad filedescriptor 2023-10-20 23:12:46.132 PDT postmaster[3140] LOG: startup process (PID 6188) exited with exit code 1 2023-10-20 23:12:46.132 PDT postmaster[3140] LOG: aborting startup due to startup process failure 2023-10-20 23:12:46.132 PDT postmaster[3140] LOG: database system is shut down With your patch applied, `vcregress check` passes just fine. Best regards, Alexander
Re: BUG #18165: Could not duplicate handle for "Global/PostgreSQL.xxxxxxxxxx": Bad file descriptor
From
Thomas Munro
Date:
On Sun, Oct 22, 2023 at 4:38 AM Boyer, Maxime (he/him | il/lui) <Maxime.Boyer@cra-arc.gc.ca> wrote: > You guys are good! I did set min_dynamic_shared_memory to 2 GB in the new config, which didn't exist on 11. I can set itto 0 and give it a try on Tuesday. Does 2 GB seem overkill? We have plenty of memory to go by. Apparently it has never worked on Windows and you're the first to notice and report that, so we can't expect anyone to have platform-specific experience with that. But on other OSes we've seen it improve performance of large parallel joins and parallel vacuum, especially when 'huge pages' are configured. (Note that we had reports that our huge page support for windows stopped working on recent Windows versions, addressed in PostgreSQL 16 by commit fdd8937c, so YMMV on PostgreSQL 14.) Thanks for the report, and to Alexander for testing. I've pushed the fix now, so it will ship in the next point release in a few weeks. In the meantime, the workaround is to set it to 0.
RE: BUG #18165: Could not duplicate handle for "Global/PostgreSQL.xxxxxxxxxx": Bad file descriptor
From
"Boyer, Maxime (he/him | il/lui)"
Date:
Categorization: Unclassified Thank you all! :) I changed the setting back to 0 and we're not getting the error anymore. I have another error from pg11I'll submit a new bug for. Maxime -----Original Message----- From: Thomas Munro <thomas.munro@gmail.com> Sent: October 21, 2023 5:29 PM To: Boyer, Maxime (he/him | il/lui) <Maxime.Boyer@cra-arc.gc.ca> Cc: Alexander Lakhin <exclusion@gmail.com>; Tom Lane <tgl@sss.pgh.pa.us>; Andres Freund <andres@anarazel.de>; pgsql-bugs@lists.postgresql.org Subject: Re: BUG #18165: Could not duplicate handle for "Global/PostgreSQL.xxxxxxxxxx": Bad file descriptor On Sun, Oct 22, 2023 at 4:38 AM Boyer, Maxime (he/him | il/lui) <Maxime.Boyer@cra-arc.gc.ca> wrote: > You guys are good! I did set min_dynamic_shared_memory to 2 GB in the new config, which didn't exist on 11. I can set itto 0 and give it a try on Tuesday. Does 2 GB seem overkill? We have plenty of memory to go by. Apparently it has never worked on Windows and you're the first to notice and report that, so we can't expect anyone to haveplatform-specific experience with that. But on other OSes we've seen it improve performance of large parallel joinsand parallel vacuum, especially when 'huge pages' are configured. (Note that we had reports that our huge page supportfor windows stopped working on recent Windows versions, addressed in PostgreSQL 16 by commit fdd8937c, so YMMV onPostgreSQL 14.) Thanks for the report, and to Alexander for testing. I've pushed the fix now, so it will ship in the next point releasein a few weeks. In the meantime, the workaround is to set it to 0.