Thread: BUG #18165: Could not duplicate handle for "Global/PostgreSQL.xxxxxxxxxx": Bad file descriptor

The following bug has been logged on the website:

Bug reference:      18165
Logged by:          Maxime Boyer
Email address:      maxime.boyer@cra-arc.gc.ca
PostgreSQL version: 14.9
Operating system:   Windows Server 2019
Description:

Could not duplicate handle for "Global/PostgreSQL.xxxxxxxxxx": Bad file
descriptor

Error started after a pg_upgrade from 11.17 to 14.9 on the same physical
servers. Very little information about this online. I'm not sure which file
it's referring to.

Server:
Windows Server 2019 Standard 64-bit
HP ProLiant BL460c Gen9
Intel Xeon CPU E5-2620 v4 @ 2.10 GHz (32 CPUs)
128 GB RAM

Installed the bundled Microsoft Visual C++ 2015-2022 Redistributable (x64) -
14.36.32532

Facts:
- The instance hosts multiple databases of the same application (Jira)
- Only one database has this error
- Same story on our production instance. The production database has the
errors, not training.
- The affected databases are larger than the others (5 & 17 GB vs ~1 GB)
- The affected databases also show expected "duplicate key value violates
unique constraint" errors, although some without the problem also do.

Errors:
2023-09-18 12:06:19.375
EDT,"dbuser","database",37884,"app_node_1_ip:59368",6508750f.93fc,5,"",2023-09-18
12:04:31 EDT,182/105680,0,ERROR,XX000,"could not duplicate handle for
""Global/PostgreSQL.2437152779"": Bad file descriptor",,,,,,"select
string_agg(word, ',') from pg_catalog.pg_get_keywords() where word <> ALL

('{a,abs,absolute,action,ada,add,admin,after,all,allocate,alter,always,and,any,are,array,as,asc,asensitive,assertion,assignment,asymmetric,at,atomic,attribute,attributes,authorization,avg,before,begin,bernoulli,between,bigint,binary,blob,boolean,both,breadth,by,c,call,called,cardinality,cascade,cascaded,case,cast,catalog,catalog_name,ceil,ceiling,chain,char,char_length,character,character_length,character_set_catalog,character_set_name,character_set_schema,characteristics,characters,check,checked,class_origin,clob,close,coalesce,cobol,code_units,collate,collation,collation_catalog,collation_name,collation_schema,collect,column,column_name,command_function,command_function_code,commit,committed,condition,condition_number,connect,connection_name,constraint,constraint_catalog,constraint_name,constraint_schema,constraints,constructors,contains,continue,convert,corr,corresponding,count,covar_pop,covar_samp,create,cross,cube,cume_dist,current,current_collation,current_date,current_default_transform_group,current_path,current_role,current_time,current_timestamp,current_transform_group_for_type,current_user,cursor,cursor_name,cycle,data,date,datetime_interval_code,datetime_interval_precision,day,deallocate,dec,decimal,declare,default,defaults,deferrable,deferred,defined,definer,degree,delete,dense_rank,depth,deref,derived,desc,describe,descriptor,deterministic,diagnostics,disconnect,dispatch,distinct,domain,double,drop,dynamic,dynamic_function,dynamic_function_code,each,element,else,end,end-exec,equals,escape,every,except,exception,exclude,excluding,exec,execute,exists,exp,external,extract,false,fetch,filter,final,first,float,floor,following,for,foreign,fortran,found,free,from,full,function,fusion,g,general,get,global,go,goto,grant,granted,group,grouping,having,hierarchy,hold,hour,identity,immediate,implementation,in,including,increment,indicator,initially,inner,inout,input,insensitive,insert,instance,instantiable,int,integer,intersect,intersection,interval,into,invoker,is,isolation,join,k,key,key_member,key_type,language,large,last,lateral,leading,left,length,level,like,ln,local,localtime,localtimestamp,locator,lower,m,map,match,matched,max,maxvalue,member,merge,message_length,message_octet_length,message_text,method,min,minute,minvalue,mod,modifies,module,month,more,multiset,mumps,name,names,national,natural,nchar,nclob,nesting,new,next,no,none,normalize,normalized,not,""null"",nullable,nullif,nulls,number,numeric,object,octet_length,octets,of,old,on,only,open,option,options,or,order,ordering,ordinality,others,out,outer,output,over,overlaps,overlay,overriding,pad,parameter,parameter_mode,parameter_name,parameter_ordinal_position,parameter_specific_catalog,parameter_specific_name,parameter_specific_schema,partial,partition,pascal,path,percent_rank,percentile_cont,percentile_disc,placing,pli,position,power,preceding,precision,prepare,preserve,primary,prior,privileges,procedure,public,range,rank,read,reads,real,recursive,ref,references,referencing,regr_avgx,regr_avgy,regr_count,regr_intercept,regr_r2,regr_slope,regr_sxx,regr_sxy,regr_syy,relative,release,repeatable,restart,result,return,returned_cardinality,returned_length,returned_octet_length,returned_sqlstate,returns,revoke,right,role,rollback,rollup,routine,routine_catalog,routine_name,routine_schema,row,row_count,row_number,rows,savepoint,scale,schema,schema_name,scope_catalog,scope_name,scope_schema,scroll,search,second,section,security,select,self,sensitive,sequence,serializable,server_name,session,session_user,set,sets,similar,simple,size,smallint,some,source,space,specific,specific_name,specifictype,sql,sqlexception,sqlstate,sqlwarning,sqrt,start,state,statement,static,stddev_pop,stddev_samp,structure,style,subclass_origin,submultiset,substring,sum,symmetric,system,system_user,table,table_name,tablesample,temporary,then,ties,time,timestamp,timezone_hour,timezone_minute,to,top_level_count,trailing,transaction,transaction_active,transactions_committed,transactions_rolled_back,transform,transforms,translate,translation,treat,trigger,trigger_catalog,trigger_name,trigger_schema,trim,true,type,uescape,unbounded,uncommitted,under,union,unique,unknown,unnamed,unnest,update,upper,usage,user,user_defined_type_catalog,user_defined_type_code,user_defined_type_name,user_defined_type_schema,using,value,values,var_pop,var_samp,varchar,varying,view,when,whenever,where,width_bucket,window,with,within,without,work,write,year,zone}'::text[])",,"dsm_impl_pin_segment,
dsm_impl.c:991","PostgreSQL JDBC Driver","client backend",,0
2023-09-18 13:09:52.122
EDT,"dbuser","database",36664,"app_node_2_ip:61657",650883f8.8f38,17,"",2023-09-18
13:08:08 EDT,185/19545,0,ERROR,XX000,"could not duplicate handle for
""Global/PostgreSQL.158941235"": Bad file descriptor",,,,,,"SELECT
COUNT(I.ID) FROM public.jiraissue I",,"dsm_impl_pin_segment,
dsm_impl.c:991","PostgreSQL JDBC Driver","client backend",,0
2023-09-18 13:10:25.322
EDT,"dbuser","database",28488,"app_node_2_ip:61653",650883e8.6f48,12,"",2023-09-18
13:07:52 EDT,184/44578,0,ERROR,XX000,"could not duplicate handle for
""Global/PostgreSQL.1225474099"": Bad file descriptor",,,,,,"SELECT
COUNT(I.ID) FROM public.jiraissue I",,"dsm_impl_pin_segment,
dsm_impl.c:991","PostgreSQL JDBC Driver","client backend",,0
2023-09-18 13:11:58.207
EDT,"dbuser","database",36664,"app_node_2_ip:61657",650883f8.8f38,21,"",2023-09-18
13:08:08 EDT,185/53587,0,ERROR,XX000,"could not duplicate handle for
""Global/PostgreSQL.1764720693"": Bad file descriptor",,,,,,"SELECT
COUNT(I.ID) FROM public.jiraissue I WHERE (I.ARCHIVED =  $1 ) OR (I.ARCHIVED
IS NULL )",,"dsm_impl_pin_segment, dsm_impl.c:991","PostgreSQL JDBC
Driver","client backend",,0

2023-09-18 13:09:50.434
EDT,"dbuser","database",55164,"app_node_1_ip:51366",6508831b.d77c,4,"",2023-09-18
13:04:27 EDT,164/254836,758544373,ERROR,23505,"duplicate key value violates
unique constraint ""cluster_lock_name_idx""","Key
(lock_name)=(com.atlassian.jira.service.DefaultServiceManager.updateLock)
already exists.",,,,,"insert into public.clusterlockstatus (lock_name,
update_time, id) values ($1, $2, $3)",,"_bt_check_unique,
nbtinsert.c:670","PostgreSQL JDBC Driver","client backend",,0

Thanks,
Max


PG Bug reporting form <noreply@postgresql.org> writes:
> Could not duplicate handle for "Global/PostgreSQL.xxxxxxxxxx": Bad file
> descriptor

That must be coming from dsm_impl_pin_segment or dsm_impl_unpin_segment;
noplace else has the same error message spelling.

> Error started after a pg_upgrade from 11.17 to 14.9 on the same physical
> servers.

Both those messages predate v11, so it's hard to tell what changed.
Are you sure the OS environment didn't change too?

            regards, tom lane



Hi,

On 2023-10-20 18:16:04 -0400, Tom Lane wrote:
> PG Bug reporting form <noreply@postgresql.org> writes:
> > Could not duplicate handle for "Global/PostgreSQL.xxxxxxxxxx": Bad file
> > descriptor
> 
> That must be coming from dsm_impl_pin_segment or dsm_impl_unpin_segment;
> noplace else has the same error message spelling.
> 
> > Error started after a pg_upgrade from 11.17 to 14.9 on the same physical
> > servers.
> 
> Both those messages predate v11, so it's hard to tell what changed.
> Are you sure the OS environment didn't change too?

Hm. The first failing query in the log queries pg_catalog.pg_get_keywords(),
which uses record types, makes me suspect that the use of dsms is triggered by
the shared typmod registry. However, the shared typmod registry already exists
in 11 and the query should be way too cheap to trigger use of parallel
query. But once we started using the registry I think we continue using it for
the rest of the lifetime of the backend.

I guess we might now parallelize queries that we didn't in 11 and thus it's
more likely that we end up needing the typmod registry?

Greetings,

Andres Freund



Andres Freund <andres@anarazel.de> writes:
> On 2023-10-20 18:16:04 -0400, Tom Lane wrote:
>> That must be coming from dsm_impl_pin_segment or dsm_impl_unpin_segment;
>> noplace else has the same error message spelling.

> I guess we might now parallelize queries that we didn't in 11 and thus it's
> more likely that we end up needing the typmod registry?

Maybe.  Disabling parallel queries should be an easy stopgap workaround,
in any case.

            regards, tom lane



On Sat, Oct 21, 2023 at 2:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2023-10-20 18:16:04 -0400, Tom Lane wrote:
> >> That must be coming from dsm_impl_pin_segment or dsm_impl_unpin_segment;
> >> noplace else has the same error message spelling.
>
> > I guess we might now parallelize queries that we didn't in 11 and thus it's
> > more likely that we end up needing the typmod registry?
>
> Maybe.  Disabling parallel queries should be an easy stopgap workaround,
> in any case.

Yeah.

Some initial observations about the error: it's
dsm_impl_pin_segment(), not dsm_impl_unpin_segment() (kinda strange
that we have the same error message for the latter given that it's
trying to close it, but anyway here we can see the line number in the
OP's report).  Both callers of dsm_pin_segment() are immediately after
dsm_create(), so we're dealing with an inability to pin a freshly
allocated memory object.  The way we pin in Windows is that we
duplicate the handle into the postmaster for safe-keeping, to keep its
refcount (the kernel's refcount, not ours) from reaching zero, and
then we unpin by closing that dup'd handle in the postmaster.  Perhaps
there are two handles that could be b0rked there, triggering an error
message like that: the postmaster process handle, and the memory
object handle.



Ahh, I think I might see it.  "Global/PostgreSQL.2437152779" is an odd
DSM handle (as in, the low order bit is set).  That means that it's a
"main region" DSM segment, of the type that you get if you set
min_dynamic_shared_memory (commit 84b1c63a).  That looks potentially
broken on Windows, because we have this extra NT handle sharing stuff
in dsm_impl_pin_segment(), which can't possibly work and should
probably be gated on !is_main_region_dsm_handle(seg->handle).  On
every other OS that function does nothing, which would explain how we
didn't notice for so long.

Maxime, do you have min_dynamic_shared_memory set?



On Sat, Oct 21, 2023 at 5:10 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Ahh, I think I might see it.  "Global/PostgreSQL.2437152779" is an odd
> DSM handle (as in, the low order bit is set).  That means that it's a
> "main region" DSM segment, of the type that you get if you set
> min_dynamic_shared_memory (commit 84b1c63a).  That looks potentially
> broken on Windows, because we have this extra NT handle sharing stuff
> in dsm_impl_pin_segment(), which can't possibly work and should
> probably be gated on !is_main_region_dsm_handle(seg->handle).  On
> every other OS that function does nothing, which would explain how we
> didn't notice for so long.
>
> Maxime, do you have min_dynamic_shared_memory set?

And concretely this would be the fix, I think.  But I don't have
Windows, and for reasons I haven't fathomed I haven't yet managed to
reproduce it by pushing a change of default min_dynamic_shared_memory
to CI.  Is there any chance that a Postgres hacker on Windows could
confirm the above theory, and that the attached fixes it?

Attachment
Hi Thomas,

21.10.2023 08:26, Thomas Munro wrote:
> And concretely this would be the fix, I think.  But I don't have
> Windows, and for reasons I haven't fathomed I haven't yet managed to
> reproduce it by pushing a change of default min_dynamic_shared_memory
> to CI.  Is there any chance that a Postgres hacker on Windows could
> confirm the above theory, and that the attached fixes it?

I've tried on master:
echo min_dynamic_shared_memory = 10MB >c:\temp\extra.config
set TEMP_CONFIG=c:\temp\extra.config
vcregress check

and got:
...
Installation complete.
# initializing database system by running initdb
# postmaster failed, examine ".../src/test/regress/log/postmaster.log" for the reasonBail out!

 >cat ../../../src/test/regress/log/postmaster.log
2023-10-20 23:12:46.110 PDT postmaster[3140] LOG:  starting PostgreSQL 17devel, compiled by Visual C++ build 1928,
64-bit
2023-10-20 23:12:46.113 PDT postmaster[3140] LOG:  listening on IPv6 address "::1", port 55312
2023-10-20 23:12:46.113 PDT postmaster[3140] LOG:  listening on IPv4 address "127.0.0.1", port 55312
2023-10-20 23:12:46.129 PDT startup[6188] LOG:  database system was shut down at 2023-10-20 23:12:41 PDT
2023-10-20 23:12:46.130 PDT startup[6188] FATAL:  could not duplicate handle for "Global/PostgreSQL.753594369": Bad
file
 
descriptor
2023-10-20 23:12:46.132 PDT postmaster[3140] LOG:  startup process (PID 6188) exited with exit code 1
2023-10-20 23:12:46.132 PDT postmaster[3140] LOG:  aborting startup due to startup process failure
2023-10-20 23:12:46.132 PDT postmaster[3140] LOG:  database system is shut down

With your patch applied, `vcregress check` passes just fine.

Best regards,
Alexander



RE: BUG #18165: Could not duplicate handle for "Global/PostgreSQL.xxxxxxxxxx": Bad file descriptor

From
"Boyer, Maxime (he/him | il/lui)"
Date:
You guys are good! I did set min_dynamic_shared_memory to 2 GB in the new config, which didn't exist on 11. I can set
itto 0 and give it a try on Tuesday. Does 2 GB seem overkill? We have plenty of memory to go by.
 

dynamic_shared_memory_type = windows    # the default is the first option
                    # supported by the operating system:
                    #   posix
                    #   sysv
                    #   windows
                    #   mmap
                    # (change requires restart)
min_dynamic_shared_memory = 2GB    # (change requires restart)

Thanks!
Maxime

-----Original Message-----
From: Alexander Lakhin <exclusion@gmail.com> 
Sent: October 21, 2023 2:30 AM
To: Thomas Munro <thomas.munro@gmail.com>; Tom Lane <tgl@sss.pgh.pa.us>
Cc: Andres Freund <andres@anarazel.de>; Boyer, Maxime (he/him | il/lui) <Maxime.Boyer@cra-arc.gc.ca>;
pgsql-bugs@lists.postgresql.org
Subject: Re: BUG #18165: Could not duplicate handle for "Global/PostgreSQL.xxxxxxxxxx": Bad file descriptor

Hi Thomas,

21.10.2023 08:26, Thomas Munro wrote:
> And concretely this would be the fix, I think.  But I don't have 
> Windows, and for reasons I haven't fathomed I haven't yet managed to 
> reproduce it by pushing a change of default min_dynamic_shared_memory 
> to CI.  Is there any chance that a Postgres hacker on Windows could 
> confirm the above theory, and that the attached fixes it?

I've tried on master:
echo min_dynamic_shared_memory = 10MB >c:\temp\extra.config set TEMP_CONFIG=c:\temp\extra.config vcregress check

and got:
...
Installation complete.
# initializing database system by running initdb # postmaster failed, examine ".../src/test/regress/log/postmaster.log"
forthe reasonBail out!
 

 >cat ../../../src/test/regress/log/postmaster.log
2023-10-20 23:12:46.110 PDT postmaster[3140] LOG:  starting PostgreSQL 17devel, compiled by Visual C++ build 1928,
64-bit
2023-10-20 23:12:46.113 PDT postmaster[3140] LOG:  listening on IPv6 address "::1", port 55312
2023-10-20 23:12:46.113 PDT postmaster[3140] LOG:  listening on IPv4 address "127.0.0.1", port 55312
2023-10-20 23:12:46.129 PDT startup[6188] LOG:  database system was shut down at 2023-10-20 23:12:41 PDT
2023-10-20 23:12:46.130 PDT startup[6188] FATAL:  could not duplicate handle for "Global/PostgreSQL.753594369": Bad
filedescriptor
 
2023-10-20 23:12:46.132 PDT postmaster[3140] LOG:  startup process (PID 6188) exited with exit code 1
2023-10-20 23:12:46.132 PDT postmaster[3140] LOG:  aborting startup due to startup process failure
2023-10-20 23:12:46.132 PDT postmaster[3140] LOG:  database system is shut down

With your patch applied, `vcregress check` passes just fine.

Best regards,
Alexander

On Sun, Oct 22, 2023 at 4:38 AM Boyer, Maxime (he/him | il/lui)
<Maxime.Boyer@cra-arc.gc.ca> wrote:
> You guys are good! I did set min_dynamic_shared_memory to 2 GB in the new config, which didn't exist on 11. I can set
itto 0 and give it a try on Tuesday. Does 2 GB seem overkill? We have plenty of memory to go by. 

Apparently it has never worked on Windows and you're the first to
notice and report that, so we can't expect anyone to have
platform-specific experience with that.  But on other OSes we've seen
it improve performance of large parallel joins and parallel vacuum,
especially when 'huge pages' are configured.  (Note that we had
reports that our huge page support for windows stopped working on
recent Windows versions, addressed in PostgreSQL 16 by commit
fdd8937c, so YMMV on PostgreSQL 14.)

Thanks for the report, and to Alexander for testing.  I've pushed the
fix now, so it will ship in the next point release in a few weeks.  In
the meantime, the workaround is to set it to 0.



RE: BUG #18165: Could not duplicate handle for "Global/PostgreSQL.xxxxxxxxxx": Bad file descriptor

From
"Boyer, Maxime (he/him | il/lui)"
Date:
Categorization: 
Unclassified 

Thank you all! :) I changed the setting back to 0 and we're not getting the error anymore. I have another error from
pg11I'll submit a new bug for.
 

Maxime

-----Original Message-----
From: Thomas Munro <thomas.munro@gmail.com> 
Sent: October 21, 2023 5:29 PM
To: Boyer, Maxime (he/him | il/lui) <Maxime.Boyer@cra-arc.gc.ca>
Cc: Alexander Lakhin <exclusion@gmail.com>; Tom Lane <tgl@sss.pgh.pa.us>; Andres Freund <andres@anarazel.de>;
pgsql-bugs@lists.postgresql.org
Subject: Re: BUG #18165: Could not duplicate handle for "Global/PostgreSQL.xxxxxxxxxx": Bad file descriptor

On Sun, Oct 22, 2023 at 4:38 AM Boyer, Maxime (he/him | il/lui) <Maxime.Boyer@cra-arc.gc.ca> wrote:
> You guys are good! I did set min_dynamic_shared_memory to 2 GB in the new config, which didn't exist on 11. I can set
itto 0 and give it a try on Tuesday. Does 2 GB seem overkill? We have plenty of memory to go by.
 

Apparently it has never worked on Windows and you're the first to notice and report that, so we can't expect anyone to
haveplatform-specific experience with that.  But on other OSes we've seen it improve performance of large parallel
joinsand parallel vacuum, especially when 'huge pages' are configured.  (Note that we had reports that our huge page
supportfor windows stopped working on recent Windows versions, addressed in PostgreSQL 16 by commit fdd8937c, so YMMV
onPostgreSQL 14.)
 

Thanks for the report, and to Alexander for testing.  I've pushed the fix now, so it will ship in the next point
releasein a few weeks.  In the meantime, the workaround is to set it to 0.