RE: could not read from hash-join temporary file: SUCCESS && DB goes into recovery mode - Mailing list pgsql-general

From Reid Thompson
Subject RE: could not read from hash-join temporary file: SUCCESS && DB goes into recovery mode
Date
Msg-id SJ0PR11MB484893B3096252DC73E0F8B09E499@SJ0PR11MB4848.namprd11.prod.outlook.com
Whole thread Raw
In response to Re: could not read from hash-join temporary file: SUCCESS && DB goes into recovery mode  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
Responses inline below
Thanks,
reid

-----Original Message-----
From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: Monday, April 19, 2021 9:54 AM
To: Reid Thompson <Reid.Thompson@omnicell.com>
Cc: pgsql-general@lists.postgresql.org
Subject: Re: could not read from hash-join temporary file: SUCCESS && DB goes into recovery mode

Reid Thompson <Reid.Thompson@omnicell.com> writes:
> Hi I'm looking for some guidance related to the subject line issue.

Is this repeatable?  If so you've found a bug, I think.

Alvaro has noted   https://www.postgresql.org/message-id/20210419155928.GA3253%40alvherre.pgsql  that the mentioned bug
fixon the error reporting should correct this ( as the hash join will abort with the fix rather than not recognizing
thatit needs to, and then attempt a later read) 
Somewhat -- I think it's dependent on what else may be writing tmp files ( i.e. it's happened twice in the past few
dayswith successful runs in the interim and I believe that this query has been in place for a long time prior to the
firstoccurrence) 

>   1.  That the error message has been updated ( i.e. SUCCESS is not
> the proper value)

Yeah, what this really indicates is an incomplete read (file shorter than expected).  Since 11.8, we've improved the
errorreporting for that, but that wouldn't in itself fix whatever the underlying problem is.   

See message noted above from Alvaro.

>   2.  That the error is due to running out of temporary space either disk space or maybe temp_buffers?

That could be the proximate cause, although then there would be a bug that the original write failure wasn't detected.
Butit seems about as likely that there's just some inconsistency between what the temp file writing code wrote and what
thereading code expects to read. 

Is this a parallelized hash join by any chance?  That's new in v11 if memory serves, so it'd be interesting to see if
disablingenable_parallel_hash changes anything. 

Explain on the host does not indicate parallelization anywhere - it does indicate that we can narrow the result set by
optimizingour query which we are implementing now.  Hopefully this will get us to a scheduled upgrade timeframe. 

Anyway, I'd counsel updating to current (11.11), and then if you can still reproduce the problem, try to reduce it to a
self-containedtest case. 

                        regards, tom lane



pgsql-general by date:

Previous
From: Reid Thompson
Date:
Subject: RE: could not read from hash-join temporary file: SUCCESS && DB goes into recovery mode
Next
From: Allie Crawford
Date:
Subject: archive_commnad parameter question