Re: How to perform a long running dry run transaction without blocking - Mailing list pgsql-general

From Adrian Klaver
Subject Re: How to perform a long running dry run transaction without blocking
Date
Msg-id 047f05ca-1519-410b-abe6-3773b6533b52@aklaver.com
Whole thread Raw
In response to How to perform a long running dry run transaction without blocking  (Robert Leach <rleach@princeton.edu>)
List pgsql-general
On 2/7/25 05:49, Robert Leach wrote:

Ccing list
>>>> Alright I am trying to reconcile this with from below, 'The largest studies take just under a minute'.
>>> The context of the 'The largest studies take just under a minute' statement is that it's not loading the
hefty/time-consumingraw data.   It's only validating the metadata.  That's fast (5-60s).  And that data is a portion of
thetransaction in the back-end load.  There are errors that validation can miss that are due to not touching the raw
data,and in fact, those errors are addressed by curators editing the excel sheets.  That's why it's all in the load
transactioninstead of
 
>>
>> As a scientist that makes me start to twitch.
>>
>> Is there an audit trail for that?
> 
> We have a well defined curation process.  Original files are checked into a private data repo.  There is a CHANGES.md
filesthat details any/every change a curator makes, and the curator coordinates these changes with the researcher.
 
> 

>> Aah, time travel.
> 
> Sorry, "anticipate" wasn't the best word choice.  What I was thinking was more along the lines of "detect" when a
blockhappens and at that point, either stop and queue up the validation or provide a status/progress that shows it's
waitingon validations in line before it.  Probably not possible.  It's just what I'm imagining would be the most
efficientstrategy with the least wait time.
 
> 

>>> Other documentation I read referred to the state of the DB (when a transaction starts) as a "snapshot" and I
thought...what if I could save such a snapshot automatically just *before* a back-end load starts, and use that
snapshotfor validation, such that my validation processes could use that to validate against and not encounter any
locks? The validation will never commit, so there's no risk.
 
>>
>> Hmm. I don't think so.
> 
>  From a separate thread, which seems analogous...:
> 
>> Seems to me this could be dealt with using a schema named validate that contains 'shadow' tables of those in the
liveschema(s). Import into their and see what fails.
 
> 
> What is a "shadow table"?  Is that a technical thing?  Could these shadow tables be in the same database?  (Trying to
wrapmy head around what this implies is possible.)
 

No a concept of my own making. Basically create copies of the tables you 
are loading into now in a separate schema and load into them instead. 
Then do the validation against those tables, that would reduce the 
contention with the existing data. It is something I have done to 
validate data from another data source before loading into production 
tables. A possible issue to deal with is how much, if at all,  you 
depend on other tables for the validation process.


-- 
Adrian Klaver
adrian.klaver@aklaver.com




pgsql-general by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: Commit Latency
Next
From: Adrian Klaver
Date:
Subject: Re: How to get a notification