Thread: How to perform a long running dry run transaction without blocking

How to perform a long running dry run transaction without blocking

From
Robert Leach
Date:
I've been trying to solve this problem in Django and I've finally decided after over a year going down this path that
there'sno way to solve it in Django (in a database agnostic fashion). So I'm thinking that I need to explore a
Postgres-specificsolution. FYI, I'm not a database expert. I just know enough to get myself into trouble like this
problem.

Let me try and distill all the back story down to an oversimplified explanation:

I created a scientific data submission validation interface that helps researchers compile their data submissions to
ourdatabase. To do this, I decided to add a `--validate` option to the load scripts that essentially raises a dry run
exceptionat the end of a load run so that the data is all rolled back before leaving the atomic transaction block. 

This validation interface skips the raw data load step, which is the heftiest, most long running, part and generally
finishesin well under a minute. 

The interface works fantastically well. It rolls back problematic data in smaller transactions and buffers those errors
fora final report that the user can work to fix in their data files and retry validation until all their problems are
solvedand the data is ready to load. 

The problem is that if we have a long running load going on in the background (which can take hours) and during that
load,a user tries to validate a new submission that contains some overlapping common data (e.g. a new tissue type or
protocolthat is currently being loaded) or perhaps they are trying to validate data added to an existing submission
thatis being partially loaded, that validation process gets blocked and the validation interface encounters a gateway
timeout.

I had tried changing the isolation level to both repeatable read and serializable, but the hang can still occur (though
serializableseems to avoid hangs in some cases that repeatable read does not). 

My initial interpretation of the isolation level documentation was that transactions would not be aware of what is
happeningin other transactions and that if there was a problem, you would only ever get a serialization error at the
endwhen the data is committed, but after encountering the hang yesterday, I realized that the "snapshot" that the
secondtransaction gets includes locks already established by the first transaction. 

I had hoped that since my validation interface would never commit, it would never have a serialization error or hang.

What options are available that might allow this all to work as I intended? Is there some way to tell the validation
processtransaction to assume that any pending changes from another transaction will succeed? 

I have lots of ideas on how to mitigate these occurrences, but I don't want to bias any potentially helpful responses.
AmI screwed from my early decision to use the loading code to validate data submissions? 


Re: How to perform a long running dry run transaction without blocking

From
Adrian Klaver
Date:
On 2/6/25 07:40, Robert Leach wrote:

Comments inline.

> Let me try and distill all the back story down to an oversimplified explanation:
> 
> I created a scientific data submission validation interface that helps researchers compile their data submissions to
ourdatabase. To do this, I decided to add a `--validate` option to the load scripts that essentially raises a dry run
exceptionat the end of a load run so that the data is all rolled back before leaving the atomic transaction block.
 

Load to where existing table or temporary table?

> 
> This validation interface skips the raw data load step, which is the heftiest, most long running, part and generally
finishesin well under a minute.
 

So what is the validation actually doing?

> 
> The interface works fantastically well. It rolls back problematic data in smaller transactions and buffers those
errorsfor a final report that the user can work to fix in their data files and retry validation until all their
problemsare solved and the data is ready to load.
 
> 
> The problem is that if we have a long running load going on in the background (which can take hours) and during that
load,a user tries to validate a new submission that contains some overlapping common data (e.g. a new tissue type or
protocolthat is currently being loaded) or perhaps they are trying to validate data added to an existing submission
thatis being partially loaded, that validation process gets blocked and the validation interface encounters a gateway
timeout.

Where are the background processes loading data to?

> 
> I had tried changing the isolation level to both repeatable read and serializable, but the hang can still occur
(thoughserializable seems to avoid hangs in some cases that repeatable read does not).
 
> 
> My initial interpretation of the isolation level documentation was that transactions would not be aware of what is
happeningin other transactions and that if there was a problem, you would only ever get a serialization error at the
endwhen the data is committed, but after encountering the hang yesterday, I realized that the "snapshot" that the
secondtransaction gets includes locks already established by the first transaction.
 
> 
> I had hoped that since my validation interface would never commit, it would never have a serialization error or
hang.
> 
> What options are available that might allow this all to work as I intended? Is there some way to tell the validation
processtransaction to assume that any pending changes from another transaction will succeed?
 

Not that I know of and that would be a false promise anyway as I know of 
no process that is perfect.

> 
> I have lots of ideas on how to mitigate these occurrences, but I don't want to bias any potentially helpful
responses.Am I screwed from my early decision to use the loading code to validate data submissions?
 
> 

-- 
Adrian Klaver
adrian.klaver@aklaver.com




Re: How to perform a long running dry run transaction without blocking

From
Adrian Klaver
Date:
On 2/6/25 08:53, Robert Leach wrote:
> Great questions!  Responses inline...

Please reply to list also.
Ccing list

>> Load to where existing table or temporary table?
> 
> Existing tables.  Note that (and this is a point of contention in our team), we have some fields that must be
universallyunique across all study data, so if we were to employ temporary (empty) tables, we would miss out on the
uniquenesschecks.
 

That is another discussion.

> 
> Background: The previous solution for this validation interface that I had tried, used a partial copy of the database
whichI called the "validation" database.  (I'd been forced to do it that way because previous developers hadn't used
transactionsand their "dry run" mode had side effects that I didn't want to happen due to the validation interface, so
insteadof engage in a massive refactor, I tried using a separate database as a quick temporary fix.)  But mind you, all
ofthe code I was writing was in django (python), and in order to maintain 2 different databases and stay database
architectureagnostic, I learned that your code must be littered with hundreds of insertions of `.using()` (and other)
statements,and it turned out that that strategy is not 100% compatible with every kind of Django ORM thing you can do
(e.g.`full_clean()`), so I'd had to dig around in django core code to ensure every operation was being performed on the
correctdatabase.  It was a nightmare to maintain and I happily ripped it all out when I corrected the original problems
bywrapping everything in a transaction.
 

Yeah, reason why I bypass the ORM.

> 
>>> This validation interface skips the raw data load step, which is the heftiest, most long running, part and
generallyfinishes in well under a minute.
 
>>
>> So what is the validation actually doing?
> 
> The raw data is pretty solid.  It is automatically generated by hardware (mass spectrometer) and software (peak
analysissuites).  So it doesn't need much (if any) validation.  What needs validation is all the metadata associated
withthe samples that is totally generated by hand by the researchers.
 
> 
> They enter the data in an excel spreadsheet containing about a dozen inter-related worksheets, named, for example:
Study,Animals, Samples, Tissues, Treatments, Sequences 
 

Well there's your problem:) I will spare you my full spreadsheet rant.

Long term is there a thought to have them enter directly into database 
where validation can happen in real time

>> Where are the background processes loading data to?
> 
> We first run our loads on a development server with a separate copy of the database, but then we re-run those same
loadson the production server, where users perform their validations.
 

The load to the development server does no validation?

If so what is the purpose?

The background processes are other validation runs?

> 
> One of the thoughts I'd had to work around the problem was to somehow dump the data from the development load and
loadit onto production in some sort of scheduled downtime or something.  However, even if we do that, I'm concerned
thatmultiple users concurrently validating different submissions would encounter this blocking issue, and since those
validationscan take (on the upper end) just under a minute, it's enough for at least 1 user to encounter a timeout.  I
havenot yet proven that can happen, but based on my reading of the postgres documentation, it seems logical.
 
> 

Seems you are looking for some sort of queuing system.

What are the time constraints for getting the validation turned around.


>> Not that I know of and that would be a false promise anyway as I know of no process that is perfect.
> 
> I realize that it's a false promise WRT the background load transaction, but it's a moot concern from the perspective
ofthe validation page, because it is making a guaranteed promise that it will never commit.  All I want is for it to
notbe blocked so I can report as many errors as I can to the researcher so they can advance their submission
compilation. besides, there's a pretty solid bet that since the load succeeded on the development server, it will
succeedon the production server where this block would happen.
 
> 
> So if the load transaction does fail, and the data the validation process was waiting on (to see the outcome) is not
reportedas problematic to the user, it will be reported as problematic to those responsible for the load on the
back-end,so it gets dealt with either way.
 
> 

>>
>> -- 
>> Adrian Klaver
>> adrian.klaver@aklaver.com
> 
> 
> 
> Robert William Leach
> Research Software Engineer
> 133 Carl C. Icahn Lab
> Lewis-Sigler Institute for Integrative Genomics
> Princeton University
> Princeton, NJ 08544
> 

-- 
Adrian Klaver
adrian.klaver@aklaver.com


Re: How to perform a long running dry run transaction without blocking

From
Robert Leach
Date:
Please reply to list also.
Ccing list

👍🏻

They enter the data in an excel spreadsheet containing about a dozen inter-related worksheets, named, for example: Study, Animals, Samples, Tissues, Treatments, Sequences

Long term is there a thought to have them enter directly into database where validation can happen in real time

No, not really.  Sample metadata tends to be an afterthought to researchers.  They have it in their notebooks and getting them to enter it at all is like pulling teeth.  The validation interface actually has a bunch of other features I haven't mentioned that streamline the process for them.  Before it gets to actually validating the data, it tries to lighten the manual burden on the researchers (and help with consistent nomenclature) by pulling sample names out of the raw files, massaging them, and filling those in along with a mass of common data that is used to populate drop-downs in the excel columns to avoid researcher typos and value variants.

Having everything work with excel actually made the site more attractive to the researchers, because they're comfortable with it and use it already, so it lowered the bar for using our software.

Besides, we don't trust the users enough to enter data unsupervised.  There are a lot of aspects of the data that cannot be automatically validated and involve experimental parameters that are adjacent to the purpose of our site.  We have curators that need to look at everything to ensure consistency, and looking at all the data in context is necessary before any of it is entered.

That said, back in the aughts, I wrote a perl cgi site for a toxin and virulence factor database that used a web interface for data entry and achieved the curation goal by saving a form of all inter-related data.  The submit button sent that form to a list of curators who could approve the insert/update and make it actually happen.  I think I had actually suggested that form of data entry when this current project first started, but I was overruled.  However, in this project, the equivalent procedure would be per-sample, and you'd lose out on the overall context.  It's an interesting challenge, but I think we're pretty committed now on this file load path.

Where are the background processes loading data to?
We first run our loads on a development server with a separate copy of the database, but then we re-run those same loads on the production server, where users perform their validations.

The load to the development server does no validation?

If so what is the purpose?

The background processes are other validation runs?

It's the same code that executes in both cases (with or without the `--validate` flag).  All that that flag does is it (effectively) raises the dry run exception before it leaves the transaction block, so it always validates (whether the flag is supplied or not).

So the load doesn't fail until the end of the run, which is inefficient from a maintenance perspective.  I've been thinking of adding a `--failfast` option for use on the back end.  Haven't done it yet.  I started a load yesterday in fact that ran 2 hours before it buffered an exception related to a newly introduced bug.  I fixed the bug and ran the load again.  It finished sometime between COB yesterday and this morning (successfully!).

One of the thoughts I'd had to work around the problem was to somehow dump the data from the development load and load it onto production in some sort of scheduled downtime or something.  However, even if we do that, I'm concerned that multiple users concurrently validating different submissions would encounter this blocking issue, and since those validations can take (on the upper end) just under a minute, it's enough for at least 1 user to encounter a timeout.  I have not yet proven that can happen, but based on my reading of the postgres documentation, it seems logical.

Seems you are looking for some sort of queuing system.

What are the time constraints for getting the validation turned around.

I have considered a queuing system, though when I previously floated a proof of concept using celery, I was informed it was too much.  Though, at the time, all I was trying to do was a progress bar for a query stats feature.  So proposing celery in this instance may get more traction with the rest of the team.

Most of the small validation processes finish in under a dozen seconds.  The largest studies take just under a minute.  I have plans to optimize the loading scripts that hopefully could get the largest studies down to a dozen seconds.  If I could do that, and do the back end loads in off-peak hours, then I'd be willing to suffer the rare timeouts from concurrent validations.  The raw data loads will still likely take a much longer time.


Robert William Leach
Research Software Engineer
133 Carl C. Icahn Lab
Lewis-Sigler Institute for Integrative Genomics
Princeton University
Princeton, NJ 08544

Re: How to perform a long running dry run transaction without blocking

From
Adrian Klaver
Date:
On 2/6/25 09:58, Robert Leach wrote:

>>> They enter the data in an excel spreadsheet containing about a dozen 
>>> inter-related worksheets, named, for example: Study, Animals, 
>>> Samples, Tissues, Treatments, Sequences
>>
>> Long term is there a thought to have them enter directly into database 
>> where validation can happen in real time

> Having everything work with excel actually made the site more attractive 
> to the researchers, because they're comfortable with it and use it 
> already, so it lowered the bar for using our software.

[...]

>  It's an interesting challenge, but I think we're 
> pretty committed now on this file load path.

Ok, that is generally how my discussions on this subject end. It's my 
windmill.

>> The load to the development server does no validation?
>>
>> If so what is the purpose?
>>
>> The background processes are other validation runs?
> 
> It's the same code that executes in both cases (with or without the 
> `--validate` flag).  All that that flag does is it (effectively) raises 
> the dry run exception before it leaves the transaction block, so it 
> always validates (whether the flag is supplied or not).

More for my sake then anything else, why do the load to the development 
server at all if the production load is the only one that counts?


> So the load doesn't fail until the end of the run, which is inefficient 
> from a maintenance perspective.  I've been thinking of adding a 
> `--failfast` option for use on the back end.  Haven't done it yet.  I 
> started a load yesterday in fact that ran 2 hours before it buffered an 
> exception related to a newly introduced bug.  I fixed the bug and ran 
> the load again.  It finished sometime between COB yesterday and this 
> morning (successfully!).

Alright I am trying to reconcile this with from below, 'The largest 
studies take just under a minute'.


>> Seems you are looking for some sort of queuing system.
>>
>> What are the time constraints for getting the validation turned around.
> 
> I have considered a queuing system, though when I previously floated a 
> proof of concept using celery, I was informed it was too much.  Though, 
> at the time, all I was trying to do was a progress bar for a query stats 
> feature.  So proposing celery in this instance may get more traction 
> with the rest of the team.
> 
> Most of the small validation processes finish in under a dozen seconds. 
>   The largest studies take just under a minute.  I have plans to 
> optimize the loading scripts that hopefully could get the largest 
> studies down to a dozen seconds.  If I could do that, and do the back 
> end loads in off-peak hours, then I'd be willing to suffer the rare 
> timeouts from concurrent validations.  The raw data loads will still 
> likely take a much longer time.

This is where I get confused, probably because I am not exactly sure 
what constitutes validation. My sense is that involves a load of data 
into live tables and seeing what fails PK, FK or other constraints.

If that is the case I am not seeing how the 'for real' data load would 
be longer?

At any rate I can't see how loading into a live database multiple sets 
of data while operations are going on in the database can be made 
conflict free. To me  it seems the best that be done is:

1) Reduce chance for conflict by spreading the actions out.

2) Have retry logic that deals with conflicts.


> 
> 
> Robert William Leach
> Research Software Engineer
> 133 Carl C. Icahn Lab
> Lewis-Sigler Institute for Integrative Genomics
> Princeton University
> Princeton, NJ 08544
> 

-- 
Adrian Klaver
adrian.klaver@aklaver.com




Re: How to perform a long running dry run transaction without blocking

From
Robert Leach
Date:
> Have you considered a validation app? Have it read the inputs and look in db for conflicts, rather than attempt the
insert.Zero transactions necessary 


I did consider that about a year or two ago when I first conceived the data validation interface.  Doing that now would
solvethe problem of concurrent validations blocking one another, but selects would still get blocked if their result is
pendingthe outcome of a load running in the background.  If I mitigate that issue by running those loads over night on
aschedule, I still lose out on the benefits of having the loading code do the validation for me... 

I would have to explicitly find and report on problems that the load exceptions currently do for me, without any extra
code.

So what I'm saying is that the data being validated is inter-dependent.  There are about a dozen different loading
scripts(one for each sheet of the uploaded excel file) whose runs are orchestrated by a master load script that ensures
theyare run in the right order so that the interdependent data can be checked.  For example, these are some relative
ordersof what needs to be loaded so that data can be validated: 

Study > Animals
Tracers > Infusates > Animals
Treatments > Animals
Tissues > Samples

The Animal load script would fail if the new data in (the tables) Study, Tracers, Infusates, and Treatments aren't
inserted,because it links to those newly created records.  And there's no way to detect problems in those new
relationshipsin the unchanged database if they aren't inserted.  That's what doing this all in a transaction, and
actuallydoing the inserts (for which I use Django `get_or_create` method calls) provides. 

In other words, I would have to save and explicitly check the inter-related sheet data in data structures independent
ofthe database in order to find the equivalent of (for example) `ObjectDoesNotExist` errors that originate from the
database. Right now, I get those errors caught "for free".  All I have to do is tell the user what sheet/row/column is
relatedto that error.  And it saves me the overhead of having to maintain synchronicity between separate validation
codeand loading code when the loading code changes. 

Robert William Leach
Research Software Engineer
133 Carl C. Icahn Lab
Lewis-Sigler Institute for Integrative Genomics
Princeton University
Princeton, NJ 08544




Re: How to perform a long running dry run transaction without blocking

From
Robert Leach
Date:
The load to the development server does no validation?

If so what is the purpose?

The background processes are other validation runs?
It's the same code that executes in both cases (with or without the `--validate` flag).  All that that flag does is it (effectively) raises the dry run exception before it leaves the transaction block, so it always validates (whether the flag is supplied or not).

More for my sake then anything else, why do the load to the development server at all if the production load is the only one that counts?

The software is still in a new major version beta.  We're adding features and fixing bugs.  It's not unusual to encounter a new bug, fix it on dev to get the load to work, then deploy a point release on prod.  And that means repeated load attempts that interfere with the validation interface.  Besides, beyond this, we're planning on a separate staging database that dev effectively now is.  Sometimes, a curator only finds a technical data issue after the initial load while browsing the newly loaded data on the dev site.

So the load doesn't fail until the end of the run, which is inefficient from a maintenance perspective.  I've been thinking of adding a `--failfast` option for use on the back end.  Haven't done it yet.  I started a load yesterday in fact that ran 2 hours before it buffered an exception related to a newly introduced bug.  I fixed the bug and ran the load again.  It finished sometime between COB yesterday and this morning (successfully!).

Alright I am trying to reconcile this with from below, 'The largest studies take just under a minute'.

The context of the 'The largest studies take just under a minute' statement is that it's not loading the hefty/time-consuming raw data.  It's only validating the metadata.  That's fast (5-60s).  And that data is a portion of the transaction in the back-end load.  There are errors that validation can miss that are due to not touching the raw data, and in fact, those errors are addressed by curators editing the excel sheets.  That's why it's all in the load transaction instead of loaded separately, but those problems are somewhat rare (and we currently have a new feature in the design phase that should almost completely eliminate those issues).

Seems you are looking for some sort of queuing system.

What are the time constraints for getting the validation turned around.
I have considered a queuing system, though when I previously floated a proof of concept using celery, I was informed it was too much.  Though, at the time, all I was trying to do was a progress bar for a query stats feature.  So proposing celery in this instance may get more traction with the rest of the team.
Most of the small validation processes finish in under a dozen seconds.   The largest studies take just under a minute.  I have plans to optimize the loading scripts that hopefully could get the largest studies down to a dozen seconds.  If I could do that, and do the back end loads in off-peak hours, then I'd be willing to suffer the rare timeouts from concurrent validations.  The raw data loads will still likely take a much longer time.

This is where I get confused, probably because I am not exactly sure what constitutes validation. My sense is that involves a load of data into live tables and seeing what fails PK, FK or other constraints.

If that is the case I am not seeing how the 'for real' data load would be longer?

The validation skips the time-consuming raw data load.  That raw data is collectively hundreds of gigs in size and could not be uploaded on the validation page anyway.  The feature I alluded to above that would make errors associated with the raw data almost completely eliminated is one where the researcher can drop the raw data folder into the form and it just walks the directory to get all the raw data file names and relative paths.  It's those data relationships whose validations are currently skipped.

At any rate I can't see how loading into a live database multiple sets of data while operations are going on in the database can be made conflict free. To me  it seems the best that be done is:

1) Reduce chance for conflict by spreading the actions out.

2) Have retry logic that deals with conflicts.

I'm unfamiliar with retry functionality, but those options sound logical to me as a good path forward, particularly using celery to spread out validations and doing the back end loads at night (or using some sort of fast dump/load).  The thing that bothers me about the celery solution is that most of the time, 2 users validating different data will not block, so I would be making users wait for no reason.  Ideally, I could anticipate the block and only at that point, separate those validations.

This brings up a question though about a possibility I suspect is not practical.  My initial read of the isolation levels documentation found this section really promising:

> The Repeatable Read isolation level only sees data committed before the transaction began; it never sees either uncommitted data or changes committed during transaction execution by concurrent transactions.

This was before I realized that the actions of the previously started transaction would include "locks" that would block validation even though the load transaction hasn't committed yet:

> a target row might have already been updated (or deleted or locked) by another concurrent transaction by the time it is found. In this case, the repeatable read transaction will wait for the first updating transaction to commit or roll back

Other documentation I read referred to the state of the DB (when a transaction starts) as a "snapshot" and I thought... what if I could save such a snapshot automatically just before a back-end load starts, and use that snapshot for validation, such that my validation processes could use that to validate against and not encounter any locks?  The validation will never commit, so there's no risk.

I know Django's ORM wouldn't support that, but I kind of hoped that someone in this email list might suggest a snapshot functionality as a possible solution.  Since the validations never commit, the only downside would be if the backend load changed something that introduces a problem with the validated data that would not be fixed until we actually attempt to load it.

Is that too science-fictiony of an idea?

Re: How to perform a long running dry run transaction without blocking

From
Adrian Klaver
Date:
On 2/6/25 12:08, Robert Leach wrote:


>> Alright I am trying to reconcile this with from below, 'The largest 
>> studies take just under a minute'.
> 
> The context of the 'The largest studies take just under a minute' 
> statement is that it's not loading the hefty/time-consuming raw data. 
>   It's only validating the metadata.  That's fast (5-60s).  And that 
> data is a portion of the transaction in the back-end load.  There are 
> errors that validation can miss that are due to not touching the raw 
> data, and in fact, those errors are addressed by curators editing the 
> excel sheets.  That's why it's all in the load transaction instead of 

As a scientist that makes me start to twitch.

Is there an audit trail for that?


> 
> I'm unfamiliar with retry functionality, but those options sound logical 
> to me as a good path forward, particularly using celery to spread out 
> validations and doing the back end loads at night (or using some sort of 
> fast dump/load).  The thing that bothers me about the celery solution is 
> that most of the time, 2 users validating different data will not block, 
> so I would be making users wait for no reason.  Ideally, I could 
> anticipate the block and only at that point, separate those validations.

Aah, time travel.

For fast dump/load on validated data see:

https://www.postgresql.org/docs/current/sql-copy.html

Though note in Postgres 16- COPY is all or nothing, so if there is an 
error nothing will be loaded. With version 17 you get ON_ERROR and 
LOG_VERBOSITY. One way to deal with is to load to a staging table and do 
your validation there and then move the data to the final table.

As to retry that depends on where you want to do it. For subtransactions 
(SAVEPOINT) see:

https://www.postgresql.org/docs/current/sql-savepoint.html

https://www.cybertec-postgresql.com/en/subtransactions-and-performance-in-postgresql/

In Python there is try/except.


> 
> This brings up a question though about a possibility I suspect is not 
> practical.  My initial read of the isolation levels documentation found 
> this section really promising:
> 
>  > The Repeatable Read isolation level only sees data committed before 
> the transaction began; it never sees either uncommitted data or changes 
> committed during transaction execution by concurrent transactions.
> 
> This was before I realized that the actions of the previously started 
> transaction would include "locks" that would block validation even 
> though the load transaction hasn't committed yet:
> 
>  > a target row might have already been updated (or deleted or *locked*) 
> by another concurrent transaction by the time it is found. In this case, 
> the repeatable read transaction will wait for the first updating 
> transaction to commit or roll back
> 
> Other documentation I read referred to the state of the DB (when a 
> transaction starts) as a "snapshot" and I thought... what if I could 
> save such a snapshot automatically just *before* a back-end load starts, 
> and use that snapshot for validation, such that my validation processes 
> could use that to validate against and not encounter any locks?  The 
> validation will never commit, so there's no risk.

Hmm. I don't think so.

> 
> I know Django's ORM wouldn't support that, but I kind of hoped that 
> someone in this email list might suggest a snapshot functionality as a 
> possible solution.  Since the validations never commit, the only 
> downside would be if the backend load changed something that introduces 
> a problem with the validated data that would not be fixed until we 
> actually attempt to load it.
> 
> Is that too science-fictiony of an idea?

-- 
Adrian Klaver
adrian.klaver@aklaver.com




Re: How to perform a long running dry run transaction without blocking

From
Adrian Klaver
Date:
On 2/6/25 11:25, Robert Leach wrote:
>> Have you considered a validation app? Have it read the inputs and look in db for conflicts, rather than attempt the
insert.Zero transactions necessary
 
> 
> 
> I did consider that about a year or two ago when I first conceived the data validation interface.  Doing that now
wouldsolve the problem of concurrent validations blocking one another, but selects would still get blocked if their
resultis pending the outcome of a load running in the background.  If I mitigate that issue by running those loads over
nighton a schedule, I still lose out on the benefits of having the loading code do the validation for me...
 
> 
> I would have to explicitly find and report on problems that the load exceptions currently do for me, without any
extracode.
 
> 
> So what I'm saying is that the data being validated is inter-dependent.  There are about a dozen different loading
scripts(one for each sheet of the uploaded excel file) whose runs are orchestrated by a master load script that ensures
theyare run in the right order so that the interdependent data can be checked.  For example, these are some relative
ordersof what needs to be loaded so that data can be validated:
 
> 
> Study > Animals
> Tracers > Infusates > Animals
> Treatments > Animals
> Tissues > Samples
> 
> The Animal load script would fail if the new data in (the tables) Study, Tracers, Infusates, and Treatments aren't
inserted,because it links to those newly created records.  And there's no way to detect problems in those new
relationshipsin the unchanged database if they aren't inserted.  That's what doing this all in a transaction, and
actuallydoing the inserts (for which I use Django `get_or_create` method calls) provides.
 
> 
> In other words, I would have to save and explicitly check the inter-related sheet data in data structures independent
ofthe database in order to find the equivalent of (for example) `ObjectDoesNotExist` errors that originate from the
database. Right now, I get those errors caught "for free".  All I have to do is tell the user what sheet/row/column is
relatedto that error.  And it saves me the overhead of having to maintain synchronicity between separate validation
codeand loading code when the loading code changes.
 

Seems to me this could be dealt with using a schema named validate that 
contains 'shadow' tables of those in the live schema(s). Import into 
their and see what fails.

> 
> Robert William Leach
> Research Software Engineer
> 133 Carl C. Icahn Lab
> Lewis-Sigler Institute for Integrative Genomics
> Princeton University
> Princeton, NJ 08544
> 

-- 
Adrian Klaver
adrian.klaver@aklaver.com




Re: How to perform a long running dry run transaction without blocking

From
Adrian Klaver
Date:
On 2/7/25 05:49, Robert Leach wrote:

Ccing list
>>>> Alright I am trying to reconcile this with from below, 'The largest studies take just under a minute'.
>>> The context of the 'The largest studies take just under a minute' statement is that it's not loading the
hefty/time-consumingraw data.   It's only validating the metadata.  That's fast (5-60s).  And that data is a portion of
thetransaction in the back-end load.  There are errors that validation can miss that are due to not touching the raw
data,and in fact, those errors are addressed by curators editing the excel sheets.  That's why it's all in the load
transactioninstead of
 
>>
>> As a scientist that makes me start to twitch.
>>
>> Is there an audit trail for that?
> 
> We have a well defined curation process.  Original files are checked into a private data repo.  There is a CHANGES.md
filesthat details any/every change a curator makes, and the curator coordinates these changes with the researcher.
 
> 

>> Aah, time travel.
> 
> Sorry, "anticipate" wasn't the best word choice.  What I was thinking was more along the lines of "detect" when a
blockhappens and at that point, either stop and queue up the validation or provide a status/progress that shows it's
waitingon validations in line before it.  Probably not possible.  It's just what I'm imagining would be the most
efficientstrategy with the least wait time.
 
> 

>>> Other documentation I read referred to the state of the DB (when a transaction starts) as a "snapshot" and I
thought...what if I could save such a snapshot automatically just *before* a back-end load starts, and use that
snapshotfor validation, such that my validation processes could use that to validate against and not encounter any
locks? The validation will never commit, so there's no risk.
 
>>
>> Hmm. I don't think so.
> 
>  From a separate thread, which seems analogous...:
> 
>> Seems to me this could be dealt with using a schema named validate that contains 'shadow' tables of those in the
liveschema(s). Import into their and see what fails.
 
> 
> What is a "shadow table"?  Is that a technical thing?  Could these shadow tables be in the same database?  (Trying to
wrapmy head around what this implies is possible.)
 

No a concept of my own making. Basically create copies of the tables you 
are loading into now in a separate schema and load into them instead. 
Then do the validation against those tables, that would reduce the 
contention with the existing data. It is something I have done to 
validate data from another data source before loading into production 
tables. A possible issue to deal with is how much, if at all,  you 
depend on other tables for the validation process.


-- 
Adrian Klaver
adrian.klaver@aklaver.com




Re: How to perform a long running dry run transaction without blocking

From
Robert Leach
Date:
I'd run across a stackexchange answer <https://stackoverflow.com/a/75037925/2057516> that suggested setting a statement timeout (specific to postgres) in a django view.  If I did that for the validation view, it wouldn't necessarily 100% correspond to a blocking issue, but since the overall goal is the prevent a gateway timeout, I'm fine with that.

When you do a SET statement_timeout is session/connection specific.

For what that means in Django see:

https://docs.djangoproject.com/en/5.1/ref/databases/

Personally I think this will introduce more complexity then it's worth.

Further down the page at above link it shows how you can use a psycopg(2) pool, but that is another layer added to an already complex setup.

TBF, adding a celery strategy to the codebase to queue jobs is already comparatively very complex to start out.  It's a pretty heavy handed solution to what seems to be a relatively rare occurrence (though rather recurrent when it does happen).  By that metric, I would think that queuing jobs at all is too complex to be worth it.  So I'm not sure that catching a timeout exception to queue a job instead of queuing the job at the outset is that much more complex.  I can just create a view that inherits from the current view (to which the statement timeout is applied) which is *not* subject to the statement timeout when I want to queue a longer running validation due to the block.  I think I should be able to redirect the request to that derived class view.  Not sure exactly how ATM, but it shouldn't be more than 1 line of code/statement, I would think.

Regarding statement timeouts in general, I found a pretty great article that suggests setting site-wide statement timeouts as a general rule, which I think makes a lot of sense.  In fact, some of the pages on our site (that I didn't write) are not going to scale well as the DB grows.  I already feel like the samples page is way too slow.  They should probably all gracefully handle timeouts so that users don't encounter raw 500 errors.  So whether or not I decide to "queue on statement timeout", I think using statement timeouts is worthwhile.

I take your point though.  How many more lines of code do you think would make the strategy worthwhile?  I'd be willing to bet I could keep it under 20 compared to straight up queuing every validation.  In fact, I'm leaning toward implementing your suggestion of queuing everything as a "phase 1" of this endeavor, and then implement my proposed "queue on timeout" idea as phase 2 (perhaps as a configurable option - at least until I'm satisfied it works well).

I don't know if that stack exchange suggestion does what it claims to do or not.  I haven't tried it.  But they key claim of interest here is that the timeout is specific to the view.  I.e. any other views or background processes wouldn't be subject to the timeout.  I'm suspicious as to whether this is actually true.  My novice understanding of Django's cursors is that it maintains a pool of connections, and I'm not

Actually cursors are attributes of connections.

Yeah, thus my suspicion that it may not apply to only the view in the example.  I suspected their "proof" that it didn't affect other views was due to randomly getting a connection to which that statement timeout had not yet been applied.  *However*, I just read this article which describes Django's connection management and I realized that Django's default behavior (which is what we use) is to create a new connection for every request, so I think that, with the caveat that it wouldn't work with persistent connections, the stack exchange claim is actually correct.

I would just have to figure out how best to apply the timeout to everything except those queued validations.  And I think maybe a context manager might be the simplest way to do it.

Anyway, thanks so much for your help.  This discussion has been very useful, and I think I will proceed at first, exactly how you suggested, by queuing every validation job (using celery).  Then I will explore whether or not I can apply the "on timeout" strategy in a small patch.

Incidentally, during our Wednesday meeting this week, we actually opened our public instance to the world for the first time, in preparation for the upcoming publication.  This discussion is about the data submission interface, but that interface is actually disabled on the public-facing instance.  The other part of the codebase that I was primarily responsible for was the advanced search.  Everything else was primarily by other team members.  If you would like to check it out, let me know what you think: http://tracebase.princeton.edu

Cheers,
Rob


Robert William Leach
Research Software Engineer
133 Carl C. Icahn Lab
Lewis-Sigler Institute for Integrative Genomics
Princeton University
Princeton, NJ 08544

Re: How to perform a long running dry run transaction without blocking

From
Robert Leach
Date:
>> Anyway, thanks so much for your help.  This discussion has been very useful, and I think I will proceed at first,
exactlyhow you suggested, by queuing every validation job (using celery).  Then I will explore whether or not I can
applythe "on timeout" strategy in a small patch. 
>> Incidentally, during our Wednesday meeting this week, we actually opened our public instance to the world for the
firsttime, in preparation for the upcoming publication.  This discussion is about the data submission interface, but
thatinterface is actually disabled on the public-facing instance.  The other part of the codebase that I was primarily
responsiblefor was the advanced search.  Everything else was primarily by other team members.  If you would like to
checkit out, let me know what you think: http://tracebase.princeton.edu <http://tracebase.princeton.edu> 
>
> I would have to hit the books again to understand all of what is going on here.

It's a mass spec tracing database.  Animals are infused with radio labeled compounds and mass spec is used to see what
theanimal's biochemistry turns those compounds into.  (My undergrad was biochem, so I've been resurrecting my biochem
knowledge,as needed for this project.  I've been mostly doing RNA and DNA sequence analysis since undergrad, and most
ofthat was prokaryotic. 

> One quibble with the Download tab, there is no indication of the size of the datasets. I generally like to know what
Iam getting into before I start a download. Also, is there explicit throttling going on? I am seeing 10.2kb/sec,
whereasfrom here https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page I downloaded a 47.65M file at 41.9MB/s 

Thank you!  Not knowing the download size is exactly a complaint I had.  That download actually uses my advanced search
interface(in browse mode).  There is the same issue with the download buttons on the advanced search.  With the
streaming,we're not dealing with temp files, which is nice, at least for the advanced search, but we can't know the
downloadsize that way.  So I had wanted a progress bar to at least show progress (current record per total).  I could
evenestimate the size (an option I explored for a few days).  Eventually, I proposed a celery solution for that and I
wasoverruled. 

As for the download in the nav bar, we have an issue to change that to a listing of actual files broken down by study
(3files per study).  There's not much actual utility from a user perspective for downloading everything anyway.  We've
justbeen focussed on other things.  In fact, we have a request from a user for that specific feature, done in a way
that'scompatible with curl/scp.  We just have to figure out how to not have to CAS authenticate each command, something
Idon't have experience with.