Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers

From Robert Haas
Subject Re: POC: Cleaning up orphaned files using undo logs
Date
Msg-id CA+TgmoaaL_p3SFy13LGMKW35SQxsa-PGVFZ9zfksBkDMqQQVdg@mail.gmail.com
Whole thread Raw
In response to Re: POC: Cleaning up orphaned files using undo logs  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Mon, Jul 8, 2019 at 6:57 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> I didn't have other use cases in mind and I think to some extent this
> argument holds true for existing binaryheap_allocate allocate.  If we
> want to make it more generic, then shouldn't we try to even change the
> existing binaryheap_allocate to use this new model, that way the
> binary heap allocation API will be more generic?

No.  binaryheap_allocate is fine for simple cases and there's no
reason that I can see to change it.

> You are right that it won't be nth smallest element from the queue and
> we don't even care for that here.  The peeking logic is not to find
> the next prioritized element but to check if we can find some element
> for the same database in the next few entries to avoid frequent undo
> worker restart.

You *should* care for that here. The whole purpose of a binary heap is
to help us choose which task we should do first and which ones should
be done later.  I don't see why it's OK to decide that we only care
about doing the tasks in priority order sometimes, and other times
it's OK to just pick semi-randomly.

> This is a good test scenario, but I think it has not been considered
> that there are multiple queues and we peek into each one.

I think that makes very little difference, so I don't see why it
should be considered. It's true that this will sometimes mask the
problem, but so what? An algorithm that works 90% of the time is not
much better than one that works 80% of the time, and neither is the
caliber of work we are expecting to see in PostgreSQL.

> I think we should once try with the actual program which can test such
> a scenario on the undo patches before reaching any conclusion.  I or
> one of my colleague will work on this and report back the results.

There is certainly a place for empirical testing of a patch like this
(perhaps even before posting it). It does not substitute for a good
theoretical explanation of why the algorithm is correct, and I don't
think it is.

> >  You can construct way worse examples than
> > this one, too: imagine that there are two databases, each with a
> > worker, and one has 99% of the requests and the other one has 1% of
> > the requests.  It's really unlikely that there's going to be an entry
> > for the second database within the lookahead window.
>
> I am not sure if that is the case because as soon as the request from
> other database get prioritized (say because its XID becomes older) and
> came as the first request in one of the queues, the undo worker will
> exit (provided it has worked for some threshold time (10s) in that
> database) and allow the request from another database to be processed.

I don't see how this responds to what I wrote.  Neither worker needs
to exit in this scenario, but the worker from the less-popular
database is likely to exit anyway, which seems like it's probably not
the right thing.

> >  And note that
> > increasing the window doesn't really help either: you just need more
> > databases than the size of the lookahead window, or even almost as
> > many as the lookahead window, and things are going to stop working
> > properly.
> >
> > On the other hand, suppose that you have 10 databases and one undo
> > worker.  One database is pretty active and generates a continuous
> > stream of undo requests at exactly the same speed we can process them.
> > The others all have 1 pending undo request.  Now, what's going to
> > happen is that you'll always find the undo request for the current
> > database within the lookahead window.  So, you'll never exit.
>
> Following the logic given above, I think here also worker will exit as
> soon as the request from other database get prioritised.

OK.

> >  But
> > that means the undo requests in the other 9 databases will just sit
> > there for all eternity, because there's no other worker to process
> > them.  On the other hand, if you had 11 databases, there's a good
> > chance it would work fine, because the new request for the active
> > database would likely be outside the lookahead window, and so you'd
> > find no work to do and exit, allowing a worker to be started up in
> > some other database.
>
> As explained above, I think it will work the same way both for 10 or
> 11 databases.  Note, that we don't always try to look ahead. We look
> ahead when we have not worked on the current database for some
> threshold amount of time.

That's interesting, and it means that some of the scenarios that I
mentioned are not problems. However, I don't believe it means that
your code is actually correct. It's just means that it's wrong in
different ways.  The point is that, with the way you've implemented
this, whenever you do lookahead, you will, basically randomly,
sometimes find the next entry for the current database within the
lookahead window, and sometimes you won't.  And sometimes it will be
the next-highest-priority request, and sometimes it won't. That just
cannot possibly be the right thing to do.

Would you propose to commit a patch that implemented the following pseudocode?

find-next-thing-to-do:
   see if the highest-priority task in any database is for our database.
   if it is, do it and stop here.
   if it is not, and if we haven't worked on the current database for
at least 10 seconds, look for an item in the current database.
   ...but don't look very hard, so that we'll sometimes,
semi-randomly, find nothing even when there is something we could do.
   ...and also, sometimes find a lower-priority item that we can do,
possibly much lower-priority, instead of the highest priority thing we
can do.

Because that's what your patch is doing.

In contrast, the algorithm that I proposed would work like this:

find-next-thing-to-do:
   find the highest-priority item for the current database.
   do it.

I venture to propose that the second one is the superior algorithm
here.  One problem with the second algorithm, which I pointed out in
my previous email, is that sometimes we might want the worker to exit
even though there is work to do in the current database. My algorithm
makes no provision for that, and yours does.  However, yours does that
in a way that's totally unprincipled: it just sometimes fails to find
any work that it could do even though there is work that it could do.
No amount of testing or argumentation is going to convince me that
this is a good approach. The decision about when a worker should exit
to allow a new one to be launched needs to be based on clear,
understandable rules, not be something that happens semi-randomly when
a haphazard search for the next entry fails, as if by chance, to find
it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: Duplicated LSN in ReorderBuffer
Next
From: Tomas Vondra
Date:
Subject: Re: [PATCH] Incremental sort (was: PoC: Partial sort)