Re: [PERFORM] DELETE vs TRUNCATE explanation - Mailing list pgsql-hackers

From Tom Lane
Subject Re: [PERFORM] DELETE vs TRUNCATE explanation
Date
Msg-id 19949.1342376967@sss.pgh.pa.us
Whole thread Raw
In response to Re: [PERFORM] DELETE vs TRUNCATE explanation  (Jeff Janes <jeff.janes@gmail.com>)
Responses Re: [PERFORM] DELETE vs TRUNCATE explanation  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: [PERFORM] DELETE vs TRUNCATE explanation  (Craig Ringer <ringerc@ringerc.id.au>)
Re: [PERFORM] DELETE vs TRUNCATE explanation  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Jeff Janes <jeff.janes@gmail.com> writes:
> On Thu, Jul 12, 2012 at 9:55 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> The topic was poor performance when truncating lots of small tables
>> repeatedly on test environments with fsync=off.
>> 
>> On Thu, Jul 12, 2012 at 6:00 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>> I think the problem is in the Fsync Absorption queue.  Every truncate
>>> adds a FORGET_RELATION_FSYNC to the queue, and processing each one of
>>> those leads to sequential scanning the checkpointer's pending ops hash
>>> table, which is quite large.  It is almost entirely full of other
>>> requests which have already been canceled, but it still has to dig
>>> through them all.   So this is essentially an N^2 operation.

> The attached patch addresses this problem by deleting the entry when
> it is safe to do so, and flagging it as canceled otherwise.

I don't like this patch at all.  It seems ugly and not terribly safe,
and it won't help at all when the checkpointer is in the midst of an
mdsync scan, which is a nontrivial part of its cycle.

I think what we ought to do is bite the bullet and refactor the
representation of the pendingOps table.  What I'm thinking about
is reducing the hash key to just RelFileNodeBackend + ForkNumber,
so that there's one hashtable entry per fork, and then storing a
bitmap to indicate which segment numbers need to be sync'd.  At
one gigabyte to the bit, I think we could expect the bitmap would
not get terribly large.  We'd still have a "cancel" flag in each
hash entry, but it'd apply to the whole relation fork not each
segment.

If we did this then the FORGET_RELATION_FSYNC code path could use
a hashtable lookup instead of having to traverse the table
linearly; and that would get rid of the O(N^2) performance issue.
The performance of FORGET_DATABASE_FSYNC might still suck, but
DROP DATABASE is a pretty heavyweight operation anyhow.

I'm willing to have a go at coding this design if it sounds sane.
Comments?

> Also, I still wonder if it is worth memorizing fsyncs (under
> fsync=off) that may or may not ever take place.  Is there any
> guarantee that we can make by doing so, that couldn't be made
> otherwise?

Yeah, you have a point there.  It's not real clear that switching fsync
from off to on is an operation that we can make any guarantees about,
short of executing something like the code recently added to initdb
to force-sync the entire PGDATA tree.  Perhaps we should change fsync
to be PGC_POSTMASTER (ie frozen at postmaster start), and then we could
skip forwarding fsync requests when it's off?
        regards, tom lane


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Getting rid of pre-assignment of index names in CREATE TABLE LIKE
Next
From: Gurjeet Singh
Date:
Subject: Re: Getting rid of pre-assignment of index names in CREATE TABLE LIKE