Thread: Sorting writes during checkpoint

Sorting writes during checkpoint

From
ITAGAKI Takahiro
Date:
Here is a patch for TODO item, "Consider sorting writes during checkpoint".
It writes dirty buffers in the order of block number during checkpoint
so that buffers are written sequentially.

I proposed the patch before, but it was rejected because 8.3 feature
has been frozen already at that time.
http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php

I rewrote it to be applied cleanly against current HEAD, but the concept
is not changed at all -- Memorizing pairs of (buf_id, BufferTag) for each
dirty buffer into an palloc-ed array at the start of checkpoint. Sorting
the array in BufferTag order and writing buffers in the order.

There are 10% of performance win in pgbench on my machine with RAID-0
disks. There can be more benefits on RAID-5 disks, because random writes
are slower than sequential writes there.


[HEAD]
  tps = 1134.233955 (excluding connections establishing)
[HEAD with patch]
  tps = 1267.446249 (excluding connections establishing)

[pgbench]
transaction type: TPC-B (sort of)
scaling factor: 100
query mode: simple
number of clients: 32
number of transactions per client: 100000
number of transactions actually processed: 3200000/3200000

[hardware]
2x Quad core Xeon, 16GB RAM, 4x HDD (RAID-0)

[postgresql.conf]
shared_buffers = 2GB
wal_buffers = 4MB
checkpoint_segments = 64
checkpoint_timeout = 5min
checkpoint_completion_target = 0.5

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center


Attachment

Re: Sorting writes during checkpoint

From
Greg Smith
Date:
On Tue, 15 Apr 2008, ITAGAKI Takahiro wrote:

> 2x Quad core Xeon, 16GB RAM, 4x HDD (RAID-0)

What is the disk controller in this system?  I'm specifically curious
about what write cache was involved, so I can get a better feel for the
hardware your results came from.

I'm busy rebuilding my performance testing systems right now, once that's
done I can review this on a few platforms.  One thing that jumped out at
me just reading the code is this happening inside BufferSync:

buf_to_write = (BufAndTag *) palloc(NBuffers * sizeof(BufAndTag));

If shared_buffers(=NBuffers) is set to something big, this could give some
memory churn.  And I think it's a bad idea to allocate something this
large at checkpoint time, because what happens if that fails?  Really not
the time you want to discover there's no RAM left.

Since you're always going to need this much memory for the system to
operate, and the current model has the system running a checkpoint >50% of
the time, the only thing that makes sense to me is to allocate it at
server start time once and be done with it.  That should improve
performance over the original patch as well.

BufAndTag is a relatively small structure (5 ints).  Let's call it 40
bytes; even that's only a 0.5% overhead relative to the shared buffer
allocation.  If we can speed checkpoints significantly with that much
overhead it sounds like a good tradeoff to me.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Sorting writes during checkpoint

From
ITAGAKI Takahiro
Date:
Greg Smith <gsmith@gregsmith.com> wrote:

> On Tue, 15 Apr 2008, ITAGAKI Takahiro wrote:
>
> > 2x Quad core Xeon, 16GB RAM, 4x HDD (RAID-0)
>
> What is the disk controller in this system?  I'm specifically curious
> about what write cache was involved, so I can get a better feel for the
> hardware your results came from.

I used HP ProLiant DL380 G5 with Smart Array P400 with 256MB cache
(http://h10010.www1.hp.com/wwpc/us/en/sm/WF06a/15351-15351-3328412-241644-241475-1121516.html)
and ext3fs on LVM of CentOS 5.1 (Linux version 2.6.18-53.el5).
Dirty region of database was probably larger than disk controller's cache.


> buf_to_write = (BufAndTag *) palloc(NBuffers * sizeof(BufAndTag));
>
> If shared_buffers(=NBuffers) is set to something big, this could give some
> memory churn.  And I think it's a bad idea to allocate something this
> large at checkpoint time, because what happens if that fails?  Really not
> the time you want to discover there's no RAM left.

Hmm, but I think we need to copy buffer tags into bgwriter's local memory
in order to avoid locking taga many times in the sorting. Is it better to
allocate sorting buffers at the first time and keep and reuse it from then on?


> BufAndTag is a relatively small structure (5 ints).  Let's call it 40
> bytes; even that's only a 0.5% overhead relative to the shared buffer
> allocation.  If we can speed checkpoints significantly with that much
> overhead it sounds like a good tradeoff to me.

I thinks sizeof(BufAndTag) is 20 bytes because sizeof(int) is 4 on typical
platforms (and if not, I should rewrite the patch to be always so).
It is 0.25% of shared buffers; when shared_buffers is set to 10GB,
it takes 25MB of process local memory. If we want to consume less memory
for it, RelFileNode in BufferTag could be hashed and packed into an integer;
The blockNum order is important for this purpose, but RelFileNode is not.
It makes the overhead to 12 bytes per page (0.15%). Is it worth doing?

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



Re: Sorting writes during checkpoint

From
Greg Smith
Date:
On Wed, 16 Apr 2008, ITAGAKI Takahiro wrote:

> Dirty region of database was probably larger than disk controller's cache.

Might be worthwhile to run with log_checkpoints on and collect some
statistics there next time you're running these tests.  It's a good habit
to get other testers into regardless; it's nice to be able to say
something like "during the 15 checkpoints encountered during this test,
the largest dirty area was 516MB while the median was 175MB".

> Hmm, but I think we need to copy buffer tags into bgwriter's local memory
> in order to avoid locking taga many times in the sorting. Is it better to
> allocate sorting buffers at the first time and keep and reuse it from then on?

That what I was thinking:  allocate the memory when the background writer
starts and just always have it there, the allocation you're doing is
always the same size.  If it's in use 50% of the time anyway (which it is
if you have checkpoint_completion_target at its default), why introduce
the risk that an allocation will fail at checkpoint time?  Just allocate
it once and keep it around.

> It is 0.25% of shared buffers; when shared_buffers is set to 10GB,
> it takes 25MB of process local memory.

Your numbers are probably closer to correct.  I was being pessimistic
about the size of all the integers just to demonstrate that it's not
really a significant amount of memory even if they're large.

> If we want to consume less memory for it, RelFileNode in BufferTag could
> be hashed and packed into an integer

I personally don't feel it's worth making the code any more complicated
than it needs to be just to save a fraction of a percent of the total
memory used by the buffer pool.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Sorting writes during checkpoint

From
Tom Lane
Date:
ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
> Greg Smith <gsmith@gregsmith.com> wrote:
>> If shared_buffers(=NBuffers) is set to something big, this could give some
>> memory churn.  And I think it's a bad idea to allocate something this
>> large at checkpoint time, because what happens if that fails?  Really not
>> the time you want to discover there's no RAM left.

> Hmm, but I think we need to copy buffer tags into bgwriter's local memory
> in order to avoid locking taga many times in the sorting.

I updated this patch to permanently allocate the working array as Greg
suggests, and to fix a bunch of commenting issues (attached).

However, I am completely unable to measure any performance improvement
from it.  Given the possible risk of out-of-memory failures, I think the
patch should not be applied without some direct proof of performance
benefits, and I don't see any.

            regards, tom lane

Index: src/backend/storage/buffer/bufmgr.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.228
diff -c -r1.228 bufmgr.c
*** src/backend/storage/buffer/bufmgr.c    1 Jan 2008 19:45:51 -0000    1.228
--- src/backend/storage/buffer/bufmgr.c    4 May 2008 01:11:08 -0000
***************
*** 56,61 ****
--- 56,68 ----
  #define BUF_WRITTEN                0x01
  #define BUF_REUSABLE            0x02

+ /* Struct for BufferSync's internal to-do list */
+ typedef struct BufAndTag
+ {
+     int            buf_id;
+     BufferTag    tag;
+ } BufAndTag;
+

  /* GUC variables */
  bool        zero_damaged_pages = false;
***************
*** 986,991 ****
--- 993,1025 ----
  }

  /*
+  * qsort comparator for BufferSync
+  */
+ static int
+ bufandtagcmp(const void *a, const void *b)
+ {
+     const BufAndTag *lhs = (const BufAndTag *) a;
+     const BufAndTag *rhs = (const BufAndTag *) b;
+     int            r;
+
+     /*
+      * We don't much care about the order in which different relations get
+      * written, so memcmp is enough for comparing the relfilenodes,
+      * even though its behavior will be platform-dependent.
+      */
+     r = memcmp(&lhs->tag.rnode, &rhs->tag.rnode, sizeof(lhs->tag.rnode));
+     if (r != 0)
+         return r;
+
+     /* We do want blocks within a relation to be ordered accurately */
+     if (lhs->tag.blockNum < rhs->tag.blockNum)
+         return -1;
+     if (lhs->tag.blockNum > rhs->tag.blockNum)
+         return 1;
+     return 0;
+ }
+
+ /*
   * BufferSync -- Write out all dirty buffers in the pool.
   *
   * This is called at checkpoint time to write out all dirty shared buffers.
***************
*** 995,1004 ****
  static void
  BufferSync(int flags)
  {
      int            buf_id;
-     int            num_to_scan;
      int            num_to_write;
      int            num_written;

      /* Make sure we can handle the pin inside SyncOneBuffer */
      ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
--- 1029,1056 ----
  static void
  BufferSync(int flags)
  {
+     static BufAndTag *bufs_to_write = NULL;
      int            buf_id;
      int            num_to_write;
      int            num_written;
+     int            i;
+
+     /*
+      * We allocate the bufs_to_write[] array on first call and keep it
+      * around for the life of the process.  This is okay because in normal
+      * operation this function is only called within the bgwriter, so
+      * we won't have lots of large arrays floating around.  We prefer this
+      * way because we don't want checkpoints to suddenly start failing
+      * when the system gets under memory pressure.
+      */
+     if (bufs_to_write == NULL)
+     {
+         bufs_to_write = (BufAndTag *) malloc(NBuffers * sizeof(BufAndTag));
+         if (bufs_to_write == NULL)
+             ereport(FATAL,
+                     (errcode(ERRCODE_OUT_OF_MEMORY),
+                      errmsg("out of memory")));
+     }

      /* Make sure we can handle the pin inside SyncOneBuffer */
      ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
***************
*** 1033,1038 ****
--- 1085,1092 ----
          if (bufHdr->flags & BM_DIRTY)
          {
              bufHdr->flags |= BM_CHECKPOINT_NEEDED;
+             bufs_to_write[num_to_write].buf_id = buf_id;
+             bufs_to_write[num_to_write].tag = bufHdr->tag;
              num_to_write++;
          }

***************
*** 1043,1061 ****
          return;                    /* nothing to do */

      /*
!      * Loop over all buffers again, and write the ones (still) marked with
!      * BM_CHECKPOINT_NEEDED.  In this loop, we start at the clock sweep point
!      * since we might as well dump soon-to-be-recycled buffers first.
!      *
!      * Note that we don't read the buffer alloc count here --- that should be
!      * left untouched till the next BgBufferSync() call.
       */
-     buf_id = StrategySyncStart(NULL, NULL);
-     num_to_scan = NBuffers;
      num_written = 0;
!     while (num_to_scan-- > 0)
      {
!         volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];

          /*
           * We don't need to acquire the lock here, because we're only looking
--- 1097,1120 ----
          return;                    /* nothing to do */

      /*
!      * Sort the buffers-to-be-written into order by file and block number.
!      * This improves sequentiality of access for the upcoming I/O.
!      */
!     qsort(bufs_to_write, num_to_write, sizeof(BufAndTag), bufandtagcmp);
!
!     /*
!      * Loop over all buffers to be written, and write the ones (still) marked
!      * with BM_CHECKPOINT_NEEDED.  Note that we don't need to recheck the
!      * buffer tag, because if the buffer has been reassigned it cannot have
!      * BM_CHECKPOINT_NEEDED still set.
       */
      num_written = 0;
!     for (i = 0; i < num_to_write; i++)
      {
!         volatile BufferDesc *bufHdr;
!
!         buf_id = bufs_to_write[i].buf_id;
!         bufHdr = &BufferDescriptors[buf_id];

          /*
           * We don't need to acquire the lock here, because we're only looking
***************
*** 1077,1096 ****
                  num_written++;

                  /*
-                  * We know there are at most num_to_write buffers with
-                  * BM_CHECKPOINT_NEEDED set; so we can stop scanning if
-                  * num_written reaches num_to_write.
-                  *
-                  * Note that num_written doesn't include buffers written by
-                  * other backends, or by the bgwriter cleaning scan. That
-                  * means that the estimate of how much progress we've made is
-                  * conservative, and also that this test will often fail to
-                  * trigger.  But it seems worth making anyway.
-                  */
-                 if (num_written >= num_to_write)
-                     break;
-
-                 /*
                   * Perform normal bgwriter duties and sleep to throttle our
                   * I/O rate.
                   */
--- 1136,1141 ----
***************
*** 1098,1110 ****
                                       (double) num_written / num_to_write);
              }
          }
-
-         if (++buf_id >= NBuffers)
-             buf_id = 0;
      }

      /*
!      * Update checkpoint statistics. As noted above, this doesn't include
       * buffers written by other backends or bgwriter scan.
       */
      CheckpointStats.ckpt_bufs_written += num_written;
--- 1143,1152 ----
                                       (double) num_written / num_to_write);
              }
          }
      }

      /*
!      * Update checkpoint statistics.  The num_written count doesn't include
       * buffers written by other backends or bgwriter scan.
       */
      CheckpointStats.ckpt_bufs_written += num_written;
Index: src/backend/storage/buffer/freelist.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/storage/buffer/freelist.c,v
retrieving revision 1.64
diff -c -r1.64 freelist.c
*** src/backend/storage/buffer/freelist.c    1 Jan 2008 19:45:51 -0000    1.64
--- src/backend/storage/buffer/freelist.c    4 May 2008 01:11:08 -0000
***************
*** 241,250 ****
  }

  /*
!  * StrategySyncStart -- tell BufferSync where to start syncing
   *
!  * The result is the buffer index of the best buffer to sync first.
!  * BufferSync() will proceed circularly around the buffer array from there.
   *
   * In addition, we return the completed-pass count (which is effectively
   * the higher-order bits of nextVictimBuffer) and the count of recent buffer
--- 241,251 ----
  }

  /*
!  * StrategySyncStart -- tell BgBufferSync where we are reclaiming buffers
   *
!  * The result is the buffer index of the next possible victim buffer.
!  * BgBufferSync() tries to keep the buffers immediately in front of this
!  * point clean.
   *
   * In addition, we return the completed-pass count (which is effectively
   * the higher-order bits of nextVictimBuffer) and the count of recent buffer

Re: Sorting writes during checkpoint

From
Greg Smith
Date:
On Sun, 4 May 2008, Tom Lane wrote:

> However, I am completely unable to measure any performance improvement
> from it.  Given the possible risk of out-of-memory failures, I think the
> patch should not be applied without some direct proof of performance
> benefits, and I don't see any.

Fair enough.  There were some pgbench results attached to the original
patch submission that gave me a good idea how to replicate the situation
where there's some improvement.  I expect I can take a shot at quantifying
that independantly near the end of this month if nobody else gets to it
before then (I'm stuck sorting out a number of OS level issue right now
before my testing system is online again).  Was planning to take a longer
look at Greg Stark's prefetching work at that point as well.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Sorting writes during checkpoint

From
Tom Lane
Date:
Greg Smith <gsmith@gregsmith.com> writes:
> On Sun, 4 May 2008, Tom Lane wrote:
>> However, I am completely unable to measure any performance improvement
>> from it.  Given the possible risk of out-of-memory failures, I think the
>> patch should not be applied without some direct proof of performance
>> benefits, and I don't see any.

> Fair enough.  There were some pgbench results attached to the original
> patch submission that gave me a good idea how to replicate the situation
> where there's some improvement.

Well, I tried a pgbench test similar to that one --- on smaller hardware
than was reported, so it was a bit smaller test case, but it should have
given similar results.  I didn't see any improvement; if anything it was
a bit worse.  So that's what got me concerned.

Of course it's notoriously hard to get consistent numbers out of pgbench
anyway, so I'd rather see some other test case ...

> I expect I can take a shot at quantifying
> that independantly near the end of this month if nobody else gets to it
> before then (I'm stuck sorting out a number of OS level issue right now
> before my testing system is online again).  Was planning to take a longer
> look at Greg Stark's prefetching work at that point as well.

Fair enough.  Unless someone can volunteer to test sooner, I think we
should drop this item from the current commitfest queue.

            regards, tom lane

Re: Sorting writes during checkpoint

From
Greg Smith
Date:
On Sun, 4 May 2008, Tom Lane wrote:

> Well, I tried a pgbench test similar to that one --- on smaller hardware
> than was reported, so it was a bit smaller test case, but it should have
> given similar results.

My pet theory on cases where sorting will help suggests you may need a
write-caching controller for this patch to be useful.  I expect we'll see
the biggest improvement in situations where the total amount of dirty
buffers is larger than the write cache and the cache becomes blocked.  If
you're not offloading to another device like that, the OS-level elevator
sorting will handle sorting for you close enough to optimally that I doubt
this will help much (and in fact may just get in the way).

> Of course it's notoriously hard to get consistent numbers out of pgbench
> anyway, so I'd rather see some other test case ...

I have some tools to run pgbench results many times and look for patterns
that work fairly well for the consistency part.  pgbench will dirty a very
high percentage of the buffer cache by checkpoint time relative to how
much work it does, which makes it close to a best case for confirming
there is a potential improvement here.

I think a reasonable approach is to continue trying to quantify some
improvement using pgbench with an eye toward also doing DBT2 tests, which
provoke similar behavior at checkpoint time.  I suspect someone who
already has a known good DBT2 lab setup with caching controller hardware
(EDB?) might be able to do a useful test of this patch without too much
trouble on their part.

> Unless someone can volunteer to test sooner, I think we should drop this
> item from the current commitfest queue.

This patch took a good step forward toward being commited this round with
your review, which is the important part from my perspective (as someone
who would like this to be committed if it truly works).  I expect that
performance related patches will often take more than one commitfest to
pass through.

From the perspective of keeping the committer's plates clean, a reasonable
system for this situation might be for you to bounce this into the
rejected pile as "Returned for testing" immediately, to clearly remove it
from the main queue.  A reasonable expectation there is that you might
consider it again during May if someone gets back with said testing
results before the 'fest ends.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Sorting writes during checkpoint

From
Tom Lane
Date:
Greg Smith <gsmith@gregsmith.com> writes:
> On Sun, 4 May 2008, Tom Lane wrote:
>> Well, I tried a pgbench test similar to that one --- on smaller hardware
>> than was reported, so it was a bit smaller test case, but it should have
>> given similar results.

> ... If
> you're not offloading to another device like that, the OS-level elevator
> sorting will handle sorting for you close enough to optimally that I doubt
> this will help much (and in fact may just get in the way).

Yeah.  It bothers me a bit that the patch forces writes to be done "all
of file A in order, then all of file B in order, etc".  We don't know
enough about the disk layout of the files to be sure that that's good.
(This might also mean that whether there is a win is going to be
platform and filesystem dependent ...)

>> Unless someone can volunteer to test sooner, I think we should drop this
>> item from the current commitfest queue.

> From the perspective of keeping the committer's plates clean, a reasonable
> system for this situation might be for you to bounce this into the
> rejected pile as "Returned for testing" immediately, to clearly remove it
> from the main queue.  A reasonable expectation there is that you might
> consider it again during May if someone gets back with said testing
> results before the 'fest ends.

Right, that's in the ground rules for commitfests: if the submitter can
respond to complaints before the fest is over, we'll reconsider the
patch.

            regards, tom lane

Re: Sorting writes during checkpoint

From
Greg Smith
Date:
On Mon, 5 May 2008, Tom Lane wrote:

> It bothers me a bit that the patch forces writes to be done "all of file
> A in order, then all of file B in order, etc".  We don't know enough
> about the disk layout of the files to be sure that that's good. (This
> might also mean that whether there is a win is going to be platform and
> filesystem dependent ...)

I think most platform and filesystem implementations have disk location
correlated enough with block order that this particular issue isn't a
large one.  If the writes are mainly going to one logical area (a single
partition or disk array), it should be a win as long as the sorting step
itself isn't introducing a delay.  I am concered that in a more
complicated case than pgbench, where the writes are spread across multiple
arrays say, that forcing writes in order may slow things down.

Example:  let's say there's two tablespaces mapped to two arrays, A and B,
that the data is being written to at checkpoint time.  In the current
case, that I/O might be AABAABABBBAB, which is going to keep both arrays
busy writing.  The sorted case would instead make that AAAAAABBBBBB so
only one array will be active at a time.  It may very well be the case
that the improvement from lowering seeks on the writes to A and B is less
than the loss coming from not keeping both continuously busy.

I think I can simulate this by using a modified pgbench script that works
against an accounts1 and accounts2 with equal frequency, where 1&2 are
actually on different tablespaces on two disks.

> Right, that's in the ground rules for commitfests: if the submitter can
> respond to complaints before the fest is over, we'll reconsider the
> patch.

The small optimization I was trying to suggest was that you just bounce
this type of patch automatically to the "rejected for <x>" section of the
commitfest wiki page in cases like these.  The standard practice on this
sort of queue is to automatically reclassify when someone has made a pass
over the patch, leaving the original source to re-open with more
information.  That keeps the unprocessed part of the queue always
shrinking, and as long as people know that they can get it reconsidered by
submitting new results it's not unfair to them.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Sorting writes during checkpoint

From
Simon Riggs
Date:
On Mon, 2008-05-05 at 00:23 -0400, Tom Lane wrote:
> Greg Smith <gsmith@gregsmith.com> writes:
> > On Sun, 4 May 2008, Tom Lane wrote:
> >> Well, I tried a pgbench test similar to that one --- on smaller hardware
> >> than was reported, so it was a bit smaller test case, but it should have
> >> given similar results.
>
> > ... If
> > you're not offloading to another device like that, the OS-level elevator
> > sorting will handle sorting for you close enough to optimally that I doubt
> > this will help much (and in fact may just get in the way).
>
> Yeah.  It bothers me a bit that the patch forces writes to be done "all
> of file A in order, then all of file B in order, etc".  We don't know
> enough about the disk layout of the files to be sure that that's good.
> (This might also mean that whether there is a win is going to be
> platform and filesystem dependent ...)

No action on this seen since last commitfest, but I think we should do
something with it, rather than just ignore it.

Agree with all comments myself, so proposed solution is to implement
this as an I/O elevator hook. Standard elevator is to issue them in
order as they come, additional elevator in contrib is file/block sorted.
That will make testing easier and will also give Itagaki his benefit,
while allowing on-going research. If this solution's good enough for
Linux it ought to be good enough for us.

Note that if we do this for checkpoint we should also do this for
FlushRelationBuffers(), used during heap_sync(), for exactly the same
reasons.

Would suggest calling it bulk_io_hook() or similar.

Further observation would be that if there was an effect then it would
be at the block-device level, i.e. tablespace. Sorting the writes so
that we issued one tablespace at a time might at least help the I/O
elevators/disk caches to work with the whole problem at once. We might
get benefit on one tablespace but not on another.

Sorting by file might have inadvertently shown benefit at the tablespace
level on a larger server with spread out data whereas on Tom's test
system I would guess just a single tablespace was used.

Anyway, I note that we don't have an easy way of sorting by tablespace,
but I'm sure it would be possible to look up the tablespace for a file
within a plugin.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


Re: Sorting writes during checkpoint

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> Anyway, I note that we don't have an easy way of sorting by tablespace,

Say what?  tablespace is the first component of relfilenode.

> but I'm sure it would be possible to look up the tablespace for a file
> within a plugin.

If the information weren't readily available from relfilenode, it would
*not* be possible for a bufmgr plugin to look it up.  bufmgr is much too
low-level to be dependent on performing catalog lookups.

            regards, tom lane

Re: Sorting writes during checkpoint

From
Simon Riggs
Date:
On Fri, 2008-07-04 at 12:05 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > Anyway, I note that we don't have an easy way of sorting by tablespace,
>
> Say what?  tablespace is the first component of relfilenode.

OK, thats a mistake... what about the rest of the idea?

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


Re: Sorting writes during checkpoint

From
Greg Smith
Date:
On Fri, 4 Jul 2008, Simon Riggs wrote:

> No action on this seen since last commitfest, but I think we should do
> something with it, rather than just ignore it.

Just no action worth reporting yet.  Over the weekend I finally reached
the point where I've got a system that should be capable of independently
replicating the results improvement setup here, and I've started
performance testing of the patch.  Getting useful checkpoint test results
from pgbench is really a pain.

> Sorting by file might have inadvertently shown benefit at the tablespace
> level on a larger server with spread out data whereas on Tom's test
> system I would guess just a single tablespace was used.

I doubt this has anything to do with it, only because the pgbench schema
doesn't split into tablespaces usefully.  Almost all of the real action is
on a single table, accounts.

My suspicion is that sorting only benefits in situations where you have a
disk controller with a significant amount of RAM on it, but the server RAM
is much larger.  In that case the sorting horizon of the controller itself
is smaller than what the server can do, and the sorting makes it less
likely you'll end up with the controller filled with unsorted stuff that
takes a long time to clear.

In Tom's test, there's probably only 8 or 16MB worth of cache on the disk
itself, so you can't get a large backlog of unsorted writes clogging the
write pipeline.  But most server systems have 256MB or more of RAM there,
and if you get that filled with seek-heavy writes (which might only clear
at a couple of MB a second) the delay for that cache to empty can be
considerable.

That said, I've got a 256MB controller here and have a very similar disk
setup to the one postiive results were reported on, but so far I don't see
any significant difference after applying the sorted writes patch.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Sorting writes during checkpoint

From
Simon Riggs
Date:
On Wed, 2008-07-09 at 21:39 -0400, Greg Smith wrote:
> On Fri, 4 Jul 2008, Simon Riggs wrote:
>
> > No action on this seen since last commitfest, but I think we should
> do
> > something with it, rather than just ignore it.
>
> Just no action worth reporting yet.  Over the weekend I finally
> reached the point where I've got a system that should be capable of
> independently replicating the results improvement setup here, and I've
> started performance testing of the patch.  Getting useful checkpoint
> test results from pgbench is really a pain.

I agree completely. That's why I've suggested a plugin approach. That
way Itagaki can have his performance, the rest of us don't need to fret
and yet we hold open the door indefinitely for additional ways of doing
it. And we can test it on production systems with realistic workloads.
If one clear way emerges as best, we adopt that plugin permanently.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support