Re: Decreasing BLKSZ - Mailing list pgsql-performance

From Mark Lewis
Subject Re: Decreasing BLKSZ
Date
Msg-id 1159223304.9657.1780.camel@archimedes
Whole thread Raw
In response to Re: Decreasing BLKSZ  ("Marc Morin" <marc@sandvine.com>)
List pgsql-performance
I'm not sure if decreasing BLKSZ is the way to go.  It would allow you
to have more smaller blocks in memory, but the actual coverage of the
index would remain the same; if only 33% of the index fits in memory
with the 8K BLKSZ then only 33% would fit in memory with a 4k BLKSZ.  I
can see where you're going if the tree nodes for all 15 million key
entries fit in memory as well as the most recent nodes for the logtime
nodes lower down in the index; basically trying to make sure that the
"right" 33% of the index is in memory.

But it seems like it might be more useful to have two indexes, one on
logtime and one on key.  Inserts into the logtime index would be
correlated with your insert order and as such be cache-friendly so
that's not an issue.  The index on just the key column would be at least
as small as the active subset of a combined index, so performance should
be at least as good as you could possibly achieve by reducing BLKSIZE.

PG 8.1 is smart enough to use a bitmap index scan to combine the two
indexes at query time; if that gives you adequate performance then it
would be simpler than reducing BLKSIZE.

-- Mark Lewis

On Mon, 2006-09-25 at 17:54 -0400, Marc Morin wrote:
> > Would it be possible to change the primary key to
> > (logtime,key)? This could help keeping the "working window" small.
>
> No, the application accessing the data wants all the rows between start
> and end time for a particular key value.
>
> >
> > Secondly, the real working set is smaller, as the rows are
> > all inserted at the end of the table, filling each page until
> > it's full, so only the last pages are accessed. There's no
> > relation between the index order, and the order of data on
> > disk, unless you CLUSTER.
>
> I'd theorizing that my problem is in updating the index itself and not
> the heap.  Insert order
> Refers to the order by which the applications are inserting the rows and
> as such, the order by
> Which the index is being updated.  This in turn, is causing the b-tree
> to be traverse.  Problem
> Is the working set of blocks at the bottom of the btree is too big for
> my cache.
>
> >
> > > Any comment on other affects or gotchas with lowering the size of
> > > BLKSZ?  Currently, our database is thrashing its cache of blocks we
> > > we're getting only ~100 inserts/second, every insert results in a
> > > evict-read-modify operation.
> >
> > I'm not shure that's the correct diagnosis.
> >
> > Do you have one transaction per insert? Every transaction
> > means a forced sync to the disk, so you won't get more than
> > about 100-200 commits per second, depending on your actual
> > disk rotation speed.
>
> No, an insert consists of roughly 10,000+ rows per transaction block.
>
> >
> > To improve concurrency of the "numer of inserters" running in
> > parallel, try to tweak the config variables commit_delay and
> > commit_sibling, so you get a higher overall throughput at
> > cost of an increased delay per connection, and increase the
> > number of inserters. Using sensible tweaking, the throughput
> > should scale nearly linear with the number of backens. :-)
>
> I don't think this will help us here due to large transactions already.
>
> >
> > If feasible for your application, you can also bundle several
> > log entries into a single transaction. If you're CPU bound,
> > you can use COPY instead of INSERT or (if you can wait for
> > 8.2) the new multi-row INSERT to further improve performance,
> > but I doubt that you're CPU bound.
>
> >
> > The only way to "really" get over the sync limit is to have
> > (at least) the WAL on a battery backed ram / SSD media that
> > has no "spinning disk"
> > physical limit, or abandon crash safety by turning fsync off.
>
> Again, problem is not with WAL writing, already on it's own raid1 disk
> pair.  The
> I/O pattern we see is about 1-2% load on WAL and 100% load on the array
> holding the indexes and tables. Throughput is very low, something like
> 150k-200K bytes/second of real  rows being deposited on the disk.
>
> The disks are busy seeking all over the disk platter to fetch a block,
> add a single row, then seek to another spot and write back a previously
> dirty buffer....
>
> >
> > Thanks,
> > Markus.
> > --
> > Markus Schaber | Logical Tracking&Tracing International AG
> > Dipl. Inf.     | Software Development GIS
> >
> > Fight against software patents in Europe! www.ffii.org
> > www.nosoftwarepatents.org
> >
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster

pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: Decreasing BLKSZ
Next
From: "Marc Morin"
Date:
Subject: Re: Decreasing BLKSZ