Thread: Decreasing BLKSZ

Decreasing BLKSZ

From
"Marc Morin"
Date:
Our application has a number of inserters posting rows of network statistics into a database.  This is occuring continously.  The following is an example of a stats table (simplified but maintains key concepts).
 
 
CREATE TABLE stats
(
      logtime timestamptz,
      key     int,
      stat1   bigint,
      stat2   bigint,
      stat3   bigint,
      PRIMARY KEY (key,logtime)
);
CREATE INDEX x ON stats(logtime);
 
There are on the order of 1M unique values for "key" and a new row for each key value will be inserted say every 15 minutes.  These rows are divided up between a number of different inserting elements, but that isn't relevant.
 
The problem is, the insert pattern has low correlation with the (key,logtime) index.   In this case, would need >1M blocks in my shared_buffer space to prevent a read-modify-write type of pattern happening during the inserts (given a large enough database).
 
Wondering about lowering the BLKSZ value so that the total working set of blocks required can be maintained in my shared buffers.  Our database only has 8G of memory and likely need to reduce BLKSZ to 512....
 
Any comment on other affects or gotchas with lowering the size of BLKSZ?  Currently, our database is thrashing its cache of blocks we we're getting only ~100 inserts/second, every insert results in a evict-read-modify operation.
 
 
Ideally, like to keep the entire working set of blocks in memory across insert periods so that the i/o looks more like write full blocks....
 
Thanks
Marc
 
 

Re: Decreasing BLKSZ

From
Markus Schaber
Date:
Hi, Marc,

Marc Morin wrote:

> The problem is, the insert pattern has low correlation with the
> (key,logtime) index.   In this case, would need >1M blocks in my
> shared_buffer space to prevent a read-modify-write type of pattern
> happening during the inserts (given a large enough database).

Would it be possible to change the primary key to (logtime,key)? This
could help keeping the "working window" small.

Secondly, the real working set is smaller, as the rows are all inserted
at the end of the table, filling each page until it's full, so only the
last pages are accessed. There's no relation between the index order,
and the order of data on disk, unless you CLUSTER.

> Any comment on other affects or gotchas with lowering the size of
> BLKSZ?  Currently, our database is thrashing its cache of blocks we
> we're getting only ~100 inserts/second, every insert results in a
> evict-read-modify operation.

I'm not shure that's the correct diagnosis.

Do you have one transaction per insert? Every transaction means a forced
sync to the disk, so you won't get more than about 100-200 commits per
second, depending on your actual disk rotation speed.

To improve concurrency of the "numer of inserters" running in parallel,
try to tweak the config variables commit_delay and commit_sibling, so
you get a higher overall throughput at cost of an increased delay per
connection, and increase the number of inserters. Using sensible
tweaking, the throughput should scale nearly linear with the number of
backens. :-)

If feasible for your application, you can also bundle several log
entries into a single transaction. If you're CPU bound, you can use COPY
instead of INSERT or (if you can wait for 8.2) the new multi-row INSERT
to further improve performance, but I doubt that you're CPU bound.

The only way to "really" get over the sync limit is to have (at least)
the WAL on a battery backed ram / SSD media that has no "spinning disk"
physical limit, or abandon crash safety by turning fsync off.

Thanks,
Markus.
--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in Europe! www.ffii.org
www.nosoftwarepatents.org

Re: Decreasing BLKSZ

From
"Marc Morin"
Date:
> Would it be possible to change the primary key to
> (logtime,key)? This could help keeping the "working window" small.

No, the application accessing the data wants all the rows between start
and end time for a particular key value.

>
> Secondly, the real working set is smaller, as the rows are
> all inserted at the end of the table, filling each page until
> it's full, so only the last pages are accessed. There's no
> relation between the index order, and the order of data on
> disk, unless you CLUSTER.

I'd theorizing that my problem is in updating the index itself and not
the heap.  Insert order
Refers to the order by which the applications are inserting the rows and
as such, the order by
Which the index is being updated.  This in turn, is causing the b-tree
to be traverse.  Problem
Is the working set of blocks at the bottom of the btree is too big for
my cache.

>
> > Any comment on other affects or gotchas with lowering the size of
> > BLKSZ?  Currently, our database is thrashing its cache of blocks we
> > we're getting only ~100 inserts/second, every insert results in a
> > evict-read-modify operation.
>
> I'm not shure that's the correct diagnosis.
>
> Do you have one transaction per insert? Every transaction
> means a forced sync to the disk, so you won't get more than
> about 100-200 commits per second, depending on your actual
> disk rotation speed.

No, an insert consists of roughly 10,000+ rows per transaction block.

>
> To improve concurrency of the "numer of inserters" running in
> parallel, try to tweak the config variables commit_delay and
> commit_sibling, so you get a higher overall throughput at
> cost of an increased delay per connection, and increase the
> number of inserters. Using sensible tweaking, the throughput
> should scale nearly linear with the number of backens. :-)

I don't think this will help us here due to large transactions already.

>
> If feasible for your application, you can also bundle several
> log entries into a single transaction. If you're CPU bound,
> you can use COPY instead of INSERT or (if you can wait for
> 8.2) the new multi-row INSERT to further improve performance,
> but I doubt that you're CPU bound.

>
> The only way to "really" get over the sync limit is to have
> (at least) the WAL on a battery backed ram / SSD media that
> has no "spinning disk"
> physical limit, or abandon crash safety by turning fsync off.

Again, problem is not with WAL writing, already on it's own raid1 disk
pair.  The
I/O pattern we see is about 1-2% load on WAL and 100% load on the array
holding the indexes and tables. Throughput is very low, something like
150k-200K bytes/second of real  rows being deposited on the disk.

The disks are busy seeking all over the disk platter to fetch a block,
add a single row, then seek to another spot and write back a previously
dirty buffer....

>
> Thanks,
> Markus.
> --
> Markus Schaber | Logical Tracking&Tracing International AG
> Dipl. Inf.     | Software Development GIS
>
> Fight against software patents in Europe! www.ffii.org
> www.nosoftwarepatents.org
>

Re: Decreasing BLKSZ

From
Tom Lane
Date:
"Marc Morin" <marc@sandvine.com> writes:
> No, an insert consists of roughly 10,000+ rows per transaction block.

Perhaps it would help to pre-sort these rows by key?

Like Markus, I'm pretty suspicious of lowering BLCKSZ ... you can try it
but it's likely to prove counterproductive (more btree index levels,
more rows requiring toasting, a tighter limit on what rows will fit at
all).  I doubt I'd try to make it lower than a couple K in any case.

The bottom line here is likely to be "you need more RAM" :-(

I wonder whether there is a way to use table partitioning to make the
insert pattern more localized?  We'd need to know a lot more about your
insertion patterns to guess how, though.

            regards, tom lane

Re: Decreasing BLKSZ

From
Mark Lewis
Date:
I'm not sure if decreasing BLKSZ is the way to go.  It would allow you
to have more smaller blocks in memory, but the actual coverage of the
index would remain the same; if only 33% of the index fits in memory
with the 8K BLKSZ then only 33% would fit in memory with a 4k BLKSZ.  I
can see where you're going if the tree nodes for all 15 million key
entries fit in memory as well as the most recent nodes for the logtime
nodes lower down in the index; basically trying to make sure that the
"right" 33% of the index is in memory.

But it seems like it might be more useful to have two indexes, one on
logtime and one on key.  Inserts into the logtime index would be
correlated with your insert order and as such be cache-friendly so
that's not an issue.  The index on just the key column would be at least
as small as the active subset of a combined index, so performance should
be at least as good as you could possibly achieve by reducing BLKSIZE.

PG 8.1 is smart enough to use a bitmap index scan to combine the two
indexes at query time; if that gives you adequate performance then it
would be simpler than reducing BLKSIZE.

-- Mark Lewis

On Mon, 2006-09-25 at 17:54 -0400, Marc Morin wrote:
> > Would it be possible to change the primary key to
> > (logtime,key)? This could help keeping the "working window" small.
>
> No, the application accessing the data wants all the rows between start
> and end time for a particular key value.
>
> >
> > Secondly, the real working set is smaller, as the rows are
> > all inserted at the end of the table, filling each page until
> > it's full, so only the last pages are accessed. There's no
> > relation between the index order, and the order of data on
> > disk, unless you CLUSTER.
>
> I'd theorizing that my problem is in updating the index itself and not
> the heap.  Insert order
> Refers to the order by which the applications are inserting the rows and
> as such, the order by
> Which the index is being updated.  This in turn, is causing the b-tree
> to be traverse.  Problem
> Is the working set of blocks at the bottom of the btree is too big for
> my cache.
>
> >
> > > Any comment on other affects or gotchas with lowering the size of
> > > BLKSZ?  Currently, our database is thrashing its cache of blocks we
> > > we're getting only ~100 inserts/second, every insert results in a
> > > evict-read-modify operation.
> >
> > I'm not shure that's the correct diagnosis.
> >
> > Do you have one transaction per insert? Every transaction
> > means a forced sync to the disk, so you won't get more than
> > about 100-200 commits per second, depending on your actual
> > disk rotation speed.
>
> No, an insert consists of roughly 10,000+ rows per transaction block.
>
> >
> > To improve concurrency of the "numer of inserters" running in
> > parallel, try to tweak the config variables commit_delay and
> > commit_sibling, so you get a higher overall throughput at
> > cost of an increased delay per connection, and increase the
> > number of inserters. Using sensible tweaking, the throughput
> > should scale nearly linear with the number of backens. :-)
>
> I don't think this will help us here due to large transactions already.
>
> >
> > If feasible for your application, you can also bundle several
> > log entries into a single transaction. If you're CPU bound,
> > you can use COPY instead of INSERT or (if you can wait for
> > 8.2) the new multi-row INSERT to further improve performance,
> > but I doubt that you're CPU bound.
>
> >
> > The only way to "really" get over the sync limit is to have
> > (at least) the WAL on a battery backed ram / SSD media that
> > has no "spinning disk"
> > physical limit, or abandon crash safety by turning fsync off.
>
> Again, problem is not with WAL writing, already on it's own raid1 disk
> pair.  The
> I/O pattern we see is about 1-2% load on WAL and 100% load on the array
> holding the indexes and tables. Throughput is very low, something like
> 150k-200K bytes/second of real  rows being deposited on the disk.
>
> The disks are busy seeking all over the disk platter to fetch a block,
> add a single row, then seek to another spot and write back a previously
> dirty buffer....
>
> >
> > Thanks,
> > Markus.
> > --
> > Markus Schaber | Logical Tracking&Tracing International AG
> > Dipl. Inf.     | Software Development GIS
> >
> > Fight against software patents in Europe! www.ffii.org
> > www.nosoftwarepatents.org
> >
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster

Re: Decreasing BLKSZ

From
"Marc Morin"
Date:
>
> The bottom line here is likely to be "you need more RAM" :-(

Yup.  Just trying to get a handle on what I can do if I need more than
16G
Of ram... That's as much as I can put on the installed based of
servers.... 100s of them.

>
> I wonder whether there is a way to use table partitioning to
> make the insert pattern more localized?  We'd need to know a
> lot more about your insertion patterns to guess how, though.
>
>             regards, tom lane

We're doing partitioning as well.....
>

Re: Decreasing BLKSZ

From
"Bucky Jordan"
Date:
> > The bottom line here is likely to be "you need more RAM" :-(
>
> Yup.  Just trying to get a handle on what I can do if I need more than
> 16G
> Of ram... That's as much as I can put on the installed based of
> servers.... 100s of them.
>
> >
> > I wonder whether there is a way to use table partitioning to
> > make the insert pattern more localized?  We'd need to know a
> > lot more about your insertion patterns to guess how, though.
> >
> >             regards, tom lane
>
> We're doing partitioning as well.....
> >
I'm guessing that you basically have a data collection application that
sends in lots of records, and a reporting application that wants
summaries of the data? So, if I understand the problem correctly, you
don't have enough ram (or may not in the future) to index the data as it
comes in.

Not sure how much you can change the design, but what about either
updating a summary table(s) as the records come in (trigger, part of the
transaction, or do it in the application) or, index periodically? In
otherwords, load a partition (say a day's worth) then index that
partition all at once. If you're doing real-time analysis that might not
work so well though, but the summary tables should.

I assume the application generates unique records on its own due to the
timestamp, so this isn't really about checking for constraint
violations? If so, you can probably do away with the index on the tables
that you're running the inserts on.

- Bucky

Re: Decreasing BLKSZ

From
"Marc Morin"
Date:
Yes, that is our application.   We have implemented both scenarios...

1- partitions loaded without indexes on them.. And build index "when
partition is full".  Slow to drill down into incomplete partitions.
2- paritions with index as loaded.  Slow, on insert (problem mentioned)
but good to drill down....

So, I'd like my cake and eat it too... :-)

I'd like to have my indexes built as rows are inserted into the
partition so help with the drill down...

> -----Original Message-----
> From: Bucky Jordan [mailto:bjordan@lumeta.com]
> Sent: Tuesday, September 26, 2006 5:26 PM
> To: Marc Morin; Tom Lane
> Cc: Markus Schaber; pgsql-performance@postgresql.org
> Subject: RE: [PERFORM] Decreasing BLKSZ
>
> > > The bottom line here is likely to be "you need more RAM" :-(
> >
> > Yup.  Just trying to get a handle on what I can do if I
> need more than
> > 16G Of ram... That's as much as I can put on the installed based of
> > servers.... 100s of them.
> >
> > >
> > > I wonder whether there is a way to use table partitioning to make
> > > the insert pattern more localized?  We'd need to know a lot more
> > > about your insertion patterns to guess how, though.
> > >
> > >             regards, tom lane
> >
> > We're doing partitioning as well.....
> > >
> I'm guessing that you basically have a data collection
> application that sends in lots of records, and a reporting
> application that wants summaries of the data? So, if I
> understand the problem correctly, you don't have enough ram
> (or may not in the future) to index the data as it comes in.
>
> Not sure how much you can change the design, but what about
> either updating a summary table(s) as the records come in
> (trigger, part of the transaction, or do it in the
> application) or, index periodically? In otherwords, load a
> partition (say a day's worth) then index that partition all
> at once. If you're doing real-time analysis that might not
> work so well though, but the summary tables should.
>
> I assume the application generates unique records on its own
> due to the timestamp, so this isn't really about checking for
> constraint violations? If so, you can probably do away with
> the index on the tables that you're running the inserts on.
>
> - Bucky
>

Re: Decreasing BLKSZ

From
"Bucky Jordan"
Date:
>
> So, I'd like my cake and eat it too... :-)
>
> I'd like to have my indexes built as rows are inserted into the
> partition so help with the drill down...
>
So you want to drill down so fine grained that summary tables don't do
much good? Keep in mind, even if you roll up only two records, that's
half as many you have to process (be it for drill down or index).

I've seen applications that have a log table with no indexes/constraints
and lots of records being inserted, then they only report on very fine
grained summary tables. Drill downs still work pretty well, but if you
get audited and want to see that specific action, well, you're in for a
bit of a wait, but hopefully that doesn't happen too often.

If that's the case (summary tables won't work), I'd be very curious how
you manage to get your cake and eat it too :)

- Bucky

Re: Decreasing BLKSZ

From
Markus Schaber
Date:
Hi, Marc,

Marc Morin wrote:

>> I wonder whether there is a way to use table partitioning to
>> make the insert pattern more localized?  We'd need to know a
>> lot more about your insertion patterns to guess how, though.
>
> We're doing partitioning as well.....

And is constraint exclusion set up properly, and have you verified that
it works?

HTH,
Markus

Re: Decreasing BLKSZ

From
Jim Nasby
Date:
On Sep 26, 2006, at 5:36 PM, Marc Morin wrote:
> 1- partitions loaded without indexes on them.. And build index "when
> partition is full".  Slow to drill down into incomplete partitions.
> 2- paritions with index as loaded.  Slow, on insert (problem
> mentioned)
> but good to drill down....

How big are your partitions? The number of rows in your active
partition will determine how large your indexes are (and probably
more importantly, how many levels there are), which will definitely
affect your timing. So, you might have better luck with a smaller
partition size.

I'd definitely try someone else's suggestion of making the PK
logtime, key (assuming that you need to enforce uniqueness) and
having an extra index on just key. If you don't need to enforce
uniqueness, just have one index on key and one on logtime. Or if your
partitions are small enough, don't even create the logtime index
until the partition isn't being inserted into anymore.

If the number of key values is pretty fixed, it'd be an interesting
experiment to try partitioning on that, perhaps even with one key per
partition (which would allow you to drop the key from the tables
entirely, ie:

CREATE TABLE stats_1 (logtime PRIMARY KEY, stat1, stat2, stat3);
CREATE TABLE stats_2 ...

CREATE VIEW stats AS
SELECT 1 AS  key, * FROM stats_1
UNION ALL SELECT 2, * FROM stats_2
...

I wouldn't put too much work into that as no real effort's been
expended to optimize for that case (especially the resulting monster
UNION ALL), but you might get lucky.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)