Thread: Decreasing BLKSZ
Our application has a number of inserters posting rows of network statistics into a database. This is occuring continously. The following is an example of a stats table (simplified but maintains key concepts).
CREATE TABLE stats
(
logtime timestamptz,
key int,
stat1 bigint,
stat2 bigint,
stat3 bigint,
PRIMARY KEY (key,logtime)
);
CREATE INDEX x ON stats(logtime);
There are on the order of 1M unique values for "key" and a new row for each key value will be inserted say every 15 minutes. These rows are divided up between a number of different inserting elements, but that isn't relevant.
The problem is, the insert pattern has low correlation with the (key,logtime) index. In this case, would need >1M blocks in my shared_buffer space to prevent a read-modify-write type of pattern happening during the inserts (given a large enough database).
Wondering about lowering the BLKSZ value so that the total working set of blocks required can be maintained in my shared buffers. Our database only has 8G of memory and likely need to reduce BLKSZ to 512....
Any comment on other affects or gotchas with lowering the size of BLKSZ? Currently, our database is thrashing its cache of blocks we we're getting only ~100 inserts/second, every insert results in a evict-read-modify operation.
Ideally, like to keep the entire working set of blocks in memory across insert periods so that the i/o looks more like write full blocks....
Thanks
Marc
Hi, Marc, Marc Morin wrote: > The problem is, the insert pattern has low correlation with the > (key,logtime) index. In this case, would need >1M blocks in my > shared_buffer space to prevent a read-modify-write type of pattern > happening during the inserts (given a large enough database). Would it be possible to change the primary key to (logtime,key)? This could help keeping the "working window" small. Secondly, the real working set is smaller, as the rows are all inserted at the end of the table, filling each page until it's full, so only the last pages are accessed. There's no relation between the index order, and the order of data on disk, unless you CLUSTER. > Any comment on other affects or gotchas with lowering the size of > BLKSZ? Currently, our database is thrashing its cache of blocks we > we're getting only ~100 inserts/second, every insert results in a > evict-read-modify operation. I'm not shure that's the correct diagnosis. Do you have one transaction per insert? Every transaction means a forced sync to the disk, so you won't get more than about 100-200 commits per second, depending on your actual disk rotation speed. To improve concurrency of the "numer of inserters" running in parallel, try to tweak the config variables commit_delay and commit_sibling, so you get a higher overall throughput at cost of an increased delay per connection, and increase the number of inserters. Using sensible tweaking, the throughput should scale nearly linear with the number of backens. :-) If feasible for your application, you can also bundle several log entries into a single transaction. If you're CPU bound, you can use COPY instead of INSERT or (if you can wait for 8.2) the new multi-row INSERT to further improve performance, but I doubt that you're CPU bound. The only way to "really" get over the sync limit is to have (at least) the WAL on a battery backed ram / SSD media that has no "spinning disk" physical limit, or abandon crash safety by turning fsync off. Thanks, Markus. -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in Europe! www.ffii.org www.nosoftwarepatents.org
> Would it be possible to change the primary key to > (logtime,key)? This could help keeping the "working window" small. No, the application accessing the data wants all the rows between start and end time for a particular key value. > > Secondly, the real working set is smaller, as the rows are > all inserted at the end of the table, filling each page until > it's full, so only the last pages are accessed. There's no > relation between the index order, and the order of data on > disk, unless you CLUSTER. I'd theorizing that my problem is in updating the index itself and not the heap. Insert order Refers to the order by which the applications are inserting the rows and as such, the order by Which the index is being updated. This in turn, is causing the b-tree to be traverse. Problem Is the working set of blocks at the bottom of the btree is too big for my cache. > > > Any comment on other affects or gotchas with lowering the size of > > BLKSZ? Currently, our database is thrashing its cache of blocks we > > we're getting only ~100 inserts/second, every insert results in a > > evict-read-modify operation. > > I'm not shure that's the correct diagnosis. > > Do you have one transaction per insert? Every transaction > means a forced sync to the disk, so you won't get more than > about 100-200 commits per second, depending on your actual > disk rotation speed. No, an insert consists of roughly 10,000+ rows per transaction block. > > To improve concurrency of the "numer of inserters" running in > parallel, try to tweak the config variables commit_delay and > commit_sibling, so you get a higher overall throughput at > cost of an increased delay per connection, and increase the > number of inserters. Using sensible tweaking, the throughput > should scale nearly linear with the number of backens. :-) I don't think this will help us here due to large transactions already. > > If feasible for your application, you can also bundle several > log entries into a single transaction. If you're CPU bound, > you can use COPY instead of INSERT or (if you can wait for > 8.2) the new multi-row INSERT to further improve performance, > but I doubt that you're CPU bound. > > The only way to "really" get over the sync limit is to have > (at least) the WAL on a battery backed ram / SSD media that > has no "spinning disk" > physical limit, or abandon crash safety by turning fsync off. Again, problem is not with WAL writing, already on it's own raid1 disk pair. The I/O pattern we see is about 1-2% load on WAL and 100% load on the array holding the indexes and tables. Throughput is very low, something like 150k-200K bytes/second of real rows being deposited on the disk. The disks are busy seeking all over the disk platter to fetch a block, add a single row, then seek to another spot and write back a previously dirty buffer.... > > Thanks, > Markus. > -- > Markus Schaber | Logical Tracking&Tracing International AG > Dipl. Inf. | Software Development GIS > > Fight against software patents in Europe! www.ffii.org > www.nosoftwarepatents.org >
"Marc Morin" <marc@sandvine.com> writes: > No, an insert consists of roughly 10,000+ rows per transaction block. Perhaps it would help to pre-sort these rows by key? Like Markus, I'm pretty suspicious of lowering BLCKSZ ... you can try it but it's likely to prove counterproductive (more btree index levels, more rows requiring toasting, a tighter limit on what rows will fit at all). I doubt I'd try to make it lower than a couple K in any case. The bottom line here is likely to be "you need more RAM" :-( I wonder whether there is a way to use table partitioning to make the insert pattern more localized? We'd need to know a lot more about your insertion patterns to guess how, though. regards, tom lane
I'm not sure if decreasing BLKSZ is the way to go. It would allow you to have more smaller blocks in memory, but the actual coverage of the index would remain the same; if only 33% of the index fits in memory with the 8K BLKSZ then only 33% would fit in memory with a 4k BLKSZ. I can see where you're going if the tree nodes for all 15 million key entries fit in memory as well as the most recent nodes for the logtime nodes lower down in the index; basically trying to make sure that the "right" 33% of the index is in memory. But it seems like it might be more useful to have two indexes, one on logtime and one on key. Inserts into the logtime index would be correlated with your insert order and as such be cache-friendly so that's not an issue. The index on just the key column would be at least as small as the active subset of a combined index, so performance should be at least as good as you could possibly achieve by reducing BLKSIZE. PG 8.1 is smart enough to use a bitmap index scan to combine the two indexes at query time; if that gives you adequate performance then it would be simpler than reducing BLKSIZE. -- Mark Lewis On Mon, 2006-09-25 at 17:54 -0400, Marc Morin wrote: > > Would it be possible to change the primary key to > > (logtime,key)? This could help keeping the "working window" small. > > No, the application accessing the data wants all the rows between start > and end time for a particular key value. > > > > > Secondly, the real working set is smaller, as the rows are > > all inserted at the end of the table, filling each page until > > it's full, so only the last pages are accessed. There's no > > relation between the index order, and the order of data on > > disk, unless you CLUSTER. > > I'd theorizing that my problem is in updating the index itself and not > the heap. Insert order > Refers to the order by which the applications are inserting the rows and > as such, the order by > Which the index is being updated. This in turn, is causing the b-tree > to be traverse. Problem > Is the working set of blocks at the bottom of the btree is too big for > my cache. > > > > > > Any comment on other affects or gotchas with lowering the size of > > > BLKSZ? Currently, our database is thrashing its cache of blocks we > > > we're getting only ~100 inserts/second, every insert results in a > > > evict-read-modify operation. > > > > I'm not shure that's the correct diagnosis. > > > > Do you have one transaction per insert? Every transaction > > means a forced sync to the disk, so you won't get more than > > about 100-200 commits per second, depending on your actual > > disk rotation speed. > > No, an insert consists of roughly 10,000+ rows per transaction block. > > > > > To improve concurrency of the "numer of inserters" running in > > parallel, try to tweak the config variables commit_delay and > > commit_sibling, so you get a higher overall throughput at > > cost of an increased delay per connection, and increase the > > number of inserters. Using sensible tweaking, the throughput > > should scale nearly linear with the number of backens. :-) > > I don't think this will help us here due to large transactions already. > > > > > If feasible for your application, you can also bundle several > > log entries into a single transaction. If you're CPU bound, > > you can use COPY instead of INSERT or (if you can wait for > > 8.2) the new multi-row INSERT to further improve performance, > > but I doubt that you're CPU bound. > > > > > The only way to "really" get over the sync limit is to have > > (at least) the WAL on a battery backed ram / SSD media that > > has no "spinning disk" > > physical limit, or abandon crash safety by turning fsync off. > > Again, problem is not with WAL writing, already on it's own raid1 disk > pair. The > I/O pattern we see is about 1-2% load on WAL and 100% load on the array > holding the indexes and tables. Throughput is very low, something like > 150k-200K bytes/second of real rows being deposited on the disk. > > The disks are busy seeking all over the disk platter to fetch a block, > add a single row, then seek to another spot and write back a previously > dirty buffer.... > > > > > Thanks, > > Markus. > > -- > > Markus Schaber | Logical Tracking&Tracing International AG > > Dipl. Inf. | Software Development GIS > > > > Fight against software patents in Europe! www.ffii.org > > www.nosoftwarepatents.org > > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster
> > The bottom line here is likely to be "you need more RAM" :-( Yup. Just trying to get a handle on what I can do if I need more than 16G Of ram... That's as much as I can put on the installed based of servers.... 100s of them. > > I wonder whether there is a way to use table partitioning to > make the insert pattern more localized? We'd need to know a > lot more about your insertion patterns to guess how, though. > > regards, tom lane We're doing partitioning as well..... >
> > The bottom line here is likely to be "you need more RAM" :-( > > Yup. Just trying to get a handle on what I can do if I need more than > 16G > Of ram... That's as much as I can put on the installed based of > servers.... 100s of them. > > > > > I wonder whether there is a way to use table partitioning to > > make the insert pattern more localized? We'd need to know a > > lot more about your insertion patterns to guess how, though. > > > > regards, tom lane > > We're doing partitioning as well..... > > I'm guessing that you basically have a data collection application that sends in lots of records, and a reporting application that wants summaries of the data? So, if I understand the problem correctly, you don't have enough ram (or may not in the future) to index the data as it comes in. Not sure how much you can change the design, but what about either updating a summary table(s) as the records come in (trigger, part of the transaction, or do it in the application) or, index periodically? In otherwords, load a partition (say a day's worth) then index that partition all at once. If you're doing real-time analysis that might not work so well though, but the summary tables should. I assume the application generates unique records on its own due to the timestamp, so this isn't really about checking for constraint violations? If so, you can probably do away with the index on the tables that you're running the inserts on. - Bucky
Yes, that is our application. We have implemented both scenarios... 1- partitions loaded without indexes on them.. And build index "when partition is full". Slow to drill down into incomplete partitions. 2- paritions with index as loaded. Slow, on insert (problem mentioned) but good to drill down.... So, I'd like my cake and eat it too... :-) I'd like to have my indexes built as rows are inserted into the partition so help with the drill down... > -----Original Message----- > From: Bucky Jordan [mailto:bjordan@lumeta.com] > Sent: Tuesday, September 26, 2006 5:26 PM > To: Marc Morin; Tom Lane > Cc: Markus Schaber; pgsql-performance@postgresql.org > Subject: RE: [PERFORM] Decreasing BLKSZ > > > > The bottom line here is likely to be "you need more RAM" :-( > > > > Yup. Just trying to get a handle on what I can do if I > need more than > > 16G Of ram... That's as much as I can put on the installed based of > > servers.... 100s of them. > > > > > > > > I wonder whether there is a way to use table partitioning to make > > > the insert pattern more localized? We'd need to know a lot more > > > about your insertion patterns to guess how, though. > > > > > > regards, tom lane > > > > We're doing partitioning as well..... > > > > I'm guessing that you basically have a data collection > application that sends in lots of records, and a reporting > application that wants summaries of the data? So, if I > understand the problem correctly, you don't have enough ram > (or may not in the future) to index the data as it comes in. > > Not sure how much you can change the design, but what about > either updating a summary table(s) as the records come in > (trigger, part of the transaction, or do it in the > application) or, index periodically? In otherwords, load a > partition (say a day's worth) then index that partition all > at once. If you're doing real-time analysis that might not > work so well though, but the summary tables should. > > I assume the application generates unique records on its own > due to the timestamp, so this isn't really about checking for > constraint violations? If so, you can probably do away with > the index on the tables that you're running the inserts on. > > - Bucky >
> > So, I'd like my cake and eat it too... :-) > > I'd like to have my indexes built as rows are inserted into the > partition so help with the drill down... > So you want to drill down so fine grained that summary tables don't do much good? Keep in mind, even if you roll up only two records, that's half as many you have to process (be it for drill down or index). I've seen applications that have a log table with no indexes/constraints and lots of records being inserted, then they only report on very fine grained summary tables. Drill downs still work pretty well, but if you get audited and want to see that specific action, well, you're in for a bit of a wait, but hopefully that doesn't happen too often. If that's the case (summary tables won't work), I'd be very curious how you manage to get your cake and eat it too :) - Bucky
Hi, Marc, Marc Morin wrote: >> I wonder whether there is a way to use table partitioning to >> make the insert pattern more localized? We'd need to know a >> lot more about your insertion patterns to guess how, though. > > We're doing partitioning as well..... And is constraint exclusion set up properly, and have you verified that it works? HTH, Markus
On Sep 26, 2006, at 5:36 PM, Marc Morin wrote: > 1- partitions loaded without indexes on them.. And build index "when > partition is full". Slow to drill down into incomplete partitions. > 2- paritions with index as loaded. Slow, on insert (problem > mentioned) > but good to drill down.... How big are your partitions? The number of rows in your active partition will determine how large your indexes are (and probably more importantly, how many levels there are), which will definitely affect your timing. So, you might have better luck with a smaller partition size. I'd definitely try someone else's suggestion of making the PK logtime, key (assuming that you need to enforce uniqueness) and having an extra index on just key. If you don't need to enforce uniqueness, just have one index on key and one on logtime. Or if your partitions are small enough, don't even create the logtime index until the partition isn't being inserted into anymore. If the number of key values is pretty fixed, it'd be an interesting experiment to try partitioning on that, perhaps even with one key per partition (which would allow you to drop the key from the tables entirely, ie: CREATE TABLE stats_1 (logtime PRIMARY KEY, stat1, stat2, stat3); CREATE TABLE stats_2 ... CREATE VIEW stats AS SELECT 1 AS key, * FROM stats_1 UNION ALL SELECT 2, * FROM stats_2 ... I wouldn't put too much work into that as no real effort's been expended to optimize for that case (especially the resulting monster UNION ALL), but you might get lucky. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)