Thread: how many record versions
begin; update t set val=val+1; -- 1000 times commit; How many record versions does it create? 1 or 1000? I'm implementing a banner counter which is incremented at least 2-3 millions a day. I thought I'd cheat by only commiting after every few minutes. Would that work or would I still create as many record versions? -- dave
On Sun, May 23, 2004 at 02:44:31 +0700, David Garamond <lists@zara.6.isreserved.com> wrote: > begin; > update t set val=val+1; -- 1000 times > commit; > > How many record versions does it create? 1 or 1000? I'm implementing a > banner counter which is incremented at least 2-3 millions a day. I > thought I'd cheat by only commiting after every few minutes. Would that > work or would I still create as many record versions? You might be better off keeping the counter in its own table and vacuuming that table very often. It is unlikely that holding transactions open for several minutes is a good idea. Also if you are doing multiple updates in a single transaction, you are still going to get multiple rows.
David Garamond <lists@zara.6.isreserved.com> writes: > begin; > update t set val=val+1; -- 1000 times > commit; > How many record versions does it create? 1 or 1000? 1000. > I'm implementing a > banner counter which is incremented at least 2-3 millions a day. I > thought I'd cheat by only commiting after every few minutes. Would that > work or would I still create as many record versions? Won't make a difference. You should think seriously about using a sequence rather than an ordinary table for this. regards, tom lane
Manfred Koizar wrote: >>begin; >>update t set val=val+1; -- 1000 times >>commit; >> >>How many record versions does it create? 1 or 1000? > > 1000 > >>I'm implementing a >>banner counter which is incremented at least 2-3 millions a day. > > How many rows? I would VACUUM that table after every few hundred > updates or whenever 10% to 20% of the rows have been updated, whichever > is greater. Actually, each record will be incremented probably only thousands of times a day. But there are many banners. Each record has a (bannerid, campaignid, websiteid, date, countrycode) "dimensions" and (impression, click) "measures". The table currently has +- 1,5-2 mil records (it's in MyISAM MySQL), so I'm not sure if I can use that many sequences which Tom suggested. Every impression (banner view) and click will result in a SQL statement (= a "transaction" in MyISAM, since MyISAM doesn't support BEGIN + COMMIT). I'm contemplating of moving to Postgres, but am worried with the MVCC thing. I've previously tried briefly using InnoDB in MySQL but have to revert back to MyISAM because the load increased significantly. -- dave
On Sun, 23 May 2004 23:32:48 +0700, David Garamond <lists@zara.6.isreserved.com> wrote: >Actually, each record will be incremented probably only thousands of >times a day. But there are many banners. Each record has a (bannerid, >campaignid, websiteid, date, countrycode) "dimensions" and (impression, >click) "measures". If you need all of bannerid, campaignid, websiteid, date, countrycode to identify a row, it may be worth the effort to split this up into two tables: CREATE TABLE dimensions ( dimensionid int PRIMARY KEY, bannerid ..., campaignid ..., websiteid ..., date ..., countrycode ..., UNIQUE (bannerid, ..., countrycode) ); CREATE TABLE measures ( dimensionid int PRIMARY KEY REFERENCES dimensions, impression ..., click ... ); Thus you'd only update measures thousands of times and the index would be much more compact, because the PK is only a four byte integer. > The table currently has +- 1,5-2 mil records (it's in >MyISAM MySQL), so I'm not sure if I can use that many sequences which >Tom suggested. Every impression (banner view) and click will result in a >SQL statement Schedule a VACUUM ANALYSE measures; for every 100000 updates or so. >I'm contemplating of moving to Postgres, but am worried with the MVCC >thing. I've previously tried briefly using InnoDB in MySQL but have to >revert back to MyISAM because the load increased significantly. You mean InnoDB cannot handle the load? Servus Manfred
Manfred Koizar <mkoi-pg@aon.at> writes: > On Sun, 23 May 2004 23:32:48 +0700, David Garamond > <lists@zara.6.isreserved.com> wrote: >> I'm contemplating of moving to Postgres, but am worried with the MVCC >> thing. I've previously tried briefly using InnoDB in MySQL but have to >> revert back to MyISAM because the load increased significantly. > You mean InnoDB cannot handle the load? I suspect what he meant is that InnoDB had exactly the same performance issues with lots-of-dead-rows that Postgres will have. Around here, however, we are accustomed to that behavior and know how to deal with it, whereas I'll bet the MySQL community hasn't got that down yet ... regards, tom lane
David Garamond <lists@zara.6.isreserved.com> writes: > Actually, each record will be incremented probably only thousands of times a > day. But there are many banners. Each record has a (bannerid, campaignid, > websiteid, date, countrycode) "dimensions" and (impression, click) "measures". In the past when I had a very similar situation we kept the raw impression and click event data. Ie, one record per impression in the impression table and one record per click in the click data. That makes the tables insert-only which is efficient and not prone to locking contention. They would never have to be vacuumed except after purging old data. Then to accelerate queries we had denormalized aggregate tables with a cron job that did the equivalent of insert into agg_clicks ( select count(*),bannerid from clicks where date between ? and ? group by bannerid ) Where the ?s were actually hourly periods. Ie, at 12:15 it ran this query for the 11-12 period. This meant we didn't have immediate up-to-date stats on banners but it meant we did have stats on every single impression and click including time and information about the users. This worked out very well for reporting needs. If your system is using the data to handle serving the ads, though, it's a different kettle of fish. For that I think you'll want something that avoids having to do a database query for every single impression. -- greg
Manfred Koizar wrote: > You mean InnoDB cannot handle the load? Perhaps it's more appropriate to say that the disk becomes the bottleneck. -- dave
Greg Stark wrote: >>Actually, each record will be incremented probably only thousands of times a >>day. But there are many banners. Each record has a (bannerid, campaignid, >>websiteid, date, countrycode) "dimensions" and (impression, click) "measures". > > In the past when I had a very similar situation we kept the raw impression and > click event data. Ie, one record per impression in the impression table and > one record per click in the click data. > That makes the tables insert-only which is efficient and not prone to locking > contention. They would never have to be vacuumed except after purging old data. Assuming there are 10 millions of impressions per day, the impression table will grow at least 200-400MB per day, is that correct? What do you do and how often do you purge old data? Do you do a mass DELETE on the impression table itself or do you switch to another table? I've found that deleting tens/hundreds of thousands of row, at least in InnoDB, takes long, long time (plus it sucks CPU and slows other queries). > Then to accelerate queries we had denormalized aggregate tables with a cron > job that did the equivalent of > > insert into agg_clicks ( > select count(*),bannerid > from clicks > where date between ? and ? > group by bannerid > ) -- dave
On Mon, 24 May 2004, David Garamond wrote: > Manfred Koizar wrote: > > You mean InnoDB cannot handle the load? > > Perhaps it's more appropriate to say that the disk becomes the bottleneck. Was this attempting to do each update in a separate transaction? If so, that is certainly expected, with whatever disk based transactional database you use. With innodb, you could hack around it by configuring innodb not to flush its log to disk at every transaction, obviously at the risk of loosing data if something crashes. From what I have seen, I would expect innodb's multiversioning to work better for the use you describe than postgresql's due how it implements undo logs for updates. However, there could well be other differences that could make postgresql work better for your application depending on exactly what issues you are seeing.
David Garamond <lists@zara.6.isreserved.com> writes: > Greg Stark wrote: > >>Actually, each record will be incremented probably only thousands of times a > >>day. But there are many banners. Each record has a (bannerid, campaignid, > >>websiteid, date, countrycode) "dimensions" and (impression, click) "measures". > > In the past when I had a very similar situation we kept the raw impression and > > click event data. Ie, one record per impression in the impression table and > > one record per click in the click data. > > > That makes the tables insert-only which is efficient and not prone to locking > > contention. They would never have to be vacuumed except after purging old data. > > Assuming there are 10 millions of impressions per day, the impression table > will grow at least 200-400MB per day, is that correct? What do you do and how > often do you purge old data? Do you do a mass DELETE on the impression table > itself or do you switch to another table? I've found that deleting > tens/hundreds of thousands of row, at least in InnoDB, takes long, long time > (plus it sucks CPU and slows other queries). Well this was actually under Oracle, but I can extrapolate to Postgres given my experience. The idea tool for the job is a feature that Postgres has discussed but hasn't implemented yet, "partitioned tables". Under Oracle with partitioned tables we were able to drop entire partitions virtually instantaneously. It also made copying the data out to near-line backups much more efficient than index scanning as well. Before we implemented partitioned tables we used both techniques you described. At first we had an ad-hoc procedure of creating a new table and swapping it out. But that involved a short downtime and was a manual process. Eventually we set up an automated batch job which used deletes. Deletes under postgres should be fairly efficient. The I/O use would be unavoidable, so doing it during off-peak hours would still be good. But it shouldn't otherwise interfere with other queries. There should be no locking contention, no additional work for other queries (like checking rollback segments or logs) or any of the other problems other databases suffer from with large updates. I find the 10 million impressions per day pretty scary though. That's over 100/s across the entire 24 period. Probably twice that at peak hours. That would have to be one pretty beefy server just to handle the transaction processing itself. (And updates under postgres are essentially inserts where vacuum cleans up the old tuple later, so they would be no less taxing.) A server capable of handling that ought to be able to make quick work of deleting a few hundred megabytes of records. Another option is simply logging this data to a text file. Or multiple text files one per server. Then you can load the text files with batch loads offline. This avoids slowing down your servers handling the transactions in the critical path. But it's yet more complex with more points for failure. Something else you might be interested in is using a tool like this: http://www.danga.com/memcached/ I could see it being useful for caching the counts you were looking to keep so that the ad server doesn't need to consult the database to calculate which ad to show. A separate job could periodically sync the counts to the database or from the database. -- greg
unsubscribe all
Greg Stark wrote: > Another option is simply logging this data to a text file. Or multiple text Yes, this is what we've been doing recently. We write to a set of text files and there's a process to commit to MySQL every 2-3 minutes (and if the commit fails, we write to another text file to avoid the data being lost). It works but I keep thinking how ugly the whole thing is :-) > files one per server. Then you can load the text files with batch loads > offline. This avoids slowing down your servers handling the transactions in > the critical path. But it's yet more complex with more points for failure. -- dave
Greg Stark wrote: > Well this was actually under Oracle, but I can extrapolate to Postgres given > my experience. > > The idea tool for the job is a feature that Postgres has discussed but hasn't > implemented yet, "partitioned tables". Under Oracle with partitioned tables we > were able to drop entire partitions virtually instantaneously. It also made > copying the data out to near-line backups much more efficient than index > scanning as well. I think you can get a similar effect by using inherited tables. Create one "master" table, and then inherit individual "partition" tables from that. Then you can easily create or drop a "partition", while still being able to query the "master" and see all the rows. HTH, Joe
On Mon, May 24, 2004 at 11:15:07AM -0700, Joe Conway wrote: > Greg Stark wrote: > >Well this was actually under Oracle, but I can extrapolate to Postgres > >given > >my experience. > > > >The idea tool for the job is a feature that Postgres has discussed but > >hasn't > >implemented yet, "partitioned tables". Under Oracle with partitioned > >tables we > >were able to drop entire partitions virtually instantaneously. It also made > >copying the data out to near-line backups much more efficient than index > >scanning as well. > > I think you can get a similar effect by using inherited tables. Create > one "master" table, and then inherit individual "partition" tables from > that. Then you can easily create or drop a "partition", while still > being able to query the "master" and see all the rows. I've done this, in production, and it works fairly well. It's not as clean as true partitioned tables (as a lot of things don't inherit) but you can localise the nastiness in a pretty small bit of application code. Any query ends up looking like a long union of selects, which'll slow things down somewhat, but I found that most of my queries had date range selection on them so I could take advantage of that in the application code to only query some subset of the inherited tables for most of the application generated queries, while I could still do ad-hoc work from the psql commandline using the parent table. Cheers, Steve