Thread: Question about updates and MVCC
Hello, I have a couple of questions regarding how MVCC (in postges 8.3.3 if it makes a difference) affects vacuum. #1. If I am doing an update to a row and none of the values have changed, will that cause a "hole" that requires vacuum to reclaim? #2. I have a column in my table (called "status", if you can believe *that*). This contains 1 of 4 values: -1: row is expired, but needs to be marked deleted from index 0: row is expired, and has been indexed 1: row is active, and has been indexed 2: row is new or updated, and needs to be indexed .. The point of all this is that when a new row is added, or updated, it goes into a status = 2, so the process that comes along later to build search indexes, can quickly query any listings in status = 2 and incrementally update the index. (Same with respect to status -1, except those rows are no longer active and need to be deleted from the index)... The issue with this is that it seems to be causing a lot of vacuum work.... The total number of rows in the table are about 30 million, but partitioned into about 130 segments, based on a category... I'm trying to minimize the amount of vacuum work because not much else changes in the table over time, but the status column will get fiddled with 4 times during the life of a row... Thanks, as always! - Greg
On Sat, Jan 10, 2009 at 08:00:29AM -0800, mailinglists@net-virtual.com wrote: > Hello, > > I have a couple of questions regarding how MVCC (in postges 8.3.3 if it > makes a difference) affects vacuum. > > #1. If I am doing an update to a row and none of the values have changed, > will that cause a "hole" that requires vacuum to reclaim? Yes. There are various ways to work wih this, depending on your capabilties. > #2. I have a column in my table (called "status", if you can believe > *that*). This contains 1 of 4 values: > > -1: row is expired, but needs to be marked deleted from index > 0: row is expired, and has been indexed > 1: row is active, and has been indexed > 2: row is new or updated, and needs to be indexed How wide is the row otherwise. Each status update will make a copy of the complete row. One thought is to think about how often you need the status anyway and decide if it's worthwhile to split the status off into a seperate table, which will be much smaller and vacuum quicker. Hope this helps, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
Attachment
Hello, Thanks to the replier (Martijn -- thank you very much!) to an earlier question I had about MVCC, I've decided to re-think entirely my use of the status column in a table. What I've decided to do is to put all of my new or changed records into a "holding" table, then after they are indexed, commit the changes to their final location. This has worked extremely well, except when I am querying the holding table. This is not the actual table, but my problem can be demonstrated by this (mode can be "U" for an update/insert or "D" for a delete): CREATE TABLE listings ( trans_id SERIAL, mode CHAR(1), listing_id INT, region_id INT, category INT ); .. so, my process goes along and inserts all these rows into the table, about 2,000,000 a day. Then it comes time to query the data, I do a query like this: "SELECT * FROM listings ORDER BY region_id, category, listing_id, trans_id" -- this is *very* expensive obviously, but since multiple rows can be inserted for the same listing_id I have to get the data into some deterministic order. There can be multiple writers adding to this listings table, when it comes time to process it, what I want to do is get only the last transaction for a given listing_id, because the earlier ones don't matter. On top of that, each region_id and category_id has its own index. I need to be able to process the indexes in-full, one-at-a-time because there are too many to hold that many open filehandles/processes at one time. So, my question is, is there some way to return the rows in a deterministic order, without actually having to do an explicit sort on the data? What I mean is, I don't care if category_id 4 / region_id 10 / listing_id 10000 comes before category_id 1 / region_id 1 / lisitng_id 1 -- I just need them returned to me in that sort of grouped order (although sorted by trans_id). And would this even be more efficient in the first place or am I barking up the wrong tree? I hope this makes sense, I've been up all night so not thinking too clearly.... Thanks! - Greg
On Wed, Jan 14, 2009 at 07:36:03PM -0800, mailinglists@net-virtual.com wrote: > CREATE TABLE listings ( > trans_id SERIAL, > mode CHAR(1), > listing_id INT, > region_id INT, > category INT > ); > > "SELECT * FROM listings ORDER BY region_id, category, listing_id, > trans_id" > [...] what I want to do is get only the last transaction for > a given listing_id, because the earlier ones don't matter. If you have an index on (region_id,category,listing_id,trans_id) you should be able to do: SELECT region_id,category,listing_id,MAX(trans_id) FROM listings GROUP BY region_id,category,listing_id; And have PG answer this using the index (it'll only do this if it thinks there are many transactions for each group though). > On top of > that, each region_id and category_id has its own index. I need to be able > to process the indexes in-full, one-at-a-time because there are too many > to hold that many open filehandles/processes at one time. Not sure what you mean by "index" here; I'm assuming you're talking about something outside PG, or am I missing some context? If it's inside PG, then you do know that every index you have will slow down every modification of the table? > So, my question is, is there some way to return the rows in a > deterministic order, without actually having to do an explicit sort on the > data? What I mean is, I don't care if category_id 4 / region_id 10 / > listing_id 10000 comes before category_id 1 / region_id 1 / lisitng_id 1 > -- I just need them returned to me in that sort of grouped order (although > sorted by trans_id). If you want to know all the transaction ids then you need to do the sort, if you only want the largest/latest then you're probably better off telling PG that's what you want (i.e. using GROUP BY and MAX aggregate and letting it make an appropiate decision). -- Sam http://samason.me.uk/