Thread: queries on xmin
the openstreetmap project (http://osm.org/) recently moved from using mysql to postgres and we're trying to improve some of our tools using the new functionality that postgres provides. in particular, we are dumping changes to the database at short intervals (currently every minute, hour and day [1,2]) so that 3rd party sites can use this to keep up-to-date with the main database. it previously worked by examining the timestamp of each modified element, but this is no longer practical due to new features in the openstreetmap API which can cause long-running transactions [3]. we've been working out a scheme based on taking txid_snapshots at short intervals and dumping the new rows (due to the way it's implemented, all edits are inserted rows) and querying xmin. the query looks something like this: select id,version from (nodes|ways|relations) where timestamp > (now() - '1 hour'::interval) and xmin in (...) and we build up the txid list from the two snapshots we're dumping between on the client. however, we're finding that this becomes much less efficient as the txid list becomes longer. in an effort to reduce the query time we're looking to index the xmin column. it seems that hash indexes are already supported on the txid type, but btree are not [4]. the queries we're doing would usually be of the form "xmin in previous_unfinished_txids or (xmin > previous_max_txid and xmin <= current_max_txid and not in current_unfinished_txids)" except when wrap-around occurs, so it would seem that a btree index would be superior to building this list client-side and using a hash index. what problems are we going to create for ourselves if we create a btree index on xmin casted to int4? would it be as efficient to use a hash index, create a temporary table of txids that we're querying with a hash index and do an explicit join? have i missed the point entirely? many thanks, matt [1] http://wiki.openstreetmap.org/wiki/Planet.osm/diffs [2] http://wiki.openstreetmap.org/wiki/OsmChange [3] http://wiki.openstreetmap.org/wiki/OSM_Protocol_Version_0.6#Diff_upload:_POST_.2Fapi.2F0.6.2Fchangeset.2F.23id.2Fupload [4] http://archives.postgresql.org/pgsql-general/2004-10/msg01474.php
On Thu, Jun 11, 2009 at 11:25 AM, Matt Amos<zerebubuth@gmail.com> wrote: > > what problems are we going to create for ourselves if we create a > btree index on xmin casted to int4? would it be as efficient to use a > hash index, create a temporary table of txids that we're querying with > a hash index and do an explicit join? have i missed the point > entirely? Wow, I'm quite shocked that we don't already detect attempts to create an index on xmin or xmax. There's no way that'll work properly since those fields can change spontaneously when, for example vacuum runs or for xmax when things like UPDATE or SELET FOR SHARE or SELECT FOR UPDATE are used. Incidentally the reason there's no btree opclass is because xids don't monotonically increase. They wrap around. So periodically you would lose updates or see them repeatedly whenever the xid wrapped around and the old transactions appear to be in the future. If you never run updates and are never interested in tuples that are old enough to be frozen then perhaps you could mostly get away with it. But I really don't think it's a good idea. Much better would be to store a user-space column with somethin like txid or a timestamp and use that directly. That way you have control over the behaviour of the column. Another option to consider would be including a boolean column "dumped" defaulted to false. Then you could have a partial index on the primary key or date "WHERE NOT dumped". Then when you dump you can "SELECT FOR UPDATE * WHERE NOT dumped" and then when you're done "UPDATE SET dumped = 't' ". Alternately you could use "UPDATE SET dumped='t' WHERE NOT dumped RETURNING *" which is basically equivalent. That would create twice as much traffic in the table which would make vacuums much more important. But it would mean you could quickly acces undumped records using the index and know that your process doesn't depend on a following a strict clock schedule. -- Gregory Stark http://mit.edu/~gsstark/resume.pdf
On 6/11/09, Matt Amos <zerebubuth@gmail.com> wrote: > the openstreetmap project (http://osm.org/) recently moved from using > mysql to postgres and we're trying to improve some of our tools using > the new functionality that postgres provides. > > in particular, we are dumping changes to the database at short > intervals (currently every minute, hour and day [1,2]) so that 3rd > party sites can use this to keep up-to-date with the main database. it > previously worked by examining the timestamp of each modified element, > but this is no longer practical due to new features in the > openstreetmap API which can cause long-running transactions [3]. > > we've been working out a scheme based on taking txid_snapshots at > short intervals and dumping the new rows (due to the way it's > implemented, all edits are inserted rows) and querying xmin. the query > looks something like this: > > select id,version from (nodes|ways|relations) where timestamp > (now() > - '1 hour'::interval) and xmin in (...) > > and we build up the txid list from the two snapshots we're dumping > between on the client. however, we're finding that this becomes much > less efficient as the txid list becomes longer. in an effort to reduce > the query time we're looking to index the xmin column. it seems that > hash indexes are already supported on the txid type, but btree are not > [4]. > > the queries we're doing would usually be of the form "xmin in > previous_unfinished_txids or (xmin > previous_max_txid and xmin <= > current_max_txid and not in current_unfinished_txids)" except when > wrap-around occurs, so it would seem that a btree index would be > superior to building this list client-side and using a hash index. > > what problems are we going to create for ourselves if we create a > btree index on xmin casted to int4? would it be as efficient to use a > hash index, create a temporary table of txids that we're querying with > a hash index and do an explicit join? have i missed the point > entirely? Hash index would not work as you cannot do range queries on that. 4-byte xids on btree may create data corruption. Solution is to use 8-byte txids via txid_current() for indexing. [1] See pgq.batch_event_sql() function in Skytools [2] for how to query txids between snapshots efficiently and without being affected by long transactions. In fact perhaps you can use PgQ directly instead building your own. It is built quite similarly to what you are planning - periodic snapshots and then queries on txids to get the data. -- marko [1] http://www.postgresql.org/docs/8.3/static/functions-info.html#FUNCTIONS-TXID-SNAPSHOT [2] http://wiki.postgresql.org/wiki/Skytools
Marko Kreen wrote: > 4-byte xids on btree may create data corruption. > Can you be more specific on this? I'm aware of xid being an unsigned integer which means we need to deal with the cast resulting in negative numbers. This means we have to split our range queries into several pieces when crossing overflow points. But I'm not aware of any other issues. > Solution is to use 8-byte txids via txid_current() for indexing. [1] > The main issue with 8-byte values are the size. When multipled by 500 million rows it turns into a lot of space. I'd prefer to use int4 if it's possible ... Sorry, I'm not sure what you're suggesting with txid_current(). We're currently using the |txid_current_snapshot|() method which returns us the maximum transaction id plus in-flight transactions. We specifically exclude transactions that are in-flight from the query, then include them on subsequent queries when they have committed. > See pgq.batch_event_sql() function in Skytools [2] for how to > query txids between snapshots efficiently and without being affected > by long transactions. > I'll take a look. > In fact perhaps you can use PgQ directly instead building your own. > It is built quite similarly to what you are planning - periodic > snapshots and then queries on txids to get the data. > I must admit I'm not very familiar with PgQ, but I did take a quick look when evaluating options. I'm possibly misguided here but serialising everything via a queue doesn't seem like the most efficient way of replicating large changesets. Another issue is that it would only allow read-once behaviour that prevents a client from re-replicating if something goes wrong. How would you trigger message creation using PgQ? Would you typically use triggers on the tables to be replicated? Brett
On Thu, Jun 11, 2009 at 12:59 PM, Brett Henderson<brett@bretth.com> wrote: > I have a couple of hesitations with using this approach: > 1. We can only run the replicator once. > 2. We can only run a single replicator. > 3. It requires write access to the db. > > 1 is perhaps the biggest issue. It means that we only get one shot at > reading changes, and if something goes wrong we lose the results. It's nice > being able to re-generate when something goes wrong. I was picturing only actually committing the update once you're sure all the files are generated. So if something goes wrong you roll back the database update. Another option would be to use an integer "batch_id" column instead of a boolean. Then you could recreate a previous batch if the file is subsequently lost. An integer takes more space than a boolean but due to alignment issues it often works out the same. > We could live with 2, although it makes it impossible to test new > replication mechanisms without adding additional columns for each. Well four boolean columns would, depending on the rest of the table layout, probably take the same space as one anyways due to alignment issues. > 3 is also a major consideration, it makes everybody's life easier if we can > avoid updates being made to the db by the replicator. Yeah, having to update every record once would have a major impact. It would mean twice as many tuples in the table and every index (except the partial index). Vacuum would be a major issue for avoiding bloat in both the table and the indexes. I'm trying to see how to do it without at least updating, or inserting into a separate queue table, but I'm not immediately seeing a good solution. I've done something similar to what you're doing now once, but there the transactions were simple inserts, we just selected all the work for the previous hour at 5 minutes past the hour and were done with it. In your case you would have to delay replication by 30min+. Implementing what you describe with txid would depend on a lot of internal logic which could change in future releases. It would also tie you to never updating or deleting rows in the updates which seems like it might be a problem in the future. -- Gregory Stark http://mit.edu/~gsstark/resume.pdf
I've been working with Matt on this. Thanks for the suggestions. Greg Stark wrote: > On Thu, Jun 11, 2009 at 11:25 AM, Matt Amos<zerebubuth@gmail.com> wrote: > >> what problems are we going to create for ourselves if we create a >> btree index on xmin casted to int4? would it be as efficient to use a >> hash index, create a temporary table of txids that we're querying with >> a hash index and do an explicit join? have i missed the point >> entirely? >> > > Wow, I'm quite shocked that we don't already detect attempts to create > an index on xmin or xmax. There's no way that'll work properly since > those fields can change spontaneously when, for example vacuum runs or > for xmax when things like UPDATE or SELET FOR SHARE or SELECT FOR > UPDATE are used. > I had just assumed that the index would be updated along with the xmin value. Are you saying that the index would remain set to the original value? I think I read that it only gets set after 100 million transactions (by default) which would be okay for our needs, we'd have long ago replicated the changes by then. If we really do need to re-generate replication files after that long we could just do it using a timestamp. > Incidentally the reason there's no btree opclass is because xids don't > monotonically increase. They wrap around. So periodically you would > lose updates or see them repeatedly whenever the xid wrapped around > and the old transactions appear to be in the future. > Yep, understood. We'd be converting to a signed integer which means we'd have two special points to deal with, the overflow point from 2^31 to -2^31, and the wraparound back to 0 with 0-2 to be ignored. Would it work if we did something along the lines of: (xmin >= startX AND xmin <= (2^31 - 1) OR (xmin >= -2^31 AND xmin <= finishTx) One other thing I should mention is that we'd be adding a new function to the db (eg. xid_to_int4) that would convert the xid to an int4 and wrap values greater than 2^31 around to negative values thus avoiding the database raising overflow errors. > If you never run updates and are never interested in tuples that are > old enough to be frozen then perhaps you could mostly get away with > it. But I really don't think it's a good idea. > > Much better would be to store a user-space column with somethin like > txid or a timestamp and use that directly. That way you have control > over the behaviour of the column. > We already have a timestamp but with long running transactions some of the rows don't become visible until after we've replicated that time interval. The existing implementation queries for changes based on timestamp and runs with a 5 minute lag, but we see some transactions taking up to half an hour which means we miss the data committed as part of those transactions. When you suggest to add a txid column, are you suggesting to have something like an int8 column populated from a global monotonically increasing sequence? That sounds like an elegant approach. The downside is that between an int8 column and the index we'd be talking approximately 16 bytes per row which when multipled by 500 million rows (I'm not sure exactly how many there are but that's ballpark) comes out at 8GB of additional disk usage. It's approximately 4 times the size of an int4 index on xmin. Disk consumption isn't critical, but it is a consideration. > Another option to consider would be including a boolean column > "dumped" defaulted to false. Then you could have a partial index on > the primary key or date "WHERE NOT dumped". Then when you dump you can > "SELECT FOR UPDATE * WHERE NOT dumped" and then when you're done > "UPDATE SET dumped = 't' ". Alternately you could use "UPDATE SET > dumped='t' WHERE NOT dumped RETURNING *" which is basically > equivalent. > I have a couple of hesitations with using this approach: 1. We can only run the replicator once. 2. We can only run a single replicator. 3. It requires write access to the db. 1 is perhaps the biggest issue. It means that we only get one shot at reading changes, and if something goes wrong we lose the results. It's nice being able to re-generate when something goes wrong. We could live with 2, although it makes it impossible to test new replication mechanisms without adding additional columns for each. 3 is also a major consideration, it makes everybody's life easier if we can avoid updates being made to the db by the replicator. > That would create twice as much traffic in the table which would make > vacuums much more important. But it would mean you could quickly acces > undumped records using the index and know that your process doesn't > depend on a following a strict clock schedule. > If we do use range queries then I don't think we have any clock dependencies because we'd remove the timestamp clause from the queries. I hope I don't sound too negative. My gut also tells me that what we're doing is not the "right" solution and I've had fairly similar arguments with Matt already :-) But having spent some time playing with it I can't find any reason why it won't work, and from a performance point of view I suspect it will win ... Brett
On Thu, Jun 11, 2009 at 12:59 PM, Brett Henderson<brett@bretth.com> wrote: > Greg Stark wrote: >> Another option to consider would be including a boolean column >> "dumped" defaulted to false. Then you could have a partial index on >> the primary key or date "WHERE NOT dumped". Then when you dump you can >> "SELECT FOR UPDATE * WHERE NOT dumped" and then when you're done >> "UPDATE SET dumped = 't' ". Alternately you could use "UPDATE SET >> dumped='t' WHERE NOT dumped RETURNING *" which is basically >> equivalent. >> > > I have a couple of hesitations with using this approach: > 1. We can only run the replicator once. > 2. We can only run a single replicator. > 3. It requires write access to the db. 4. it requires a schema change to support this use case. again, this isn't a massive issue. but for the time being we're at least making an effort to not extend the schema beyond what the rails code itself requires. > 3 is also a major consideration, it makes everybody's life easier if we can > avoid updates being made to the db by the replicator. for safety's sake i think this makes a lot of sense. > I hope I don't sound too negative. My gut also tells me that what we're > doing is not the "right" solution and I've had fairly similar arguments with > Matt already :-) But having spent some time playing with it I can't find > any reason why it won't work, and from a performance point of view I suspect > it will win ... it seems right to me to use postgres' existing features to support this. we've got pretty close to a working system without needing schema changes, so it would be a shame if they turn out to be necessary. cheers, matt
On Thu, Jun 11, 2009 at 1:13 PM, Brett Henderson<brett@bretth.com> wrote: > Marko Kreen wrote: > Sorry, I'm not sure what you're suggesting with txid_current(). We're > currently using the |txid_current_snapshot|() method which returns us the > maximum transaction id plus in-flight transactions. We specifically exclude > transactions that are in-flight from the query, then include them on > subsequent queries when they have committed. i think what marko is suggesting is a new column, e.g: alter table X add column txid bigint default current_txid(); the current_txid() function returning an int8 which is the 32-bit txid in the lower word and a 32-bit "epoch" counter in the higher word so that it doesn't wrap-around. >> See pgq.batch_event_sql() function in Skytools [2] for how to >> query txids between snapshots efficiently and without being affected >> by long transactions. > > I'll take a look. it was looking at the skytools stuff which got me thinking about using txids in the first place. someone on the osm-dev list had suggested using PgQ, but we weren't keen on the schema changes that would have been necessary. cheers, matt
On 6/11/09, Brett Henderson <brett@bretth.com> wrote: > Marko Kreen wrote: > > > 4-byte xids on btree may create data corruption. > > > > > Can you be more specific on this? I'm aware of xid being an unsigned > integer which means we need to deal with the cast resulting in negative > numbers. This means we have to split our range queries into several pieces > when crossing overflow points. But I'm not aware of any other issues. No, the issue is unstable '<' and '>' relationships between values. > > Solution is to use 8-byte txids via txid_current() for indexing. [1] > > > > > The main issue with 8-byte values are the size. When multipled by 500 > million rows it turns into a lot of space. I'd prefer to use int4 if it's > possible ... If you use PgQ, you dont need any extra column in table. As the room below PgQ is garbage-collected pretty quickly the extra space will not be a problem. > Sorry, I'm not sure what you're suggesting with txid_current(). We're > currently using the |txid_current_snapshot|() method which returns us the > maximum transaction id plus in-flight transactions. We specifically exclude > transactions that are in-flight from the query, then include them on > subsequent queries when they have committed. My suggestion was to use int8 txid column for querying, either on table (bit expensive indeed), or on queue tables (pgq). > > See pgq.batch_event_sql() function in Skytools [2] for how to > > query txids between snapshots efficiently and without being affected > > by long transactions. > > > > > I'll take a look. > > > In fact perhaps you can use PgQ directly instead building your own. > > It is built quite similarly to what you are planning - periodic > > snapshots and then queries on txids to get the data. > > > > > I must admit I'm not very familiar with PgQ, but I did take a quick look > when evaluating options. I'm possibly misguided here but serialising > everything via a queue doesn't seem like the most efficient way of > replicating large changesets. Most efficient is COPY-ing of whole table. If you want to replicate piecemeal, only changes that happened since last sync point, PgQ is quite efficient, compared to trying to read changes from main table: - Only small amount of recent data is duplicated, older data is cleaned up quite fast (configurable). - The extra index is small as it need to cover only small amount of data. > Another issue is that it would only allow > read-once behaviour that prevents a client from re-replicating if something > goes wrong. PgQ will give you exact same batch repeatedly, until you call pgq.finish_batch(). This allows you to coordinate commits between differnet databases. > How would you trigger message creation using PgQ? Would you typically use > triggers on the tables to be replicated? http://skytools.projects.postgresql.org/pgq/files/triggers-sql.html -- marko
On 6/11/09, Matt Amos <zerebubuth@gmail.com> wrote: > On Thu, Jun 11, 2009 at 1:13 PM, Brett Henderson<brett@bretth.com> wrote: > >> See pgq.batch_event_sql() function in Skytools [2] for how to > >> query txids between snapshots efficiently and without being affected > >> by long transactions. > > > > I'll take a look. > > it was looking at the skytools stuff which got me thinking about using > txids in the first place. someone on the osm-dev list had suggested > using PgQ, but we weren't keen on the schema changes that would have > been necessary. Except the trigger, PgQ does not need any schema changes? -- marko
On Thu, Jun 11, 2009 at 2:48 PM, Marko Kreen<markokr@gmail.com> wrote: > On 6/11/09, Matt Amos <zerebubuth@gmail.com> wrote: >> On Thu, Jun 11, 2009 at 1:13 PM, Brett Henderson<brett@bretth.com> wrote: >> >> See pgq.batch_event_sql() function in Skytools [2] for how to >> >> query txids between snapshots efficiently and without being affected >> >> by long transactions. >> > >> > I'll take a look. >> >> it was looking at the skytools stuff which got me thinking about using >> txids in the first place. someone on the osm-dev list had suggested >> using PgQ, but we weren't keen on the schema changes that would have >> been necessary. > > Except the trigger, PgQ does not need any schema changes? i've been having a look and it seems to me that PgQ requires some extra tables as well as the trigger. am i missing something? PgQ might be a good solution, but i'm worried that after calling pgq.finish_batch() the batch is released. this would mean it wouldn't be possible to regenerate older files (e.g: a few days to a week) in case something unexpected went wrong. it might not be a major problem, though. i think we could get the same functionality without the extra daemons, by putting a trigger on those tables for insert and recorded the object id, version and 64-bit txid in another table. but if we're going to alter the schema we might as well put the txid column directly into those tables... cheers, matt
On 6/11/09, Matt Amos <zerebubuth@gmail.com> wrote: > On Thu, Jun 11, 2009 at 2:48 PM, Marko Kreen<markokr@gmail.com> wrote: > > On 6/11/09, Matt Amos <zerebubuth@gmail.com> wrote: > >> On Thu, Jun 11, 2009 at 1:13 PM, Brett Henderson<brett@bretth.com> wrote: > >> >> See pgq.batch_event_sql() function in Skytools [2] for how to > >> >> query txids between snapshots efficiently and without being affected > >> >> by long transactions. > >> > > >> > I'll take a look. > >> > >> it was looking at the skytools stuff which got me thinking about using > >> txids in the first place. someone on the osm-dev list had suggested > >> using PgQ, but we weren't keen on the schema changes that would have > >> been necessary. > > > > Except the trigger, PgQ does not need any schema changes? > > > i've been having a look and it seems to me that PgQ requires some > extra tables as well as the trigger. am i missing something? Well, my point was you don't need to change your existing tables. PgQ indeed has few internal and per-queue tables but they come with pgq.sql, live under 'pgq' schema and are maintained by it. There is no need for you to worry about them. > PgQ might be a good solution, but i'm worried that after calling > pgq.finish_batch() the batch is released. this would mean it wouldn't > be possible to regenerate older files (e.g: a few days to a week) in > case something unexpected went wrong. it might not be a major problem, > though. You can register a 'safely consumer' on each queue, that lags specified amount of time to avoid pgq rotation for that queue, or set rotation time big enough to guarantee that old batches stay around. > i think we could get the same functionality without the extra daemons, > by putting a trigger on those tables for insert and recorded the > object id, version and 64-bit txid in another table. We have tried storing id-s into queue and later joining them with data rows, but it seems more efficient to duplicate data into queue and thus avoid joins when reading from it. The queue space will recover soon anyway. On light loads it probably does not matter either way. > but if we're > going to alter the schema we might as well put the txid column > directly into those tables... It's permanent space loss in data tables vs. temporary space loss in queue tables. If former fits your situation better, go ahead. -- marko