Thread: queries on xmin

queries on xmin

From

Matt Amos

Date:

11 June 2009, 07:25:54

the openstreetmap project (http://osm.org/) recently moved from using
mysql to postgres and we're trying to improve some of our tools using
the new functionality that postgres provides.

in particular, we are dumping changes to the database at short
intervals (currently every minute, hour and day [1,2]) so that 3rd
party sites can use this to keep up-to-date with the main database. it
previously worked by examining the timestamp of each modified element,
but this is no longer practical due to new features in the
openstreetmap API which can cause long-running transactions [3].

we've been working out a scheme based on taking txid_snapshots at
short intervals and dumping the new rows (due to the way it's
implemented, all edits are inserted rows) and querying xmin. the query
looks something like this:

select id,version from (nodes|ways|relations) where timestamp > (now()
- '1 hour'::interval) and xmin in (...)

and we build up the txid list from the two snapshots we're dumping
between on the client. however, we're finding that this becomes much
less efficient as the txid list becomes longer. in an effort to reduce
the query time we're looking to index the xmin column. it seems that
hash indexes are already supported on the txid type, but btree are not
[4].

the queries we're doing would usually be of the form "xmin in
previous_unfinished_txids or (xmin > previous_max_txid and xmin <=
current_max_txid and not in current_unfinished_txids)" except when
wrap-around occurs, so it would seem that a btree index would be
superior to building this list client-side and using a hash index.

what problems are we going to create for ourselves if we create a
btree index on xmin casted to int4? would it be as efficient to use a
hash index, create a temporary table of txids that we're querying with
a hash index and do an explicit join? have i missed the point
entirely?

many thanks,

matt

[1] http://wiki.openstreetmap.org/wiki/Planet.osm/diffs
[2] http://wiki.openstreetmap.org/wiki/OsmChange
[3]
http://wiki.openstreetmap.org/wiki/OSM_Protocol_Version_0.6#Diff_upload:_POST_.2Fapi.2F0.6.2Fchangeset.2F.23id.2Fupload
[4] http://archives.postgresql.org/pgsql-general/2004-10/msg01474.php

Re: queries on xmin

From

Greg Stark

Date:

11 June 2009, 08:11:13

On Thu, Jun 11, 2009 at 11:25 AM, Matt Amos<zerebubuth@gmail.com> wrote:
>
> what problems are we going to create for ourselves if we create a
> btree index on xmin casted to int4? would it be as efficient to use a
> hash index, create a temporary table of txids that we're querying with
> a hash index and do an explicit join? have i missed the point
> entirely?

Wow, I'm quite shocked that we don't already detect attempts to create
an index on xmin or xmax. There's no way that'll work properly since
those fields can change spontaneously when, for example vacuum runs or
for xmax when things like UPDATE or SELET FOR SHARE or SELECT FOR
UPDATE are used.

Incidentally the reason there's no btree opclass is because xids don't
monotonically increase. They wrap around. So periodically you would
lose updates or see them repeatedly whenever the xid wrapped around
and the old transactions appear to be in the future.

If you never run updates and are never interested in tuples that are
old enough to be frozen then perhaps you could mostly get away with
it. But I really don't think it's a good idea.

Much better would be to store a user-space column with somethin like
txid or a timestamp and use that directly. That way you have control
over the behaviour of the column.

Another option to consider would be including a boolean column
"dumped" defaulted to false. Then you could have a partial index on
the primary key or date "WHERE NOT dumped". Then when you dump you can
"SELECT FOR UPDATE * WHERE NOT dumped" and then when you're done
"UPDATE SET dumped = 't' ". Alternately you could use "UPDATE SET
dumped='t' WHERE NOT dumped RETURNING *" which is basically
equivalent.

That would create twice as much traffic in the table which would make
vacuums much more important. But it would mean you could quickly acces
undumped records using the index and know that your process doesn't
depend on a following a strict clock schedule.

--
Gregory Stark
http://mit.edu/~gsstark/resume.pdf

Re: queries on xmin

From

Marko Kreen

Date:

11 June 2009, 08:22:59

On 6/11/09, Matt Amos <zerebubuth@gmail.com> wrote:
> the openstreetmap project (http://osm.org/) recently moved from using
>  mysql to postgres and we're trying to improve some of our tools using
>  the new functionality that postgres provides.
>
>  in particular, we are dumping changes to the database at short
>  intervals (currently every minute, hour and day [1,2]) so that 3rd
>  party sites can use this to keep up-to-date with the main database. it
>  previously worked by examining the timestamp of each modified element,
>  but this is no longer practical due to new features in the
>  openstreetmap API which can cause long-running transactions [3].
>
>  we've been working out a scheme based on taking txid_snapshots at
>  short intervals and dumping the new rows (due to the way it's
>  implemented, all edits are inserted rows) and querying xmin. the query
>  looks something like this:
>
>  select id,version from (nodes|ways|relations) where timestamp > (now()
>  - '1 hour'::interval) and xmin in (...)
>
>  and we build up the txid list from the two snapshots we're dumping
>  between on the client. however, we're finding that this becomes much
>  less efficient as the txid list becomes longer. in an effort to reduce
>  the query time we're looking to index the xmin column. it seems that
>  hash indexes are already supported on the txid type, but btree are not
>  [4].
>
>  the queries we're doing would usually be of the form "xmin in
>  previous_unfinished_txids or (xmin > previous_max_txid and xmin <=
>  current_max_txid and not in current_unfinished_txids)" except when
>  wrap-around occurs, so it would seem that a btree index would be
>  superior to building this list client-side and using a hash index.
>
>  what problems are we going to create for ourselves if we create a
>  btree index on xmin casted to int4? would it be as efficient to use a
>  hash index, create a temporary table of txids that we're querying with
>  a hash index and do an explicit join? have i missed the point
>  entirely?

Hash index would not work as you cannot do range queries on that.

4-byte xids on btree may create data corruption.

Solution is to use 8-byte txids via txid_current() for indexing. [1]

See pgq.batch_event_sql() function in Skytools [2] for how to
query txids between snapshots efficiently and without being affected
by long transactions.

In fact perhaps you can use PgQ directly instead building your own.
It is built quite similarly to what you are planning - periodic
snapshots and then queries on txids to get the data.

--
marko

[1] http://www.postgresql.org/docs/8.3/static/functions-info.html#FUNCTIONS-TXID-SNAPSHOT
[2] http://wiki.postgresql.org/wiki/Skytools

Re: queries on xmin

From

Brett Henderson

Date:

11 June 2009, 09:13:41

Marko Kreen wrote:
> 4-byte xids on btree may create data corruption.
>
Can you be more specific on this?  I'm aware of xid being an unsigned
integer which means we need to deal with the cast resulting in negative
numbers.  This means we have to split our range queries into several
pieces when crossing overflow points.  But I'm not aware of any other
issues.
> Solution is to use 8-byte txids via txid_current() for indexing. [1]
>
The main issue with 8-byte values are the size.  When multipled by 500
million rows it turns into a lot of space.  I'd prefer to use int4 if
it's possible ...

Sorry, I'm not sure what you're suggesting with txid_current().  We're
currently using the |txid_current_snapshot|() method which returns us
the maximum transaction id plus in-flight transactions.  We specifically
exclude transactions that are in-flight from the query, then include
them on subsequent queries when they have committed.
> See pgq.batch_event_sql() function in Skytools [2] for how to
> query txids between snapshots efficiently and without being affected
> by long transactions.
>
I'll take a look.
> In fact perhaps you can use PgQ directly instead building your own.
> It is built quite similarly to what you are planning - periodic
> snapshots and then queries on txids to get the data.
>
I must admit I'm not very familiar with PgQ, but I did take a quick look
when evaluating options.  I'm possibly misguided here but serialising
everything via a queue doesn't seem like the most efficient way of
replicating large changesets.  Another issue is that it would only allow
read-once behaviour that prevents a client from re-replicating if
something goes wrong.

How would you trigger message creation using PgQ?  Would you typically
use triggers on the tables to be replicated?

Brett

Re: queries on xmin

From

Greg Stark

Date:

11 June 2009, 09:22:50

On Thu, Jun 11, 2009 at 12:59 PM, Brett Henderson<brett@bretth.com> wrote:
> I have a couple of hesitations with using this approach:
> 1. We can only run the replicator once.
> 2. We can only run a single replicator.
> 3. It requires write access to the db.
>
> 1 is perhaps the biggest issue.  It means that we only get one shot at
> reading changes, and if something goes wrong we lose the results.  It's nice
> being able to re-generate when something goes wrong.

I was picturing only actually committing the update once you're sure
all the files are generated. So if something goes wrong you roll back
the database update.

Another option would be to use an integer "batch_id" column instead of
a boolean. Then you could recreate a previous batch if the file is
subsequently lost. An integer takes more space than a boolean but due
to alignment issues it often works out the same.

> We could live with 2, although it makes it impossible to test new
> replication mechanisms without adding additional columns for each.

Well four boolean columns would, depending on the rest of the table
layout, probably take the same space as one anyways due to alignment
issues.

> 3 is also a major consideration, it makes everybody's life easier if we can
> avoid updates being made to the db by the replicator.

Yeah, having to update every record once would have a major impact. It
would mean twice as many tuples in the table and every index (except
the partial index). Vacuum would be a major issue for avoiding bloat
in both the table and the indexes.

I'm trying to see how to do it without at least updating, or inserting
into a separate queue table, but I'm not immediately seeing a good
solution. I've done something similar to what you're doing now once,
but there the transactions were simple inserts, we just selected all
the work for the previous hour at 5 minutes past the hour and were
done with it. In your case you would have to delay replication by
30min+.

Implementing what you describe with txid would depend on a lot of
internal logic which could change in future releases. It would also
tie you to never updating or deleting rows in the updates which seems
like it might be a problem in the future.

--
Gregory Stark
http://mit.edu/~gsstark/resume.pdf

Re: queries on xmin

From

Brett Henderson

Date:

11 June 2009, 09:27:42

I've been working with Matt on this.  Thanks for the suggestions.

Greg Stark wrote:
> On Thu, Jun 11, 2009 at 11:25 AM, Matt Amos<zerebubuth@gmail.com> wrote:
>
>> what problems are we going to create for ourselves if we create a
>> btree index on xmin casted to int4? would it be as efficient to use a
>> hash index, create a temporary table of txids that we're querying with
>> a hash index and do an explicit join? have i missed the point
>> entirely?
>>
>
> Wow, I'm quite shocked that we don't already detect attempts to create
> an index on xmin or xmax. There's no way that'll work properly since
> those fields can change spontaneously when, for example vacuum runs or
> for xmax when things like UPDATE or SELET FOR SHARE or SELECT FOR
> UPDATE are used.
>
I had just assumed that the index would be updated along with the xmin
value.  Are you saying that the index would remain set to the original
value?  I think I read that it only gets set after 100 million
transactions (by default) which would be okay for our needs, we'd have
long ago replicated the changes by then.  If we really do need to
re-generate replication files after that long we could just do it using
a timestamp.
> Incidentally the reason there's no btree opclass is because xids don't
> monotonically increase. They wrap around. So periodically you would
> lose updates or see them repeatedly whenever the xid wrapped around
> and the old transactions appear to be in the future.
>
Yep, understood.  We'd be converting to a signed integer which means
we'd have two special points to deal with, the overflow point from 2^31
to -2^31, and the wraparound back to 0 with 0-2 to be ignored.  Would it
work if we did something along the lines of:
(xmin >= startX AND xmin <= (2^31 - 1) OR (xmin >= -2^31 AND xmin <=
finishTx)

One other thing I should mention is that we'd be adding a new function
to the db (eg. xid_to_int4) that would convert the xid to an int4 and
wrap values greater than 2^31 around to negative values thus avoiding
the database raising overflow errors.
> If you never run updates and are never interested in tuples that are
> old enough to be frozen then perhaps you could mostly get away with
> it. But I really don't think it's a good idea.
>
> Much better would be to store a user-space column with somethin like
> txid or a timestamp and use that directly. That way you have control
> over the behaviour of the column.
>
We already have a timestamp but with long running transactions some of
the rows don't become visible until after we've replicated that time
interval.  The existing implementation queries for changes based on
timestamp and runs with a 5 minute lag, but we see some transactions
taking up to half an hour which means we miss the data committed as part
of those transactions.

When you suggest to add a txid column, are you suggesting to have
something like an int8 column populated from a global monotonically
increasing sequence?  That sounds like an elegant approach.  The
downside is that between an int8 column and the index we'd be talking
approximately 16 bytes per row which when multipled by 500 million rows
(I'm not sure exactly how many there are but that's ballpark) comes out
at 8GB of additional disk usage.  It's approximately 4 times the size of
an int4 index on xmin.  Disk consumption isn't critical, but it is a
consideration.
> Another option to consider would be including a boolean column
> "dumped" defaulted to false. Then you could have a partial index on
> the primary key or date "WHERE NOT dumped". Then when you dump you can
> "SELECT FOR UPDATE * WHERE NOT dumped" and then when you're done
> "UPDATE SET dumped = 't' ". Alternately you could use "UPDATE SET
> dumped='t' WHERE NOT dumped RETURNING *" which is basically
> equivalent.
>
I have a couple of hesitations with using this approach:
1. We can only run the replicator once.
2. We can only run a single replicator.
3. It requires write access to the db.

1 is perhaps the biggest issue.  It means that we only get one shot at
reading changes, and if something goes wrong we lose the results.  It's
nice being able to re-generate when something goes wrong.
We could live with 2, although it makes it impossible to test new
replication mechanisms without adding additional columns for each.
3 is also a major consideration, it makes everybody's life easier if we
can avoid updates being made to the db by the replicator.
> That would create twice as much traffic in the table which would make
> vacuums much more important. But it would mean you could quickly acces
> undumped records using the index and know that your process doesn't
> depend on a following a strict clock schedule.
>
If we do use range queries then I don't think we have any clock
dependencies because we'd remove the timestamp clause from the queries.

I hope I don't sound too negative.  My gut also tells me that what we're
doing is not the "right" solution and I've had fairly similar arguments
with Matt already :-)  But having spent some time playing with it I
can't find any reason why it won't work, and from a performance point of
view I suspect it will win ...

Brett

Re: queries on xmin

From

Matt Amos

Date:

11 June 2009, 09:28:22

On Thu, Jun 11, 2009 at 12:59 PM, Brett Henderson<brett@bretth.com> wrote:
> Greg Stark wrote:
>> Another option to consider would be including a boolean column
>> "dumped" defaulted to false. Then you could have a partial index on
>> the primary key or date "WHERE NOT dumped". Then when you dump you can
>> "SELECT FOR UPDATE * WHERE NOT dumped" and then when you're done
>> "UPDATE SET dumped = 't' ". Alternately you could use "UPDATE SET
>> dumped='t' WHERE NOT dumped RETURNING *" which is basically
>> equivalent.
>>
>
> I have a couple of hesitations with using this approach:
> 1. We can only run the replicator once.
> 2. We can only run a single replicator.
> 3. It requires write access to the db.

4. it requires a schema change to support this use case.

again, this isn't a massive issue. but for the time being we're at
least making an effort to not extend the schema beyond what the rails
code itself requires.

> 3 is also a major consideration, it makes everybody's life easier if we can
> avoid updates being made to the db by the replicator.

for safety's sake i think this makes a lot of sense.

> I hope I don't sound too negative.  My gut also tells me that what we're
> doing is not the "right" solution and I've had fairly similar arguments with
> Matt already :-)  But having spent some time playing with it I can't find
> any reason why it won't work, and from a performance point of view I suspect
> it will win ...

it seems right to me to use postgres' existing features to support
this. we've got pretty close to a working system without needing
schema changes, so it would be a shame if they turn out to be
necessary.

cheers,

matt

Re: queries on xmin

From

Matt Amos

Date:

11 June 2009, 09:36:41

On Thu, Jun 11, 2009 at 1:13 PM, Brett Henderson<brett@bretth.com> wrote:
> Marko Kreen wrote:
> Sorry, I'm not sure what you're suggesting with txid_current().  We're
> currently using the |txid_current_snapshot|() method which returns us the
> maximum transaction id plus in-flight transactions.  We specifically exclude
> transactions that are in-flight from the query, then include them on
> subsequent queries when they have committed.

i think what marko is suggesting is a new column, e.g:

alter table X add column txid bigint default current_txid();

the current_txid() function returning an int8 which is the 32-bit txid
in the lower word and a 32-bit "epoch" counter in the higher word so
that it doesn't wrap-around.

>> See pgq.batch_event_sql() function in Skytools [2] for how to
>> query txids between snapshots efficiently and without being affected
>> by long transactions.
>
> I'll take a look.

it was looking at the skytools stuff which got me thinking about using
txids in the first place. someone on the osm-dev list had suggested
using PgQ, but we weren't keen on the schema changes that would have
been necessary.

cheers,

matt

Re: queries on xmin

From

Marko Kreen

Date:

11 June 2009, 10:47:24

On 6/11/09, Brett Henderson <brett@bretth.com> wrote:
> Marko Kreen wrote:
>
> > 4-byte xids on btree may create data corruption.
> >
> >
>  Can you be more specific on this?  I'm aware of xid being an unsigned
> integer which means we need to deal with the cast resulting in negative
> numbers.  This means we have to split our range queries into several pieces
> when crossing overflow points.  But I'm not aware of any other issues.

No, the issue is unstable '<' and '>' relationships between values.

> > Solution is to use 8-byte txids via txid_current() for indexing. [1]
> >
> >
>  The main issue with 8-byte values are the size.  When multipled by 500
> million rows it turns into a lot of space.  I'd prefer to use int4 if it's
> possible ...

If you use PgQ, you dont need any extra column in table.  As the
room below PgQ is garbage-collected pretty quickly the extra space
will not be a problem.

>  Sorry, I'm not sure what you're suggesting with txid_current().  We're
> currently using the |txid_current_snapshot|() method which returns us the
> maximum transaction id plus in-flight transactions.  We specifically exclude
> transactions that are in-flight from the query, then include them on
> subsequent queries when they have committed.

My suggestion was to use int8 txid column for querying, either
on table (bit expensive indeed), or on queue tables (pgq).

> > See pgq.batch_event_sql() function in Skytools [2] for how to
> > query txids between snapshots efficiently and without being affected
> > by long transactions.
> >
> >
>  I'll take a look.
>
> > In fact perhaps you can use PgQ directly instead building your own.
> > It is built quite similarly to what you are planning - periodic
> > snapshots and then queries on txids to get the data.
> >
> >
>  I must admit I'm not very familiar with PgQ, but I did take a quick look
> when evaluating options.  I'm possibly misguided here but serialising
> everything via a queue doesn't seem like the most efficient way of
> replicating large changesets.

Most efficient is COPY-ing of whole table.  If you want to replicate
piecemeal, only changes that happened since last sync point, PgQ is
quite efficient, compared to trying to read changes from main table:

- Only small amount of recent data is duplicated, older data is
  cleaned up quite fast (configurable).
- The extra index is small as it need to cover only small amount of data.

> Another issue is that it would only allow
> read-once behaviour that prevents a client from re-replicating if something
> goes wrong.

PgQ will give you exact same batch repeatedly, until you call
pgq.finish_batch().  This allows you to coordinate commits between
differnet databases.

>  How would you trigger message creation using PgQ?  Would you typically use
> triggers on the tables to be replicated?

  http://skytools.projects.postgresql.org/pgq/files/triggers-sql.html

--
marko

Re: queries on xmin

From

Marko Kreen

Date:

11 June 2009, 10:49:05

On 6/11/09, Matt Amos <zerebubuth@gmail.com> wrote:
> On Thu, Jun 11, 2009 at 1:13 PM, Brett Henderson<brett@bretth.com> wrote:
>  >> See pgq.batch_event_sql() function in Skytools [2] for how to
>  >> query txids between snapshots efficiently and without being affected
>  >> by long transactions.
>  >
>  > I'll take a look.
>
> it was looking at the skytools stuff which got me thinking about using
>  txids in the first place. someone on the osm-dev list had suggested
>  using PgQ, but we weren't keen on the schema changes that would have
>  been necessary.

Except the trigger, PgQ does not need any schema changes?

--
marko

Re: queries on xmin

From

Matt Amos

Date:

11 June 2009, 13:12:11

On Thu, Jun 11, 2009 at 2:48 PM, Marko Kreen<markokr@gmail.com> wrote:
> On 6/11/09, Matt Amos <zerebubuth@gmail.com> wrote:
>> On Thu, Jun 11, 2009 at 1:13 PM, Brett Henderson<brett@bretth.com> wrote:
>>  >> See pgq.batch_event_sql() function in Skytools [2] for how to
>>  >> query txids between snapshots efficiently and without being affected
>>  >> by long transactions.
>>  >
>>  > I'll take a look.
>>
>> it was looking at the skytools stuff which got me thinking about using
>>  txids in the first place. someone on the osm-dev list had suggested
>>  using PgQ, but we weren't keen on the schema changes that would have
>>  been necessary.
>
> Except the trigger, PgQ does not need any schema changes?

i've been having a look and it seems to me that PgQ requires some
extra tables as well as the trigger. am i missing something?

PgQ might be a good solution, but i'm worried that after calling
pgq.finish_batch() the batch is released. this would mean it wouldn't
be possible to regenerate older files (e.g: a few days to a week) in
case something unexpected went wrong. it might not be a major problem,
though.

i think we could get the same functionality without the extra daemons,
by putting a trigger on those tables for insert and recorded the
object id, version and 64-bit txid in another table. but if we're
going to alter the schema we might as well put the txid column
directly into those tables...

cheers,

matt

Re: queries on xmin

From

Marko Kreen

Date:

11 June 2009, 15:43:02

On 6/11/09, Matt Amos <zerebubuth@gmail.com> wrote:
> On Thu, Jun 11, 2009 at 2:48 PM, Marko Kreen<markokr@gmail.com> wrote:
>  > On 6/11/09, Matt Amos <zerebubuth@gmail.com> wrote:
>  >> On Thu, Jun 11, 2009 at 1:13 PM, Brett Henderson<brett@bretth.com> wrote:
>  >>  >> See pgq.batch_event_sql() function in Skytools [2] for how to
>  >>  >> query txids between snapshots efficiently and without being affected
>  >>  >> by long transactions.
>  >>  >
>  >>  > I'll take a look.
>  >>
>  >> it was looking at the skytools stuff which got me thinking about using
>  >>  txids in the first place. someone on the osm-dev list had suggested
>  >>  using PgQ, but we weren't keen on the schema changes that would have
>  >>  been necessary.
>  >
>  > Except the trigger, PgQ does not need any schema changes?
>
>
> i've been having a look and it seems to me that PgQ requires some
>  extra tables as well as the trigger. am i missing something?

Well, my point was you don't need to change your existing tables.

PgQ indeed has few internal and per-queue tables but they come
with pgq.sql, live under 'pgq' schema and are maintained by it.

There is no need for you to worry about them.

>  PgQ might be a good solution, but i'm worried that after calling
>  pgq.finish_batch() the batch is released. this would mean it wouldn't
>  be possible to regenerate older files (e.g: a few days to a week) in
>  case something unexpected went wrong. it might not be a major problem,
>  though.

You can register a 'safely consumer' on each queue, that lags specified
amount of time to avoid pgq rotation for that queue, or set rotation
time big enough to guarantee that old batches stay around.

>  i think we could get the same functionality without the extra daemons,
>  by putting a trigger on those tables for insert and recorded the
>  object id, version and 64-bit txid in another table.

We have tried storing id-s into queue and later joining them with data
rows, but it seems more efficient to duplicate data into queue and thus
avoid joins when reading from it. The queue space will recover soon anyway.

On light loads it probably does not matter either way.

>  but if we're
>  going to alter the schema we might as well put the txid column
>  directly into those tables...

It's permanent space loss in data tables vs. temporary space loss in
queue tables.  If former fits your situation better, go ahead.

--
marko