Thread: database corruption
I received this from a friend: Lance Rushing wrote: > Something bad happened in Postgres. I lost the ability to connect to the > 'pegasus2' database. > > i was doing some updates when I got an error, saying something about the > backend dying... > > When I restarted the database I could no longer connect to my database. I > can connect to the other databases, just not 'pegasus2'. This is a 7.3.3 installation on Red Hat (9 I think). I don't have any more detail yet on exactly what he was doing at this point, but I grabbed a copy of $PGDATA and looked at it on my own machine (since he doesn't have debug and assert support). Logging into any other database works fine, but the offending database produces this backtrace: ... #85355 0x0817c0b5 in RelationBuildDesc (buildinfo= {infotype = 2, i = {info_id = 135924865, info_name = 0x81a0c81 "pg_trigger"}}, oldrelation=0x0) at relcache.c:890 #85356 0x0817cf54 in RelationSysNameGetRelation (relationName=0x81a0c81 "pg_trigger") at relcache.c:1591 #85357 0x0807c066 in relation_openr (sysRelationName=0x81a0c81 "pg_trigger", lockmode=1) at heapam.c:550 #85358 0x0807c20a in heap_openr (sysRelationName=0x81a0c81 "pg_trigger", lockmode=1) at heapam.c:660 #85359 0x080d5872 in RelationBuildTriggers (relation=0x41ad740c) at trigger.c:720 #85360 0x0817c0b5 in RelationBuildDesc (buildinfo= {infotype = 2, i = {info_id = 135924865, info_name = 0x81a0c81 "pg_trigger"}}, oldrelation=0x0) at relcache.c:890 #85361 0x0817cf54 in RelationSysNameGetRelation (relationName=0x81a0c81 "pg_trigger") at relcache.c:1591 #85362 0x0807c066 in relation_openr (sysRelationName=0x81a0c81 "pg_trigger", lockmode=1) at heapam.c:550 #85363 0x0807c20a in heap_openr (sysRelationName=0x81a0c81 "pg_trigger", lockmode=1) at heapam.c:660 #85364 0x080d5872 in RelationBuildTriggers (relation=0x41ad6b54) at trigger.c:720 #85365 0x0817c0b5 in RelationBuildDesc (buildinfo= {infotype = 2, i = {info_id = 135924865, info_name = 0x81a0c81 "pg_trigger"}}, oldrelation=0x0) at relcache.c:890 ... i.e. it recurses itself to a horrible death. Any ideas? Is this an indication that pg_trigger is corrupted? Thanks, Joe
Joe Conway wrote: > I don't have any more detail yet on exactly what he was doing at this > point, but I grabbed a copy of $PGDATA and looked at it on my own > machine (since he doesn't have debug and assert support). Logging into > any other database works fine, but the offending database produces this > backtrace: It turns out the "corruption" was user error. He ran a statement that inadvertantly set reltriggers = 1 for every row in pg_class. This is what led to the infinite recursion. For the archives, I was able to hack my way into the database by wrapping the body of RelationBuildTriggers() in "if (relation->rd_id > 17000) {}", thus excluding system tables. Once back in the database, updating reltriggers in pg_class to appropriate values restored access to the database with an unhacked backend. Joe
On Sat, Aug 30, 2003 at 12:57:26PM -0700, Joe Conway wrote: > Joe Conway wrote: > >I don't have any more detail yet on exactly what he was doing at this > >point, but I grabbed a copy of $PGDATA and looked at it on my own > >machine (since he doesn't have debug and assert support). Logging into > >any other database works fine, but the offending database produces this > >backtrace: > > It turns out the "corruption" was user error. He ran a statement that > inadvertantly set reltriggers = 1 for every row in pg_class. This is > what led to the infinite recursion. For the record, how were you able to detect this? -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) "Ellos andaban todos desnudos como su madre los parió, y también las mujeres, aunque no vi más que una, harto moza, y todos los que yo vi eran todos mancebos, que ninguno vi de edad de más de XXX años" (Cristóbal Colón)
Alvaro Herrera wrote: > On Sat, Aug 30, 2003 at 12:57:26PM -0700, Joe Conway wrote: > >>Joe Conway wrote: >> >>>I don't have any more detail yet on exactly what he was doing at this >>>point, but I grabbed a copy of $PGDATA and looked at it on my own >>>machine (since he doesn't have debug and assert support). Logging into >>>any other database works fine, but the offending database produces this >>>backtrace: >> >>It turns out the "corruption" was user error. He ran a statement that >>inadvertantly set reltriggers = 1 for every row in pg_class. This is >>what led to the infinite recursion. > > For the record, how were you able to detect this? The backtrace from the core file showed that the recursion was going on between RelationSysNameGetRelation(), relation_openr(), heap_openr(), RelationBuildTriggers() and RelationBuildDesc(). It seemed odd that pg_trigger seemed to be getting a trigger applied to it, so I guessed that bypassing RelationBuildTriggers() for system tables would allow me in to the database. About the time I got in and started looking around, I finally got in touch with the user, who confirmed he had been trying to disable then re-enable triggers when the problem occurred. He ran a statement like: update pg_class set reltriggers = 1; thinking that reltriggers was a "flag" (0 == off, 1 == on). That's when it all suddenly made sense to me ;-) Fortunately this was his own development database he was messing with, but it was an interesting exercise none-the-less. Maybe this reinforces the need to have a command for enabling/disabling triggers. I vaguely remember discussion about that -- did it make it into 7.4? Joe
Joe Conway <mail@joeconway.com> writes: > Fortunately this was his own development database he was messing with, > but it was an interesting exercise none-the-less. Maybe this reinforces > the need to have a command for enabling/disabling triggers. No, it reinforces the cardinal rule about not experimenting with system catalog alterations in databases you care about. If he'd done his experimentation in a scratch database, he'd not have been in trouble. Perhaps we need some documentation about What Not To Do As Superuser. regards, tom lane
(Perhaps a newbie question, but I tried to google this out without success). Why postgres does an expensive seqscan to find the max(value) for an indexed column? I think MAX() does not know or cares if a column is indexed, but... Should not it? BTW, is there some smarter trick to do that? I know I can just do a very fast (SELECT pk FROM foo ORDER BY pk DESC LIMIT 1) instead, but my coleagues are arguing that MAX(indexed_column) seems to be a lot more smarter in MS-SQLServer and I end up without a good response. Thank you, -- Paulo Scardine Brazil
On 4 Sep 2003 at 11:32, Paulo Scardine wrote: > (Perhaps a newbie question, but I tried to google this out without success). > > Why postgres does an expensive seqscan to find the max(value) for an indexed > column? I think MAX() does not know or cares if a column is indexed, but... > Should not it? BTW, is there some smarter trick to do that? No. Postgresql uses MVCC which mean there could be multiple views of sample tuple active at the same time. There is no way to tell which is max. value for a column as definition of a committed value can be a moving target. It can not be cached, at least easily. That's the price to pay for MVCC. Same goes for select count(*) from table. That query has to end up with a sequential scan. > > I know I can just do a very fast (SELECT pk FROM foo ORDER BY pk DESC LIMIT > 1) instead, but my coleagues are arguing that MAX(indexed_column) seems to > be a lot > more smarter in MS-SQLServer and I end up without a good response. Well, postgresql earns solid concurrency due to MVCC. Set up postgresql and MS SQL server on same machine and do a rudimentary benchmark with 100 clients hitting database hard. See where you get more tps'.s In postgresql, readers and writers don't block each other. AFAIK, in MS SQL server rows are ocked for update. So if you lock a row in transaction and does not commit for long, MS SQL will have serious problems. All night long transactions are no problem to postgresql except for the fact that vacuum can not clean the tuples locked in tranactions. HTH ByeShridhar -- Blutarsky's Axiom: Nothing is impossible for the man who will not listen to reason.
On Thu, 4 Sep 2003, Shridhar Daithankar wrote: > > column? I think MAX() does not know or cares if a column is indexed, but... > > No. Postgresql uses MVCC which mean there could be multiple views of sample > tuple active at the same time. There is no way to tell which is max. value for > a column as definition of a committed value can be a moving target. > > It can not be cached, at least easily. That's the price to pay for MVCC. Same > goes for select count(*) from table. That query has to end up with a sequential > scan. It does not have to be like that. Even with a mvcc database it can use the index for max/min and in my opinion it should. As far as I know the only reason why it's not implemented in postgresql is because pg has a general aggregate model and max/min are implemented using that. Still, max/min are special in that they are almost the only aggregates that can use an index to deliver the result directly. Some day someone should make max/min a special case in pg. Exactly how is the question. I don't know mssql much, but I guess you can't define your own aggregate functions there? Then all aggregate functions are special anyway. -- /Dennis
This is an FAQ, BTW -- try searching the archives again. It's also mentioned in the documentation: http://candle.pha.pa.us/main/writings/pgsql/sgml/functions-aggregate.html On Thu, 2003-09-04 at 11:10, Dennis Bjorklund wrote: > On Thu, 4 Sep 2003, Shridhar Daithankar wrote: > > It can not be cached, at least easily. That's the price to pay for MVCC. Same > > goes for select count(*) from table. That query has to end up with a sequential > > scan. > > It does not have to be like that. Even with a mvcc database it can use the > index for max/min and in my opinion it should. Right, AFAIK MVCC isn't relevant to MAX() (given a btree index, you can just read the index in the right order and return the first valid tuple), although it makes optimizing COUNT(*) trickier, I believe. > As far as I know the only reason why it's not implemented in postgresql is > because pg has a general aggregate model and max/min are implemented using > that. Still, max/min are special in that they are almost the only > aggregates that can use an index to deliver the result directly. Some day > someone should make max/min a special case in pg. Exactly how is the > question. Well, it's an open question whether it's worth uglifying the backend to support this optimization, given that there is a trivial workaround that people can use. It would make it easier to port code to PostgreSQL from other RDBMSs, though... -Neil
Hello In my opinion, in 7.4 this optimized max() aggregate function would be a very small, but significant improvement. As one of the members on the list said, it would be a lot easier to port from/to other RDBMSes, with keeping the same optimalization of the queries. Bye, Gergely Czuczy mailto: phoemix@harmless.hu PGP: http://phoemix.harmless.hu/phoemix.pgp The point is, that geeks are not necessarily the outcasts society often believes they are. The fact is that society isn't cool enough to be included in our activities.
"Shridhar Daithankar" <shridhar_daithankar@persistent.co.in> writes: > On 4 Sep 2003 at 11:32, Paulo Scardine wrote: > > > (Perhaps a newbie question, but I tried to google this out without success). > > > > Why postgres does an expensive seqscan to find the max(value) for an indexed > > column? I think MAX() does not know or cares if a column is indexed, but... > > Should not it? BTW, is there some smarter trick to do that? > > No. Postgresql uses MVCC which mean there could be multiple views of sample > tuple active at the same time. There is no way to tell which is max. value for > a column as definition of a committed value can be a moving target. It has nothing to do with MVCC. It has to do with implementing this is hard in the general case. Think of examples like: select max(foo) group by bar; or select max(foo) where xyz = z; To do it properly max/min have to be special-cased and tightly integrated with other code to handle index scans and aggregates. As it currently stands they're implemented the same way as any other aggregate, which means they get to see all the records in the grouping. This is a frequently asked question, I'm surprised you didn't find stuff searching with google. There have been numerous long discussions on this topic not long ago. People are still trying to think about how to handle this better. -- greg
Greg Stark wrote: > It has nothing to do with MVCC. It has to do with implementing this is hard in > the general case. > > Think of examples like: > > select max(foo) group by bar; > > or > > select max(foo) where xyz = z; > > To do it properly max/min have to be special-cased and tightly integrated with > other code to handle index scans and aggregates. As it currently stands > they're implemented the same way as any other aggregate, which means they get > to see all the records in the grouping. > > This is a frequently asked question, I'm surprised you didn't find stuff > searching with google. There have been numerous long discussions on this topic > not long ago. People are still trying to think about how to handle this > better. The FAQ does have the example of using ORDER BY LIMIT 1 for MAX(). What we don't have a workaround for is COUNT(*). I think that will require some cached value that obeys MVCC rules of visibility. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian wrote: >Greg Stark wrote: > > >>It has nothing to do with MVCC. It has to do with implementing this is hard in >>the general case. >> >>Think of examples like: >> >>select max(foo) group by bar; >> >>or >> >>select max(foo) where xyz = z; >> >>To do it properly max/min have to be special-cased and tightly integrated with >>other code to handle index scans and aggregates. As it currently stands >>they're implemented the same way as any other aggregate, which means they get >>to see all the records in the grouping. >> >>This is a frequently asked question, I'm surprised you didn't find stuff >>searching with google. There have been numerous long discussions on this topic >>not long ago. People are still trying to think about how to handle this >>better. >> >> > >The FAQ does have the example of using ORDER BY LIMIT 1 for MAX(). What >we don't have a workaround for is COUNT(*). I think that will require >some cached value that obeys MVCC rules of visibility. > > IMHO portability is an important point. People are used to MAX() and COUNT(*), and will be surprised that they need some special treatment. While the reasons for this are perfectly explainable, speeding up these aggregates with some extra effort would make porting a bit easier. Regards, Andreas
The world rejoiced as pgadmin@pse-consulting.de (Andreas Pflug) wrote: > Bruce Momjian wrote: > >>Greg Stark wrote: >> >> >>>It has nothing to do with MVCC. It has to do with implementing this is hard in >>>the general case. >>> >>>Think of examples like: >>> >>>select max(foo) group by bar; >>> >>>or >>> >>>select max(foo) where xyz = z; >>> >>>To do it properly max/min have to be special-cased and tightly integrated with >>>other code to handle index scans and aggregates. As it currently stands >>>they're implemented the same way as any other aggregate, which means they get >>>to see all the records in the grouping. >>> >>>This is a frequently asked question, I'm surprised you didn't find stuff >>>searching with google. There have been numerous long discussions on this topic >>>not long ago. People are still trying to think about how to handle this >>>better. >>> >>> >> >>The FAQ does have the example of using ORDER BY LIMIT 1 for MAX(). What >>we don't have a workaround for is COUNT(*). I think that will require >>some cached value that obeys MVCC rules of visibility. >> >> > IMHO portability is an important point. People are used to MAX() and > COUNT(*), and will be surprised that they need some special > treatment. While the reasons for this are perfectly explainable, > speeding up these aggregates with some extra effort would make porting > a bit easier. The availability of cleverness with MAX()/MIN() is no grand surprise; it would be very nice to get some expansion of that to "SELECT VALUE FROM TABLE WHERE (CRITERIA) ORDER BY VALUE DESCENDING LIMIT 1;" But I'm _very_ curious as to what the anticipated treatment to collect COUNT() more efficiently would be. I would expect that it would only be able to get tuned much more if there's NO "where" clause, so that it could use some ("magically-kept-up-to-date") stats on table size. I don't see any way to optimize COUNT when numbers of rows can continually vary. Storing stats somewhere will just make updates more expensive. And if those stats are for the table, that doesn't help me if I want "COUNT(*) FROM TABLE WHERE UPDATED_ON BETWEEN NOW() - '1 day' and NOW()". -- (format nil "~S@~S" "aa454" "freenet.carleton.ca") http://cbbrowne.com/info/linuxdistributions.html "Recursion is the root of computation since it trades description for time." -- Alan Perlis
Christopher Browne wrote: > > IMHO portability is an important point. People are used to MAX() and > > COUNT(*), and will be surprised that they need some special > > treatment. While the reasons for this are perfectly explainable, > > speeding up these aggregates with some extra effort would make porting > > a bit easier. > > The availability of cleverness with MAX()/MIN() is no grand surprise; > it would be very nice to get some expansion of that to "SELECT VALUE > FROM TABLE WHERE (CRITERIA) ORDER BY VALUE DESCENDING LIMIT 1;" > > But I'm _very_ curious as to what the anticipated treatment to collect > COUNT() more efficiently would be. I would expect that it would only > be able to get tuned much more if there's NO "where" clause, so that > it could use some ("magically-kept-up-to-date") stats on table size. > > I don't see any way to optimize COUNT when numbers of rows can > continually vary. Storing stats somewhere will just make updates more > expensive. And if those stats are for the table, that doesn't help me > if I want "COUNT(*) FROM TABLE WHERE UPDATED_ON BETWEEN NOW() - '1 > day' and NOW()". Yes, count would only use the cached stats for non-WHERE clause COUNT(*). My idea is that if a transaction doing a COUNT(*) would first look to see if there already was a visible cached value, and if not, it would do the COUNT(*) and insert into the cache table. Any INSERT/DELETE would remove the value from the cache. As I see it, the commit of the INSERT/DELETE transaction would then auto-invalidate the cache at the exact time the transaction commits. This would allow MVCC visibility of the counts. A trickier idea would be for INSERT/DELETE to UPDATE the cached value. It might be possible to always have a valid cache value for COUNT(*). (COPY would also need to update the cache.) -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Thu, 2003-09-04 at 22:02, Bruce Momjian wrote: > My idea is that if a transaction doing a COUNT(*) would first look to > see if there already was a visible cached value, and if not, it would do > the COUNT(*) and insert into the cache table. Any INSERT/DELETE would > remove the value from the cache. As I see it, the commit of the > INSERT/DELETE transaction would then auto-invalidate the cache at the > exact time the transaction commits. This would allow MVCC visibility of > the counts. But this means that some of the time (indeed, *much* of the time), COUNT(*) would require a seqscan of the entire table. Since at many sites that will take an enormous amount of time (and disk I/O), that makes this solution infeasible IMHO. In general, I don't think this is worth doing. -Neil
Neil Conway wrote: > On Thu, 2003-09-04 at 22:02, Bruce Momjian wrote: > > My idea is that if a transaction doing a COUNT(*) would first look to > > see if there already was a visible cached value, and if not, it would do > > the COUNT(*) and insert into the cache table. Any INSERT/DELETE would > > remove the value from the cache. As I see it, the commit of the > > INSERT/DELETE transaction would then auto-invalidate the cache at the > > exact time the transaction commits. This would allow MVCC visibility of > > the counts. > > But this means that some of the time (indeed, *much* of the time), > COUNT(*) would require a seqscan of the entire table. Since at many > sites that will take an enormous amount of time (and disk I/O), that > makes this solution infeasible IMHO. > > In general, I don't think this is worth doing. It is possible it isn't worth doing. Can the INSERT/DELETE incrementing/decrementing the cached count work reliabily? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Neil Conway wrote: >> In general, I don't think this is worth doing. > It is possible it isn't worth doing. Can the INSERT/DELETE > incrementing/decrementing the cached count work reliabily? I don't even see how the notion of a single cached value makes theoretical sense, when in principle every transaction may have a different idea of the correct answer. You could doubtless maintain a fairly good approximate total this way, and that would be highly useful for some applications ... but it isn't COUNT(*). regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Neil Conway wrote: > >> In general, I don't think this is worth doing. > > > It is possible it isn't worth doing. Can the INSERT/DELETE > > incrementing/decrementing the cached count work reliabily? > > I don't even see how the notion of a single cached value makes > theoretical sense, when in principle every transaction may have > a different idea of the correct answer. > > You could doubtless maintain a fairly good approximate total this > way, and that would be highly useful for some applications ... > but it isn't COUNT(*). With MVCC allowing multiple rows with only one visible, I thought the INSERT/DELETE system would work --- once the delete becomes visible, the change becomes visible. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Tom Lane wrote: >> You could doubtless maintain a fairly good approximate total this >> way, and that would be highly useful for some applications ... >> but it isn't COUNT(*). > With MVCC allowing multiple rows with only one visible, I thought the > INSERT/DELETE system would work --- once the delete becomes visible, the > change becomes visible. Oh, you're imagining the cache as being a row in an ordinary table? I doubt that could work. Multiple transactions trying to update these rows would suffer from contention and deadlock problems, wouldn't they? regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Tom Lane wrote: > >> You could doubtless maintain a fairly good approximate total this > >> way, and that would be highly useful for some applications ... > >> but it isn't COUNT(*). > > > With MVCC allowing multiple rows with only one visible, I thought the > > INSERT/DELETE system would work --- once the delete becomes visible, the > > change becomes visible. > > Oh, you're imagining the cache as being a row in an ordinary table? > I doubt that could work. Multiple transactions trying to update these > rows would suffer from contention and deadlock problems, wouldn't they? Oh, they would, woudn't they. I was thinking of the counter UPDATE as a DELETE and an INSERT. In fact, when we do UPDATE col SET col = col + 1, we lock the row only so we know the count. Instead, could we insert +/-1 records into the cache table that were visible only to the running transaction, and on commit (ON COMMIT TRIGGER) adjust the cached aggregate counts without requiring locks? I know I am just shooting out ideas, but it might give someone else a better idea. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Oops! pgman@candle.pha.pa.us (Bruce Momjian) was seen spray-painting on a wall: > Neil Conway wrote: >> On Thu, 2003-09-04 at 22:02, Bruce Momjian wrote: >> > My idea is that if a transaction doing a COUNT(*) would first look to >> > see if there already was a visible cached value, and if not, it would do >> > the COUNT(*) and insert into the cache table. Any INSERT/DELETE would >> > remove the value from the cache. As I see it, the commit of the >> > INSERT/DELETE transaction would then auto-invalidate the cache at the >> > exact time the transaction commits. This would allow MVCC visibility of >> > the counts. >> >> But this means that some of the time (indeed, *much* of the time), >> COUNT(*) would require a seqscan of the entire table. Since at many >> sites that will take an enormous amount of time (and disk I/O), >> that makes this solution infeasible IMHO. >> >> In general, I don't think this is worth doing. > > It is possible it isn't worth doing. Can the INSERT/DELETE > incrementing/decrementing the cached count work reliabily? Wouldn't this more or less be the same thing as having a trigger that does, upon each insert/delete "update pg_counts set count = count + 1 where reltable = 45232;"? (... where 1 would be -1 for deletes, and where 45232 is the OID of the table...) Technically, it seems _feasible_, albeit with the problem that it turns pg_counts into a pretty horrid bottleneck. If lots of backends are updating that table, then row 45232 in pg_counts becomes troublesome because all those processes have to serialize around updating it. And if I have tables where I insert lots of data, but couldn't care less how many rows they have, this effort is wasted. When I was curious as to how COUNT might be maintained, I was pretty sure that this wouldn't be the preferred method... -- If this was helpful, <http://svcs.affero.net/rm.php?r=cbbrowne> rate me http://cbbrowne.com/info/emacs.html :FATAL ERROR -- ERROR IN ERROR HANDLER
Christopher Browne wrote: > Wouldn't this more or less be the same thing as having a trigger that > does, upon each insert/delete "update pg_counts set count = count + 1 > where reltable = 45232;"? (... where 1 would be -1 for deletes, and where > 45232 is the OID of the table...) > > Technically, it seems _feasible_, albeit with the problem that it > turns pg_counts into a pretty horrid bottleneck. If lots of backends > are updating that table, then row 45232 in pg_counts becomes > troublesome because all those processes have to serialize around > updating it. > > And if I have tables where I insert lots of data, but couldn't care > less how many rows they have, this effort is wasted. > > When I was curious as to how COUNT might be maintained, I was pretty > sure that this wouldn't be the preferred method... See my later idea of the trigger doing +/-1 rather than locking the value during the transaction. If we don't do it this way, I can't think of another way that would honor MVCC visibility. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Fri, 5 Sep 2003, Bruce Momjian wrote: > > When I was curious as to how COUNT might be maintained, I was pretty > > sure that this wouldn't be the preferred method... > > See my later idea of the trigger doing +/-1 rather than locking the > value during the transaction. > > If we don't do it this way, I can't think of another way that would > honor MVCC visibility. A general query cache is something that is fairly clean and which might help both with count(*) and other queries. Many databases has a lot of tables that are more or less stable where this would work fine. From what I have heard mysql has something like this and it works well. For tables that change a lot the the cached queries will almost always be invalid so one might want to let the user decide which tables should never be cached. It could at least be an interesting experiment. -- /Dennis
> A general query cache is something that is fairly clean and which might > help both with count(*) and other queries. > > Many databases has a lot of tables that are more or less stable where this > would work fine. From what I have heard mysql has something like this and > it works well. For tables that change a lot the the cached queries will > almost always be invalid so one might want to let the user decide which > tables should never be cached. It works well because MySQL doesn't have MVCC... Chris
Christopher Browne <cbbrowne@acm.org> writes: > Wouldn't this more or less be the same thing as having a trigger that > does, upon each insert/delete "update pg_counts set count = count + 1 > where reltable = 45232;"? (... where 1 would be -1 for deletes, and where > 45232 is the OID of the table...) I think that's exactly what Bruce was suggesting. A slightly more efficient variant is for each transaction to save up its net deltas, and issue a single UPDATE for each table it's touched just before it commits. But that just reduces the volume of update traffic, it doesn't fundamentally alter the concept. > Technically, it seems _feasible_, albeit with the problem that it > turns pg_counts into a pretty horrid bottleneck. Not to mention a likely source of deadlocks. And it still doesn't solve the fundamental objection that you can't get an MVCC-correct answer by examining the table. An idea I was toying with is to do something similar to what was just suggested to David Skoll for his stats problem: instead of using UPDATEs, use INSERTs of delta records. That is, every time a transaction is about to commit, it INSERTs into the counts table a row like "45232 +1" ("I inserted one row") or "45232 -10" ("I deleted ten rows"). Assume that we somehow initialized the counts table with an entry "45232 total-rows" for each table. Then, a COUNT(*) on table 45232 is equivalent to "SELECT SUM(deltas) FROM counts WHERE reltable = 45232". As long as the number of rows you have to look at to compute this sum is smaller than the number of rows in the original table, it's a win. The cool thing about this approach is that it is actually MVCC-correct. If some transaction has committed, but is uncommitted according to your worldview, your SUM will automatically ignore its delta row. Another cool thing is that the INSERTs don't conflict with each other, so there's no contention or deadlock risk. You would periodically (perhaps during VACUUM) update the counts table with operations that are conceptually like BEGIN; INSERT INTO counts SELECT reltable, SUM(deltas) FROM counts WHERE xid < GLOBALXMIN GROUPBY reltable; DELETE FROM counts WHERE xid < GLOBALXMIN; COMMIT; to sweep together the past deltas from transactions that are so old no one cares about their individual effects anymore (GLOBALXMIN is the same cutoff used by VACUUM to decide it can remove a committed-dead tuple). This prevents the number of delta rows from growing indefinitely large over time. > And if I have tables where I insert lots of data, but couldn't care > less how many rows they have, this effort is wasted. Yes, this mechanism would be hugely expensive in any case. I can't see enabling it overall, it would have to be turned on only for specific tables by user command. It'd be interesting to try to code it as a contrib module that's fired by triggers on the tables you want to track. regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes: > The FAQ does have the example of using ORDER BY LIMIT 1 for MAX(). What > we don't have a workaround for is COUNT(*). I think that will require > some cached value that obeys MVCC rules of visibility. Note that that only handles min()/max() for the whole table. It doesn't handle the GROUP BY case, for that you need DISTINCT ON with an ORDER BY clause. I don't see anything special about count(*) that makes it especially amenable to optimization. In fact I think you're headed to a full updatable materialized views implementation with this approach. Materialized views are basically tables that are guaranteed to always contain the results of a view. They're constructed by executing the specified query (eg "select bar,count(*) n from foo group by bar"). Then updated every time the underlying tables are modified (eg "insert into foo (bar) values (1)" does an "update foo_count_view set n = n+1 where bar=1"). Then they're available for the optimizer to substitute whenever it sees an expression they can answer. (so if you do "select count(*) from foo where bar=1" it gets transformed into "select n from foo_count_view where bar=1"). It's a big project. I think the min/max optimization is one of those things that "has to happen sometime". It's something people expect to work, and as long as it doesn't the database just isn't using the data it already has as well as it could. Materialized views would be nice, Oracle has them largely because they let the Oracle techs make a *huge* increase in their spec numbers. They were at the heart of that challenge a few years ago When Ellison said he would pay a million dollars to anyone who showed that MSSQL could come within a factor of 10 of Oracle. It was impossible only because Oracle wasn't really doing the same order of work because of materialized views. But they're a "would be neat" kind of thing. Nobody comes to the database expecting to find them. If postgres had them it would be really really cool. But it seems like there are more important things to be working on. -- greg
Would it be possible to catch an unconstrained max(id)/min(id) and rewrite it as "select id from table order by id [desc] limit1" on the fly in the parser somewhere? That would require fairly little code, and be transparent to the user. I.e. low hanging fruit. On 5 Sep 2003, Greg Stark wrote: > Note that that only handles min()/max() for the whole table. It doesn't handle > the GROUP BY case, for that you need DISTINCT ON with an ORDER BY clause.
"scott.marlowe" <scott.marlowe@ihs.com> writes: > Would it be possible to catch an unconstrained max(id)/min(id) and rewrite > it as "select id from table order by id [desc] limit1" on the fly in the > parser somewhere? > > That would require fairly little code, and be transparent to the user. > I.e. low hanging fruit. What if there's no index on id? Then it would actually be slower than the straightforward approach. You would have to check both versions and take the one with the lowest cost, or check before rewriting for possible paths on that column. The problem with low hanging fruit is sometimes it makes people stop looking for real solutions. And I think the real solution is worthwhile here. -- greg
Greg Stark <gsstark@mit.edu> writes: > "scott.marlowe" <scott.marlowe@ihs.com> writes: >> Would it be possible to catch an unconstrained max(id)/min(id) and rewrite >> it as "select id from table order by id [desc] limit1" on the fly in the >> parser somewhere? >> That would require fairly little code, and be transparent to the user. >> I.e. low hanging fruit. > What if there's no index on id? Then it would actually be slower than the > straightforward approach. You would have to check both versions and take the > one with the lowest cost, or check before rewriting for possible paths on that > column. If the fruit were all that low-hanging, it would've been done before now, as I think this is all that people coming from other DBs expect. But as Greg points out, it's not really a trivial planner change. There are also semantic issues: how shall the planner decide which aggregates are candidates for this treatment (I don't much care for hardwiring some behavior to the names "max" and "min") and how shall it decide which indexes match a given aggregate? In the presence of multiple operator classes for a datatype, it's not obvious whether a btree index has the same sort order that max/min need. If you dig in the pghackers archives you can find some speculation about extending aggregate definitions to associate max/min with appropriate sort operators, but no one's done the legwork to make a concrete proposal, let alone actually code it up. regards, tom lane