Thread: mdnblocks is an amazing time sink in huge relations
I have been doing some testing with multi-segment relations. (Not having sixty gigabytes laying around, like some folk I've been talking to recently, I rebuilt with a very small RELSEG_SIZE for testing...) I have discovered that performance goes to zero as the number of segments gets large --- in particular, if the number of segments exceeds the number of kernel file descriptors that fd.c thinks it can use, you can take a coffee break while inserting a few tuples :-(. The main problem is mdnblocks() in md.c, which is called for each tuple inserted; it insists on touching every segment of the relation to verify its length via lseek(). That's an unreasonable number of kernel calls in any case, and if you add a file open and close to each touch because of file-descriptor thrashing, it's coffee break time. It seems to me that mdnblocks should assume that all segments previously verified to be of length RELSEG_SIZE are still of that length, and only do an _mdnblocks() call on the last element of the segment chain. That way the number of kernel interactions needed is a constant independent of the number of segments in the table. (This doesn't make truncation of the relation by a different backend any more or less dangerous; we're still hosed if we don't get a notification before we try to do something with the relation. I think that is fixed, and if it's not this doesn't make it worse.) Any objections? regards, tom lane
> -----Original Message----- > From: owner-pgsql-hackers@postgreSQL.org > [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Tom Lane > Sent: Tuesday, October 12, 1999 1:12 PM > To: pgsql-hackers@postgreSQL.org > Subject: [HACKERS] mdnblocks is an amazing time sink in huge relations > > > I have been doing some testing with multi-segment relations. > (Not having sixty gigabytes laying around, like some folk I've been > talking to recently, I rebuilt with a very small RELSEG_SIZE for > testing...) I have discovered that performance goes to zero as the > number of segments gets large --- in particular, if the number of > segments exceeds the number of kernel file descriptors that fd.c > thinks it can use, you can take a coffee break while inserting a > few tuples :-(. The main problem is mdnblocks() in md.c, which > is called for each tuple inserted; it insists on touching every > segment of the relation to verify its length via lseek(). That's an > unreasonable number of kernel calls in any case, and if you add a file > open and close to each touch because of file-descriptor thrashing, > it's coffee break time. > It's me who removed the following code from mdnblocks(). The code caused a disaster in case of mdtruncate(). if (v->mdfd_lstbcnt == RELSEG_SIZE || .... At that time I was anxious about the change of performance but I've forgotten,sorry. > It seems to me that mdnblocks should assume that all segments > previously verified to be of length RELSEG_SIZE are still of that > length, and only do an _mdnblocks() call on the last element of the > segment chain. That way the number of kernel interactions needed > is a constant independent of the number of segments in the table. > (This doesn't make truncation of the relation by a different backend > any more or less dangerous; we're still hosed if we don't get a > notification before we try to do something with the relation. > I think that is fixed, and if it's not this doesn't make it worse.) > Currently only the first segment is opened when mdopen() etc is called. Would you change to check all segments at first open ? If there is a relation which has multi-segment of size 0,which would be the last segment ? Regards. Hiroshi Inoue Inoue@tpf.co.jp
> It seems to me that mdnblocks should assume that all segments > previously verified to be of length RELSEG_SIZE are still of that > length, and only do an _mdnblocks() call on the last element of the > segment chain. That way the number of kernel interactions needed > is a constant independent of the number of segments in the table. > (This doesn't make truncation of the relation by a different backend > any more or less dangerous; we're still hosed if we don't get a > notification before we try to do something with the relation. > I think that is fixed, and if it's not this doesn't make it worse.) > > Any objections? Sounds great. Now that the segments work properly, it is time for some optimization. Vadim has complained about the lseek() from day-1. Someday he wants a shared catalog cache to prevent that from being needed. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > It's me who removed the following code from mdnblocks(). > The code caused a disaster in case of mdtruncate(). > > if (v->mdfd_lstbcnt == RELSEG_SIZE > || .... OK, but do you think there's still a risk, given that we've changed the relcache-level interlocking? > Currently only the first segment is opened when mdopen() etc is > called. Would you change to check all segments at first open ? No, I don't see a need to change that logic. I was thinking that since mdnblocks adds another link to the chain whenever it sees that the current segment is exactly RELSEG_SIZE, it could just assume that if the next-segment link is not NULL then this segment must be of size RELSEG_SIZE and it doesn't need to check that again. It'd only need to do an actual check on the last chain element (plus any elements it adds during the current call, of course). We'd rely on the higher-level interlock to close and reopen the virtual file when a truncate has happened. > If there is a relation which has multi-segment of size 0,which would > be the last segment ? Only the last segment can have any size other than RELSEG_SIZE, I think. regards, tom lane
> > > Currently only the first segment is opened when mdopen() etc is > > called. Would you change to check all segments at first open ? > > No, I don't see a need to change that logic. I was thinking that > since mdnblocks adds another link to the chain whenever it sees that > the current segment is exactly RELSEG_SIZE, it could just assume that > if the next-segment link is not NULL then this segment must be of > size RELSEG_SIZE and it doesn't need to check that again. It'd only > need to do an actual check on the last chain element (plus any elements > it adds during the current call, of course). We'd rely on the > higher-level interlock to close and reopen the virtual file when a > truncate has happened. > You are right. The last segment is always lseeked. And does this mean that the first mdnblocks() opens all segments and checks them ? > > If there is a relation which has multi-segment of size 0,which would > > be the last segment ? > > Only the last segment can have any size other than RELSEG_SIZE, I think. > This may be another issue. I have been suspicious about current implementation of md.c. It relies so much on information about existent phisical files. How do you think about the following ? 1. Partial blocks(As you know,I have changed the handling of this kind of blocks recently). 2. If a backend was killed or crashed in the middle of execution of mdunlink()/mdtruncate(),half of segments wouldn'tbe unlink/ truncated. 3. In cygwin port,mdunlink()/mdtruncate() may leave segments of 0 length. 4. We couldn't mdcreate() existent files and coudn't mdopen()/md unlink() non-existent files. So there are some casesthat we could neither CREATE TABLE nor DROP TABLE. I have no solution but seems the count of valid segments and blocks should be held in other places. Regards. Hiroshi Inoue Inoue@tpf.co.jp
(Sorry for slow response, I've been off chasing psort problems...) "Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > I have been suspicious about current implementation of md.c. > It relies so much on information about existent phisical files. Yes, but on the other hand we rely completely on those same physical files to hold our data ;-). I don't see anything fundamentally wrong with using the existence and size of a data file as useful information. It's not a substitute for a lock, of course, and there may be places where we need cross-backend interlocks that we haven't got now. > How do you think about the following ? > > 1. Partial blocks(As you know,I have changed the handling of this > kind of blocks recently). Yes. I think your fix was good. > 2. If a backend was killed or crashed in the middle of execution of > mdunlink()/mdtruncate(),half of segments wouldn't be unlink/ > truncated. That's bothered me too. A possible answer would be to do the unlinking back-to-front (zap the last file first); that'd require a few more lines of code in md.c, but a crash midway through would then leave a legal file configuration that another backend could still do something with. > 3. In cygwin port,mdunlink()/mdtruncate() may leave segments of 0 > length. I don't understand what causes this. Can you explain? BTW, I think that having the last segment be 0 length is OK and indeed expected --- mdnblocks will create the next segment as soon as it notices the currently last segment has reached RELSEG_SIZE, even if there's not yet a disk page to put in the next segment. This seems OK to me, although it's not really necessary. > 4. We couldn't mdcreate() existent files and coudn't mdopen()/md > unlink() non-existent files. So there are some cases that we > could neither CREATE TABLE nor DROP TABLE. True, but I think this is probably the best thing for safety's sake. It seems to me there is too much risk of losing or overwriting valid data if md.c bulls ahead when it finds an unexpected file configuration. I'd rather rely on manual cleanup if things have gotten that seriously out of whack... (but that's just my opinion, perhaps I'm in the minority?) regards, tom lane
> "Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > > I have been suspicious about current implementation of md.c. > > It relies so much on information about existent phisical files. > > Yes, but on the other hand we rely completely on those same physical > files to hold our data ;-). I don't see anything fundamentally > wrong with using the existence and size of a data file as useful > information. It's not a substitute for a lock, of course, and there > may be places where we need cross-backend interlocks that we haven't > got now. > We have to lseek() each time to know the number of blocks of a table file. Isn't it a overhead ? > > How do you think about the following ? > > > > 2. If a backend was killed or crashed in the middle of execution of > > mdunlink()/mdtruncate(),half of segments wouldn't be unlink/ > > truncated. > > That's bothered me too. A possible answer would be to do the unlinking > back-to-front (zap the last file first); that'd require a few more lines > of code in md.c, but a crash midway through would then leave a legal > file configuration that another backend could still do something with. Oops,it's more serious than I have thought. mdunlink() may only truncates a table file by a crash while unlinking back-to-front. A crash while unlinking front-to-back may leave unlinked segments and they would suddenly appear as segments of the recreated table. Seems there's no easy fix. > > 3. In cygwin port,mdunlink()/mdtruncate() may leave segments of 0 > > length. > > I don't understand what causes this. Can you explain? > You call FileUnlink() after FileTrucnate() to unlink in md.c. If FileUnlink() fails there remains segments of 0 length. But it seems not critical in this issue. > > 4. We couldn't mdcreate() existent files and coudn't mdopen()/md > > unlink() non-existent files. So there are some cases that we > > could neither CREATE TABLE nor DROP TABLE. > > True, but I think this is probably the best thing for safety's sake. > It seems to me there is too much risk of losing or overwriting valid > data if md.c bulls ahead when it finds an unexpected file configuration. > I'd rather rely on manual cleanup if things have gotten that seriously > out of whack... (but that's just my opinion, perhaps I'm in the > minority?) > There is another risk. We may remove other table files manually by mistake. And if I were a newcomer,I would not consider PostgreSQL as a real DBMS(Fortunately I have never seen the reference to this). However,I don't object to you because I also have the same anxiety and could provide no easy solution, Probably it would require a lot of work to fix correctly. Postponing real unlink/truncating until commit and creating table files which correspond to their oids ..... etc ... It's same as "DROP TABLE inside transations" requires. Hmm,is it worth the work ? Regards. Hiroshi Inoue Inoue@tpf.co.jp
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes: >> ... I don't see anything fundamentally >> wrong with using the existence and size of a data file as useful >> information. It's not a substitute for a lock, of course, and there >> may be places where we need cross-backend interlocks that we haven't >> got now. > We have to lseek() each time to know the number of blocks of a table > file. Isn't it a overhead ? True, but lseek is pretty cheap as kernel calls go (the kernel just has to consult the file's inode, which should be in memory already). We're not going to get the info for free; any other way of keeping track of it is going to have its own costs. Vadim's been muttering about using a shared cache for system catalog tuples, which might be a win but I'm not sure (I'm worried about contention for the cache, especially if it's protected by just one or a few spinlocks). Anyway, if we did have one then keeping an accurate block count in the relation's pg_class row would be a practical alternative. >> That's bothered me too. A possible answer would be to do the unlinking >> back-to-front (zap the last file first); that'd require a few more lines >> of code in md.c, but a crash midway through would then leave a legal >> file configuration that another backend could still do something with. > Oops,it's more serious than I have thought. > mdunlink() may only truncates a table file by a crash while unlinking > back-to-front. > A crash while unlinking front-to-back may leave unlinked segments > and they would suddenly appear as segments of the recreated table. > Seems there's no easy fix. Well, it seems to me that the first misbehavior (incomplete delete becomes a partial truncate, and you can try again) is a lot better than the second (incomplete delete leaves an undeletable, unrecreatable table). Should I go ahead and make delete/truncate work back-to-front, or do you see a reason why that'd be a bad thing to do? >>>> 3. In cygwin port,mdunlink()/mdtruncate() may leave segments of 0 >>>> length. >> >> I don't understand what causes this. Can you explain? > You call FileUnlink() after FileTrucnate() to unlink in md.c. If > FileUnlink() > fails there remains segments of 0 length. But it seems not critical in > this issue. Ah, I see. It's not specific to cygwin then. I think you are right that it is not critical, at least not if we change to do the operations back-to-front. Truncating and then failing to delete will still leave a valid file configuration. > Probably it would require a lot of work to fix correctly. > Postponing real unlink/truncating until commit and creating table > files which correspond to their oids ..... etc ... > It's same as "DROP TABLE inside transations" requires. > Hmm,is it worth the work ? I'm not eager to do that either (not just because of the low return on work invested, but because naming table files by OIDs would be a big handicap for debugging and database admin work). However, I'm not quite sure why you see it as a solution to the problem of recovering from a failed md.c operation. The risks would be the same if the deletion was happening during commit, no? And I'd be a lot *more* worried about deleting the wrong file during manual cleanup if the files were named by OID ;-) regards, tom lane
> > "Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > >> ... I don't see anything fundamentally > >> wrong with using the existence and size of a data file as useful > >> information. It's not a substitute for a lock, of course, and there > >> may be places where we need cross-backend interlocks that we haven't > >> got now. > > > We have to lseek() each time to know the number of blocks of a table > > file. Isn't it a overhead ? > > True, but lseek is pretty cheap as kernel calls go (the kernel just has > to consult the file's inode, which should be in memory already). We're > not going to get the info for free; any other way of keeping track of > it is going to have its own costs. Vadim's been muttering about using > a shared cache for system catalog tuples, which might be a win but I'm > not sure (I'm worried about contention for the cache, especially if it's > protected by just one or a few spinlocks). Anyway, if we did have one > then keeping an accurate block count in the relation's pg_class row > would be a practical alternative. > Seems it's related to a TODO. * Shared catalog cache, reduce lseek()'s by caching table size in shared area But there would be a problem if we use shared catalog cache. Being updated system tuples are only visible to an updating backend and other backends should see committed tuples. On the other hand,an accurate block count should be visible to all backends. Which tuple of a row should we load to catalog cache and update ? > >> That's bothered me too. A possible answer would be to do the unlinking > >> back-to-front (zap the last file first); that'd require a few > more lines > >> of code in md.c, but a crash midway through would then leave a legal > >> file configuration that another backend could still do something with. > > > Oops,it's more serious than I have thought. > > mdunlink() may only truncates a table file by a crash while unlinking > > back-to-front. > > A crash while unlinking front-to-back may leave unlinked segments > > and they would suddenly appear as segments of the recreated table. > > Seems there's no easy fix. > > Well, it seems to me that the first misbehavior (incomplete delete becomes > a partial truncate, and you can try again) is a lot better than the > second (incomplete delete leaves an undeletable, unrecreatable table). > Should I go ahead and make delete/truncate work back-to-front, or do you > see a reason why that'd be a bad thing to do? > I also think back-to-front is better. > > Probably it would require a lot of work to fix correctly. > > Postponing real unlink/truncating until commit and creating table > > files which correspond to their oids ..... etc ... > > It's same as "DROP TABLE inside transations" requires. > > Hmm,is it worth the work ? > > I'm not eager to do that either (not just because of the low return on > work invested, but because naming table files by OIDs would be a big > handicap for debugging and database admin work). However, I'm not > quite sure why you see it as a solution to the problem of recovering > from a failed md.c operation. The risks would be the same if the > deletion was happening during commit, no? And I'd be a lot *more* > worried about deleting the wrong file during manual cleanup if the > files were named by OID ;-) > We don't have to delete relation files even after commit. Backend would never see them and access to the files corresponding to new oids of (being) recreated relations. Deletion is necessary only not to consume disk space. For example vacuum could remove not deleted files. It may be a PostgreSQL style. Regards. Hiroshi Inoue Inoue@tpf.co.jp
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes: >> a shared cache for system catalog tuples, which might be a win but I'm >> not sure (I'm worried about contention for the cache, especially if it's >> protected by just one or a few spinlocks). Anyway, if we did have one >> then keeping an accurate block count in the relation's pg_class row >> would be a practical alternative. > But there would be a problem if we use shared catalog cache. > Being updated system tuples are only visible to an updating backend > and other backends should see committed tuples. > On the other hand,an accurate block count should be visible to all > backends. > Which tuple of a row should we load to catalog cache and update ? Good point --- rolling back a transaction would cancel changes to the pg_class row, but it mustn't cause the relation's file to get truncated (since there could be tuples of other uncommitted transactions in the newly added block(s)). This says that having a block count column in pg_class is the Wrong Thing; we should get rid of relpages entirely. The Right Thing is a separate data structure in shared memory that stores the current physical block count for each active relation. The first backend to touch a given relation would insert an entry, and then subsequent extensions/truncations/deletions would need to update it. We already obtain a special lock when extending a relation, so seems like there'd be no extra locking cost to have a table like this. Anyone up for actually implementing this ;-) ? I have other things I want to work on... >> Well, it seems to me that the first misbehavior (incomplete delete becomes >> a partial truncate, and you can try again) is a lot better than the >> second (incomplete delete leaves an undeletable, unrecreatable table). >> Should I go ahead and make delete/truncate work back-to-front, or do you >> see a reason why that'd be a bad thing to do? > I also think back-to-front is better. OK, I have a couple other little things I want to do in md.c, so I'll see what I can do about that. Even with a shared-memory relation length table, back-to-front truncation would be the safest way to proceed, so we'll want to make this change in any case. > Deletion is necessary only not to consume disk space. > > For example vacuum could remove not deleted files. Hmm ... interesting idea ... but I can hear the complaints from users already... regards, tom lane
Tom Lane wrote: > > >> a shared cache for system catalog tuples, which might be a win but I'm > >> not sure (I'm worried about contention for the cache, especially if it's > >> protected by just one or a few spinlocks). Anyway, if we did have one Commercial DBMSes have this... Isn't it a good reason? -:) > > But there would be a problem if we use shared catalog cache. > > Being updated system tuples are only visible to an updating backend > > and other backends should see committed tuples. > > On the other hand,an accurate block count should be visible to all > > backends. > > Which tuple of a row should we load to catalog cache and update ? > > Good point --- rolling back a transaction would cancel changes to the > pg_class row, but it mustn't cause the relation's file to get truncated > (since there could be tuples of other uncommitted transactions in the > newly added block(s)). > > This says that having a block count column in pg_class is the Wrong > Thing; we should get rid of relpages entirely. The Right Thing is a > separate data structure in shared memory that stores the current > physical block count for each active relation. The first backend to > touch a given relation would insert an entry, and then subsequent > extensions/truncations/deletions would need to update it. We already > obtain a special lock when extending a relation, so seems like there'd > be no extra locking cost to have a table like this. I supposed that each backend will still use own catalog cache (after reading entries from shared one) and synchronize shared/private caches on commit - e.g. update reltuples! relpages will be updated immediately after physical changes - what's problem with this? > Anyone up for actually implementing this ;-) ? I have other things > I want to work on... And me too -:)) Vadim
> > "Hiroshi Inoue" <Inoue@tpf.co.jp> writes: [snip] > > > Deletion is necessary only not to consume disk space. > > > > For example vacuum could remove not deleted files. > > Hmm ... interesting idea ... but I can hear the complaints > from users already... > My idea is only an analogy of PostgreSQL's simple recovery mechanism of tuples. And my main point is"delete fails after commit" doesn't harm the databaseexcept that not deleted files consume disk space. Of cource,it's preferable to delete relation files immediately after(or just when) commit. Useless files are visible though useless tuples are invisible. Regards. Hiroshi Inoue Inoue@tpf.co.jp
> > Tom Lane wrote: > > > > >> a shared cache for system catalog tuples, which might be a > win but I'm > > >> not sure (I'm worried about contention for the cache, > especially if it's > > >> protected by just one or a few spinlocks). Anyway, if we > did have one > > Commercial DBMSes have this... Isn't it a good reason? -:) > > > > But there would be a problem if we use shared catalog cache. > > > Being updated system tuples are only visible to an updating backend > > > and other backends should see committed tuples. > > > On the other hand,an accurate block count should be visible to all > > > backends. > > > Which tuple of a row should we load to catalog cache and update ? > > > > Good point --- rolling back a transaction would cancel changes to the > > pg_class row, but it mustn't cause the relation's file to get truncated > > (since there could be tuples of other uncommitted transactions in the > > newly added block(s)). > > > > This says that having a block count column in pg_class is the Wrong > > Thing; we should get rid of relpages entirely. The Right Thing is a > > separate data structure in shared memory that stores the current > > physical block count for each active relation. The first backend to > > touch a given relation would insert an entry, and then subsequent > > extensions/truncations/deletions would need to update it. We already > > obtain a special lock when extending a relation, so seems like there'd > > be no extra locking cost to have a table like this. > > I supposed that each backend will still use own catalog > cache (after reading entries from shared one) and synchronize > shared/private caches on commit - e.g. update reltuples! > relpages will be updated immediately after physical changes - > what's problem with this? > Does this mean the following ? 1. shared cache holds committed system tuples. 2. private cache holds uncommitted system tuples. 3. relpages of shared cache are updated immediately by phisical change and corresponding buffer pages are marked dirty. 4. on commit, the contents of uncommitted tuples except relpages,reltuples,... are copied to correponding tuples in sharedcache and the combined contents are committed. If so,catalog cache invalidation would be no longer needed. But synchronization of the step 4. may be difficult. Regards. Hiroshi Inoue Inoue@tpf.co.jp
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > 1. shared cache holds committed system tuples. > 2. private cache holds uncommitted system tuples. > 3. relpages of shared cache are updated immediately by > phisical change and corresponding buffer pages are > marked dirty. > 4. on commit, the contents of uncommitted tuples except > relpages,reltuples,... are copied to correponding tuples > in shared cache and the combined contents are > committed. > If so,catalog cache invalidation would be no longer needed. > But synchronization of the step 4. may be difficult. I think the main problem is that relpages and reltuples shouldn't be kept in pg_class columns at all, because they need to have very different update behavior from the other pg_class columns. The rest of pg_class is update-on-commit, and we can lock down any one row in the normal MVCC way (if transaction A has modified a row and transaction B also wants to modify it, B waits for A to commit or abort, so it can know which version of the row to start from). Furthermore, there can legitimately be several different values of a row in use in different places: the latest committed, an uncommitted modification, and one or more old values that are still being used by active transactions because they were current when those transactions started. (BTW, the present relcache is pretty bad about maintaining pure MVCC transaction semantics like this, but it seems clear to me that that's the direction we want to go in.) relpages cannot operate this way. To be useful for avoiding lseeks, relpages *must* change exactly when the physical file changes. It matters not at all whether the particular transaction that extended the file ultimately commits or not. Moreover there can be only one correct value (per relation) across the whole system, because there is only one length of the relation file. If we want to take reltuples seriously and try to maintain it on-the-fly, then I think it needs still a third behavior. Clearly it cannot be updated using MVCC rules, or we lose all writer concurrency (if A has added tuples to a rel, B would have to wait for A to commit before it could update reltuples...). Furthermore "updating" isn't a simple matter of storing what you think the new value is; otherwise two transactions adding tuples in parallel would leave the wrong answer after B commits and overwrites A's value. I think it would work for each transaction to keep track of a net delta in reltuples for each table it's changed (total tuples added less total tuples deleted), and then atomically add that value to the table's shared reltuples counter during commit. But that still leaves the problem of how you use the counter during a transaction to get an accurate answer to the question "If I scan this table now, how many tuples will I see?" At the time the question is asked, the current shared counter value might include the effects of transactions that have committed since your transaction started, and therefore are not visible under MVCC rules. I think getting the correct answer would involve making an instantaneous copy of the current counter at the start of your xact, and then adding your own private net-uncommitted-delta to the saved shared counter value when asked the question. This doesn't look real practical --- you'd have to save the reltuples counts of *all* tables in the database at the start of each xact, on the off chance that you might need them. Ugh. Perhaps someone has a better idea. In any case, reltuples clearly needs different mechanisms than the ordinary fields in pg_class do, because updating it will be a performance bottleneck otherwise. If we allow reltuples to be updated only by vacuum-like events, as it is now, then I think keeping it in pg_class is still OK. In short, it seems clear to me that relpages should be removed from pg_class and kept somewhere else if we want to make it more reliable than it is now, and the same for reltuples (but reltuples doesn't behave the same as relpages, and probably ought to be handled differently). regards, tom lane
Hiroshi Inoue wrote: > > > I supposed that each backend will still use own catalog > > cache (after reading entries from shared one) and synchronize > > shared/private caches on commit - e.g. update reltuples! > > relpages will be updated immediately after physical changes - > > what's problem with this? > > > > Does this mean the following ? > > 1. shared cache holds committed system tuples. > 2. private cache holds uncommitted system tuples. > 3. relpages of shared cache are updated immediately by > phisical change and corresponding buffer pages are > marked dirty. > 4. on commit, the contents of uncommitted tuples except > relpages,reltuples,... are copied to correponding tuples ^^^^^^^^^ reltuples in shared catalog cache (SCC) will be updated! If transaction inserted some tuples then SCC->reltuples will be incremented, etc. > in shared cache and the combined contents are > committed. > > If so,catalog cache invalidation would be no longer needed. I never liked our invalidation code -:) > But synchronization of the step 4. may be difficult. What's easy in this life? -:) Vadim
Tom Lane wrote: > > I think the main problem is that relpages and reltuples shouldn't > be kept in pg_class columns at all, because they need to have > very different update behavior from the other pg_class columns. Yes, but is this reason to move them somewhere else? Let's update them differently (i.e. update in-place) but keep in pg_class. Should we provide read consistency for these internal-use columns? I'm not sure. > If we want to take reltuples seriously and try to maintain it > on-the-fly, then I think it needs still a third behavior. Clearly ...snip... I agreed that there is no way to get accurate estimation for # of rows to be seen by a query... Well, let's keep up-to-date # of rows present in relation: in any case a query will have to read them and this is what we need to estimate cost of simple scans, as for costs of joins - now way, currently(?) -:( But please remember that there is another SCC goal - faster catalog access... Vadim
> I agreed that there is no way to get accurate estimation for > # of rows to be seen by a query... > Well, let's keep up-to-date # of rows present in relation: > in any case a query will have to read them and this is what > we need to estimate cost of simple scans, as for costs of > joins - now way, currently(?) -:( > > But please remember that there is another SCC goal - > faster catalog access... I want to index access more cache entries on cache miss for 7.0. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> > Hiroshi Inoue wrote: > > > > > I supposed that each backend will still use own catalog > > > cache (after reading entries from shared one) and synchronize > > > shared/private caches on commit - e.g. update reltuples! > > > relpages will be updated immediately after physical changes - > > > what's problem with this? > > > > > > > Does this mean the following ? > > > > 1. shared cache holds committed system tuples. > > 2. private cache holds uncommitted system tuples. > > 3. relpages of shared cache are updated immediately by > > phisical change and corresponding buffer pages are > > marked dirty. > > 4. on commit, the contents of uncommitted tuples except > > relpages,reltuples,... are copied to correponding tuples > ^^^^^^^^^ > reltuples in shared catalog cache (SCC) will be updated! > If transaction inserted some tuples then SCC->reltuples > will be incremented, etc. > System tuples are only modifiled or (insert and delet)ed like user tuples when reltuples are updated ? If only modified,we couldn't use it in SERIALIZABLE mode. > > in shared cache and the combined contents are > > committed. > > > > If so,catalog cache invalidation would be no longer needed. > > I never liked our invalidation code -:) > > > But synchronization of the step 4. may be difficult. > > What's easy in this life? -:) > As for relpages(block count),my first thought was that PostgreSQL relation files have their control data on their first page, I was surprised to know there's no such data. Seems catalog cache sharing is another issue as Tom says. Of cource it's unnatural to hold separate catalog cache. I will be absent this week after now. I would think for a while. Regards. Hiroshi Inoue Inoue@tpf.co.jp
Bruce Momjian wrote: > > > I agreed that there is no way to get accurate estimation for > > # of rows to be seen by a query... > > Well, let's keep up-to-date # of rows present in relation: > > in any case a query will have to read them and this is what > > we need to estimate cost of simple scans, as for costs of > > joins - now way, currently(?) -:( > > > > But please remember that there is another SCC goal - > > faster catalog access... > > I want to index access more cache entries on cache miss for 7.0. Good. But in any case we'll gain faster access from _shared_ cache. Vadim
Vadim Mikheev <vadim@krs.ru> writes: > I agreed that there is no way to get accurate estimation for > # of rows to be seen by a query... > Well, let's keep up-to-date # of rows present in relation: > in any case a query will have to read them and this is what > we need to estimate cost of simple scans, as for costs of > joins - now way, currently(?) -:( The optimizer is perfectly happy with approximate tuple counts. It has to make enough other guesstimates that I really doubt an exact tuple count could help it measurably. So updating the count via VACUUM is an appropriate solution as far as the optimizer is concerned. The only good reason I've ever seen for trying to keep an exact tuple count at all times is to allow short-circuiting of SELECT COUNT(*) queries --- and even there, it's only useful if there's no WHERE or GROUP BY. As far as I can see, keeping an exact tuple count couldn't possibly be worth the overhead it'd take. Keeping an exact block count may have the same problem, but I'm not sure either way. regards, tom lane
Hiroshi Inoue wrote: > > > > Does this mean the following ? > > > > > > 1. shared cache holds committed system tuples. > > > 2. private cache holds uncommitted system tuples. > > > 3. relpages of shared cache are updated immediately by > > > phisical change and corresponding buffer pages are > > > marked dirty. > > > 4. on commit, the contents of uncommitted tuples except > > > relpages,reltuples,... are copied to correponding tuples > > ^^^^^^^^^ > > reltuples in shared catalog cache (SCC) will be updated! > > If transaction inserted some tuples then SCC->reltuples > > will be incremented, etc. > > > > System tuples are only modifiled or (insert and delet)ed like > user tuples when reltuples are updated ? > If only modified,we couldn't use it in SERIALIZABLE mode. ^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^ ...this... I'm not sure that we must provide read consistency for internal-use columns... Nevertheless, I agreed that keeping internal-use columns in table is bad thing, but let's do it for awhile: I believe that someday we'll re-implement our mdmgr - using separate file for each table/index is bad thing too, - and move these columns "somewhere else" in that day. Vadim
Tom Lane wrote: > > Vadim Mikheev <vadim@krs.ru> writes: > > I agreed that there is no way to get accurate estimation for > > # of rows to be seen by a query... > > Well, let's keep up-to-date # of rows present in relation: > > in any case a query will have to read them and this is what > > we need to estimate cost of simple scans, as for costs of > > joins - now way, currently(?) -:( > > The optimizer is perfectly happy with approximate tuple counts. It has > to make enough other guesstimates that I really doubt an exact tuple > count could help it measurably. So updating the count via VACUUM is an > appropriate solution as far as the optimizer is concerned. The only I'm not sure: scans read all (visible/unvisible, committed/uncommittd) tuples before making visibility decision... though, relpages is primary value for cost estimation. > good reason I've ever seen for trying to keep an exact tuple count at > all times is to allow short-circuiting of SELECT COUNT(*) queries --- > and even there, it's only useful if there's no WHERE or GROUP BY. ...and only in READ COMMITTED mode... > As far as I can see, keeping an exact tuple count couldn't possibly > be worth the overhead it'd take. Keeping an exact block count may > have the same problem, but I'm not sure either way. relpages is more significant thing to know. Extending relation by one block at time is bad for insert performance. It would be nice to have two values per relation in shared cache - # of blocks and last block used. On the other hand this is mostly mdmgr issue. But in any case, it seems to me that using shmem for shared catalog cache is much worther than for shared cache invalidation... Vadim
> > Hiroshi Inoue wrote: > > > > > > Does this mean the following ? > > > > > > > > 1. shared cache holds committed system tuples. > > > > 2. private cache holds uncommitted system tuples. > > > > 3. relpages of shared cache are updated immediately by > > > > phisical change and corresponding buffer pages are > > > > marked dirty. > > > > 4. on commit, the contents of uncommitted tuples except > > > > relpages,reltuples,... are copied to correponding tuples > > > ^^^^^^^^^ > > > reltuples in shared catalog cache (SCC) will be updated! > > > If transaction inserted some tuples then SCC->reltuples > > > will be incremented, etc. > > > > > > > System tuples are only modifiled or (insert and delet)ed like > > user tuples when reltuples are updated ? > > If only modified,we couldn't use it in SERIALIZABLE mode. > ^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^ > ...this... > > I'm not sure that we must provide read consistency > for internal-use columns... As for relpages,read consistency has no meaning because they are out of transaction control. But as for reltuples,isn't it difficult to commit/rollback correctly without using insert-delete updation ? Does your WAL system make it possible ? Regards. Hiroshi Inoue Inoue@tpf.co.jp
> > > But in any case, it seems to me that using shmem for > shared catalog cache is much worther than for > shared cache invalidation... > I don't like current cache invalidation either. And shared catalog cache would make it easy to rollback catalog cache(catalog/relation cache is not rollbacked correct- ly now). But there are some problems to implement shared catalog cache. 1. Relation cache invalidation remains It has almost same mechanism as catalog cache invaldation. Cache invaldation isstill incomprehensible as a whole. 2. Is it clear how consistency is kept between system tuples ? It's quite unclear for me and there are 4 anxieties at least. a. Locking for system tuples is short term(this is for DDL commands inside transactions). This would break 2-phase lock easily. Is there another principle instead ? b. SnapshotNow is used to access system tuples in most cases. SnapshotNow isn't a real snapshot. i.e SnapshotNowcouldn't guarantee read consistency. c. Row level locking is not implemented for system tuples yet. d. AccessShare lock are acquired for system tuples in many places. But does it have any sense ? 3. If neither relpages nor reltuples is held,are there any other speeding up things ? Regards. Hiroshi Inoue Inoue@tpf.co.jp