Re: [HACKERS] mdnblocks is an amazing time sink in huge relations - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: [HACKERS] mdnblocks is an amazing time sink in huge relations |
Date | |
Msg-id | 9036.940342155@sss.pgh.pa.us Whole thread Raw |
In response to | RE: [HACKERS] mdnblocks is an amazing time sink in huge relations ("Hiroshi Inoue" <Inoue@tpf.co.jp>) |
List | pgsql-hackers |
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > 1. shared cache holds committed system tuples. > 2. private cache holds uncommitted system tuples. > 3. relpages of shared cache are updated immediately by > phisical change and corresponding buffer pages are > marked dirty. > 4. on commit, the contents of uncommitted tuples except > relpages,reltuples,... are copied to correponding tuples > in shared cache and the combined contents are > committed. > If so,catalog cache invalidation would be no longer needed. > But synchronization of the step 4. may be difficult. I think the main problem is that relpages and reltuples shouldn't be kept in pg_class columns at all, because they need to have very different update behavior from the other pg_class columns. The rest of pg_class is update-on-commit, and we can lock down any one row in the normal MVCC way (if transaction A has modified a row and transaction B also wants to modify it, B waits for A to commit or abort, so it can know which version of the row to start from). Furthermore, there can legitimately be several different values of a row in use in different places: the latest committed, an uncommitted modification, and one or more old values that are still being used by active transactions because they were current when those transactions started. (BTW, the present relcache is pretty bad about maintaining pure MVCC transaction semantics like this, but it seems clear to me that that's the direction we want to go in.) relpages cannot operate this way. To be useful for avoiding lseeks, relpages *must* change exactly when the physical file changes. It matters not at all whether the particular transaction that extended the file ultimately commits or not. Moreover there can be only one correct value (per relation) across the whole system, because there is only one length of the relation file. If we want to take reltuples seriously and try to maintain it on-the-fly, then I think it needs still a third behavior. Clearly it cannot be updated using MVCC rules, or we lose all writer concurrency (if A has added tuples to a rel, B would have to wait for A to commit before it could update reltuples...). Furthermore "updating" isn't a simple matter of storing what you think the new value is; otherwise two transactions adding tuples in parallel would leave the wrong answer after B commits and overwrites A's value. I think it would work for each transaction to keep track of a net delta in reltuples for each table it's changed (total tuples added less total tuples deleted), and then atomically add that value to the table's shared reltuples counter during commit. But that still leaves the problem of how you use the counter during a transaction to get an accurate answer to the question "If I scan this table now, how many tuples will I see?" At the time the question is asked, the current shared counter value might include the effects of transactions that have committed since your transaction started, and therefore are not visible under MVCC rules. I think getting the correct answer would involve making an instantaneous copy of the current counter at the start of your xact, and then adding your own private net-uncommitted-delta to the saved shared counter value when asked the question. This doesn't look real practical --- you'd have to save the reltuples counts of *all* tables in the database at the start of each xact, on the off chance that you might need them. Ugh. Perhaps someone has a better idea. In any case, reltuples clearly needs different mechanisms than the ordinary fields in pg_class do, because updating it will be a performance bottleneck otherwise. If we allow reltuples to be updated only by vacuum-like events, as it is now, then I think keeping it in pg_class is still OK. In short, it seems clear to me that relpages should be removed from pg_class and kept somewhere else if we want to make it more reliable than it is now, and the same for reltuples (but reltuples doesn't behave the same as relpages, and probably ought to be handled differently). regards, tom lane
pgsql-hackers by date: