Thread: mdnblocks is an amazing time sink in huge relations

mdnblocks is an amazing time sink in huge relations

From
Tom Lane
Date:
I have been doing some testing with multi-segment relations.
(Not having sixty gigabytes laying around, like some folk I've been
talking to recently, I rebuilt with a very small RELSEG_SIZE for
testing...)  I have discovered that performance goes to zero as the
number of segments gets large --- in particular, if the number of
segments exceeds the number of kernel file descriptors that fd.c
thinks it can use, you can take a coffee break while inserting a
few tuples :-(.  The main problem is mdnblocks() in md.c, which
is called for each tuple inserted; it insists on touching every
segment of the relation to verify its length via lseek().  That's an
unreasonable number of kernel calls in any case, and if you add a file
open and close to each touch because of file-descriptor thrashing,
it's coffee break time.

It seems to me that mdnblocks should assume that all segments
previously verified to be of length RELSEG_SIZE are still of that
length, and only do an _mdnblocks() call on the last element of the
segment chain.  That way the number of kernel interactions needed
is a constant independent of the number of segments in the table.
(This doesn't make truncation of the relation by a different backend
any more or less dangerous; we're still hosed if we don't get a
notification before we try to do something with the relation.
I think that is fixed, and if it's not this doesn't make it worse.)

Any objections?
        regards, tom lane


RE: [HACKERS] mdnblocks is an amazing time sink in huge relations

From
"Hiroshi Inoue"
Date:
> -----Original Message-----
> From: owner-pgsql-hackers@postgreSQL.org
> [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Tom Lane
> Sent: Tuesday, October 12, 1999 1:12 PM
> To: pgsql-hackers@postgreSQL.org
> Subject: [HACKERS] mdnblocks is an amazing time sink in huge relations
> 
> 
> I have been doing some testing with multi-segment relations.
> (Not having sixty gigabytes laying around, like some folk I've been
> talking to recently, I rebuilt with a very small RELSEG_SIZE for
> testing...)  I have discovered that performance goes to zero as the
> number of segments gets large --- in particular, if the number of
> segments exceeds the number of kernel file descriptors that fd.c
> thinks it can use, you can take a coffee break while inserting a
> few tuples :-(.  The main problem is mdnblocks() in md.c, which
> is called for each tuple inserted; it insists on touching every
> segment of the relation to verify its length via lseek().  That's an
> unreasonable number of kernel calls in any case, and if you add a file
> open and close to each touch because of file-descriptor thrashing,
> it's coffee break time.
>

It's me who removed the following code from mdnblocks().
The code caused a disaster in case of mdtruncate().
               if (v->mdfd_lstbcnt == RELSEG_SIZE        || .... 

At that time I was anxious about the change of performance
but I've forgotten,sorry.

> It seems to me that mdnblocks should assume that all segments
> previously verified to be of length RELSEG_SIZE are still of that
> length, and only do an _mdnblocks() call on the last element of the
> segment chain.  That way the number of kernel interactions needed
> is a constant independent of the number of segments in the table.
> (This doesn't make truncation of the relation by a different backend
> any more or less dangerous; we're still hosed if we don't get a
> notification before we try to do something with the relation.
> I think that is fixed, and if it's not this doesn't make it worse.)
>

Currently only the first segment is opened when mdopen() etc is
called.  Would you change to check all segments at first open ?

If there is a relation which has multi-segment of size 0,which would
be the last segment ?

Regards. 
Hiroshi Inoue
Inoue@tpf.co.jp


Re: [HACKERS] mdnblocks is an amazing time sink in huge relations

From
Bruce Momjian
Date:
> It seems to me that mdnblocks should assume that all segments
> previously verified to be of length RELSEG_SIZE are still of that
> length, and only do an _mdnblocks() call on the last element of the
> segment chain.  That way the number of kernel interactions needed
> is a constant independent of the number of segments in the table.
> (This doesn't make truncation of the relation by a different backend
> any more or less dangerous; we're still hosed if we don't get a
> notification before we try to do something with the relation.
> I think that is fixed, and if it's not this doesn't make it worse.)
> 
> Any objections?

Sounds great.  Now that the segments work properly, it is time for some
optimization.  Vadim has complained about the lseek() from day-1. 
Someday he wants a shared catalog cache to prevent that from being
needed.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] mdnblocks is an amazing time sink in huge relations

From
Tom Lane
Date:
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> It's me who removed the following code from mdnblocks().
> The code caused a disaster in case of mdtruncate().
>
>                 if (v->mdfd_lstbcnt == RELSEG_SIZE
>             || .... 

OK, but do you think there's still a risk, given that we've changed
the relcache-level interlocking?

> Currently only the first segment is opened when mdopen() etc is
> called.  Would you change to check all segments at first open ?

No, I don't see a need to change that logic.  I was thinking that
since mdnblocks adds another link to the chain whenever it sees that
the current segment is exactly RELSEG_SIZE, it could just assume that
if the next-segment link is not NULL then this segment must be of
size RELSEG_SIZE and it doesn't need to check that again.  It'd only
need to do an actual check on the last chain element (plus any elements
it adds during the current call, of course).  We'd rely on the
higher-level interlock to close and reopen the virtual file when a
truncate has happened.

> If there is a relation which has multi-segment of size 0,which would
> be the last segment ?

Only the last segment can have any size other than RELSEG_SIZE, I think.
        regards, tom lane


RE: [HACKERS] mdnblocks is an amazing time sink in huge relations

From
"Hiroshi Inoue"
Date:
> 
> > Currently only the first segment is opened when mdopen() etc is
> > called.  Would you change to check all segments at first open ?
> 
> No, I don't see a need to change that logic.  I was thinking that
> since mdnblocks adds another link to the chain whenever it sees that
> the current segment is exactly RELSEG_SIZE, it could just assume that
> if the next-segment link is not NULL then this segment must be of
> size RELSEG_SIZE and it doesn't need to check that again.  It'd only
> need to do an actual check on the last chain element (plus any elements
> it adds during the current call, of course).  We'd rely on the
> higher-level interlock to close and reopen the virtual file when a
> truncate has happened.
>

You are right.
The last segment is always lseeked.
And does this mean that the first mdnblocks() opens all segments and
checks them ?
> > If there is a relation which has multi-segment of size 0,which would
> > be the last segment ?
> 
> Only the last segment can have any size other than RELSEG_SIZE, I think.
>

This may be another issue.
I have been suspicious about current implementation of md.c.
It relies so much on information about existent phisical files.

How do you think about the following ?

1. Partial blocks(As you know,I have changed the handling of this   kind of blocks recently).
2. If a backend was killed or crashed in the middle of execution of    mdunlink()/mdtruncate(),half of segments
wouldn'tbe unlink/   truncated.
 
3. In cygwin port,mdunlink()/mdtruncate() may leave segments of 0   length. 
4. We couldn't mdcreate() existent files and coudn't mdopen()/md   unlink() non-existent files.  So there are some
casesthat we   could neither CREATE TABLE nor DROP TABLE. 
 

I have no solution but seems the count of valid segments and blocks
should be held in other places.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp


Re: [HACKERS] mdnblocks is an amazing time sink in huge relations

From
Tom Lane
Date:
(Sorry for slow response, I've been off chasing psort problems...)

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> I have been suspicious about current implementation of md.c.
> It relies so much on information about existent phisical files.

Yes, but on the other hand we rely completely on those same physical
files to hold our data ;-).  I don't see anything fundamentally
wrong with using the existence and size of a data file as useful
information.  It's not a substitute for a lock, of course, and there
may be places where we need cross-backend interlocks that we haven't
got now.

> How do you think about the following ?
>
> 1. Partial blocks(As you know,I have changed the handling of this
>     kind of blocks recently).

Yes.  I think your fix was good.

> 2. If a backend was killed or crashed in the middle of execution of 
>     mdunlink()/mdtruncate(),half of segments wouldn't be unlink/
>     truncated.

That's bothered me too.  A possible answer would be to do the unlinking
back-to-front (zap the last file first); that'd require a few more lines
of code in md.c, but a crash midway through would then leave a legal
file configuration that another backend could still do something with.

> 3. In cygwin port,mdunlink()/mdtruncate() may leave segments of 0
>     length. 

I don't understand what causes this.  Can you explain?

BTW, I think that having the last segment be 0 length is OK and indeed
expected --- mdnblocks will create the next segment as soon as it
notices the currently last segment has reached RELSEG_SIZE, even if
there's not yet a disk page to put in the next segment.  This seems
OK to me, although it's not really necessary.

> 4. We couldn't mdcreate() existent files and coudn't mdopen()/md
>     unlink() non-existent files.  So there are some cases that we
>     could neither CREATE TABLE nor DROP TABLE. 

True, but I think this is probably the best thing for safety's sake.
It seems to me there is too much risk of losing or overwriting valid
data if md.c bulls ahead when it finds an unexpected file configuration.
I'd rather rely on manual cleanup if things have gotten that seriously
out of whack... (but that's just my opinion, perhaps I'm in the
minority?)
        regards, tom lane


RE: [HACKERS] mdnblocks is an amazing time sink in huge relations

From
"Hiroshi Inoue"
Date:
> "Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> > I have been suspicious about current implementation of md.c.
> > It relies so much on information about existent phisical files.
>
> Yes, but on the other hand we rely completely on those same physical
> files to hold our data ;-).  I don't see anything fundamentally
> wrong with using the existence and size of a data file as useful
> information.  It's not a substitute for a lock, of course, and there
> may be places where we need cross-backend interlocks that we haven't
> got now.
>

We have to lseek() each time to know the number of blocks of a table
file.  Isn't it a overhead ?

> > How do you think about the following ?
> >
> > 2. If a backend was killed or crashed in the middle of execution of
> >     mdunlink()/mdtruncate(),half of segments wouldn't be unlink/
> >     truncated.
>
> That's bothered me too.  A possible answer would be to do the unlinking
> back-to-front (zap the last file first); that'd require a few more lines
> of code in md.c, but a crash midway through would then leave a legal
> file configuration that another backend could still do something with.

Oops,it's more serious than I have thought.
mdunlink() may only truncates a table file by a crash while unlinking
back-to-front.
A crash while unlinking front-to-back may leave unlinked segments
and they would suddenly appear as segments of the recreated table.
Seems there's no easy fix.

> > 3. In cygwin port,mdunlink()/mdtruncate() may leave segments of 0
> >     length.
>
> I don't understand what causes this.  Can you explain?
>

You call FileUnlink() after FileTrucnate() to unlink in md.c. If
FileUnlink()
fails there remains segments of 0 length. But it seems not critical in
this issue.

> > 4. We couldn't mdcreate() existent files and coudn't mdopen()/md
> >     unlink() non-existent files.  So there are some cases that we
> >     could neither CREATE TABLE nor DROP TABLE.
>
> True, but I think this is probably the best thing for safety's sake.
> It seems to me there is too much risk of losing or overwriting valid
> data if md.c bulls ahead when it finds an unexpected file configuration.
> I'd rather rely on manual cleanup if things have gotten that seriously
> out of whack... (but that's just my opinion, perhaps I'm in the
> minority?)
>

There is another risk.
We may remove other table files manually by mistake.
And if I were a newcomer,I would not consider PostgreSQL as
a real DBMS(Fortunately I have never seen the reference to this).

However,I don't object to you because I also have the same anxiety
and could provide no easy solution,

Probably it would require a lot of work to fix correctly.
Postponing real unlink/truncating until commit and creating table
files which correspond to their oids ..... etc ...
It's same as "DROP TABLE inside transations" requires.

Hmm,is it worth the work ?

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp



Re: [HACKERS] mdnblocks is an amazing time sink in huge relations

From
Tom Lane
Date:
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
>> ...  I don't see anything fundamentally
>> wrong with using the existence and size of a data file as useful
>> information.  It's not a substitute for a lock, of course, and there
>> may be places where we need cross-backend interlocks that we haven't
>> got now.

> We have to lseek() each time to know the number of blocks of a table
> file.  Isn't it a overhead ?

True, but lseek is pretty cheap as kernel calls go (the kernel just has
to consult the file's inode, which should be in memory already).  We're
not going to get the info for free; any other way of keeping track of
it is going to have its own costs.  Vadim's been muttering about using
a shared cache for system catalog tuples, which might be a win but I'm
not sure (I'm worried about contention for the cache, especially if it's
protected by just one or a few spinlocks).  Anyway, if we did have one
then keeping an accurate block count in the relation's pg_class row
would be a practical alternative.

>> That's bothered me too.  A possible answer would be to do the unlinking
>> back-to-front (zap the last file first); that'd require a few more lines
>> of code in md.c, but a crash midway through would then leave a legal
>> file configuration that another backend could still do something with.

> Oops,it's more serious than I have thought.
> mdunlink() may only truncates a table file by a crash while unlinking
> back-to-front.
> A crash while unlinking front-to-back may leave unlinked segments
> and they would suddenly appear as segments of the recreated table.
> Seems there's no easy fix.

Well, it seems to me that the first misbehavior (incomplete delete becomes
a partial truncate, and you can try again) is a lot better than the
second (incomplete delete leaves an undeletable, unrecreatable table).
Should I go ahead and make delete/truncate work back-to-front, or do you
see a reason why that'd be a bad thing to do?

>>>> 3. In cygwin port,mdunlink()/mdtruncate() may leave segments of 0
>>>> length.
>> 
>> I don't understand what causes this.  Can you explain?

> You call FileUnlink() after FileTrucnate() to unlink in md.c. If
> FileUnlink()
> fails there remains segments of 0 length. But it seems not critical in
> this issue.

Ah, I see.  It's not specific to cygwin then.  I think you are right
that it is not critical, at least not if we change to do the operations
back-to-front.  Truncating and then failing to delete will still leave
a valid file configuration.

> Probably it would require a lot of work to fix correctly.
> Postponing real unlink/truncating until commit and creating table
> files which correspond to their oids ..... etc ...
> It's same as "DROP TABLE inside transations" requires.
> Hmm,is it worth the work ?

I'm not eager to do that either (not just because of the low return on
work invested, but because naming table files by OIDs would be a big
handicap for debugging and database admin work).  However, I'm not
quite sure why you see it as a solution to the problem of recovering
from a failed md.c operation.  The risks would be the same if the
deletion was happening during commit, no?  And I'd be a lot *more*
worried about deleting the wrong file during manual cleanup if the
files were named by OID ;-)
        regards, tom lane


RE: [HACKERS] mdnblocks is an amazing time sink in huge relations

From
"Hiroshi Inoue"
Date:
>
> "Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> >> ...  I don't see anything fundamentally
> >> wrong with using the existence and size of a data file as useful
> >> information.  It's not a substitute for a lock, of course, and there
> >> may be places where we need cross-backend interlocks that we haven't
> >> got now.
>
> > We have to lseek() each time to know the number of blocks of a table
> > file.  Isn't it a overhead ?
>
> True, but lseek is pretty cheap as kernel calls go (the kernel just has
> to consult the file's inode, which should be in memory already).  We're
> not going to get the info for free; any other way of keeping track of
> it is going to have its own costs.  Vadim's been muttering about using
> a shared cache for system catalog tuples, which might be a win but I'm
> not sure (I'm worried about contention for the cache, especially if it's
> protected by just one or a few spinlocks).  Anyway, if we did have one
> then keeping an accurate block count in the relation's pg_class row
> would be a practical alternative.
>

Seems it's related to a TODO.
* Shared catalog cache, reduce lseek()'s by caching table size in shared
area

But there would be a problem if we use shared catalog cache.
Being updated system tuples are only visible to an updating backend
and other backends should see committed tuples.
On the other hand,an accurate block count should be visible to all
backends.
Which tuple of a row should we load to catalog cache and update ?

> >> That's bothered me too.  A possible answer would be to do the unlinking
> >> back-to-front (zap the last file first); that'd require a few
> more lines
> >> of code in md.c, but a crash midway through would then leave a legal
> >> file configuration that another backend could still do something with.
>
> > Oops,it's more serious than I have thought.
> > mdunlink() may only truncates a table file by a crash while unlinking
> > back-to-front.
> > A crash while unlinking front-to-back may leave unlinked segments
> > and they would suddenly appear as segments of the recreated table.
> > Seems there's no easy fix.
>
> Well, it seems to me that the first misbehavior (incomplete delete becomes
> a partial truncate, and you can try again) is a lot better than the
> second (incomplete delete leaves an undeletable, unrecreatable table).
> Should I go ahead and make delete/truncate work back-to-front, or do you
> see a reason why that'd be a bad thing to do?
>

I also think back-to-front is better.

> > Probably it would require a lot of work to fix correctly.
> > Postponing real unlink/truncating until commit and creating table
> > files which correspond to their oids ..... etc ...
> > It's same as "DROP TABLE inside transations" requires.
> > Hmm,is it worth the work ?
>
> I'm not eager to do that either (not just because of the low return on
> work invested, but because naming table files by OIDs would be a big
> handicap for debugging and database admin work).  However, I'm not
> quite sure why you see it as a solution to the problem of recovering
> from a failed md.c operation.  The risks would be the same if the
> deletion was happening during commit, no?  And I'd be a lot *more*
> worried about deleting the wrong file during manual cleanup if the
> files were named by OID ;-)
>

We don't have to delete relation files even after commit.
Backend would never see them and access to the files
corresponding to new oids of (being) recreated relations.
Deletion is necessary only not to consume disk space.

For example vacuum could remove not deleted files.
It may be a PostgreSQL style.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp



Re: [HACKERS] mdnblocks is an amazing time sink in huge relations

From
Tom Lane
Date:
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
>> a shared cache for system catalog tuples, which might be a win but I'm
>> not sure (I'm worried about contention for the cache, especially if it's
>> protected by just one or a few spinlocks).  Anyway, if we did have one
>> then keeping an accurate block count in the relation's pg_class row
>> would be a practical alternative.

> But there would be a problem if we use shared catalog cache.
> Being updated system tuples are only visible to an updating backend
> and other backends should see committed tuples.
> On the other hand,an accurate block count should be visible to all
> backends.
> Which tuple of a row should we load to catalog cache and update ?

Good point --- rolling back a transaction would cancel changes to the
pg_class row, but it mustn't cause the relation's file to get truncated
(since there could be tuples of other uncommitted transactions in the
newly added block(s)).

This says that having a block count column in pg_class is the Wrong
Thing; we should get rid of relpages entirely.  The Right Thing is a
separate data structure in shared memory that stores the current
physical block count for each active relation.  The first backend to
touch a given relation would insert an entry, and then subsequent
extensions/truncations/deletions would need to update it.  We already
obtain a special lock when extending a relation, so seems like there'd
be no extra locking cost to have a table like this.

Anyone up for actually implementing this ;-) ?  I have other things
I want to work on...

>> Well, it seems to me that the first misbehavior (incomplete delete becomes
>> a partial truncate, and you can try again) is a lot better than the
>> second (incomplete delete leaves an undeletable, unrecreatable table).
>> Should I go ahead and make delete/truncate work back-to-front, or do you
>> see a reason why that'd be a bad thing to do?

> I also think back-to-front is better.

OK, I have a couple other little things I want to do in md.c, so I'll
see what I can do about that.  Even with a shared-memory relation
length table, back-to-front truncation would be the safest way to
proceed, so we'll want to make this change in any case.

> Deletion is necessary only not to consume disk space.
>
> For example vacuum could remove not deleted files.

Hmm ... interesting idea ... but I can hear the complaints
from users already...
        regards, tom lane


Re: [HACKERS] mdnblocks is an amazing time sink in huge relations

From
Vadim Mikheev
Date:
Tom Lane wrote:
> 
> >> a shared cache for system catalog tuples, which might be a win but I'm
> >> not sure (I'm worried about contention for the cache, especially if it's
> >> protected by just one or a few spinlocks).  Anyway, if we did have one

Commercial DBMSes have this... Isn't it a good reason? -:)

> > But there would be a problem if we use shared catalog cache.
> > Being updated system tuples are only visible to an updating backend
> > and other backends should see committed tuples.
> > On the other hand,an accurate block count should be visible to all
> > backends.
> > Which tuple of a row should we load to catalog cache and update ?
> 
> Good point --- rolling back a transaction would cancel changes to the
> pg_class row, but it mustn't cause the relation's file to get truncated
> (since there could be tuples of other uncommitted transactions in the
> newly added block(s)).
> 
> This says that having a block count column in pg_class is the Wrong
> Thing; we should get rid of relpages entirely.  The Right Thing is a
> separate data structure in shared memory that stores the current
> physical block count for each active relation.  The first backend to
> touch a given relation would insert an entry, and then subsequent
> extensions/truncations/deletions would need to update it.  We already
> obtain a special lock when extending a relation, so seems like there'd
> be no extra locking cost to have a table like this.

I supposed that each backend will still use own catalog 
cache (after reading entries from shared one) and synchronize 
shared/private caches on commit - e.g. update reltuples!
relpages will be updated immediately after physical changes -
what's problem with this?

> Anyone up for actually implementing this ;-) ?  I have other things
> I want to work on...

And me too -:))

Vadim


RE: [HACKERS] mdnblocks is an amazing time sink in huge relations

From
"Hiroshi Inoue"
Date:
> 
> "Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

[snip]
> 
> > Deletion is necessary only not to consume disk space.
> >
> > For example vacuum could remove not deleted files.
> 
> Hmm ... interesting idea ... but I can hear the complaints
> from users already...
>

My idea is only an analogy of PostgreSQL's simple recovery
mechanism of tuples.

And my main point is"delete fails after commit" doesn't harm the databaseexcept that not deleted files consume disk
space.

Of cource,it's preferable to delete relation files immediately
after(or just when) commit.
Useless files are visible though useless tuples are invisible.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp


RE: [HACKERS] mdnblocks is an amazing time sink in huge relations

From
"Hiroshi Inoue"
Date:
> 
> Tom Lane wrote:
> > 
> > >> a shared cache for system catalog tuples, which might be a 
> win but I'm
> > >> not sure (I'm worried about contention for the cache, 
> especially if it's
> > >> protected by just one or a few spinlocks).  Anyway, if we 
> did have one
> 
> Commercial DBMSes have this... Isn't it a good reason? -:)
> 
> > > But there would be a problem if we use shared catalog cache.
> > > Being updated system tuples are only visible to an updating backend
> > > and other backends should see committed tuples.
> > > On the other hand,an accurate block count should be visible to all
> > > backends.
> > > Which tuple of a row should we load to catalog cache and update ?
> > 
> > Good point --- rolling back a transaction would cancel changes to the
> > pg_class row, but it mustn't cause the relation's file to get truncated
> > (since there could be tuples of other uncommitted transactions in the
> > newly added block(s)).
> > 
> > This says that having a block count column in pg_class is the Wrong
> > Thing; we should get rid of relpages entirely.  The Right Thing is a
> > separate data structure in shared memory that stores the current
> > physical block count for each active relation.  The first backend to
> > touch a given relation would insert an entry, and then subsequent
> > extensions/truncations/deletions would need to update it.  We already
> > obtain a special lock when extending a relation, so seems like there'd
> > be no extra locking cost to have a table like this.
> 
> I supposed that each backend will still use own catalog 
> cache (after reading entries from shared one) and synchronize 
> shared/private caches on commit - e.g. update reltuples!
> relpages will be updated immediately after physical changes -
> what's problem with this?
>

Does this mean the following ?

1. shared cache holds committed system tuples.
2. private cache holds uncommitted system tuples.
3. relpages of shared cache are updated immediately by   phisical change and corresponding buffer pages are   marked
dirty.
4. on commit, the contents of uncommitted tuples except  relpages,reltuples,... are copied to correponding tuples  in
sharedcache and the combined contents are  committed.
 

If so,catalog cache invalidation would be no longer needed.
But synchronization of the step 4. may be difficult.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp


Re: [HACKERS] mdnblocks is an amazing time sink in huge relations

From
Tom Lane
Date:
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> 1. shared cache holds committed system tuples.
> 2. private cache holds uncommitted system tuples.
> 3. relpages of shared cache are updated immediately by
>     phisical change and corresponding buffer pages are
>     marked dirty.
> 4. on commit, the contents of uncommitted tuples except
>    relpages,reltuples,... are copied to correponding tuples
>    in shared cache and the combined contents are
>    committed.
> If so,catalog cache invalidation would be no longer needed.
> But synchronization of the step 4. may be difficult.

I think the main problem is that relpages and reltuples shouldn't
be kept in pg_class columns at all, because they need to have
very different update behavior from the other pg_class columns.

The rest of pg_class is update-on-commit, and we can lock down any one
row in the normal MVCC way (if transaction A has modified a row and
transaction B also wants to modify it, B waits for A to commit or abort,
so it can know which version of the row to start from).  Furthermore,
there can legitimately be several different values of a row in use in
different places: the latest committed, an uncommitted modification, and
one or more old values that are still being used by active transactions
because they were current when those transactions started.  (BTW, the
present relcache is pretty bad about maintaining pure MVCC transaction
semantics like this, but it seems clear to me that that's the direction
we want to go in.)

relpages cannot operate this way.  To be useful for avoiding lseeks,
relpages *must* change exactly when the physical file changes.  It
matters not at all whether the particular transaction that extended the
file ultimately commits or not.  Moreover there can be only one correct
value (per relation) across the whole system, because there is only one
length of the relation file.

If we want to take reltuples seriously and try to maintain it
on-the-fly, then I think it needs still a third behavior.  Clearly
it cannot be updated using MVCC rules, or we lose all writer
concurrency (if A has added tuples to a rel, B would have to wait
for A to commit before it could update reltuples...).  Furthermore
"updating" isn't a simple matter of storing what you think the new
value is; otherwise two transactions adding tuples in parallel would
leave the wrong answer after B commits and overwrites A's value.
I think it would work for each transaction to keep track of a net delta
in reltuples for each table it's changed (total tuples added less total
tuples deleted), and then atomically add that value to the table's
shared reltuples counter during commit.  But that still leaves the
problem of how you use the counter during a transaction to get an
accurate answer to the question "If I scan this table now, how many tuples
will I see?"  At the time the question is asked, the current shared
counter value might include the effects of transactions that have
committed since your transaction started, and therefore are not visible
under MVCC rules.  I think getting the correct answer would involve
making an instantaneous copy of the current counter at the start of
your xact, and then adding your own private net-uncommitted-delta to
the saved shared counter value when asked the question.  This doesn't
look real practical --- you'd have to save the reltuples counts of
*all* tables in the database at the start of each xact, on the off
chance that you might need them.  Ugh.  Perhaps someone has a better
idea.  In any case, reltuples clearly needs different mechanisms than
the ordinary fields in pg_class do, because updating it will be a
performance bottleneck otherwise.

If we allow reltuples to be updated only by vacuum-like events, as
it is now, then I think keeping it in pg_class is still OK.

In short, it seems clear to me that relpages should be removed from
pg_class and kept somewhere else if we want to make it more reliable
than it is now, and the same for reltuples (but reltuples doesn't
behave the same as relpages, and probably ought to be handled
differently).
        regards, tom lane


Re: [HACKERS] mdnblocks is an amazing time sink in huge relations

From
Vadim Mikheev
Date:
Hiroshi Inoue wrote:
> 
> > I supposed that each backend will still use own catalog
> > cache (after reading entries from shared one) and synchronize
> > shared/private caches on commit - e.g. update reltuples!
> > relpages will be updated immediately after physical changes -
> > what's problem with this?
> >
> 
> Does this mean the following ?
> 
> 1. shared cache holds committed system tuples.
> 2. private cache holds uncommitted system tuples.
> 3. relpages of shared cache are updated immediately by
>     phisical change and corresponding buffer pages are
>     marked dirty.
> 4. on commit, the contents of uncommitted tuples except
>    relpages,reltuples,... are copied to correponding tuples             ^^^^^^^^^
reltuples in shared catalog cache (SCC) will be updated!
If transaction inserted some tuples then SCC->reltuples
will be incremented, etc.

>    in shared cache and the combined contents are
>    committed.
> 
> If so,catalog cache invalidation would be no longer needed.

I never liked our invalidation code -:)

> But synchronization of the step 4. may be difficult.

What's easy in this life? -:)

Vadim


Re: [HACKERS] mdnblocks is an amazing time sink in huge relations

From
Vadim Mikheev
Date:
Tom Lane wrote:
> 
> I think the main problem is that relpages and reltuples shouldn't
> be kept in pg_class columns at all, because they need to have
> very different update behavior from the other pg_class columns.

Yes, but is this reason to move them somewhere else?
Let's update them differently (i.e. update in-place) 
but keep in pg_class.
Should we provide read consistency for these internal-use columns?
I'm not sure.

> If we want to take reltuples seriously and try to maintain it
> on-the-fly, then I think it needs still a third behavior.  Clearly

...snip...
I agreed that there is no way to get accurate estimation for
# of rows to be seen by a query...
Well, let's keep up-to-date # of rows present in relation:
in any case a query will have to read them and this is what
we need to estimate cost of simple scans, as for costs of
joins - now way, currently(?) -:(

But please remember that there is another SCC goal -
faster catalog access...

Vadim


Re: [HACKERS] mdnblocks is an amazing time sink in huge relations

From
Bruce Momjian
Date:
> I agreed that there is no way to get accurate estimation for
> # of rows to be seen by a query...
> Well, let's keep up-to-date # of rows present in relation:
> in any case a query will have to read them and this is what
> we need to estimate cost of simple scans, as for costs of
> joins - now way, currently(?) -:(
> 
> But please remember that there is another SCC goal -
> faster catalog access...

I want to index access more cache entries on cache miss for 7.0.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


RE: [HACKERS] mdnblocks is an amazing time sink in huge relations

From
"Hiroshi Inoue"
Date:
> 
> Hiroshi Inoue wrote:
> > 
> > > I supposed that each backend will still use own catalog
> > > cache (after reading entries from shared one) and synchronize
> > > shared/private caches on commit - e.g. update reltuples!
> > > relpages will be updated immediately after physical changes -
> > > what's problem with this?
> > >
> > 
> > Does this mean the following ?
> > 
> > 1. shared cache holds committed system tuples.
> > 2. private cache holds uncommitted system tuples.
> > 3. relpages of shared cache are updated immediately by
> >     phisical change and corresponding buffer pages are
> >     marked dirty.
> > 4. on commit, the contents of uncommitted tuples except
> >    relpages,reltuples,... are copied to correponding tuples
>               ^^^^^^^^^
> reltuples in shared catalog cache (SCC) will be updated!
> If transaction inserted some tuples then SCC->reltuples
> will be incremented, etc.
>

System tuples are only modifiled or (insert and delet)ed like
user tuples when reltuples are updated ?
If only modified,we couldn't use it in SERIALIZABLE mode.
> >    in shared cache and the combined contents are
> >    committed.
> > 
> > If so,catalog cache invalidation would be no longer needed.
> 
> I never liked our invalidation code -:)
> 
> > But synchronization of the step 4. may be difficult.
> 
> What's easy in this life? -:)
>

As for relpages(block count),my first thought was that
PostgreSQL relation files have their control data on their
first page, I was surprised to know there's no such data.

Seems catalog cache sharing is another issue as Tom
says.  Of cource it's unnatural to hold separate catalog
cache.

I will be absent this week after now.
I would think for a while.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp



Re: [HACKERS] mdnblocks is an amazing time sink in huge relations

From
Vadim Mikheev
Date:
Bruce Momjian wrote:
> 
> > I agreed that there is no way to get accurate estimation for
> > # of rows to be seen by a query...
> > Well, let's keep up-to-date # of rows present in relation:
> > in any case a query will have to read them and this is what
> > we need to estimate cost of simple scans, as for costs of
> > joins - now way, currently(?) -:(
> >
> > But please remember that there is another SCC goal -
> > faster catalog access...
> 
> I want to index access more cache entries on cache miss for 7.0.

Good. But in any case we'll gain faster access from _shared_ cache.

Vadim


Re: [HACKERS] mdnblocks is an amazing time sink in huge relations

From
Tom Lane
Date:
Vadim Mikheev <vadim@krs.ru> writes:
> I agreed that there is no way to get accurate estimation for
> # of rows to be seen by a query...
> Well, let's keep up-to-date # of rows present in relation:
> in any case a query will have to read them and this is what
> we need to estimate cost of simple scans, as for costs of
> joins - now way, currently(?) -:(

The optimizer is perfectly happy with approximate tuple counts.  It has
to make enough other guesstimates that I really doubt an exact tuple
count could help it measurably.  So updating the count via VACUUM is an
appropriate solution as far as the optimizer is concerned.  The only
good reason I've ever seen for trying to keep an exact tuple count at
all times is to allow short-circuiting of SELECT COUNT(*) queries ---
and even there, it's only useful if there's no WHERE or GROUP BY.

As far as I can see, keeping an exact tuple count couldn't possibly
be worth the overhead it'd take.  Keeping an exact block count may
have the same problem, but I'm not sure either way.
        regards, tom lane


Re: [HACKERS] mdnblocks is an amazing time sink in huge relations

From
Vadim Mikheev
Date:
Hiroshi Inoue wrote:
> 
> > > Does this mean the following ?
> > >
> > > 1. shared cache holds committed system tuples.
> > > 2. private cache holds uncommitted system tuples.
> > > 3. relpages of shared cache are updated immediately by
> > >     phisical change and corresponding buffer pages are
> > >     marked dirty.
> > > 4. on commit, the contents of uncommitted tuples except
> > >    relpages,reltuples,... are copied to correponding tuples
> >               ^^^^^^^^^
> > reltuples in shared catalog cache (SCC) will be updated!
> > If transaction inserted some tuples then SCC->reltuples
> > will be incremented, etc.
> >
> 
> System tuples are only modifiled or (insert and delet)ed like
> user tuples when reltuples are updated ?
> If only modified,we couldn't use it in SERIALIZABLE mode.    ^^^^^^^^^^^^^                    ^^^^^^^^^^^^^^^^^^^^
...this...
 

I'm not sure that we must provide read consistency
for internal-use columns...
Nevertheless, I agreed that keeping internal-use columns
in table is bad thing, but let's do it for awhile: I believe
that someday we'll re-implement our mdmgr - using separate
file for each table/index is bad thing too, - and move these
columns "somewhere else" in that day.

Vadim


Re: [HACKERS] mdnblocks is an amazing time sink in huge relations

From
Vadim Mikheev
Date:
Tom Lane wrote:
> 
> Vadim Mikheev <vadim@krs.ru> writes:
> > I agreed that there is no way to get accurate estimation for
> > # of rows to be seen by a query...
> > Well, let's keep up-to-date # of rows present in relation:
> > in any case a query will have to read them and this is what
> > we need to estimate cost of simple scans, as for costs of
> > joins - now way, currently(?) -:(
> 
> The optimizer is perfectly happy with approximate tuple counts.  It has
> to make enough other guesstimates that I really doubt an exact tuple
> count could help it measurably.  So updating the count via VACUUM is an
> appropriate solution as far as the optimizer is concerned.  The only

I'm not sure: scans read all (visible/unvisible, committed/uncommittd) 
tuples before making visibility decision... though, relpages is 
primary value for cost estimation.

> good reason I've ever seen for trying to keep an exact tuple count at
> all times is to allow short-circuiting of SELECT COUNT(*) queries ---
> and even there, it's only useful if there's no WHERE or GROUP BY.

...and only in READ COMMITTED mode...

> As far as I can see, keeping an exact tuple count couldn't possibly
> be worth the overhead it'd take.  Keeping an exact block count may
> have the same problem, but I'm not sure either way.

relpages is more significant thing to know. Extending relation
by one block at time is bad for insert performance. It would be nice
to have two values per relation in shared cache - # of blocks and
last block used. On the other hand this is mostly mdmgr issue.

But in any case, it seems to me that using shmem for 
shared catalog cache is much worther than for
shared cache invalidation...

Vadim


RE: [HACKERS] mdnblocks is an amazing time sink in huge relations

From
"Hiroshi Inoue"
Date:
> 
> Hiroshi Inoue wrote:
> > 
> > > > Does this mean the following ?
> > > >
> > > > 1. shared cache holds committed system tuples.
> > > > 2. private cache holds uncommitted system tuples.
> > > > 3. relpages of shared cache are updated immediately by
> > > >     phisical change and corresponding buffer pages are
> > > >     marked dirty.
> > > > 4. on commit, the contents of uncommitted tuples except
> > > >    relpages,reltuples,... are copied to correponding tuples
> > >               ^^^^^^^^^
> > > reltuples in shared catalog cache (SCC) will be updated!
> > > If transaction inserted some tuples then SCC->reltuples
> > > will be incremented, etc.
> > >
> > 
> > System tuples are only modifiled or (insert and delet)ed like
> > user tuples when reltuples are updated ?
> > If only modified,we couldn't use it in SERIALIZABLE mode.
>      ^^^^^^^^^^^^^                    ^^^^^^^^^^^^^^^^^^^^
>       ...this...
> 
> I'm not sure that we must provide read consistency
> for internal-use columns...

As for relpages,read consistency has no meaning
because they are out of transaction control.
But as for reltuples,isn't it difficult to commit/rollback
correctly without using insert-delete updation ?
Does your WAL system make it possible ?

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp



RE: [HACKERS] mdnblocks is an amazing time sink in huge relations

From
"Hiroshi Inoue"
Date:
> 
> 
> But in any case, it seems to me that using shmem for 
> shared catalog cache is much worther than for
> shared cache invalidation...
>

I don't like current cache invalidation either.
And shared catalog cache would make it easy to rollback
catalog cache(catalog/relation cache is not rollbacked correct-
ly now).

But  there are some problems to implement shared catalog
cache.

1. Relation cache invalidation remains   It has almost same mechanism as catalog cache invaldation.   Cache invaldation
isstill incomprehensible as a whole.
 

2. Is it clear how consistency is kept between system tuples ?   It's quite unclear for me and there are 4 anxieties at
least.
   a. Locking for system tuples is short term(this is for DDL       commands inside transactions). This would break
2-phase      lock easily. Is there another principle instead ?
 
   b. SnapshotNow is used to access system tuples in most       cases. SnapshotNow isn't a real snapshot.       i.e
SnapshotNowcouldn't guarantee read consistency.
 
   c. Row level locking is not implemented for system tuples yet.
   d. AccessShare lock are acquired for system tuples in many       places. But does it have any sense ?

3. If neither relpages nor reltuples is held,are there any other   speeding up things ?

Regards. 

Hiroshi Inoue
Inoue@tpf.co.jp