Re: mosbench revisited - Mailing list pgsql-hackers

From Tom Lane
Subject Re: mosbench revisited
Date
Msg-id 23924.1312407357@sss.pgh.pa.us
Whole thread Raw
In response to Re: mosbench revisited  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: mosbench revisited
List pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Aug 3, 2011 at 4:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> ... We could possibly accept stale values for the
>> planner estimates, but I think heapam's number had better be accurate.

> I think the exact requirement is that, if the relation turns out to be
> larger than the size we read, the extra blocks had better not contain
> any tuples our snapshot can see.  There's actually no interlock
> between smgrnblocks() and smgrextend() right now, so presumably we
> don't need to add one.

No interlock in userspace, you mean.  We're relying on the kernel to do
it, ie, give us a number that is not older than the time of our (already
taken at this point) snapshot.

> I don't really think there's any sensible way to implement a
> per-backend cache, because that would require invalidation events of
> some kind to be sent on relation extension, and that seems utterly
> insane from a performance standpoint, even if we invented something
> less expensive than sinval.

Yeah, that's the issue.  But "relation extension" is not actually a
cheap operation, since it requires a minimum of one kernel call that is
presumably doing something nontrivial in the filesystem.  I'm not
entirely convinced that we couldn't make this work --- especially since
we could certainly derate the duty cycle by a factor of ten or more
without giving up anything remotely meaningful in planning accuracy.
(I'd be inclined to make it send an inval only once the relation size
had changed at least, say, 10%.)

> A shared cache seems like it could work, but the locking is tricky.
> Normally we'd just use a hash table protected by an LWLock, one one
> LWLock per partition, but here that's clearly not going to work.  The
> kernel is using a spinlock per file, and that's still too
> heavy-weight.

That still seems utterly astonishing to me.  We're touching each of
those files once per query cycle; a cycle that contains two message
sends, who knows how many internal spinlock/lwlock/heavyweightlock
acquisitions inside Postgres (some of which *do* contend with each
other), and a not insignificant amount of plain old computing.
Meanwhile, this particular spinlock inside the kernel is protecting
what, a single doubleword fetch?  How is that the bottleneck?

I am wondering whether kernel spinlocks are broken.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: cataloguing NOT NULL constraints
Next
From: James Robinson
Date:
Subject: Postgres / plpgsql equivalent to python's getattr() ?