On Wednesday 24 November 2010 21:24:43 Robert Haas wrote:
> I'd like to get access to a box with (a lot) more cores, to see
> whether the lock stuff moves up in the profile. A big chunk of that
> hash_search_with_hash_value overhead is coming from
> LockAcquireExtended. The __strcmp_sse2 is almost entirely parsing
> overhead. In general, I'm not sure there's much hope for reducing the
> parsing overhead, although ScanKeywordLookup() can certainly be done
> better. XLogInsert() is spending a lot of time doing CRC's.
> LWLockAcquire() is dropping cycles in many different places.
I can get you profiles of machines with up two 24 real cores, unfortunately I
can't give access away.
Regarding CRCs:
I spent some time optimizing these, as you might remember. The wall I hit
optimizing it benefit-wise is that the single CRC calls (4 for a non-indexed
single-row insert on a table with 1 column inside a transaction) are just too
damn small to get more efficient. Its causing pipeline stalls all over...
(21, 5, 1, 28 bytes).
I have a very preliminary patch calculating the CRC over the whole thing in
one go if it can do so (no switch, no xl buffers wraparound), but its highly
ugly as it needs to read from the xl insert buffers and then reinsert the crc
at the correct position.
While it shows a noticable improvement, that doesn't seem to be a good way to
go. It could be made to work properly though.
I played around with some ideas to do that more nicely, but none were
gratifying.
Recarding LWLockAcquire costs:
Yes, its pretty noticeable - on loads of different usages. On a bunch of
production machines its the second (begind XLogInsert) on some the most
expensive function. Most of the time
All of those machines are Nehalems though, so the image may be a bit
distorted.
Andres