Re: Seq scans roadmap - Mailing list pgsql-hackers

From Luke Lonergan
Subject Re: Seq scans roadmap
Date
Msg-id C3E62232E3BCF24CBA20D72BFDCB6BF803EB6A65@MI8NYCMAIL08.Mi8.com
Whole thread Raw
In response to Re: Seq scans roadmap  (Heikki Linnakangas <heikki@enterprisedb.com>)
Responses Re: Seq scans roadmap
Re: Seq scans roadmap
List pgsql-hackers
Heikki,

> That's interesting. Care to share the results of the
> experiments you ran? I was thinking of running tests of my
> own with varying table sizes.

Yah - it may take a while - you might get there faster.

There are some interesting effects to look at between I/O cache
performance and PG bufcache, and at those speeds the only tool I've
found that actually measures scan rate in PG is VACUUM.  "SELECT
COUNT(*)" measures CPU consumption in the aggregation node, not scan
rate.

Note that the copy from I/O cache to PG bufcache is where the L2 effect
is seen.
> The main motivation here is to avoid the sudden drop in
> performance when a table grows big enough not to fit in RAM.
> See attached diagram for what I mean. Maybe you're right and
> the effect isn't that bad in practice.

There are going to be two performance drops, first when the table
doesn't fit into PG bufcache, the second when it doesn't fit in bufcache
+ I/O cache.  The second is severe, the first is almost insignificant
(for common queries).
> How is that different from what I described?

My impression of your descriptions is that they overvalue the case where
there are multiple scanners of a large (> 1x bufcache) table such that
they can share the "first load" of the bufcache, e.g. your 10% benefit
for table = 10x bufcache argument.  I think this is a non-common
workload, rather there are normally many small tables and several large
tables such that sharing the PG bufcache is irrelevant to the query
speed.

> Yeah I remember the discussion on the L2 cache a while back.
>
> What do you mean with using readahead inside the heapscan?
> Starting an async read request?

Nope - just reading N buffers ahead for seqscans.  Subsequent calls use
previously read pages.  The objective is to issue contiguous reads to
the OS in sizes greater than the PG page size (which is much smaller
than what is needed for fast sequential I/O).
> > The modifications you suggest here may not have the following
> > properties:
> > - don't pollute bufcache for seqscan of tables > 1 x bufcache
> > - for tables > 1 x bufcache use a ringbuffer for I/O that
> is ~ 32KB to
> > minimize L2 cache pollution
>
> So the difference is that you don't want 3A (the take
> advantage of pages already in buffer cache) strategy at all,
> and want the buffer ring strategy to kick in earlier instead.
> Am I reading you correctly?

Yes, I think the ring buffer strategy should be used when the table size
is > 1 x bufcache and the ring buffer should be of a fixed size smaller
than L2 cache (32KB - 128KB seems to work well).

- Luke



pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: Seq scans roadmap
Next
From: Heikki Linnakangas
Date:
Subject: Re: Seq scans roadmap