Thread: Cache relation sizes?

Cache relation sizes?

From

Thomas Munro

Date:

06 November 2018, 22:40:22

Hello,

PostgreSQL likes to probe the size of relations with lseek(SEEK_END) a
lot.  For example, a fully prewarmed pgbench -S transaction consists
of recvfrom(), lseek(SEEK_END), lseek(SEEK_END), sendto().  I think
lseek() is probably about as cheap as a syscall can be so I doubt it
really costs us much, but it's still a context switch and it stands
out when tracing syscalls, especially now that all the lseek(SEEK_SET)
calls are gone (commit c24dcd0cfd).

If we had a different kind of buffer mapping system of the kind that
Andres has described, there might be a place in shared memory that
could track the size of the relation.  Even if we could do that, I
wonder if it would still be better to do a kind of per-backend
lock-free caching, like this:

1.  Whenever a file size has been changed by extending or truncating
(ie immediately after the syscall), bump a shared "size_change"
invalidation counter.

2.  Somewhere in SMgrRelation, store the last known size_change
counter and the last known size.  In _mdnblocks(), if the counter
hasn't moved, we can use the cached size and skip the call to
FileSize().

3.  To minimise false cache invalidations (caused by other relations'
size changes), instead of using a single size_change counter in shared
memory, use an array of N and map relation OIDs onto them.

4.  As for memory coherency, I think it might be enough to use uint32
without locks or read barriers on the read size, since you have a view
of memory at least as new as your snapshot (the taking of which
included a memory barrier).  That's good enough because we don't need
to see blocks added after our snapshot was taken (the same assumption
applies today, this just takes further advantage of it), and
truncations can't happen while we have a share lock on the relation
(the taking of which also includes memory barrier, covering the case
where the truncation happened after our snapshot and the acquisition
of the share lock on the relation).  In other words, there is heavy
locking around truncation already, and for extension we don't care
about recent extensions so we can be quite relaxed about memory.
Right?

I don't have a patch for this (though I did once try it as a
throw-away hack and it seemed to work), but I just wanted to share the
idea and see if anyone sees a problem with the logic/interlocking, or
has a better idea for how to do this.  It occurred to me that I might
be missing something or this would have been done already...

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Cache relation sizes?

From

Andres Freund

Date:

06 November 2018, 22:46:06

Hi,

On 2018-11-07 11:40:22 +1300, Thomas Munro wrote:
> PostgreSQL likes to probe the size of relations with lseek(SEEK_END) a
> lot.  For example, a fully prewarmed pgbench -S transaction consists
> of recvfrom(), lseek(SEEK_END), lseek(SEEK_END), sendto().  I think
> lseek() is probably about as cheap as a syscall can be so I doubt it
> really costs us much, but it's still a context switch and it stands
> out when tracing syscalls, especially now that all the lseek(SEEK_SET)
> calls are gone (commit c24dcd0cfd).

I'd really really like to see some benchmarking before embarking on a
more complex scheme.  I aesthetically dislike those lseeks, but ...


> If we had a different kind of buffer mapping system of the kind that
> Andres has described, there might be a place in shared memory that
> could track the size of the relation.  Even if we could do that, I
> wonder if it would still be better to do a kind of per-backend
> lock-free caching, like this:

Note that the reason for introducing that isn't primarily motivated
by getting rid of the size determining lseeks, but reducing the locking
for extending *and* truncating relations.


Greetings,

Andres Freund

Re: Cache relation sizes?

From

Edmund Horner

Date:

09 November 2018, 01:23:42

On Wed, 7 Nov 2018 at 11:41, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>
> Hello,
>
> PostgreSQL likes to probe the size of relations with lseek(SEEK_END) a
> lot.  For example, a fully prewarmed pgbench -S transaction consists
> of recvfrom(), lseek(SEEK_END), lseek(SEEK_END), sendto().  I think
> lseek() is probably about as cheap as a syscall can be so I doubt it
> really costs us much, but it's still a context switch and it stands
> out when tracing syscalls, especially now that all the lseek(SEEK_SET)
> calls are gone (commit c24dcd0cfd).
>
> If we had a different kind of buffer mapping system of the kind that
> Andres has described, there might be a place in shared memory that
> could track the size of the relation.  Even if we could do that, I
> wonder if it would still be better to do a kind of per-backend
> lock-free caching, like this:

On behalf of those looking after databases running over NFS (sigh!), I
think this is definitely worth exploring.  Looking at the behaviour of
my (9.4.9) server, there's an lseek(SEEK_END) for every relation
(table or index) used by a query, which is a lot of them for a heavily
partitioned database.  The lseek counts seem to be the same with
native partitions and 10.4.

As an incredibly rough benchmark, a "SELECT * FROM t ORDER BY pk LIMIT
0" on a table with 600 partitions, which builds a
MergeAppend/IndexScan plan, invokes lseek around 1200 times, and takes
600ms to return when repeated.  (It's much slower the first time,
because the backend has to open the files, and read index blocks.  I
found that increasing max_files_per_process above the number of
tables/indexes in the query made a huge difference, too!)  Testing
separately, 1200 lseeks on that NFS mount takes around 400ms.

I'm aware of other improvements since 9.4.9 that would likely improve
things (the pread/pwrite change; possibly using native partitioning
instead of inheritance), but I imagine reducing lseeks would too.

(Of course, an even better improvement is to not put your data
directory on an NFS mount (sigh).)

Re: Cache relation sizes?

From

David Rowley

Date:

09 November 2018, 03:42:15

On 7 November 2018 at 11:46, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
> On 2018-11-07 11:40:22 +1300, Thomas Munro wrote:
>> PostgreSQL likes to probe the size of relations with lseek(SEEK_END) a
>> lot.  For example, a fully prewarmed pgbench -S transaction consists
>> of recvfrom(), lseek(SEEK_END), lseek(SEEK_END), sendto().  I think
>> lseek() is probably about as cheap as a syscall can be so I doubt it
>> really costs us much, but it's still a context switch and it stands
>> out when tracing syscalls, especially now that all the lseek(SEEK_SET)
>> calls are gone (commit c24dcd0cfd).
>
> I'd really really like to see some benchmarking before embarking on a
> more complex scheme.  I aesthetically dislike those lseeks, but ...

I agree. It would be good to see benchmarks on this first.  Those
could be as simple as just some crude local backend cache that stuff
the return value of RelationGetNumberOfBlocks in estimate_rel_size
into a hashtable and does not take into account the fact that it might
change. Should be okay to do some read-only benchmarking.

The partitioning case is probably a less interesting case to improve
if we get [1] as we'll no longer ask for the size of any pruned
partitions. Queries that don't prune any partitions are less likely to
notice the extra overhead of the lseek(SEEK_END) since they've
probably got more work to do elsewhere.

[1] https://commitfest.postgresql.org/20/1778/

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Cache relation sizes?

From

Thomas Munro

Date:

15 November 2018, 23:06:08

On Fri, Nov 9, 2018 at 4:42 PM David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 7 November 2018 at 11:46, Andres Freund <andres@anarazel.de> wrote:
> > On 2018-11-07 11:40:22 +1300, Thomas Munro wrote:
> >> PostgreSQL likes to probe the size of relations with lseek(SEEK_END) a
> >> lot.  For example, a fully prewarmed pgbench -S transaction consists
> >> of recvfrom(), lseek(SEEK_END), lseek(SEEK_END), sendto().  I think
> >> lseek() is probably about as cheap as a syscall can be so I doubt it
> >> really costs us much, but it's still a context switch and it stands
> >> out when tracing syscalls, especially now that all the lseek(SEEK_SET)
> >> calls are gone (commit c24dcd0cfd).
> >
> > I'd really really like to see some benchmarking before embarking on a
> > more complex scheme.  I aesthetically dislike those lseeks, but ...
>
> I agree. It would be good to see benchmarks on this first.  Those
> could be as simple as just some crude local backend cache that stuff
> the return value of RelationGetNumberOfBlocks in estimate_rel_size
> into a hashtable and does not take into account the fact that it might
> change. Should be okay to do some read-only benchmarking.

Oh, I just found the throw-away patch I wrote ages ago down the back
of the sofa.  Here's a rebase.  It somehow breaks initdb so you have
to initdb with unpatched.  Unfortunately I couldn't seem to measure
any speed-up on a random EDB test lab Linux box using pgbench -S (not
"prepared"), but that test doesn't involve many tables, and also it's
an older kernel without KPTI mitigations.  Attached in case anyone
else would like to try it.

-- 
Thomas Munro
http://www.enterprisedb.com

Attachment

0001-Cache-file-sizes-to-avoid-lseek-calls.patch

Re: Cache relation sizes?

From

David Rowley

Date:

30 November 2018, 06:50:33

On Fri, 16 Nov 2018 at 12:06, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Oh, I just found the throw-away patch I wrote ages ago down the back
> of the sofa.  Here's a rebase.  It somehow breaks initdb so you have
> to initdb with unpatched.  Unfortunately I couldn't seem to measure
> any speed-up on a random EDB test lab Linux box using pgbench -S (not
> "prepared"), but that test doesn't involve many tables, and also it's
> an older kernel without KPTI mitigations.  Attached in case anyone
> else would like to try it.

Over on [1] there's some talk about how when using PREPAREd statements
on a table with many partitions where some of the parameters help
prune many of the partitions, that on the 6th, and only on the 6th
execution of the statement that there's a huge spike in the query
latency.  This will be down to the fact that GetCachedPlan() builds a
generic plan on the 6th execution and most likely discards it due to
it appearing too expensive because of lack of any partition pruning.
The custom plan's cost is likely much much cheaper, so the generic
plan is planned but never used.  This may be a good real-life
candidate to test this patch with.  I know from benchmarks I performed
several months ago that the lseek() call to determine the relation
size was a large part of the cost of planning with many partitions.

[1] https://www.postgresql.org/message-id/flat/25C1C6B2E7BE044889E4FE8643A58BA963D89214%40G01JPEXMBKW03

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

RE: Cache relation sizes?

From

"Jamison, Kirk"

Date:

27 December 2018, 06:59:57

Hello,

I also find this proposed feature to be beneficial for performance, especially when we want to extend or truncate large
tables.
As mentioned by David, currently there is a query latency spike when we make generic plan for partitioned table with
manypartitions.
 
I tried to apply Thomas' patch for that use case. Aside from measuring the planning and execution time,
I also monitored the lseek calls using simple strace, with and without the patch.

Below are the test results.
Setup 8192 table partitions.
(1) set plan_cache_mode = 'force_generic_plan';

  [Without Patch]
    prepare select_stmt(int) as select * from t where id = $1;
    explain (timing off, analyze) execute select_stmt(8192);
    […]
    Planning Time: 1678.680 ms
    Execution Time: 643.495 ms

    $ strace -p [pid] -e trace=lseek -c
    % time    seconds  usecs/call  calls   errors   syscall
    ---------------------------------------------------------------------------
    100.00    0.017247      1     16385          lseek

  [With Patch]
    […]
    Planning Time: 1596.566 ms
    Execution Time: 653.646 ms

    $ strace -p [pid] -e trace=lseek -c
    % time    seconds  usecs/call  calls   errors   syscall
    ---------------------------------------------------------------------------
    100.00    0.009196      1     8192           lseek

It was mentioned in the other thread [1] that creating a generic plan for the first time is very expensive.
Although this patch did not seem to reduce the cost of planning time for force_generic_plan,
it seems that number of lseek calls was reduced into half during the first execution of generic plan.


(2) plan_cache_mode = 'auto’
    reset plan_cache_mode; -- resets to auto / custom plan

  [Without Patch]
    […]
    Planning Time: 768.669 ms
    Execution Time: 0.776 ms

    $ strace -p [pid] -e trace=lseek -c
    % time    seconds  usecs/call  calls   errors   syscall
    ---------------------------------------------------------------------------
    100.00    0.015117     2       8193             lseek

  [With Patch]
    […]
    Planning Time: 181.690 ms
    Execution Time: 0.615 ms

    $ strace -p [pid] -e trace=lseek -c
    […]
    NO (zero) lseek calls.

Without the patch, there were around 8193 lseek calls.
With the patch applied, there were no lseek calls when creating the custom plan.


(3) set plan_cache_mode = 'force_generic_plan';
    -- force it to generic plan again to use the cached plan (no re-planning)

  [Without Patch]
    […]
    Planning Time: 14.294 ms
    Execution Time: 601.141 ms

    $ strace -p [pid] -e trace=lseek -c
    % time    seconds  usecs/call  calls   errors   syscall
    ---------------------------------------------------------------------------
    0.00    0.000000        0        1              lseek

  [With Patch]
    […]
    Planning Time: 13.976 ms
    Execution Time: 570.518 ms
    
    $ strace -p [pid] -e trace=lseek -c
    […]
    NO (zero) lseek calls.

----
If I did the test correctly, I am not sure though as to why the patch did not affect the generic planning performance
oftable with many partitions.
 
However, the number of lseek calls was greatly reduced with Thomas’ patch.
I also did not get considerable speed up in terms of latency average using pgbench –S (read-only, unprepared).
I am assuming this might be applicable to other use cases as well.
(I just tested the patch, but haven’t dug up the patch details yet).

Would you like to submit this to the commitfest to get more reviews for possible idea/patch improvement?


[1]
https://www.postgresql.org/message-id/flat/CAEepm%3D3SSw-Ty1DFcK%3D1rU-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com


Regards,
Kirk Jamison

Re: Cache relation sizes?

From

Thomas Munro

Date:

27 December 2018, 21:42:46

On Thu, Dec 27, 2018 at 8:00 PM Jamison, Kirk <k.jamison@jp.fujitsu.com> wrote:
> I also find this proposed feature to be beneficial for performance, especially when we want to extend or truncate
largetables. 
> As mentioned by David, currently there is a query latency spike when we make generic plan for partitioned table with
manypartitions. 
> I tried to apply Thomas' patch for that use case. Aside from measuring the planning and execution time,
> I also monitored the lseek calls using simple strace, with and without the patch.

Thanks for looking into this and testing!

> Setup 8192 table partitions.

> (1) set plan_cache_mode = 'force_generic_plan';
>     Planning Time: 1678.680 ms
>     Planning Time: 1596.566 ms

> (2) plan_cache_mode = 'auto’
>     Planning Time: 768.669 ms
>     Planning Time: 181.690 ms

> (3) set plan_cache_mode = 'force_generic_plan';
>     Planning Time: 14.294 ms
>     Planning Time: 13.976 ms

> If I did the test correctly, I am not sure though as to why the patch did not affect the generic planning performance
oftable with many partitions. 
> However, the number of lseek calls was greatly reduced with Thomas’ patch.
> I also did not get considerable speed up in terms of latency average using pgbench –S (read-only, unprepared).
> I am assuming this might be applicable to other use cases as well.
> (I just tested the patch, but haven’t dug up the patch details yet).

The result for (2) is nice.  Even though you had to use 8192
partitions to see it.

> Would you like to submit this to the commitfest to get more reviews for possible idea/patch improvement?

For now I think this still in the experiment/hack phase and I have a
ton of other stuff percolating in this commitfest already (and a week
of family holiday in the middle of January).  But if you have ideas
about the validity of the assumptions, the reason it breaks initdb, or
any other aspect of this approach (or alternatives), please don't let
me stop you, and of course please feel free to submit this, an
improved version or an alternative proposal yourself!  Unfortunately I
wouldn't have time to nurture it this time around, beyond some
drive-by comments.

Assorted armchair speculation:  I wonder how much this is affected by
the OS and KPTI, virtualisation technology, PCID support, etc.  Back
in the good old days, Linux's lseek(SEEK_END) stopped acquiring the
inode mutex when reading the size, at least in the generic
implementation used by most filesystems (I wonder if our workloads
were indirectly responsible for that optimisation?) so maybe it became
about as fast as a syscall could possibly be, but now the baseline for
how fast syscalls can be has moved and it also depends on your
hardware, and it also has external costs that depend on what memory
you touch in between syscalls.  Also, other operating systems might
still acquire a per-underlying-file/vnode/whatever lock (<checks
source code>... yes) and the contention for that might depend on what
else is happening, so that a single standalone test wouldn't capture
that but a super busy DB with a rapidly expanding and contracting
table that many other sessions are trying to observe with
lseek(SEEK_END) could slow down more.

--
Thomas Munro
http://www.enterprisedb.com

RE: Cache relation sizes?

From

"Jamison, Kirk"

Date:

08 January 2019, 09:07:09

Hi Thomas,

On Friday, December 28, 2018 6:43 AM Thomas Munro <thomas.munro@enterprisedb.com> wrote:
> [...]if you have ideas about the validity of the assumptions, the reason it breaks initdb, or any other aspect of
thisapproach (or alternatives), please don't let me stop you, and of course please feel free to submit this, an
improvedversion or an alternative proposal [...]
 

Sure. Thanks. I'd like to try to work on the idea. I also took a look at the code, and I hope you don't mind if I ask
forclarifications (explanation/advice/opinions) on the following, since my postgres experience is not substantial
enoughyet.
 

(1) I noticed that you used a "relsize_change_counter" to store in MdSharedData. Is my understanding below correct?

The intention is to cache the rel_size per-backend (lock-free), with an array of relsize_change_counter to skip using
lseeksyscall when the counter does not change.
 
In _mdnblocks(), if the counter did not change, the cached rel size (nblocks) is used and skip the call to FileSize()
(lseekto get and cache rel size). That means in the default Postgres master, lseek syscall (through FileSize()) is
calledwhenever we want to get the rel size (nblocks).
 

On the other hand, the simplest method I thought that could also work is to only cache the file size (nblock) in shared
memory,not in the backend process, since both nblock and relsize_change_counter are uint32 data type anyway. If
relsize_change_countercan be changed without lock, then nblock can be changed without lock, is it right? In that case,
nblockcan be accessed directly in shared memory. In this case, is the relation size necessary to be cached in backend?
 

(2) Is the MdSharedData temporary or permanent in shared memory?
from the patch:
 typedef struct MdSharedData
 {
     /* XXX could have an array of these, and use rel OID % nelements? */ 
     pg_atomic_uint32    relsize_change_counter;
 } MdSharedData;
 
 static MdSharedData *MdShared;

What I intend to have is a permanent hashtable that will keep the file size (eventually/future dev, including table
addresses)in the shared memory for faster access by backend processes. The idea is to keep track of the unallocated
blocks,based from how much the relation has been extended or truncated. Memory for this hashtable will be dynamically
allocated.

Thanks, 
Kirk Jamison

RE: Cache relation sizes?

From

"Ideriha, Takeshi"

Date:

06 February 2019, 05:32:38

>From: Jamison, Kirk [mailto:k.jamison@jp.fujitsu.com]

>On the other hand, the simplest method I thought that could also work is to only cache
>the file size (nblock) in shared memory, not in the backend process, since both nblock
>and relsize_change_counter are uint32 data type anyway. If relsize_change_counter
>can be changed without lock, then nblock can be changed without lock, is it right? In
>that case, nblock can be accessed directly in shared memory. In this case, is the
>relation size necessary to be cached in backend?

(Aside from which idea is better.. )
If you want to put relation size on the shared memory, then I don't think caches in backend 
is necessary because every time relation_size is updated you need to invalidate cache 
in backends. At the reference taking shared lock on the cache and at the update taking 
exclusive lock is simple without backend cache. 

>(2) Is the MdSharedData temporary or permanent in shared memory?
That data structure seems to initialize at the time of InitPostgre, which means it's permanent
because postgres-initialized-shared-memory doesn't have a chance to drop it as far as I know.
(If you want to use temporary data structure, then other mechanism like dsm/dsa/dshash is a candidate.)

Regards,
Takeshi Ideriha

RE: Cache relation sizes?

From

"Tsunakawa, Takayuki"

Date:

06 February 2019, 06:29:15

From: Jamison, Kirk [mailto:k.jamison@jp.fujitsu.com]
> On the other hand, the simplest method I thought that could also work is
> to only cache the file size (nblock) in shared memory, not in the backend
> process, since both nblock and relsize_change_counter are uint32 data type
> anyway. If relsize_change_counter can be changed without lock, then nblock
> can be changed without lock, is it right? In that case, nblock can be accessed
> directly in shared memory. In this case, is the relation size necessary
> to be cached in backend?

Although I haven't looked deeply at Thomas's patch yet, there's currently no place to store the size per relation in
sharedmemory.  You have to wait for the global metacache that Ideriha-san is addressing.  Then, you can store the
relationsize in the RelationData structure in relcache.
 



> (2) Is the MdSharedData temporary or permanent in shared memory?
> from the patch:
>  typedef struct MdSharedData
>  {
>      /* XXX could have an array of these, and use rel OID % nelements?
> */
>      pg_atomic_uint32    relsize_change_counter;
>  } MdSharedData;
> 
>  static MdSharedData *MdShared;

Permanent in shared memory.


Regards
Takayuki Tsunakawa

Re: Cache relation sizes?

From

Kyotaro HORIGUCHI

Date:

06 February 2019, 08:24:45

At Wed, 6 Feb 2019 06:29:15 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in
<0A3221C70F24FB45833433255569204D1FB955DF@G01JPEXMBYT05>
> From: Jamison, Kirk [mailto:k.jamison@jp.fujitsu.com]
> > On the other hand, the simplest method I thought that could also work is
> > to only cache the file size (nblock) in shared memory, not in the backend
> > process, since both nblock and relsize_change_counter are uint32 data type
> > anyway. If relsize_change_counter can be changed without lock, then nblock
> > can be changed without lock, is it right? In that case, nblock can be accessed
> > directly in shared memory. In this case, is the relation size necessary
> > to be cached in backend?
> 
> Although I haven't looked deeply at Thomas's patch yet, there's currently no place to store the size per relation in
sharedmemory.  You have to wait for the global metacache that Ideriha-san is addressing.  Then, you can store the
relationsize in the RelationData structure in relcache.
 

Just one counter in the patch *seems* to give significant gain
comparing to the complexity, given that lseek is so complex or it
brings latency, especially on workloads where file is scarcely
changed. Though I didn't run it on a test bench.

> > (2) Is the MdSharedData temporary or permanent in shared memory?
> > from the patch:
> >  typedef struct MdSharedData
> >  {
> >      /* XXX could have an array of these, and use rel OID % nelements?
> > */
> >      pg_atomic_uint32    relsize_change_counter;
> >  } MdSharedData;
> > 
> >  static MdSharedData *MdShared;
> 
> Permanent in shared memory.

I'm not sure the duration of the 'permanent' there, but it
disappears when server stops. Anyway it doesn't need to be
permanent beyond a server restart.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

RE: Cache relation sizes?

From

"Tsunakawa, Takayuki"

Date:

06 February 2019, 08:29:54

From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
> Just one counter in the patch *seems* to give significant gain
> comparing to the complexity, given that lseek is so complex or it
> brings latency, especially on workloads where file is scarcely
> changed. Though I didn't run it on a test bench.

I expect so, too.


> I'm not sure the duration of the 'permanent' there, but it
> disappears when server stops. Anyway it doesn't need to be
> permanent beyond a server restart.

Right, it exists while the server is running.


Regards
Takayuki Tsunakawa

RE: Cache relation sizes?

From

"Jamison, Kirk"

Date:

06 February 2019, 08:50:45

On February 6, 2019, 08:25AM +0000, Kyotaro HORIGUCHI wrote:

>At Wed, 6 Feb 2019 06:29:15 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote:
>> Although I haven't looked deeply at Thomas's patch yet, there's currently no place to store the size per relation in
sharedmemory.  You have to wait for the global metacache that Ideriha-san is addressing.  Then, you can store the
relationsize in the RelationData structure in relcache.

>Just one counter in the patch *seems* to give significant gain comparing to the complexity, given that lseek is so
complexor it brings latency, especially on workloads where file is scarcely changed. Though I didn't run it on a test
bench.

> > > (2) Is the MdSharedData temporary or permanent in shared memory?
> > Permanent in shared memory.
> I'm not sure the duration of the 'permanent' there, but it disappears when server stops. Anyway it doesn't need to be
permanentbeyond a server restart.

Thank you for the insights. 
I did a simple test in the previous email using simple syscall tracing,
the patch significantly reduced the number of lseek syscall.
(but that simple test might not be enough to describe the performance benefit)

Regarding Tsunakawa-san's comment,
in Thomas' patch, he made a place in shared memory that stores the
relsize_change_counter, so I am thinking of utilizing the same,
but for caching the relsize itself.

Perhaps I should explain further the intention for the design.

First step, to cache the file size in the shared memory. Similar to the
intention or purpose of the patch written by Thomas, to reduce the
number of lseek(SEEK_END) by caching the relation size without using
additional locks. The difference is by caching a rel size on the shared
memory itself. I wonder if there are problems that you can see with
this approach.

Eventually, the next step is to have a structure in shared memory
that caches file addresses along with their sizes (Andres' idea of
putting an open relations table in the shared memory). With a
structure that group buffers into file relation units, we can get
file size directly from shared memory, so the assumption is it would
be faster whenever we truncate/extend our relations because we can
track the offset of the changes in size and use range for managing
the truncation, etc..
The next step is a complex direction that needs serious discussion,
but I am wondering if we can proceed with the first step for now if
the idea and direction are valid.

Regards,
Kirk Jamison

Re: Cache relation sizes?

From

"andres@anarazel.de"

Date:

06 February 2019, 08:56:52

On 2019-02-06 08:50:45 +0000, Jamison, Kirk wrote:
> On February 6, 2019, 08:25AM +0000, Kyotaro HORIGUCHI wrote:
> 
> >At Wed, 6 Feb 2019 06:29:15 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote:
> >> Although I haven't looked deeply at Thomas's patch yet, there's currently no place to store the size per relation
inshared memory.  You have to wait for the global metacache that Ideriha-san is addressing.  Then, you can store the
relationsize in the RelationData structure in relcache.
 
> 
> >Just one counter in the patch *seems* to give significant gain comparing to the complexity, given that lseek is so
complexor it brings latency, especially on workloads where file is scarcely changed. Though I didn't run it on a test
bench.
> 
> > > > (2) Is the MdSharedData temporary or permanent in shared memory?
> > > Permanent in shared memory.
> > I'm not sure the duration of the 'permanent' there, but it disappears when server stops. Anyway it doesn't need to
bepermanent beyond a server restart.
 
> 
> 
> Thank you for the insights. 
> I did a simple test in the previous email using simple syscall tracing,
> the patch significantly reduced the number of lseek syscall.
> (but that simple test might not be enough to describe the performance benefit)
> 
> Regarding Tsunakawa-san's comment,
> in Thomas' patch, he made a place in shared memory that stores the
> relsize_change_counter, so I am thinking of utilizing the same,
> but for caching the relsize itself.
> 
> Perhaps I should explain further the intention for the design.
> 
> First step, to cache the file size in the shared memory. Similar to the
> intention or purpose of the patch written by Thomas, to reduce the
> number of lseek(SEEK_END) by caching the relation size without using
> additional locks. The difference is by caching a rel size on the shared
> memory itself. I wonder if there are problems that you can see with
> this approach.
> 
> Eventually, the next step is to have a structure in shared memory
> that caches file addresses along with their sizes (Andres' idea of
> putting an open relations table in the shared memory). With a
> structure that group buffers into file relation units, we can get
> file size directly from shared memory, so the assumption is it would
> be faster whenever we truncate/extend our relations because we can
> track the offset of the changes in size and use range for managing
> the truncation, etc..
> The next step is a complex direction that needs serious discussion,
> but I am wondering if we can proceed with the first step for now if
> the idea and direction are valid.

Maybe I'm missing something here, but why is it actually necessary to
have the sizes in shared memory, if we're just talking about caching
sizes?  It's pretty darn cheap to determine the filesize of a file that
has been recently stat()/lseek()/ed, and introducing per-file shared
data adds *substantial* complexity, because the amount of shared memory
needs to be pre-determined.  The reason I want to put per-relation data
into shared memory is different, it's about putting the buffer mapping
into shared memory, and that, as a prerequisite, also need per-relation
data. And there's a limit of the number of relation sthat need to be
open (one per cached page at max), and space can be freed by evicting
pages.

Greetings,

Andres Freund

RE: Cache relation sizes?

From

"Jamison, Kirk"

Date:

13 February 2019, 05:48:28

On February 6, 2019, 8:57 AM +0000, Andres Freund wrote:
> Maybe I'm missing something here, but why is it actually necessary to
> have the sizes in shared memory, if we're just talking about caching
> sizes?  It's pretty darn cheap to determine the filesize of a file that
> has been recently stat()/lseek()/ed, and introducing per-file shared
> data adds *substantial* complexity, because the amount of shared memory
> needs to be pre-determined.  The reason I want to put per-relation data
> into shared memory is different, it's about putting the buffer mapping
> into shared memory, and that, as a prerequisite, also need per-relation
> data. And there's a limit of the number of relations that need to be
> open (one per cached page at max), and space can be freed by evicting
> pages.

Ahh.. You are right about the logic of putting it in the shared memory.
As for Thomas' toy patch, multiple files share one counter in shmem.
Although it currently works, it might likely to miss. 
Though his eventual plan of the idea is to use an array of N counters
and map relation OIDs onto them. 
But as your point about complexity says, in shared memory we cannot
share the same area with multiple files, so that needs an area to
allocate depending on the number of files.

Regarding the allocation of per-relation data in shared memory, I
thought it can be a separated component at first so I asked for
validity of the idea. But now I consider the point raised.

Regards,
Kirk Jamison

Re: Cache relation sizes?

From

Kyotaro HORIGUCHI

Date:

14 February 2019, 11:41:04

At Wed, 13 Feb 2019 05:48:28 +0000, "Jamison, Kirk" <k.jamison@jp.fujitsu.com> wrote in
<D09B13F772D2274BB348A310EE3027C643880D@g01jpexmbkw24>
> On February 6, 2019, 8:57 AM +0000, Andres Freund wrote:
> > Maybe I'm missing something here, but why is it actually necessary to
> > have the sizes in shared memory, if we're just talking about caching
> > sizes?  It's pretty darn cheap to determine the filesize of a file that
> > has been recently stat()/lseek()/ed, and introducing per-file shared
> > data adds *substantial* complexity, because the amount of shared memory
> > needs to be pre-determined.  The reason I want to put per-relation data
> > into shared memory is different, it's about putting the buffer mapping
> > into shared memory, and that, as a prerequisite, also need per-relation
> > data. And there's a limit of the number of relations that need to be
> > open (one per cached page at max), and space can be freed by evicting
> > pages.
> 
> Ahh.. You are right about the logic of putting it in the shared memory.
> As for Thomas' toy patch, multiple files share one counter in shmem.
> Although it currently works, it might likely to miss. 
> Though his eventual plan of the idea is to use an array of N counters
> and map relation OIDs onto them. 
> But as your point about complexity says, in shared memory we cannot
> share the same area with multiple files, so that needs an area to
> allocate depending on the number of files.
> 
> Regarding the allocation of per-relation data in shared memory, I
> thought it can be a separated component at first so I asked for
> validity of the idea. But now I consider the point raised.

I still believe that one shared memory element for every
non-mapped relation is not only too-complex but also too-much, as
Andres (and implicitly I) wrote. I feel that just one flag for
all works fine but partitioned flags (that is, relations or files
corresponds to the same hash value share one flag) can reduce the
shared memory elements to a fixed small number.

Note: I'm still not sure how much lseek impacts performance.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Cache relation sizes?

From

Kyotaro HORIGUCHI

Date:

14 February 2019, 12:24:09

2019年2月14日(木) 20:41、Kyotaro HORIGUCHI さん（horiguchi.kyotaro@lab.ntt.co.jp）のメッセージ:

At Wed, 13 Feb 2019 05:48:28 +0000, "Jamison, Kirk" <k.jamison@jp.fujitsu.com> wrote in <D09B13F772D2274BB348A310EE3027C643880D@g01jpexmbkw24>
> On February 6, 2019, 8:57 AM +0000, Andres Freund wrote:
> > Maybe I'm missing something here, but why is it actually necessary to
> > have the sizes in shared memory, if we're just talking about caching
> > sizes? It's pretty darn cheap to determine the filesize of a file that
> > has been recently stat()/lseek()/ed, and introducing per-file shared
> > data adds *substantial* complexity, because the amount of shared memory
> > needs to be pre-determined. The reason I want to put per-relation data
> > into shared memory is different, it's about putting the buffer mapping
> > into shared memory, and that, as a prerequisite, also need per-relation
> > data. And there's a limit of the number of relations that need to be
> > open (one per cached page at max), and space can be freed by evicting
> > pages.
>
> Ahh.. You are right about the logic of putting it in the shared memory.
> As for Thomas' toy patch, multiple files share one counter in shmem.
> Although it currently works, it might likely to miss.
> Though his eventual plan of the idea is to use an array of N counters
> and map relation OIDs onto them.
> But as your point about complexity says, in shared memory we cannot
> share the same area with multiple files, so that needs an area to
> allocate depending on the number of files.
>
> Regarding the allocation of per-relation data in shared memory, I
> thought it can be a separated component at first so I asked for
> validity of the idea. But now I consider the point raised.

I still believe that one shared memory element for every
non-mapped relation is not only too-complex but also too-much, as
Andres (and implicitly I) wrote. I feel that just one flag for
all works fine but partitioned flags (that is, relations or files
corresponds to the same hash value share one flag) can reduce the
shared memory elements to a fixed small number.

Note: I'm still not sure how much lseek impacts performance.

Of course all the "flag"s above actually are "update counter"s :)

--
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Cache relation sizes?

From

Thomas Munro

Date:

31 December 2019, 04:05:31

On Tue, Dec 31, 2019 at 4:43 PM Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> I still believe that one shared memory element for every
> non-mapped relation is not only too-complex but also too-much, as
> Andres (and implicitly I) wrote. I feel that just one flag for
> all works fine but partitioned flags (that is, relations or files
> corresponds to the same hash value share one flag) can reduce the
> shared memory elements to a fixed small number.

There is one potentially interesting case that doesn't require any
kind of shared cache invalidation AFAICS.  XLogReadBufferExtended()
calls smgrnblocks() for every buffer access, even if the buffer is
already in our buffer pool.  I tried to make yet another quick
experiment-grade patch to cache the size[1], this time for use in
recovery only.

initdb -D pgdata
postgres -D pgdata -c checkpoint_timeout=60min

In another shell:
pgbench -i -s100 postgres
pgbench -M prepared -T60 postgres
killall -9 postgres
mv pgdata pgdata-save

Master branch:

cp -r pgdata-save pgdata
strace -c -f postgres -D pgdata
[... wait for "redo done", then hit ^C ...]
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
...
 18.61   22.492286          26    849396           lseek
  6.95    8.404369          30    277134           pwrite64
  6.63    8.009679          28    277892           pread64
  0.50    0.604037          39     15169           sync_file_range
...

Patched:

rm -fr pgdata
cp -r pgdata-save pgdata
strace -c -f ~/install/bin/postgres -D pgdata
[... wait for "redo done", then hit ^C ...]
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
...
 16.33    8.097631          29    277134           pwrite64
 15.56    7.715052          27    277892           pread64
  1.13    0.559648          39     14137           sync_file_range
...
  0.00    0.001505          25        59           lseek

> Note: I'm still not sure how much lseek impacts performance.

It doesn't seem great that we are effectively making system calls for
most WAL records we replay, but, sadly, in this case the patch didn't
really make any measurable difference when run without strace on this
Linux VM.  I suspect there is some workload and stack where it would
make a difference (CF the read(postmaster pipe) call for every WAL
record that was removed), but this is just something I noticed in
passing while working on something else, so I haven't investigated
much.

[1] https://github.com/postgres/postgres/compare/master...macdice:cache-nblocks
(just a test, unfinished, probably has bugs)

Re: Cache relation sizes?

From

Andres Freund

Date:

03 February 2020, 13:23:19

Hi,

On 2019-12-31 17:05:31 +1300, Thomas Munro wrote:
> There is one potentially interesting case that doesn't require any
> kind of shared cache invalidation AFAICS.  XLogReadBufferExtended()
> calls smgrnblocks() for every buffer access, even if the buffer is
> already in our buffer pool.

Yea, that's really quite bad*. The bit about doing so even when already
in the buffer pool is particularly absurd.  Needing to have special
handling in mdcreate() for XLogReadBufferExtended() always calling it is
also fairly ugly.

> It doesn't seem great that we are effectively making system calls for
> most WAL records we replay, but, sadly, in this case the patch didn't
> really make any measurable difference when run without strace on this
> Linux VM.  I suspect there is some workload and stack where it would
> make a difference (CF the read(postmaster pipe) call for every WAL
> record that was removed), but this is just something I noticed in
> passing while working on something else, so I haven't investigated
> much.

I wonder if that's just because your workload is too significantly
bottlenecked elsewhere:

> postgres -D pgdata -c checkpoint_timeout=60min

> In another shell:
> pgbench -i -s100 postgres
> pgbench -M prepared -T60 postgres
> killall -9 postgres
> mv pgdata pgdata-save

With scale 100, but the default shared_buffers, you'll frequently hit
the OS for reads/writes. Which will require the same metadata in the
kernel, but then also memcpys between kernel and userspace.

A word of caution about strace's -c: In my experience the total time
measurements are very imprecise somehow. I think it might be that some
of the overhead of ptracing will be attributed to the syscalls or such,
which means frequent syscalls appear relatively more expensive than they
really are.

Greetings,

Andres Freund

* it insults my sense of aesthetics

Re: Cache relation sizes?

From

Thomas Munro

Date:

13 February 2020, 06:18:00

On Tue, Feb 4, 2020 at 2:23 AM Andres Freund <andres@anarazel.de> wrote:
> On 2019-12-31 17:05:31 +1300, Thomas Munro wrote:
> > There is one potentially interesting case that doesn't require any
> > kind of shared cache invalidation AFAICS.  XLogReadBufferExtended()
> > calls smgrnblocks() for every buffer access, even if the buffer is
> > already in our buffer pool.
>
> Yea, that's really quite bad*. The bit about doing so even when already
> in the buffer pool is particularly absurd.  Needing to have special
> handling in mdcreate() for XLogReadBufferExtended() always calling it is
> also fairly ugly.

Yeah.  It seems like there are several things to fix there.  So now
I'm wondering if we should start out by trying to cache the size it in
the smgr layer for recovery only, like in the attached, and then later
try to extend the scheme to cover the shared case as discussed at the
beginning of the thread.

> A word of caution about strace's -c: In my experience the total time
> measurements are very imprecise somehow. I think it might be that some
> of the overhead of ptracing will be attributed to the syscalls or such,
> which means frequent syscalls appear relatively more expensive than they
> really are.

Yeah, those times are large, meaningless tracing overheads.  While
some systems might in fact be happy replaying a couple of million
lseeks per second, (1) I'm pretty sure some systems would not be happy
about that (see claims in this thread) and (2) it means you can't
practically use strace on recovery because it slows it down to a
crawl, which is annoying.

Attachment

0001-Use-cached-smgrnblocks-results-in-recovery.patch

Re: Cache relation sizes?

From

Thomas Munro

Date:

14 February 2020, 00:50:27

On Thu, Feb 13, 2020 at 7:18 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> ... (1) I'm pretty sure some systems would not be happy
> about that (see claims in this thread) ...

I poked a couple of people off-list and learned that, although the
Linux and FreeBSD systems I tried could do a million lseek(SEEK_END)
calls in 60-100ms, a couple of different Windows systems took between
1.1 and 3.4 seconds (worse times when running as non-administrator),
which seems to be clearly in the territory that can put a dent in your
recovery speeds on that OS.  I also learned that GetFileSizeEx() is
"only" about twice as fast, which is useful information for that other
thread about which syscall to use for this, but it's kind of
irrelevant this thread about how we can get rid of these crazy
syscalls altogether.

Re: Cache relation sizes?

From

Thomas Munro

Date:

11 April 2020, 04:10:42

On Fri, Feb 14, 2020 at 1:50 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Thu, Feb 13, 2020 at 7:18 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > ... (1) I'm pretty sure some systems would not be happy
> > about that (see claims in this thread) ...
>
> I poked a couple of people off-list and learned that, although the
> Linux and FreeBSD systems I tried could do a million lseek(SEEK_END)
> calls in 60-100ms, a couple of different Windows systems took between
> 1.1 and 3.4 seconds (worse times when running as non-administrator),
> which seems to be clearly in the territory that can put a dent in your
> recovery speeds on that OS.  I also learned that GetFileSizeEx() is
> "only" about twice as fast, which is useful information for that other
> thread about which syscall to use for this, but it's kind of
> irrelevant this thread about how we can get rid of these crazy
> syscalls altogether.

I received a report off-list from someone who experimented with the
patch I shared earlier on this thread[1], using a crash recovery test
similar to one I showed on the WAL prefetching thread[2] (which he was
also testing, separately).

He observed that the lseek() rate in recovery was actually a
significant problem for his environment on unpatched master, showing
up as the top sampled function in perf, and by using that patch he got
(identical) crash recovery to complete in 41s instead of 65s, with a
sane looking perf (= 58% speedup).  His test system was an AWS
i3.16xlarge running an unspecified version of Linux.

I think it's possible that all of the above reports can be explained
by variations in the speculative execution bug mitigations that are
enabled by default on different systems, but I haven't tried to
investigate that yet.

[1] https://github.com/postgres/postgres/compare/master...macdice:cache-nblocks
[2] https://www.postgresql.org/message-id/CA%2BhUKG%2BOcWr8nHqa3%3DZoPTGgdDcrgjSC4R2sT%2BjrUunBua3rpg%40mail.gmail.com

Re: Cache relation sizes?

From

Thomas Munro

Date:

19 June 2020, 22:32:47

On Sat, Apr 11, 2020 at 4:10 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> I received a report off-list from someone who experimented with the
> patch I shared earlier on this thread[1], using a crash recovery test
> similar to one I showed on the WAL prefetching thread[2] (which he was
> also testing, separately).

Rebased.  I'll add this to the open commitfest.

Attachment

v2-0001-Cache-smgrnblocks-results-in-recovery.patch

Re: Cache relation sizes?

From

Thomas Munro

Date:

31 July 2020, 02:36:12

On Sat, Jun 20, 2020 at 10:32 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> Rebased.  I'll add this to the open commitfest.

I traced the recovery process while running pgbench -M prepared -c16
-j16 -t10000 (= 160,000 transactions).  With the patch, the number of
lseeks went from 1,080,661 (6.75 per pgbench transaction) to just 85.

I went ahead and pushed this patch.

There's still the matter of crazy numbers of lseeks in regular
backends; looking at all processes while running the above test, I get
1,469,060 (9.18 per pgbench transaction) without -M prepared, and
193,722 with -M prepared (1.21 per pgbench transaction).  Fixing that
with this approach will require bullet-proof shared invalidation, but
I think it's doable, in later work.

Re: Cache relation sizes?

From

Thomas Munro

Date:

31 July 2020, 21:56:14

On Fri, Jul 31, 2020 at 2:36 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> There's still the matter of crazy numbers of lseeks in regular
> backends; looking at all processes while running the above test, I get
> 1,469,060 (9.18 per pgbench transaction) without -M prepared, and
> 193,722 with -M prepared (1.21 per pgbench transaction).  Fixing that
> with this approach will require bullet-proof shared invalidation, but
> I think it's doable, in later work.

I couldn't help hacking on this a bit.  Perhaps instead of
bullet-proof general shared invalidation, we should have a way for
localised bits of code to state that they are ok with a "relaxed"
value.  Then they should explain the theory for why that is safe in
each case based on arguments about memory barrier pairs, but leave all
other callers making the system call so that we don't torpedo the
whole project by making it too hard.  For now, the main cases I'm
interested in are the ones that show up all the time as the dominant
system call in various workloads:

(1) Sequential scan relation-size probe.  This should be OK with a
relaxed value.  You can't miss the invalidation for a truncation,
because the read barrier in your lock acquisition on the relation
pairs with the write barrier in the exclusive lock release of any
session that truncated, and you can't miss relation any relation
extension that your snapshot can see, because the release of the
extension lock pairs with the lock involved in snapshot acquisition.

(2) Planner relation-size probe, which should be OK with a relaxed
value.  Similar arguments give you a fresh enough view, I think.

Or maybe there is a theory along these lines that already covers every
use of smgrnblocks(), so a separate mode isn't require... I don't
know!

The attached sketch-quality patch shows some of the required elements
working, though I ran out of steam trying to figure out how to thread
this thing through the right API layers so for now it always asks for
a relaxed value in table_block_relation_size().

Attachment

0001-WIP-Cache-smgrnblocks-in-more-cases.patch

Re: Cache relation sizes?

From

Konstantin Knizhnik

Date:

03 August 2020, 15:54:33


On 01.08.2020 00:56, Thomas Munro wrote:
> On Fri, Jul 31, 2020 at 2:36 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> There's still the matter of crazy numbers of lseeks in regular
>> backends; looking at all processes while running the above test, I get
>> 1,469,060 (9.18 per pgbench transaction) without -M prepared, and
>> 193,722 with -M prepared (1.21 per pgbench transaction).  Fixing that
>> with this approach will require bullet-proof shared invalidation, but
>> I think it's doable, in later work.
> I couldn't help hacking on this a bit.  Perhaps instead of
> bullet-proof general shared invalidation, we should have a way for
> localised bits of code to state that they are ok with a "relaxed"
> value.  Then they should explain the theory for why that is safe in
> each case based on arguments about memory barrier pairs, but leave all
> other callers making the system call so that we don't torpedo the
> whole project by making it too hard.  For now, the main cases I'm
> interested in are the ones that show up all the time as the dominant
> system call in various workloads:
>
> (1) Sequential scan relation-size probe.  This should be OK with a
> relaxed value.  You can't miss the invalidation for a truncation,
> because the read barrier in your lock acquisition on the relation
> pairs with the write barrier in the exclusive lock release of any
> session that truncated, and you can't miss relation any relation
> extension that your snapshot can see, because the release of the
> extension lock pairs with the lock involved in snapshot acquisition.
>
> (2) Planner relation-size probe, which should be OK with a relaxed
> value.  Similar arguments give you a fresh enough view, I think.
>
> Or maybe there is a theory along these lines that already covers every
> use of smgrnblocks(), so a separate mode isn't require... I don't
> know!
>
> The attached sketch-quality patch shows some of the required elements
> working, though I ran out of steam trying to figure out how to thread
> this thing through the right API layers so for now it always asks for
> a relaxed value in table_block_relation_size().
So in this thread three solutions are proposed:
1. "bullet-proof general shared invalidation"
2. recovery-only solution avoiding shared memory and invalidation
3. "relaxed" shared memory cache with simplified invalidation

If solving such very important by still specific problem of caching 
relation size requires so much efforts,
then may be it is time to go further in the direction towards shared 
catalog?
This shared relation cache can easily store relation size as well.
In addition it will solve a lot of other problems:
- noticeable overhead of local relcache warming
- large memory consumption in case of larger number of relations 
O(max_connections*n_relations)
- sophisticated invalidation protocol and related performance issues

Certainly access to shared cache requires extra synchronization.But DDL 
operations are relatively rare.
So in most cases we will have only shared locks. May be overhead of 
locking will not be too large?

Re: Cache relation sizes?

From

Pavel Stehule

Date:

03 August 2020, 17:11:37

po 3. 8. 2020 v 17:54 odesílatel Konstantin Knizhnik <k.knizhnik@postgrespro.ru> napsal:

On 01.08.2020 00:56, Thomas Munro wrote:
> On Fri, Jul 31, 2020 at 2:36 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> There's still the matter of crazy numbers of lseeks in regular
>> backends; looking at all processes while running the above test, I get
>> 1,469,060 (9.18 per pgbench transaction) without -M prepared, and
>> 193,722 with -M prepared (1.21 per pgbench transaction). Fixing that
>> with this approach will require bullet-proof shared invalidation, but
>> I think it's doable, in later work.
> I couldn't help hacking on this a bit. Perhaps instead of
> bullet-proof general shared invalidation, we should have a way for
> localised bits of code to state that they are ok with a "relaxed"
> value. Then they should explain the theory for why that is safe in
> each case based on arguments about memory barrier pairs, but leave all
> other callers making the system call so that we don't torpedo the
> whole project by making it too hard. For now, the main cases I'm
> interested in are the ones that show up all the time as the dominant
> system call in various workloads:
>
> (1) Sequential scan relation-size probe. This should be OK with a
> relaxed value. You can't miss the invalidation for a truncation,
> because the read barrier in your lock acquisition on the relation
> pairs with the write barrier in the exclusive lock release of any
> session that truncated, and you can't miss relation any relation
> extension that your snapshot can see, because the release of the
> extension lock pairs with the lock involved in snapshot acquisition.
>
> (2) Planner relation-size probe, which should be OK with a relaxed
> value. Similar arguments give you a fresh enough view, I think.
>
> Or maybe there is a theory along these lines that already covers every
> use of smgrnblocks(), so a separate mode isn't require... I don't
> know!
>
> The attached sketch-quality patch shows some of the required elements
> working, though I ran out of steam trying to figure out how to thread
> this thing through the right API layers so for now it always asks for
> a relaxed value in table_block_relation_size().
So in this thread three solutions are proposed:
1. "bullet-proof general shared invalidation"
2. recovery-only solution avoiding shared memory and invalidation
3. "relaxed" shared memory cache with simplified invalidation

If solving such very important by still specific problem of caching
relation size requires so much efforts,
then may be it is time to go further in the direction towards shared
catalog?
This shared relation cache can easily store relation size as well.
In addition it will solve a lot of other problems:
- noticeable overhead of local relcache warming
- large memory consumption in case of larger number of relations
O(max_connections*n_relations)
- sophisticated invalidation protocol and related performance issues

Certainly access to shared cache requires extra synchronization.But DDL
operations are relatively rare.

Some applications use very frequently CREATE TEMP TABLE, DROP TEMP TABLE, or CREATE TABLE AS SELECT ..

Regards

Pavel

So in most cases we will have only shared locks. May be overhead of
locking will not be too large?

Re: Cache relation sizes?

From

Thomas Munro

Date:

04 August 2020, 02:21:28

On Tue, Aug 4, 2020 at 3:54 AM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
> So in this thread three solutions are proposed:
> 1. "bullet-proof general shared invalidation"
> 2. recovery-only solution avoiding shared memory and invalidation
> 3. "relaxed" shared memory cache with simplified invalidation

Hi Konstantin,

By the way, point 2 is now committed (c5315f4f).  As for 1 vs 3, I
wasn't proposing two different invalidation techniques: in both
approaches, I'm calling the cached values "relaxed", meaning that
their freshness is controlled by memory barriers elsewhere that the
caller has to worry about.  I'm just suggesting for idea 3 that it
might be a good idea to use relaxed values only in a couple of hot
code paths where we do the analysis required to convince ourselves
that memory barriers are already in the right places to make it safe.
By "bullet-proof" I meant that we could in theory convince ourselves
that *all* users of smgrnblocks() can safely use relaxed values, but
that's hard.

That said, the sketch patch I posted certainly needs more work, and
maybe someone has a better idea on how to do it.

> If solving such very important by still specific problem of caching
> relation size requires so much efforts,
> then may be it is time to go further in the direction towards shared
> catalog?

I wouldn't say it requires too much effort, at least the conservative
approach (3).  But I also agree with what you're saying, in the long
term:

> This shared relation cache can easily store relation size as well.
> In addition it will solve a lot of other problems:
> - noticeable overhead of local relcache warming
> - large memory consumption in case of larger number of relations
> O(max_connections*n_relations)
> - sophisticated invalidation protocol and related performance issues
> Certainly access to shared cache requires extra synchronization.But DDL
> operations are relatively rare.
> So in most cases we will have only shared locks. May be overhead of
> locking will not be too large?

Yeah, I would be very happy if we get a high performance shared
sys/rel/plan/... caches in the future, and separately, having the
relation size available in shmem is something that has come up in
discussions about other topics too (tree-based buffer mapping,
multi-relation data files, ...).  I agree with you that our cache
memory usage is a big problem, and it will be great to fix that one
day.  I don't think that should stop us from making small improvements
to the existing design in the meantime, though.  "The perfect is the
enemy of the good."  Look at all this trivial stuff:

https://wiki.postgresql.org/wiki/Syscall_Reduction

I don't have high quality data to report yet, but from simple tests
I'm seeing orders of magnitude fewer syscalls per pgbench transaction
in recovery, when comparing 11 to 14devel, due to various changes.
Admittedly, the size probes in regular backends aren't quite as bad as
the recovery ones were, because they're already limited by the rate
you can throw request/response cycles at the DB, but there are still
some cases like planning for high-partition counts and slow
filesystems that can benefit measurably from caching.

Re: Cache relation sizes?

From

Thomas Munro

Date:

16 November 2020, 07:11:52

On Tue, Aug 4, 2020 at 2:21 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Tue, Aug 4, 2020 at 3:54 AM Konstantin Knizhnik
> <k.knizhnik@postgrespro.ru> wrote:
> > This shared relation cache can easily store relation size as well.
> > In addition it will solve a lot of other problems:
> > - noticeable overhead of local relcache warming
> > - large memory consumption in case of larger number of relations
> > O(max_connections*n_relations)
> > - sophisticated invalidation protocol and related performance issues
> > Certainly access to shared cache requires extra synchronization.But DDL
> > operations are relatively rare.
> > So in most cases we will have only shared locks. May be overhead of
> > locking will not be too large?
>
> Yeah, I would be very happy if we get a high performance shared
> sys/rel/plan/... caches in the future, and separately, having the
> relation size available in shmem is something that has come up in
> discussions about other topics too (tree-based buffer mapping,
> multi-relation data files, ...).  ...

After recent discussions about the limitations of relying on SEEK_END
in a nearby thread[1], I decided to try to prototype a system for
tracking relation sizes properly in shared memory.  Earlier in this
thread I was talking about invalidation schemes for backend-local
caches, because I only cared about performance.  In contrast, this new
system has SMgrRelation objects that point to SMgrSharedRelation
objects (better names welcome) that live in a pool in shared memory,
so that all backends agree on the size.  The scheme is described in
the commit message and comments.  The short version is that smgr.c
tracks the "authoritative" size of any relation that has recently been
extended or truncated, until it has been fsync'd.  By authoritative, I
mean that there may be dirty buffers in that range in our buffer pool,
even if the filesystem has vaporised the allocation of disk blocks and
shrunk the file.

That is, it's not really a "cache".  It's also not like a shared
catalog, which Konstantin was talking about... it's more like the pool
of inodes in a kernel's memory.  It holds all currently dirty SRs
(SMgrSharedRelations), plus as many clean ones as it can fit, with
some kind of reclamation scheme, much like buffers.  Here, "dirty"
means the size changed.

Attached is an early sketch, not debugged much yet (check undir
contrib/postgres_fdw fails right now for a reason I didn't look into),
and there are clearly many architectural choices one could make
differently, and more things to be done... but it seemed like enough
of a prototype to demonstrate the concept and fuel some discussion
about this and whatever better ideas people might have...

Thoughts?

[1]
https://www.postgresql.org/message-id/flat/OSBPR01MB3207DCA7EC725FDD661B3EDAEF660%40OSBPR01MB3207.jpnprd01.prod.outlook.com

Attachment

0001-WIP-Track-relation-sizes-in-shared-memory.patch

Re: Cache relation sizes?

From

Konstantin Knizhnik

Date:

16 November 2020, 10:01:03

On 16.11.2020 10:11, Thomas Munro wrote:
> On Tue, Aug 4, 2020 at 2:21 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> On Tue, Aug 4, 2020 at 3:54 AM Konstantin Knizhnik
>> <k.knizhnik@postgrespro.ru> wrote:
>>> This shared relation cache can easily store relation size as well.
>>> In addition it will solve a lot of other problems:
>>> - noticeable overhead of local relcache warming
>>> - large memory consumption in case of larger number of relations
>>> O(max_connections*n_relations)
>>> - sophisticated invalidation protocol and related performance issues
>>> Certainly access to shared cache requires extra synchronization.But DDL
>>> operations are relatively rare.
>>> So in most cases we will have only shared locks. May be overhead of
>>> locking will not be too large?
>> Yeah, I would be very happy if we get a high performance shared
>> sys/rel/plan/... caches in the future, and separately, having the
>> relation size available in shmem is something that has come up in
>> discussions about other topics too (tree-based buffer mapping,
>> multi-relation data files, ...).  ...
> After recent discussions about the limitations of relying on SEEK_END
> in a nearby thread[1], I decided to try to prototype a system for
> tracking relation sizes properly in shared memory.  Earlier in this
> thread I was talking about invalidation schemes for backend-local
> caches, because I only cared about performance.  In contrast, this new
> system has SMgrRelation objects that point to SMgrSharedRelation
> objects (better names welcome) that live in a pool in shared memory,
> so that all backends agree on the size.  The scheme is described in
> the commit message and comments.  The short version is that smgr.c
> tracks the "authoritative" size of any relation that has recently been
> extended or truncated, until it has been fsync'd.  By authoritative, I
> mean that there may be dirty buffers in that range in our buffer pool,
> even if the filesystem has vaporised the allocation of disk blocks and
> shrunk the file.
>
> That is, it's not really a "cache".  It's also not like a shared
> catalog, which Konstantin was talking about... it's more like the pool
> of inodes in a kernel's memory.  It holds all currently dirty SRs
> (SMgrSharedRelations), plus as many clean ones as it can fit, with
> some kind of reclamation scheme, much like buffers.  Here, "dirty"
> means the size changed.

I will look at your implementation more precisely latter.
But now I just to ask one question: is it reasonable to spent 
substantial amount of efforts trying to solve
the problem of redundant calculations of relation size, which seems to 
be just part of more general problem: shared relation cache?

It is well known problem: private caches of system catalog can cause 
unacceptable memory consumption for large number of backends.
If there are thousands relations in the database (which is not so rare 
case, especially taken in account ORMs and temporary tables)
then size of catalog cache can be several megabytes. If it is multiplies 
by thousand backends, then we get gigabytes of memory used just for
private catalog caches. Thanks to Andres optimizations of taking 
snapshots, now Postgres can normally handle thousands of connections.
So this extremely inefficient use of memory for private catalog caches 
becomes more and more critical.
Also private caches requires sophisticated and error prone invalidation 
mechanism.

If we will have such shared cache, then keeping shared relation size 
seems to be trivial task, isn't it?
So may be we before "diving" into the problem of maintaining shared pool 
of dirtied "inodes" we can better discuss and conerntrate on solving 
more generic problem?

Re: Cache relation sizes?

From

Konstantin Knizhnik

Date:

16 November 2020, 16:07:13


On 16.11.2020 10:11, Thomas Munro wrote:
> On Tue, Aug 4, 2020 at 2:21 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> On Tue, Aug 4, 2020 at 3:54 AM Konstantin Knizhnik
>> <k.knizhnik@postgrespro.ru> wrote:
>>> This shared relation cache can easily store relation size as well.
>>> In addition it will solve a lot of other problems:
>>> - noticeable overhead of local relcache warming
>>> - large memory consumption in case of larger number of relations
>>> O(max_connections*n_relations)
>>> - sophisticated invalidation protocol and related performance issues
>>> Certainly access to shared cache requires extra synchronization.But DDL
>>> operations are relatively rare.
>>> So in most cases we will have only shared locks. May be overhead of
>>> locking will not be too large?
>> Yeah, I would be very happy if we get a high performance shared
>> sys/rel/plan/... caches in the future, and separately, having the
>> relation size available in shmem is something that has come up in
>> discussions about other topics too (tree-based buffer mapping,
>> multi-relation data files, ...).  ...
> After recent discussions about the limitations of relying on SEEK_END
> in a nearby thread[1], I decided to try to prototype a system for
> tracking relation sizes properly in shared memory.  Earlier in this
> thread I was talking about invalidation schemes for backend-local
> caches, because I only cared about performance.  In contrast, this new
> system has SMgrRelation objects that point to SMgrSharedRelation
> objects (better names welcome) that live in a pool in shared memory,
> so that all backends agree on the size.  The scheme is described in
> the commit message and comments.  The short version is that smgr.c
> tracks the "authoritative" size of any relation that has recently been
> extended or truncated, until it has been fsync'd.  By authoritative, I
> mean that there may be dirty buffers in that range in our buffer pool,
> even if the filesystem has vaporised the allocation of disk blocks and
> shrunk the file.
>
> That is, it's not really a "cache".  It's also not like a shared
> catalog, which Konstantin was talking about... it's more like the pool
> of inodes in a kernel's memory.  It holds all currently dirty SRs
> (SMgrSharedRelations), plus as many clean ones as it can fit, with
> some kind of reclamation scheme, much like buffers.  Here, "dirty"
> means the size changed.
>
> Attached is an early sketch, not debugged much yet (check undir
> contrib/postgres_fdw fails right now for a reason I didn't look into),
> and there are clearly many architectural choices one could make
> differently, and more things to be done... but it seemed like enough
> of a prototype to demonstrate the concept and fuel some discussion
> about this and whatever better ideas people might have...
>
> Thoughts?
>
> [1]
https://www.postgresql.org/message-id/flat/OSBPR01MB3207DCA7EC725FDD661B3EDAEF660%40OSBPR01MB3207.jpnprd01.prod.outlook.com

I noticed that there are several fragments like this:


if (!smgrexists(rel->rd_smgr, FSM_FORKNUM))
          smgrcreate(rel->rd_smgr, FSM_FORKNUM, false);

fsm_nblocks_now = smgrnblocks(rel->rd_smgr, FSM_FORKNUM);


I wonder if it will be more efficient and simplify code to add 
"create_if_not_exists" parameter to smgrnblocks?
It will avoid extra hash lookup and avoid explicit checks for fork 
presence in multiple places?


-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Cache relation sizes?

From

Thomas Munro

Date:

16 November 2020, 22:42:53

On Mon, Nov 16, 2020 at 11:01 PM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
> I will look at your implementation more precisely latter.

Thanks!  Warning:  I thought about making a thing like this for a
while, but the patch itself is only a one-day prototype, so I am sure
you can find many bugs...  Hmm, I guess the smgrnblocks_fast() code
really needs a nice big comment to explain the theory about why this
value is fresh enough, based on earlier ideas about implicit memory
barriers.  (Something like: if you're extending, you must have
acquired the extension lock; if you're scanning you must have acquired
a snapshot that only needs to see blocks that existed at that time and
concurrent truncations are impossible; I'm wondering about special
snapshots like VACUUM, and whether this is too relaxed for that, or if
it's covered by existing horizon-based interlocking...).

> But now I just to ask one question: is it reasonable to spent
> substantial amount of efforts trying to solve
> the problem of redundant calculations of relation size, which seems to
> be just part of more general problem: shared relation cache?

It's a good question.  I don't know exactly how to make a shared
relation cache (I have only some vague ideas and read some of
Idehira-san's work) but I agree that it would be very nice.  However,
I think that's an independent and higher level thing, because Relation
!= SMgrRelation.

What is an SMgrRelation?  Not much!  It holds file descriptors, which
are different in every backend so you can't share them, and then some
information about sizes and a target block for insertions.  With this
patch, the sizes become shared and more reliable, so there's *almost*
nothing left at all except for the file descriptors and the pointer to
the shared object.  I haven't looked into smgr_targblock, the last
remaining non-shared thing, but there might be an opportunity to do
something clever for parallel inserters (the proposed feature) by
coordinating target blocks in SMgrSharedRelation; I don't know.

> It is well known problem: private caches of system catalog can cause
> unacceptable memory consumption for large number of backends.

Yeah.

> Also private caches requires sophisticated and error prone invalidation
> mechanism.
>
> If we will have such shared cache, then keeping shared relation size
> seems to be trivial task, isn't it?
> So may be we before "diving" into the problem of maintaining shared pool
> of dirtied "inodes" we can better discuss and conerntrate on solving
> more generic problem?

Sure, we can talk about how to make a shared relation cache (and
syscache etc).  I think it'll be much harder than this shared SMGR
concept.  Even if we figure out how to share palloc'd trees of nodes
with correct invalidation and interlocking (ie like Ideriha-san's work
in [1]), including good performance and space management/replacement,
I guess we'll still need per-backend Relation objects to point to the
per-backend SMgrRelation objects to hold the per-backend file
descriptors.  (Until we have threads...).  But... do you think sharing
SMGR relations to the maximum extent possible conflicts with that?
Are you thinking there should be some general component here, perhaps
a generic system for pools of shmem objects with mapping tables and a
replacement policy?

[1] https://www.postgresql.org/message-id/flat/4E72940DA2BF16479384A86D54D0988A567B9245%40G01JPEXMBKW04

RE: Cache relation sizes?

From

"tsunakawa.takay@fujitsu.com"

Date:

17 November 2020, 00:39:35

From: Thomas Munro <thomas.munro@gmail.com>
> On Mon, Nov 16, 2020 at 11:01 PM Konstantin Knizhnik
> <k.knizhnik@postgrespro.ru> wrote:
> > I will look at your implementation more precisely latter.
> 
> Thanks!  Warning:  I thought about making a thing like this for a
> while, but the patch itself is only a one-day prototype, so I am sure
> you can find many bugs...  Hmm, I guess the smgrnblocks_fast() code
> really needs a nice big comment to explain the theory about why this
> value is fresh enough, based on earlier ideas about implicit memory
> barriers.  (Something like: if you're extending, you must have
> acquired the extension lock; if you're scanning you must have acquired
> a snapshot that only needs to see blocks that existed at that time and
> concurrent truncations are impossible; I'm wondering about special
> snapshots like VACUUM, and whether this is too relaxed for that, or if
> it's covered by existing horizon-based interlocking...).

We'd like to join the discussion and review, too, so that Kirk's optimization for VACUUM truncation and TRUNCATRE can
extendbeyond recovery to operations during normal operation.  In that regard, would you mind becoming a committer for
hispatch?  We think it's committable now.  I'm fine with using the unmodified smgrnblocks() as your devil's
suggestion.


> It's a good question.  I don't know exactly how to make a shared
> relation cache (I have only some vague ideas and read some of
> Idehira-san's work) but I agree that it would be very nice.  However,
> I think that's an independent and higher level thing, because Relation
> != SMgrRelation.

> Sure, we can talk about how to make a shared relation cache (and
> syscache etc).  I think it'll be much harder than this shared SMGR
> concept.  Even if we figure out how to share palloc'd trees of nodes
> with correct invalidation and interlocking (ie like Ideriha-san's work
> in [1]), including good performance and space management/replacement,
> I guess we'll still need per-backend Relation objects to point to the
> per-backend SMgrRelation objects to hold the per-backend file
> descriptors.  (Until we have threads...).  But... do you think sharing
> SMGR relations to the maximum extent possible conflicts with that?
> Are you thinking there should be some general component here, perhaps
> a generic system for pools of shmem objects with mapping tables and a
> replacement policy?

Yes, Ideriha-san and we have completed the feature called global metacache for placing relation and catalog caches in
sharedmemory and limiting their sizes, and incorporated it in our product to meet our customer's request.  I hope we'll
submitthe patch in the near future when our resource permits.
 


Regards
Takayuki Tsunakawa

RE: Cache relation sizes?

From

"k.jamison@fujitsu.com"

Date:

17 November 2020, 00:52:46

On Tuesday, November 17, 2020 9:40 AM, Tsunkawa-san wrote:
> From: Thomas Munro <thomas.munro@gmail.com>
> > On Mon, Nov 16, 2020 at 11:01 PM Konstantin Knizhnik
> > <k.knizhnik@postgrespro.ru> wrote:
> > > I will look at your implementation more precisely latter.
> >
> > Thanks!  Warning:  I thought about making a thing like this for a
> > while, but the patch itself is only a one-day prototype, so I am sure
> > you can find many bugs...  Hmm, I guess the smgrnblocks_fast() code
> > really needs a nice big comment to explain the theory about why this
> > value is fresh enough, based on earlier ideas about implicit memory
> > barriers.  (Something like: if you're extending, you must have
> > acquired the extension lock; if you're scanning you must have acquired
> > a snapshot that only needs to see blocks that existed at that time and
> > concurrent truncations are impossible; I'm wondering about special
> > snapshots like VACUUM, and whether this is too relaxed for that, or if
> > it's covered by existing horizon-based interlocking...).
> 
> We'd like to join the discussion and review, too, so that Kirk's optimization for
> VACUUM truncation and TRUNCATRE can extend beyond recovery to
> operations during normal operation.  In that regard, would you mind
> becoming a committer for his patch?  We think it's committable now.  I'm
> fine with using the unmodified smgrnblocks() as your devil's suggestion.

Just saw this thread has been bumped. And yes, regarding the vacuum and truncate
optimization during recovery, it would still work even without the unmodified smgrnblocks.
We'd just need to remove the conditions that checks the flag parameter. And if we explore
your latest prototype patch, it can possibly use the new smgrnblocks_xxxx APIs introduced here.
That said, we'd like to explore the next step of extending the feature to normal operations.

Regards,
Kirk Jamison

Re: Cache relation sizes?

From

Kyotaro Horiguchi

Date:

17 November 2020, 08:29:16

At Mon, 16 Nov 2020 20:11:52 +1300, Thomas Munro <thomas.munro@gmail.com> wrote in 
> After recent discussions about the limitations of relying on SEEK_END
> in a nearby thread[1], I decided to try to prototype a system for
> tracking relation sizes properly in shared memory.  Earlier in this
> thread I was talking about invalidation schemes for backend-local
> caches, because I only cared about performance.  In contrast, this new
> system has SMgrRelation objects that point to SMgrSharedRelation
> objects (better names welcome) that live in a pool in shared memory,
> so that all backends agree on the size.  The scheme is described in
> the commit message and comments.  The short version is that smgr.c
> tracks the "authoritative" size of any relation that has recently been
> extended or truncated, until it has been fsync'd.  By authoritative, I
> mean that there may be dirty buffers in that range in our buffer pool,
> even if the filesystem has vaporised the allocation of disk blocks and
> shrunk the file.
> 
> That is, it's not really a "cache".  It's also not like a shared
> catalog, which Konstantin was talking about... it's more like the pool
> of inodes in a kernel's memory.  It holds all currently dirty SRs
> (SMgrSharedRelations), plus as many clean ones as it can fit, with
> some kind of reclamation scheme, much like buffers.  Here, "dirty"
> means the size changed.
> 
> Attached is an early sketch, not debugged much yet (check undir
> contrib/postgres_fdw fails right now for a reason I didn't look into),
> and there are clearly many architectural choices one could make
> differently, and more things to be done... but it seemed like enough
> of a prototype to demonstrate the concept and fuel some discussion
> about this and whatever better ideas people might have...
> 
> Thoughts?
> 
> [1]
https://www.postgresql.org/message-id/flat/OSBPR01MB3207DCA7EC725FDD661B3EDAEF660%40OSBPR01MB3207.jpnprd01.prod.outlook.com

I was naively thinking that we could just remember the size of all
database files in a shared array but I realized that that needs a hash
table to translate rnodes into array indexes, which could grow huge..

The proposed way tries to make sure that the next fseek call tells the
truth before forgetting cached values.  On the other hand a
shared-smgrrel allows to defer fsyncing of a (local) smgrrel until
someone evicts the entry.  That seems to me to be the minimal
mechanism that allows us to keep us being able to know the right file
size at all times, getting rid of a need to remember the size of all
database files in shared memory.  I'm afraid that it might cause fsync
storms on very-huge systems, though..

However, if we try to do the similar thing in any other way, it seems
to be that it's going grown to be more like the pg_smgr catalog. But
that seems going to eat more memory and cause invalidation storms.

Sorry for the rambling thought, but I think this is basically in the
right direction.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Cache relation sizes?

From

Amit Kapila

Date:

17 November 2020, 09:49:38

On Tue, Nov 17, 2020 at 4:13 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Mon, Nov 16, 2020 at 11:01 PM Konstantin Knizhnik
> <k.knizhnik@postgrespro.ru> wrote:
> > I will look at your implementation more precisely latter.
>
> Thanks!  Warning:  I thought about making a thing like this for a
> while, but the patch itself is only a one-day prototype, so I am sure
> you can find many bugs...  Hmm, I guess the smgrnblocks_fast() code
> really needs a nice big comment to explain the theory about why this
> value is fresh enough, based on earlier ideas about implicit memory
> barriers.  (Something like: if you're extending, you must have
> acquired the extension lock; if you're scanning you must have acquired
> a snapshot that only needs to see blocks that existed at that time and
> concurrent truncations are impossible; I'm wondering about special
> snapshots like VACUUM, and whether this is too relaxed for that, or if
> it's covered by existing horizon-based interlocking...).
>

Yeah, it is good to verify VACUUM stuff but I have another question
here. What about queries having functions that access the same
relation (SELECT c1 FROM t1 WHERE c1 <= func(); assuming here function
access the relation t1)? Now, here I think because the relation 't1'
is already opened, it might use the same value of blocks from the
shared cache even though the snapshot for relation 't1' when accessed
from func() might be different. Am, I missing something, or is it
dealt in some way?

-- 
With Regards,
Amit Kapila.

Re: Cache relation sizes?

From

Thomas Munro

Date:

19 November 2020, 05:01:14

On Tue, Nov 17, 2020 at 10:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> Yeah, it is good to verify VACUUM stuff but I have another question
> here. What about queries having functions that access the same
> relation (SELECT c1 FROM t1 WHERE c1 <= func(); assuming here function
> access the relation t1)? Now, here I think because the relation 't1'
> is already opened, it might use the same value of blocks from the
> shared cache even though the snapshot for relation 't1' when accessed
> from func() might be different. Am, I missing something, or is it
> dealt in some way?

I think it should be covered by the theory about implicit memory
barriers snapshots, but to simplify things I have now removed the
lock-free stuff from the main patch (0001), because it was a case of
premature optimisation and it distracted from the main concept.  The
main patch has 128-way partitioned LWLocks for the mapping table, and
then per-relfilenode spinlocks for modifying the size.  There are
still concurrency considerations, which I think are probably handled
with the dirty-update-wins algorithm you see in the patch.  In short:
due to extension and exclusive locks, size changes AKA dirty updates
are serialised, but clean updates can run concurrently, so we just
have to make sure that clean updates never clobber dirty updates -- do
you think that is on the right path?

Attachment

Re: Cache relation sizes?

From

Andy Fan

Date:

17 December 2020, 06:29:19

On Thu, Nov 19, 2020 at 1:02 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Tue, Nov 17, 2020 at 10:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> Yeah, it is good to verify VACUUM stuff but I have another question
> here. What about queries having functions that access the same
> relation (SELECT c1 FROM t1 WHERE c1 <= func(); assuming here function
> access the relation t1)? Now, here I think because the relation 't1'
> is already opened, it might use the same value of blocks from the
> shared cache even though the snapshot for relation 't1' when accessed
> from func() might be different. Am, I missing something, or is it
> dealt in some way?

I think it should be covered by the theory about implicit memory
barriers snapshots, but to simplify things I have now removed the
lock-free stuff from the main patch (0001), because it was a case of
premature optimisation and it distracted from the main concept. The
main patch has 128-way partitioned LWLocks for the mapping table, and
then per-relfilenode spinlocks for modifying the size. There are
still concurrency considerations, which I think are probably handled
with the dirty-update-wins algorithm you see in the patch. In short:
due to extension and exclusive locks, size changes AKA dirty updates
are serialised, but clean updates can run concurrently, so we just
have to make sure that clean updates never clobber dirty updates -- do
you think that is on the right path?

Hi Thomas:

Thank you for working on it.

I spent one day studying the patch and I want to talk about one question for now.

What is the purpose of calling smgrimmedsync to evict a DIRTY sr (what will happen

if we remove it and the SR_SYNCING and SR_JUST_DIRTIED flags)?

Best Regards

Andy Fan

Re: Cache relation sizes?

From

Thomas Munro

Date:

17 December 2020, 07:04:52

Hi Andy,

On Thu, Dec 17, 2020 at 7:29 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
> I spent one day studying the patch and I want to talk about one question for now.
> What is the purpose of calling smgrimmedsync to evict a DIRTY sr (what will happen
> if we remove it and the SR_SYNCING and SR_JUST_DIRTIED flags)?

The underlying theory of the patch is that after fsync() succeeds, the
new size is permanent, but before that it could change at any time
because of asynchronous activities in the kernel*.  Therefore, if you
want to evict the size from shared memory, you have to fsync() the
file first.  I used smgrimmedsync() to do that.

*I suspect that it can really only shrink with network file systems
such as NFS and GlusterFS, where the kernel has to take liberties to
avoid network round trips for every file extension, but I don't know
the details on all 9 of our supported OSes and standards aren't very
helpful for understanding buffered I/O.

Re: Cache relation sizes?

From

Andy Fan

Date:

17 December 2020, 09:22:30

Hi Thomas,

Thank you for your quick response.

On Thu, Dec 17, 2020 at 3:05 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Hi Andy,

On Thu, Dec 17, 2020 at 7:29 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
> I spent one day studying the patch and I want to talk about one question for now.
> What is the purpose of calling smgrimmedsync to evict a DIRTY sr (what will happen
> if we remove it and the SR_SYNCING and SR_JUST_DIRTIED flags)?

The underlying theory of the patch is that after fsync() succeeds, the
new size is permanent, but before that it could change at any time
because of asynchronous activities in the kernel*. Therefore, if you
want to evict the size from shared memory, you have to fsync() the
file first. I used smgrimmedsync() to do that.

*I suspect that it can really only shrink with network file systems
such as NFS and GlusterFS, where the kernel has to take liberties to
avoid network round trips for every file extension, but I don't know
the details on all 9 of our supported OSes and standards aren't very
helpful for understanding buffered I/O.

Let me try to understand your point. Suppose process 1 extends a file to

2 blocks from 1 block, and fsync is not called, then a). the lseek *may* still

return 1 based on the comments in the ReadBuffer_common ("because

of buggy Linux kernels that sometimes return an seek SEEK_END result

that doesn't account for a recent write"). b). When this patch is introduced,

we would get 2 blocks for sure if we can get the size from cache. c). After

user reads the 2 blocks from cache and then the cache is evicted, and user

reads the blocks again, it is possible to get 1 again.

Do we need to consider case a) in this patch? and Is case c). the situation you

want to avoid and do we have to introduce fsync to archive this? Basically I

think case a) should not happen by practice so we don't need to handle case

c) and finally we don't need the fsync/SR_SYNCING/SR_JUST_DIRTIED flag.

I'm not opposed to adding them, I am just interested in the background in case

you are also willing to discuss it.

I would suggest the below change for smgrnblock_fast, FYI

@@ -182,7 +182,11 @@ smgrnblocks_fast(SMgrRelation reln, ForkNumber forknum)
/*
* The generation doesn't match, the shared relation must have been
* evicted since we got a pointer to it. We'll need to do more work.
+ * and invalid the fast path for next time.
*/
+ reln->smgr_shared = NULL;
}

Best Regards

Andy Fan

回复：Re: Cache relation sizes?

From

"陈佳昕(步真)"

Date:

22 December 2020, 12:30:56

Hi Thomas:

I studied your patch these days and found there might be a problem.

When execute 'drop database', the smgr shared pool will not be removed because of no call 'smgr_drop_sr'. Function 'dropdb' in dbcommands.c remove the buffer from bufferpool and unlink the real files by 'rmtree', It doesn't call smgrdounlinkall, so the smgr shared cache will not be dropped although the table has been removed. This will cause some errors when smgr_alloc_str -> smgropen、smgrimmedsync. Table file has been removed, so smgropen and smgrimmedsync will get a unexpected result.

Buzhen

------------------原始邮件 ------------------
发件人:Thomas Munro <thomas.munro@gmail.com>
发送时间:Tue Dec 22 19:57:35 2020
收件人:Amit Kapila <amit.kapila16@gmail.com>
抄送:Konstantin Knizhnik <k.knizhnik@postgrespro.ru>, PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
主题:Re: Cache relation sizes?
On Tue, Nov 17, 2020 at 10:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> Yeah, it is good to verify VACUUM stuff but I have another question
> here. What about queries having functions that access the same
> relation (SELECT c1 FROM t1 WHERE c1 <= func(); assuming here function
> access the relation t1)? Now, here I think because the relation 't1'
> is already opened, it might use the same value of blocks from the
> shared cache even though the snapshot for relation 't1' when accessed
> from func() might be different. Am, I missing something, or is it
> dealt in some way?

I think it should be covered by the theory about implicit memory
barriers snapshots, but to simplify things I have now removed the
lock-free stuff from the main patch (0001), because it was a case of
premature optimisation and it distracted from the main concept.  The
main patch has 128-way partitioned LWLocks for the mapping table, and
then per-relfilenode spinlocks for modifying the size.  There are
still concurrency considerations, which I think are probably handled
with the dirty-update-wins algorithm you see in the patch.  In short:
due to extension and exclusive locks, size changes AKA dirty updates
are serialised, but clean updates can run concurrently, so we just
have to make sure that clean updates never clobber dirty updates -- do
you think that is on the right path?

Re: Cache relation sizes?

From

Thomas Munro

Date:

23 December 2020, 22:58:56

On Thu, Dec 17, 2020 at 10:22 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
> Let me try to understand your point.  Suppose process 1 extends a file to
> 2 blocks from 1 block, and fsync is not called, then a). the lseek *may* still
> return 1 based on the comments in the ReadBuffer_common ("because
> of buggy Linux kernels that sometimes return an seek SEEK_END result
> that doesn't account for a recent write").  b). When this patch is introduced,
> we would get 2 blocks for sure if we can get the size from cache. c). After
> user reads the 2 blocks from cache and then the cache is evicted, and user
> reads the blocks again, it is possible to get 1 again.
>
> Do we need to consider case a) in this patch? and Is case c). the situation you
> want to avoid and do we have to introduce fsync to archive this? Basically I
> think case a) should not happen by practice so we don't need to handle case
> c) and finally we don't need the fsync/SR_SYNCING/SR_JUST_DIRTIED flag.
> I'm not opposed to adding them, I am just interested in the background in case
> you are also willing to discuss it.

Sorry for the lack of background -- there was some discussion on the
thread "Optimize dropping of relation buffers using dlist", but it's
long and mixed up with other topics.  I'll try to summarise here.

It's easy to reproduce files shrinking asynchronously on network
filesystems that reserve space lazily[1], and we know that there have
historically been problems even with local filesystems[2], leading to
that comment about buggy kernels.  I am only interested in the
behaviour I can demonstrate, because I believe Linux is working as
intended when it does that (hypothetical/unknown bugs would probably
be helped by this approach too, but I'm not interested in speculating
about that beyond these parentheses).

So, why should we consider this?  It's true that commit ffae5cc5a60's
change to ReadBuffer_common() already prevents us from eating data by
overwriting an existing buffer due to this phenomenon, but there are
other problems:

1.  A system that in this state can give an incorrect answer to a
query: heapam.c calls RelationGetNumberOfBlocks() at the beginning of
a scan, so it might not see recently added blocks containing committed
data.  Hopefully we'll panic when we try to create a checkpoint and
see errors, but we could be giving bad results for a while.

2.  Commitfest entry #2319 needs an accurate size, or it might leave
stray blocks in the buffer pool that will cause failures or corruption
later.

It's true that we could decide not to worry about this, and perhaps
even acknowledge in some official way that PostgreSQL doesn't work
reliably on some kinds of filesystems.  But in this prototype I
thought it was fun to try to fix our very high rate of lseek()
syscalls and this reliability problem, at the same time.  I also
speculate that we'll eventually have other reasons to want a
demand-paged per-relation shared memory object.

> I would suggest the below change for smgrnblock_fast,  FYI
>
> @@ -182,7 +182,11 @@ smgrnblocks_fast(SMgrRelation reln, ForkNumber forknum)
>                 /*
>                  * The generation doesn't match, the shared relation must have been
>                  * evicted since we got a pointer to it.  We'll need to do more work.
> +                * and invalid the fast path for next time.
>                  */
> +               reln->smgr_shared = NULL;
>         }

Thanks, makes sense -- I'll add this to the next version.

[1] https://www.postgresql.org/message-id/CA%2BhUKGLZTfKuXir9U4JQkY%3Dk3Tb6M_3toGrGOK_fa2d4MPQQ_w%40mail.gmail.com
[2] https://www.postgresql.org/message-id/BAY116-F146A38326C5630EA33B36BD12B0%40phx.gbl

Re: Re: Cache relation sizes?

From

Thomas Munro

Date:

23 December 2020, 23:03:52

On Wed, Dec 23, 2020 at 1:31 AM 陈佳昕(步真) <buzhen.cjx@alibaba-inc.com> wrote:
> I studied your patch these days and found there might be a problem.
> When execute 'drop database', the smgr shared pool will not be removed because of no call 'smgr_drop_sr'. Function
'dropdb'in dbcommands.c remove the buffer from bufferpool and unlink the real files by 'rmtree', It doesn't call
smgrdounlinkall,so the smgr shared cache will not be dropped although the table has been removed. This will cause some
errorswhen smgr_alloc_str -> smgropen、smgrimmedsync. Table file has been removed, so smgropen and smgrimmedsync will
geta unexpected result. 

Hi Buzhen,

Thanks, you're right -- it needs to scan the pool of SRs and forget
everything from the database you're dropping.

Re: Cache relation sizes?

From

Andy Fan

Date:

28 December 2020, 04:24:35

On Thu, Dec 24, 2020 at 6:59 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, Dec 17, 2020 at 10:22 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
> Let me try to understand your point. Suppose process 1 extends a file to
> 2 blocks from 1 block, and fsync is not called, then a). the lseek *may* still
> return 1 based on the comments in the ReadBuffer_common ("because
> of buggy Linux kernels that sometimes return an seek SEEK_END result
> that doesn't account for a recent write"). b). When this patch is introduced,
> we would get 2 blocks for sure if we can get the size from cache. c). After
> user reads the 2 blocks from cache and then the cache is evicted, and user
> reads the blocks again, it is possible to get 1 again.
>
> Do we need to consider case a) in this patch? and Is case c). the situation you
> want to avoid and do we have to introduce fsync to archive this? Basically I
> think case a) should not happen by practice so we don't need to handle case
> c) and finally we don't need the fsync/SR_SYNCING/SR_JUST_DIRTIED flag.
> I'm not opposed to adding them, I am just interested in the background in case
> you are also willing to discuss it.

Sorry for the lack of background -- there was some discussion on the
thread "Optimize dropping of relation buffers using dlist", but it's
long and mixed up with other topics. I'll try to summarise here.

It's easy to reproduce files shrinking asynchronously on network
filesystems that reserve space lazily[1],

Thanks for the explaintain. After I read that thread, I am more confused now.

In [1], we have the below test result.

$ cat magic_shrinking_file.c
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int main()
{
int fd;
char buffer[8192] = {0};

fd = open("/mnt/test_loopback_remote/dir/file", O_RDWR | O_APPEND);
if (fd < 0) {
perror("open");
return EXIT_FAILURE;
}
printf("lseek(..., SEEK_END) = %jd\n", lseek(fd, 0, SEEK_END));
printf("write(...) = %zd\n", write(fd, buffer, sizeof(buffer)));
printf("lseek(..., SEEK_END) = %jd\n", lseek(fd, 0, SEEK_END));
printf("fsync(...) = %d\n", fsync(fd));
printf("lseek(..., SEEK_END) = %jd\n", lseek(fd, 0, SEEK_END));

return EXIT_SUCCESS;
}
$ cc magic_shrinking_file.c
$ ./a.out
lseek(..., SEEK_END) = 9670656
write(...) = 8192
lseek(..., SEEK_END) = 9678848
fsync(...) = -1
lseek(..., SEEK_END) = 9670656

I got 2 information from above. a) before the fsync, the lseek(fd, 0, SEEK_END)

returns a correct value, however after the fsync, it returns a wrong value. b).

the fsync failed(return values is -1) in the above case. I'm more confused

because of point a). Since the fsync can't correct some wrong value, what

is the point to call fsync in this patch (I agree that it is good to fix this

reliability problem within this patch, but I'm doubtful if fsync can help in

this case). Am I missing something obviously?

I read your patch with details some days ago and feel it is in pretty good shape.

and I think you are clear about the existing issues very well. like a). smgr_alloc_sr use a

FIFO design. b). SR_PARTITION_LOCK doesn't use a partition lock really. c). and

looks like you have some concern about the concurrency issue, which I didn't find

anything wrong. d). how to handle the DIRTY sr during evict. I planned to recheck

item c) before this reply, but looks like the time is not permitted. May I know what

Is your plan about this patch?

[1] https://www.postgresql.org/message-id/CA%2BhUKGLZTfKuXir9U4JQkY%3Dk3Tb6M_3toGrGOK_fa2d4MPQQ_w%40mail.gmail.com

Best Regards

Andy Fan

回复：Re: Re: Cache relation sizes?

From

"陈佳昕(步真)"

Date:

29 December 2020, 15:13:12

Hi Thomas

I found some other problems which I want to share my change with you to make you confirm.

<1> I changed the code in smgr_alloc_sr to avoid dead lock.

LWLockAcquire(mapping_lock, LW_EXCLUSIVE);

flags = smgr_lock_sr(sr);

Assert(flags & SR_SYNCING(forknum));

+ flags &= ~SR_SYNCING(forknum);

if (flags & SR_JUST_DIRTIED(forknum))

{

* Someone else dirtied it while we were syncing, so we can't mark

* it clean. Let's give up on this SR and go around again.

smgr_unlock_sr(sr, flags);

LWLockRelease(mapping_lock);

goto retry;

}

/* This fork is clean! */

- flags &= ~SR_SYNCING(forknum);

flags &= ~SR_DIRTY(forknum);

}

In smgr_drop_sr, if the sr is SR_SYNCING, it will retry until the sr is not SR_SYNCING. But in smgr_alloc_sr, if the sr is SR_JUST_DIRTIED, it will retry to get another sr, although it has been synced by smgrimmedsync, the flag SR_SYNCING doesn't changed. This might cause dead lock. So I changed your patch as above.

<2> I changed the code in smgr_drop_sr to avoid some corner cases

/* Mark it invalid and drop the mapping. */

smgr_unlock_sr(sr, ~SR_VALID);

+ for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)

+ sr->nblocks[forknum] = InvalidBlockNumber;

hash_search_with_hash_value(sr_mapping_table,

&reln->smgr_rnode,

hash,

HASH_REMOVE,

NULL);

smgr_drop_sr just removes the hash entry from sr_mapping_table, but doesn't remove the sr_pool. But in smgrnblocks_fast, it just get sr from sr_pool, so I add some codes as above to avoid some corner cases get an unexpected result from smgrnblocks_fast. Is it necessary, I also want some advice from you.

Thanks a lot.

Buzhen

------------------原始邮件 ------------------
发件人:Thomas Munro <thomas.munro@gmail.com>
发送时间:Tue Dec 29 22:52:59 2020
收件人:陈佳昕(步真) <buzhen.cjx@alibaba-inc.com>
抄送:Amit Kapila <amit.kapila16@gmail.com>, Konstantin Knizhnik <k.knizhnik@postgrespro.ru>, PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
主题:Re: Re: Cache relation sizes?
On Wed, Dec 23, 2020 at 1:31 AM 陈佳昕(步真) <buzhen.cjx@alibaba-inc.com> wrote:
> I studied your patch these days and found there might be a problem.
> When execute 'drop database', the smgr shared pool will not be removed because of no call 'smgr_drop_sr'. Function 'dropdb' in dbcommands.c remove the buffer from bufferpool and unlink the real files by 'rmtree', It doesn't call smgrdounlinkall, so the smgr shared cache will not be dropped although the table has been removed. This will cause some errors when smgr_alloc_str -> smgropen、smgrimmedsync. Table file has been removed, so smgropen and smgrimmedsync will get a unexpected result.

Hi Buzhen,

Thanks, you're right -- it needs to scan the pool of SRs and forget
everything from the database you're dropping.

Re: Cache relation sizes?

From

Andres Freund

Date:

30 December 2020, 04:35:36

Hi,

On 2020-11-19 18:01:14 +1300, Thomas Munro wrote:
> From ac3c61926bf947a3288724bd02cf8439ff5c14bc Mon Sep 17 00:00:00 2001
> From: Thomas Munro <thomas.munro@gmail.com>
> Date: Fri, 13 Nov 2020 14:38:41 +1300
> Subject: [PATCH v2 1/2] WIP: Track relation sizes in shared memory.
>
> Introduce a fixed size pool of SMgrSharedRelation objects.  A new GUC
> smgr_shared_relation controls how many can exist at once, and they are
> evicted as required.

As noted over IM, I think it'd be good if we could avoid the need for a
new GUC here and instead derive it from other settings. E.g.
  max_files_per_process * MaxBackends
or maybe
  max_files_per_process * log(MaxBackends)
(on the basis that it's unlikely that each backend uses completely
separate files).

Given the size of the structs it seems that the memory overhead of this
should be acceptable?



> XXX smgrimmedsync() is maybe too blunt an instrument?
>
> XXX perhaps mdsyncfiletag should tell smgr.c when it cleans forks via
> some new interface, but it doesn't actually know if it's cleaning a fork
> that extended a relation...
>
> XXX perhaps bgwriter should try to clean them?
>

As I suggested on the same call, I think it might make sense to
integrate this with the checkpointer sync queue. Have checkpointer try
to keep some a certain fraction of the entries clean (or even make them
unused?), by processing all the sync requests for that
relfilnode. That'd allow to a) only sync segments we actually need to
sync b) syncing those segments again at the end of the checkpoint.


Another use-case I have is that I'd like to make file-extension better
than it currently is. There's two main challenges:

- Having bulk extension add pages to the FSM isn't actually a good
  solution, as that has a high likelihood of multiple backends ending up
  trying to insert into the same page. If we had, in SMgrSharedRelation,
  a 'physical' relation size and a 'logical' relation size, we could do
  better, and hand out pre-allocated pages to exactly one backend.

- Doing bulk-extension when there's no other space left leads to large
  pile-ups of processes needing extension lock -> a large latency
  increase. Instead we should opportunistically start to do extension
  when we've used-up most of the bulk allocated space.



> +/*
> + * An entry in the hash table that allows us to look up objects in the
> + * SMgrSharedRelation pool by rnode (+ backend).
> + */
> +typedef struct SMgrSharedRelationMapping
> +{
> +    RelFileNodeBackend rnode;
> +    int                index;
> +} SMgrSharedRelationMapping;

Maybe add a note that index points into
SMgrSharedRelationPool->objects[]? And why you chose to not store
SMgrSharedRelation directly in the hashtable (makes sense to me, still
worth commenting on imo).



> +/* Flags. */
> +#define SR_LOCKED                    0x01
> +#define SR_VALID                    0x02
> +
> +/* Each forknum gets its own dirty, syncing and just dirtied bits. */
> +#define SR_DIRTY(forknum)            (0x04 << ((forknum) + (MAX_FORKNUM + 1) * 0))
> +#define SR_SYNCING(forknum)            (0x04 << ((forknum) + (MAX_FORKNUM + 1) * 1))
> +#define SR_JUST_DIRTIED(forknum)    (0x04 << ((forknum) + (MAX_FORKNUM + 1) * 2))

Hm - do we really want all that stored in a single flag value?

> +/*
> + * Lock a SMgrSharedRelation.  The lock is a spinlock that should be held for
> + * only a few instructions.  The return value is the current set of flags,
> + * which may be modified and then passed to smgr_unlock_sr() to be atomically
> + * when the lock is released.
> + */
> +static uint32
> +smgr_lock_sr(SMgrSharedRelation *sr)
> +{
> +
> +    for (;;)
> +    {
> +        uint32    old_flags = pg_atomic_read_u32(&sr->flags);
> +        uint32    flags;
> +
> +        if (!(old_flags & SR_LOCKED))
> +        {
> +            flags = old_flags | SR_LOCKED;
> +            if (pg_atomic_compare_exchange_u32(&sr->flags, &old_flags, flags))
> +                return flags;
> +        }
> +    }
> +    return 0; /* unreachable */
> +}

Since this is a spinlock, you should use the spin delay infrastructure
(c.f. LWLockWaitListLock()). Otherwise it'll react terribly if there
ever ended up contention.

What's the reason not to use a per-entry lwlock instead?

Naming: I find it weird to have smgr_ as a prefix and _sr as a
postfix. I'd go for smgr_sr_$op.


> +/*
> + * Unlock a SMgrSharedRelation, atomically updating its flags at the same
> + * time.
> + */
> +static void
> +smgr_unlock_sr(SMgrSharedRelation *sr, uint32 flags)
> +{
> +    pg_write_barrier();
> +    pg_atomic_write_u32(&sr->flags, flags & ~SR_LOCKED);
> +}

This assumes there never are atomic modification of flags without
holding the spinlock (or a cmpxchg loop testing whether somebody is
holding the lock).Is that right / good?


> +/*
> + * Allocate a new invalid SMgrSharedRelation, and return it locked.
> + *
> + * The replacement algorithm is a simple FIFO design with no second chance for
> + * now.
> + */
> +static SMgrSharedRelation *
> +smgr_alloc_sr(void)
> +{
> +    SMgrSharedRelationMapping *mapping;
> +    SMgrSharedRelation *sr;
> +    uint32 index;
> +    LWLock *mapping_lock;
> +    uint32 flags;
> +    RelFileNodeBackend rnode;
> +    uint32 hash;
> +
> + retry:
> +    /* Lock the next one in clock-hand order. */
> +    index = pg_atomic_fetch_add_u32(&sr_pool->next, 1) % smgr_shared_relations;
> +    sr = &sr_pool->objects[index];
> +    flags = smgr_lock_sr(sr);

Modulo is pretty expensive - I'd force smgr_shared_relations to be a
power-of-two and then do a & smgr_shared_relations_mask initialized to
(smgr_shared_relations - 1).



> +    /* If it's unused, can return it, still locked, immediately. */
> +    if (!(flags & SR_VALID))
> +        return sr;
> +
> +    /*
> +     * Copy the rnode and unlock.  We'll briefly acquire both mapping and SR
> +     * locks, but we need to do it in that order, so we'll unlock the SR
> +     * first.
> +     */
> +    rnode = sr->rnode;
> +    smgr_unlock_sr(sr, flags);
> +
> +    hash = get_hash_value(sr_mapping_table, &rnode);
> +    mapping_lock = SR_PARTITION_LOCK(hash);
> +
> +    LWLockAcquire(mapping_lock, LW_EXCLUSIVE);
> +    mapping = hash_search_with_hash_value(sr_mapping_table,
> +                                          &rnode,
> +                                          hash,
> +                                          HASH_FIND,
> +                                          NULL);
> +    if (!mapping || mapping->index != index)
> +    {
> +        /* Too slow, it's gone or now points somewhere else.  Go around. */
> +        LWLockRelease(mapping_lock);
> +        goto retry;
> +    }
> +
> +    /* We will lock the SR for just a few instructions. */
> +    flags = smgr_lock_sr(sr);
> +    Assert(flags & SR_VALID);

So the replacement here is purely based on when the relation was
accessed first, not when it was accessed last. Probably good enough, but
seems worth a comment.


> 3.  The nblocks value must be free enough for scans, extension,
> truncatation and dropping buffers, because the those operations all
> executed memory barriers when it acquired a snapshot to scan (which
> doesn't need to see blocks added after that) or an exclusive heavyweight
> lock to extend, truncate or drop.  XXX right?

"free enough"?

I am not sure this holds for non-mvcc scans?


> @@ -699,7 +748,7 @@ smgrclearowner(SMgrRelation *owner, SMgrRelation reln)
>  bool
>  smgrexists(SMgrRelation reln, ForkNumber forknum)
>  {
> -    if (smgrnblocks_shared(reln, forknum) != InvalidBlockNumber)
> +    if (smgrnblocks_fast(reln, forknum) != InvalidBlockNumber)
>          return true;

ISTM that all these places referencing _fast is leaking an
implementation detail (but not very far) - seems like that should be an
implementation detail for smgrnblocks() which can then call out to
_fast/_shared or such.

Greetings,

Andres Freund

Re: Cache relation sizes?

From

Thomas Munro

Date:

30 December 2020, 04:39:52

On Mon, Dec 28, 2020 at 5:24 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
> lseek(..., SEEK_END) = 9670656
> write(...) = 8192
> lseek(..., SEEK_END) = 9678848
> fsync(...) = -1
> lseek(..., SEEK_END) = 9670656
>
> I got 2 information from above.  a) before the fsync,  the lseek(fd, 0, SEEK_END)
> returns a correct value,  however after the fsync,  it returns a wrong value.  b).
> the fsync failed(return values is -1) in the above case.  I'm more confused
> because of point a).  Since the fsync can't correct some wrong value, what
> is the point to call fsync in this patch (I agree that it is good to fix this
> reliability problem within this  patch, but I'm doubtful if fsync can help in
> this case). Am I missing something obviously?

The point of the fsync() call isn't to correct the value, it's to find
out if it is safe to drop the SR object from shared memory because the
operating system has promised that it won't forget about the new size.
If that fails, we'll report an ERROR or PANIC (depending on
data_sync_retry), but not drop the SR.

By the way, you don't need fsync(fd) for the size to change, as shown
by the other experiment in that other thread that just did sleep(60).
It might also happen asynchronously.  fsync(fd) forces it, but also
tells you about errors.

> I read your patch with details some days ago and feel it is in pretty good shape.
> and I think you are clear about the existing issues very well. like  a).  smgr_alloc_sr use a
> FIFO design.  b). SR_PARTITION_LOCK doesn't use a partition lock really.  c). and

Yeah, it probably should have its own bank of locks (I wish they were
easier to add, via lwlocknames.txt or similar).

> looks like you have some concern about the concurrency issue, which I didn't find
> anything wrong.  d).  how to handle the DIRTY sr during evict.  I planned to recheck

So yeah, mostly this discussion has been about not trusting lseek()
and using our own tracking of relation size instead.  But there is a
separate question about how "fresh" the value needs to be.  The
question is: under what circumstances could the unlocked read of
nblocks from shared memory give you a value that isn't fresh enough
for your snapshot/scan?  This involves thinking about implicit memory
barriers.

The idea of RECENTLY_DIRTIED is similar to what we do for buffers:
while you're trying to "clean" it (in this case by calling fsync())
someone else can extend the relation again, and in that case we don't
know whether the new extension was included in the fsync() or not, so
we can't clear the DIRTY flag when it completes.

> item c) before this reply, but looks like the time is not permitted.  May I know what
> Is your plan about this patch?

Aside from details, bugs etc discussed in this thread, one major piece
remains incomplete: something in the background to "clean" SRs so that
foreground processes rarely have to block.

Re: Re: Re: Cache relation sizes?

From

Thomas Munro

Date:

30 December 2020, 04:52:20

On Wed, Dec 30, 2020 at 4:13 AM 陈佳昕(步真) <buzhen.cjx@alibaba-inc.com> wrote:
> I found some other problems which I want to share my change with you to make you confirm.
> <1> I changed the code in smgr_alloc_sr to avoid dead lock.
>
>   LWLockAcquire(mapping_lock, LW_EXCLUSIVE);
>   flags = smgr_lock_sr(sr);
>   Assert(flags & SR_SYNCING(forknum));
>   + flags &= ~SR_SYNCING(forknum);
>   if (flags & SR_JUST_DIRTIED(forknum))
>   {
>    /*
>     * Someone else dirtied it while we were syncing, so we can't mark
>     * it clean.  Let's give up on this SR and go around again.
>     */
>    smgr_unlock_sr(sr, flags);
>    LWLockRelease(mapping_lock);
>    goto retry;
>   }
>   /* This fork is clean! */
>   - flags &= ~SR_SYNCING(forknum);
>   flags &= ~SR_DIRTY(forknum);
>  }
>
> In smgr_drop_sr, if the sr is SR_SYNCING, it will retry until the sr is not SR_SYNCING.  But in smgr_alloc_sr, if the
sris SR_JUST_DIRTIED, it will retry to get another sr, although it has been synced by smgrimmedsync, the flag
SR_SYNCING doesn't changed. This might cause dead lock. So I changed your patch as above. 

Right.  Thanks!

I also added smgrdropdb() to handle DROP DATABASE (the problem you
reported in your previous email).

While tweaking that code, I fixed it so that it uses a condition
variable to wait (instead of the silly sleep loop) when it needs to
drop an SR that is being sync'd.  Also, it now releases the mapping
lock while it's doing that, and requires it on retry.

> <2> I changed the code in smgr_drop_sr to avoid some corner cases
> /* Mark it invalid and drop the mapping. */
> smgr_unlock_sr(sr, ~SR_VALID);
> + for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
> +  sr->nblocks[forknum] = InvalidBlockNumber;
> hash_search_with_hash_value(sr_mapping_table,
>        &reln->smgr_rnode,
>        hash,
>        HASH_REMOVE,
>        NULL);
>
> smgr_drop_sr just removes the hash entry from sr_mapping_table, but doesn't remove the sr_pool. But in
smgrnblocks_fast,it just get sr from sr_pool, so I add some codes as above to avoid some corner cases get an unexpected
resultfrom smgrnblocks_fast. Is it necessary, I also want some advice from you. 

Hmm.  I think it might be better to increment sr->generation.  That
was already done in the "eviction" code path, where other processes
might still have references to the SR object, and I don't think it's
possible for anyone to access a dropped relation, but I suppose it's
more consistent to do that anyway.  Fixed.

Thanks for the review!

Attachment

Re: Re: Re: Cache relation sizes?

From

Thomas Munro

Date:

30 December 2020, 04:53:57

On Wed, Dec 30, 2020 at 5:52 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> and requires it on retry

s/requires/reacquires/

Re: Cache relation sizes?

From

"陈佳昕(步真)"

Date:

07 January 2021, 17:03:18

Hi, Thomas

Is some scenes the same as dropdb for smgr cache.

1. drop tablespace for master. Should smgrdroptbs function be added in DropTableSpace function ?

2. drop database for slave. smgrdropdb in dbase_redo.

3. drop tablespace for slave. smgrdroptbs in tblspc_redo.

Buzhen

------------------原始邮件 ------------------
发件人:Thomas Munro <thomas.munro@gmail.com>
发送时间:Fri Jan 8 00:56:17 2021
收件人:陈佳昕(步真) <buzhen.cjx@alibaba-inc.com>
抄送:Amit Kapila <amit.kapila16@gmail.com>, Konstantin Knizhnik <k.knizhnik@postgrespro.ru>, PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
主题:Re: Cache relation sizes?
On Wed, Dec 30, 2020 at 4:13 AM 陈佳昕(步真) <buzhen.cjx@alibaba-inc.com> wrote:
> I found some other problems which I want to share my change with you to make you confirm.
> <1> I changed the code in smgr_alloc_sr to avoid dead lock.
>
>   LWLockAcquire(mapping_lock, LW_EXCLUSIVE);
>   flags = smgr_lock_sr(sr);
>   Assert(flags & SR_SYNCING(forknum));
>   + flags &= ~SR_SYNCING(forknum);
>   if (flags & SR_JUST_DIRTIED(forknum))
>   {
>    /*
>     * Someone else dirtied it while we were syncing, so we can't mark
>     * it clean.  Let's give up on this SR and go around again.
>     */
>    smgr_unlock_sr(sr, flags);
>    LWLockRelease(mapping_lock);
>    goto retry;
>   }
>   /* This fork is clean! */
>   - flags &= ~SR_SYNCING(forknum);
>   flags &= ~SR_DIRTY(forknum);
>  }
>
> In smgr_drop_sr, if the sr is SR_SYNCING, it will retry until the sr is not SR_SYNCING.  But in smgr_alloc_sr, if the sr is SR_JUST_DIRTIED, it will retry to get another sr, although it has been synced by smgrimmedsync, the flag SR_SYNCING  doesn't changed. This might cause dead lock. So I changed your patch as above.

Right.  Thanks!

I also added smgrdropdb() to handle DROP DATABASE (the problem you
reported in your previous email).

While tweaking that code, I fixed it so that it uses a condition
variable to wait (instead of the silly sleep loop) when it needs to
drop an SR that is being sync'd.  Also, it now releases the mapping
lock while it's doing that, and requires it on retry.

> <2> I changed the code in smgr_drop_sr to avoid some corner cases
> /* Mark it invalid and drop the mapping. */
> smgr_unlock_sr(sr, ~SR_VALID);
> + for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
> +  sr->nblocks[forknum] = InvalidBlockNumber;
> hash_search_with_hash_value(sr_mapping_table,
>        &reln->smgr_rnode,
>        hash,
>        HASH_REMOVE,
>        NULL);
>
> smgr_drop_sr just removes the hash entry from sr_mapping_table, but doesn't remove the sr_pool. But in smgrnblocks_fast, it just get sr from sr_pool, so I add some codes as above to avoid some corner cases get an unexpected result from smgrnblocks_fast. Is it necessary, I also want some advice from you.

Hmm.  I think it might be better to increment sr->generation.  That
was already done in the "eviction" code path, where other processes
might still have references to the SR object, and I don't think it's
possible for anyone to access a dropped relation, but I suppose it's
more consistent to do that anyway.  Fixed.

Thanks for the review!

回复：Re: Cache relation sizes?

From

"陈佳昕(步真)"

Date:

19 January 2021, 03:42:35

Hi Thomas

I want to share a patch with you, I change the replacement algorithm from fifo to a simple lru.

Buzhen

Attachment

0001-update-fifo-to-lru-to-sweep-a-valid-cache.patch

Re: 回复：Re: Cache relation sizes?

From

Thomas Munro

Date:

16 June 2021, 06:24:01

No change yet, just posting a rebase to keep cfbot happy.

One thing I'm wondering about is whether it'd be possible, and if so,
a good idea, to make a kind of tiny reusable cache replacement
algorithm, something modern, that can be used to kill several birds
with one stone (SLRUs, this object pool, ...).  Or if the interlocking
requirements make it too integrated.

Attachment

Re: 回复：Re: Cache relation sizes?

From

Anastasia Lubennikova

Date:

14 July 2021, 12:38:37

On Wed, Jun 16, 2021 at 9:24 AM Thomas Munro <thomas.munro@gmail.com> wrote:

No change yet, just posting a rebase to keep cfbot happy.

Hi, Thomas

I think that the proposed feature is pretty cool not only because it fixes some old issues with lseek() performance and reliability, but also because it opens the door to some new features, such as disk size quota management [1]. Cheap up-to-date size tracking is one of the major problems for it.

I've only read the 1st patch so far. Here are some thoughts about it:

- SMgrSharedRelation - what about calling it SMgrShmemRelation. It looks weird too, but "...SharedRelation" makes me think of shared catalog tables.

- We can exclude temp relations from consideration as they are never shared and use RelFileNode instead of RelFileNodeBackend in SMgrSharedRelationMapping.
Not only it will save us some space, but also it will help to avoid a situation when temp relations occupy whole cache (which may easily happen in some workloads).
I assume you were trying to make a generic solution, but in practice, temp rels differ from regular ones and may require different optimizations. Local caching will be enough for them if ever needed, for example, we can leave smgr_cached_nblocks[] for temp relations.

> int smgr_shared_relations = 1000;
- How bad would it be to keep cache for all relations? Let's test memory overhead on some realistic datasets. I suppose this 1000 default was chosen just for WIP patch.
I wonder if we can use DSM instead of regular shared memory?

- I'd expect that CREATE DATABASE and ALTER DATABASE SET TABLESPACE require special treatment, because they bypass smgr, just like dropdb. Don't see it in the patch.

[1] https://github.com/greenplum-db/diskquota

--

Best regards,

Lubennikova Anastasia