Hello All,
I was looking thr. the source and thought it would be worth to seek opinion on
this proposal.
From what I understood so far, the core shared memory handling is done in
pgsql/src/backend/port/sysv_shmem.c. It is linked by configure as per the
runtime environment.
So I need to write another source code file which exports same APIs as
above(i.e. all non static functions in that file) but using mmap and that would
do it for using anon mmap instead of sysV shared memory.
It might seem unnecessary to provide mmap based shared memory. but this is just
one step I was thinking of.
In pgsql/src/backend/storage/ipc/shmem.c, all the shared memory allocations are
done. I was thinking of creating a structure of all global variables in that
file. The global variables would still be in place so that existing code would
not break. But the structure would hold database specific buffering information.
Let's call that structure database context.
That way we can assign different mmaped(anon, of course) regions per database.
In the backend, we could just switch the database contexts i.e. assign global
variables from the database context and let the backend write to appropriate
shared memory region. Every database would need at least two shared memory
regions. One for operating on it's own buffers and another for system where it
could write to shared catalogs etc. It can close the shared memory region
belonging to other databases on startup.
Of course, buffer management alone would not cover database contexts altogether.
WAL need to be lumped in as well(Not necessarily though. If all WAL buffering go
thr. system shared region, everything will still work). I don't know if clog and
data file handling is affected by this. If WAL goes in database context, we can
probably provide per database WAL which could go well with tablespaces as well.
In case of WAL per database, the operations done on a shared catalog from a
backend would need flushing system WAL and database WAL to ensure such
transaction commit. Otherwise only flushing database WAL would do.
This way we can provided a background writer process per database, a common
buffer per database minimising impact of cross database load significantly. e.g.
vacuum full on one database would not hog another database due to buffer cache
pollution. (IO can still saturate though.) This way we can push hardware to
limit which might not possible right now in some cases.
I was looking for the reason large number of buffers degrades the performance
and the source code browsing spiralled in this thought. So far I haven't figured
out any reason why large numebr of buffers can degrade the performance. Still
looking for it.
Comments?
Shridhar