Thread: [Patch] Make block and file size for WAL and relations defined atcluster creation

[Patch] Make block and file size for WAL and relations defined atcluster creation

From
Remi Colinet
Date:
Hello,

So far, the relation's block and file sizes have been defined statically at server build time.
This is also the case for the WAL block size.

This means that it is not possible to share the same Postgresql binary if using different block and file sizes for the WAL and the relations, and across different server instances/databases.


Recently, the value definition of the WAL file size has been converted from server build time to cluster creation time. The current patch goes further in this direction with the relation block and file sizes and with the WAL block size. And some more could be done with LOBLKSIZE for instance.


Below patch makes block and file sizes defined at cluster creation for both the WAL and the relations. This avoids having different server builds for each possible combination of block size and file sizes.


With the patch, the values of the block and file sizes are kept in the control file (This has been the case so far) and are provided to initdb when creating the cluster. If no value is specified, the default values are used.

Values which can be defined at cluster creation time are:

- the WAL block size
- the WAL file size
- the relation block size
- the relation file size

I noticed that the names of these parameters can slightly vary throughout the source code for the name, the unit and the case used.

Such names are:

BLCKSZ: the relation block size in bytes
RELSEG_SIZE: maximum number of blocks allowed in one disk file
XLOG_BLCKSZ: the WAL block size in bytes
XLOG_SEG_SIZE: the WAL file size in bytes
blcksz (in control file): the relation block size in bytes (same as BLCKSZ)
relseg_size (in control file): the relation file size in blocks (same as RELSEG_SIZE)
xlog_blcksz (in control file): WAL block size in bytes (same as XLOG_BLCKSZ)
xlog_seg_size (in control file); the WAL file size in bytes (same as XLOG_SEG_SIZE)
WalSegSz (in pg_resetwal.c): the WAL segment size in bytes
wal_segment_size (in xlog.c): the WAL segment size in bytes
segment_size (in guc.c): the relation segment size

For the current patch, I defined common names to be used throughout in the source code, whether this in the server or in the different utilities with units in both blocks and bytes.

These are:

- wal_blck_size: which replaces XLOG_BLCKSZ
- wal_file_blck
- wal_file_size which is wal_blck_size * wal_file_blck. It replaces XLOG_SEG_SIZE and wal_segment_size

- rel_blck_size: which replaces BLCKSZ
- rel_file_blck: it replaces RELSEG_SIZE and segment_size
- rel_file_size which is rel_blck_size * rel_file_blck.

Lower case letters are used to remind that these values are not statically defined at compile time.

This is so far a POC to show that the change is not very big and worth the extra code needed. The patch is made of only small changes unless a few files which require some more work with palloc/pfree.

The patch is rather simple despite it modifies many different files.

I've tested the patch with different combination of block and file sizes for the WAL and the relations.

Feel free to comment.

Regards
Remi


Patch diffstat:

[root@rco v1]# diffstat blkfilesizes_v1.patch
 TODO                                            |   56 ++
 configure.in                                    |   94 ---
 contrib/amcheck/verify_nbtree.c                 |    4
 contrib/bloom/blinsert.c                        |   14
 contrib/bloom/bloom.h                           |   26 -
 contrib/bloom/blutils.c                         |    6
 contrib/bloom/blvacuum.c                        |    6
 contrib/file_fdw/file_fdw.c                     |    6
 contrib/pageinspect/brinfuncs.c                 |    8
 contrib/pageinspect/btreefuncs.c                |    6
 contrib/pageinspect/rawpage.c                   |   12
 contrib/pg_prewarm/pg_prewarm.c                 |    4
 contrib/pg_standby/pg_standby.c                 |    7
 contrib/pgstattuple/pgstatapprox.c              |    6
 contrib/pgstattuple/pgstatindex.c               |    4
 contrib/pgstattuple/pgstattuple.c               |   10
 contrib/postgres_fdw/deparse.c                  |    2
 contrib/postgres_fdw/postgres_fdw.c             |    2
 param.sh                                        |    1
 src/backend/access/brin/brin_pageops.c          |    4
 src/backend/access/common/bufmask.c             |    4
 src/backend/access/common/reloptions.c          |    8
 src/backend/access/gin/ginbtree.c               |   12
 src/backend/access/gin/gindatapage.c            |   18
 src/backend/access/gin/ginentrypage.c           |    2
 src/backend/access/gin/ginfast.c                |    6
 src/backend/access/gin/ginget.c                 |    6
 src/backend/access/gin/ginvacuum.c              |    2
 src/backend/access/gin/ginxlog.c                |    4
 src/backend/access/gist/gistbuild.c             |    8
 src/backend/access/gist/gistbuildbuffers.c      |   10
 src/backend/access/gist/gistscan.c              |    1
 src/backend/access/hash/hash.c                  |    7
 src/backend/access/hash/hashpage.c              |    4
 src/backend/access/heap/README.HOT              |    2
 src/backend/access/heap/heapam.c                |   17
 src/backend/access/heap/pruneheap.c             |   39 +
 src/backend/access/heap/rewriteheap.c           |    4
 src/backend/access/heap/syncscan.c              |    2
 src/backend/access/heap/visibilitymap.c         |    8
 src/backend/access/nbtree/nbtpage.c             |    2
 src/backend/access/nbtree/nbtree.c              |   18
 src/backend/access/nbtree/nbtsearch.c           |    5
 src/backend/access/nbtree/nbtsort.c             |   10
 src/backend/access/spgist/spgdoinsert.c         |    4
 src/backend/access/spgist/spginsert.c           |    2
 src/backend/access/spgist/spgscan.c             |    1
 src/backend/access/spgist/spgtextproc.c         |   10
 src/backend/access/spgist/spgutils.c            |    4
 src/backend/access/transam/README               |    2
 src/backend/access/transam/clog.c               |   10
 src/backend/access/transam/commit_ts.c          |    4
 src/backend/access/transam/generic_xlog.c       |   44 +
 src/backend/access/transam/multixact.c          |   12
 src/backend/access/transam/slru.c               |   22
 src/backend/access/transam/subtrans.c           |    5
 src/backend/access/transam/timeline.c           |    2
 src/backend/access/transam/twophase.c           |    2
 src/backend/access/transam/xlog.c               |  603 ++++++++++++++----------
 src/backend/access/transam/xlogarchive.c        |   12
 src/backend/access/transam/xlogfuncs.c          |   10
 src/backend/access/transam/xloginsert.c         |   48 +
 src/backend/access/transam/xlogreader.c         |  141 +++--
 src/backend/access/transam/xlogutils.c          |   34 -
 src/backend/bootstrap/bootstrap.c               |   33 -
 src/backend/commands/async.c                    |   15
 src/backend/commands/tablecmds.c                |    2
 src/backend/commands/vacuumlazy.c               |    4
 src/backend/executor/execGrouping.c             |    1
 src/backend/nodes/tidbitmap.c                   |  152 +++++-
 src/backend/optimizer/path/costsize.c           |   10
 src/backend/optimizer/util/plancat.c            |    2
 src/backend/postmaster/checkpointer.c           |    4
 src/backend/replication/basebackup.c            |   30 -
 src/backend/replication/logical/logical.c       |    2
 src/backend/replication/logical/reorderbuffer.c |   18
 src/backend/replication/slot.c                  |    2
 src/backend/replication/walreceiver.c           |   14
 src/backend/replication/walreceiverfuncs.c      |    4
 src/backend/replication/walsender.c             |   30 -
 src/backend/storage/buffer/buf_init.c           |    4
 src/backend/storage/buffer/bufmgr.c             |    8
 src/backend/storage/buffer/freelist.c           |    6
 src/backend/storage/buffer/localbuf.c           |    6
 src/backend/storage/file/buffile.c              |   20
 src/backend/storage/file/copydir.c              |    2
 src/backend/storage/freespace/README            |    8
 src/backend/storage/freespace/freespace.c       |   36 -
 src/backend/storage/freespace/indexfsm.c        |    7
 src/backend/storage/lmgr/predicate.c            |    2
 src/backend/storage/page/bufpage.c              |   27 -
 src/backend/storage/smgr/md.c                   |  104 ++--
 src/backend/tcop/postgres.c                     |    2
 src/backend/utils/adt/selfuncs.c                |    2
 src/backend/utils/init/globals.c                |   20
 src/backend/utils/init/miscinit.c               |    6
 src/backend/utils/init/postinit.c               |   23
 src/backend/utils/misc/guc.c                    |  175 ++++--
 src/backend/utils/misc/pg_controldata.c         |    4
 src/backend/utils/sort/logtape.c                |   49 -
 src/backend/utils/sort/tuplesort.c              |    6
 src/bin/initdb/initdb.c                         |  304 +++++++++---
 src/bin/pg_basebackup/pg_basebackup.c           |   18
 src/bin/pg_basebackup/pg_receivewal.c           |   26 -
 src/bin/pg_basebackup/pg_recvlogical.c          |   11
 src/bin/pg_basebackup/receivelog.c              |   28 -
 src/bin/pg_basebackup/streamutil.c              |   76 +--
 src/bin/pg_basebackup/streamutil.h              |    6
 src/bin/pg_basebackup/walmethods.c              |   14
 src/bin/pg_controldata/pg_controldata.c         |   16
 src/bin/pg_resetwal/pg_resetwal.c               |  125 +++-
 src/bin/pg_rewind/copy_fetch.c                  |    9
 src/bin/pg_rewind/filemap.c                     |   11
 src/bin/pg_rewind/libpq_fetch.c                 |    7
 src/bin/pg_rewind/parsexlog.c                   |   26 -
 src/bin/pg_rewind/pg_rewind.c                   |   33 -
 src/bin/pg_test_fsync/pg_test_fsync.c           |   71 +-
 src/bin/pg_upgrade/controldata.c                |    7
 src/bin/pg_upgrade/file.c                       |   15
 src/bin/pg_upgrade/pg_upgrade.c                 |    3
 src/bin/pg_waldump/pg_waldump.c                 |   69 +-
 src/common/controldata_utils.c                  |   98 +++
 src/include/access/brin_page.h                  |    2
 src/include/access/ginblock.h                   |    6
 src/include/access/gist_private.h               |   20
 src/include/access/hash.h                       |    5
 src/include/access/htup_details.h               |   11
 src/include/access/itup.h                       |    2
 src/include/access/nbtree.h                     |   10
 src/include/access/relscan.h                    |    7
 src/include/access/slru.h                       |    2
 src/include/access/spgist_private.h             |   22
 src/include/access/tuptoaster.h                 |    2
 src/include/access/xlog_internal.h              |    8
 src/include/access/xlogreader.h                 |    9
 src/include/access/xlogrecord.h                 |    6
 src/include/common/controldata_utils.h          |    4
 src/include/lib/simplehash.h                    |   12
 src/include/nodes/execnodes.h                   |    1
 src/include/nodes/nodes.h                       |    1
 src/include/pg_config.h.in                      |   31 -
 src/include/pg_config_manual.h                  |    8
 src/include/pg_control_def.h                    |   44 +
 src/include/storage/bufmgr.h                    |    4
 src/include/storage/bufpage.h                   |    5
 src/include/storage/checksum_impl.h             |    2
 src/include/storage/fsm_internals.h             |    5
 src/include/storage/large_object.h              |    4
 src/include/storage/md.h                        |   12
 src/include/storage/off.h                       |    2
 src/include/utils/rel.h                         |    4
 src/interfaces/libpq/libpq-int.h                |    5
 152 files changed, 2234 insertions(+), 1284 deletions(-)
[root@rco v1]# 



Attachment
On Sun, Dec 31, 2017 at 12:00 PM, Remi Colinet <remi.colinet@gmail.com> wrote:
> Below patch makes block and file sizes defined at cluster creation for both
> the WAL and the relations. This avoids having different server builds for
> each possible combination of block size and file sizes.\

The email thread where we discussed making the WAL segment size
configurable at initdb time contained a detailed rationale, explaining
why it was useful to be able to make such a change.  The very short
version is that, if a system is generating WAL at a very high rate,
being able to group that WAL into fewer, larger files makes life
easier since, for example, the latency requirements for
archive_command are not as tight, and "ls pg_wal" doesn't have to go
into the tank just trying to read the directory contents.

Your email doesn't seem to contain a rationale explaining why the
block and file sizes should be run-time configurable.  There may be a
very good reason, but can you explain what it is?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Robert,

Justifications are:

- we may test different combinations of file and block sizes, for the relation and the WAL in order to have the better performances of the server.
Avoiding a compilation for each combination of values seems to make sense.

- the same binary can be used on the same host with several databases instances/cluster, each using different values for block and file sizes.
This is what I did to test the patch. I have created about 20 different combinations of values for the file and block sizes of the relation and WAL files.

- Linux distributions deliver Postgresql with a binary already compiled with the default values.
This means DBAs need to rebuild the binary for each combination of block and file sizes, whether this is for the WAL or the relations.

- Selecting the correct values for file and block sizes is a DBA task, and not a developer task.
For instance, when someone wants to create a Linux filesystem with a given block size, he is not forced to accept a given value chosed by the developer of the filesystem driver when this later was compiled.

- The file and block sizes should depend mostly of the physical server and physical storage.
Not of the database software itself.

Regarding the cost of using run-time configurable values for file and block sizes of the WAL and relations, this cost is low both :

- from a developer point of view: the source code changes are spread in many files but only a few one have significant changes.
Mainly the tidbitmap.c is concerned the change. Other changes are minor changes.

- from a run-time point of view. The overhead is only at the start of the database instance.
And moreover, the overhead is still very low at the start of the server, with only a few more dynamic memory allocations.

Regards
Remi

 


2018-01-02 22:26 GMT+01:00 Robert Haas <robertmhaas@gmail.com>:
On Sun, Dec 31, 2017 at 12:00 PM, Remi Colinet <remi.colinet@gmail.com> wrote:
> Below patch makes block and file sizes defined at cluster creation for both
> the WAL and the relations. This avoids having different server builds for
> each possible combination of block size and file sizes.\

The email thread where we discussed making the WAL segment size
configurable at initdb time contained a detailed rationale, explaining
why it was useful to be able to make such a change.  The very short
version is that, if a system is generating WAL at a very high rate,
being able to group that WAL into fewer, larger files makes life
easier since, for example, the latency requirements for
archive_command are not as tight, and "ls pg_wal" doesn't have to go
into the tank just trying to read the directory contents.

Your email doesn't seem to contain a rationale explaining why the
block and file sizes should be run-time configurable.  There may be a
very good reason, but can you explain what it is?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Hi,

On 2018-01-03 21:43:51 +0100, Remi Colinet wrote:
> - we may test different combinations of file and block sizes, for the
> relation and the WAL in order to have the better performances of the server.
> Avoiding a compilation for each combination of values seems to make sense.

That's something you need to proof to beneficial *before* we make this
change.


> - Selecting the correct values for file and block sizes is a DBA task, and
> not a developer task.
> For instance, when someone wants to create a Linux filesystem with a given
> block size, he is not forced to accept a given value chosed by the
> developer of the filesystem driver when this later was compiled.

I'm unconvinced there's as much value syncing up fs in pg as some
conventional wisdom says.


> - The file and block sizes should depend mostly of the physical server and
> physical storage.
> Not of the database software itself.

Citation needed.


> Regarding the cost of using run-time configurable values for file and block
> sizes of the WAL and relations, this cost is low both :
> 
> - from a developer point of view: the source code changes are spread in
> many files but only a few one have significant changes.
> Mainly the tidbitmap.c is concerned the change. Other changes are minor
> changes.
> 
> - from a run-time point of view. The overhead is only at the start of the
> database instance.
> And moreover, the overhead is still very low at the start of the server,
> with only a few more dynamic memory allocations.

That's to some degree because you rely on stack allocation of variable
sided amounts of data - we can't rely on that. E.g. you allocate stack
variables sized by rel_block_size, that's unfortunately not
ok. Additionally some of the size calculations will have some
performance impact.


- Andres


On Wed, Jan 3, 2018 at 3:43 PM, Remi Colinet <remi.colinet@gmail.com> wrote:
> Justifications are:

I think this is all missing the point.  If varying the block size (or
whatever) is beneficial, then having it configurable at initdb is
clearly useful.  But, as Andres says, you haven't submitted any
evidence that this is the case.  You need to describe scenarios in
which (1) a non-default blocksize performs better and (2) there's no
reasonable way to obtain the same performance improvement without
changing the block size.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Hello,

2018-01-03 21:51 GMT+01:00 Andres Freund <andres@anarazel.de>:
Hi,

On 2018-01-03 21:43:51 +0100, Remi Colinet wrote:
> - we may test different combinations of file and block sizes, for the
> relation and the WAL in order to have the better performances of the server.
> Avoiding a compilation for each combination of values seems to make sense.

That's something you need to proof to beneficial *before* we make this
change.
 
Performance is only one argument advocating for the need of run-time block/file sizes choices.

DBA may just want to have larger files for its relation and WAL in order to reduce the number of files. Why would this be an unacceptable wish? Just because a developer decided to chose a value for the whole world?

What about the fact that storage are getting larger every year? Ok, at some point in time, a developer may change the default value in the source code and rebuild. But this is not very handy. For insance, we do not need to rebuild a kernel when we want to change just one parameter.

By the way, we someone install Postgresql, he may not want to rebuild but only to use.



> - Selecting the correct values for file and block sizes is a DBA task, and
> not a developer task.
> For instance, when someone wants to create a Linux filesystem with a given
> block size, he is not forced to accept a given value chosed by the
> developer of the filesystem driver when this later was compiled.

I'm unconvinced there's as much value syncing up fs in pg as some
conventional wisdom says.

The argument is to tell that visible parameters should be set by users or DBAs. This is an admin task. For instance, if someone uses a storage with 4K sectors, he may need to set the block size to 4K for both WAL and relations, without having to rebuild the binaries. Building binaries is not an easy task for everybody.



> - The file and block sizes should depend mostly of the physical server and
> physical storage.
> Not of the database software itself.

Citation needed.

Someone using a large database will probably want to have larger files. This is matter of personal perception. Some companies may alsohave defined policies regarding databases in order to avoid having too many files.

When using a storage with 4K blocks, it may be better to use 4K block sizes for Postgresql. But then, what about a storage with 16K blocks? Rebuild again...? And then, you need a build for each block and file size combination. You may end up with a lot of builds to manage.



> Regarding the cost of using run-time configurable values for file and block
> sizes of the WAL and relations, this cost is low both :
>
> - from a developer point of view: the source code changes are spread in
> many files but only a few one have significant changes.
> Mainly the tidbitmap.c is concerned the change. Other changes are minor
> changes.
>
> - from a run-time point of view. The overhead is only at the start of the
> database instance.
> And moreover, the overhead is still very low at the start of the server,
> with only a few more dynamic memory allocations.

That's to some degree because you rely on stack allocation of variable
sided amounts of data - we can't rely on that. E.g. you allocate stack
variables sized by rel_block_size, that's unfortunately not
ok. Additionally some of the size calculations will have some
performance impact.

Data structures depending on BLCKSZ and allocated on stack are migrated to palloc/pfree management in the patch. A few files are concerned by such change with the most noticeable one being tidbitmap.c. This later one is a bit more difficult to change because it includes directly the header file simplehash.h (not nice for gdb). Anyway, I could perform the conversion to run-time values with a minimal change, even for tidbitmap.c

Regards
Remi




- Andres



2018-01-03 22:04 GMT+01:00 Robert Haas <robertmhaas@gmail.com>:
On Wed, Jan 3, 2018 at 3:43 PM, Remi Colinet <remi.colinet@gmail.com> wrote:
> Justifications are:

I think this is all missing the point.  If varying the block size (or
whatever) is beneficial, then having it configurable at initdb is
clearly useful.  But, as Andres says, you haven't submitted any
evidence that this is the case.  You need to describe scenarios in
which (1) a non-default blocksize performs better and (2) there's no
reasonable way to obtain the same performance improvement without
changing the block size.

Block size does not boils down only to performance.

For instance, having a larger block size allows:
- to avoid toasting tuples. Rows with sizes larger that the default block size can justify larger block sizes.
- to reduce fragmentation in relations.

If changing the block size at initdb is useless, then why allowing developer to set such block size at compile time?
The patch only allows to shift the block size choice from compilation to run-time.

Regards
Remi


--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

On Wed, January 3, 2018 4:04 pm, Robert Haas wrote:
> On Wed, Jan 3, 2018 at 3:43 PM, Remi Colinet <remi.colinet@gmail.com>
> wrote:
>> Justifications are:
>
> I think this is all missing the point.  If varying the block size (or
> whatever) is beneficial, then having it configurable at initdb is
> clearly useful.  But, as Andres says, you haven't submitted any
> evidence that this is the case.  You need to describe scenarios in
> which (1) a non-default blocksize performs better and (2) there's no
> reasonable way to obtain the same performance improvement without
> changing the block size.

(Note: I gather that "block size" here is the page size, but I'm not
entirely sure. So this detail might make my arguments moot :)

First, I do think there is a chicken-and-egg problem here. If you can't
vary the values except by re-compiling, almost no-one does. So you don't
get any benchmarks, or data.

And even if the author of the patch does this, he can only provide very
spotty data, which might not be enough to draw the right conclusions.

Plus, if the goal is to "select the optimal size" (as opposed to Remi's
goal, which seems to me os "make it easier to select the optimial size for
one system at hand"), you might not be able to do this, as the "optimial"
size is different for different conditions, but you can't try different
values if the value is compiled in...

Plus, isn't almost all advancement in computing that you replace "1 + 1"
with "x + y"? :)

So, I do think the patch has some merits, because at least it allows much
easer benchmarking. And if the value was configurable at initdb time, I
think more people would experiment and settle for their optimium. So for
me the question isn't here "does it benefit everyone", but "does it
benefit some, and what is the cost for others".

Plus, I stumbled just by accident over a blog post from Tomas Vondra[0]
from 2015, who seems to agree that different page sizes can be useful:

https://blog.pgaddict.com/posts/postgresql-on-ssd-4kb-or-8kB-pages

Me thinks the next steps would be that more benchmarks are done on more
distinct hardware to see what benefits we can see, and weight this against
the complexity the patch introduces into the code base.

Hope that does make sense,

Tels

[0]: Tomas, great blog, btw!


On Thu, Jan 4, 2018 at 5:15 PM, Remi Colinet <remi.colinet@gmail.com> wrote:
> Block size does not boils down only to performance.
>
> For instance, having a larger block size allows:
> - to avoid toasting tuples. Rows with sizes larger that the default block
> size can justify larger block sizes.
> - to reduce fragmentation in relations.

Well, I think those are things that you do to improve performance.
So, ultimately, I would argue that it does come down to performance.

> If changing the block size at initdb is useless, then why allowing developer
> to set such block size at compile time?

In my view, right now, changing BLCKSZ is only marginally supported.
It's there so that you can experiment, but it's not really something
we expect users to do.  I think we have no buildfarm coverage of
different block sizes.  I am not sure that we even consistently fix
regression test failures with other block sizes even if someone
reports them.  If there's a bug in some index AM that only manifests
with some non-default block size, we might not know about it.  I think
that if we make this an initdb-time option, we're committing to fix
all of those issues: add tests, fix bugs, and of course document how
to set the parameter properly (which means we have to know how it
should be set, which means we have to know what the effects of
changing it are on different systems and workloads).  You may or may
not be willing to do some of that work, but I suspect there's a good
chance that it will require effort from other people as well -- e.g.
if we turn up a bug in BRIN, are you going to dive into that and fix
it, or are you going to hope Alvaro does something about it?  He'll
probably have to review and commit your patch, at the least.

Of course, it's possible there are no such problems and everything
will just work.

I looked around a little for previous tests that had been run in this
area and found these:

https://blog.pgaddict.com/posts/postgresql-on-ssd-4kb-or-8kB-pages
https://www.cybertec-postgresql.com/en/postgresql-block-sizes-getting-started/
http://blog.coelho.net/database/2014/08/08/postgresql-page-size-for-SSD.html

All of those seem to agree that smaller block sizes can help
performance, sometimes significantly, and larger block sizes hurt
performance, which is sort of surprising to me since that also means
that that your database will get bigger: at a 4kB page size, you have
to store at least twice as many page headers as you would with an 8kB
page size.  Some of them also mention reasons why you might want a
larger block size. I believe I recall a mailing-list discussion some
years back about how index pages might need some kind of page-internal
indexing for efficiency with large block sizes, because a simple
binary search might touch too many cache lines.  It seems like if we
want to have good performance at a variety of block sizes -- rather
than just having it technically work -- we might need to do a fair
amount of investigation of what factors account for good and bad
performance at a variety of settings and consider whether there are
design changes that might mitigate some of the problems.

I think that if you're interested in making non-default block sizes
more supported in PostgreSQL, some good first steps would be:

- run make check-world at all the supposedly-supported block sizes and
see if it passes.

- set up some buildfarm critters that run with various non-default
block sizes on various hardware and software platforms.  ideally we
should have various combinations of 32-bit and 64-bit; Linux, Windows,
and other; and the whole range of block sizes but especially the more
extreme ones.

- run performance tests with a variety of workloads, not just pgbench,
at various block sizes and on various hardware, and post or blog about
the results

If it's clear that non-default block sizes (a) work and (b) are good,
then at least IMHO it's quite likely that we would want this patch.
Maybe those things are already clear to you, but they're not
completely clear to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Fri, Jan 5, 2018 at 9:42 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> - run make check-world at all the supposedly-supported block sizes and
> see if it passes.

Last time I tried that with a 16kB-size block, which was some months
back, make check complained about some plan inconsistencies. Perhaps
that's something that could be improved to begin with. There is always
some margin in this area.
-- 
Michael


On Fri, Jan 5, 2018 at 7:54 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Fri, Jan 5, 2018 at 9:42 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> - run make check-world at all the supposedly-supported block sizes and
>> see if it passes.
>
> Last time I tried that with a 16kB-size block, which was some months
> back, make check complained about some plan inconsistencies. Perhaps
> that's something that could be improved to begin with. There is always
> some margin in this area.

Yeah, that's exactly the kind of thing that needs to be ironed out
before we can think about actually recommending that users run with
other block sizes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [Patch] Make block and file size for WAL and relations defined atcluster creation

From
Alexander Korotkov
Date:
On Wed, Jan 3, 2018 at 12:26 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sun, Dec 31, 2017 at 12:00 PM, Remi Colinet <remi.colinet@gmail.com> wrote:
> Below patch makes block and file sizes defined at cluster creation for both
> the WAL and the relations. This avoids having different server builds for
> each possible combination of block size and file sizes.\

The email thread where we discussed making the WAL segment size
configurable at initdb time contained a detailed rationale, explaining
why it was useful to be able to make such a change.  The very short
version is that, if a system is generating WAL at a very high rate,
being able to group that WAL into fewer, larger files makes life
easier since, for example, the latency requirements for
archive_command are not as tight, and "ls pg_wal" doesn't have to go
into the tank just trying to read the directory contents.

Your email doesn't seem to contain a rationale explaining why the
block and file sizes should be run-time configurable.  There may be a
very good reason, but can you explain what it is?

I'd like add my 2 cents regarding larger relation file sizes.  While dealing with large
multi-terabyte databases, user may operate with number of files greater than
max_files_per_process.  In this case, fetching blocks from relation may appear
to be quite inefficient.  For example, fetching of another relation block may frequently
cause open of one file description and eviction and close of another. Assuming
that modern servers operate multi-terabyte of RAM, that may happen even
while dealing with data fitting to OS cache.  Thus, this overhead is really pretty
sensitive.

Possible solution might be to increase max_files_per_process parameter.
However, assuming that hundreds or even thousands of backends are
running, user may easily hit the OS limit over number of open file descriptors.
Even if user increases that limit, it may cause a performance degradation,
because kernel don't operate that large number of file descriptors efficiently.

This problem will go away if we switch to threaded model with pread/pwrite.
Then we wouldn't have per-backend file descriptor for every file.
However, that doesn't seem to be a close future.  This is why, larger file
sizes seem to be a valid approach to mitigate this problem meantime.
Experimental research on this subject is required before considering
committing any patches though.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company