Thread: Re: Read/Write block sizes (Was: Caching by Postgres)

Re: Read/Write block sizes (Was: Caching by Postgres)

From

Jignesh Shah

Date:

23 August 2005, 18:34:23

> Does that include increasing the size of read/write blocks? I've
> noticedthat with a large enough table it takes a while to do a
> sequential scan,
> even if it's cached; I wonder if the fact that it takes a million
> read(2) calls to get through an 8G table is part of that.
>


Actually some of that readaheads,etc  the OS does  already if it does some sort of throttling/clubbing of reads/writes.
Butits not enough for such types of workloads. 

Here is what I think will help:

* Support for different Blocksize TABLESPACE without recompiling the code.. (Atlease support for a different Blocksize
forthe whole database without recompiling the code) 

* Support for bigger sizes of WAL files instead of 16MB files WITHOUT recompiling the code.. Should be a tuneable if
youask me (with checkpoint_segments at 256.. you have too many 16MB files in the log directory) (This will help OLTP
benchmarksmore since now they don't spend time rotating log files) 

* Introduce a multiblock or extent tunable variable where you can define a multiple of 8K (or BlockSize tuneable) to
reada bigger chunk and store it in the bufferpool.. (Maybe writes too) (Most devices now support upto 1MB chunks for
readsand writes) 

*There should be a way to preallocate files for TABLES in TABLESPACES otherwise with multiple table writes in the same
filesystemends with fragmented files which causes poor "READS" from the files.  

* With 64bit 1GB file chunks is also moot.. Maybe it should be tuneable too like 100GB without recompiling the code.


Why recompiling is bad? Most companies that will support Postgres will support their own binaries and they won't prefer
differentversions of binaries for different blocksizes, different WAL file sizes, etc... and hence more function using
thesame set of binaries is more desirable in enterprise environments 


Regards,
Jignesh

Re: Read/Write block sizes

From

Chris Browne

Date:

23 August 2005, 20:04:23

J.K.Shah@Sun.COM (Jignesh Shah) writes:
>> Does that include increasing the size of read/write blocks? I've
>> noticedthat with a large enough table it takes a while to do a
>> sequential scan, even if it's cached; I wonder if the fact that it
>> takes a million read(2) calls to get through an 8G table is part of
>> that.
>
> Actually some of that readaheads,etc the OS does already if it does
> some sort of throttling/clubbing of reads/writes. But its not enough
> for such types of workloads.
>
> Here is what I think will help:
>
> * Support for different Blocksize TABLESPACE without recompiling the
> code.. (Atlease support for a different Blocksize for the whole
> database without recompiling the code)
>
> * Support for bigger sizes of WAL files instead of 16MB files
> WITHOUT recompiling the code.. Should be a tuneable if you ask me
> (with checkpoint_segments at 256.. you have too many 16MB files in
> the log directory) (This will help OLTP benchmarks more since now
> they don't spend time rotating log files)
>
> * Introduce a multiblock or extent tunable variable where you can
> define a multiple of 8K (or BlockSize tuneable) to read a bigger
> chunk and store it in the bufferpool.. (Maybe writes too) (Most
> devices now support upto 1MB chunks for reads and writes)
>
> *There should be a way to preallocate files for TABLES in
> TABLESPACES otherwise with multiple table writes in the same
> filesystem ends with fragmented files which causes poor "READS" from
> the files.
>
> * With 64bit 1GB file chunks is also moot.. Maybe it should be
> tuneable too like 100GB without recompiling the code.
>
> Why recompiling is bad? Most companies that will support Postgres
> will support their own binaries and they won't prefer different
> versions of binaries for different blocksizes, different WAL file
> sizes, etc... and hence more function using the same set of binaries
> is more desirable in enterprise environments

Every single one of these still begs the question of whether the
changes will have a *material* impact on performance.

What we have been finding, as RAID controllers get smarter, is that it
is getting increasingly futile to try to attach knobs to 'disk stuff;'
it is *way* more effective to add a few more spindles to an array than
it is to fiddle with which disks are to be allocated to what database
'objects.'

The above suggested 'knobs' are all going to add to complexity and it
is NOT evident that any of them will forcibly help.

I could be wrong; code contributions combined with Actual Benchmarking
would be the actual proof of the merits of the ideas.

But it also suggests another question, namely...

  Will these represent more worthwhile improvements to speed than
  working on other optimizations that are on the TODO list?

If someone spends 100h working on one of these items, and gets a 2%
performance improvement, that's almost certain to be less desirable
than spending 50h on something else that gets a 4% improvement.

And we might discover that memory management improvements in Linux
2.6.16 or FreeBSD 5.5 allow some OS kernels to provide some such
improvements "for free" behind our backs without *any* need to write
database code.  :-)
--
let name="cbbrowne" and tld="ntlug.org" in name ^ "@" ^ tld;;
http://www3.sympatico.ca/cbbrowne/postgresql.html
Wiener's Law of Libraries:
        There are no answers, only cross references.

Re: Read/Write block sizes (Was: Caching by Postgres)

From

Michael Stone

Date:

23 August 2005, 20:18:37

On Tue, Aug 23, 2005 at 05:29:01PM -0400, Jignesh Shah wrote:
>Actually some of that readaheads,etc  the OS does  already if it does
>some sort of throttling/clubbing of reads/writes.

Note that I specified the fully cached case--even with the workload in
RAM the system still has to process a heck of a lot of read calls.

>* Introduce a multiblock or extent tunable variable where you can
>define a multiple of 8K (or BlockSize tuneable) to read a bigger chunk
>and store it in the bufferpool.. (Maybe writes too) (Most devices now
>support upto 1MB chunks for reads and writes)

Yeah. The problem with relying on OS readahead is that the OS doesn't
know whether you're doing a sequential scan or an index scan; if you
have the OS agressively readahead you'll kill your seek performance.
OTOH, if you don't do readaheads you'll kill your sequential scan
performance. At the app level you know which makes sense for each
operation.

Mike Stone

Re: Read/Write block sizes

From

Michael Stone

Date:

23 August 2005, 20:24:33

On Tue, Aug 23, 2005 at 06:09:09PM -0400, Chris Browne wrote:
>What we have been finding, as RAID controllers get smarter, is that it
>is getting increasingly futile to try to attach knobs to 'disk stuff;'
>it is *way* more effective to add a few more spindles to an array than
>it is to fiddle with which disks are to be allocated to what database
>'objects.'

That statement doesn't say anything about trying to maximize performance
to or from a disk array. Yes, controllers are getting smarter--but they
aren't omnicient. IME an I/O bound sequential table scan doesn't get
data moving off the disk nearly as fast as say, a dd with a big ibs.
Why? There's obviously a lot of factors at work, but one of those
factors is that the raid controller can optimize "grab this meg" a lot
more than it can optimize "grab this 8k".

Mike Stone

Re: Read/Write block sizes

From

"Jim C. Nasby"

Date:

23 August 2005, 20:36:25

On Tue, Aug 23, 2005 at 06:09:09PM -0400, Chris Browne wrote:
> J.K.Shah@Sun.COM (Jignesh Shah) writes:
> >> Does that include increasing the size of read/write blocks? I've
> >> noticedthat with a large enough table it takes a while to do a
> >> sequential scan, even if it's cached; I wonder if the fact that it
> >> takes a million read(2) calls to get through an 8G table is part of
> >> that.
> >
> > Actually some of that readaheads,etc the OS does already if it does
> > some sort of throttling/clubbing of reads/writes. But its not enough
> > for such types of workloads.
> >
> > Here is what I think will help:
> >
> > * Support for different Blocksize TABLESPACE without recompiling the
> > code.. (Atlease support for a different Blocksize for the whole
> > database without recompiling the code)
> >
> > * Support for bigger sizes of WAL files instead of 16MB files
> > WITHOUT recompiling the code.. Should be a tuneable if you ask me
> > (with checkpoint_segments at 256.. you have too many 16MB files in
> > the log directory) (This will help OLTP benchmarks more since now
> > they don't spend time rotating log files)
> >
> > * Introduce a multiblock or extent tunable variable where you can
> > define a multiple of 8K (or BlockSize tuneable) to read a bigger
> > chunk and store it in the bufferpool.. (Maybe writes too) (Most
> > devices now support upto 1MB chunks for reads and writes)
> >
> > *There should be a way to preallocate files for TABLES in
> > TABLESPACES otherwise with multiple table writes in the same
> > filesystem ends with fragmented files which causes poor "READS" from
> > the files.
> >
> > * With 64bit 1GB file chunks is also moot.. Maybe it should be
> > tuneable too like 100GB without recompiling the code.
> >
> > Why recompiling is bad? Most companies that will support Postgres
> > will support their own binaries and they won't prefer different
> > versions of binaries for different blocksizes, different WAL file
> > sizes, etc... and hence more function using the same set of binaries
> > is more desirable in enterprise environments
>
> Every single one of these still begs the question of whether the
> changes will have a *material* impact on performance.

How many of these things are currently easy to change with a recompile?
I should be able to start testing some of these ideas in the near
future, if they only require minor code or configure changes.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software        http://pervasive.com        512-569-9461

Re: Read/Write block sizes

From

Steve Poe

Date:

23 August 2005, 20:40:26

Chris,

Unless I am wrong, you're making the assumpting the amount of time spent
and ROI is known. Maybe those who've been down this path know how to get
that additional 2-4% in 30 minutes or less?

While each person and business' performance gains (or not) could vary,
someone spending the 50-100h to gain 2-4% over a course of a month for a
24x7 operation would seem worth the investment?

I would assume that dbt2 with STP helps minimize the amount of hours
someone has to invest to determine performance gains with configurable
options?

Steve Poe

> If someone spends 100h working on one of these items, and gets a 2%
> performance improvement, that's almost certain to be less desirable
> than spending 50h on something else that gets a 4% improvement.
>
> And we might discover that memory management improvements in Linux
> 2.6.16 or FreeBSD 5.5 allow some OS kernels to provide some such
> improvements "for free" behind our backs without *any* need to write
> database code.  :-)

Re: Read/Write block sizes (Was: Caching by Postgres)

From

"Jeffrey W. Baker"

Date:

23 August 2005, 20:44:56

On Tue, 2005-08-23 at 19:12 -0400, Michael Stone wrote:
> On Tue, Aug 23, 2005 at 05:29:01PM -0400, Jignesh Shah wrote:
> >Actually some of that readaheads,etc  the OS does  already if it does
> >some sort of throttling/clubbing of reads/writes.
>
> Note that I specified the fully cached case--even with the workload in
> RAM the system still has to process a heck of a lot of read calls.
>
> >* Introduce a multiblock or extent tunable variable where you can
> >define a multiple of 8K (or BlockSize tuneable) to read a bigger chunk
> >and store it in the bufferpool.. (Maybe writes too) (Most devices now
> >support upto 1MB chunks for reads and writes)
>
> Yeah. The problem with relying on OS readahead is that the OS doesn't
> know whether you're doing a sequential scan or an index scan; if you
> have the OS agressively readahead you'll kill your seek performance.
> OTOH, if you don't do readaheads you'll kill your sequential scan
> performance. At the app level you know which makes sense for each
> operation.

This is why we have MADVISE_RANDOM and MADVISE_SEQUENTIAL.

-jwb

Re: Read/Write block sizes

From

"Jignesh K. Shah"

Date:

23 August 2005, 23:26:23

Hi Jim,

| How many of these things are currently easy to change with a recompile?
| I should be able to start testing some of these ideas in the near
| future, if they only require minor code or configure changes.

The following
* Data File Size   1GB
* WAL File Size of 16MB
* Block Size  of 8K

Are very easy to change with a recompile.. A Tunable will be greatly
prefered as it will allow one binary for different tunings

* MultiBlock read/write

Is not available but will greatly help in reducing the number of system
calls which will only increase as the size of the database increases if
something is not done about i.

* Pregrown files... maybe not important at this point since TABLESPACE
can currently work around it a bit (Just need to create a different file
system for each tablespace

But if you really think hardware & OS  is the answer for all small
things...... I think we should now start to look on how to make Postgres
Multi-threaded or multi-processed for each connection. With the influx
of  "Dual-Core" or "Multi-Core" being the fad.... Postgres can have the
cutting edge if somehow exploiting cores is designed.

Somebody mentioned that adding CPU to Postgres workload halved the
average CPU  usage...
YEAH... PostgreSQL  uses only 1 CPU per connection (assuming 100%
usage)  so if you add another CPU it is idle anyway and the system will
report only 50%  :-) BUT the importing to measure is.. whether the query
time was cut down or not? ( No flames I am sure you were talking about
multi-connection multi-user environment :-) ) But my point is then this
approach is worth the ROI and the time and effort spent to solve this
problem.

I actually vote for a multi-threaded solution for each connection while
still maintaining seperate process for each connections... This way the
fundamental architecture of Postgres doesn't change, however a
multi-threaded connection can then start to exploit different cores..
(Maybe have tunables for number of threads to read data files who
knows.. If somebody is interested in actually working a design ..
contact me and I will be interested in assisting this work.

Regards,
Jignesh

Jim C. Nasby wrote:

>On Tue, Aug 23, 2005 at 06:09:09PM -0400, Chris Browne wrote:
>
>
>>J.K.Shah@Sun.COM (Jignesh Shah) writes:
>>
>>
>>>>Does that include increasing the size of read/write blocks? I've
>>>>noticedthat with a large enough table it takes a while to do a
>>>>sequential scan, even if it's cached; I wonder if the fact that it
>>>>takes a million read(2) calls to get through an 8G table is part of
>>>>that.
>>>>
>>>>
>>>Actually some of that readaheads,etc the OS does already if it does
>>>some sort of throttling/clubbing of reads/writes. But its not enough
>>>for such types of workloads.
>>>
>>>Here is what I think will help:
>>>
>>>* Support for different Blocksize TABLESPACE without recompiling the
>>>code.. (Atlease support for a different Blocksize for the whole
>>>database without recompiling the code)
>>>
>>>* Support for bigger sizes of WAL files instead of 16MB files
>>>WITHOUT recompiling the code.. Should be a tuneable if you ask me
>>>(with checkpoint_segments at 256.. you have too many 16MB files in
>>>the log directory) (This will help OLTP benchmarks more since now
>>>they don't spend time rotating log files)
>>>
>>>* Introduce a multiblock or extent tunable variable where you can
>>>define a multiple of 8K (or BlockSize tuneable) to read a bigger
>>>chunk and store it in the bufferpool.. (Maybe writes too) (Most
>>>devices now support upto 1MB chunks for reads and writes)
>>>
>>>*There should be a way to preallocate files for TABLES in
>>>TABLESPACES otherwise with multiple table writes in the same
>>>filesystem ends with fragmented files which causes poor "READS" from
>>>the files.
>>>
>>>* With 64bit 1GB file chunks is also moot.. Maybe it should be
>>>tuneable too like 100GB without recompiling the code.
>>>
>>>Why recompiling is bad? Most companies that will support Postgres
>>>will support their own binaries and they won't prefer different
>>>versions of binaries for different blocksizes, different WAL file
>>>sizes, etc... and hence more function using the same set of binaries
>>>is more desirable in enterprise environments
>>>
>>>
>>Every single one of these still begs the question of whether the
>>changes will have a *material* impact on performance.
>>
>>

Re: Read/Write block sizes

From

Josh Berkus

Date:

23 August 2005, 23:31:26

Steve,

> I would assume that dbt2 with STP helps minimize the amount of hours
> someone has to invest to determine performance gains with configurable
> options?

Actually, these I/O operation issues show up mainly with DW workloads, so the
STP isn't much use there.   If I can ever get some of these machines back
from the build people, I'd like to start testing some stuff.

One issue with testing this is that currently PostgreSQL doesn't support block
sizes above 128K.  We've already done testing on that (well, Mark has) and
the performance gains aren't even worth the hassle of remembering you're on a
different block size (like, +4%).

What the Sun people have done with other DB systems is show that substantial
performance gains are possible on large databases (>100G) using block sizes
of 1MB.   I believe that's possible (and that it probably makes more of a
difference on Solaris than on BSD) but we can't test it without some hackery
first.

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: Read/Write block sizes

From

Alan Stange

Date:

24 August 2005, 01:22:37

Josh Berkus wrote:

>Steve,
>
>
>
>>I would assume that dbt2 with STP helps minimize the amount of hours
>>someone has to invest to determine performance gains with configurable
>>options?
>>
>>
>
>Actually, these I/O operation issues show up mainly with DW workloads, so the
>STP isn't much use there.   If I can ever get some of these machines back
>from the build people, I'd like to start testing some stuff.
>
>One issue with testing this is that currently PostgreSQL doesn't support block
>sizes above 128K.  We've already done testing on that (well, Mark has) and
>the performance gains aren't even worth the hassle of remembering you're on a
>different block size (like, +4%).
>
>
What size database was this on?

>What the Sun people have done with other DB systems is show that substantial
>performance gains are possible on large databases (>100G) using block sizes
>of 1MB.   I believe that's possible (and that it probably makes more of a
>difference on Solaris than on BSD) but we can't test it without some hackery
>first.
>
We're running on a 100+GB database, with long streams of 8KB reads with
the occasional _llseek().  I've been thinking about running with a
larger blocksize with the expectation that we'd see fewer system calls
and a bit more throughput.

read() calls are a very expensive way to get 8KB of memory (that we know
is already resident) during scans.  One has to trap into the kernel, do
the usual process state accounting, find the block, copy the memory to
userspace, return back from the kernel to user space reversing all the
process accounting, pick out the bytes one needs, and repeat all over
again.    That's quite a few sacrificial cache lines for 8KB.   Yeah,
sure, Linux syscalls are fast, but they aren't that fast, and other
operating systems (windows and solaris) have a bit more overhead on
syscalls.

Regarding large blocks sizes on Solaris:  the Solaris folks can also use
large memory pages and avoid a lot of the TLB overhead  from the VM
system.  The various trapstat and cpustat commands can be quite
interesting to look at when running any large application on a Solaris
system.

It should be noted that having a large shared memory segment can be a
performance looser just from the standpoint of TLB thrashing.  O(GB)
memory access patterns can take a huge performance hit in user space
with 4K pages compared to the kernel which would be mapping the "segmap"
(in Solaris parlance) with 4MB pages.

Anyway, I guess my point is that the balance between kernel managed vs.
postgresql managed buffer isn't obvious at all.

-- Alan

Re: Read/Write block sizes

From

"Jeffrey W. Baker"

Date:

24 August 2005, 01:44:15

On Tue, 2005-08-23 at 19:31 -0700, Josh Berkus wrote:
> Steve,
>
> > I would assume that dbt2 with STP helps minimize the amount of hours
> > someone has to invest to determine performance gains with configurable
> > options?
>
> Actually, these I/O operation issues show up mainly with DW workloads, so the
> STP isn't much use there.   If I can ever get some of these machines back
> from the build people, I'd like to start testing some stuff.
>
> One issue with testing this is that currently PostgreSQL doesn't support block
> sizes above 128K.  We've already done testing on that (well, Mark has) and
> the performance gains aren't even worth the hassle of remembering you're on a
> different block size (like, +4%).
>
> What the Sun people have done with other DB systems is show that substantial
> performance gains are possible on large databases (>100G) using block sizes
> of 1MB.   I believe that's possible (and that it probably makes more of a
> difference on Solaris than on BSD) but we can't test it without some hackery
> first.

To get decent I/O you need 1MB fundamental units all the way down the
stack.  You need a filesystem that can take a 1MB write well, and you
need an I/O scheduler that will keep it together, and you need a storage
controller that can eat a 1MB request at once.  Ideally you'd like an
architecture with a 1MB page (Itanium has this, and AMD64 Linux will
soon have this.)  The Lustre people have done some work in this area,
opening up the datapaths in the kernel so they can keep the hardware
really working.  They even modified the QLogic SCSI/FC driver so it
supports such large transfers.  Their work has shown that you can get
significant perf boost on Linux just by thinking in terms of larger
transfers.

Unfortunately I'm really afraid that this conversation is about trees
when the forest is the problem.  PostgreSQL doesn't even have an async
reader, which is the sort of thing that could double or triple its
performance.  You're talking about block sizes and such, but the kinds
of improvements you can get there are in the tens of percents at most.

-jwb

Re: Read/Write block sizes

From

Tom Lane

Date:

24 August 2005, 02:13:25

"Jeffrey W. Baker" <jwbaker@acm.org> writes:
> To get decent I/O you need 1MB fundamental units all the way down the
> stack.

It would also be a good idea to have an application that isn't likely
to change a single bit in a 1MB range and then expect you to record
that change.  This pretty much lets Postgres out of the picture.

            regards, tom lane

Re: Read/Write block sizes

From

Guy Thornley

Date:

24 August 2005, 02:20:40

> Unfortunately I'm really afraid that this conversation is about trees
> when the forest is the problem.  PostgreSQL doesn't even have an async
> reader, which is the sort of thing that could double or triple its
> performance.  You're talking about block sizes and such, but the kinds
> of improvements you can get there are in the tens of percents at most.

Not 100% sure, but I'm fairly cirtain we were seeing significant performance
degradation by too much _scheduled_ I/O activity

ie: too much work being submitted to the kernel, due to excessive
parallelism already!!

The classic example of this is a seqscan being interleved by a index scan,
and the disks end up doing nothing but seek activity

Out of all the stuff talked about on this thread so far, only tweaking the
block size (and the madvise() stuff) makes any real-world sense, as its the
only thing talked about that increases the _work_per_seek_.

As for the async IO, sure you might think 'oh async IO would be so cool!!'
and I did, once, too. But then I sat down and _thought_ about it, and
decided well, no, actually, theres _very_ few areas it could actually help,
and in most cases it just make it easier to drive your box into lseek()
induced IO collapse.

Dont forget that already in postgres, you have a process per connection, and
all the processes take care of their own I/O.

Somebody mentioned having threaded backends too, but the only benefit would
be reduced memory footprint (a backend consumes 1-2MB of RAM, which is
almost enough to be a concern for largish systems with a lot of backends)
but personally I _know_ the complixities introduced through threading are
usually not worth it.


IMVVHO (naive experience) what is needed is a complete architecture change
(probably infeasible and only useful as a thought experiment), where:

* a network I/O process deals with client connections
* a limited pool of worker processes deal with statements (perhaps related
  to number of spindles somehow)

so when a client issues a statement, the net-IO process simply forwards the
connection state to a worker process and says 'deal with this'.
(Clearly the state object needs to contain all user and transaction state
the connection is involved in).

- Guy Thornley

Re: Read/Write block sizes

From

"Jeffrey W. Baker"

Date:

24 August 2005, 02:24:29

On Wed, 2005-08-24 at 17:20 +1200, Guy Thornley wrote:
> As for the async IO, sure you might think 'oh async IO would be so cool!!'
> and I did, once, too. But then I sat down and _thought_ about it, and
> decided well, no, actually, theres _very_ few areas it could actually help,
> and in most cases it just make it easier to drive your box into lseek()
> induced IO collapse.
>
> Dont forget that already in postgres, you have a process per connection, and
> all the processes take care of their own I/O.

That's the problem.  Instead you want 1 or 4 or 10 i/o slaves
coordinating the I/O of all the backends optimally.  For instance, with
synchronous scanning.

-jwb

Re: Read/Write block sizes

From

Tom Lane

Date:

24 August 2005, 02:57:33

"Jeffrey W. Baker" <jwbaker@acm.org> writes:
> On Wed, 2005-08-24 at 17:20 +1200, Guy Thornley wrote:
>> Dont forget that already in postgres, you have a process per connection, and
>> all the processes take care of their own I/O.

> That's the problem.  Instead you want 1 or 4 or 10 i/o slaves
> coordinating the I/O of all the backends optimally.  For instance, with
> synchronous scanning.

And why exactly are we going to do a better job of I/O scheduling than
the OS itself can do?

There's a fairly basic disconnect in viewpoint involved here.  The
old-school viewpoint (as embodied in Oracle and a few other DBMSes)
is that the OS is too stupid to be worth anything, and the DB should
bypass the OS to the greatest extent possible, doing its own caching,
disk space layout, I/O scheduling, yadda yadda.  That might have been
defensible twenty-odd years ago when Oracle was designed.  Postgres
prefers to lay off to the OS anything that the OS can do well --- and
that definitely includes caching and I/O scheduling.  There are a whole
lot of smart people working on those problems at the OS level.  Maybe we
could make marginal improvements on their results after spending a lot
of effort reinventing the wheel ... but our time will be repaid much
more if we work at levels that the OS cannot have knowledge of, such as
join planning and data statistics.

There are some things we could do to reduce the impedance between us and
the OS --- for instance, the upthread criticism that a seqscan asks the
OS for only 8K at a time is fair enough.  But that doesn't translate
to a conclusion that we should schedule the I/O instead of the OS.

            regards, tom lane

Re: Read/Write block sizes

From

"Jeffrey W. Baker"

Date:

24 August 2005, 03:22:06

On Wed, 2005-08-24 at 01:56 -0400, Tom Lane wrote:
> "Jeffrey W. Baker" <jwbaker@acm.org> writes:
> > On Wed, 2005-08-24 at 17:20 +1200, Guy Thornley wrote:
> >> Dont forget that already in postgres, you have a process per connection, and
> >> all the processes take care of their own I/O.
>
> > That's the problem.  Instead you want 1 or 4 or 10 i/o slaves
> > coordinating the I/O of all the backends optimally.  For instance, with
> > synchronous scanning.
>
> And why exactly are we going to do a better job of I/O scheduling than
> the OS itself can do?
...
> There are some things we could do to reduce the impedance between us and
> the OS --- for instance, the upthread criticism that a seqscan asks the
> OS for only 8K at a time is fair enough.  But that doesn't translate
> to a conclusion that we should schedule the I/O instead of the OS.

Synchronous scanning is a fairly huge and obvious win.  If you have two
processes 180 degrees out-of-phase in a linear read, neither process is
going to get anywhere near the throughput they would get from a single
scan.

I think you're being deliberately obtuse with regards to file I/O and
the operating system.  The OS isn't magical.  It has to strike a balance
between a reasonable read latency and a reasonable throughput.  As far
as the kernel is concerned, a busy postgresql server is
indistinguishable from 100 unrelated activities.  All backends will be
served equally, even if in this case "equally" means "quite badly all
around."

An I/O slave process could be a big win in Postgres for many kinds of
reads.  Instead of opening and reading files the backends would connect
to the I/O slave and request the file be read.  If a scan of that file
were already underway, the new backends would be attached.  Otherwise a
new scan would commence.  In either case, the slave process can issue
(sometimes non-dependant) reads well ahead of the needs of the backend.
You may think the OS can do this for you but it can't.  On postgres
knows that it needs the whole file from beginning to end.  The OS can
only guess.

Ask me sometime about my replacement for GNU sort.  It uses the same
sorting algorithm, but it's an order of magnitude faster due to better
I/O strategy.  Someday, in my infinite spare time, I hope to demonstrate
that kind of improvement with a patch to pg.

-jwb

Re: Read/Write block sizes

From

PFC

Date:

24 August 2005, 07:35:15

> of effort reinventing the wheel ... but our time will be repaid much
> more if we work at levels that the OS cannot have knowledge of, such as
> join planning and data statistics.

    Considering a global budget of man-hours which is the best ?

1- Spend it on reimplementing half of VFS in postgres, half of Windows in
postgres, half of FreeBSD in postgres, half of Solaris in Postgres, only
to discover you gain a meagre speed increase and a million and a half bugs,

2- Spending 5% of that time lowering the impedance between the OS and
Postgres, and another 5% annoying Kernel people and helping them tweaking
stuff for database use, and the rest on useful features that give useful
speedups, like bitmap indexes, skip scans, and other features that enhance
power and usability ?

If you're Oracle and have almost unlimited resources, maybe. But even
Microsoft opted for option 2 : they implemented ReadFileGather and
WriteFileScatter to lower the syscall overhead and that's it.

And point 2 will benefit to many other apps, wether 1 would benefit only
postgres, and then only in certain cases.

I do believe there is something ineresting to uncover with reiser4 though
(it definitely fits point 2).

I'm happy that the pg team chose point 2 and that new versions keep coming
with new features at an unbelievable rate these times. Do you guys sleep ?

Re: Read/Write block sizes

From

Bruce Momjian

Date:

24 August 2005, 10:52:34

This thread covers several performance ideas.  First is the idea that
more parameters should be configurable.   While this seems like a noble
goal, we try to make parameters auto-tuning, or if users have to
configure it, the parameter should be useful for a significant number of
users.

In the commercial software world, if you can convince your boss that a
feature/knob is useful, it usually gets into the product.
Unfortunately, this leads to the golden doorknob on a shack, where some
features are out of sync with the rest of the product in terms of
usefulness and utility.  With open source, if a feature can not be
auto-tuned, or has significant overhead, the features has to be
implemented and then proven to be a benefit.

In terms of adding async I/O, threading, and other things, it might make
sense to explore how these could be implemented in a way that fits the
above criteria.

---------------------------------------------------------------------------

Jignesh K. Shah wrote:
> Hi Jim,
>
> | How many of these things are currently easy to change with a recompile?
> | I should be able to start testing some of these ideas in the near
> | future, if they only require minor code or configure changes.
>
>
> The following
> * Data File Size   1GB
> * WAL File Size of 16MB
> * Block Size  of 8K
>
> Are very easy to change with a recompile.. A Tunable will be greatly
> prefered as it will allow one binary for different tunings
>
> * MultiBlock read/write
>
> Is not available but will greatly help in reducing the number of system
> calls which will only increase as the size of the database increases if
> something is not done about i.
>
> * Pregrown files... maybe not important at this point since TABLESPACE
> can currently work around it a bit (Just need to create a different file
> system for each tablespace
>
> But if you really think hardware & OS  is the answer for all small
> things...... I think we should now start to look on how to make Postgres
> Multi-threaded or multi-processed for each connection. With the influx
> of  "Dual-Core" or "Multi-Core" being the fad.... Postgres can have the
> cutting edge if somehow exploiting cores is designed.
>
> Somebody mentioned that adding CPU to Postgres workload halved the
> average CPU  usage...
> YEAH... PostgreSQL  uses only 1 CPU per connection (assuming 100%
> usage)  so if you add another CPU it is idle anyway and the system will
> report only 50%  :-) BUT the importing to measure is.. whether the query
> time was cut down or not? ( No flames I am sure you were talking about
> multi-connection multi-user environment :-) ) But my point is then this
> approach is worth the ROI and the time and effort spent to solve this
> problem.
>
> I actually vote for a multi-threaded solution for each connection while
> still maintaining seperate process for each connections... This way the
> fundamental architecture of Postgres doesn't change, however a
> multi-threaded connection can then start to exploit different cores..
> (Maybe have tunables for number of threads to read data files who
> knows.. If somebody is interested in actually working a design ..
> contact me and I will be interested in assisting this work.
>
> Regards,
> Jignesh
>
>
> Jim C. Nasby wrote:
>
> >On Tue, Aug 23, 2005 at 06:09:09PM -0400, Chris Browne wrote:
> >
> >
> >>J.K.Shah@Sun.COM (Jignesh Shah) writes:
> >>
> >>
> >>>>Does that include increasing the size of read/write blocks? I've
> >>>>noticedthat with a large enough table it takes a while to do a
> >>>>sequential scan, even if it's cached; I wonder if the fact that it
> >>>>takes a million read(2) calls to get through an 8G table is part of
> >>>>that.
> >>>>
> >>>>
> >>>Actually some of that readaheads,etc the OS does already if it does
> >>>some sort of throttling/clubbing of reads/writes. But its not enough
> >>>for such types of workloads.
> >>>
> >>>Here is what I think will help:
> >>>
> >>>* Support for different Blocksize TABLESPACE without recompiling the
> >>>code.. (Atlease support for a different Blocksize for the whole
> >>>database without recompiling the code)
> >>>
> >>>* Support for bigger sizes of WAL files instead of 16MB files
> >>>WITHOUT recompiling the code.. Should be a tuneable if you ask me
> >>>(with checkpoint_segments at 256.. you have too many 16MB files in
> >>>the log directory) (This will help OLTP benchmarks more since now
> >>>they don't spend time rotating log files)
> >>>
> >>>* Introduce a multiblock or extent tunable variable where you can
> >>>define a multiple of 8K (or BlockSize tuneable) to read a bigger
> >>>chunk and store it in the bufferpool.. (Maybe writes too) (Most
> >>>devices now support upto 1MB chunks for reads and writes)
> >>>
> >>>*There should be a way to preallocate files for TABLES in
> >>>TABLESPACES otherwise with multiple table writes in the same
> >>>filesystem ends with fragmented files which causes poor "READS" from
> >>>the files.
> >>>
> >>>* With 64bit 1GB file chunks is also moot.. Maybe it should be
> >>>tuneable too like 100GB without recompiling the code.
> >>>
> >>>Why recompiling is bad? Most companies that will support Postgres
> >>>will support their own binaries and they won't prefer different
> >>>versions of binaries for different blocksizes, different WAL file
> >>>sizes, etc... and hence more function using the same set of binaries
> >>>is more desirable in enterprise environments
> >>>
> >>>
> >>Every single one of these still begs the question of whether the
> >>changes will have a *material* impact on performance.
> >>
> >>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: Read/Write block sizes

From

Chris Browne

Date:

24 August 2005, 13:43:30

spoe@sfnet.cc (Steve Poe) writes:
> Chris,
>
> Unless I am wrong, you're making the assumpting the amount of time spent
> and ROI is known. Maybe those who've been down this path know how to get
> that additional 2-4% in 30 minutes or less?
>
> While each person and business' performance gains (or not) could vary,
> someone spending the 50-100h to gain 2-4% over a course of a month for a
> 24x7 operation would seem worth the investment?

What we *do* know is that adding these "knobs" would involve a
significant amount of effort, as the values are widely used throughout
the database engine.  Making them dynamic (e.g. - so they could be
tuned on a tablespace-by-tablespace basis) would undoubtedly require
rather a lot of development effort.  They are definitely NOT 30 minute
changes.

Moreover, knowing how to adjust them is almost certainly also NOT a 30
minute configuration change; significant benchmarking effort for the
individual application is almost sure to be needed.

It's not much different from the reason why PostgreSQL doesn't use
threading...

The problem with using threading is that introducing it to the code
base would require a pretty enormous amount of effort (I'll bet
multiple person-years), and it wouldn't provide *any* benefit until
you get rather a long ways down the road.

Everyone involved in development seems to me to have a reasonably keen
understanding as to what the potential benefits of threading are; the
value is that there fall out plenty of opportunities to parallelize
the evaluation of portions of queries.  Alas, it wouldn't be until
*after* all the effort goes in that we would get any idea as to what
kinds of speedups this would provide.

In effect, there has to be a year invested in *breaking* PostgreSQL
(because this would initially break a lot, since thread programming is
a really tough skill) where you don't actually see any benefits.

> I would assume that dbt2 with STP helps minimize the amount of hours
> someone has to invest to determine performance gains with
> configurable options?

That's going to help in constructing a "default" knob value.  And if
we find an "optimal default," that encourages sticking with the
current approach, of using #define to apply that value...

>> If someone spends 100h working on one of these items, and gets a 2%
>> performance improvement, that's almost certain to be less desirable
>> than spending 50h on something else that gets a 4% improvement.
>>
>> And we might discover that memory management improvements in Linux
>> 2.6.16 or FreeBSD 5.5 allow some OS kernels to provide some such
>> improvements "for free" behind our backs without *any* need to write
>> database code.  :-)
--
let name="cbbrowne" and tld="ntlug.org" in String.concat "@" [name;tld];;
http://www.ntlug.org/~cbbrowne/lisp.html
"For those  of you who are  into writing programs that  are as obscure
and complicated  as possible, there are opportunities  for... real fun
here" -- Arthur Norman

Re: Read/Write block sizes

From

Josh Berkus

Date:

24 August 2005, 13:50:24

Tom, Gavin,


> > To get decent I/O you need 1MB fundamental units all the way down the
> > stack.
>
> It would also be a good idea to have an application that isn't likely
> to change a single bit in a 1MB range and then expect you to record
> that change.  This pretty much lets Postgres out of the picture.

We're looking at this pretty much just for data warehousing, where you
constantly have gigabytes of data which don't change from month to month or
even year to year.   I agree that it would *not* be an optimization for OLTP
systems.  Which is why a build-time option would be fine.

> Ummm... I don't see anything here which will be a win for Postgres. The
> transactional semantics we're interested in are fairly complex:
>
> 1) Modifications to multiple objects can become visible to the system
> atomically
> 2) On error, a series of modifications which had been grouped together
> within a transaction can be rolled back
> 3) Using object version information, determine which version of which
> object is visible to a given session
> 4) Using version information and locking, detect and resolve read/write
> and write/write conflicts

I wasn't thinking of database transactions.  I was thinking specifically of
using Reiser4 transactions (and other transactional filesytems) to do things
like eliminate the need for full page writes in the WAL.  Filesystems are
low-level things which should take care of low-level needs, like making sure
an 8K page got written to disk even in the event of a system failure.

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: Read/Write block sizes

From

"Jim C. Nasby"

Date:

24 August 2005, 19:57:22

On Wed, Aug 24, 2005 at 12:12:22PM -0400, Chris Browne wrote:
> Everyone involved in development seems to me to have a reasonably keen
> understanding as to what the potential benefits of threading are; the
> value is that there fall out plenty of opportunities to parallelize
> the evaluation of portions of queries.  Alas, it wouldn't be until
> *after* all the effort goes in that we would get any idea as to what
> kinds of speedups this would provide.

My understanding is that the original suggestion was to use threads
within individual backends to allow for parallel query execution, not
swiching to a completely thread-based model.

In any case, there are other ways to enable parallelism without using
threads, such as handing actual query execution off to a set of
processes.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software        http://pervasive.com        512-569-9461

Re: Read/Write block sizes

From

Josh Berkus

Date:

25 August 2005, 16:42:01

Jeff,

> Ask me sometime about my replacement for GNU sort.  It uses the same
> sorting algorithm, but it's an order of magnitude faster due to better
> I/O strategy.  Someday, in my infinite spare time, I hope to demonstrate
> that kind of improvement with a patch to pg.

Since we desperately need some improvements in sort performance, I do hope
you follow up on this.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: Read/Write block sizes

From

Ron

Date:

25 August 2005, 17:26:47

At 03:45 PM 8/25/2005, Josh Berkus wrote:
>Jeff,
>
> > Ask me sometime about my replacement for GNU sort. Â It uses the same
> > sorting algorithm, but it's an order of magnitude faster due to better
> > I/O strategy. Â Someday, in my infinite spare time, I hope to demonstrate
> > that kind of improvement with a patch to pg.
>
>Since we desperately need some improvements in sort performance, I do hope
>you follow up on this.
>
>--
>--Josh

I'll generalize that.  IMO we desperately need
any and all improvements in IO performance.  Even
more so than we need improvements in sorting or sorting IO performance.

Ron

Re: Read/Write block sizes

From

Chris Browne

Date:

25 August 2005, 17:49:50

rjpeace@earthlink.net (Ron) writes:
> At 03:45 PM 8/25/2005, Josh Berkus wrote:
>> > Ask me sometime about my replacement for GNU sort. Â It uses the
>> > same sorting algorithm, but it's an order of magnitude faster due
>> > to better I/O strategy. Â Someday, in my infinite spare time, I
>> > hope to demonstrate that kind of improvement with a patch to pg.
>>
>>Since we desperately need some improvements in sort performance, I
>>do hope you follow up on this.
>
> I'll generalize that.  IMO we desperately need any and all
> improvements in IO performance.  Even more so than we need
> improvements in sorting or sorting IO performance.

That's frankly a step backwards.

Feel free to "specialise" that instead.

A patch that improves some specific aspect of performance is a
thousand times better than any sort of "desperate desire for any and
all improvements in I/O performance."

The latter is unlikely to provide any usable result.

The "specialized patch" is also pointedly better in that a
*confidently submitted* patch is likely to be way better than any sort
of "desperate clutching at whatever may come to hand."

Far too often, I see people trying to address performance problems via
the "desperate clutching at whatever seems near to hand," and that
generally turns out very badly as a particular result of the whole
"desperate clutching" part.

If you can get a sort improvement submitted, that's a concrete
improvement...
--
select 'cbbrowne' || '@' || 'ntlug.org';
http://www3.sympatico.ca/cbbrowne/lisp.html
Appendium to  the Rules  of the  Evil Overlord #1:  "I will  not build
excessively integrated  security-and-HVAC systems. They  may be Really
Cool, but are far too vulnerable to breakdowns."

Re: Read/Write block sizes

From

Ron

Date:

25 August 2005, 20:47:37

At 04:49 PM 8/25/2005, Chris Browne wrote:
>rjpeace@earthlink.net (Ron) writes:
> > At 03:45 PM 8/25/2005, Josh Berkus wrote:
> >> > Ask me sometime about my replacement for GNU sort. Â It uses the
> >> > same sorting algorithm, but it's an order of magnitude faster due
> >> > to better I/O strategy. Â Someday, in my infinite spare time, I
> >> > hope to demonstrate that kind of improvement with a patch to pg.
> >>
> >>Since we desperately need some improvements in sort performance, I
> >>do hope you follow up on this.
> >
> > I'll generalize that.  IMO we desperately need any and all
> > improvements in IO performance.  Even more so than we need
> > improvements in sorting or sorting IO performance.
>
>That's frankly a step backwards.  Feel free to "specialise" that instead.

We can agree to disagree, I'm cool with that.

I'm well aware that a Systems Approach to SW
Architecture is not always popular in the Open
Source world.  Nonetheless, my POV is that if we
want to be taken seriously and beat "the big
boys", we have to do everything smarter and
faster, as well as cheaper, than they do.  You
are not likely to be able to do that consistently
without using some of the "icky" stuff one is
required to study as part of formal training in
the Comp Sci and SW Engineering fields.

>A patch that improves some specific aspect of
>performance is a thousand times better than any
>sort of "desperate desire for any and
>all improvements in I/O performance."

minor twisting of my words: substituting "desire"
for "need".  The need is provable.  Just put "the
big 5" (SQL Server, Oracle, DB2, mySQL, and
PostgreSQL) into some realistic benches to see that.

Major twisting of my words: the apparent
implication by you that I don't appreciate
improvements in the IO behavior of specific
things like sorting as much as I'd appreciate
more "general" IO performance
improvements.  Performance optimization is best
done as an iterative improvement process that
starts with measuring where the need is greatest,
then improving that greatest need by the most you
can, then repeating the whole cycle.  _Every_
improvement in such a process is a specific
improvement, even if the improvement is a
decision to re-architect the entire product to
solve the current biggest issue.  Improving
sorting IO is cool.  OTOH, if pg's biggest IO
problems are elsewhere, then the amount of
overall benefit we will get from improving
sorting IO is going to be minimized until we
improve the bigger problem(s).  Amdahl's Law.

>The "specialized patch" is also pointedly better
>in that a *confidently submitted* patch is
>likely to be way better than any sort of
>"desperate clutching at whatever may come to hand."

Another distortion of my statement and POV.  I
never suggested nor implied any sort of
"desperate clutching...".  We have _measurable_
IO issues that need to be addressed in order for
pg to be a better competitor in the
marketplace.  Just as we do with sorting performance.

>Far too often, I see people trying to address
>performance problems via the "desperate
>clutching at whatever seems near to hand," and that
>generally turns out very badly as a particular
>result of the whole "desperate clutching" part.
>
>If you can get a sort improvement submitted, that's a concrete improvement...

As I said, I'm all in favor of concrete,
measurable improvement.  I do not think I ever
stated I was in favor of anything else.

You evidently are mildly ranting because you've
seen some examples of poor SW Engineering
Discipline/Practice by people with perhaps
inadequate skills for the issues they were trying
to address.  We all have. "90% of everything is
Jreck (eg of too low a quality)."

OTOH, I do not think I've given you any reason to
think I lack such Clue, nor do I think my post was advocating such thrashing.

My post was intended to say that we need an
Overall Systems Approach to pg optimization
rather than just applying what compiler writer's
call "peephole optimizations" to pg.  No more, no less.

I apologize if I somehow misled you,
Ron Peacetree