Thread: Benchmark Data requested

Benchmark Data requested

From
Simon Riggs
Date:
Can I ask for some help with benchmarking?

There are some results here that show PostgreSQL is slower in some cases
than Monet and MySQL. Of course these results were published immediately
prior to 8.2 being released, plus run out-of-the-box, so without even
basic performance tuning.

Would anybody like to repeat these tests with the latest production
versions of these databases (i.e. with PGSQL 8.3), and with some
sensible tuning settings for the hardware used? It will be useful to get
some blind tests with more sensible settings.

http://monetdb.cwi.nl/projects/monetdb//SQL/Benchmark/TPCH/

Multiple runs from different people/different hardware is useful since
they help to iron-out differences in hardware and test methodology. So
don't worry if you see somebody else doing this also.

Thanks,

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: Benchmark Data requested

From
Luke Lonergan
Date:
Hi Simon,

Note that MonetDB/X100 does not have a SQL optimizer, they ran raw
hand-coded plans.  As a consequence, these comparisons should be taken as an
"executor-executor" test and we/you should be sure that the PG planner has
generated the best possible plan.

That said, we've already done the comparisons internally and they've got a
good point to make about L2 cache use and removal of unnecessary
abstractions in the executor.  We've been aware of this since 2005/6 and
have been slowly working these ideas into our/PG executor.

Bottom line: it's a good thing to work to get close to the X100/Monet
executor with a more general purpose DB.  PG is a looong way from being
comparable, mostly due to poor L2 D-cache locality and I-cache thrashing in
the executor.  The only way to cure this is to work on more rows than one at
a time.

- Luke


On 2/4/08 10:37 AM, "Simon Riggs" <simon@2ndquadrant.com> wrote:

> Can I ask for some help with benchmarking?
>
> There are some results here that show PostgreSQL is slower in some cases
> than Monet and MySQL. Of course these results were published immediately
> prior to 8.2 being released, plus run out-of-the-box, so without even
> basic performance tuning.
>
> Would anybody like to repeat these tests with the latest production
> versions of these databases (i.e. with PGSQL 8.3), and with some
> sensible tuning settings for the hardware used? It will be useful to get
> some blind tests with more sensible settings.
>
> http://monetdb.cwi.nl/projects/monetdb//SQL/Benchmark/TPCH/
>
> Multiple runs from different people/different hardware is useful since
> they help to iron-out differences in hardware and test methodology. So
> don't worry if you see somebody else doing this also.
>
> Thanks,


Re: Benchmark Data requested

From
"Claus Guttesen"
Date:
> There are some results here that show PostgreSQL is slower in some cases
> than Monet and MySQL. Of course these results were published immediately
> prior to 8.2 being released, plus run out-of-the-box, so without even
> basic performance tuning.
>
> Would anybody like to repeat these tests with the latest production
> versions of these databases (i.e. with PGSQL 8.3), and with some
> sensible tuning settings for the hardware used? It will be useful to get
> some blind tests with more sensible settings.
>
> http://monetdb.cwi.nl/projects/monetdb//SQL/Benchmark/TPCH/
>
> Multiple runs from different people/different hardware is useful since
> they help to iron-out differences in hardware and test methodology. So
> don't worry if you see somebody else doing this also.

Here is another graph: http://tweakers.net/reviews/649/7

Without monetdb though.

--
regards
Claus

When lenity and cruelty play for a kingdom,
the gentlest gamester is the soonest winner.

Shakespeare

Re: Benchmark Data requested

From
Simon Riggs
Date:
On Mon, 2008-02-04 at 10:47 -0800, Luke Lonergan wrote:

> Note that MonetDB/X100 does not have a SQL optimizer, they ran raw
> hand-coded plans.  As a consequence, these comparisons should be taken as an
> "executor-executor" test and we/you should be sure that the PG planner has
> generated the best possible plan.

If it doesn't then I'd regard that as a performance issue in itself.

> That said, we've already done the comparisons internally and they've got a
> good point to make about L2 cache use and removal of unnecessary
> abstractions in the executor.  We've been aware of this since 2005/6 and
> have been slowly working these ideas into our/PG executor.
>
> Bottom line: it's a good thing to work to get close to the X100/Monet
> executor with a more general purpose DB.  PG is a looong way from being
> comparable, mostly due to poor L2 D-cache locality and I-cache thrashing in
> the executor.

You maybe right, but I want to see where it hurts us the most.

> The only way to cure this is to work on more rows than one at a time.

Do you have any results to show that's true, or are you just referring
to the Cray paper? (Which used fixed length tuples and specific vector
hardware).

(With regard to benchmarks, I'd rather not download Monet at all. Helps
avoid legal issues around did-you-look-at-the-code questions.)

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: Benchmark Data requested

From
Greg Smith
Date:
On Mon, 4 Feb 2008, Simon Riggs wrote:

> Would anybody like to repeat these tests with the latest production
> versions of these databases (i.e. with PGSQL 8.3)

Do you have any suggestions on how people should run TPC-H?  It looked
like a bit of work to sort through how to even start this exercise.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Benchmark Data requested

From
Simon Riggs
Date:
On Mon, 2008-02-04 at 15:09 -0500, Greg Smith wrote:
> On Mon, 4 Feb 2008, Simon Riggs wrote:
>
> > Would anybody like to repeat these tests with the latest production
> > versions of these databases (i.e. with PGSQL 8.3)
>
> Do you have any suggestions on how people should run TPC-H?  It looked
> like a bit of work to sort through how to even start this exercise.

The link referred to a few different scale factors, so you could try
those. But anything that uses the hardware you have to its full
potential is valuable.

Everybody's test method is going to be different, whatever I say...

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: Benchmark Data requested

From
Luke Lonergan
Date:
Hi Simon,

On 2/4/08 11:07 AM, "Simon Riggs" <simon@2ndquadrant.com> wrote:

>> "executor-executor" test and we/you should be sure that the PG planner has
>> generated the best possible plan.
>
> If it doesn't then I'd regard that as a performance issue in itself.

Agreed, though that's two problems to investigate - I think the Monet/X100
stuff is clean in that it's a pure executor test.

> You maybe right, but I want to see where it hurts us the most.

You'll see :-)

>> The only way to cure this is to work on more rows than one at a time.
>
> Do you have any results to show that's true, or are you just referring
> to the Cray paper? (Which used fixed length tuples and specific vector
> hardware).

No paper referenced, just inference from the results and their (and others)
conclusions about locality and re-use.  It's a similar enough situation to
scientific programming with vector processors versus cache based superscalar
that these are the right conclusions.  We've done the profiling to look at
cache misses and have some data to back it up as well.

> (With regard to benchmarks, I'd rather not download Monet at all. Helps
> avoid legal issues around did-you-look-at-the-code questions.)

None of us have looked at the code or downloaded it.  There are a number of
presentations out there for Monet/X100 to see what their results are.

- Luke


Re: Benchmark Data requested

From
Luke Lonergan
Date:
Hi Greg,

On 2/4/08 12:09 PM, "Greg Smith" <gsmith@gregsmith.com> wrote:

> Do you have any suggestions on how people should run TPC-H?  It looked
> like a bit of work to sort through how to even start this exercise.

To run "TPC-H" requires a license to publish, etc.

However, I think you can use their published data and query generation kit
to run the queries, which aren't the benchmark per-se.  That's what the
Monet/X100 people did.

- Luke


Re: Benchmark Data requested

From
"Jignesh K. Shah"
Date:
Hi Simon,

I have some insight into TPC-H on how it works.

First of all I think it is a violation of TPC rules to publish numbers
without auditing them first.  So even if I do the test to show the
better performance of PostgreSQL 8.3, I cannot post it here or any
public forum without doing going through the "process". (Even though it
is partial benchmark as they are just doing the equivalent of the
PowerRun of TPCH) Maybe the PR of PostgreSQL team should email
info@tpc.org about them and see what they have to say about that comparison.

On the technical side:

Remember all TPC-H queries when run sequentially on PostgreSQL uses only
1 core or virtual CPU so it is a very bad for system to use it with
PostgreSQL (same for MySQL too).

Also very important unless you are running the UPDATE FUNCTIONS which
are separate queries, all these Q1-Q22 Queries are pure "READ-ONLY"
queries. Traditionally I think PostgreSQL does lack "READ-SPEED"s
specially since it is bottlenecked by the size of the reads it does
(BLOCKSIZE). Major database provides multi-block parameters to do
multiple of reads/writes in terms of blocksizes to reduce IOPS and also
for read only they also have READ-AHEAD or prefetch sizes which is
generally bigger than multi-block or extent sizes to aid reads.

Scale factor is in terms of gigs and hence using max scale of 5 (5G) is
pretty useless since most of the rows could be cached in modern day
systems. And comparing with 0.01 is what 10MB? Size of recent L2 cache
of Intel is probably bigger than that size.

If you are doing tuning for TPC-H Queries  focus on few of them:
For example Query 1 is very Join intensive  and if your CPU is not 100%
used then you have a problem in your IO to solve before tuning it.

Another example is Query 16 is literally IO scan speed, many people use
it to see if the database can scan at "line speeds" of the storage,
ending up with 100% CPU means the database cannot process that many rows
(just to bring it in).

In essence  each query does some combination of system features to
highlight the performance. However since it is an old benchmark,
database companies end up "re-engineering" their technologies to gain
advantage in this benchmark (Hence its time for a successor in work
called TPC-DS which will have more than 100 such queries)

Few of the technologies that have really helped gain ground in TPC-H world
* Hash and/or Range Partitioning of tables  ( PostgreSQL 8.3 can do that
but the setup cost of writing schema is great specially since data has
to be loaded in separate tables)
* Automated Aggregated Views  - used by optmiziers - database technology
to update more frequently used aggregations in a smaller views
* Cube views Index - like bitmap but multidimensional (I think ..but not
sure)

That said, is it useful to be used in "Regression testing in PostgreSQL
farms.  I would think yes.. specially Q16

Hope this helps.
Regards,
Jignesh




Simon Riggs wrote:
> Can I ask for some help with benchmarking?
>
> There are some results here that show PostgreSQL is slower in some cases
> than Monet and MySQL. Of course these results were published immediately
> prior to 8.2 being released, plus run out-of-the-box, so without even
> basic performance tuning.
>
> Would anybody like to repeat these tests with the latest production
> versions of these databases (i.e. with PGSQL 8.3), and with some
> sensible tuning settings for the hardware used? It will be useful to get
> some blind tests with more sensible settings.
>
> http://monetdb.cwi.nl/projects/monetdb//SQL/Benchmark/TPCH/
>
> Multiple runs from different people/different hardware is useful since
> they help to iron-out differences in hardware and test methodology. So
> don't worry if you see somebody else doing this also.
>
> Thanks,
>
>

Re: Benchmark Data requested

From
Greg Smith
Date:
On Mon, 4 Feb 2008, Luke Lonergan wrote:

> However, I think you can use their published data and query generation kit
> to run the queries, which aren't the benchmark per-se.  That's what the
> Monet/X100 people did.

Right; I was just hoping someone might suggest some relatively
standardized way to do that via PostgreSQL.  I read Simon's original note
and was afraid that multiple people might end up duplicating some
non-trivial amount of work just to get the kits setup and running, or get
frustrated not expecting that part and just give up on the whole idea.

I'm very interested in this particular topic (non-trivial database
micro-benchmarks) but have no time to spare this week to hack on this one
myself.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Benchmark Data requested

From
"Jignesh K. Shah"
Date:
Doing it at low scales is not attractive.

Commercial databases are publishing at scale factor of 1000(about 1TB)
to 10000(10TB) with one in 30TB space. So ideally right now tuning
should start at 1000 scale factor.

Unfortunately I have tried that before with PostgreSQL the few of the
problems are as follows:

Single stream loader of PostgreSQL takes hours to load data. (Single
stream load... wasting all the extra cores out there)

Multiple table loads ( 1 per table) spawned via script  is bit better
but hits wal problems.

To avoid wal problems, I  had created tables and load statements within
the same transaction, faster but cannot create index before load or it
starts writing to wal... AND if indexes are created after load, it takes
about a day or so to create all the indices required. (Its single
threaded and creating multiple indexes/indices at the same time could
result in running out of temporary "DISK" space since the tables are so
big. Which means 1 thread at a time is the answer for creating tables
that are really big. It is slow.

Boy, by this time most people doing TPC-H in high end give up on
PostgreSQL.

I have not even started Partitioning of tables yet since with the
current framework, you have to load the tables separately into each
tables which means for the TPC-H data you need "extra-logic" to take
that table data and split it into each partition child table. Not stuff
that many people want to do by hand.

Then for the power run that is essentially running one query at a time
should essentially be able to utilize the full system (specially
multi-core systems), unfortunately PostgreSQL can use only one core.
(Plus since this is read only and there is no separate disk reader all
other processes are idle) and system is running at 1/Nth capacity (where
N is the number of cores/threads)

(I am not sure here with Partitioned tables, do you get N processes
running in the system when you scan the partitioned table?)

Even off-loading work like "fetching the data into bufferpool" into
separate processes will go big time with this type of workloads.

I would be happy to help out if folks here want to do work related to
it. Infact if you have time, I can request a project in one of the Sun
Benchmarking center to see what we can learn with community members
interested in understanding where PostgreSQL performs and fails.

Regards,
Jignesh

Greg Smith wrote:
> On Mon, 4 Feb 2008, Simon Riggs wrote:
>
>> Would anybody like to repeat these tests with the latest production
>> versions of these databases (i.e. with PGSQL 8.3)
>
> Do you have any suggestions on how people should run TPC-H?  It looked
> like a bit of work to sort through how to even start this exercise.
>
> --
> * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>               http://archives.postgresql.org

Re: Benchmark Data requested

From
Greg Smith
Date:
On Mon, 4 Feb 2008, Jignesh K. Shah wrote:

> Doing it at low scales is not attractive.  Commercial databases are
> publishing at scale factor of 1000(about 1TB) to 10000(10TB) with one in
> 30TB space. So ideally right now tuning should start at 1000 scale
> factor.

I think what Simon was trying to get at is some sort of debunking of
Monet's benchmarks which were running in-memory while not giving
PostgreSQL any real memory to work with.  What you're talking about is a
completely separate discussion which is well worth having in addition to
that.

I'm well aware of how painful it is to generate+load+index even single TB
worth of data with PostgreSQL right now because I've been doing just that
for weeks now (there's two processing phases in there as well for me that
take even longer, but the raw operations are still a significant portion
of the total time).

> I would be happy to help out if folks here want to do work related to
> it. Infact if you have time, I can request a project in one of the Sun
> Benchmarking center to see what we can learn with community members
> interested in understanding where PostgreSQL performs and fails.

Sounds like a good 8.4 project.  Maybe pick this topic back up at the East
convention next month, we could talk about it then.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Benchmark Data requested

From
Gregory Stark
Date:
"Jignesh K. Shah" <J.K.Shah@Sun.COM> writes:

> Then for the power run that is essentially running one query at a time should
> essentially be able to utilize the full system (specially multi-core systems),
> unfortunately PostgreSQL can use only one core. (Plus since this is read only
> and there is no separate disk reader all other processes are idle) and system
> is running at 1/Nth capacity (where N is the number of cores/threads)

Is the whole benchmark like this or is this just one part of it?

Is the i/o system really able to saturate the cpu though?

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's On-Demand Production Tuning

Re: Benchmark Data requested

From
Gregory Stark
Date:
"Jignesh K. Shah" <J.K.Shah@Sun.COM> writes:

> Also very important unless you are running the UPDATE FUNCTIONS which are
> separate queries, all these Q1-Q22 Queries are pure "READ-ONLY" queries.
> Traditionally I think PostgreSQL does lack "READ-SPEED"s specially since it is
> bottlenecked by the size of the reads it does (BLOCKSIZE). Major database
> provides multi-block parameters to do multiple of reads/writes in terms of
> blocksizes to reduce IOPS and also for read only they also have READ-AHEAD or
> prefetch sizes which is generally bigger than multi-block or extent sizes to
> aid reads.

Note that all of these things are necessitated by those databases using direct
i/o of some form or another. The theory is that PostgreSQL doesn't have to
worry about these things because the kernel is taking care of it.

How true that is is a matter of some debate and it varies from OS to OS. But
it's definitely true that the OS will do read-ahead for sequential reads, for
example.

Incidentally we found some cases that Solaris was particularly bad at. Is
there anybody in particular that would be interested in hearing about them?
(Not meant to be a knock on Solaris, I'm sure there are other cases Linux or
BSD handle poorly too)


--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's Slony Replication support!

Re: Benchmark Data requested

From
"Jignesh K. Shah"
Date:
TPC-H has two runs
PowerRun which is single stream (Q1-22 RF1, RF2)
And Throughput Runs which has "N" (depends on scale) running
simultaneously in a mixed sequence of the same queries and the two
update functions. During throughput run you can expect to max out CPU...
But commerial databases generally have PowerRuns running quite well even
on multi-cores ( Oracle (without RAC have published with 144 cores on
Solaris)

As for IO system saturating the CPU its two folds
Kernel fetching in the data which saturates at some value
and in this case PostgreSQL reading the data and putting it in its
bufferpool

An example of how I use it is as follows:
Do a select query on a table such that it results in table scan without
actually returning any rows back
Now keep throwing hardware (better storage) till it saturates the CPU.
That's the practical max you can do with the CPU/OS combination
(considering unlimited storage bandwidth). This one is primarily used in
guessing how fast one of the queries in TPC-H will complete.

In my tests with PostgreSQL, I generally reach the CPU limit without
even reaching the storage bandwidth of the underlying storage.
Just to give numbers
Single 2Gb Fiber Channel port can practically go upto 180 MB/sec
Single 4Gb ports have proven to go upto 360-370MB/sec
So to saturate a FC port, postgreSQL has to be able to scan 370MB/sec
without saturating the CPU.
Then comes software stripping which allows multiple ports to be stripped
over increasing the capacity of the bandwidth... Now scanning has to be
able to drive Nx370MB/sec (all on single core).

I had some numbers and I had some limitations based on cpu frequency,
blocksize ,etc but those were for 8.1 days or so..

I think to take PostgreSQL a bit high end, we have to first scale out
these numbers.
Doing some sorts of test in PostgreSQL farms for every release actually
does help people see the amount of data that it can drive through...

We can actually work on some database operation metrics to also guage
how much each release is improving over older releases.. I have ideas
for few of them.

Regards,
Jignesh


Gregory Stark wrote:
> "Jignesh K. Shah" <J.K.Shah@Sun.COM> writes:
>
>
>> Then for the power run that is essentially running one query at a time should
>> essentially be able to utilize the full system (specially multi-core systems),
>> unfortunately PostgreSQL can use only one core. (Plus since this is read only
>> and there is no separate disk reader all other processes are idle) and system
>> is running at 1/Nth capacity (where N is the number of cores/threads)
>>
>
> Is the whole benchmark like this or is this just one part of it?
>
> Is the i/o system really able to saturate the cpu though?
>
>

Re: Benchmark Data requested

From
"Jignesh K. Shah"
Date:

Gregory Stark wrote:
> Incidentally we found some cases that Solaris was particularly bad at. Is
> there anybody in particular that would be interested in hearing about them?
> (Not meant to be a knock on Solaris, I'm sure there are other cases Linux or
> BSD handle poorly too)
>
>
>

Send me the details, I can file bugs for Solaris on behalf of the
community. Since I am involved in lot of PostgreSQL testing on Solaris
this year, I have a small list myself (mostly related to file system
stuff though).

I know one regarding bonnie rewriting blocks that you sent out. (I still
havent done anything about it yet but finally have some test machines
for such work instead of using my workstation to test it out :-)

But I am really interested in seeing which one hits PostgreSQL
performance/usability.

Thanks in advance.

Regards,
Jignesh


Re: Benchmark Data requested

From
Simon Riggs
Date:
On Mon, 2008-02-04 at 17:33 -0500, Jignesh K. Shah wrote:

> First of all I think it is a violation of TPC rules to publish numbers
> without auditing them first.  So even if I do the test to show the
> better performance of PostgreSQL 8.3, I cannot post it here or any
> public forum without doing going through the "process".

I'm not interested in the final results, pricing etc.. Just a query by
query elapsed times.

Can you show which part of the rules precludes this? I can't find it.

This is a developer list, so "publishing" things here is what we do for
discussion, so it's hardly breaking the spirit of the TPC rules to
publish results here, in the hope of focusing development effort.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: Benchmark Data requested

From
Simon Riggs
Date:
On Mon, 2008-02-04 at 17:55 -0500, Jignesh K. Shah wrote:
> Doing it at low scales is not attractive.
>
> Commercial databases are publishing at scale factor of 1000(about 1TB)
> to 10000(10TB) with one in 30TB space. So ideally right now tuning
> should start at 1000 scale factor.

I don't understand this. Sun is currently publishing results at 100GB,
300GB etc.. Why would we ignore those and go for much higher numbers?
Especially when you explain why we wouldn't be able to. There isn't any
currently valid result above 10 TB.

If anybody is going to run tests in response to my request, then *any*
scale factor is interesting, on any hardware. If that means Scale Factor
1, 3, 10 or 30 then that's fine by me.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: Benchmark Data requested

From
Dimitri Fontaine
Date:
Hi,

Le lundi 04 février 2008, Jignesh K. Shah a écrit :
> Single stream loader of PostgreSQL takes hours to load data. (Single
> stream load... wasting all the extra cores out there)

I wanted to work on this at the pgloader level, so CVS version of pgloader is
now able to load data in parallel, with a python thread per configured
section (1 section = 1 data file = 1 table is often the case).
Not configurable at the moment, but I plan on providing a "threads" knob which
will default to 1, and could be -1 for "as many thread as sections".

> Multiple table loads ( 1 per table) spawned via script  is bit better
> but hits wal problems.

pgloader will too hit the WAL problem, but it still may have its benefits, or
at least we will soon (you can already if you take it from CVS) be able to
measure if the parallel loading at the client side is a good idea perf. wise.

[...]
> I have not even started Partitioning of tables yet since with the
> current framework, you have to load the tables separately into each
> tables which means for the TPC-H data you need "extra-logic" to take
> that table data and split it into each partition child table. Not stuff
> that many people want to do by hand.

I'm planning to add ddl-partitioning support to pgloader:
  http://archives.postgresql.org/pgsql-hackers/2007-12/msg00460.php

The basic idea is for pgloader to ask PostgreSQL about constraint_exclusion,
pg_inherits and pg_constraint and if pgloader recognize both the CHECK
expression and the datatypes involved, and if we can implement the CHECK in
python without having to resort to querying PostgreSQL, then we can run a
thread per partition, with as many COPY FROM running in parallel as there are
partition involved (when threads = -1).

I'm not sure this will be quicker than relying on PostgreSQL trigger or rules
as used for partitioning currently, but ISTM Jignesh quoted § is just about
that.

Comments?
--
dim

Attachment

Re: Benchmark Data requested

From
Simon Riggs
Date:
On Tue, 2008-02-05 at 15:06 +0100, Dimitri Fontaine wrote:
> Hi,
>
> Le lundi 04 février 2008, Jignesh K. Shah a écrit :
> > Single stream loader of PostgreSQL takes hours to load data. (Single
> > stream load... wasting all the extra cores out there)
>
> I wanted to work on this at the pgloader level, so CVS version of pgloader is
> now able to load data in parallel, with a python thread per configured
> section (1 section = 1 data file = 1 table is often the case).
> Not configurable at the moment, but I plan on providing a "threads" knob which
> will default to 1, and could be -1 for "as many thread as sections".

That sounds great. I was just thinking of asking for that :-)

I'll look at COPY FROM internals to make this faster. I'm looking at
this now to refresh my memory; I already had some plans on the shelf.

> > Multiple table loads ( 1 per table) spawned via script  is bit better
> > but hits wal problems.
>
> pgloader will too hit the WAL problem, but it still may have its benefits, or
> at least we will soon (you can already if you take it from CVS) be able to
> measure if the parallel loading at the client side is a good idea perf. wise.

Should be able to reduce lock contention, but not overall WAL volume.

> [...]
> > I have not even started Partitioning of tables yet since with the
> > current framework, you have to load the tables separately into each
> > tables which means for the TPC-H data you need "extra-logic" to take
> > that table data and split it into each partition child table. Not stuff
> > that many people want to do by hand.
>
> I'm planning to add ddl-partitioning support to pgloader:
>   http://archives.postgresql.org/pgsql-hackers/2007-12/msg00460.php
>
> The basic idea is for pgloader to ask PostgreSQL about constraint_exclusion,
> pg_inherits and pg_constraint and if pgloader recognize both the CHECK
> expression and the datatypes involved, and if we can implement the CHECK in
> python without having to resort to querying PostgreSQL, then we can run a
> thread per partition, with as many COPY FROM running in parallel as there are
> partition involved (when threads = -1).
>
> I'm not sure this will be quicker than relying on PostgreSQL trigger or rules
> as used for partitioning currently, but ISTM Jignesh quoted § is just about
> that.

Much better than triggers and rules, but it will be hard to get it to
work.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: Benchmark Data requested

From
Matthew
Date:

Re: Benchmark Data requested

From
Richard Huxton
Date:
Simon Riggs wrote:
> On Tue, 2008-02-05 at 15:06 +0100, Dimitri Fontaine wrote:
>>
>> Le lundi 04 février 2008, Jignesh K. Shah a écrit :

>>> Multiple table loads ( 1 per table) spawned via script  is bit better
>>> but hits wal problems.
>> pgloader will too hit the WAL problem, but it still may have its benefits, or
>> at least we will soon (you can already if you take it from CVS) be able to
>> measure if the parallel loading at the client side is a good idea perf. wise.
>
> Should be able to reduce lock contention, but not overall WAL volume.

In the case of a bulk upload to an empty table (or partition?) could you
not optimise the WAL away? That is, shouldn't the WAL basically be a
simple transformation of the on-disk blocks? You'd have to explicitly
sync the file(s) for the table/indexes of course, and you'd need some
work-around for WAL shipping, but it might be worth it for you chaps
with large imports.

--
   Richard Huxton
   Archonet Ltd

Re: Benchmark Data requested

From
Matthew
Date:
Apologies for the blank email - mailer problems. I lost all my nicely
typed stuff, too.

On Tue, 5 Feb 2008, Dimitri Fontaine wrote:
>> Multiple table loads ( 1 per table) spawned via script  is bit better
>> but hits wal problems.
>
> pgloader will too hit the WAL problem, but it still may have its benefits, or
> at least we will soon (you can already if you take it from CVS) be able to
> measure if the parallel loading at the client side is a good idea perf. wise.

You'll have to be careful here. Depending on the filesystem, writing large
amounts of data to two files simultaneously can results in the blocks
being interleaved to some degree on the disc, which can cause performance
problems later on.

As for the WAL, I have an suggestion, but I'm aware that I don't know how
PG actually does it, so you'll have to tell me if it is valid.

My impression is that the WAL is used to store writes in a transactional
manner, for writes that can't be written in a transactional manner
directly to the data files. Hence the suggestion for restoring database
dumps to run the whole restore in one transaction, which means that the
table creation is in the same transaction as loading the data into it.
Since the table is not visible to other backends, the writes to it do not
need to go to the WAL, and PG is clever enough to do this.

My suggestion is to extend that slightly. If there is a large chunk of
data to be written to a table, which will be entirely to empty pages or
appended to the of the data file, then there is no risk of corruption of
existing data, and that write could be made directly to the table. You
would have to write a WAL entry reserving the space in the data file, and
then write the data to the file. Then when that WAL entry is checkpointed,
no work would be required.

This would improve the performance of database restores and large writes
which expand the table's data file. So, would it work?

Matthew

--
If pro is the opposite of con, what is the opposite of progress?

Re: Benchmark Data requested

From
Matthew
Date:
On Tue, 5 Feb 2008, Richard Huxton wrote:
> In the case of a bulk upload to an empty table (or partition?) could you not
> optimise the WAL away?

Argh. If I hadn't had to retype my email, I would have suggested that
before you.

;)

Matthew

--
Unfortunately, university regulations probably prohibit me from eating
small children in front of the lecture class.
                                        -- Computer Science Lecturer

Re: Benchmark Data requested

From
Simon Riggs
Date:
On Tue, 2008-02-05 at 14:43 +0000, Richard Huxton wrote:
> Simon Riggs wrote:
> > On Tue, 2008-02-05 at 15:06 +0100, Dimitri Fontaine wrote:
> >>
> >> Le lundi 04 février 2008, Jignesh K. Shah a écrit :
>
> >>> Multiple table loads ( 1 per table) spawned via script  is bit better
> >>> but hits wal problems.
> >> pgloader will too hit the WAL problem, but it still may have its benefits, or
> >> at least we will soon (you can already if you take it from CVS) be able to
> >> measure if the parallel loading at the client side is a good idea perf. wise.
> >
> > Should be able to reduce lock contention, but not overall WAL volume.
>
> In the case of a bulk upload to an empty table (or partition?) could you
> not optimise the WAL away? That is, shouldn't the WAL basically be a
> simple transformation of the on-disk blocks? You'd have to explicitly
> sync the file(s) for the table/indexes of course, and you'd need some
> work-around for WAL shipping, but it might be worth it for you chaps
> with large imports.

Only by locking the table, which serializes access, which then slows you
down or at least restricts other options. Plus if you use pg_loader then
you'll find only the first few rows optimized and all the rest not.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: Benchmark Data requested

From
Richard Huxton
Date:
Simon Riggs wrote:
> On Tue, 2008-02-05 at 14:43 +0000, Richard Huxton wrote:
>> Simon Riggs wrote:
>>> On Tue, 2008-02-05 at 15:06 +0100, Dimitri Fontaine wrote:
>>>> Le lundi 04 février 2008, Jignesh K. Shah a écrit :
>>>>> Multiple table loads ( 1 per table) spawned via script  is bit better
>>>>> but hits wal problems.
>>>> pgloader will too hit the WAL problem, but it still may have its benefits, or
>>>> at least we will soon (you can already if you take it from CVS) be able to
>>>> measure if the parallel loading at the client side is a good idea perf. wise.
>>> Should be able to reduce lock contention, but not overall WAL volume.
>> In the case of a bulk upload to an empty table (or partition?) could you
>> not optimise the WAL away? That is, shouldn't the WAL basically be a
>> simple transformation of the on-disk blocks? You'd have to explicitly
>> sync the file(s) for the table/indexes of course, and you'd need some
>> work-around for WAL shipping, but it might be worth it for you chaps
>> with large imports.
>
> Only by locking the table, which serializes access, which then slows you
> down or at least restricts other options. Plus if you use pg_loader then
> you'll find only the first few rows optimized and all the rest not.

Hmm - the table-locking requirement is true enough, but why would
pg_loader cause problems after the first few rows?

--
   Richard Huxton
   Archonet Ltd

Re: Benchmark Data requested

From
Simon Riggs
Date:
On Tue, 2008-02-05 at 15:05 +0000, Richard Huxton wrote:

> > Only by locking the table, which serializes access, which then slows you
> > down or at least restricts other options. Plus if you use pg_loader then
> > you'll find only the first few rows optimized and all the rest not.
>
> Hmm - the table-locking requirement is true enough, but why would
> pg_loader cause problems after the first few rows?

It runs a stream of COPY statements, so only first would be optimized
with the "empty table optimization".

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: Benchmark Data requested

From
Matthew
Date:
On Tue, 5 Feb 2008, Simon Riggs wrote:
>> In the case of a bulk upload to an empty table (or partition?) could you
>> not optimise the WAL away? That is, shouldn't the WAL basically be a
>> simple transformation of the on-disk blocks? You'd have to explicitly
>> sync the file(s) for the table/indexes of course, and you'd need some
>> work-around for WAL shipping, but it might be worth it for you chaps
>> with large imports.
>
> Only by locking the table, which serializes access, which then slows you
> down or at least restricts other options. Plus if you use pg_loader then
> you'll find only the first few rows optimized and all the rest not.

Why would you need to lock the table?

Matthew

--
Picard: I was just paid a visit from Q.
Riker:  Q! Any idea what he's up to?
Picard: No. He said he wanted to be "nice" to me.
Riker:  I'll alert the crew.

Re: Benchmark Data requested

From
Richard Huxton
Date:
Matthew wrote:
> On Tue, 5 Feb 2008, Simon Riggs wrote:
>>> In the case of a bulk upload to an empty table (or partition?) could you
>>> not optimise the WAL away? That is, shouldn't the WAL basically be a
>>> simple transformation of the on-disk blocks? You'd have to explicitly
>>> sync the file(s) for the table/indexes of course, and you'd need some
>>> work-around for WAL shipping, but it might be worth it for you chaps
>>> with large imports.
>>
>> Only by locking the table, which serializes access, which then slows you
>> down or at least restricts other options. Plus if you use pg_loader then
>> you'll find only the first few rows optimized and all the rest not.
>
> Why would you need to lock the table?

Because you're not really writing the WAL, which means you can't let
anyone else get their data into any of the blocks you are writing into.
You'd basically want to write the disk blocks then "attach" them in some
way.

--
   Richard Huxton
   Archonet Ltd

Re: Benchmark Data requested

From
Matthew
Date:
On Tue, 5 Feb 2008, Richard Huxton wrote:
>> Why would you need to lock the table?
>
> Because you're not really writing the WAL, which means you can't let anyone
> else get their data into any of the blocks you are writing into. You'd
> basically want to write the disk blocks then "attach" them in some way.

So what's wrong with "reserving" the space using the WAL, then everyone
else will know. After all, when you write the data to the WAL, you must
have an idea of where it is meant to end up. My suggestion is that you go
through all the motions of writing the data to the WAL, just without the
data bit.

Matthew

--
Failure is not an option. It comes bundled with your Microsoft product.
                                                 -- Ferenc Mantfeld

Re: Benchmark Data requested

From
Richard Huxton
Date:
Simon Riggs wrote:
> On Tue, 2008-02-05 at 15:05 +0000, Richard Huxton wrote:
>
>>> Only by locking the table, which serializes access, which then slows you
>>> down or at least restricts other options. Plus if you use pg_loader then
>>> you'll find only the first few rows optimized and all the rest not.
>> Hmm - the table-locking requirement is true enough, but why would
>> pg_loader cause problems after the first few rows?
>
> It runs a stream of COPY statements, so only first would be optimized
> with the "empty table optimization".

Ah, if you're allowing multiple commands during the process I can see
how it could get fiddly.

--
   Richard Huxton
   Archonet Ltd

Re: Benchmark Data requested

From
Richard Huxton
Date:
Matthew wrote:
> On Tue, 5 Feb 2008, Richard Huxton wrote:
>>> Why would you need to lock the table?
>>
>> Because you're not really writing the WAL, which means you can't let
>> anyone else get their data into any of the blocks you are writing
>> into. You'd basically want to write the disk blocks then "attach" them
>> in some way.
>
> So what's wrong with "reserving" the space using the WAL, then everyone
> else will know. After all, when you write the data to the WAL, you must
> have an idea of where it is meant to end up. My suggestion is that you
> go through all the motions of writing the data to the WAL, just without
> the data bit.

Well, now you're looking at page-level locking for the data blocks, or
at least something very similar. Not sure what you'd do with indexes
though - don't see a simple way of avoiding a large lock on a btree index.

If you reserved the space in advance that could work. But you don't know
how much to reserve until you've copied it in.

You could of course have a set of co-operating processes all
bulk-loading while maintaining a table-lock outside of the those. It
feels like things are getting complicated then though.

--
   Richard Huxton
   Archonet Ltd

Re: Benchmark Data requested

From
"Jignesh K. Shah"
Date:
One of the problems with "Empty Table optimization" is that if there are
indexes created then it is considered as no longer empty.

Commercial  databases have options like "IRRECOVERABLE" clause along
with DISK PARTITIONS and CPU partitions for their bulk loaders.

So one option turns off logging, disk partitions create multiple
processes to read various lines/blocks from input file and other various
blocks to clean up the bufferpools to disk and CPU partitions to process
the various blocks/lines read for their format and put the rows in
bufferpool if successful.

Regards,
Jignesh

Simon Riggs wrote:
> On Tue, 2008-02-05 at 15:05 +0000, Richard Huxton wrote:
>
>
>>> Only by locking the table, which serializes access, which then slows you
>>> down or at least restricts other options. Plus if you use pg_loader then
>>> you'll find only the first few rows optimized and all the rest not.
>>>
>> Hmm - the table-locking requirement is true enough, but why would
>> pg_loader cause problems after the first few rows?
>>
>
> It runs a stream of COPY statements, so only first would be optimized
> with the "empty table optimization".
>
>

Re: Benchmark Data requested

From
Matthew
Date:
On Tue, 5 Feb 2008, Richard Huxton wrote:
>> So what's wrong with "reserving" the space using the WAL, then everyone
>> else will know. After all, when you write the data to the WAL, you must
>> have an idea of where it is meant to end up. My suggestion is that you go
>> through all the motions of writing the data to the WAL, just without the
>> data bit.
>
> Well, now you're looking at page-level locking for the data blocks, or at
> least something very similar. Not sure what you'd do with indexes though -
> don't see a simple way of avoiding a large lock on a btree index.

Yeah, indexes would be a lot more difficult I guess, if writes to them
involve changing lots of stuff around. We do most of our loads without the
indexes present though.

> If you reserved the space in advance that could work. But you don't know how
> much to reserve until you've copied it in.

What does the WAL do? When do you allocate space in the file for written
rows? Is is when you write the WAL, or when you checkpoint it? If it's
when you write the WAL, then you can just use the same algorithm.

> You could of course have a set of co-operating processes all bulk-loading
> while maintaining a table-lock outside of the those. It feels like things are
> getting complicated then though.

That does sound a bit evil.

You could have different backends, each running a single transaction where
they create one table and load the data for it. That wouldn't need any
change to the backend, but it would only work for dump restores, and would
require the client to be clever. I'm all for allowing this kind of
optimisation while writing normally to the database, and for not requiring
the client to think too hard.

Matthew

--
All of this sounds mildly turgid and messy and confusing... but what the
heck. That's what programming's all about, really
                                        -- Computer Science Lecturer

Re: Benchmark Data requested

From
Dimitri Fontaine
Date:
Le mardi 05 février 2008, Simon Riggs a écrit :
> It runs a stream of COPY statements, so only first would be optimized
> with the "empty table optimization".

The number of rows per COPY statement is configurable, so provided you have an
estimation of the volume to import (wc -l), you could tweak this number for
lowering the stream (down to 1 COPY maybe)...

But basically a COPY run should be kept in memory (and we're talking about
high volumes here) and in case of error processing you'd want it not that
huge after all...

--
dim

Attachment

Re: Benchmark Data requested

From
Dimitri Fontaine
Date:
Le mardi 05 février 2008, Simon Riggs a écrit :
> I'll look at COPY FROM internals to make this faster. I'm looking at
> this now to refresh my memory; I already had some plans on the shelf.

Maybe stealing some ideas from pg_bulkload could somewhat help here?
  http://pgfoundry.org/docman/view.php/1000261/456/20060709_pg_bulkload.pdf

IIRC it's mainly about how to optimize index updating while loading data, and
I've heard complaints on the line "this external tool has to know too much
about PostgreSQL internals to be trustworthy as non-core code"... so...

> > The basic idea is for pgloader to ask PostgreSQL about
> > constraint_exclusion, pg_inherits and pg_constraint and if pgloader
> > recognize both the CHECK expression and the datatypes involved, and if we
> > can implement the CHECK in python without having to resort to querying
> > PostgreSQL, then we can run a thread per partition, with as many COPY
> > FROM running in parallel as there are partition involved (when threads =
> > -1).
> >
> > I'm not sure this will be quicker than relying on PostgreSQL trigger or
> > rules as used for partitioning currently, but ISTM Jignesh quoted § is
> > just about that.
>
> Much better than triggers and rules, but it will be hard to get it to
> work.

Well, I'm thinking about providing a somewhat modular approach where pgloader
code is able to recognize CHECK constraints, load a module registered to the
operator and data types, then use it.
The modules and their registration should be done at the configuration level,
I'll provide some defaults and users will be able to add their code, the same
way on-the-fly reformat modules are handled now.

This means that I'll be able to provide (hopefully) quickly the basic cases
(CHECK on dates >= x and < y), numeric ranges, etc, and users will be able to
care about more complex setups.

When the constraint won't match any configured pgloader exclusion module, the
trigger/rule code will get used (COPY will go to the main table), and when
the python CHECK implementation will be wrong (worst case) PostgreSQL will
reject the data and pgloader will fill your reject data and log files. And
you're back to debugging your python CHECK implementation...

All of this is only a braindump as of now, and maybe quite an optimistic
one... but baring any 'I know this can't work' objection that's what I'm
gonna try to implement for next pgloader version.

Thanks for comments, input is really appreciated !
--
dim

Attachment

Re: Benchmark Data requested

From
Simon Riggs
Date:
On Tue, 2008-02-05 at 18:15 +0100, Dimitri Fontaine wrote:
> Le mardi 05 février 2008, Simon Riggs a écrit :
> > I'll look at COPY FROM internals to make this faster. I'm looking at
> > this now to refresh my memory; I already had some plans on the shelf.
>
> Maybe stealing some ideas from pg_bulkload could somewhat help here?
>   http://pgfoundry.org/docman/view.php/1000261/456/20060709_pg_bulkload.pdf

> IIRC it's mainly about how to optimize index updating while loading data, and
> I've heard complaints on the line "this external tool has to know too much
> about PostgreSQL internals to be trustworthy as non-core code"... so...

Yeh, the batch index updates are a cool feature. Should be able to do
that internally also.

Not going to try the no-WAL route again though. If we can get it running
efficiently and in parallel, then that will be OK.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: Benchmark Data requested

From
"Jignesh K. Shah"
Date:
That sounds cool to me too..

How much work is to make pg_bulkload to work on 8.3? An Integrated
version is certainly more beneficial.

Specially I think it will also help for other setups like TPC-E too
where this is a problem.

Regards,
Jignesh



Simon Riggs wrote:
> On Tue, 2008-02-05 at 18:15 +0100, Dimitri Fontaine wrote:
>
>> Le mardi 05 février 2008, Simon Riggs a écrit :
>>
>>> I'll look at COPY FROM internals to make this faster. I'm looking at
>>> this now to refresh my memory; I already had some plans on the shelf.
>>>
>> Maybe stealing some ideas from pg_bulkload could somewhat help here?
>>   http://pgfoundry.org/docman/view.php/1000261/456/20060709_pg_bulkload.pdf
>>
>
>
>> IIRC it's mainly about how to optimize index updating while loading data, and
>> I've heard complaints on the line "this external tool has to know too much
>> about PostgreSQL internals to be trustworthy as non-core code"... so...
>>
>
> Yeh, the batch index updates are a cool feature. Should be able to do
> that internally also.
>
> Not going to try the no-WAL route again though. If we can get it running
> efficiently and in parallel, then that will be OK.
>
>

Re: Benchmark Data requested

From
Simon Riggs
Date:
On Tue, 2008-02-05 at 13:47 -0500, Jignesh K. Shah wrote:
> That sounds cool to me too..
>
> How much work is to make pg_bulkload to work on 8.3? An Integrated
> version is certainly more beneficial.

> Specially I think it will also help for other setups like TPC-E too
> where this is a problem.

If you don't write WAL then you can lose all your writes in a crash.
That issue is surmountable on a table with no indexes, or even
conceivably with one monotonically ascending index. With other indexes
if we crash then we have a likely corrupt index.

For most production systems I'm aware of, losing an index on a huge
table is not anything you'd want to trade for performance. Assuming
you've ever been knee-deep in it on a real server.

Maybe we can have a "load mode" for a table where we skip writing any
WAL, but if we crash we just truncate the whole table to nothing? Issue
a WARNING if we enable this mode while any data in table. I'm nervous of
it, but maybe people really want it?

I don't really want to invent ext2 all over again, so we have to run an
fsck on a table of we crash while loading. My concern is that many
people would choose that then blame us for delivering unreliable
software. e.g. direct path loader on Oracle used to corrupt a PK index
if you loaded duplicate rows with it (whether it still does I couldn't
care). That kind of behaviour is simply incompatible with production
usage, even if it does good benchmark.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: Benchmark Data requested

From
"Heikki Linnakangas"
Date:
Dimitri Fontaine wrote:
> Le mardi 05 février 2008, Simon Riggs a écrit :
>> I'll look at COPY FROM internals to make this faster. I'm looking at
>> this now to refresh my memory; I already had some plans on the shelf.
>
> Maybe stealing some ideas from pg_bulkload could somewhat help here?
>   http://pgfoundry.org/docman/view.php/1000261/456/20060709_pg_bulkload.pdf
>
> IIRC it's mainly about how to optimize index updating while loading data, and
> I've heard complaints on the line "this external tool has to know too much
> about PostgreSQL internals to be trustworthy as non-core code"... so...

I've been thinking of looking into that as well. The basic trick
pg_bulkload is using is to populate the index as the data is being
loaded. There's no fundamental reason why we couldn't do that internally
in COPY. Triggers or constraints that access the table being loaded
would make it impossible, but we should be able to detect that and fall
back to what we have now.

What I'm basically thinking about is to modify the indexam API of
building a new index, so that COPY would feed the tuples to the indexam,
instead of the indexam opening and scanning the heap. The b-tree indexam
would spool the tuples into a tuplesort as the COPY progresses, and
build the index from that at the end as usual.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Benchmark Data requested

From
"Jignesh K. Shah"
Date:
Commercial Db bulk loaders work the same way.. they give you an option
as a fast loader provided in case of error, the whole table is
truncated.  This I think also has real life advantages where PostgreSQL
is used as datamarts which are recreated every now and then from other
systems and they want fast loaders. So its not just the benchmarking
folks like me that will take advantage of such features. INFACT I have
seen that they force the clause "REPLACE TABLE" in the sense that will
infact truncate the table before loading so there is no confusion what
happens to the original data in the table and only then it avoids the logs.


to be honest, its not the WAL Writes to the disk that I am worried
about.. According to my tests, async_commit is coming pretty close to
sync=off and solves the WALWriteLock contention. We should maybe just
focus on making it more efficient which I think also involves
WALInsertLock that may not be entirely efficient.


Also all changes have to be addon options and not replacement for
existing loads, I totally agree to that point.. The guys in production
support don't even like optimizer query plan changes, forget  corrupt
index. (I have spent two days in previous role trying to figure out why
a particular query plan on another database changed in production.)






Simon Riggs wrote:
> On Tue, 2008-02-05 at 13:47 -0500, Jignesh K. Shah wrote:
>
>> That sounds cool to me too..
>>
>> How much work is to make pg_bulkload to work on 8.3? An Integrated
>> version is certainly more beneficial.
>>
>
>
>> Specially I think it will also help for other setups like TPC-E too
>> where this is a problem.
>>
>
> If you don't write WAL then you can lose all your writes in a crash.
> That issue is surmountable on a table with no indexes, or even
> conceivably with one monotonically ascending index. With other indexes
> if we crash then we have a likely corrupt index.
>
> For most production systems I'm aware of, losing an index on a huge
> table is not anything you'd want to trade for performance. Assuming
> you've ever been knee-deep in it on a real server.
>
> Maybe we can have a "load mode" for a table where we skip writing any
> WAL, but if we crash we just truncate the whole table to nothing? Issue
> a WARNING if we enable this mode while any data in table. I'm nervous of
> it, but maybe people really want it?
>
> I don't really want to invent ext2 all over again, so we have to run an
> fsck on a table of we crash while loading. My concern is that many
> people would choose that then blame us for delivering unreliable
> software. e.g. direct path loader on Oracle used to corrupt a PK index
> if you loaded duplicate rows with it (whether it still does I couldn't
> care). That kind of behaviour is simply incompatible with production
> usage, even if it does good benchmark.
>
>

Re: Benchmark Data requested

From
"Jignesh K. Shah"
Date:
Hi Heikki,

Is there a way such an operation can be spawned as a worker process?
Generally during such loading - which most people will do during
"offpeak" hours I expect additional CPU resources available. By
delegating such additional work to worker processes, we should be able
to capitalize on additional cores in the system.

Even if it is a single core, the mere fact that the loading process will
eventually wait for a read from the input file which cannot be
non-blocking, the OS can timeslice it well for the second process to use
those wait times for the index population work.

What do you think?


Regards,
Jignesh


Heikki Linnakangas wrote:
> Dimitri Fontaine wrote:
>> Le mardi 05 février 2008, Simon Riggs a écrit :
>>> I'll look at COPY FROM internals to make this faster. I'm looking at
>>> this now to refresh my memory; I already had some plans on the shelf.
>>
>> Maybe stealing some ideas from pg_bulkload could somewhat help here?
>>
>> http://pgfoundry.org/docman/view.php/1000261/456/20060709_pg_bulkload.pdf
>>
>>
>> IIRC it's mainly about how to optimize index updating while loading
>> data, and I've heard complaints on the line "this external tool has
>> to know too much about PostgreSQL internals to be trustworthy as
>> non-core code"... so...
>
> I've been thinking of looking into that as well. The basic trick
> pg_bulkload is using is to populate the index as the data is being
> loaded. There's no fundamental reason why we couldn't do that
> internally in COPY. Triggers or constraints that access the table
> being loaded would make it impossible, but we should be able to detect
> that and fall back to what we have now.
>
> What I'm basically thinking about is to modify the indexam API of
> building a new index, so that COPY would feed the tuples to the
> indexam, instead of the indexam opening and scanning the heap. The
> b-tree indexam would spool the tuples into a tuplesort as the COPY
> progresses, and build the index from that at the end as usual.
>

Re: Benchmark Data requested

From
"Heikki Linnakangas"
Date:
Jignesh K. Shah wrote:
> Is there a way such an operation can be spawned as a worker process?
> Generally during such loading - which most people will do during
> "offpeak" hours I expect additional CPU resources available. By
> delegating such additional work to worker processes, we should be able
> to capitalize on additional cores in the system.

Hmm. You do need access to shared memory, locks, catalogs, and to run
functions etc, so I don't think it's significantly easier than using
multiple cores for COPY itself.

> Even if it is a single core, the mere fact that the loading process will
> eventually wait for a read from the input file which cannot be
> non-blocking, the OS can timeslice it well for the second process to use
> those wait times for the index population work.

That's an interesting point.

> What do you think?
>
>
> Regards,
> Jignesh
>
>
> Heikki Linnakangas wrote:
>> Dimitri Fontaine wrote:
>>> Le mardi 05 février 2008, Simon Riggs a écrit :
>>>> I'll look at COPY FROM internals to make this faster. I'm looking at
>>>> this now to refresh my memory; I already had some plans on the shelf.
>>>
>>> Maybe stealing some ideas from pg_bulkload could somewhat help here?
>>>
>>> http://pgfoundry.org/docman/view.php/1000261/456/20060709_pg_bulkload.pdf
>>>
>>>
>>> IIRC it's mainly about how to optimize index updating while loading
>>> data, and I've heard complaints on the line "this external tool has
>>> to know too much about PostgreSQL internals to be trustworthy as
>>> non-core code"... so...
>>
>> I've been thinking of looking into that as well. The basic trick
>> pg_bulkload is using is to populate the index as the data is being
>> loaded. There's no fundamental reason why we couldn't do that
>> internally in COPY. Triggers or constraints that access the table
>> being loaded would make it impossible, but we should be able to detect
>> that and fall back to what we have now.
>>
>> What I'm basically thinking about is to modify the indexam API of
>> building a new index, so that COPY would feed the tuples to the
>> indexam, instead of the indexam opening and scanning the heap. The
>> b-tree indexam would spool the tuples into a tuplesort as the COPY
>> progresses, and build the index from that at the end as usual.
>>


--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Benchmark Data requested

From
Simon Riggs
Date:
On Tue, 2008-02-05 at 15:50 -0500, Jignesh K. Shah wrote:

> Is there a way such an operation can be spawned as a worker process?
> Generally during such loading - which most people will do during
> "offpeak" hours I expect additional CPU resources available. By
> delegating such additional work to worker processes, we should be able
> to capitalize on additional cores in the system.
>
> Even if it is a single core, the mere fact that the loading process will
> eventually wait for a read from the input file which cannot be
> non-blocking, the OS can timeslice it well for the second process to use
> those wait times for the index population work.

If Dimitri is working on parallel load, why bother?

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: Benchmark Data requested

From
Greg Smith
Date:
On Tue, 5 Feb 2008, Simon Riggs wrote:

> On Tue, 2008-02-05 at 15:50 -0500, Jignesh K. Shah wrote:
>>
>> Even if it is a single core, the mere fact that the loading process will
>> eventually wait for a read from the input file which cannot be
>> non-blocking, the OS can timeslice it well for the second process to use
>> those wait times for the index population work.
>
> If Dimitri is working on parallel load, why bother?

pgloader is a great tool for a lot of things, particularly if there's any
chance that some of your rows will get rejected.  But the way things pass
through the Python/psycopg layer made it uncompetative (more than 50%
slowdown) against the straight COPY path from a rows/second perspective
the last time (V2.1.0?) I did what I thought was a fair test of it (usual
caveat of "with the type of data I was loading").  Maybe there's been some
gigantic improvement since then, but it's hard to beat COPY when you've
got an API layer or two in the middle.

I suspect what will end up happening is that a parallel loading pgloader
will scale something like this:

1 CPU:  Considerably slower than COPY
2-3 CPUs: Close to even with COPY
4+ CPUs:  Faster than COPY

Maybe I'm wrong, but I wouldn't abandon looking into another approach
until that territory is mapped out a bit better.

Given the very large number of dual-core systems out there now relative to
those with more, optimizing the straight COPY path with any way to take
advantage of even one more core to things like index building is well
worth doing.  Heikki's idea sounded good to me regardless, and if that can
be separated out enough to get a second core into the index building at
the same time so much the better.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Benchmark Data requested

From
NikhilS
Date:
Hi,

On Feb 6, 2008 9:05 AM, Greg Smith <gsmith@gregsmith.com> wrote:
On Tue, 5 Feb 2008, Simon Riggs wrote:

> On Tue, 2008-02-05 at 15:50 -0500, Jignesh K. Shah wrote:
>>
>> Even if it is a single core, the mere fact that the loading process will
>> eventually wait for a read from the input file which cannot be
>> non-blocking, the OS can timeslice it well for the second process to use
>> those wait times for the index population work.
>
> If Dimitri is working on parallel load, why bother?

pgloader is a great tool for a lot of things, particularly if there's any
chance that some of your rows will get rejected.  But the way things pass
through the Python/psycopg layer made it uncompetative (more than 50%
slowdown) against the straight COPY path from a rows/second perspective
the last time (V2.1.0?) I did what I thought was a fair test of it (usual
caveat of "with the type of data I was loading").  Maybe there's been some
gigantic improvement since then, but it's hard to beat COPY when you've
got an API layer or two in the middle.

I think, its time now that we should jazz COPY up a bit to include all the discussed functionality. Heikki's batch-indexing idea is pretty useful too. Another thing that pg_bulkload does is it directly loads the tuples into the relation by constructing the tuples and writing them directly to the physical file corresponding to the involved relation, bypassing the engine completely (ofcourse the limitations that arise out of it are not supporting rules, triggers, constraints, default expression evaluation etc). ISTM, we could optimize the COPY code to try to do direct loading too (not necessarily as done by pg_bulkload) to speed it up further in certain cases.

Another thing that we should add to COPY is the ability to continue data load across errors as was discussed recently on hackers some time back too.

Regards,
Nikhils
--
EnterpriseDB               http://www.enterprisedb.com

Re: Benchmark Data requested

From
Dimitri Fontaine
Date:
Le mercredi 06 février 2008, Greg Smith a écrit :
> pgloader is a great tool for a lot of things, particularly if there's any
> chance that some of your rows will get rejected.  But the way things pass
> through the Python/psycopg layer made it uncompetative (more than 50%
> slowdown) against the straight COPY path from a rows/second perspective
> the last time (V2.1.0?)

I've yet to add in the psycopg wrapper Marko wrote for skytools: at the moment
I'm using the psycopg1 interface even when psycopg2 is used, and it seems the
new version has some great performance improvements. I just didn't bother
until now thinking this wouldn't affect COPY.

> I did what I thought was a fair test of it (usual
> caveat of "with the type of data I was loading").  Maybe there's been some
> gigantic improvement since then, but it's hard to beat COPY when you've
> got an API layer or two in the middle.

Did you compare to COPY or \copy? I'd expect psycopg COPY api not to be that
more costly than psql one, after all.
Where pgloader is really left behind (in term of tuples inserted per second)
compared to COPY is when it has to jiggle a lot with the data, I'd say
(reformat, reorder, add constants, etc). But I've tried to design it so that
when not configured to arrange (massage?) the data, the code path is the
simplest possible.

Do you want to test pgloader again with Marko psycopgwrapper code to see if
this helps? If yes I'll arrange to push it to CVS ASAP.

Maybe at the end of this PostgreSQL backend code will be smarter than pgloader
(wrt error handling and data massaging) and we'll be able to drop the
project, but in the meantime I'll try my best to have pgloader as fast as
possible :)
--
dim

Attachment

Re: Benchmark Data requested --- pgloader CE design ideas

From
Dimitri Fontaine
Date:
Hi,

I've been thinking about this topic some more, and as I don't know when I'll
be able to go and implement it I'd want to publish the ideas here. This way
I'll be able to find them again :)

Le mardi 05 février 2008, Dimitri Fontaine a écrit :
> Le mardi 05 février 2008, Simon Riggs a écrit :
> > Much better than triggers and rules, but it will be hard to get it to
> > work.
>
> Well, I'm thinking about providing a somewhat modular approach where
> pgloader code is able to recognize CHECK constraints, load a module
> registered to the operator and data types, then use it.

Here's how I think I'm gonna implement it:

User level configuration
-=-=-=-=-=-=-=-=-=-

At user level, you will have to add a constraint_exclusion = on parameter to
pgloader section configuration for it to bother checking if the destination
table has some children etc.
You'll need to provide also a global ce_path parameter (where to find user
python constraint exclusion modules) and a ce_modules parameter for each
section where constraint_exclusion = on:
  ce_modules = columnA:module:class, columnB:module:class

As the ce_path could point to any number of modules where a single type is
supported by several modules, I'll let the user choose which module to use.

Constraint exclusion modules
-=-=-=-=-=-=-=-=-=-=-=-=-

The modules will provide one or several class(es) (kind of a packaging issue),
each one will have to register which datatypes and operators they know about.
Here's some pseudo-code of a module, which certainly is the best way to
express a code design idea:

class MyCE:
   def __init__(self, operator, constant, cside='r'):
      """ CHECK ( col operator constant ) => cside = 'r', could be 'l' """
      ...

   @classmethod
   def support_type(cls, type):
      return type in ['integer', 'bigint', 'smallint', 'real', 'double']

   @classmethod
   def support_operator(cls, op):
       return op in ['=', '>', '<', '>=', '<=', '%']

   def check(self, op, data):
      if op == '>' : return self.gt(data)
      ...

   def gt(self, data):
      if cside == 'l':
         return self.constant > data
      elif cside == 'r':
         return data > self.constant

This way pgloader will be able to support any datatype (user datatype like
IP4R included) and operator (@@, ~<= or whatever). For pgloader to handle a
CHECK() constraint, though, it'll have to be configured to use a CE class
supporting the used operators and datatypes.

PGLoader constraint exclusion support
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

The CHECK() constraint being a tree of check expressions[*] linked by logical
operators, pgloader will have to build some logic tree of MyCE (user CE
modules) and evaluate all the checks in order to be able to choose the input
line partition.

[*]: check((a % 10) = 1) makes an expression tree containing 2 check nodes

After having parsed pg_constraint.consrc (not conbin which seems too much an
internal dump for using it from user code) and built a CHECK tree for each
partition, pgloader will try to decide if it's about range partitioning (most
common case).

If each partition CHECK tree is AND((a>=b, a<c) or a variation of it, we have
range partitioning. Then surely we can optimize the code to run to choose the
partition where to COPY data to and still use the module operator
implementation, e.g. making a binary search on a partitions limits tree.

If you want some other widely used (or not) partitioning scheme to be
recognized and optimized by pgloader, just tell me and we'll see about it :)
Having this step as a user module seems overkill at the moment, though.

Multi-Threading behavior and CE support
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Now, pgloader will be able to run N threads, each one loading some data to a
partitionned child-table target. N will certainly be configured depending on
the number of server cores and not depending on the partition numbers...

So what do we do when reading a tuple we want to store in a partition which
has no dedicated Thread started yet, and we already have N Threads running?
I'm thinking about some LRU(Thread) to choose a Thread to terminate (launch
COPY with current buffer and quit) and start a new one for the current
partition target.
Hopefully there won't be such high values of N that the LRU is a bad choice
per see, and the input data won't be so messy to have to stop/start Threads
at each new line.

Comments welcome, regards,
--
dim

Attachment

Re: Benchmark Data requested --- pgloader CE design ideas

From
Simon Riggs
Date:
On Wed, 2008-02-06 at 12:27 +0100, Dimitri Fontaine wrote:
> Multi-Threading behavior and CE support
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>
> Now, pgloader will be able to run N threads, each one loading some
> data to a
> partitionned child-table target. N will certainly be configured
> depending on
> the number of server cores and not depending on the partition
> numbers...
>
> So what do we do when reading a tuple we want to store in a partition
> which
> has no dedicated Thread started yet, and we already have N Threads
> running?
> I'm thinking about some LRU(Thread) to choose a Thread to terminate
> (launch
> COPY with current buffer and quit) and start a new one for the
> current
> partition target.
> Hopefully there won't be such high values of N that the LRU is a bad
> choice
> per see, and the input data won't be so messy to have to stop/start
> Threads
> at each new line.

For me, it would be good to see a --parallel=n parameter that would
allow pg_loader to distribute rows in "round-robin" manner to "n"
different concurrent COPY statements. i.e. a non-routing version. Making
that work well, whilst continuing to do error-handling seems like a
challenge, but a very useful goal.

Adding intelligence to the row distribution may be technically hard but
may also simply move the bottleneck onto pg_loader. We may need multiple
threads in pg_loader, or we may just need multiple sessions from
pg_loader. Experience from doing the non-routing parallel version may
help in deciding whether to go for the routing version.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: Benchmark Data requested --- pgloader CE design ideas

From
Dimitri Fontaine
Date:
Le mercredi 06 février 2008, Simon Riggs a écrit :
> For me, it would be good to see a --parallel=n parameter that would
> allow pg_loader to distribute rows in "round-robin" manner to "n"
> different concurrent COPY statements. i.e. a non-routing version.

What happen when you want at most N parallel Threads and have several sections
configured: do you want pgloader to serialize sections loading (often there's
one section per table, sometimes different sections target the same table)
but parallelise each section loading?

I'm thinking we should have a global max_threads knob *and* and per-section
max_thread one if we want to go this way, but then multi-threaded sections
will somewhat fight against other sections (multi-threaded or not) for
threads to use.

So I'll also add a parameter to configure how many (max) sections to load in
parallel at any time.

We'll then have (default values presented):
max_threads = 1
max_parallel_sections = 1
section_threads = -1

The section_threads parameter would be overloadable at section level but would
need to stay <= max_threads (if not, discarded, warning issued). When
section_threads is -1, pgloader tries to have the higher number of them
possible, still in the max_threads global limit.
If max_parallel_section is -1, pgloader start a new thread per each new
section, maxing out at max_threads, then it waits for a thread to finish
before launching a new section loading.

If you have N max_threads and max_parallel_sections = section_threads = -1,
then we'll see some kind of a fight between new section threads and in
section thread (the parallel non-routing COPY behaviour). But then it's a
user choice.

Adding in it the Constraint_Exclusion support would not mess it up, but it'll
have some interest only when section_threads != 1 and max_threads > 1.

> Making
> that work well, whilst continuing to do error-handling seems like a
> challenge, but a very useful goal.

Quick tests showed me python threading model allows for easily sharing of
objects between several threads, I don't think I'll need to adjust my reject
code when going per-section multi-threaded. Just have to use a semaphore
object to continue rejected one line at a time. Not that complex if reliable.

> Adding intelligence to the row distribution may be technically hard but
> may also simply move the bottleneck onto pg_loader. We may need multiple
> threads in pg_loader, or we may just need multiple sessions from
> pg_loader. Experience from doing the non-routing parallel version may
> help in deciding whether to go for the routing version.

If non-routing per-section multi-threading is a user request and not that hard
to implement (thanks to python), that sounds a good enough reason for me to
provide it :)

I'll keep you (and the list) informed as soon as I'll have the code to play
with.
--
dim

Attachment

Re: Benchmark Data requested

From
Greg Smith
Date:
On Wed, 6 Feb 2008, Dimitri Fontaine wrote:

> Did you compare to COPY or \copy?

COPY.  If you're loading a TB, if you're smart it's going onto the server
itself if it all possible and loading directly from there.  Would probably
get a closer comparision against psql \copy, but recognize you're always
going to be compared against the best server-side copy method available.

> Do you want to test pgloader again with Marko psycopgwrapper code to see if
> this helps?

Wouldn't have time myself for at least a month (probably not until after
the East convention) so don't go making commits on my behalf.

> Maybe at the end of this PostgreSQL backend code will be smarter than pgloader
> (wrt error handling and data massaging) and we'll be able to drop the
> project

There are way too many data massaging cases I never expect the backend
will handle that pgloader does a great job of right now, and I think there
will always be a niche for a tool like this.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Benchmark Data requested --- pgloader CE design ideas

From
Greg Smith
Date:
On Wed, 6 Feb 2008, Simon Riggs wrote:

> For me, it would be good to see a --parallel=n parameter that would
> allow pg_loader to distribute rows in "round-robin" manner to "n"
> different concurrent COPY statements. i.e. a non-routing version.

Let me expand on this.  In many of these giant COPY situations the
bottleneck is plain old sequential I/O to a single process.  You can
almost predict how fast the rows will load using dd.  Having a process
that pulls rows in and distributes them round-robin is good, but it won't
crack that bottleneck.  The useful approaches I've seen for other
databases all presume that the data files involved are large enough that
on big hardware, you can start multiple processes running at different
points in the file and beat anything possible with a single reader.

If I'm loading a TB file, odds are good I can split that into 4 or more
vertical pieces (say rows 1-25%, 25-50%, 50-75%, 75-100%), start 4 loaders
at once, and get way more than 1 disk worth of throughput reading.  You
have to play with the exact number because if you push the split too far
you introduce seek slowdown instead of improvements, but that's the basic
design I'd like to see one day.  It's not parallel loading that's useful
for the cases I'm thinking about until something like this comes around.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Benchmark Data requested --- pgloader CE design ideas

From
Luke Lonergan
Date:
Hi Greg,

On 2/6/08 7:56 AM, "Greg Smith" <gsmith@gregsmith.com> wrote:

> If I'm loading a TB file, odds are good I can split that into 4 or more
> vertical pieces (say rows 1-25%, 25-50%, 50-75%, 75-100%), start 4 loaders
> at once, and get way more than 1 disk worth of throughput reading.  You
> have to play with the exact number because if you push the split too far
> you introduce seek slowdown instead of improvements, but that's the basic
> design I'd like to see one day.  It's not parallel loading that's useful
> for the cases I'm thinking about until something like this comes around.

Just load 4 relfiles.  You have to be able to handle partial relfiles, which
changes the storage mgmt a bit, but the benefits are easier to achieve.

- Luke


Re: Benchmark Data requested --- pgloader CE design ideas

From
"Jignesh K. Shah"
Date:

Greg Smith wrote:
> On Wed, 6 Feb 2008, Simon Riggs wrote:
>
>> For me, it would be good to see a --parallel=n parameter that would
>> allow pg_loader to distribute rows in "round-robin" manner to "n"
>> different concurrent COPY statements. i.e. a non-routing version.
>
> Let me expand on this.  In many of these giant COPY situations the
> bottleneck is plain old sequential I/O to a single process.  You can
> almost predict how fast the rows will load using dd.  Having a process
> that pulls rows in and distributes them round-robin is good, but it
> won't crack that bottleneck.  The useful approaches I've seen for
> other databases all presume that the data files involved are large
> enough that on big hardware, you can start multiple processes running
> at different points in the file and beat anything possible with a
> single reader.
>
> If I'm loading a TB file, odds are good I can split that into 4 or
> more vertical pieces (say rows 1-25%, 25-50%, 50-75%, 75-100%), start
> 4 loaders at once, and get way more than 1 disk worth of throughput
> reading.  You have to play with the exact number because if you push
> the split too far you introduce seek slowdown instead of improvements,
> but that's the basic design I'd like to see one day.  It's not
> parallel loading that's useful for the cases I'm thinking about until
> something like this comes around.
>

Some food for thought here: Most BI Type applications which does data
conversions/cleansing also might end up sorting the data before its
loaded into a database so starting parallel loaders at Total different
points ruins that effort. A More pragmatic approach will be to read the
next rows from the input file So if there are N parallel streams then
each one is offset by 1 from each other and jumps by N rows so the seeks
are pretty much narrrowed down to few rows (ideally 1) instead of
jumping 1/Nth rows every time  a read happens.

For example to replicate this with dd to see the impact use a big file
and use the seek option and blocksizes .. Somebody out here once had
done that test and showed that "seek time" on the file being read is
reduced significantly and depending on the file system it does
intelligent prefetching (which unfortunately UFS in Solaris does not do
best by default)  all the reads for the next stream will already be in
memory.



Regards,
Jignesh


Re: Benchmark Data requested

From
Dimitri Fontaine
Date:
Le mercredi 06 février 2008, Greg Smith a écrit :
> COPY.  If you're loading a TB, if you're smart it's going onto the server
> itself if it all possible and loading directly from there.  Would probably
> get a closer comparision against psql \copy, but recognize you're always
> going to be compared against the best server-side copy method available.

Fair enough on your side, even if I can't expect an external tool using
network protocol to compete with backend reading a local file. I wanted to
make sure the 50% slowdown was not only due to my code being that bad.

> There are way too many data massaging cases I never expect the backend
> will handle that pgloader does a great job of right now, and I think there
> will always be a niche for a tool like this.

Let's try to continue improving the tool then!
--
dim

Attachment

Re: Benchmark Data requested --- pgloader CE design ideas

From
Dimitri Fontaine
Date:
Le mercredi 06 février 2008, Greg Smith a écrit :
> If I'm loading a TB file, odds are good I can split that into 4 or more
> vertical pieces (say rows 1-25%, 25-50%, 50-75%, 75-100%), start 4 loaders
> at once, and get way more than 1 disk worth of throughput reading.

pgloader already supports starting at any input file line number, and limit
itself to any number of reads:

  -C COUNT, --count=COUNT
                        number of input lines to process
  -F FROMCOUNT, --from=FROMCOUNT
                        number of input lines to skip

So you could already launch 4 pgloader processes with the same configuration
fine but different command lines arguments. It there's interest/demand, it's
easy enough for me to add those parameters as file configuration knobs too.

Still you have to pay for client to server communication instead of having the
backend read the file locally, but now maybe we begin to compete?

Regards,
--
dim

Attachment

Re: Benchmark Data requested --- pgloader CE design ideas

From
Dimitri Fontaine
Date:
Le Wednesday 06 February 2008 18:37:41 Dimitri Fontaine, vous avez écrit :
> Le mercredi 06 février 2008, Greg Smith a écrit :
> > If I'm loading a TB file, odds are good I can split that into 4 or more
> > vertical pieces (say rows 1-25%, 25-50%, 50-75%, 75-100%), start 4
> > loaders at once, and get way more than 1 disk worth of throughput
> > reading.
>
> pgloader already supports starting at any input file line number, and limit
> itself to any number of reads:

In fact, the -F option works by having pgloader read the given number of lines
but skip processing them, which is not at all what Greg is talking about here
I think.

Plus, I think it would be easier for me to code some stat() then lseek() then
read() into the pgloader readers machinery than to change the code
architecture to support a separate thread for the file reader.

Greg, what would you think of a pgloader which will separate file reading
based on file size as given by stat (os.stat(file)[ST_SIZE]) and number of
threads: we split into as many pieces as section_threads section config
value.

This behaviour won't be available for sections where type = text and
field_count(*) is given, cause in this case I don't see how pgloader could
reliably recognize a new logical line beginning and start processing here.
In other cases, a logical line is a physical line, so we start after first
newline met from given lseek start position, and continue reading after the
last lseek position until a newline.

*:http://pgloader.projects.postgresql.org/#_text_format_configuration_parameters

Comments?
--
dim

Attachment

Re: Benchmark Data requested --- pgloader CE design ideas

From
Greg Smith
Date:
On Wed, 6 Feb 2008, Dimitri Fontaine wrote:

> In fact, the -F option works by having pgloader read the given number of lines
> but skip processing them, which is not at all what Greg is talking about here
> I think.

Yeah, that's not useful.

> Greg, what would you think of a pgloader which will separate file reading
> based on file size as given by stat (os.stat(file)[ST_SIZE]) and number of
> threads: we split into as many pieces as section_threads section config
> value.

Now you're talking.  Find a couple of split points that way, fine-tune the
boundaries a bit so they rest on line termination points, and off you go.
Don't forget that the basic principle here implies you'll never know until
you're done just how many lines were really in the file.  When thread#1 is
running against chunk#1, it will never have any idea what line chunk#2
really started at until it reaches there, at which point it's done and
that information isn't helpful anymore.

You have to stop thinking in terms of lines for this splitting; all you
can do is split the file into useful byte sections and then count the
lines within them as you go.  Anything else requires a counting scan of
the file and such a sequential read is exactly what can't happen
(especially not more than once), it just takes too long.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Benchmark Data requested --- pgloader CE design ideas

From
Dimitri Fontaine
Date:
Le jeudi 07 février 2008, Greg Smith a écrit :
>Le mercredi 06 février 2008, Dimitri Fontaine a écrit :
>> In other cases, a logical line is a physical line, so we start after first
>> newline met from given lseek start position, and continue reading after the
>> last lseek position until a newline.
>
> Now you're talking.  Find a couple of split points that way, fine-tune the
> boundaries a bit so they rest on line termination points, and off you go.

I was thinking of not even reading the file content from the controller
thread, just decide splitting points in bytes (0..ST_SIZE/4 -
ST_SIZE/4+1..2*ST_SIZE/4 etc) and let the reading thread fine-tune by
beginning to process input after having read first newline, etc.

And while we're still at the design board, I'm also thinking to add a
per-section parameter (with a global default value possible)
split_file_reading which defaults to False, and which you'll have to set True
for pgloader to behave the way we're talking about.

When split_file_reading = False and section_threads != 1 pgloader will have to
manage several processing threads per section but only one file reading
thread, giving the read input to processing theads in a round-robin fashion.
In the future the processing thread choosing will possibly (another knob) be
smarter than that, as soon as we get CE support into pgloader.

When split_file_reading = True and section_threads != 1 pgloader will have to
manage several processing threads per section, each one responsible of
reading its own part of the file, processing boundaries to be discovered at
reading time. Adding in here CE support in this case means managing two
separate thread pools per section, one responsible of splitted file reading
and another responsible of data buffering and routing (COPY to partition
instead of to parent table).

In both cases, maybe it would also be needed for pgloader to be able to have a
separate thread for COPYing the buffer to the server, allowing it to continue
preparing next buffer in the meantime?

This will need some re-architecturing of pgloader, but it seems it worth it
(I'm not entirely sold about the two thread-pools idea, though, and this last
continue-reading-while-copying-idea still has to be examined).
Some of the work needing to be done is by now quite clear for me, but a part
of it still needs its design-time share. As usual though, the real hard part
is knowing what we exactly want to get done, and we're showing good progress
here :)

Greg's behavior:
max_threads           = N
max_parallel_sections = 1
section_threads       = -1
split_file_reading    = True

Simon's behaviour:
max_threads           = N
max_parallel_sections = 1   # I don't think Simon wants parallel sections
section_threads       = -1
split_file_reading    = False

Comments?
--
dim

Attachment

Re: Benchmark Data requested --- pgloader CE design ideas

From
Greg Smith
Date:
On Thu, 7 Feb 2008, Dimitri Fontaine wrote:

> I was thinking of not even reading the file content from the controller
> thread, just decide splitting points in bytes (0..ST_SIZE/4 -
> ST_SIZE/4+1..2*ST_SIZE/4 etc) and let the reading thread fine-tune by
> beginning to process input after having read first newline, etc.

The problem I was pointing out is that if chunk#2 moved foward a few bytes
before it started reading in search of a newline, how will chunk#1 know
that it's supposed to read up to that further point?  You have to stop #1
from reading further when it catches up with where #2 started.  Since the
start of #2 is fuzzy until some reading is done, what you're describing
will need #2 to send some feedback to #1 after they've both started, and
that sounds bad to me.  I like designs where the boundaries between
threads are clearly defined before any of them start and none of them ever
talk to the others.

> In both cases, maybe it would also be needed for pgloader to be able to have a
> separate thread for COPYing the buffer to the server, allowing it to continue
> preparing next buffer in the meantime?

That sounds like a V2.0 design to me.  I'd only chase after that level of
complexity if profiling suggests that's where the bottleneck really is.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Benchmark Data requested --- pgloader CE design ideas

From
Matthew
Date:
On Thu, 7 Feb 2008, Greg Smith wrote:
> The problem I was pointing out is that if chunk#2 moved foward a few bytes
> before it started reading in search of a newline, how will chunk#1 know that
> it's supposed to read up to that further point?  You have to stop #1 from
> reading further when it catches up with where #2 started.  Since the start of
> #2 is fuzzy until some reading is done, what you're describing will need #2
> to send some feedback to #1 after they've both started, and that sounds bad
> to me.

It doesn't have to be fuzzy at all. Both threads will presumably be able
to use the same algorithm to work out where the boundary is, therefore
they'll get the same result. No need to pass back information.

Matthew

--
There is something in the lecture course which may not have been visible so
far, which is reality                   -- Computer Science Lecturer

Re: Benchmark Data requested --- pgloader CE design ideas

From
Kenneth Marshall
Date:
On Thu, Feb 07, 2008 at 12:06:42PM -0500, Greg Smith wrote:
> On Thu, 7 Feb 2008, Dimitri Fontaine wrote:
>
>> I was thinking of not even reading the file content from the controller
>> thread, just decide splitting points in bytes (0..ST_SIZE/4 -
>> ST_SIZE/4+1..2*ST_SIZE/4 etc) and let the reading thread fine-tune by
>> beginning to process input after having read first newline, etc.
>
> The problem I was pointing out is that if chunk#2 moved foward a few bytes
> before it started reading in search of a newline, how will chunk#1 know
> that it's supposed to read up to that further point?  You have to stop #1
> from reading further when it catches up with where #2 started.  Since the
> start of #2 is fuzzy until some reading is done, what you're describing
> will need #2 to send some feedback to #1 after they've both started, and
> that sounds bad to me.  I like designs where the boundaries between threads
> are clearly defined before any of them start and none of them ever talk to
> the others.
>

As long as both processes understand the start condition, there
is not a problem. p1 starts at beginning and processes through chunk2
 offset until it reaches the start condition. p2 starts loading from
chunk2 offset plus the amount needed to reach the start condition, ...

DBfile|---------------|--x--------------|x----------------|-x--|
      x chunk1----------->
                         x chunk2-------->
                                         x chunk3----------->...

As long as both pieces use the same test, they will each process
non-overlapping segments of the file and still process 100% of the
file.

Ken

>> In both cases, maybe it would also be needed for pgloader to be able to
>> have a
>> separate thread for COPYing the buffer to the server, allowing it to
>> continue
>> preparing next buffer in the meantime?
>
> That sounds like a V2.0 design to me.  I'd only chase after that level of
> complexity if profiling suggests that's where the bottleneck really is.
>
> --
> * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>

Re: Benchmark Data requested --- pgloader CE design ideas

From
Mark Lewis
Date:
> > I was thinking of not even reading the file content from the controller
> > thread, just decide splitting points in bytes (0..ST_SIZE/4 -
> > ST_SIZE/4+1..2*ST_SIZE/4 etc) and let the reading thread fine-tune by
> > beginning to process input after having read first newline, etc.
>
> The problem I was pointing out is that if chunk#2 moved foward a few bytes
> before it started reading in search of a newline, how will chunk#1 know
> that it's supposed to read up to that further point?  You have to stop #1
> from reading further when it catches up with where #2 started.  Since the
> start of #2 is fuzzy until some reading is done, what you're describing
> will need #2 to send some feedback to #1 after they've both started, and
> that sounds bad to me.  I like designs where the boundaries between
> threads are clearly defined before any of them start and none of them ever
> talk to the others.

I don't think that any communication is needed beyond the beginning of
the threads.  Each thread knows that it should start at byte offset X
and end at byte offset Y, but if Y happens to be in the middle of a
record then just keep going until the end of the record.  As long as the
algorithm for reading past the end marker is the same as the algorithm
for skipping past the beginning marker then all is well.

-- Mark Lewis

Re: Benchmark Data requested

From
Mark Wong
Date:
On Mon, 04 Feb 2008 17:33:34 -0500
"Jignesh K. Shah" <J.K.Shah@Sun.COM> wrote:

> Hi Simon,
>
> I have some insight into TPC-H on how it works.
>
> First of all I think it is a violation of TPC rules to publish numbers
> without auditing them first.  So even if I do the test to show the
> better performance of PostgreSQL 8.3, I cannot post it here or any
> public forum without doing going through the "process". (Even though it
> is partial benchmark as they are just doing the equivalent of the
> PowerRun of TPCH) Maybe the PR of PostgreSQL team should email
> info@tpc.org about them and see what they have to say about that comparison.

I think I am qualified enough to say it is not a violation of TPC
fair-use policy if we scope the data as a measure of how PostgreSQL has
changed from 8.1 to 8.3 and refrain from comparing these results to what
any other database is doing.

The point is to measure PostgreSQL's progress not market it, correct?

Regards,
Mark

Re: Benchmark Data requested

From
Mark Wong
Date:
On Mon, 4 Feb 2008 15:09:58 -0500 (EST)
Greg Smith <gsmith@gregsmith.com> wrote:

> On Mon, 4 Feb 2008, Simon Riggs wrote:
>
> > Would anybody like to repeat these tests with the latest production
> > versions of these databases (i.e. with PGSQL 8.3)
>
> Do you have any suggestions on how people should run TPC-H?  It looked
> like a bit of work to sort through how to even start this exercise.

If you mean you want to get your hands on a kit, the one that Jenny and
I put together is here:

http://sourceforge.net/project/showfiles.php?group_id=52479&package_id=71458

I hear it still works. :)

Regards,
Mark

Dell Perc/6

From
Craig James
Date:
Does anyone have performance info about the new Dell Perc/6 controllers?  I found a long discussion ("Dell vs HP")
aboutthe Perc/5, but nothing about Perc/6.  What's under the covers? 

Here is the (abbreviated) info from Dell on this machine:

PowerEdge 1950 III        Quad Core Intel® Xeon® E5405, 2x6MB Cache, 2.0GHz, 1333MHz FSB
Additional Processors     Quad Core Intel® Xeon® E5405, 2x6MB Cache, 2.0GHz, 1333MHz FSB
Memory                    8GB 667MHz (4x2GB), Dual Ranked DIMMs
Hard Drive Configuration  Integrated SAS/SATA RAID 5, PERC 6/i Integrated

Thanks,
Craig

Re: Dell Perc/6

From
Greg Smith
Date:
On Tue, 12 Feb 2008, Craig James wrote:

> Does anyone have performance info about the new Dell Perc/6 controllers?  I
> found a long discussion ("Dell vs HP") about the Perc/5, but nothing about
> Perc/6.  What's under the covers?

The Perc/6i has an LSI Logic MegaRAID SAS 1078 chipset under the hood.  I
know the Linux drivers for the card seemed to stabilize around October,
there's a good sized list of compatible distributions on LSI's site.
FreeBSD support has some limitations but basically works.  I haven't seen
any benchmarks for the current version of the card yet.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Dell Perc/6

From
Tore Halset
Date:
Hello.

I think I started that discussion. We ended up buying a Dell 2900 with
PERC 6/i and 10 * 145GB SAS 3,5" 15KRpm discs. 6 of the SAS discs are
in a raid 10 for the database, 2 in a mirror for the wal and the last
2 in a mirror for the OS. We get 350MB/s writing and 380MB/s reading
to/from the raid 10 area using dd. The OS is Ubuntu and the filesystem
for the raid 10 is ext3.

The box is still under testing, but we plan to set it in production
this week.

Regards,
  - Tore.

On Feb 12, 2008, at 17:32 , Craig James wrote:

> Does anyone have performance info about the new Dell Perc/6
> controllers?  I found a long discussion ("Dell vs HP") about the
> Perc/5, but nothing about Perc/6.  What's under the covers?
>
> Here is the (abbreviated) info from Dell on this machine:
> PowerEdge 1950 III        Quad Core Intel® Xeon® E5405, 2x6MB Cache,
> 2.0GHz, 1333MHz FSB
> Additional Processors     Quad Core Intel® Xeon® E5405, 2x6MB Cache,
> 2.0GHz, 1333MHz FSB
> Memory                    8GB 667MHz (4x2GB), Dual Ranked DIMMs
> Hard Drive Configuration  Integrated SAS/SATA RAID 5, PERC 6/i
> Integrated
>
> Thanks,
> Craig
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>      subscribe-nomail command to majordomo@postgresql.org so that your
>      message can get through to the mailing list cleanly
>


Re: Dell Perc/6

From
Dave Cramer
Date:
On 13-Feb-08, at 5:02 AM, Tore Halset wrote:

> Hello.
>
> I think I started that discussion. We ended up buying a Dell 2900
> with PERC 6/i and 10 * 145GB SAS 3,5" 15KRpm discs. 6 of the SAS
> discs are in a raid 10 for the database, 2 in a mirror for the wal
> and the last 2 in a mirror for the OS. We get 350MB/s writing and
> 380MB/s reading to/from the raid 10 area using dd. The OS is Ubuntu
> and the filesystem for the raid 10 is ext3.
>
Wow that's fantastic. Just to be sure, did you make sure that you read
and wrote 2x memory to take the cache out of the measurement ?

Dave

Re: Dell Perc/6

From
"Merlin Moncure"
Date:
On Feb 13, 2008 5:02 AM, Tore Halset <halset@pvv.ntnu.no> wrote:
> Hello.
>
> I think I started that discussion. We ended up buying a Dell 2900 with
> PERC 6/i and 10 * 145GB SAS 3,5" 15KRpm discs. 6 of the SAS discs are
> in a raid 10 for the database, 2 in a mirror for the wal and the last
> 2 in a mirror for the OS. We get 350MB/s writing and 380MB/s reading
> to/from the raid 10 area using dd. The OS is Ubuntu and the filesystem
> for the raid 10 is ext3.

Those are decent numbers.  Can you do a bonnie++ run and post the
results (specifically interested in seeks)?

merlin

Re: Dell Perc/6

From
Tore Halset
Date:
On Feb 13, 2008, at 12:06, Dave Cramer wrote:

>
> On 13-Feb-08, at 5:02 AM, Tore Halset wrote:
>
>> Hello.
>>
>> I think I started that discussion. We ended up buying a Dell 2900
>> with PERC 6/i and 10 * 145GB SAS 3,5" 15KRpm discs. 6 of the SAS
>> discs are in a raid 10 for the database, 2 in a mirror for the wal
>> and the last 2 in a mirror for the OS. We get 350MB/s writing and
>> 380MB/s reading to/from the raid 10 area using dd. The OS is Ubuntu
>> and the filesystem for the raid 10 is ext3.
>>
> Wow that's fantastic. Just to be sure, did you make sure that you
> read and wrote 2x memory to take the cache out of the measurement ?
>
> Dave
>


The box have 16GB of ram, but my original test file was only 25GB.
Sorry. Going to 33GB lowered the numbers for writing. Here you have
some samples.

% sh -c "dd if=/dev/zero of=bigfile bs=8k count=4000000 && sync"
32768000000 bytes (33 GB) copied, 103.722 seconds, 316 MB/s
32768000000 bytes (33 GB) copied, 99.669 seconds, 329 MB/s

% time dd if=bigfile of=/dev/null bs=8k
32768000000 bytes (33 GB) copied, 85.4235 seconds, 384 MB/s
32768000000 bytes (33 GB) copied, 85.4628 seconds, 383 MB/s

Regards,
  - Tore.

Re: Dell Perc/6

From
Tore Halset
Date:
On Feb 13, 2008, at 20:45, Tore Halset wrote:

> The box have 16GB of ram, but my original test file was only 25GB.
> Sorry. Going to 33GB lowered the numbers for writing. Here you have
> some samples.
>
> % sh -c "dd if=/dev/zero of=bigfile bs=8k count=4000000 && sync"
> 32768000000 bytes (33 GB) copied, 103.722 seconds, 316 MB/s
> 32768000000 bytes (33 GB) copied, 99.669 seconds, 329 MB/s
>
> % time dd if=bigfile of=/dev/null bs=8k
> 32768000000 bytes (33 GB) copied, 85.4235 seconds, 384 MB/s
> 32768000000 bytes (33 GB) copied, 85.4628 seconds, 383 MB/s
>
> Regards,
> - Tore.

And here are the bonnie++ numbers. I am a bonnie++ newbie so I ran it
with no options.

Version 1.03c       ------Sequential Output------ --Sequential Input-
--Random-
                     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec
%CP  /sec %CP
harteigen    32136M 83983  97 221757  40 106483  19 89603  97 268787
22 886.1   1
                     ------Sequential Create------ --------Random
Create--------
                     -Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec
%CP  /sec %CP
                  16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
+++++ +++
harteigen,32136M,
83983,97,221757,40,106483,19,89603,97,268787,22,886.1,1,16,+++++,+++,++
+++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++


Regards,
  - Tore.