Thread: Re: Configuration Advice

Re: Configuration Advice

From
Adam Rich
Date:
Doesn't sound like you want postgres at all.... Try mysql.



-----Original Message-----
From: "Steve" <cheetah@tanabi.org>
To: pgsql-performance@postgresql.org
Sent: 1/17/2007 2:41 PM
Subject: [PERFORM] Configuration Advice

Hey there;

I've been lurking on this list awhile, and I've been working with postgres
for a number of years so I'm not exactly new to this.  But I'm still
having trouble getting a good balance of settings and I'd like to see what
other people think.  We may also be willing to hire a contractor to help
tackle this problem if anyone is interested.

I've got an application here that runs large (in terms of length -- the
queries have a lot of conditions in them) queries that can potentially
return millions of rows but on average probably return tens of thousands
of rows.  It's read only for most of the day, and pretty much all the
queries except one are really fast.

However, each night we load data from a legacy cobol system into the SQL
system and then we summarize that data to make the reports faster.  This
load process is intensely insert/update driven but also has a hefty
amount of selects as well.  This load process is taking ever longer to
complete.


SO ... our goal here is to make this load process take less time.  It
seems the big part is building the big summary table; this big summary
table is currently 9 million rows big.  Every night, we drop the table,
re-create it, build the 9 million rows of data (we use COPY to put hte
data in when it's prepared, not INSERT), and then build the indexes on it
-- of which there are many.  Unfortunately this table gets queried
in a lot of different ways and needs these indexes; also unfortunately, we
have operator class indexes to support both ASC and DESC sorting on
columns so these are for all intents and purposes duplicate but required
under Postgres 8.1 (we've recently upgraded to Postgres 8.2, is this still
a requirement?)

Building these indexes takes forever!  It's a long grind through inserts
and then building the indexes takes a hefty amount of time too.  (about 9
hours).  Now, the application is likely part at fault, and we're working
to make it more efficient, but it has nothing to do with the index
building time.  I'm wondering what we can do to make this better if
anything; would it be better to leave the indexes on?  It doesn't seem to
be.  Would it be better to use INSERTs instead of copies?  Doesn't seem to
be.


Anyway -- ANYTHING we can do to make this go faster is appreciated :)
Here's some vital statistics:

- Machine is a 16 GB, 4 actual CPU dual-core opteron system using SCSI
discs.  The disc configuration seems to be a good one, it's the best of
all the ones we've tested so far.

- The load process itself takes about 6 gigs of memory, the rest is free
for postgres because this is basically all the machine does.

- If this was your machine and situation, how would you lay out the emmory
settings?  What would you set the FSM to?  Would you leave teh bgwriter on
or off?  We've already got FSYNC off because "data integrity" doesn't
matter -- this stuff is religeously backed up and we've got no problem
reinstalling it.  Besides, in order for this machine to go down, data
integrity of the DB is the least of the worries :)

Do wal_buffers/full_page_writes matter of FSYNC is off?  If so, what
settings?  What about checkpoints?

Any finally, any ideas on planner constants?  Here's what I'm using:

seq_page_cost = 0.5                     # measured on an arbitrary scale
random_page_cost = 1.0                  # same scale as above
cpu_tuple_cost = 0.001                  # same scale as above
cpu_index_tuple_cost = 0.0001           # same scale as above
cpu_operator_cost = 0.00025             # same scale as above
effective_cache_size = 679006

I really don't remember how I came up with that effective_cache_size
number....


Anyway... any advice would be appreciated :)


Steve

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster


Re: Configuration Advice

From
Bricklen Anderson
Date:
Adam Rich wrote:
> Doesn't sound like you want postgres at all.... Try mysql.

Could you explain your reason for suggesting mysql? I'm simply curious
why you would offer that as a solution.

Re: Configuration Advice

From
"Joshua D. Drake"
Date:
Bricklen Anderson wrote:
> Adam Rich wrote:
>> Doesn't sound like you want postgres at all.... Try mysql.
>
> Could you explain your reason for suggesting mysql? I'm simply curious
> why you would offer that as a solution.

He sound a little trollish to me. I would refer to the other actually
helpful posts on the topic.

Sincerely,

Joshua D. Drake


>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>


--

      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
             http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/


Re: Configuration Advice

From
Bruno Wolff III
Date:
> From: "Steve" <cheetah@tanabi.org>
> To: pgsql-performance@postgresql.org
> Sent: 1/17/2007 2:41 PM
> Subject: [PERFORM] Configuration Advice
>
> SO ... our goal here is to make this load process take less time.  It
> seems the big part is building the big summary table; this big summary
> table is currently 9 million rows big.  Every night, we drop the table,
> re-create it, build the 9 million rows of data (we use COPY to put hte
> data in when it's prepared, not INSERT), and then build the indexes on it
> -- of which there are many.  Unfortunately this table gets queried
> in a lot of different ways and needs these indexes; also unfortunately, we
> have operator class indexes to support both ASC and DESC sorting on
> columns so these are for all intents and purposes duplicate but required
> under Postgres 8.1 (we've recently upgraded to Postgres 8.2, is this still
> a requirement?)

Note that you only need to have the ASC and DESC versions of opclasses when
you are going to use multicolumn indexes with some columns in ASC order and
some in DESC order. For columns used by themselves in an index, you don't
need to do this, no matter which order you are sorting on.

Re: Configuration Advice

From
"Adam Rich"
Date:
Sorry if this came off sounding trollish....  All databases have their
strengths & weaknesses, and I feel the responsible thing to do is
exploit
those strengths where possible, rather than expend significant time and
effort coaxing one database to do something it wasn't designed to.
There's just no such thing as "one size fits all".

I have professional experience with MS-SQL, Oracle, MySQL, and Postgres.
and the scenario described sounds more ideal for MySQL & MyISAM than
anything else:

1) No concerns for data loss (turning fsync & full_page_writes off)
since the data can be reloaded

2) No need for MVCC or transactions, since the database is read-only

3) No worries about lock contention

4) Complex queries that might take advantage of the MySQL "Query Cache"
since the base data never changes

5) Queries that might obtain data directly from indexes without having
to touch tables (again, no need for MVCC)

If loading in the base data and creating the summary table is taking
a lot of time, using MySQL with MyISAM tables (and binary logging
disabled) should provide significant time savings, and it doesn't
sound like there's any concerns for the downsides.

Yes, postgresql holds an edge over MySQL for heavy OLTP applications,
I use it for that and I love it.  But for the scenario the original
poster is asking about, MySQL/MyISAM is ideal.




-----Original Message-----
From: Bricklen Anderson [mailto:banderson@presinet.com]
Sent: Wednesday, January 17, 2007 3:29 PM
To: Adam Rich
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Configuration Advice


Adam Rich wrote:
> Doesn't sound like you want postgres at all.... Try mysql.

Could you explain your reason for suggesting mysql? I'm simply curious
why you would offer that as a solution.


Re: Configuration Advice

From
Steve
Date:
> Note that you only need to have the ASC and DESC versions of opclasses when
> you are going to use multicolumn indexes with some columns in ASC order and
> some in DESC order. For columns used by themselves in an index, you don't
> need to do this, no matter which order you are sorting on.
>

Yeah, I assumed the people 'in the know' on this kind of stuff would know
the details of why I have to have those, and therefore I wouldn't have to
go into detail as to why -- but you put your finger right on it. :)
Unfortunately the customer this is for wants certain columns joined at the
hip for querying and sorting, and this method was a performance godsend
when we implemented it (with a C .so library, not using SQL in our
opclasses or anything like that).


Steve

Re: Configuration Advice

From
Steve
Date:
This would probably also be another last ditch option. :)  Our stuff is
designed to pretty much work on any DB but there are postgres specific
things in there... not to mention ramp up time on MySQL.  I mean, I know
MySQL from a user point of view and in a very limited way
administratively, but I'd be back to square one on learning performance
stuff :)

Anyway -- I'll listen to what people have to say, and keep this in mind.
It would be an interesting test to take parts of the process and compare
at least, if not converting the whole thing.

talk to you later,

Steve

On Wed, 17 Jan 2007, Adam Rich wrote:

>
> Sorry if this came off sounding trollish....  All databases have their
> strengths & weaknesses, and I feel the responsible thing to do is
> exploit
> those strengths where possible, rather than expend significant time and
> effort coaxing one database to do something it wasn't designed to.
> There's just no such thing as "one size fits all".
>
> I have professional experience with MS-SQL, Oracle, MySQL, and Postgres.
> and the scenario described sounds more ideal for MySQL & MyISAM than
> anything else:
>
> 1) No concerns for data loss (turning fsync & full_page_writes off)
> since the data can be reloaded
>
> 2) No need for MVCC or transactions, since the database is read-only
>
> 3) No worries about lock contention
>
> 4) Complex queries that might take advantage of the MySQL "Query Cache"
> since the base data never changes
>
> 5) Queries that might obtain data directly from indexes without having
> to touch tables (again, no need for MVCC)
>
> If loading in the base data and creating the summary table is taking
> a lot of time, using MySQL with MyISAM tables (and binary logging
> disabled) should provide significant time savings, and it doesn't
> sound like there's any concerns for the downsides.
>
> Yes, postgresql holds an edge over MySQL for heavy OLTP applications,
> I use it for that and I love it.  But for the scenario the original
> poster is asking about, MySQL/MyISAM is ideal.
>
>
>
>
> -----Original Message-----
> From: Bricklen Anderson [mailto:banderson@presinet.com]
> Sent: Wednesday, January 17, 2007 3:29 PM
> To: Adam Rich
> Cc: pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] Configuration Advice
>
>
> Adam Rich wrote:
>> Doesn't sound like you want postgres at all.... Try mysql.
>
> Could you explain your reason for suggesting mysql? I'm simply curious
> why you would offer that as a solution.
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>       choose an index scan if your joining column's datatypes do not
>       match
>

Re: Configuration Advice

From
Arjen van der Meijden
Date:
On 18-1-2007 0:37 Adam Rich wrote:
> 4) Complex queries that might take advantage of the MySQL "Query Cache"
> since the base data never changes

Have you ever compared MySQL's performance with complex queries to
PostgreSQL's? I once had a query which would operate on a recordlist and
see whether there were any gaps larger than 1 between consecutive
primary keys.

Normally that information isn't very usefull, but this time it was.
Since the data was in MySQL I tried several variations of queries in
MySQL... After ten minutes or so I gave up waiting, but left my last
version running. In the mean time I dumped the data, reloaded the data
in PostgreSQL and ran some testqueries there. I came up with a query
that took only 0.5 second on Postgres pretty soon. The query on MySQL
still wasn't finished...
In my experience it is (even with the 5.0 release) easier to get good
performance from complex queries in postgresql. And postgresql gives you
more usefull information on why a query takes a long time when using
explain (analyze). There are some draw backs too of course, but while we
in our company use mysql I switched to postgresql for some readonly
complex query stuff just for its performance...

Besides that, mysql rewrites the entire table for most table-altering
statements you do (including indexes). For small tables that's no issue,
but if you somehow can't add all your indexes in a single statement to a
table you'll be waiting a long time more for new indexes than with
postgresql. And that situation isn't so unusual if you think of a query
which needs an index that isn't there yet. Apart from the fact that it
doesn't have functional indexes and such.

Long story short: MySQL still isn't the best performer when looking at
the more complex queries. I've seen performance which made me assume it
can't optimise sequential scans (when it is forced to loop using a seq
scan it appears to do a new seq scan for each round in the loop...) and
various other cases PostgreSQL can execute much more efficiently.

So unless you run the same queries a lot of times and know of a way to
get it fast enough the initial time, the query cache is not much of a help.

Best regards,

Arjen

Re: Configuration Advice

From
Scott Marlowe
Date:
On Thu, 2007-01-18 at 04:24, Arjen van der Meijden wrote:
> On 18-1-2007 0:37 Adam Rich wrote:
> > 4) Complex queries that might take advantage of the MySQL "Query Cache"
> > since the base data never changes
>
> Have you ever compared MySQL's performance with complex queries to
> PostgreSQL's? I once had a query which would operate on a recordlist and
> see whether there were any gaps larger than 1 between consecutive
> primary keys.
>
> Normally that information isn't very usefull, but this time it was.
> Since the data was in MySQL I tried several variations of queries in
> MySQL... After ten minutes or so I gave up waiting, but left my last
> version running. In the mean time I dumped the data, reloaded the data
> in PostgreSQL and ran some testqueries there. I came up with a query
> that took only 0.5 second on Postgres pretty soon. The query on MySQL
> still wasn't finished...

I have had similar experiences in the past.  Conversely, I've had
similar things happen the other way around.  The biggest difference?  If
I report something like that happening in postgresql, it's easier to get
a fix or workaround, and if it's a code bug, the fix is usually released
as a patch within a day or two.  With MySQL, if it's a common problem,
then I can find it on the internet with google, otherwise it might take
a while to get a good workaround / fix.  And if it's a bug, it might
take much longer to get a working patch.

> In my experience it is (even with the 5.0 release) easier to get good
> performance from complex queries in postgresql.

Agreed.  For data warehousing / OLAP stuff, postgresql is generally
better than mysql.

> Besides that, mysql rewrites the entire table for most table-altering
> statements you do (including indexes). For small tables that's no issue,
> but if you somehow can't add all your indexes in a single statement to a
> table you'll be waiting a long time more for new indexes than with
> postgresql. And that situation isn't so unusual if you think of a query
> which needs an index that isn't there yet. Apart from the fact that it
> doesn't have functional indexes and such.

Note that this applies to the myisam table type.  innodb works quite
differently.  It is more like pgsql in behaviour, and is an mvcc storage
engine.  Like all storage engine, it's a collection of compromises.
Some areas it's better than pgsql, some areas worse.  Sadly, it lives
under the hood of a database that can do some pretty stupid things, like
ignore column level constraint definitions without telling you.

> Long story short: MySQL still isn't the best performer when looking at
> the more complex queries.

agreed.  And those are the queries that REALLY kick your ass.  Or your
server's ass, I guess.

Re: Configuration Advice

From
Arjen van der Meijden
Date:
On 18-1-2007 17:20 Scott Marlowe wrote:
>> Besides that, mysql rewrites the entire table for most table-altering
>> statements you do (including indexes).
>
> Note that this applies to the myisam table type.  innodb works quite
> differently.  It is more like pgsql in behaviour, and is an mvcc storage

Afaik this is not engine specific and also applies to InnoDB. Here is
what the MySQL-manual sais about it:
"In most cases, ALTER TABLE works by making a temporary copy of the
original table. The alteration is performed on the copy, and then the
original table is deleted and the new one is renamed. While ALTER TABLE
  is executing, the original table is readable by other clients. Updates
and writes to the table are stalled until the new table is ready, and
then are automatically redirected to the new table without any failed
updates."

http://dev.mysql.com/doc/refman/5.0/en/alter-table.html

If it were myisam-only they sure would've mentioned that. Besides this
is the behaviour we've seen on our site as well.

Since 'create index' is also an alter table statement for mysql, this
also applies for adding indexes.

Best regards,

Arjen


Re: Configuration Advice

From
"Jeremy Haile"
Date:
> I once had a query which would operate on a recordlist and
> see whether there were any gaps larger than 1 between consecutive
> primary keys.

Would you mind sharing the query you described?  I am attempting to do
something similar now.

Re: Configuration Advice

From
Arjen van der Meijden
Date:
On 18-1-2007 18:28 Jeremy Haile wrote:
>> I once had a query which would operate on a recordlist and
>> see whether there were any gaps larger than 1 between consecutive
>> primary keys.
>
> Would you mind sharing the query you described?  I am attempting to do
> something similar now.


Well it was over a year ago, so I don't know what I did back then. But
since it was a query adjusted from what I did in MySQL there where no
subqueries involved, I think it was something like this:
select a.id, min(b.id)
  from
   members a
   join members b on a.id < b.id
   left join members c on a.id +1 = c.id
where c.id IS NULL
group by a.id;

Or rewriting it to this one halves the execution time though:

select a.id, min(b.id)
  from
   members a
   left join members c on a.id +1 = c.id
   join members b on a.id < b.id
where c.id IS NULL
group by a.id;

Although this query seems to be much faster with 150k records:

select aid, bid
from
(select a.id as aid, (select min(b.id) from members b where b.id > a.id)
as bid
  from
   members a
group by a.id) as foo
where bid > aid+1;

The first one takes about 16 seconds on my system with PG 8.2, the
second about 1.8 second. But back then the list was much shorter, so it
can have been the first one or a variant on that. On MySQL the first
takes much more than the 16 seconds PostgreSQL uses, and after editting
this e-mail it still isn't finished... The second one made EXPLAIN hang
in my 5.0.32-bk, so I didn't try that for real.

Best regards,

Arjen

PS, In case any of the planner-hackers are reading, here are the plans
of the first two queries, just to see if something can be done to
decrease the differences between them. The main differences seems to be
that groupaggregate vs the hashaggregate?

  GroupAggregate  (cost=34144.16..35144.38 rows=50011 width=8) (actual
time=17653.401..23881.320 rows=71 loops=1)
    ->  Sort  (cost=34144.16..34269.19 rows=50011 width=8) (actual
time=17519.274..21423.128 rows=7210521 loops=1)
          Sort Key: a.id
          ->  Nested Loop  (cost=11011.41..30240.81 rows=50011 width=8)
(actual time=184.412..10945.189 rows=7210521 loops=1)
                ->  Hash Left Join  (cost=11011.41..28739.98 rows=1
width=4) (actual time=184.384..1452.467 rows=72 loops=1)
                      Hash Cond: ((a.id + 1) = c.id)
                      Filter: (c.id IS NULL)
                      ->  Seq Scan on members a  (cost=0.00..9903.33
rows=150033 width=4) (actual time=0.009..71.463 rows=150033 loops=1)
                      ->  Hash  (cost=9903.33..9903.33 rows=150033
width=4) (actual time=146.040..146.040 rows=150033 loops=1)
                            ->  Seq Scan on members c
(cost=0.00..9903.33 rows=150033 width=4) (actual time=0.002..77.066
rows=150033 loops=1)
                ->  Index Scan using members_pkey on members b
(cost=0.00..875.69 rows=50011 width=4) (actual time=0.025..78.971
rows=100146 loops=72)
                      Index Cond: (a.id < b.id)
  Total runtime: 23882.511 ms
(13 rows)

  HashAggregate  (cost=30240.82..30240.83 rows=1 width=8) (actual
time=12870.440..12870.504 rows=71 loops=1)
    ->  Nested Loop  (cost=11011.41..30240.81 rows=1 width=8) (actual
time=168.658..9466.644 rows=7210521 loops=1)
          ->  Hash Left Join  (cost=11011.41..28739.98 rows=1 width=4)
(actual time=168.630..865.690 rows=72 loops=1)
                Hash Cond: ((a.id + 1) = c.id)
                Filter: (c.id IS NULL)
                ->  Seq Scan on members a  (cost=0.00..9903.33
rows=150033 width=4) (actual time=0.012..70.612 rows=150033 loops=1)
                ->  Hash  (cost=9903.33..9903.33 rows=150033 width=4)
(actual time=140.432..140.432 rows=150033 loops=1)
                      ->  Seq Scan on members c  (cost=0.00..9903.33
rows=150033 width=4) (actual time=0.003..76.709 rows=150033 loops=1)
          ->  Index Scan using members_pkey on members b
(cost=0.00..875.69 rows=50011 width=4) (actual time=0.023..73.317
rows=100146 loops=72)
                Index Cond: (a.id < b.id)
  Total runtime: 12870.756 ms
(11 rows)

Re: Configuration Advice

From
Tom Lane
Date:
Arjen van der Meijden <acmmailing@tweakers.net> writes:
> PS, In case any of the planner-hackers are reading, here are the plans
> of the first two queries, just to see if something can be done to
> decrease the differences between them.

Increase work_mem?  It's not taking the hash because it thinks it won't
fit in memory ...

There is a bug here, I'd say: the rowcount estimate ought to be the same
either way.  Dunno why it's not, but will look --- I see the same
misbehavior with a toy table here.

            regards, tom lane

Re: Configuration Advice

From
Arjen van der Meijden
Date:
On 18-1-2007 23:11 Tom Lane wrote:
> Increase work_mem?  It's not taking the hash because it thinks it won't
> fit in memory ...

When I increase it to 128MB in the session (arbitrarily selected
relatively large value) it indeed has the other plan.

Best regards,

Arjen