Thread: Speed Question

Speed Question

From

Noah Silverman

Date:

20 December 2002, 17:58:18

Hello,

We are considering switching our systems over from MySQL to Postgresql.

Speed is one of our major concerns, so before switching we've decided
to perform some speed tests.
 From what I understand, Postgresql is NOT as fast as Mysql, but should
be close enough.

We've installed the software and have run some basic insert, index and
query tests that
seem ridiculously slow.  I can't help thinking that we are doing
something wrong, or
don't have things configured for optimal performance.

We've performed these same tests on Mysql and then run dramatically
faster.

Here's the initial performance test results and issues...

Table configuration:
speedtest( prop1 integer, prop2 integer, prop3 integer, prop4 integer);
indexes on each of the four individual property fields

Each record consists of four random integers, uniformly distributed,
between 0 and 1000.  The integers are computed in the perl script
used to populate the table, not using an SQL random() function.

Hardware configuration:  P3-500, 384MB ram, *unloaded* system.
Software configuration: Linux 2.4.20, reiserfs, standard slackware
install.

Issue #1:  Speed of inserts is relatively slow.  100000 inserts is
taking
roughly 10 minutes.  This isn't EVIL, but mysql appears to be about
ten times faster here.  Is there something we could do to the indexes
differently?  Disable transactions?  Is there a more "raw" insert, which
may not set off triggers?

Issue #2:  It doesn't appear as though multiple indexes are being used.
ie:  select count(*) from speedtest where (prop1 between 100 and 200)
and( prop2 between 100 and 200) and (prop3 between 100 and 200)
and (prop4 between 100 and 200)  formulates a query plan that only
uses one index.  The following is pasted from the 'explain select' ---

  Aggregate  (cost=17.16..17.16 rows=1 width=0)
    ->  Index Scan using p4 on speedtest  (cost=0.00..17.16 rows=1
width=0)
          Index Cond: ((prop4 >= 100) AND (prop4 <= 200))
          Filter: ((prop1 >= 100) AND (prop1 <= 200) AND (prop2 >= 100)
AND
(prop2 <= 200) AND (prop3 >= 100) AND (prop3 <= 200))
(4 rows)

It appears as though the index on prop4 is being used to determine a
subset
of records to fetch -- subsequently filtering them with the other
conditions.
Unfortunately, since the index condition matches 10% of the table (due
to
the random uniform integers from 0-1000), this results in a large
number of
record fetches and examinations the db engine must make.  This query
takes
at least a second to execute, whereas we would like to be able to drop
this
into the sub-0.1 second range, and preferably into the millisecond
range.
While this would run faster on the production machines than on my
workstation,
it is still a fundamental flaw that multiple indexes aren't being
combined to
restrict the record set to fetch.

OTOH, if we could do index combining, we could fetch 10% of 10% of 10%
of the initial 10% of records...  Resulting in a microscopic number of
items
to retrieve and examine.

Can anybody give me some ideas as to what I am doing wrong???

Thanks,

-Noah

Re: Speed Question

From

"scott.marlowe"

Date:

20 December 2002, 18:24:36

On Fri, 20 Dec 2002, Noah Silverman wrote:

> Issue #1:  Speed of inserts is relatively slow.  100000 inserts is
> taking
> roughly 10 minutes.  This isn't EVIL, but mysql appears to be about
> ten times faster here.  Is there something we could do to the indexes
> differently?  Disable transactions?  Is there a more "raw" insert, which
> may not set off triggers?

Are you doing these in a transaction?  If not, then try adding a
begin;end; pair around your inserts.  i.e.

begin;
insert 100000 rows
end;

that should help.

Reading the rest of your message, it appears there are two issues here.
One is you might get some help from a multi-column index.

Further, have you run analyze on your database?

Have you read the administrative docs yet?  There's lots more good stuff
in there too.  These are the basics.

The other issue is the assumption that indexes are ALWAYS faster, which
they aren't.  If the query planner thinks it's gonna grab some significant
portion of a table, it will just grab the whole thing instead of using an
index, which makes a certain amount of sense.  To reduce the likelihood of
the planner picking a sequential scan, change random_page_cost from the
default 4 to something lower.  A 1 means that the cost of grabbing a page
randomly is the same as grabbing it sequentially, which shouldn't be
possible, but is, if the data is all in memory.

Next, use EXPLAIN ANALYZE to get an output of both what the query planner
THOUGHT it was going to do, and what the query actually did, in terms of
time to execute.

Let us know how it all turns out.

Re: Speed Question

From

Tom Lane

Date:

20 December 2002, 18:50:27

"scott.marlowe" <scott.marlowe@ihs.com> writes:
> On Fri, 20 Dec 2002, Noah Silverman wrote:
>> Issue #1:  Speed of inserts is relatively slow.  100000 inserts is

> Are you doing these in a transaction?  If not, then try adding a
> begin;end; pair around your inserts.  i.e.

> begin;
> insert 100000 rows
> end;

Or use a COPY command instead of retail inserts.  See also the tips at
http://www.ca.postgresql.org/users-lounge/docs/7.3/postgres/populate.html

> One is you might get some help from a multi-column index.

Yes, I'd recommend a multi-column index when no single column is
particularly selective.

            regards, tom lane

Re: Speed Question

From

Josh Berkus

Date:

20 December 2002, 18:58:02

Noah,

> Speed is one of our major concerns, so before switching we've decided
> to perform some speed tests.
>  From what I understand, Postgresql is NOT as fast as Mysql,

This is a PR myth spread by MySQL AB.   The truth is:

1) PostgreSQL, unconfigured and not optimized, is indeed slower than MySQL
out-of-the-box.  MySQL is meant to be idiot-proof; PostgreSQL is not,
intentionally.

2) Nobody has yet come up with a database benchmark that both MySQL AB and the
PostgreSQL team are willing to accept; depending on whose benchmark you use,
either could be faster -- and neither benchmark may approximate your setup.

> We've installed the software and have run some basic insert, index and
> query tests that
> seem ridiculously slow.  I can't help thinking that we are doing
> something wrong, or
> don't have things configured for optimal performance.

Almost undoubtedly.   Have you modified the postgresql.conf file at all?
Where are your database files located on disk?  How are you construting your
queries?

> We've performed these same tests on Mysql and then run dramatically
> faster.

Without transations?  Sure.   Turn off transaction logging, and PostgreSQL
runs faster, too.

>
> Here's the initial performance test results and issues...
>
> Table configuration:
> speedtest( prop1 integer, prop2 integer, prop3 integer, prop4 integer);
> indexes on each of the four individual property fields
>
> Each record consists of four random integers, uniformly distributed,
> between 0 and 1000.  The integers are computed in the perl script
> used to populate the table, not using an SQL random() function.
>
> Hardware configuration:  P3-500, 384MB ram, *unloaded* system.
> Software configuration: Linux 2.4.20, reiserfs, standard slackware
> install.

You haven't mentioned your PostgreSQL memory settings, by which I assume that
you haven't configured them.  This is very important.

> Issue #1:  Speed of inserts is relatively slow.  100000 inserts is
> taking
> roughly 10 minutes.  This isn't EVIL, but mysql appears to be about
> ten times faster here.  Is there something we could do to the indexes
> differently?  Disable transactions?  Is there a more "raw" insert, which
> may not set off triggers?

Bundle them in a single transaction.  Move pg_xlog to a seperate drive from
the database.

> Issue #2:  It doesn't appear as though multiple indexes are being used.
> ie:  select count(*) from speedtest where (prop1 between 100 and 200)
> and( prop2 between 100 and 200) and (prop3 between 100 and 200)
> and (prop4 between 100 and 200)  formulates a query plan that only
> uses one index.  The following is pasted from the 'explain select' ---

That's correct; Postgres will only use a single index on this query.  If you
want to reference all columns, create a multi-column index.  Note that,
however, Postgres is likely to reject the index as it is just as large as the
table.  In this way, your test is insufficiently like real data.

Good luck.  Why not use the Open Database Benchmark for testing, instead of
inventing your own?

 http://www.sf.net/projects/osdb

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco

Re: Speed Question

From

Noah Silverman

Date:

20 December 2002, 19:10:48

First:  THANK YOU everyone for all  your suggestions.

I've discovered the "copy from" command and it helps a lot.
Right now, we just ran a test on 1MM rows with 4 columns and it is very
fast with a 4 column index.  Works well.

Now we are creating more of a real world example: 10MM rows with 32
columns of integers.  I'm loading up the data now, and will create a
multi-column index(on all 32) after the data is loaded.

 From everyone's responses I understand that we really need to tune the
system to get optimal performance.  I would love to do this, but don't
really know where to start.  Below are our system stats if anyone wants
to suggest some settings:

2x AMD 2100MP CPU
2 GB RAM
Data - 350GB on a raid5 card
Note: We will probably NEVER use transactions, so turning off that
feature would be fine if it would help, and we knew how.

Our data is probably only going to take up 20% MAXIMUM of our RAID.
Subsequently, we have no problem trading a little extra space for
better performance.

BTW - is there any kind of "describe table" and/or "show index"
function if pgsql.  I've gotten very used to them in Mysql, but they
don't work here.  There must be some way.  I've RTFM, but can't find
anything.  help.

THANKS AGAIN,

-Noah

Re: Speed Question

From

Philip Hallstrom

Date:

20 December 2002, 20:01:15

> BTW - is there any kind of "describe table" and/or "show index"
> function if pgsql.  I've gotten very used to them in Mysql, but they
> don't work here.  There must be some way.  I've RTFM, but can't find
> anything.  help.

In psql use "\d tablename".  do a "\?" for a quick overview and "man psql"
for lots of stuff.

-philip

Re: Speed Question

From

Manfred Koizar

Date:

21 December 2002, 07:21:46

On Fri, 20 Dec 2002 19:10:49 -0500, Noah Silverman
<noah@allresearch.com> wrote:
>Now we are creating more of a real world example: 10MM rows with 32
>columns of integers.  I'm loading up the data now, and will create a
>multi-column index(on all 32) after the data is loaded.

If a table with a 32 column key and no dependent attributes is a real
world example, I'd like to see your use case ;-)

An index on c1, c2, ..., cn will only help, if your search criteria
contain (strict) conditions on the leading index columns, e.g.
    WHERE c1 = ... AND c2 = ... AND c3 BETWEEN ... AND ...

It won't help for
    WHERE c22 = ...

> From everyone's responses I understand that we really need to tune [...]
>2x AMD 2100MP CPU
>2 GB RAM
>Data - 350GB on a raid5 card

It all depends on your application, but looking at SHARED_BUFFERS,
EFFECTIVE_CACHE_SIZE, SORT_MEM, MAX_FSM_RELATIONS, and MAX_FSM_PAGES
might be a good start.  Later you might want to use CPU_*_COST,
RANDOM_PAGE_COST, and various WAL settings to fine tune your system.

>Note: We will probably NEVER use transactions,

Oh yes, you will.  You have no other choice.  If you don't enclose
(several) statements between BEGIN and COMMIT, every statement is
automatically wrapped into its own transaction.

It helps performance and consistency, if *you* control transactions.

Servus
 Manfred

Re: Speed Question

From

Noah Silverman

Date:

21 December 2002, 13:45:31

Thanks for the help.  We've been using MySQL for the last 4 years, so
PgSQL is a whole new world for me.  Lots to learn

Actually the "real world" test we are performing is an exact
duplication of our intended use.  Without divulging too many company
secrets, we create a 32 key profile of an object.  We then have to be
able to search the database to  find "similar" objects.  In reality, we
will probably have 20MM to 30MM rows in our table.  I need to very
quickly find the matching records on a "test" object.

If you're really curious as to more details, let me know (I don't want
to bore the group with our specifics)

Since this machine is solely a database server, I want to utilize a ton
of RAM to help things along.  Probably at lease 1.5 Gigs worth.  I
guess my next step is to try and figure out what all the various memory
settings are and where to set them.

Thanks,

-N

On Saturday, December 21, 2002, at 07:21  AM, Manfred Koizar wrote:

> On Fri, 20 Dec 2002 19:10:49 -0500, Noah Silverman
> <noah@allresearch.com> wrote:
>> Now we are creating more of a real world example: 10MM rows with 32
>> columns of integers.  I'm loading up the data now, and will create a
>> multi-column index(on all 32) after the data is loaded.
>
> If a table with a 32 column key and no dependent attributes is a real
> world example, I'd like to see your use case ;-)
>
> An index on c1, c2, ..., cn will only help, if your search criteria
> contain (strict) conditions on the leading index columns, e.g.
>     WHERE c1 = ... AND c2 = ... AND c3 BETWEEN ... AND ...
>
> It won't help for
>     WHERE c22 = ...
>
>> From everyone's responses I understand that we really need to tune
>> [...]
>> 2x AMD 2100MP CPU
>> 2 GB RAM
>> Data - 350GB on a raid5 card
>
> It all depends on your application, but looking at SHARED_BUFFERS,
> EFFECTIVE_CACHE_SIZE, SORT_MEM, MAX_FSM_RELATIONS, and MAX_FSM_PAGES
> might be a good start.  Later you might want to use CPU_*_COST,
> RANDOM_PAGE_COST, and various WAL settings to fine tune your system.
>
>> Note: We will probably NEVER use transactions,
>
> Oh yes, you will.  You have no other choice.  If you don't enclose
> (several) statements between BEGIN and COMMIT, every statement is
> automatically wrapped into its own transaction.
>
> It helps performance and consistency, if *you* control transactions.
>
> Servus
>  Manfred
>
>

Re: Speed Question

From

Manfred Koizar

Date:

21 December 2002, 15:04:32

On Sat, 21 Dec 2002 13:46:05 -0500, Noah Silverman
<noah@allresearch.com> wrote:
>Without divulging too many company
>secrets, we create a 32 key profile of an object.  We then have to be
>able to search the database to  find "similar" objects.

... where "similar" means that the value of each attribute lies within
a small range around the value of the corresponding attribute of the
reference object?

I fear a multicolumn b-tree index is not the optimal solution to this
problem, unless you have some extremely selective attributes you can
put at the start of the index.  But then again I doubt that it makes
sense to include even the last attribute (or the last few attributes)
into the index.

>In reality, we
>will probably have 20MM to 30MM rows in our table.  I need to very
>quickly find the matching records on a "test" object.

This seems to be a nice case for utilizing bitmaps for index scans.
Thus you would scan several single column indices and combine the
bitmaps before accessing the heap tuples.  This has been discussed on
-hackers and I believe it is a todo item.

I don't know, whether GiST or R-Tree could help.  Is anybody listening
who knows?

>If you're really curious as to more details, let me know (I don't want
>to bore the group with our specifics)

The group is patient :-)

Servus
 Manfred

Re: Speed Question

From

Noah Silverman

Date:

21 December 2002, 15:17:18

You are correct.  "similar" means within a small range.

Below is a sample query:

select count(*) from speedtest where (p1 between 209 and 309) and (p2
between 241 and 341) and (p3 between 172 and 272) and (p4 between 150
and 250) and (p5 between 242 and 342) and (p6 between 222 and 322) and
(p7 between 158 and 258) and (p8 between 249 and 349) and (p9 between
162 and 262) and (p10 between 189 and 289) and (p11 between 201 and
301) and (p12 between 167 and 267) and (p13 between 167 and 267) and
(p14 between 229 and 329) and (p15 between 235 and 335) and (p16
between 190 and 290) and (p17 between 240 and 340) and (p18 between 156
and 256) and (p19 between 150 and 250) and (p20 between 171 and 271)
and (p21 between 241 and 341) and (p22 between 244 and 344) and (p23
between 219 and 319) and (p24 between 198 and 298) and (p25 between 196
and 296) and (p26 between 243 and 343) and (p27 between 160 and 260)
and (p28 betw een 151 and 251) and (p29 between 226 and 326) and (p30
between 168 and 268) and (p31  between 153 and 253) and (p32 between
218 and 318)

Currently, on an un-tuned installation, this query takes about 1
second.  Much too slow for our needs.  We need to be able to execute
about 30-50 per second.


I'm not a database expert.  There is probably a better way to do this,
but I have no idea how.

The general use of this table is as an index for document storage.
When we come across a new document, we have to know if we already have
something close to it.  Exact checksums don't work because two
documents with only a few different words are still "the same" for our
intended use.  We calculate 32 separate checksums on parts of each
document.   By storing all 32, we have a good representation of each
document.  A new document can then very quickly be checked against the
table to see if we already have something close to it.

If anybody has any better ideas, I would love to hear it...

-N


On Saturday, December 21, 2002, at 03:02  PM, Manfred Koizar wrote:

> On Sat, 21 Dec 2002 13:46:05 -0500, Noah Silverman
> <noah@allresearch.com> wrote:
>> Without divulging too many company
>> secrets, we create a 32 key profile of an object.  We then have to be
>> able to search the database to  find "similar" objects.
>
> ... where "similar" means that the value of each attribute lies within
> a small range around the value of the corresponding attribute of the
> reference object?
>
> I fear a multicolumn b-tree index is not the optimal solution to this
> problem, unless you have some extremely selective attributes you can
> put at the start of the index.  But then again I doubt that it makes
> sense to include even the last attribute (or the last few attributes)
> into the index.
>
>> In reality, we
>> will probably have 20MM to 30MM rows in our table.  I need to very
>> quickly find the matching records on a "test" object.
>
> This seems to be a nice case for utilizing bitmaps for index scans.
> Thus you would scan several single column indices and combine the
> bitmaps before accessing the heap tuples.  This has been discussed on
> -hackers and I believe it is a todo item.
>
> I don't know, whether GiST or R-Tree could help.  Is anybody listening
> who knows?
>
>> If you're really curious as to more details, let me know (I don't want
>> to bore the group with our specifics)
>
> The group is patient :-)
>
> Servus
>  Manfred
>
>

Re: Speed Question

From

Tom Lane

Date:

21 December 2002, 15:28:34

Manfred Koizar <mkoi-pg@aon.at> writes:
> ... where "similar" means that the value of each attribute lies within
> a small range around the value of the corresponding attribute of the
> reference object?

> I don't know, whether GiST or R-Tree could help.

If the problem is multidimensional range search then GIST might be just
the ticket.  I am not sure if you'd need to do any coding though.  It
looks like contrib/btree_gist provides the necessary operator class, but
only for int4 and timestamp datatypes.

I think that our r-tree code is restricted to two-dimensional indexing,
so it wouldn't help.

            regards, tom lane

Re: Speed Question

From

Noah Silverman

Date:

23 December 2002, 16:55:05

Does anyone know how/where I can find the contrib/btree_gist stuff and
how I use it, and are there docs for it.

Thanks,

-N


On Saturday, December 21, 2002, at 03:28 PM, Tom Lane wrote:

> Manfred Koizar <mkoi-pg@aon.at> writes:
>> ... where "similar" means that the value of each attribute lies within
>> a small range around the value of the corresponding attribute of the
>> reference object?
>
>> I don't know, whether GiST or R-Tree could help.
>
> If the problem is multidimensional range search then GIST might be just
> the ticket.  I am not sure if you'd need to do any coding though.  It
> looks like contrib/btree_gist provides the necessary operator class,
> but
> only for int4 and timestamp datatypes.
>
> I think that our r-tree code is restricted to two-dimensional indexing,
> so it wouldn't help.
>
>             regards, tom lane
>