Thread: Sorting performance vs. MySQL

Sorting performance vs. MySQL

From
Yang Zhang
Date:
I have the exact same table of data in both MySQL and Postgresql. In Postgresql:

tpcc=# \d metarelcloud_transactionlog
                                       Table
"public.metarelcloud_transactionlog"
       Column        |         Type          |
       Modifiers

---------------------+-----------------------+--------------------------------------------------------------------------
 id                  | integer               | not null default
nextval('metarelcloud_transactionlog_id_seq'::regclass)
 transactionid       | integer               | not null
 queryid             | smallint              | not null
 tableid             | character varying(30) | not null
 tupleid             | integer               | not null
 querytype           | character varying     | not null
 graphpartition      | smallint              |
 replicatedpartition | smallint              |
 justifiedpartition  | smallint              |
 hashpartition       | smallint              |
 nodeid              | integer               |
 manualpartition     | smallint              |
Indexes:
    "metarelcloud_transactionlog_pkey" PRIMARY KEY, btree (id)
Check constraints:
    "metarelcloud_transactionlog_graphpartition_check" CHECK
(graphpartition >= 0)
    "metarelcloud_transactionlog_hashpartition_check" CHECK (hashpartition >= 0)
    "metarelcloud_transactionlog_justifiedpartition_check" CHECK
(justifiedpartition >= 0)
    "metarelcloud_transactionlog_manualpartition_check" CHECK
(manualpartition >= 0)
    "metarelcloud_transactionlog_querytype_check" CHECK
(querytype::text = ANY (ARRAY['select'::character varying,
'insert'::character varying, 'delete'::character varying,
'update'::character varying]::text[]))
    "metarelcloud_transactionlog_replicatedpartition_check" CHECK
(replicatedpartition >= 0)

In MySQL:

CREATE TABLE `metarelcloud_transactionlog` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `transactionid` int(11) NOT NULL,
  `queryid` tinyint(4) NOT NULL,
  `tableid` varchar(30) NOT NULL,
  `tupleid` int(11) NOT NULL,
  `querytype` enum('select','insert','delete','update') NOT NULL,
  `graphpartition` tinyint(3) unsigned DEFAULT NULL,
  `replicatedpartition` tinyint(3) unsigned DEFAULT NULL,
  `justifiedpartition` tinyint(3) unsigned DEFAULT NULL,
  `hashpartition` tinyint(3) unsigned DEFAULT NULL,
  `nodeid` int(11) DEFAULT NULL,
  `manualpartition` tinyint(3) unsigned DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `transactionid` (`transactionid`),
  KEY `tableid` (`tableid`,`tupleid`),
  KEY `nodeid` (`nodeid`)
) ENGINE=MyISAM AUTO_INCREMENT=50410166 DEFAULT CHARSET=latin1

I'm running:

  select * from metarelcloud_transactionlog order by transactionid;

It takes MySQL 6 minutes, but Postgresql is still running after 70
minutes. Is there something like a glaring misconfiguration that I'm
overlooking? Thanks in advance.
--
Yang Zhang
http://www.mit.edu/~y_z/

Re: Sorting performance vs. MySQL

From
Pavel Stehule
Date:
hello

the speed depends on setting of working_memory. Try to increase a working_memory

set working_memory to '10MB';

Regards
Pavel Stehule

2010/2/22 Yang Zhang <yanghatespam@gmail.com>:
> I have the exact same table of data in both MySQL and Postgresql. In Postgresql:
>
> tpcc=# \d metarelcloud_transactionlog
>                                       Table
> "public.metarelcloud_transactionlog"
>       Column        |         Type          |
>       Modifiers
>
---------------------+-----------------------+--------------------------------------------------------------------------
>  id                  | integer               | not null default
> nextval('metarelcloud_transactionlog_id_seq'::regclass)
>  transactionid       | integer               | not null
>  queryid             | smallint              | not null
>  tableid             | character varying(30) | not null
>  tupleid             | integer               | not null
>  querytype           | character varying     | not null
>  graphpartition      | smallint              |
>  replicatedpartition | smallint              |
>  justifiedpartition  | smallint              |
>  hashpartition       | smallint              |
>  nodeid              | integer               |
>  manualpartition     | smallint              |
> Indexes:
>    "metarelcloud_transactionlog_pkey" PRIMARY KEY, btree (id)
> Check constraints:
>    "metarelcloud_transactionlog_graphpartition_check" CHECK
> (graphpartition >= 0)
>    "metarelcloud_transactionlog_hashpartition_check" CHECK (hashpartition >= 0)
>    "metarelcloud_transactionlog_justifiedpartition_check" CHECK
> (justifiedpartition >= 0)
>    "metarelcloud_transactionlog_manualpartition_check" CHECK
> (manualpartition >= 0)
>    "metarelcloud_transactionlog_querytype_check" CHECK
> (querytype::text = ANY (ARRAY['select'::character varying,
> 'insert'::character varying, 'delete'::character varying,
> 'update'::character varying]::text[]))
>    "metarelcloud_transactionlog_replicatedpartition_check" CHECK
> (replicatedpartition >= 0)
>
> In MySQL:
>
> CREATE TABLE `metarelcloud_transactionlog` (
>  `id` int(11) NOT NULL AUTO_INCREMENT,
>  `transactionid` int(11) NOT NULL,
>  `queryid` tinyint(4) NOT NULL,
>  `tableid` varchar(30) NOT NULL,
>  `tupleid` int(11) NOT NULL,
>  `querytype` enum('select','insert','delete','update') NOT NULL,
>  `graphpartition` tinyint(3) unsigned DEFAULT NULL,
>  `replicatedpartition` tinyint(3) unsigned DEFAULT NULL,
>  `justifiedpartition` tinyint(3) unsigned DEFAULT NULL,
>  `hashpartition` tinyint(3) unsigned DEFAULT NULL,
>  `nodeid` int(11) DEFAULT NULL,
>  `manualpartition` tinyint(3) unsigned DEFAULT NULL,
>  PRIMARY KEY (`id`),
>  KEY `transactionid` (`transactionid`),
>  KEY `tableid` (`tableid`,`tupleid`),
>  KEY `nodeid` (`nodeid`)
> ) ENGINE=MyISAM AUTO_INCREMENT=50410166 DEFAULT CHARSET=latin1
>
> I'm running:
>
>  select * from metarelcloud_transactionlog order by transactionid;
>
> It takes MySQL 6 minutes, but Postgresql is still running after 70
> minutes. Is there something like a glaring misconfiguration that I'm
> overlooking? Thanks in advance.
> --
> Yang Zhang
> http://www.mit.edu/~y_z/
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

Re: Sorting performance vs. MySQL

From
Frank Heikens
Date:
There is no index on the column transactionid in your PostgreSQL-
table, as there is in your MySQL-table. This explains the difference.

CREATE INDEX i_transactionid ON public.metarelcloud_transactionlog
(transactionid);


Op 22 feb 2010, om 19:10 heeft Yang Zhang het volgende geschreven:

> I have the exact same table of data in both MySQL and Postgresql. In
> Postgresql:
>
> tpcc=# \d metarelcloud_transactionlog
>                                       Table
> "public.metarelcloud_transactionlog"
>       Column        |         Type          |
>       Modifiers
> ---------------------+-----------------------
> +
> --------------------------------------------------------------------------
> id                  | integer               | not null default
> nextval('metarelcloud_transactionlog_id_seq'::regclass)
> transactionid       | integer               | not null
> queryid             | smallint              | not null
> tableid             | character varying(30) | not null
> tupleid             | integer               | not null
> querytype           | character varying     | not null
> graphpartition      | smallint              |
> replicatedpartition | smallint              |
> justifiedpartition  | smallint              |
> hashpartition       | smallint              |
> nodeid              | integer               |
> manualpartition     | smallint              |
> Indexes:
>    "metarelcloud_transactionlog_pkey" PRIMARY KEY, btree (id)
> Check constraints:
>    "metarelcloud_transactionlog_graphpartition_check" CHECK
> (graphpartition >= 0)
>    "metarelcloud_transactionlog_hashpartition_check" CHECK
> (hashpartition >= 0)
>    "metarelcloud_transactionlog_justifiedpartition_check" CHECK
> (justifiedpartition >= 0)
>    "metarelcloud_transactionlog_manualpartition_check" CHECK
> (manualpartition >= 0)
>    "metarelcloud_transactionlog_querytype_check" CHECK
> (querytype::text = ANY (ARRAY['select'::character varying,
> 'insert'::character varying, 'delete'::character varying,
> 'update'::character varying]::text[]))
>    "metarelcloud_transactionlog_replicatedpartition_check" CHECK
> (replicatedpartition >= 0)
>
> In MySQL:
>
> CREATE TABLE `metarelcloud_transactionlog` (
>  `id` int(11) NOT NULL AUTO_INCREMENT,
>  `transactionid` int(11) NOT NULL,
>  `queryid` tinyint(4) NOT NULL,
>  `tableid` varchar(30) NOT NULL,
>  `tupleid` int(11) NOT NULL,
>  `querytype` enum('select','insert','delete','update') NOT NULL,
>  `graphpartition` tinyint(3) unsigned DEFAULT NULL,
>  `replicatedpartition` tinyint(3) unsigned DEFAULT NULL,
>  `justifiedpartition` tinyint(3) unsigned DEFAULT NULL,
>  `hashpartition` tinyint(3) unsigned DEFAULT NULL,
>  `nodeid` int(11) DEFAULT NULL,
>  `manualpartition` tinyint(3) unsigned DEFAULT NULL,
>  PRIMARY KEY (`id`),
>  KEY `transactionid` (`transactionid`),
>  KEY `tableid` (`tableid`,`tupleid`),
>  KEY `nodeid` (`nodeid`)
> ) ENGINE=MyISAM AUTO_INCREMENT=50410166 DEFAULT CHARSET=latin1
>
> I'm running:
>
>  select * from metarelcloud_transactionlog order by transactionid;
>
> It takes MySQL 6 minutes, but Postgresql is still running after 70
> minutes. Is there something like a glaring misconfiguration that I'm
> overlooking? Thanks in advance.
> --
> Yang Zhang
> http://www.mit.edu/~y_z/
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

Frank Heikens
frankheikens@mac.com




Re: Sorting performance vs. MySQL

From
Richard Broersma
Date:
On Mon, Feb 22, 2010 at 10:17 AM, Frank Heikens <frankheikens@mac.com> wrote:

> There is no index on the column transactionid in your PostgreSQL-table, as
> there is in your MySQL-table. This explains the difference.
>
> CREATE INDEX i_transactionid ON public.metarelcloud_transactionlog
> (transactionid);

Does an index help a sort operation in PostgreSQL?


--
Regards,
Richard Broersma Jr.

Visit the Los Angeles PostgreSQL Users Group (LAPUG)
http://pugs.postgresql.org/lapug

Re: Sorting performance vs. MySQL

From
Yang Zhang
Date:
On Mon, Feb 22, 2010 at 1:13 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> hello
>
> the speed depends on setting of working_memory. Try to increase a working_memory
>
> set working_memory to '10MB';

It's already at

tpcc=# show work_mem;
 work_mem
----------
 20000kB
(1 row)

I also wouldn't have imagined an external merge-sort as being very
memory-intensive--wouldn't it only enough buffer space to read 2x and
write 1x in big-enough chunks for mostly-sequential access?
--
Yang Zhang
http://www.mit.edu/~y_z/

Re: Sorting performance vs. MySQL

From
Yang Zhang
Date:
On Mon, Feb 22, 2010 at 1:30 PM, Richard Broersma
<richard.broersma@gmail.com> wrote:
> On Mon, Feb 22, 2010 at 10:17 AM, Frank Heikens <frankheikens@mac.com> wrote:
>
>> There is no index on the column transactionid in your PostgreSQL-table, as
>> there is in your MySQL-table. This explains the difference.
>>
>> CREATE INDEX i_transactionid ON public.metarelcloud_transactionlog
>> (transactionid);
>
> Does an index help a sort operation in PostgreSQL?

I also share the same doubt. An external merge-sort needs to make
complete passes over the entire dataset, with no index-directed
accesses.
--
Yang Zhang
http://www.mit.edu/~y_z/

Re: Sorting performance vs. MySQL

From
Frank Heikens
Date:
Op 22 feb 2010, om 19:30 heeft Richard Broersma het volgende geschreven:

> On Mon, Feb 22, 2010 at 10:17 AM, Frank Heikens
> <frankheikens@mac.com> wrote:
>
>> There is no index on the column transactionid in your PostgreSQL-
>> table, as
>> there is in your MySQL-table. This explains the difference.
>>
>> CREATE INDEX i_transactionid ON public.metarelcloud_transactionlog
>> (transactionid);
>
> Does an index help a sort operation in PostgreSQL?

Yes it does, see the manual: http://www.postgresql.org/docs/8.4/interactive/indexes-ordering.html

Example without index:
"Sort  (cost=804.39..829.39 rows=10000 width=4) (actual
time=16.006..17.171 rows=10000 loops=1)"
"  Sort Key: bar"
"  Sort Method:  quicksort  Memory: 491kB"
"  ->  Seq Scan on bla  (cost=0.00..140.00 rows=10000 width=4) (actual
time=0.015..2.236 rows=10000 loops=1)"
"Total runtime: 18.098 ms"

Same query with index (btree):
"Index Scan Backward using i_bar on bla  (cost=0.00..406.25 rows=10000
width=4) (actual time=0.093..4.408 rows=10000 loops=1)"
"Total runtime: 5.381 ms"

>
>
> --
> Regards,
> Richard Broersma Jr.
>
> Visit the Los Angeles PostgreSQL Users Group (LAPUG)
> http://pugs.postgresql.org/lapug


Regards,
Frank Heikens




Re: Sorting performance vs. MySQL

From
Alban Hertroys
Date:
On 22 Feb 2010, at 19:35, Yang Zhang wrote:

> I also wouldn't have imagined an external merge-sort as being very


Where's that external merge-sort coming from? Can you show an explain analyze?

If your work-mem is too low there's a good chance that Postgres has to use your disks for sorting, which will obviously
bequite slow. 

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.


!DSPAM:737,4b82d18510442035320951!



Re: Sorting performance vs. MySQL

From
Alvaro Herrera
Date:
Yang Zhang escribió:

> I'm running:
>
>   select * from metarelcloud_transactionlog order by transactionid;
>
> It takes MySQL 6 minutes, but Postgresql is still running after 70
> minutes. Is there something like a glaring misconfiguration that I'm
> overlooking? Thanks in advance.

How large is the table, and have you vacuumed it?  Did you analyze it?
What Pg version is this?

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Sorting performance vs. MySQL

From
Yang Zhang
Date:
On Mon, Feb 22, 2010 at 1:48 PM, Alban Hertroys
<dalroi@solfertje.student.utwente.nl> wrote:
> On 22 Feb 2010, at 19:35, Yang Zhang wrote:
>
>> I also wouldn't have imagined an external merge-sort as being very
>
>
> Where's that external merge-sort coming from? Can you show an explain analyze?

I just assumed that the "Sort" in the EXPLAIN output meant an external
merge-sort, given that the table has over 50 million tuples and is
over 3GB, *and* there is no index on the sort key:

tpcc=# explain select * from metarelcloud_transactionlog order by transactionid;
                                                   QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
 Sort  (cost=8408637.34..8534662.95 rows=50410244 width=17)
   Sort Key: a.transactionid
   ->  Seq Scan on metarelcloud_transactionlog a
(cost=0.00..925543.44 rows=50410244 width=17)
(3 rows)

Anyway, I added the INDEX as suggested by Frank, but it's been 20
minutes and it's still running. With the index, EXPLAIN says:

tpcc=# explain select * from metarelcloud_transactionlog order by transactionid;
                                                   QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
 Index Scan using i_transactionid on metarelcloud_transactionlog
(cost=0.00..4453076.81 rows=50410164 width=44)
(1 row)

> If your work-mem is too low there's a good chance that Postgres has to use your disks for sorting, which will
obviouslybe quite slow. 

Relative to the non-terminating 80-minute-so-far sort, Unix sort runs
much faster (on the order of several minutes).
--
Yang Zhang
http://www.mit.edu/~y_z/

Re: Sorting performance vs. MySQL

From
Yang Zhang
Date:
On Mon, Feb 22, 2010 at 2:03 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> Yang Zhang escribió:
>
>> I'm running:
>>
>>   select * from metarelcloud_transactionlog order by transactionid;
>>
>> It takes MySQL 6 minutes, but Postgresql is still running after 70
>> minutes. Is there something like a glaring misconfiguration that I'm
>> overlooking? Thanks in advance.
>
> How large is the table, and have you vacuumed it?  Did you analyze it?
> What Pg version is this?

The table has 50.4M tuples. It's been vacuumed and analyzed. I'm using
PG 8.3.8 on Fedora 10 x86_64.
--
Yang Zhang
http://www.mit.edu/~y_z/

Re: Sorting performance vs. MySQL

From
Frank Heikens
Date:
Op 22 feb 2010, om 20:07 heeft Yang Zhang het volgende geschreven:

> On Mon, Feb 22, 2010 at 1:48 PM, Alban Hertroys
> <dalroi@solfertje.student.utwente.nl> wrote:
>> On 22 Feb 2010, at 19:35, Yang Zhang wrote:
>>
>>> I also wouldn't have imagined an external merge-sort as being very
>>
>>
>> Where's that external merge-sort coming from? Can you show an
>> explain analyze?
>
> I just assumed that the "Sort" in the EXPLAIN output meant an external
> merge-sort, given that the table has over 50 million tuples and is
> over 3GB, *and* there is no index on the sort key:
>
> tpcc=# explain select * from metarelcloud_transactionlog order by
> transactionid;
>                                                   QUERY PLAN
> -----------------------------------------------------------------------------------------------------------------
> Sort  (cost=8408637.34..8534662.95 rows=50410244 width=17)
>   Sort Key: a.transactionid
>   ->  Seq Scan on metarelcloud_transactionlog a
> (cost=0.00..925543.44 rows=50410244 width=17)
> (3 rows)
>
> Anyway, I added the INDEX as suggested by Frank, but it's been 20
> minutes and it's still running. With the index, EXPLAIN says:
>
> tpcc=# explain select * from metarelcloud_transactionlog order by
> transactionid;
>                                                   QUERY PLAN
> -----------------------------------------------------------------------------------------------------------------
> Index Scan using i_transactionid on metarelcloud_transactionlog
> (cost=0.00..4453076.81 rows=50410164 width=44)
> (1 row)
>

Use EXPLAIN ANALYZE to see how the query is executed, gives you more
details.


>> If your work-mem is too low there's a good chance that Postgres has
>> to use your disks for sorting, which will obviously be quite slow.
>
> Relative to the non-terminating 80-minute-so-far sort, Unix sort runs
> much faster (on the order of several minutes).

Make sure your index does fit into memory, what's the size of the index?


> --
> Yang Zhang
> http://www.mit.edu/~y_z/
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

Frank Heikens




Re: Sorting performance vs. MySQL

From
Alvaro Herrera
Date:
Yang Zhang escribió:
> I have the exact same table of data in both MySQL and Postgresql. In Postgresql:

I just noticed two things:

[snip lots of stuff]

1.

> ) ENGINE=MyISAM AUTO_INCREMENT=50410166 DEFAULT CHARSET=latin1

You're doing a comparison to MyISAM.


2.

>   select * from metarelcloud_transactionlog order by transactionid;

You're reading the whole table.

This is unlikely to fly very far.  I suggest you try some query that's
actually going to be used in the real world.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Sorting performance vs. MySQL

From
Yang Zhang
Date:
On Mon, Feb 22, 2010 at 2:15 PM, Frank Heikens <frankheikens@mac.com> wrote:
>
> Op 22 feb 2010, om 20:07 heeft Yang Zhang het volgende geschreven:
>
>> On Mon, Feb 22, 2010 at 1:48 PM, Alban Hertroys
>> <dalroi@solfertje.student.utwente.nl> wrote:
>>>
>>> On 22 Feb 2010, at 19:35, Yang Zhang wrote:
>>>
>>>> I also wouldn't have imagined an external merge-sort as being very
>>>
>>>
>>> Where's that external merge-sort coming from? Can you show an explain
>>> analyze?
>>
>> I just assumed that the "Sort" in the EXPLAIN output meant an external
>> merge-sort, given that the table has over 50 million tuples and is
>> over 3GB, *and* there is no index on the sort key:
>>
>> tpcc=# explain select * from metarelcloud_transactionlog order by
>> transactionid;
>>                                                  QUERY PLAN
>>
>> -----------------------------------------------------------------------------------------------------------------
>> Sort  (cost=8408637.34..8534662.95 rows=50410244 width=17)
>>  Sort Key: a.transactionid
>>  ->  Seq Scan on metarelcloud_transactionlog a
>> (cost=0.00..925543.44 rows=50410244 width=17)
>> (3 rows)
>>
>> Anyway, I added the INDEX as suggested by Frank, but it's been 20
>> minutes and it's still running. With the index, EXPLAIN says:
>>
>> tpcc=# explain select * from metarelcloud_transactionlog order by
>> transactionid;
>>                                                  QUERY PLAN
>>
>> -----------------------------------------------------------------------------------------------------------------
>> Index Scan using i_transactionid on metarelcloud_transactionlog
>> (cost=0.00..4453076.81 rows=50410164 width=44)
>> (1 row)
>>
>
> Use EXPLAIN ANALYZE to see how the query is executed, gives you more
> details.

As mentioned, this would take a very long time to complete running --
I have not yet seen one successful query completion yet. I'd need to
let it run overnight.

>
>
>>> If your work-mem is too low there's a good chance that Postgres has to
>>> use your disks for sorting, which will obviously be quite slow.
>>
>> Relative to the non-terminating 80-minute-so-far sort, Unix sort runs
>> much faster (on the order of several minutes).
>
> Make sure your index does fit into memory, what's the size of the index?

How might I find out the size and whether it's being fit in memory?
--
Yang Zhang
http://www.mit.edu/~y_z/

Re: Sorting performance vs. MySQL

From
Yang Zhang
Date:
On Mon, Feb 22, 2010 at 2:27 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> Yang Zhang escribió:
>> I have the exact same table of data in both MySQL and Postgresql. In Postgresql:
>
> I just noticed two things:
>
> [snip lots of stuff]
>
> 1.
>
>> ) ENGINE=MyISAM AUTO_INCREMENT=50410166 DEFAULT CHARSET=latin1
>
> You're doing a comparison to MyISAM.

We've actually been using innodb as well; it exhibits similar
execution times to MyISAM.

>
>
> 2.
>
>>   select * from metarelcloud_transactionlog order by transactionid;
>
> You're reading the whole table.
>
> This is unlikely to fly very far.  I suggest you try some query that's
> actually going to be used in the real world.

This isn't some microbenchmark. This is part of our actual analytical
application. We're running large-scale graph partitioning algorithms.
--
Yang Zhang
http://www.mit.edu/~y_z/

Re: Sorting performance vs. MySQL

From
Scott Marlowe
Date:
On Mon, Feb 22, 2010 at 12:30 PM, Yang Zhang <yanghatespam@gmail.com> wrote:
> This isn't some microbenchmark. This is part of our actual analytical
> application. We're running large-scale graph partitioning algorithms.

It's important to see how it runs if you can fit more / most of the
data set into memory by cranking up work_mem to something really big
(like a gigabyte or two) and if the query planner can switch to some
sort of hash algorithm.

Also, can you cluster the table on transactionid ?

Re: Sorting performance vs. MySQL

From
Frank Heikens
Date:
Op 22 feb 2010, om 20:28 heeft Yang Zhang het volgende geschreven:
>
>
>>
>>
>>>> If your work-mem is too low there's a good chance that Postgres
>>>> has to
>>>> use your disks for sorting, which will obviously be quite slow.
>>>
>>> Relative to the non-terminating 80-minute-so-far sort, Unix sort
>>> runs
>>> much faster (on the order of several minutes).
>>
>> Make sure your index does fit into memory, what's the size of the
>> index?
>
> How might I find out the size and whether it's being fit in memory?

SELECT pg_size_pretty(pg_relation_size('i_transactionid'));


> --
> Yang Zhang
> http://www.mit.edu/~y_z/
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

Frank Heikens




Re: Sorting performance vs. MySQL

From
Yang Zhang
Date:
On Mon, Feb 22, 2010 at 2:39 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> On Mon, Feb 22, 2010 at 12:30 PM, Yang Zhang <yanghatespam@gmail.com> wrote:
>> This isn't some microbenchmark. This is part of our actual analytical
>> application. We're running large-scale graph partitioning algorithms.
>
> It's important to see how it runs if you can fit more / most of the
> data set into memory by cranking up work_mem to something really big
> (like a gigabyte or two) and if the query planner can switch to some
> sort of hash algorithm.

We're actually using a very small dataset right now. Being bounded by
memory capacity is not a scalable approach for our application.

>
> Also, can you cluster the table on transactionid ?
>

We can, but that's not really addressing the core issue, which matters
to us since the sort itself is only for performing a self merge join
on transactionid, and the *very next step* is a group by a.tableid,
a.tupleid, b.tableid, b.tupleid (i.e. requiring another sort for the
group-agg).
--
Yang Zhang
http://www.mit.edu/~y_z/

Re: Sorting performance vs. MySQL

From
Scott Marlowe
Date:
On Mon, Feb 22, 2010 at 11:10 AM, Yang Zhang <yanghatespam@gmail.com> wrote:
> I have the exact same table of data in both MySQL and Postgresql. In Postgresql:

Just wondering, are these on the same exact machine?

Re: Sorting performance vs. MySQL

From
Yang Zhang
Date:
On Mon, Feb 22, 2010 at 2:52 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> On Mon, Feb 22, 2010 at 11:10 AM, Yang Zhang <yanghatespam@gmail.com> wrote:
>> I have the exact same table of data in both MySQL and Postgresql. In Postgresql:
>
> Just wondering, are these on the same exact machine?
>

Yes, on the same disk.
--
Yang Zhang
http://www.mit.edu/~y_z/

Re: Sorting performance vs. MySQL

From
Scott Marlowe
Date:
On Mon, Feb 22, 2010 at 12:50 PM, Yang Zhang <yanghatespam@gmail.com> wrote:
> On Mon, Feb 22, 2010 at 2:39 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
>> On Mon, Feb 22, 2010 at 12:30 PM, Yang Zhang <yanghatespam@gmail.com> wrote:
>>> This isn't some microbenchmark. This is part of our actual analytical
>>> application. We're running large-scale graph partitioning algorithms.
>>
>> It's important to see how it runs if you can fit more / most of the
>> data set into memory by cranking up work_mem to something really big
>> (like a gigabyte or two) and if the query planner can switch to some
>> sort of hash algorithm.
>
> We're actually using a very small dataset right now. Being bounded by
> memory capacity is not a scalable approach for our application.

But the more you can fit into work_mem the faster it will go anyway.
So it's still worth a try.

Re: Sorting performance vs. MySQL

From
Yang Zhang
Date:
On Mon, Feb 22, 2010 at 2:41 PM, Frank Heikens <frankheikens@mac.com> wrote:
>
> Op 22 feb 2010, om 20:28 heeft Yang Zhang het volgende geschreven:
>>
>>
>>>
>>>
>>>>> If your work-mem is too low there's a good chance that Postgres has to
>>>>> use your disks for sorting, which will obviously be quite slow.
>>>>
>>>> Relative to the non-terminating 80-minute-so-far sort, Unix sort runs
>>>> much faster (on the order of several minutes).
>>>
>>> Make sure your index does fit into memory, what's the size of the index?
>>
>> How might I find out the size and whether it's being fit in memory?
>
> SELECT pg_size_pretty(pg_relation_size('i_transactionid'));

 pg_size_pretty
----------------
 1080 MB
(1 row)
--
Yang Zhang
http://www.mit.edu/~y_z/

Re: Sorting performance vs. MySQL

From
Yeb Havinga
Date:
Scott Marlowe wrote:
> On Mon, Feb 22, 2010 at 11:10 AM, Yang Zhang <yanghatespam@gmail.com> wrote:
>
>> I have the exact same table of data in both MySQL and Postgresql. In Postgresql:
>>
>
> Just wondering, are these on the same exact machine?
>
>
Just reading up on this interesting thread. WFIW, 2 years ago I and a
collegue of mine did a hardware comparison of early Intel and AMD
desktop quadcore processors to run postgres database, with most other
parts comparable. The intel processor was 20 to 30 % faster in cpu
operations to the (first generation) Phenom at almost everything, except
at index creation. The test that the AMD finished in a few minutes, we
had to stop on the Intel because it simply didn't finish. We double
checked configuration settings and could not find explainable
differences. I hesitate to post this information here, because its hard
to believe that an actual big difference between the processors exists,
and it more likely was something in our test setup. Still: the
difference was *only* in index creation, which is kindoff like a qsort yes?

egards
Yeb Havinga





Re: Sorting performance vs. MySQL

From
Tom Lane
Date:
Yang Zhang <yanghatespam@gmail.com> writes:
> On Mon, Feb 22, 2010 at 1:13 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>> the speed depends on setting of working_memory. Try to increase a working_memory

> It's already at
>  20000kB

According to your original posting, you're trying to sort something like
a gigabyte of data.  20MB is peanuts.  I wouldn't recommend increasing
the value across-the-board, but setting it to several hundred meg for
this particular query might help.  How much RAM in your machine anyway?

Also, the fact that mysql is faster suggests that having an index does help.
Possibly the data is nearly ordered by transactionid, in which case an
indexscan would not have random-access problems and would be much faster
than an explicit sort.

            regards, tom lane

Re: Sorting performance vs. MySQL

From
Scott Marlowe
Date:
On Mon, Feb 22, 2010 at 12:53 PM, Yang Zhang <yanghatespam@gmail.com> wrote:
> On Mon, Feb 22, 2010 at 2:52 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
>> On Mon, Feb 22, 2010 at 11:10 AM, Yang Zhang <yanghatespam@gmail.com> wrote:
>>> I have the exact same table of data in both MySQL and Postgresql. In Postgresql:
>>
>> Just wondering, are these on the same exact machine?
>>
>
> Yes, on the same disk.

I'm wondering how much of this could be caching effects.  Is the MySQL
database "warmed up" before you started, and the pgsql database is
"cold" and no caching has taken place?

What do things like vmstat 10 say while the query is running on each
db?  First time, second time, things like that.

Also, just curios, what's shared_buffers set to on the pgsql instance?

Re: Sorting performance vs. MySQL

From
Yang Zhang
Date:
On Mon, Feb 22, 2010 at 3:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Yang Zhang <yanghatespam@gmail.com> writes:
>> On Mon, Feb 22, 2010 at 1:13 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>>> the speed depends on setting of working_memory. Try to increase a working_memory
>
>> It's already at
>>  20000kB
>
> According to your original posting, you're trying to sort something like
> a gigabyte of data.  20MB is peanuts.  I wouldn't recommend increasing
> the value across-the-board, but setting it to several hundred meg for
> this particular query might help.  How much RAM in your machine anyway?

We have 16GB of RAM, but again, Unix sort (and even our own
hand-rolled merge-sort) can operate zippily while avoiding consuming
additional memory.

All the same, we increased work_mem to 1GB, and still the query is not
completing.

>
> Also, the fact that mysql is faster suggests that having an index does help.
> Possibly the data is nearly ordered by transactionid, in which case an
> indexscan would not have random-access problems and would be much faster
> than an explicit sort.

Note that earlier in the thread I tried running this query with an
index scan, but it's still much slower.
--
Yang Zhang
http://www.mit.edu/~y_z/

Re: Sorting performance vs. MySQL

From
Tom Lane
Date:
Yeb Havinga <yebhavinga@gmail.com> writes:
> Just reading up on this interesting thread. WFIW, 2 years ago I and a
> collegue of mine did a hardware comparison of early Intel and AMD
> desktop quadcore processors to run postgres database, with most other
> parts comparable. The intel processor was 20 to 30 % faster in cpu
> operations to the (first generation) Phenom at almost everything, except
> at index creation. The test that the AMD finished in a few minutes, we
> had to stop on the Intel because it simply didn't finish. We double
> checked configuration settings and could not find explainable
> differences. I hesitate to post this information here, because its hard
> to believe that an actual big difference between the processors exists,
> and it more likely was something in our test setup. Still: the
> difference was *only* in index creation, which is kindoff like a qsort yes?

Interesting.  Yes, btree index creation is essentially a sort ...

            regards, tom lane

Re: Sorting performance vs. MySQL

From
"Igor Neyman"
Date:
When in doubt - test.
Why not remove index in MySQL (or create index in PostgreSQL) and see
what happens.
Why trying compare "apples and oranges"?

Igor Neyman

> -----Original Message-----
> From: Yang Zhang [mailto:yanghatespam@gmail.com]
> Sent: Monday, February 22, 2010 1:37 PM
> To: Richard Broersma
> Cc: Frank Heikens; pgsql-general@postgresql.org
> Subject: Re: Sorting performance vs. MySQL
>
> On Mon, Feb 22, 2010 at 1:30 PM, Richard Broersma
> <richard.broersma@gmail.com> wrote:
> > On Mon, Feb 22, 2010 at 10:17 AM, Frank Heikens
> <frankheikens@mac.com> wrote:
> >
> >> There is no index on the column transactionid in your
> >> PostgreSQL-table, as there is in your MySQL-table. This
> explains the difference.
> >>
> >> CREATE INDEX i_transactionid ON public.metarelcloud_transactionlog
> >> (transactionid);
> >
> > Does an index help a sort operation in PostgreSQL?
>
> I also share the same doubt. An external merge-sort needs to
> make complete passes over the entire dataset, with no
> index-directed accesses.
> --
> Yang Zhang
> http://www.mit.edu/~y_z/
>

Re: Sorting performance vs. MySQL

From
Yang Zhang
Date:
On Mon, Feb 22, 2010 at 5:31 PM, Igor Neyman <ineyman@perceptron.com> wrote:
> When in doubt - test.
> Why not remove index in MySQL (or create index in PostgreSQL) and see
> what happens.
> Why trying compare "apples and oranges"?

Continue reading this thread -- I also tried using an index in Postgresql.
--
Yang Zhang
http://www.mit.edu/~y_z/

Re: Sorting performance vs. MySQL

From
Scott Marlowe
Date:
On Mon, Feb 22, 2010 at 3:33 PM, Yang Zhang <yanghatespam@gmail.com> wrote:
> On Mon, Feb 22, 2010 at 5:31 PM, Igor Neyman <ineyman@perceptron.com> wrote:
>> When in doubt - test.
>> Why not remove index in MySQL (or create index in PostgreSQL) and see
>> what happens.
>> Why trying compare "apples and oranges"?
>
> Continue reading this thread -- I also tried using an index in Postgresql.

But have you tried cranking up work_mem to say 1G?

Re: Sorting performance vs. MySQL

From
Scott Marlowe
Date:
On Mon, Feb 22, 2010 at 3:33 PM, Yang Zhang <yanghatespam@gmail.com> wrote:
> On Mon, Feb 22, 2010 at 5:31 PM, Igor Neyman <ineyman@perceptron.com> wrote:
>> When in doubt - test.
>> Why not remove index in MySQL (or create index in PostgreSQL) and see
>> what happens.
>> Why trying compare "apples and oranges"?
>
> Continue reading this thread -- I also tried using an index in Postgresql.

And oh yeah, what was shared_buffers set to?  I'm not sure we ever got
an answer to that.

Re: Sorting performance vs. MySQL

From
Alex Hunsaker
Date:
On Mon, Feb 22, 2010 at 11:10, Yang Zhang <yanghatespam@gmail.com> wrote:
> I have the exact same table of data in both MySQL and Postgresql. In ?> Postgresql:

FWIW on a stock (unchanged postgresql.conf) 8.3.9 I get (best of 3
runs) 79 seconds, 26 using an index and 27 seconds with it clustered.
Now yes it goes a lot faster because im skipping all the overhead of
sending the data to the client...  But still that means it has almost
nothing with the sort or indexs.  Leaves pretty much your cpu, disk,
filesystem and network...  BTW the first time I did it it had to write
out the hint bits so that took (a bit) longer... Is this freshly
loaded data?

---

# create table metarelcould_transactionlog (
        id serial primary key,
        transactionid integer not null,
        queryid smallint not null,
        tableid varchar(30) not null,
        tupleid integer not null,
        querytype varchar not null,
        graphpartition smallint,
        replicatedpartition smallint,
        justifiedpartition smallint,
        hashpartition smallint,
        modeid integer,
        manualpartition smallint
);

# insert into metarelcould_transactionlog (transactionid, queryid,
tableid, tupleid, querytype, graphpartition, replicatedpartition,
justifiedpartition, hashpartition, modeid, manualpartition) select
foo,  1, 'sometable', 1, 's', 1, 1, 1, 1, 1, 1 from generate_series(1,
50000000) as foo;

# select count(1) from (SELECT * from metarelcould_transactionlog
order by transactionid) as foo;
  count
----------
 50000000
(1 row)

Time: 79017.186 ms

# create index idx on metarelcould_transactionlog (transactionid);
# select count(1) from (SELECT * from metarelcould_transactionlog
order by transactionid) as foo;
  count
----------
 50000000
(1 row)

Time: 26230.534 ms

# cluster metarelcould_transactionlog USING  idx;
CLUSTER
Time: 342381.535 ms

# select count(1) from (SELECT * from metarelcould_transactionlog
order by transactionid) as foo;
  count
----------
 50000000
(1 row)

Time: 27704.794 ms

Re: Sorting performance vs. MySQL

From
Yang Zhang
Date:
nnnnnOn Mon, Feb 22, 2010 at 3:45 PM, Scott Marlowe
<scott.marlowe@gmail.com> wrote:
> On Mon, Feb 22, 2010 at 12:53 PM, Yang Zhang <yanghatespam@gmail.com> wrote:
>> On Mon, Feb 22, 2010 at 2:52 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
>>> On Mon, Feb 22, 2010 at 11:10 AM, Yang Zhang <yanghatespam@gmail.com> wrote:
>>>> I have the exact same table of data in both MySQL and Postgresql. In Postgresql:
>>>
>>> Just wondering, are these on the same exact machine?
>>>
>>
>> Yes, on the same disk.
>
> I'm wondering how much of this could be caching effects.  Is the MySQL
> database "warmed up" before you started, and the pgsql database is
> "cold" and no caching has taken place?
>
> What do things like vmstat 10 say while the query is running on each
> db?  First time, second time, things like that.

Awesome -- this actually led me to discover the problem.

When running the query in MySQL InnoDB:

$ vmstat 10
procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0 13 13733604  83020   5648 2193884    3    3   936   168    2    1
4  2 89  5  0
 1 12 13749952  80164   5600 2178032    0 4354   908  4379 3586 2638
0  1 38 60  0
 0 19 13762228  80576   5556 2145220  208 3527  1280  3690 3668 2635
1  1 39 59  0
 0 19 13778632  79420   5560 2135228   52 4186  1046  4191 3682 2418
0  1 37 62  0
 0 19 13792964  77336   5592 2082520   41 3731  1698  3804 4102 2686
1  1 53 45  0
 0 14 13810356  84036   5556 2049836   36 4241   797  4246 3913 2603
0  1 68 31  0
 1 14 13825640  81336   5520 2001920    0 4212   958  4220 3848 2736
1  1 73 25  0
 0 17 13844952  78036   5476 1976956    8 4685   923  4689 3832 2547
0  1 69 29  0
 2 13 13863828  79812   5448 1954952    3 4627   692  4634 3744 2505
0  1 70 28  0
 0 15 13883828  77764   5440 1920528  249 4544   972  4548 4345 2506
0  1 70 28  0
 1 20 13898900  79132   5456 1890192   28 4341   723  4438 4982 3030
0  3 64 33  0
 0 11 13915252  85184   5624 1865260   79 3668   752  3764 4472 2765
0  3 57 40  0
 0 12 13933964  78448   5700 1832640  120 4327  1066  4434 4484 2777
1  3 52 45  0
 0 19 13951748  77640   5816 1795720   94 4005  1159  4091 4580 2762
1  3 48 49  0
 0 16 13972748  79884   5780 1753676    0 4737   787  4746 4385 2766
1  3 51 45  0
 0 25 13988108  78936   5884 1726068  547 3954  1468  4116 4976 3502
0  4 44 52  0
 1 20 14011500  77676   5868 1689136  161 4980   843  5506 5218 3131
0  3 34 62  0
 0 22 14032472  81348   5816 1647884  270 4198   943  4369 4521 2826
1  3 40 56  0
 0 23 14055220  81712   5804 1626872  193 4774  1408  4856 4617 2754
1  3 38 58  0
 0 21 14075848  81844   5696 1576836    0 4738   974  4742 4528 2704
1  3 40 56  0
 0 25 14097260  79788   5628 1536336  213 4512   922  4639 4726 2929
1  3 27 69  0
 0 24 14123900  80820   5616 1488460  319 5033  1059  5128 4895 2780
2  3 17 78  0
 1 26 14142632  77276   5660 1445592  445 4605  1434  4727 5401 3364
1  3 16 79  0
 0 31 14165668  83736   5976 1387048  362 4288  1444  4428 4739 2963
2  3 17 78  0
 1 28 14180104  77564   6324 1369232  387 4526  2222  4677 5748 3559
1  3 16 80  0

I'm guessing the swap numbers are because MySQL uses mmap?

Anyway, when running the query in Postgresql:

$ vmstat 10
procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  2 13866336 1574540  25024 7878980    3    3   936   168    2    1
4  2 89  5  0
 1  3 13861520 1163596  25328 8128360 1046    0 25429   433 4368 4175
4  2 80 14  0
 0  3 13856484 803024  25600 8343220 1117    0 22062   688 4492 4590
4  2 73 20  0
 0  6 13855304 678868  26052 8435540  160    0  9239   598 5195 7141
1  5 70 24  0
 0  6 13853644 513568  26332 8563680  401    0 12480  7100 4775 4248
3  3 68 26  0
 2  2 13851804 166228  26624 8775304  634    0 21466  1497 4680 4550
6  2 64 28  0
 0  5 13861556  81896  26740 8825360  860 3547  6100  3847 5142 3386
6  2 57 35  0
 0  6 13867688  91368  26808 8832712  653 3326  1835  3604 4738 2762
5  2 61 32  0
 0  5 13870676  88524  26872 8849392  638 3272  2578  3517 4864 2909
4  2 55 39  0
 0  5 13872748  79512  27004 8864456  629 1788  2086  2949 4337 2921
1  3 55 41  0
 0  7 13876760  83124  27136 8867272 1018 2253  1713  2409 4321 2889
0  3 63 33  0
 0  6 13878964  82876  27240 8874540  792 2119  1854  2314 4288 2813
2  2 72 24  0
 3  4 13883204  81224  27280 8887068  661 3067  2995  3385 4558 2899
4  2 72 22  0
 0  6 13886636  82036  27352 8905628  594 3726  2628  4013 4744 2765
4  2 69 25  0
 0  8 13899852  85604  27400 8925800  638 4423  2689  4658 4903 2808
4  2 55 40  0
 1  4 13905184  80544  27484 8940040  676 3501  3006  3799 4805 2932
4  2 66 28  0
 0  9 13908480  80100  27516 8948476  668 2996  1720  3192 4594 2799
4  2 60 35  0

vmstat showed no swapping-out for a while, and then suddenly it
started spilling a lot. Checking psql's memory stats showed that it
was huge -- apparently, it's trying to store its full result set in
memory. As soon as I added a LIMIT 10000, everything worked
beautifully and finished in 4m (I verified that the planner was still
issuing a Sort).

I'm relieved that Postgresql itself does not, in fact, suck, but
slightly disappointed in the behavior of psql. I suppose it needs to
buffer everything in memory to properly format its tabular output,
among other possible reasons I could imagine.

Now it all almost seems obvious. My only solace is that it wasn't obv.
to the list either (thanks to everyone for their feedback!). Should've
just started with system profiling -- saves the day again.

>
> Also, just curios, what's shared_buffers set to on the pgsql instance?

 shared_buffers
----------------
 320000kB
(1 row)
--
Yang Zhang
http://www.mit.edu/~y_z/

Re: Sorting performance vs. MySQL

From
Yang Zhang
Date:
On Mon, Feb 22, 2010 at 9:30 PM, Alex Hunsaker <badalex@gmail.com> wrote:
> On Mon, Feb 22, 2010 at 11:10, Yang Zhang <yanghatespam@gmail.com> wrote:
>> I have the exact same table of data in both MySQL and Postgresql. In ?> Postgresql:
>
> FWIW on a stock (unchanged postgresql.conf) 8.3.9 I get (best of 3
> runs) 79 seconds, 26 using an index and 27 seconds with it clustered.
> Now yes it goes a lot faster because im skipping all the overhead of
> sending the data to the client...

Last sentence also contributed to my realizing the problem (the client
I was using was psql), but there's one oddity....

> # select count(1) from (SELECT * from metarelcould_transactionlog
> order by transactionid) as foo;

Does it strike anyone else that the query optimizer/rewriter should be
able to toss out the sort from such a query altogether?
--
Yang Zhang
http://www.mit.edu/~y_z/

Re: Sorting performance vs. MySQL

From
Tom Lane
Date:
Yang Zhang <yanghatespam@gmail.com> writes:
>> # select count(1) from (SELECT * from metarelcould_transactionlog
>> order by transactionid) as foo;

> Does it strike anyone else that the query optimizer/rewriter should be
> able to toss out the sort from such a query altogether?

It could, if it knew that the aggregate function didn't care about input
row order.  We don't have that knowledge about aggregates ATM.

            regards, tom lane

Re: Sorting performance vs. MySQL

From
Tom Lane
Date:
Yang Zhang <yanghatespam@gmail.com> writes:
> I'm relieved that Postgresql itself does not, in fact, suck, but
> slightly disappointed in the behavior of psql. I suppose it needs to
> buffer everything in memory to properly format its tabular output,
> among other possible reasons I could imagine.

That's half of it, and the other half is not wanting to present a
portion of query output if the query fails partway through.  You could
certainly write a client that disregarded these issues (as I suppose
mysql must be doing).

            regards, tom lane

Re: Sorting performance vs. MySQL

From
Scott Marlowe
Date:
On Mon, Feb 22, 2010 at 10:51 PM, Yang Zhang <yanghatespam@gmail.com> wrote:
> nnnnnOn Mon, Feb 22, 2010 at 3:45 PM, Scott Marlowe
> <scott.marlowe@gmail.com> wrote:
>>
>> What do things like vmstat 10 say while the query is running on each
>> db?  First time, second time, things like that.
>
> Awesome -- this actually led me to discover the problem.
>
> vmstat showed no swapping-out for a while, and then suddenly it
> started spilling a lot. Checking psql's memory stats showed that it
> was huge -- apparently, it's trying to store its full result set in
> memory. As soon as I added a LIMIT 10000, everything worked
> beautifully and finished in 4m (I verified that the planner was still
> issuing a Sort).
>
> I'm relieved that Postgresql itself does not, in fact, suck, but
> slightly disappointed in the behavior of psql. I suppose it needs to
> buffer everything in memory to properly format its tabular output,
> among other possible reasons I could imagine.

It's best when working with big sets to do so with a cursor and fetch
a few thousand rows at a time.  It's how we handle really big sets at
work and it works like a charm in keeping the client from bogging down
with a huge memory footprint.

Re: Sorting performance vs. MySQL

From
Alex Hunsaker
Date:
On Mon, Feb 22, 2010 at 22:51, Yang Zhang <yanghatespam@gmail.com> wrote:
> vmstat showed no swapping-out for a while, and then suddenly it
> started spilling a lot. Checking psql's memory stats showed that it
> was huge -- apparently, it's trying to store its full result set in
> memory. As soon as I added a LIMIT 10000, everything worked
> beautifully and finished in 4m (I verified that the planner was still
> issuing a Sort).

Well im half surprised no one has recommend using a cursor.  Have you
looked in to that?  I bet that would fix most of your problems here.
>
> I'm relieved that Postgresql itself does not, in fact, suck, but
> slightly disappointed in the behavior of psql. I suppose it needs to
> buffer everything in memory to properly format its tabular output,
> among other possible reasons I could imagine.

Well AFAIK it will dump everything you asked for.  So if you said
select * from 1G table;  It should take at least 1G and potentially
quite a bit more formatting and overhead.

Re: Sorting performance vs. MySQL

From
Yang Zhang
Date:
On Tue, Feb 23, 2010 at 1:48 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> On Mon, Feb 22, 2010 at 10:51 PM, Yang Zhang <yanghatespam@gmail.com> wrote:
>> nnnnnOn Mon, Feb 22, 2010 at 3:45 PM, Scott Marlowe
>> <scott.marlowe@gmail.com> wrote:
>>>
>>> What do things like vmstat 10 say while the query is running on each
>>> db?  First time, second time, things like that.
>>
>> Awesome -- this actually led me to discover the problem.
>>
>> vmstat showed no swapping-out for a while, and then suddenly it
>> started spilling a lot. Checking psql's memory stats showed that it
>> was huge -- apparently, it's trying to store its full result set in
>> memory. As soon as I added a LIMIT 10000, everything worked
>> beautifully and finished in 4m (I verified that the planner was still
>> issuing a Sort).
>>
>> I'm relieved that Postgresql itself does not, in fact, suck, but
>> slightly disappointed in the behavior of psql. I suppose it needs to
>> buffer everything in memory to properly format its tabular output,
>> among other possible reasons I could imagine.
>
> It's best when working with big sets to do so with a cursor and fetch
> a few thousand rows at a time.  It's how we handle really big sets at
> work and it works like a charm in keeping the client from bogging down
> with a huge memory footprint.
>

Thing is, this is how I got here:

- ran complex query that does SELECT INTO.
- that never terminated, so killed it and tried a simpler SELECT (the
subject of this thread) from psql to see how long that would take.

I.e., my original application doesn't receive the entire dataset.
--
Yang Zhang
http://www.mit.edu/~y_z/

Re: Sorting performance vs. MySQL

From
Greg Stark
Date:
On Tue, Feb 23, 2010 at 6:48 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
>> I'm relieved that Postgresql itself does not, in fact, suck, but
>> slightly disappointed in the behavior of psql. I suppose it needs to
>> buffer everything in memory to properly format its tabular output,
>> among other possible reasons I could imagine.
>
> It's best when working with big sets to do so with a cursor and fetch
> a few thousand rows at a time.  It's how we handle really big sets at
> work and it works like a charm in keeping the client from bogging down
> with a huge memory footprint.
>

You can do \set FETCH_COUNT to have psql use a cursor automatically.


--
greg

Re: Sorting performance vs. MySQL

From
John Gage
Date:
I am under the impression that MySQL does not have anything resembling
Postgres' support for regular expressions.  Though some might think
that regular expressions are a sort of poor man's SQL, in any
application which manages large amounts of text they are crucial.
Postgres definitely does not suck.

Is this the authoratative webpage for "Snowball" (which I never
realized was a play on "Snobol")?

http://snowball.tartarus.org/

Thanks,

John


On Feb 23, 2010, at 6:51 AM, Yang Zhang wrote:

> I'm relieved that Postgresql itself does not, in fact, suck,


Re: Sorting performance vs. MySQL

From
Yeb Havinga
Date:
Greg Stark wrote:
>
> You can do \set FETCH_COUNT to have psql use a cursor automatically.
>
It seems like a big win in this case. What would be the downside of
having a fetch_count set default in psql?

regards
Yeb Havinga





Re: Sorting performance vs. MySQL

From
Greg Stark
Date:
On Tue, Feb 23, 2010 at 9:48 AM, Yeb Havinga <yebhavinga@gmail.com> wrote:
> Greg Stark wrote:
>>
>> You can do \set FETCH_COUNT to have psql use a cursor automatically.
>>
>
> It seems like a big win in this case. What would be the downside of having a
> fetch_count set default in psql?

They were mentioned previously in this thread:

1) If an error occurs partway through the execution of the query you
might receive part of the result set.

2) psql won't be able to align the columns properly

--
greg

Re: Sorting performance vs. MySQL

From
Alex Hunsaker
Date:
On Tue, Feb 23, 2010 at 00:02, Yang Zhang <yanghatespam@gmail.com> wrote:
> Thing is, this is how I got here:
>
> - ran complex query that does SELECT INTO.
> - that never terminated, so killed it and tried a simpler SELECT (the
> subject of this thread) from psql to see how long that would take.

You might have better luck if you paste the EXPLAIN ANALYZE of the
SELECT INTO here (Maybe a new thread? Maybe on -performance? use your
judgement...).  But I bet if its doing something with transactionid
like your straight select was, an index would help.  If you are just
using SELECT INTO to copy all of the data into a new table... COPY
might be faster or CREATE TABLE AS.

Re: Sorting performance vs. MySQL

From
Baron Schwartz
Date:
Hi,

On Tue, Feb 23, 2010 at 12:51 AM, Yang Zhang <yanghatespam@gmail.com> wrote:
> When running the query in MySQL InnoDB:
>
> $ vmstat 10
> procs -----------memory---------- ---swap-- -----io---- --system--
> -----cpu------
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
>  0 13 13733604  83020   5648 2193884    3    3   936   168    2    1
> 4  2 89  5  0
>  1 12 13749952  80164   5600 2178032    0 4354   908  4379 3586 2638
> 0  1 38 60  0
>  0 19 13762228  80576   5556 2145220  208 3527  1280  3690 3668 2635
> 1  1 39 59  0
>  0 19 13778632  79420   5560 2135228   52 4186  1046  4191 3682 2418
> 0  1 37 62  0
[snip]
> I'm guessing the swap numbers are because MySQL uses mmap?

InnoDB doesn't use mmap.

Baron