Thread: about multiprocessingmassdata

about multiprocessingmassdata

From

superman0920

Date:

04 April 2012, 12:53:04

BLOCKQUOTE {MARGIN-TOP: 0px; MARGIN-BOTTOM: 0px; MARGIN-LEFT: 2em } OL {MARGIN-TOP: 0px; MARGIN-BOTTOM: 0px } UL {MARGIN-TOP: 0px; MARGIN-BOTTOM: 0px } P {MARGIN-TOP: 0px; MARGIN-BOTTOM: 0px }

Hi list!

i have a table which has 8500000 rows records. i write a java program to update these records.

i use 100 threads to update the records. For example, thread-1 update 1~85000 records; thread-2 update 85001~170000 and so on.

The update sql's aim is remove the space in the column and it is simple: update poi set py=replace(py,' ','') where id=?;

By the program log, i find the database of processing data speed so slow, per thread updating 1000 rows need take 260s.

BTW: The PG Server is running on a PC. The PC's total memory is 2G and CPU is "Intel(R) Core(TM)2 Duo E7500 2.93GHz".

When the program's running, CPU just be used 1% and Memory left 112MB.

Is the PC configuration too low cause the problem ?

Please help ~~

Thanks for any tips,

superman0920

Re: about multiprocessingmassdata

From

Tomas Vondra

Date:

04 April 2012, 13:35:04

On 4.4.2012 17:52, superman0920 wrote:
> Hi list!
>
> i have a table which has 8500000 rows records. i write a java program to
> update these records.
> i use 100 threads to update the records. For example, thread-1 update
> 1~85000 records; thread-2 update 85001~170000 and so on.
> The update sql's aim is remove the space in the column and it is simple:
> update poi set py=replace(py,' ','') where id=?;

That's a very naive approach. It's very likely each thread will do an
index scan for each update (to evaluate the 'id=?' condition. And that's
going to cost you much more than you gain because index scans are quite
CPU and I/O intensive.

Simply update the whole table by

   UPDATE poi SET py = replace(py, ' ','');

Have you actually tried how this performs or did you guess 'it's
definitely going to be very slow so I'll use multiple threads to make
that faster'?

If you really need to parallelize this, you need to do that differently
- e.g. use 'ctid' to skip to update a whole page like this:

   UPDATE poi SET py = replace(py, ' ','')
    WHERE ctid >= '(n,0)'::tid AND ctid < '(n+1,0)'::tid AND;

where 'n' ranges between 0 and number of pages the table (e.g. in pg_class).

But try the simple UPDATE first, my guess is it's going to be much
faster than you expect.

Tomas

Re: about multiprocessingmassdata

From

Tomas Vondra

Date:

04 April 2012, 14:00:11

On 4.4.2012 18:49, superman0920 wrote:
> Thank you for your reply
> I tried executing "UPDATE poi SET py = replace(py, ' ','');", that took
> long long time(about 20+ hours) and no error report. Just like locked.

OK, that's weird. So we need a bit more details - what PostgreSQL
version is this?

How much space does the table actually occupy? Try this:

SELECT relname, relpages, reltuples FROM pg_class WHERE relname = 'poi';

And finally we need EXPLAIN output for both UPDATE commands. Don't post
them here directly - put them to explain.depesz.com and post just the link.

Further, we need to see the actual table definition. Especially if there
are any triggers or foreign keys on the table?

Tomas

Re: about multiprocessingmassdata

From

Tomas Vondra

Date:

05 April 2012, 11:48:06

On 5.4.2012 15:44, superman0920 wrote:
> Sure, i will post that at tomorrow.
>
> Today I install PG and MySQL at a  Server. I insert 850000 rows record
> to each db.
> I execute "select count(*) from poi_all_new" at two db.
> MySQL takes 0.9s
> PG takes 364s

First of all, keep the list (pgsql-performance@postgresql.org) on the
CC. You keep responding to me directly, therefore others can't respond
to your messages (and help you).

Are you sure the comparison was fair, i.e. both machines containing the
same amount of data (not number of rows, amount of data), configured
properly etc.? Have you used the same table structure (how did you
represent geometry data type in MySQL)?

For example I bet you're using MyISAM. In that case, it's comparing
apples to oranges (or maybe cats, so different it is). MyISAM does not
do any MVCC stuff (visibility checking, ...) and simply reads the number
of rows from a catalogue. PostgreSQL actually has to scan the whole
table - that's a big difference. This is probably the only place where
MySQL (with MyISAM beats PostgreSQL). But once you switch to a proper
storage manager (e.g. InnoDB) it'll have to scan the data just like
PostgreSQL - try that.

Anyway, this benchmark is rubbish because you're not going to do this
query often - use queries that actually make sense for the application.

Nevertheless, it seems there's something seriously wrong with your
machine or the environment (OS), probably I/O.

I've done a quick test - I've created the table (without the 'geometry'
column because I don't have postgis installed), filled it with one
million of rows and executed 'select count(*)'. See this:

    http://pastebin.com/42cAcCqu

This is what I get:

======================================================================
test=# SELECT pg_size_pretty(pg_relation_size('test_table'));
 pg_size_pretty
----------------
 1302 MB
(1 row)

test=#
test=# \timing on
Timing is on.
test=#
test=# SELECT count(*) from test_table;
  count
---------
 1000000
(1 row)

Time: 2026,695 ms
======================================================================

so it's running the 'count(*)' in two seconds. If I run it again, I get
this:

======================================================================
test=# SELECT count(*) from test_table;
  count
---------
 1000000
(1 row)

Time: 270,020 ms
======================================================================

Yes, that's 0,27 seconds. And this is *only* my workstation - Core i5 (4
cores), 8GB of RAM, nothing special.

These results obviously depend on the data being available in page
cache. If that's not the case, PostgreSQL needs to read them from the
drive (and then it's basically i/o bound) - I can get about 250 MB/s
from my drives, so I get this:

======================================================================
test=# SELECT count(*) from test_table;
  count
---------
 1000000
(1 row)

Time: 5088,739 ms
======================================================================

If you have slower drives, the dependency is about linear (half the
speed -> twice the time). So either your drives are very slow, or
there's something rotten.

I still haven's seen iostat / vmstat output ... that'd tell us much more
about the causes.

Tomas

Re: about multiprocessingmassdata

From

"Kevin Grittner"

Date:

05 April 2012, 12:01:44

Tomas Vondra <tv@fuzzy.cz> wrote:
> On 5.4.2012 15:44, superman0920 wrote:

>> Today I install PG and MySQL at a  Server. I insert 850000 rows
>> record to each db.
>> I execute "select count(*) from poi_all_new" at two db.
>> MySQL takes 0.9s
>> PG takes 364s

> Are you sure the comparison was fair, i.e. both machines
> containing the same amount of data (not number of rows, amount of
> data), configured properly etc.?

Don't forget the "hint bits" issue -- if the count(*) was run
immediately after the load (without a chance for autovacuum to get
in there), all the data was re-written in place to save hint
information.  I remember how confusing that was for me the first
time I saw it.  It's very easy to get a false impression of overall
PostgreSQL performance from that type of test, and it's the sort of
test a lot of people will do on an ad hoc basis.

-Kevin

Re: about multiprocessingmassdata

From

Merlin Moncure

Date:

09 April 2012, 19:38:14

On Thu, Apr 5, 2012 at 9:47 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
> On 5.4.2012 15:44, superman0920 wrote:
>> Sure, i will post that at tomorrow.
>>
>> Today I install PG and MySQL at a  Server. I insert 850000 rows record
>> to each db.
>> I execute "select count(*) from poi_all_new" at two db.
>> MySQL takes 0.9s
>> PG takes 364s
>
> First of all, keep the list (pgsql-performance@postgresql.org) on the
> CC. You keep responding to me directly, therefore others can't respond
> to your messages (and help you).
>
> Are you sure the comparison was fair, i.e. both machines containing the
> same amount of data (not number of rows, amount of data), configured
> properly etc.? Have you used the same table structure (how did you
> represent geometry data type in MySQL)?
>
> For example I bet you're using MyISAM. In that case, it's comparing
> apples to oranges (or maybe cats, so different it is). MyISAM does not
> do any MVCC stuff (visibility checking, ...) and simply reads the number
> of rows from a catalogue. PostgreSQL actually has to scan the whole
> table - that's a big difference. This is probably the only place where
> MySQL (with MyISAM beats PostgreSQL). But once you switch to a proper
> storage manager (e.g. InnoDB) it'll have to scan the data just like
> PostgreSQL - try that.
>
> Anyway, this benchmark is rubbish because you're not going to do this
> query often - use queries that actually make sense for the application.
>
> Nevertheless, it seems there's something seriously wrong with your
> machine or the environment (OS), probably I/O.
>
> I've done a quick test - I've created the table (without the 'geometry'
> column because I don't have postgis installed), filled it with one
> million of rows and executed 'select count(*)'. See this:
>
>    http://pastebin.com/42cAcCqu
>
> This is what I get:
>
> ======================================================================
> test=# SELECT pg_size_pretty(pg_relation_size('test_table'));
>  pg_size_pretty
> ----------------
>  1302 MB
> (1 row)
>
> test=#
> test=# \timing on
> Timing is on.
> test=#
> test=# SELECT count(*) from test_table;
>  count
> ---------
>  1000000
> (1 row)
>
> Time: 2026,695 ms
> ======================================================================
>
> so it's running the 'count(*)' in two seconds. If I run it again, I get
> this:
>
> ======================================================================
> test=# SELECT count(*) from test_table;
>  count
> ---------
>  1000000
> (1 row)
>
> Time: 270,020 ms
> ======================================================================
>
> Yes, that's 0,27 seconds. And this is *only* my workstation - Core i5 (4
> cores), 8GB of RAM, nothing special.
>
> These results obviously depend on the data being available in page
> cache. If that's not the case, PostgreSQL needs to read them from the
> drive (and then it's basically i/o bound) - I can get about 250 MB/s
> from my drives, so I get this:
>
> ======================================================================
> test=# SELECT count(*) from test_table;
>  count
> ---------
>  1000000
> (1 row)
>
> Time: 5088,739 ms
> ======================================================================
>
> If you have slower drives, the dependency is about linear (half the
> speed -> twice the time). So either your drives are very slow, or
> there's something rotten.
>
> I still haven's seen iostat / vmstat output ... that'd tell us much more
> about the causes.

geometry column can potentially quite wide.  one thing we need to see
is the table has any indexes -- in particular gist/gin on the
geometry.

merlin

Re: about multiprocessingmassdata

From

Tomas Vondra

Date:

09 April 2012, 20:51:16

On 10.4.2012 00:37, Merlin Moncure wrote:
> On Thu, Apr 5, 2012 at 9:47 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> If you have slower drives, the dependency is about linear (half the
>> speed -> twice the time). So either your drives are very slow, or
>> there's something rotten.
>>
>> I still haven's seen iostat / vmstat output ... that'd tell us much more
>> about the causes.
>
> geometry column can potentially quite wide.  one thing we need to see
> is the table has any indexes -- in particular gist/gin on the
> geometry.

Yeah, but in one of the previous posts the OP posted this:

relname      | relpages  |  reltuples
-------------+----------+-------------
poi_all_new |  2421133 | 6.53328e+06

which means the table has ~ 19GB for 6.5 million rows, so it's like
2.8GB per 1 million of rows, i.e. ~3kB per row. I've been working with 1
million rows and 1.3GB of data, so it's like 50% of the expected amount.

But this does not explain why the SELECT COUNT(*) takes 364 seconds on
that machine. That'd mean ~8MB/s.

Regarding the indexes, the the OP already posted a description of the
table and apparently there are these indexes:

Indexes:
    "poi_all_new_pk" PRIMARY KEY, btree (ogc_fid)
    "poi_all_new_flname_idx" btree (flname)
    "poi_all_new_geom_idx" btree (wkb_geometry)
    "poi_all_new_ogc_fid_idx" btree (ogc_fid)
    "poi_all_new_pinyin_idx" btree (pinyin)

So none of them is GIN/GIST although some one of them is on the geometry
column.

T.

Re: about multiprocessingmassdata

From

Merlin Moncure

Date:

10 April 2012, 10:22:12

On Mon, Apr 9, 2012 at 6:50 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
> On 10.4.2012 00:37, Merlin Moncure wrote:
>> On Thu, Apr 5, 2012 at 9:47 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
>>> If you have slower drives, the dependency is about linear (half the
>>> speed -> twice the time). So either your drives are very slow, or
>>> there's something rotten.
>>>
>>> I still haven's seen iostat / vmstat output ... that'd tell us much more
>>> about the causes.
>>
>> geometry column can potentially quite wide.  one thing we need to see
>> is the table has any indexes -- in particular gist/gin on the
>> geometry.
>
> Yeah, but in one of the previous posts the OP posted this:
>
> relname      | relpages  |  reltuples
> -------------+----------+-------------
> poi_all_new |  2421133 | 6.53328e+06
>
> which means the table has ~ 19GB for 6.5 million rows, so it's like
> 2.8GB per 1 million of rows, i.e. ~3kB per row. I've been working with 1
> million rows and 1.3GB of data, so it's like 50% of the expected amount.
>
> But this does not explain why the SELECT COUNT(*) takes 364 seconds on
> that machine. That'd mean ~8MB/s.
>
> Regarding the indexes, the the OP already posted a description of the
> table and apparently there are these indexes:
>
> Indexes:
>    "poi_all_new_pk" PRIMARY KEY, btree (ogc_fid)
>    "poi_all_new_flname_idx" btree (flname)
>    "poi_all_new_geom_idx" btree (wkb_geometry)
>    "poi_all_new_ogc_fid_idx" btree (ogc_fid)
>    "poi_all_new_pinyin_idx" btree (pinyin)
>
> So none of them is GIN/GIST although some one of them is on the geometry
> column.

hm. well, there's a duplicate index in there: ogc_fid is indexed
twice.  how much bloat is on the table (let's see an ANALYZE VERBOSE)?
 what's the storage for this database?

merlin