Thread: Parallel Scaling of a pgplsql problem

Parallel Scaling of a pgplsql problem

From

Venki Ramachandran

Date:

25 April 2012, 18:52:23

Hi all:

Can someone please guide me as to how to solve this problem? If this is the wrong forum, please let me know which one to post this one in. I am new to Postgres (about 3 months into it)

I have PostGres 9.0 database in a AWS server (x-large) and a pgplsql program that does some computation. It takes in a date range and for one pair of personnel (two employees in a company) it calculates some values over the time period. It takes about 40ms (milli seconds) to complete and give me the answer. All good so far.

Now I have to run the same pgplsql on all possible combinations of employees and with 542 employees that is about say 300,000 unique pairs.

So (300000 * 40)/(1000 * 60 * 60) = 3.33 hours and I have to rank them and show it on a screen. No user wants to wait for 3 hours, they can probably wait for 10 minutes (even that is too much for a UI application). How do I solve this scaling problem? Can I have multiple parellel sessions and each session have multiple/processes that do a pair each at 40 ms and then collate the results. Does PostGres or pgplsql have any parallel computing capability.

Thanks, Venki

Re: Parallel Scaling of a pgplsql problem

From

Samuel Gendler

Date:

25 April 2012, 19:36:58

On Wed, Apr 25, 2012 at 11:52 AM, Venki Ramachandran <venki_ramachandran@yahoo.com> wrote:

Hi all:
Can someone please guide me as to how to solve this problem? If this is the wrong forum, please let me know which one to post this one in. I am new to Postgres (about 3 months into it)

I have PostGres 9.0 database in a AWS server (x-large) and a pgplsql program that does some computation. It takes in a date range and for one pair of personnel (two employees in a company) it calculates some values over the time period. It takes about 40ms (milli seconds) to complete and give me the answer. All good so far.

Now I have to run the same pgplsql on all possible combinations of employees and with 542 employees that is about say 300,000 unique pairs.

So (300000 * 40)/(1000 * 60 * 60) = 3.33 hours and I have to rank them and show it on a screen. No user wants to wait for 3 hours, they can probably wait for 10 minutes (even that is too much for a UI application). How do I solve this scaling problem? Can I have multiple parellel sessions and each session have multiple/processes that do a pair each at 40 ms and then collate the results. Does PostGres or pgplsql have any parallel computing capability.

The question is, how much of that 40ms is spent performing the calculation, how much is spent querying, and how much is function call overhead, and how much is round trip between the client and server with the query and results? Depending upon the breakdown, it is entirely possible that the actual per-record multiplier can be kept down to a couple of milliseconds if you restructure things to query data in bulk and only call a single function to do the work. If you get it down to 4ms, that's a 20 minute query. Get it down to 1ms and you're looking at only 5 minutes for what would appear to be a fairly compute-intensive report over a relatively large dataset.

Re: Parallel Scaling of a pgplsql problem

From

"Kevin Grittner"

Date:

25 April 2012, 20:05:11

Venki Ramachandran <venki_ramachandran@yahoo.com> wrote:

> I have PostGres 9.0 database in a AWS server (x-large) and a
> pgplsql program that does some computation. It takes in a date
> range and for one pair of personnel (two employees in a company)
> it calculates some values over the time period. It takes about
> 40ms (milli seconds) to complete and give me the answer. All good
> so far.

Maybe; maybe not.  If you wrote out *how to do it* in your code, it
probably won't scale well.  The trick to scaling is to write
*declaratively*: say *what you want* rather than *how to get it*.

Aggregates, window functions, CTEs, and/or the generate_series()
function may be useful.  It's hard to give more specific advice
without more detail about the problem.

-Kevin

Re: Parallel Scaling of a pgplsql problem

From

Pavel Stehule

Date:

25 April 2012, 20:10:25

Hello

2012/4/25 Venki Ramachandran <venki_ramachandran@yahoo.com>:
> Hi all:
> Can someone please guide me as to how to solve this problem? If this is the
> wrong forum, please let me know which one to post this one in. I am new to
> Postgres (about 3 months into it)
>
> I have PostGres 9.0 database in a AWS server (x-large) and a pgplsql program
> that does some computation. It takes in a date range and for one pair of
> personnel (two employees in a company) it calculates some values over the
> time period. It takes about 40ms (milli seconds) to complete and give me the
> answer. All good so far.
>
> Now I have to run the same pgplsql on all possible combinations of employees
> and with 542 employees that is about say 300,000 unique pairs.
>
> So (300000 * 40)/(1000 * 60 * 60) = 3.33 hours and I have to rank them and
> show it on a screen. No user wants to wait for 3 hours,  they can probably
> wait for 10 minutes (even that is too much for a UI application). How do I
> solve this scaling problem? Can I have multiple parellel sessions and each
> session have multiple/processes that do a pair each at 40 ms and then
> collate the results. Does PostGres or pgplsql have any parallel computing
> capability.

no, PostgreSQL doesn't support parallel processing of one query. You
can use some hardcore tricks and implement cooperative functions in C
- but this is hard work for beginner.  The most simple solution is
parallelism on application level.

Regards

Pavel Stehule

>
> Thanks, Venki

Re: Parallel Scaling of a pgplsql problem

From

Venki Ramachandran

Date:

25 April 2012, 21:12:25

Another question (probably a silly mistake) while debugging this problem:

I put in insert statements into the pgplsql code to collect the current_timestamp and see where the code was spending most of it time....The output is as follows:

log_text | insertdatetime

-------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------

INPUT VARIABLES,Src,emp_id_1,emp_id_100,StartDate,2012-01-01,EndDate,2012-02-01,Tou,DOnPk,CRR Type,ptp-obligations,MinPositivePirce,1.00 | 2012-04-25 20:56:42.691965+00

Starting get_DAM_Value function | 2012-04-25 20:56:42.691965+00

StartDate,2012-01-01 01:00:00,EndDate,2012-02-02 00:00:00 | 2012-04-25 20:56:42.691965+00

StartHr,vSrcprice,vSinkprice,vDiff,vAvgValue,vTotalDAMValue,vPkTotal,vOffPkTotal | 2012-04-25 20:56:42.691965+00

2012-01-01 01:00:00,18.52,16.15,-2.370000,-2.370000,0.000000,0.000000,-2.370000 | 2012-04-25 20:56:42.691965+00

2012-01-01 02:00:00,17.22,15.60,-1.620000,-1.620000,0.000000,0.000000,-3.990000 | 2012-04-25 20:56:42.691965+00

2012-01-01 03:00:00,18.22,17.55,-0.670000,-0.670000,0.000000,0.000000,-4.660000 | 2012-04-25 20:56:42.691965+00

2012-01-01 04:00:00,18.53,18.13,-0.400000,-0.400000,0.000000,0.000000,-5.060000 | 2012-04-25 20:56:42.691965+00

...

2012-02-02 00:00:00,2,17.13,17.13,0.000000,0.000000,0.000000,-7940.250000,-5216.290000 | 2012-04-25 20:56:42.691965+00

2012-02-02 00:00:00,3,16.54,16.54,0.000000,0.000000,0.000000,-7940.250000,-5216.290000 | 2012-04-25 20:56:42.691965+00

2012-02-02 00:00:00,4,16.27,16.27,0.000000,0.000000,0.000000,-7940.250000,-5216.290000 | 2012-04-25 20:56:42.691965+00

TotalPEAKMVal,-7940.250000,NumIntervals,2034,AvgRTMVAlue,-3.903761 | 2012-04-25 20:56:42.691965+00

OUTPUT VARIABLES,AvgPurValue,-2.84,AvgSaleValue,-3.90,Profit,-1.06,ProfitPercentage,-106.00 | 2012-04-25 20:56:42.691965+00

(3832 rows)

Why doesn't the current_timestamp value change within the pgpsql code? For every select from a table to compute, I am inserting into my debug_log dummy table. For all 3832 rows I have the same current_timestamp value which I was hoping to get by the following insert statement:

insert into debug_log ('some log text', current_timestamp);

But when I run this procedure from the psql command line, it takes 11 seconds....I gave a bigger date range than what I stated as (40 ms) in my first post hence the change to 11 seconds.

dev=> select calc_comp('emp_id_1','emp_id_100','DAM-RTM',to_date('2012-01-01','yyyy-mm-dd'),to_date('2012-02-01','yyyy-mm-dd'),'DOnPk','ptp-obligations',1.00, 0);

crr_valuation

-------------------------------

0||-2.84|-3.90|-1.06|-106.00|

(1 row)

Time: 11685.828 ms.

The last input value 1/0 while calling the above function controls, if any log line messages should be inserted or not, 0=insert, 1 = do_not_insert. When I toggle the flag the overall timing did not change. Does it not some time in ms to write 3832 rows to a table?

Why is my current_timestamp value not changing. I was expecting the difference between the last row's timestamp value MINUS the first row's tiemstamp value to equal my 11.685 seconds. What is wrong here?

-Venki

From: Samuel Gendler <sgendler@ideasculptor.com>
To: Venki Ramachandran <venki_ramachandran@yahoo.com>
Cc: "pgsql-performance@postgresql.org" <pgsql-performance@postgresql.org>
Sent: Wednesday, April 25, 2012 12:36 PM
Subject: Re: [PERFORM] Parallel Scaling of a pgplsql problem

On Wed, Apr 25, 2012 at 11:52 AM, Venki Ramachandran <venki_ramachandran@yahoo.com> wrote:

Hi all:
Can someone please guide me as to how to solve this problem? If this is the wrong forum, please let me know which one to post this one in. I am new to Postgres (about 3 months into it)

I have PostGres 9.0 database in a AWS server (x-large) and a pgplsql program that does some computation. It takes in a date range and for one pair of personnel (two employees in a company) it calculates some values over the time period. It takes about 40ms (milli seconds) to complete and give me the answer. All good so far.

Now I have to run the same pgplsql on all possible combinations of employees and with 542 employees that is about say 300,000 unique pairs.

So (300000 * 40)/(1000 * 60 * 60) = 3.33 hours and I have to rank them and show it on a screen. No user wants to wait for 3 hours, they can probably wait for 10 minutes (even that is too much for a UI application). How do I solve this scaling problem? Can I have multiple parellel sessions and each session have multiple/processes that do a pair each at 40 ms and then collate the results. Does PostGres or pgplsql have any parallel computing capability.

Re: Parallel Scaling of a pgplsql problem

From

Pavel Stehule

Date:

25 April 2012, 21:27:19

Hello

http://www.postgresql.org/docs/8.1/static/functions-datetime.html

CURRENT_TIME and CURRENT_TIMESTAMP deliver values with time zone;
LOCALTIME and LOCALTIMESTAMP deliver values without time zone.

CURRENT_TIME, CURRENT_TIMESTAMP, LOCALTIME, and LOCALTIMESTAMP can
optionally take a precision parameter, which causes the result to be
rounded to that many fractional digits in the seconds field. Without a
precision parameter, the result is given to the full available
precision.

Some examples:

SELECT CURRENT_TIME;
Result: 14:39:53.662522-05

SELECT CURRENT_DATE;
Result: 2001-12-23

SELECT CURRENT_TIMESTAMP;
Result: 2001-12-23 14:39:53.662522-05

SELECT CURRENT_TIMESTAMP(2);
Result: 2001-12-23 14:39:53.66-05

SELECT LOCALTIMESTAMP;
Result: 2001-12-23 14:39:53.662522

Since these functions return the start time of the current
transaction, their values do not change during the transaction. This
is considered a feature: the intent is to allow a single transaction
to have a consistent notion of the "current" time, so that multiple
modifications within the same transaction bear the same time stamp.

    Note: Other database systems might advance these values more frequently.

PostgreSQL also provides functions that return the start time of the
current statement, as well as the actual current time at the instant
the function is called. The complete list of non-SQL-standard time
functions is:

transaction_timestamp()
statement_timestamp()

Regards

Pavel Stehule

2012/4/25 Venki Ramachandran <venki_ramachandran@yahoo.com>:
> Another question (probably a silly mistake) while debugging this problem:
> I put in insert statements into the pgplsql code to collect the
> current_timestamp and see where the code was spending most of it time....The
> output is as follows:
>
>                                                                     log_text
>                                                                     |
>  insertdatetime
>
-------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------
>  INPUT
> VARIABLES,Src,emp_id_1,emp_id_100,StartDate,2012-01-01,EndDate,2012-02-01,Tou,DOnPk,CRR
> Type,ptp-obligations,MinPositivePirce,1.00 | 2012-04-25 20:56:42.691965+00
>  Starting get_DAM_Value function
>
>              | 2012-04-25 20:56:42.691965+00
>  StartDate,2012-01-01 01:00:00,EndDate,2012-02-02 00:00:00
>                                                                         |
> 2012-04-25 20:56:42.691965+00
>  StartHr,vSrcprice,vSinkprice,vDiff,vAvgValue,vTotalDAMValue,vPkTotal,vOffPkTotal
>                                                                | 2012-04-25
> 20:56:42.691965+00
>  2012-01-01
> 01:00:00,18.52,16.15,-2.370000,-2.370000,0.000000,0.000000,-2.370000
>                                                         | 2012-04-25
> 20:56:42.691965+00
>  2012-01-01
> 02:00:00,17.22,15.60,-1.620000,-1.620000,0.000000,0.000000,-3.990000
>                                                         | 2012-04-25
> 20:56:42.691965+00
>  2012-01-01
> 03:00:00,18.22,17.55,-0.670000,-0.670000,0.000000,0.000000,-4.660000
>                                                         | 2012-04-25
> 20:56:42.691965+00
>  2012-01-01
> 04:00:00,18.53,18.13,-0.400000,-0.400000,0.000000,0.000000,-5.060000
>                                                         | 2012-04-25
> 20:56:42.691965+00
>  ...
>  ...
>  ...
>  ...
>  2012-02-02
> 00:00:00,2,17.13,17.13,0.000000,0.000000,0.000000,-7940.250000,-5216.290000
>                                                          | 2012-04-25
> 20:56:42.691965+00
>  2012-02-02
> 00:00:00,3,16.54,16.54,0.000000,0.000000,0.000000,-7940.250000,-5216.290000
>                                                          | 2012-04-25
> 20:56:42.691965+00
>  2012-02-02
> 00:00:00,4,16.27,16.27,0.000000,0.000000,0.000000,-7940.250000,-5216.290000
>                                                          | 2012-04-25
> 20:56:42.691965+00
>  TotalPEAKMVal,-7940.250000,NumIntervals,2034,AvgRTMVAlue,-3.903761
>                                                                      |
> 2012-04-25 20:56:42.691965+00
>  OUTPUT
> VARIABLES,AvgPurValue,-2.84,AvgSaleValue,-3.90,Profit,-1.06,ProfitPercentage,-106.00
>                                                     | 2012-04-25
> 20:56:42.691965+00
> (3832 rows)
>
> Why doesn't the current_timestamp value change within the pgpsql code? For
> every select from a table to compute, I am inserting into my debug_log dummy
> table. For all 3832 rows I have the same current_timestamp value which I was
> hoping to get by the following insert statement:
> insert into debug_log ('some log text', current_timestamp);
>
> But when I run this procedure from the psql command line, it takes 11
> seconds....I gave a bigger date range than what I stated as (40 ms) in my
> first post hence the change to 11 seconds.
>
> dev=> select
>
calc_comp('emp_id_1','emp_id_100','DAM-RTM',to_date('2012-01-01','yyyy-mm-dd'),to_date('2012-02-01','yyyy-mm-dd'),'DOnPk','ptp-obligations',1.00,
> 0);
>          crr_valuation
> -------------------------------
>  0||-2.84|-3.90|-1.06|-106.00|
> (1 row)
>
> Time: 11685.828 ms.
>
> The last input value 1/0 while calling the above function controls, if any
> log line messages should be inserted or not, 0=insert, 1 = do_not_insert.
> When I toggle the flag the overall timing did not change. Does it not some
> time in ms to write 3832 rows to a table?
> Why is my current_timestamp value not changing. I was expecting the
> difference between the last row's timestamp value MINUS the first row's
> tiemstamp value to equal my 11.685 seconds. What is wrong here?
>
>
> -Venki
>
> ________________________________
> From: Samuel Gendler <sgendler@ideasculptor.com>
>
> To: Venki Ramachandran <venki_ramachandran@yahoo.com>
> Cc: "pgsql-performance@postgresql.org" <pgsql-performance@postgresql.org>
> Sent: Wednesday, April 25, 2012 12:36 PM
>
> Subject: Re: [PERFORM] Parallel Scaling of a pgplsql problem
>
>
>
> On Wed, Apr 25, 2012 at 11:52 AM, Venki Ramachandran
> <venki_ramachandran@yahoo.com> wrote:
>
> Hi all:
> Can someone please guide me as to how to solve this problem? If this is the
> wrong forum, please let me know which one to post this one in. I am new to
> Postgres (about 3 months into it)
>
> I have PostGres 9.0 database in a AWS server (x-large) and a pgplsql program
> that does some computation. It takes in a date range and for one pair of
> personnel (two employees in a company) it calculates some values over the
> time period. It takes about 40ms (milli seconds) to complete and give me the
> answer. All good so far.
>
> Now I have to run the same pgplsql on all possible combinations of employees
> and with 542 employees that is about say 300,000 unique pairs.
>
> So (300000 * 40)/(1000 * 60 * 60) = 3.33 hours and I have to rank them and
> show it on a screen. No user wants to wait for 3 hours,  they can probably
> wait for 10 minutes (even that is too much for a UI application). How do I
> solve this scaling problem? Can I have multiple parellel sessions and each
> session have multiple/processes that do a pair each at 40 ms and then
> collate the results. Does PostGres or pgplsql have any parallel computing
> capability.
>
>
> The question is, how much of that 40ms is spent performing the calculation,
> how much is spent querying, and how much is function call overhead, and how
> much is round trip between the client and server with the query and results?
>  Depending upon the breakdown, it is entirely possible that the actual
> per-record multiplier can be kept down to a couple of milliseconds if you
> restructure things to query data in bulk and only call a single function to
> do the work.  If you get it down to 4ms, that's a 20 minute query.  Get it
> down to 1ms and you're looking at only 5 minutes for what would appear to be
> a fairly compute-intensive report over a relatively large dataset.
>
>
>
>

Re: Parallel Scaling of a pgplsql problem

From

Venki Ramachandran

Date:

25 April 2012, 21:46:01

Replacing current_timestamp() with transaction_timestamp() and statement_timestamp() did not help!!!.

My timestamp values are still the same. I can't believe this is not possible in PG. In oracle you can use 'select sysdate from dual' and insert that value and you can see which sql statement or function is taking a long time during development mode. All I want to do is to find where my pgplsql code is spending its 11 seconds. The overall pseudo code is as follows:

while end_date >= start_date

select some_columns INTO vars1 from tbl-1 where emp_id = 'first emp';

select some_columns INTO vars2 from tbl-1 where emp_id = 'second emp';

Do some computation with vars1 and vars2;

select calc_comp(emp_1, emp_2) into v_profit;

end while

I want to see for each select and each call to the function what the system date is (or something else) which will tell me where the 11 seconds is being spent. How do I do that?

Thanks, Venki

From: Pavel Stehule <pavel.stehule@gmail.com>
To: Venki Ramachandran <venki_ramachandran@yahoo.com>
Cc: Samuel Gendler <sgendler@ideasculptor.com>; "pgsql-performance@postgresql.org" <pgsql-performance@postgresql.org>
Sent: Wednesday, April 25, 2012 2:26 PM
Subject: Re: [PERFORM] Parallel Scaling of a pgplsql problem

Hello

http://www.postgresql.org/docs/8.1/static/functions-datetime.html

CURRENT_TIME and CURRENT_TIMESTAMP deliver values with time zone;
LOCALTIME and LOCALTIMESTAMP deliver values without time zone.

CURRENT_TIME, CURRENT_TIMESTAMP, LOCALTIME, and LOCALTIMESTAMP can
optionally take a precision parameter, which causes the result to be
rounded to that many fractional digits in the seconds field. Without a
precision parameter, the result is given to the full available
precision.

Some examples:

SELECT CURRENT_TIME;
Result: 14:39:53.662522-05

SELECT CURRENT_DATE;
Result: 2001-12-23

SELECT CURRENT_TIMESTAMP;
Result: 2001-12-23 14:39:53.662522-05

SELECT CURRENT_TIMESTAMP(2);
Result: 2001-12-23 14:39:53.66-05

SELECT LOCALTIMESTAMP;
Result: 2001-12-23 14:39:53.662522

Since these functions return the start time of the current
transaction, their values do not change during the transaction. This
is considered a feature: the intent is to allow a single transaction
to have a consistent notion of the "current" time, so that multiple
modifications within the same transaction bear the same time stamp.

Note: Other database systems might advance these values more frequently.

PostgreSQL also provides functions that return the start time of the
current statement, as well as the actual current time at the instant
the function is called. The complete list of non-SQL-standard time
functions is:

transaction_timestamp()
statement_timestamp()

Regards

Pavel Stehule

2012/4/25 Venki Ramachandran <venki_ramachandran@yahoo.com>:
> Another question (probably a silly mistake) while debugging this problem:
> I put in insert statements into the pgplsql code to collect the
> current_timestamp and see where the code was spending most of it time....The
> output is as follows:
>
> log_text
> |
> insertdatetime
> -------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------
> INPUT
> VARIABLES,Src,emp_id_1,emp_id_100,StartDate,2012-01-01,EndDate,2012-02-01,Tou,DOnPk,CRR
> Type,ptp-obligations,MinPositivePirce,1.00 | 2012-04-25 20:56:42.691965+00
> Starting get_DAM_Value function
>
> | 2012-04-25 20:56:42.691965+00
> StartDate,2012-01-01 01:00:00,EndDate,2012-02-02 00:00:00
> |
> 2012-04-25 20:56:42.691965+00
> StartHr,vSrcprice,vSinkprice,vDiff,vAvgValue,vTotalDAMValue,vPkTotal,vOffPkTotal
> | 2012-04-25
> 20:56:42.691965+00
> 2012-01-01
> 01:00:00,18.52,16.15,-2.370000,-2.370000,0.000000,0.000000,-2.370000
> | 2012-04-25
> 20:56:42.691965+00
> 2012-01-01
> 02:00:00,17.22,15.60,-1.620000,-1.620000,0.000000,0.000000,-3.990000
> | 2012-04-25
> 20:56:42.691965+00
> 2012-01-01
> 03:00:00,18.22,17.55,-0.670000,-0.670000,0.000000,0.000000,-4.660000
> | 2012-04-25
> 20:56:42.691965+00
> 2012-01-01
> 04:00:00,18.53,18.13,-0.400000,-0.400000,0.000000,0.000000,-5.060000
> | 2012-04-25
> 20:56:42.691965+00
> ...
> ...
> ...
> ...
> 2012-02-02
> 00:00:00,2,17.13,17.13,0.000000,0.000000,0.000000,-7940.250000,-5216.290000
> | 2012-04-25
> 20:56:42.691965+00
> 2012-02-02
> 00:00:00,3,16.54,16.54,0.000000,0.000000,0.000000,-7940.250000,-5216.290000
> | 2012-04-25
> 20:56:42.691965+00
> 2012-02-02
> 00:00:00,4,16.27,16.27,0.000000,0.000000,0.000000,-7940.250000,-5216.290000
> | 2012-04-25
> 20:56:42.691965+00
> TotalPEAKMVal,-7940.250000,NumIntervals,2034,AvgRTMVAlue,-3.903761
> |
> 2012-04-25 20:56:42.691965+00
> OUTPUT
> VARIABLES,AvgPurValue,-2.84,AvgSaleValue,-3.90,Profit,-1.06,ProfitPercentage,-106.00
> | 2012-04-25
> 20:56:42.691965+00
> (3832 rows)
>
> Why doesn't the current_timestamp value change within the pgpsql code? For
> every select from a table to compute, I am inserting into my debug_log dummy
> table. For all 3832 rows I have the same current_timestamp value which I was
> hoping to get by the following insert statement:
> insert into debug_log ('some log text', current_timestamp);
>
> But when I run this procedure from the psql command line, it takes 11
> seconds....I gave a bigger date range than what I stated as (40 ms) in my
> first post hence the change to 11 seconds.
>
> dev=> select
> calc_comp('emp_id_1','emp_id_100','DAM-RTM',to_date('2012-01-01','yyyy-mm-dd'),to_date('2012-02-01','yyyy-mm-dd'),'DOnPk','ptp-obligations',1.00,
> 0);
> crr_valuation
> -------------------------------
> 0||-2.84|-3.90|-1.06|-106.00|
> (1 row)
>
> Time: 11685.828 ms.
>
> The last input value 1/0 while calling the above function controls, if any
> log line messages should be inserted or not, 0=insert, 1 = do_not_insert.
> When I toggle the flag the overall timing did not change. Does it not some
> time in ms to write 3832 rows to a table?
> Why is my current_timestamp value not changing. I was expecting the
> difference between the last row's timestamp value MINUS the first row's
> tiemstamp value to equal my 11.685 seconds. What is wrong here?
>
>
> -Venki
>
> ________________________________
> From: Samuel Gendler <sgendler@ideasculptor.com>
>
> To: Venki Ramachandran <venki_ramachandran@yahoo.com>
> Cc: "pgsql-performance@postgresql.org" <pgsql-performance@postgresql.org>
> Sent: Wednesday, April 25, 2012 12:36 PM
>
> Subject: Re: [PERFORM] Parallel Scaling of a pgplsql problem
>
>
>
> On Wed, Apr 25, 2012 at 11:52 AM, Venki Ramachandran
> <venki_ramachandran@yahoo.com> wrote:
>
> Hi all:
> Can someone please guide me as to how to solve this problem? If this is the
> wrong forum, please let me know which one to post this one in. I am new to
> Postgres (about 3 months into it)
>
> I have PostGres 9.0 database in a AWS server (x-large) and a pgplsql program
> that does some computation. It takes in a date range and for one pair of
> personnel (two employees in a company) it calculates some values over the
> time period. It takes about 40ms (milli seconds) to complete and give me the
> answer. All good so far.
>
> Now I have to run the same pgplsql on all possible combinations of employees
> and with 542 employees that is about say 300,000 unique pairs.
>
> So (300000 * 40)/(1000 * 60 * 60) = 3.33 hours and I have to rank them and
> show it on a screen. No user wants to wait for 3 hours, they can probably
> wait for 10 minutes (even that is too much for a UI application). How do I
> solve this scaling problem? Can I have multiple parellel sessions and each
> session have multiple/processes that do a pair each at 40 ms and then
> collate the results. Does PostGres or pgplsql have any parallel computing
> capability.
>
>
> The question is, how much of that 40ms is spent performing the calculation,
> how much is spent querying, and how much is function call overhead, and how
> much is round trip between the client and server with the query and results?
> Depending upon the breakdown, it is entirely possible that the actual
> per-record multiplier can be kept down to a couple of milliseconds if you
> restructure things to query data in bulk and only call a single function to
> do the work. If you get it down to 4ms, that's a 20 minute query. Get it
> down to 1ms and you're looking at only 5 minutes for what would appear to be
> a fairly compute-intensive report over a relatively large dataset.
>
>
>
>

Re: Parallel Scaling of a pgplsql problem

From

Tom Lane

Date:

25 April 2012, 21:52:41

Venki Ramachandran <venki_ramachandran@yahoo.com> writes:
> Replacing current_timestamp() with�transaction_timestamp() and�statement_timestamp() did not help!!!.�

You did not read the documentation you were pointed to.  Use
clock_timestamp().

            regards, tom lane

Re: Parallel Scaling of a pgplsql problem

From

Merlin Moncure

Date:

25 April 2012, 22:06:01

On Wed, Apr 25, 2012 at 1:52 PM, Venki Ramachandran
<venki_ramachandran@yahoo.com> wrote:
> Hi all:
> Can someone please guide me as to how to solve this problem? If this is the
> wrong forum, please let me know which one to post this one in. I am new to
> Postgres (about 3 months into it)
>
> I have PostGres 9.0 database in a AWS server (x-large) and a pgplsql program
> that does some computation. It takes in a date range and for one pair of
> personnel (two employees in a company) it calculates some values over the
> time period. It takes about 40ms (milli seconds) to complete and give me the
> answer. All good so far.
>
> Now I have to run the same pgplsql on all possible combinations of employees
> and with 542 employees that is about say 300,000 unique pairs.
>
> So (300000 * 40)/(1000 * 60 * 60) = 3.33 hours and I have to rank them and
> show it on a screen. No user wants to wait for 3 hours,  they can probably
> wait for 10 minutes (even that is too much for a UI application). How do I
> solve this scaling problem? Can I have multiple parellel sessions and each
> session have multiple/processes that do a pair each at 40 ms and then
> collate the results. Does PostGres or pgplsql have any parallel computing
> capability.

what's the final output of the computation -- are you inserting to a
table? if so, it should be trivially threaded any number of ways. it's
pretty easy to do it via bash script for example (i can give you some
code to do that if you want).

merlin

Re: Parallel Scaling of a pgplsql problem

From

Venki Ramachandran

Date:

26 April 2012, 02:40:42

Thanks Tom, clock_timestamp() worked. Appreciate it!!! and Sorry was hurrying to get this done at work and hence did not read through.

Can you comment on how you would solve the original problem? Even if I can get the 11 seconds down to 500 ms for one pair, running it for 300k pairs will take multiple hours. How can one write a combination of a bash script/pgplsql code so as to use all 8 cores of a server. I am seeing that this is just executing in one session/process.

thanks and regards, Venki

From: Tom Lane <tgl@sss.pgh.pa.us>
To: Venki Ramachandran <venki_ramachandran@yahoo.com>
Cc: Pavel Stehule <pavel.stehule@gmail.com>; Samuel Gendler <sgendler@ideasculptor.com>; "pgsql-performance@postgresql.org" <pgsql-performance@postgresql.org>
Sent: Wednesday, April 25, 2012 2:52 PM
Subject: Re: [PERFORM] Parallel Scaling of a pgplsql problem

Venki Ramachandran <venki_ramachandran@yahoo.com> writes:
> Replacing current_timestamp() with transaction_timestamp() and statement_timestamp() did not help!!!.

You did not read the documentation you were pointed to. Use
clock_timestamp().

regards, tom lane

Re: Parallel Scaling of a pgplsql problem

From

Jan Nielsen

Date:

26 April 2012, 03:41:32

On Wed, Apr 25, 2012 at 12:52 PM, Venki Ramachandran <venki_ramachandran@yahoo.com> wrote:

Hi all:
Can someone please guide me as to how to solve this problem? If this is the wrong forum, please let me know which one to post this one in. I am new to Postgres (about 3 months into it)

I have PostGres 9.0 database in a AWS server (x-large) and a pgplsql program that does some computation. It takes in a date range and for one pair of personnel (two employees in a company) it calculates some values over the time period. It takes about 40ms (milli seconds) to complete and give me the answer. All good so far.

Now I have to run the same pgplsql on all possible combinations of employees and with 542 employees that is about say 300,000 unique pairs.

So (300000 * 40)/(1000 * 60 * 60) = 3.33 hours and I have to rank them and show it on a screen. No user wants to wait for 3 hours, they can probably wait for 10 minutes (even that is too much for a UI application). How do I solve this scaling problem? Can I have multiple parellel sessions and each session have multiple/processes that do a pair each at 40 ms and then collate the results. Does PostGres or pgplsql have any parallel computing capability.

Setting aside the database concurrency question, have you considered application-level solutions?

How often does a user expect their rank to change? If a daily rank change is fine, trigger the (lengthy) ranking calculation nightly and cache the results in a materialized view for all users; you could continuously rebuild the view to improve freshness to within 4 hours. To go faster with an application-level solution, you will have to reduce your calculation to *what's* likely to be most important to the individual which, again, you can cache; or, if you can predict *who's* most likely to request a ranking, calculate these first; or, both.

These are likely good things to consider regardless of any improvements you make to the back-end ranking calculation, though at you will hit a point of diminishing returns if your ranking calculation drops below some "tolerable" wait. In the web world "tolerable" is about 3 seconds for the general public and about 30 seconds for a captured audience, e.g., employees. YMMV.

Cheers,

Jan

Re: Parallel Scaling of a pgplsql problem

From

Yeb Havinga

Date:

26 April 2012, 06:49:39

On 2012-04-26 04:40, Venki Ramachandran wrote:

Thanks Tom, clock_timestamp() worked. Appreciate it!!! and Sorry was hurrying to get this done at work and hence did not read through.

Can you comment on how you would solve the original problem? Even if I can get the 11 seconds down to 500 ms for one pair, running it for 300k pairs will take multiple hours. How can one write a combination of a bash script/pgplsql code so as to use all 8 cores of a server. I am seeing that this is just executing in one session/process.

You want to compare a calculation on the cross product 'employee x employee'. If employee is partitioned into emp1, emp2, ... emp8, the cross product is equal to the union of emp1 x employee, emp2 x employee, .. emp8 x employee. Each of these 8 cross products on partitions can be executed in parallel. I'd look into dblink to execute each of the 8 cross products in parallel, and then union all of those results.

http://www.postgresql.org/docs/9.1/static/contrib-dblink-connect.html

regards,
Yeb

Re: Parallel Scaling of a pgplsql problem

From

Greg Spiegelberg

Date:

26 April 2012, 16:13:51

On Wed, Apr 25, 2012 at 12:52 PM, Venki Ramachandran <venki_ramachandran@yahoo.com> wrote:

Now I have to run the same pgplsql on all possible combinations of employees and with 542 employees that is about say 300,000 unique pairs.

So (300000 * 40)/(1000 * 60 * 60) = 3.33 hours and I have to rank them and show it on a screen. No user wants to wait for 3 hours, they can probably wait for 10 minutes (even that is too much for a UI application). How do I solve this scaling problem? Can I have multiple parellel sessions and each session have multiple/processes that do a pair each at 40 ms and then collate the results. Does PostGres or pgplsql have any parallel computing capability.

Interesting problem.

How frequently does the data change? Hourly, daily, monthly?

How granular are the time frames in the typical query? Seconds, minutes, hours, days, weeks?

I'm thinking if you can prepare the data ahead of time as it changes via a trigger or client-side code then your problem will go away pretty quickly.

-Greg