Thread: PGSQL 9.3 - Materialized View - multithreading

PGSQL 9.3 - Materialized View - multithreading

From

Nicolas Paris

Date:

04 April 2014, 16:29:52

Hello,

My question is about multiprocess and materialized View.
http://www.postgresql.org/docs/9.3/static/sql-creatematerializedview.html

I (will) have something like 3600 materialised views, and I would like to know the way to refresh them in a multithread way

(anderstand 8 cpu cores -> 8 refresh process in the same time)

Thanks a lot,

Re: PGSQL 9.3 - Materialized View - multithreading

From

Thom Brown

Date:

04 April 2014, 16:54:42

On 4 April 2014 17:29, Nicolas Paris <niparisco@gmail.com> wrote:
> Hello,
>
> My question is about multiprocess and materialized View.
> http://www.postgresql.org/docs/9.3/static/sql-creatematerializedview.html
> I (will) have something like 3600 materialised views, and I would like to
> know the way to refresh them in a multithread way
> (anderstand 8 cpu cores -> 8 refresh process  in the same time)

The only thing that immediately comes to mind would be running a
rather hacky DO function in 4 separate sessions:

DO $$
DECLARE
  session CONSTANT BIGINT := 0;
  rec RECORD;
BEGIN
  FOR rec IN SELECT quote_ident(nspname) || '.' ||
quote_ident(relname) AS mv FROM pg_class c INNER JOIN pg_namespace n
ON c.relnamespace = n.oid WHERE relkind = 'm' AND c.oid::bigint % 8 =
session LOOP
    RAISE NOTICE 'Refreshing materialized view: %', rec.mv;
    EXECUTE 'REFRESH MATERIALIZED VIEW ' || rec.mv || ';';
  END LOOP;
END$$ language plpgsql;

Where you would set session to 0 for the first session, 1 for the
next, 2 for the next and 3 for the next, and so on until you reach 7
for the last.  These would each be run in a separate parallel session,
although someone may come up with a better solution.

--
Thom

Re: PGSQL 9.3 - Materialized View - multithreading

From

PARIS Nicolas

Date:

04 April 2014, 19:49:37

Thanks,

"The only thing that immediately comes to mind would be running a
 rather hacky DO function in 4 separate sessions:"
You mean 8 sessions I guess.

8 separate sessions ?
Have you any idea how to manage sessions ? Is it possible to create
separate session internaly ?
Do I have to make 8 external connection to database, to get 8 process.
It would be great if I could manage session internaly, in a pl/sql by
example.


Le 04/04/2014 18:54, Thom Brown a écrit :
> On 4 April 2014 17:29, Nicolas Paris <niparisco@gmail.com> wrote:
>> Hello,
>>
>> My question is about multiprocess and materialized View.
>> http://www.postgresql.org/docs/9.3/static/sql-creatematerializedview.html
>> I (will) have something like 3600 materialised views, and I would like to
>> know the way to refresh them in a multithread way
>> (anderstand 8 cpu cores -> 8 refresh process  in the same time)
>
> The only thing that immediately comes to mind would be running a
> rather hacky DO function in 4 separate sessions:
>
> DO $$
> DECLARE
>   session CONSTANT BIGINT := 0;
>   rec RECORD;
> BEGIN
>   FOR rec IN SELECT quote_ident(nspname) || '.' ||
> quote_ident(relname) AS mv FROM pg_class c INNER JOIN pg_namespace n
> ON c.relnamespace = n.oid WHERE relkind = 'm' AND c.oid::bigint % 8 =
> session LOOP
>     RAISE NOTICE 'Refreshing materialized view: %', rec.mv;
>     EXECUTE 'REFRESH MATERIALIZED VIEW ' || rec.mv || ';';
>   END LOOP;
> END$$ language plpgsql;
>
> Where you would set session to 0 for the first session, 1 for the
> next, 2 for the next and 3 for the next, and so on until you reach 7
> for the last.  These would each be run in a separate parallel session,
> although someone may come up with a better solution.
>

Re: PGSQL 9.3 - Materialized View - multithreading

From

Thom Brown

Date:

04 April 2014, 19:57:59

On 4 April 2014 20:49, PARIS Nicolas <niparisco@gmail.com> wrote:
> Thanks,
>
> "The only thing that immediately comes to mind would be running a
>  rather hacky DO function in 4 separate sessions:"
> You mean 8 sessions I guess.

Yes, typo.

> 8 separate sessions ?
> Have you any idea how to manage sessions ? Is it possible to create
> separate session internaly ?
> Do I have to make 8 external connection to database, to get 8 process.
> It would be great if I could manage session internaly, in a pl/sql by
> example.

Well you can't have multiple sessions per connection, so yes, you'd
need to issue each of them in separate connections.

I can't think of a more convenient way of doing it, but the solution
I've proposed isn't particularly elegant anyway.

--
Thom

Re: PGSQL 9.3 - Materialized View - multithreading

From

PARIS Nicolas

Date:

04 April 2014, 20:12:26

Ok thanks,

And what about triggers. 8 triggers based on the same event won't be
multithreaded ?




Le 04/04/2014 21:57, Thom Brown a écrit :
> On 4 April 2014 20:49, PARIS Nicolas <niparisco@gmail.com> wrote:
>> Thanks,
>>
>> "The only thing that immediately comes to mind would be running a
>>  rather hacky DO function in 4 separate sessions:"
>> You mean 8 sessions I guess.
>
> Yes, typo.
>
>> 8 separate sessions ?
>> Have you any idea how to manage sessions ? Is it possible to create
>> separate session internaly ?
>> Do I have to make 8 external connection to database, to get 8 process.
>> It would be great if I could manage session internaly, in a pl/sql by
>> example.
>
> Well you can't have multiple sessions per connection, so yes, you'd
> need to issue each of them in separate connections.
>
> I can't think of a more convenient way of doing it, but the solution
> I've proposed isn't particularly elegant anyway.
>

Re: PGSQL 9.3 - Materialized View - multithreading

From

Thom Brown

Date:

04 April 2014, 20:15:00

On 4 April 2014 21:07, PARIS Nicolas <niparisco@gmail.com> wrote:
> Ok thanks,
>
> And what about triggers. 8 triggers based on the same event won't be
> multithreaded ?

I'm not clear on how triggers come into this.  You can't have triggers
on materialized views, and they don't fire triggers on tables or views
that they are based on.
--
Thom

Re: PGSQL 9.3 - Materialized View - multithreading

From

PARIS Nicolas

Date:

04 April 2014, 20:26:36

this postgres documentation :
http://www.postgresql.org/docs/9.3/static/ecpg-connect.html
says it is actually possible to manage connection in C stored procedure.

I may be wrong...


Le 04/04/2014 22:14, Thom Brown a écrit :
> lear on how triggers come into this.  You can't have triggers
> on materialized views, and they don't fire triggers on tables or views
> that they are based o

Re: PGSQL 9.3 - Materialized View - multithreading

From

"ktm@rice.edu"

Date:

04 April 2014, 20:34:58

On Fri, Apr 04, 2014 at 10:26:22PM +0200, PARIS Nicolas wrote:
> this postgres documentation :
> http://www.postgresql.org/docs/9.3/static/ecpg-connect.html
> says it is actually possible to manage connection in C stored procedure.
>
> I may be wrong...
>
>
> Le 04/04/2014 22:14, Thom Brown a écrit :
> > lear on how triggers come into this.  You can't have triggers
> > on materialized views, and they don't fire triggers on tables or views
> > that they are based o
>

Hi,

I do not know if it can be used in this fashion, but could pl/proxy be
used by defining a cluster to be the same server and use a partitioned
remote call? Someone with pl/proxy experience may have more information.

Regards,
Ken

Re: PGSQL 9.3 - Materialized View - multithreading

From

Thom Brown

Date:

06 April 2014, 19:08:23

On 4 April 2014 21:26, PARIS Nicolas <niparisco@gmail.com> wrote:
> this postgres documentation :
> http://www.postgresql.org/docs/9.3/static/ecpg-connect.html
> says it is actually possible to manage connection in C stored procedure.
>
> I may be wrong...

That page doesn't refer to triggers at all, so I'm still not sure what you mean.

--
Thom

Re: PGSQL 9.3 - Materialized View - multithreading

From

PARIS Nicolas

Date:

06 April 2014, 22:22:11

Right, not refering triggers, seems to be kind of mix C/sql compiled  (=
external).
To conclude :
- pl/proxy, it appears difficult, and not designed to.
- pgAgent (supposed to apply jobs in a multithreaded way)
- bash (xargs does the job)
- external scripts (R, python, perl...)

So I will test pgAgent and feedback it


Thanks

Le 06/04/2014 21:07, Thom Brown a écrit :
> On 4 April 2014 21:26, PARIS Nicolas <niparisco@gmail.com> wrote:
>> this postgres documentation :
>> http://www.postgresql.org/docs/9.3/static/ecpg-connect.html
>> says it is actually possible to manage connection in C stored procedure.
>>
>> I may be wrong...
>
> That page doesn't refer to triggers at all, so I'm still not sure what you mean.
>

Re: PGSQL 9.3 - Materialized View - multithreading

From

"Graeme B. Bell"

Date:

07 April 2014, 10:29:39

On 04 Apr 2014, at 18:29, Nicolas Paris <niparisco@gmail.com> wrote:

> Hello,
>
> My question is about multiprocess and materialized View.
> http://www.postgresql.org/docs/9.3/static/sql-creatematerializedview.html
> I (will) have something like 3600 materialised views, and I would like to know the way to refresh them in a
multithreadway 
> (anderstand 8 cpu cores -> 8 refresh process  in the same time)

Hi Nick,

out of DB solution:

1. Produce a text file which contains the 3600 refresh commands you want to run in parallel. You can do that with
selectand format() if you don't have a list already.  

2. I'm going to simulate your 3600 'refresh' commands here with some select and sleep statements that finish at unknown
times.

(In BASH):
  for i in {1..3600} ; do echo "echo \"select pg_sleep(1+random()::int*10); select $i\" | psql mydb" ; done >
3600commands

3. Install Gnu Parallel     and type:

parallel < 3600commands

4. Parallel will automatically work out the appropriate number of cores/threads for your CPUs, or you can control it
manuallywith -j.  
It will also give you a live progress report if you use --progress.
e.g. this command balances 8 jobs at a time, prints a dynamic progress report and dumps stdout to /dev/null

parallel -j 8 --progress  < 3600commands > /dev/null

5. If you want to make debugging easier use the parameter --tag to tag output for each command.

Of course it would be much more elegant if someone implemented something like Gnu Parallel inside postgres or psql ...
:-)

Hope this helps & have a nice day,

Graeme.

Re: PGSQL 9.3 - Materialized View - multithreading

From

Nicolas Paris

Date:

07 April 2014, 12:49:23

Hello,

Thanks for this clear explanation !

Then I have a sub-question :

Supposed I have 3600 materialised views say 600 mat views from 6 main table. (A,B,C,D,E,F are repetead 600 times with some differences)

Is it faster to :
1) parallel refresh 600 time A, then 600 time B etc,

2) parallel refresh 600 time A,B,C,D,E,F

I guess 1) is faster because they are 600 access to same table loaded in memory ? But do parallel access to the same table implies concurency

and bad performance ?

Thanks

Nicolas PARIS

2014-04-07 12:29 GMT+02:00 Graeme B. Bell <grb@skogoglandskap.no>:

On 04 Apr 2014, at 18:29, Nicolas Paris <niparisco@gmail.com> wrote:

> Hello,
>
> My question is about multiprocess and materialized View.
> http://www.postgresql.org/docs/9.3/static/sql-creatematerializedview.html
> I (will) have something like 3600 materialised views, and I would like to know the way to refresh them in a multithread way
> (anderstand 8 cpu cores -> 8 refresh process in the same time)

Hi Nick,

out of DB solution:

1. Produce a text file which contains the 3600 refresh commands you want to run in parallel. You can do that with select and format() if you don't have a list already.

2. I'm going to simulate your 3600 'refresh' commands here with some select and sleep statements that finish at unknown times.

(In BASH):
for i in {1..3600} ; do echo "echo \"select pg_sleep(1+random()::int*10); select $i\" | psql mydb" ; done > 3600commands

3. Install Gnu Parallel and type:

parallel < 3600commands

4. Parallel will automatically work out the appropriate number of cores/threads for your CPUs, or you can control it manually with -j.
It will also give you a live progress report if you use --progress.
e.g. this command balances 8 jobs at a time, prints a dynamic progress report and dumps stdout to /dev/null

parallel -j 8 --progress < 3600commands > /dev/null

5. If you want to make debugging easier use the parameter --tag to tag output for each command.

Of course it would be much more elegant if someone implemented something like Gnu Parallel inside postgres or psql ... :-)

Hope this helps & have a nice day,

Graeme.

Re: PGSQL 9.3 - Materialized View - multithreading

From

"Graeme B. Bell"

Date:

07 April 2014, 12:59:32

Hi again Nick.

Glad it helped.

Generally, I would expect that doing all the A's first, then all the B's, and so on, would be fastest since you can
re-usethe data from cache. 

Concurrency when reading isn't generally a problem. Lots of things can read at the same time and it will be nice and
fast.
It's concurrent writes or concurrent read/write of the same data item that causes problems with locking. That shouldn't
behappening here, judging by your description. 

If possible, try to make sure nothing is modifying those source tables A/B/C/D/E/F when you are doing your view
refresh.

Graeme.

On 07 Apr 2014, at 14:49, Nicolas Paris <niparisco@gmail.com> wrote:

> Hello,
> Thanks for this clear explanation !
>
> Then I have a sub-question :
> Supposed I have 3600 materialised views say 600 mat views from 6 main table. (A,B,C,D,E,F are repetead 600 times with
somedifferences)  
> Is it faster to :
> 1) parallel refresh  600 time A, then 600 time B etc,
> OR
> 2) parallel refresh  600 time A,B,C,D,E,F
>
> I guess 1) is faster because they are 600 access to same table loaded in memory ? But do parallel access to the same
tableimplies concurency 
>  and bad performance ?
>
> Thanks
>
> Nicolas PARIS
>
>
> 2014-04-07 12:29 GMT+02:00 Graeme B. Bell <grb@skogoglandskap.no>:
> On 04 Apr 2014, at 18:29, Nicolas Paris <niparisco@gmail.com> wrote:
>
> > Hello,
> >
> > My question is about multiprocess and materialized View.
> > http://www.postgresql.org/docs/9.3/static/sql-creatematerializedview.html
> > I (will) have something like 3600 materialised views, and I would like to know the way to refresh them in a
multithreadway 
> > (anderstand 8 cpu cores -> 8 refresh process  in the same time)
>
> Hi Nick,
>
> out of DB solution:
>
> 1. Produce a text file which contains the 3600 refresh commands you want to run in parallel. You can do that with
selectand format() if you don't have a list already. 
>
> 2. I'm going to simulate your 3600 'refresh' commands here with some select and sleep statements that finish at
unknowntimes. 
>
> (In BASH):
>   for i in {1..3600} ; do echo "echo \"select pg_sleep(1+random()::int*10); select $i\" | psql mydb" ; done >
3600commands
>
> 3. Install Gnu Parallel     and type:
>
> parallel < 3600commands
>
> 4. Parallel will automatically work out the appropriate number of cores/threads for your CPUs, or you can control it
manuallywith -j. 
> It will also give you a live progress report if you use --progress.
> e.g. this command balances 8 jobs at a time, prints a dynamic progress report and dumps stdout to /dev/null
>
> parallel -j 8 --progress  < 3600commands > /dev/null
>
> 5. If you want to make debugging easier use the parameter --tag to tag output for each command.
>
> Of course it would be much more elegant if someone implemented something like Gnu Parallel inside postgres or psql
...:-) 
>
> Hope this helps & have a nice day,
>
> Graeme.
>
>
>
>
>
>

Re: PGSQL 9.3 - Materialized View - multithreading

From

Nicolas Paris

Date:

07 April 2014, 13:56:53

Excellent.

Maybe the last sub-question :

Those 3600 mat views do have indexes.
I guess I will get better performances in dropping indexes first, then refresh, then re-creating indexes.

Are there other way to improve performances (like mat views storage parameters), because this routines will be at night, and need to be finished quickly.

Thanks

Nicolas PARIS

2014-04-07 14:59 GMT+02:00 Graeme B. Bell <grb@skogoglandskap.no>:

Hi again Nick.

Glad it helped.

Generally, I would expect that doing all the A's first, then all the B's, and so on, would be fastest since you can re-use the data from cache.

Concurrency when reading isn't generally a problem. Lots of things can read at the same time and it will be nice and fast.
It's concurrent writes or concurrent read/write of the same data item that causes problems with locking. That shouldn't be happening here, judging by your description.

If possible, try to make sure nothing is modifying those source tables A/B/C/D/E/F when you are doing your view refresh.

Graeme.

On 07 Apr 2014, at 14:49, Nicolas Paris <niparisco@gmail.com> wrote:

> Hello,
> Thanks for this clear explanation !
>
> Then I have a sub-question :
> Supposed I have 3600 materialised views say 600 mat views from 6 main table. (A,B,C,D,E,F are repetead 600 times with some differences)
> Is it faster to :
> 1) parallel refresh 600 time A, then 600 time B etc,
> OR
> 2) parallel refresh 600 time A,B,C,D,E,F
>
> I guess 1) is faster because they are 600 access to same table loaded in memory ? But do parallel access to the same table implies concurency
> and bad performance ?
>
> Thanks
>
> Nicolas PARIS
>
>
> 2014-04-07 12:29 GMT+02:00 Graeme B. Bell <grb@skogoglandskap.no>:
> On 04 Apr 2014, at 18:29, Nicolas Paris <niparisco@gmail.com> wrote:
>
> > Hello,
> >
> > My question is about multiprocess and materialized View.
> > http://www.postgresql.org/docs/9.3/static/sql-creatematerializedview.html
> > I (will) have something like 3600 materialised views, and I would like to know the way to refresh them in a multithread way
> > (anderstand 8 cpu cores -> 8 refresh process in the same time)
>
> Hi Nick,
>
> out of DB solution:
>
> 1. Produce a text file which contains the 3600 refresh commands you want to run in parallel. You can do that with select and format() if you don't have a list already.
>
> 2. I'm going to simulate your 3600 'refresh' commands here with some select and sleep statements that finish at unknown times.
>
> (In BASH):
> for i in {1..3600} ; do echo "echo \"select pg_sleep(1+random()::int*10); select $i\" | psql mydb" ; done > 3600commands
>
> 3. Install Gnu Parallel and type:
>
> parallel < 3600commands
>
> 4. Parallel will automatically work out the appropriate number of cores/threads for your CPUs, or you can control it manually with -j.
> It will also give you a live progress report if you use --progress.
> e.g. this command balances 8 jobs at a time, prints a dynamic progress report and dumps stdout to /dev/null
>
> parallel -j 8 --progress < 3600commands > /dev/null
>
> 5. If you want to make debugging easier use the parameter --tag to tag output for each command.
>
> Of course it would be much more elegant if someone implemented something like Gnu Parallel inside postgres or psql ... :-)
>
> Hope this helps & have a nice day,
>
> Graeme.
>
>
>
>
>
>

Re: PGSQL 9.3 - Materialized View - multithreading

From

"Graeme B. Bell"

Date:

07 April 2014, 14:05:21

- http://wiki.postgresql.org/wiki/Performance_Optimization
- run it on the most powerful machine you can find
- get some more memory
- get a big (512-1TB) SSD drive
- avoid recalculating the same things over and over. if your views have many similar elements, then calculate those
firstinto a partial result, then build the final views from the partial result. 
- make sure your source tables are fully indexed and have good statistics
- run all the views once with \timing and keep track of how long they took. Fix the slow ones.

G


On 07 Apr 2014, at 15:56, Nicolas Paris <niparisco@gmail.com> wrote:

> Excellent.
>
> Maybe the last sub-question :
>
> Those 3600 mat views do have indexes.
> I guess I will get better performances in dropping indexes first, then refresh, then re-creating indexes.
>
> Are there other way to improve performances (like mat views storage parameters), because this routines will be at
night,and need to be finished quickly. 
>
> Thanks
>
> Nicolas PARIS
>
>
> 2014-04-07 14:59 GMT+02:00 Graeme B. Bell <grb@skogoglandskap.no>:
>
> Hi again Nick.
>
> Glad it helped.
>
> Generally, I would expect that doing all the A's first, then all the B's, and so on, would be fastest since you can
re-usethe data from cache. 
>
> Concurrency when reading isn't generally a problem. Lots of things can read at the same time and it will be nice and
fast.
> It's concurrent writes or concurrent read/write of the same data item that causes problems with locking. That
shouldn'tbe happening here, judging by your description. 
>
> If possible, try to make sure nothing is modifying those source tables A/B/C/D/E/F when you are doing your view
refresh.
>
> Graeme.
>
> On 07 Apr 2014, at 14:49, Nicolas Paris <niparisco@gmail.com> wrote:
>
> > Hello,
> > Thanks for this clear explanation !
> >
> > Then I have a sub-question :
> > Supposed I have 3600 materialised views say 600 mat views from 6 main table. (A,B,C,D,E,F are repetead 600 times
withsome differences) 
> > Is it faster to :
> > 1) parallel refresh  600 time A, then 600 time B etc,
> > OR
> > 2) parallel refresh  600 time A,B,C,D,E,F
> >
> > I guess 1) is faster because they are 600 access to same table loaded in memory ? But do parallel access to the
sametable implies concurency 
> >  and bad performance ?
> >
> > Thanks
> >
> > Nicolas PARIS
> >
> >
> > 2014-04-07 12:29 GMT+02:00 Graeme B. Bell <grb@skogoglandskap.no>:
> > On 04 Apr 2014, at 18:29, Nicolas Paris <niparisco@gmail.com> wrote:
> >
> > > Hello,
> > >
> > > My question is about multiprocess and materialized View.
> > > http://www.postgresql.org/docs/9.3/static/sql-creatematerializedview.html
> > > I (will) have something like 3600 materialised views, and I would like to know the way to refresh them in a
multithreadway 
> > > (anderstand 8 cpu cores -> 8 refresh process  in the same time)
> >
> > Hi Nick,
> >
> > out of DB solution:
> >
> > 1. Produce a text file which contains the 3600 refresh commands you want to run in parallel. You can do that with
selectand format() if you don't have a list already. 
> >
> > 2. I'm going to simulate your 3600 'refresh' commands here with some select and sleep statements that finish at
unknowntimes. 
> >
> > (In BASH):
> >   for i in {1..3600} ; do echo "echo \"select pg_sleep(1+random()::int*10); select $i\" | psql mydb" ; done >
3600commands
> >
> > 3. Install Gnu Parallel     and type:
> >
> > parallel < 3600commands
> >
> > 4. Parallel will automatically work out the appropriate number of cores/threads for your CPUs, or you can control
itmanually with -j. 
> > It will also give you a live progress report if you use --progress.
> > e.g. this command balances 8 jobs at a time, prints a dynamic progress report and dumps stdout to /dev/null
> >
> > parallel -j 8 --progress  < 3600commands > /dev/null
> >
> > 5. If you want to make debugging easier use the parameter --tag to tag output for each command.
> >
> > Of course it would be much more elegant if someone implemented something like Gnu Parallel inside postgres or psql
...:-) 
> >
> > Hope this helps & have a nice day,
> >
> > Graeme.
> >
> >
> >
> >
> >
> >
>
>