Thread: suggestions to improve postgresql suitability for data-mining

suggestions to improve postgresql suitability for data-mining

From
Fabien COELHO
Date:
Dear PostgreSQL developers,


I have considered postgresql for data-mining, and I would like to
share some comments on the issues involved, that result in several
specific suggestions for improvements in various areas. I've notice
that some of the issues seems listed in the todo list, so it may be
an encouragement to implement them earlier;-)


I) Data-Mining
--------------

First, I call data-mining the bulk processing of a large amount of
data to compute some small summary for analysis.

Let us say you have a "client" table with some data about 10M clients
and an "invoice" table with 120M invoices for these clients. It may
look like this (I put INTEGER everywhere to simplify):

CREATE TABLE client ( id INTEGER PRIMARY KEY, area INTEGER, -- from 0 to 99 type INTEGER -- from 0 to 3
);

CREATE TABLE invoice ( client INTEGER REFERENCES client, amount INTEGER, -- how much money was asked for month INTEGER
--the month the invoice was sent, from 0 to 11
 
);

You want to process all invoices to count them and to sum up the amounts
on a per month/area/type basis. The initial data size is in GB,
but the size of the expected result is in KB (namely 2 data for each
100 areas * 12 months * 4 types).


II) SQL
-------

The first idea is to ask SQL to do the job with a 'group by' clause:

SELECT area, type, month, SUM(amount), COUNT(*)
FROM client AS c, invoice AS i
WHERE c.id=i.client
GROUP BY area, type, month;

As I am just interested in reading the data, without any transaction, I
tuned a little bit the database parameters (fsync=false, more shared_mem
and sort_mem).

It works, but it is quite slow and it requires a lot of disk space.
Indeed, the result of the join is big, and the aggregation seems to
require an external sort step so as to sum up data one group after the
other.

As the resulting table is very small, I wish the optimizer would have
skipped the sort phase, so as to aggregate the data as they come after the
join. All may be done on the fly without much additionnal storage (well,
with some implementation efforts). Maybe it is the "hash evaluation of
group by aggregates" item listed in the todo list.


III) PL/pgSQL
-------------

Ok, if postgresql does not want to do it my way, let us make it do it.
Thus I wrote some PL/pgSQL function for my purpose, something like:

CREATE TEMPORARY TABLE tmp ( area INTEGER, type INTEGER, month INTEGER, amount INTEGER, count INTEGER, PRIMARY
KEY(area,type, month)
 
);
-- initialize tmp
FOR i IN 0..99 LOOP FOR j IN 0..3 LOOP FOR k IN 0..11 LOOP INSERT INTO tmp VALUES(i,j,k,0,0);
END all LOOPs;
-- fill tmp
FOR tuple IN SELECT area, type, month, amount FROM client, invoice WHERE id=client
LOOP UPDATE tmp SET amount=amount+tuple.amount, count=count+1   WHERE area=tuple.area AND type=tuple.type AND
month=tuple.month
END LOOP;
...

It is very SLOOOOOOOOOOOWWWWWWWWW... 10 to 100 times slower than the
previous one. Exit PL/pgSQL.


IV) Basic client side (JDBC, DBI, libpq)
----------------------------------------

Then I wrote the same stuff on the client side in java with JDBC, perl
with DBI and C with libpq, by browsing the above SELECT in a simple
loop and aggregating the data directly in the language. In all 3
cases, the process attempts to allocate the full result of the client
and invoice join in memory... a **very** bad idea indeed!

I checked that the postgres client-server protocol does not allow to
chunk the result of a select, as only one response is sent for one
query.

I suggest that this behavior should be changed, as the ODBC/DBI/JDBC
interfaces are designed to allow the client to process data as the
come out of the database, even if the query is not finished yet.

The library should do the chunking on its own automatically, either by
doing a CURSOR/FETCH's manually in the library implementation on
SELECT, or by changing the protocol so that results are sent by chunks
when required.

This is listed in the todo list of the JDBC interface, but there is
nothing about the perl interface nor the libpq interface.


V) Less basic client side (DBI, libpq)
--------------------------------------

I've redone the previous stuff, but with an explicit CURSOR and a
FETCH loop. It worked better, but it is still slow and still requires
a lot of disk space. Indeed, the database seems to first generate the
join in a temporary table on disk (I need twice as much disk space
available as the original base), which is then sent back to the client.
Thus I pay a read/write/read of the whole tables although
I had hoped that reading the data only once would have been enough.

I would suggest to make processing data on the fly be done really
on the fly, not with an intermediate storage and providing just
an on-the-fly interface without the real thing behind. I haven't seen
any item in the todo list about this issue. I'm not sure it is really
easy to implement.


Conclusion
----------

I have not succeeded in getting from postgresql the performances
I was expecting for data-mining.

I could get them if postgresql could be improved on some or all
of the following items:

(1) the execution engine may aggregate grouped data without a sort in   some cases.

(2) the PL/pgSQL interpreter would be a great deal faster.

(3) the client programming interfaces would provide a real on-the-fly   (without intermediate storage) fetching
mecanism.

(4) Also, I noticed that temporary tables/indexes created by postgresql   when processing a request are stored in the
samepartition as the   database in use. What about "/tmp" or other partitions? Maybe   a set of other directories could
bedesignated for this purpose?
 

Hope this help... at least to add new items to the postgresql todo list;-)

Have a nice day,

-- 
Fabien.


Re: suggestions to improve postgresql suitability for data-mining

From
Bruno Wolff III
Date:
On Tue, Jul 22, 2003 at 18:39:33 +0200, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
> 
> As the resulting table is very small, I wish the optimizer would have
> skipped the sort phase, so as to aggregate the data as they come after the
> join. All may be done on the fly without much additionnal storage (well,
> with some implementation efforts). Maybe it is the "hash evaluation of
> group by aggregates" item listed in the todo list.

Yes the hash aggregate addition may help you. This has been implemented
in 7.4 which will be in beta shortly.


Re: suggestions to improve postgresql suitability for data-mining

From
"Darren King"
Date:
> You want to process all invoices to count them
> and to sum up the amounts on a per month/area/type
> basis. The initial data size is in GB, but the
> size of the expected result is in KB (namely 2 data
> for each 100 areas * 12 months * 4 types).

The key to handling large datasets for data mining is pre-aggregation based on the smallest time frame needed for
details.

I'd suggest running these large queries and storing the results in other tables, and then writing a set of functions to
workwith those aggregate tables. 

No sense in summing up the same set of static data more than once if you can help it.

Darren


Re: suggestions to improve postgresql suitability for

From
"Nigel J. Andrews"
Date:
On Tue, 22 Jul 2003, Fabien COELHO wrote:

> ...
>
> III) PL/pgSQL
> -------------
> 
> Ok, if postgresql does not want to do it my way, let us make it do it.
> Thus I wrote some PL/pgSQL function for my purpose, something like:
> 
> CREATE TEMPORARY TABLE tmp (
>   area INTEGER,
>   type INTEGER,
>   month INTEGER,
>   amount INTEGER,
>   count INTEGER,
>   PRIMARY KEY(area, type, month)
> );
> -- initialize tmp
> FOR i IN 0..99 LOOP FOR j IN 0..3 LOOP FOR k IN 0..11 LOOP
>   INSERT INTO tmp VALUES(i,j,k,0,0);
> END all LOOPs;
> -- fill tmp
> FOR tuple IN
>   SELECT area, type, month, amount FROM client, invoice WHERE id=client
> LOOP
>   UPDATE tmp SET amount=amount+tuple.amount, count=count+1
>     WHERE area=tuple.area AND type=tuple.type AND month=tuple.month
> END LOOP;
> ...
> 
> It is very SLOOOOOOOOOOOWWWWWWWWW... 10 to 100 times slower than the
> previous one. Exit PL/pgSQL.

It will be, first you're doing the same join that generates the large result
set you were complaining about in the plain SQL example and then you're looping
over it generating a delete/insert for every tuple in that result set.

> 
> IV) Basic client side (JDBC, DBI, libpq)
> ----------------------------------------
> 
> Then I wrote the same stuff on the client side in java with JDBC, perl
> with DBI and C with libpq, by browsing the above SELECT in a simple
> loop and aggregating the data directly in the language. In all 3
> cases, the process attempts to allocate the full result of the client
> and invoice join in memory... a **very** bad idea indeed!

But what about doing that in the server?


> I checked that the postgres client-server protocol does not allow to
> chunk the result of a select, as only one response is sent for one
> query.
> 
> I suggest that this behavior should be changed, as the ODBC/DBI/JDBC
> interfaces are designed to allow the client to process data as the
> come out of the database, even if the query is not finished yet.
> 
> The library should do the chunking on its own automatically, either by
> doing a CURSOR/FETCH's manually in the library implementation on
> SELECT, or by changing the protocol so that results are sent by chunks
> when required.
> 
> This is listed in the todo list of the JDBC interface, but there is
> nothing about the perl interface nor the libpq interface.
> 
> 
> V) Less basic client side (DBI, libpq)
> --------------------------------------
> 
> I've redone the previous stuff, but with an explicit CURSOR and a
> FETCH loop. It worked better, but it is still slow and still requires
> a lot of disk space. Indeed, the database seems to first generate the
> join in a temporary table on disk (I need twice as much disk space
> available as the original base), which is then sent back to the client.
> Thus I pay a read/write/read of the whole tables although
> I had hoped that reading the data only once would have been enough.
> 
> I would suggest to make processing data on the fly be done really
> on the fly, not with an intermediate storage and providing just
> an on-the-fly interface without the real thing behind. I haven't seen
> any item in the todo list about this issue. I'm not sure it is really
> easy to implement.

I thought it necessary for the result set to be generated before any data can
be returned, in the general case and in your grouped by example
specifically. The latter if only because if you're not using the hash
aggregates then the sort is required and that of course requires all the result
data to be known.


> 
> Conclusion
> ----------
> 
> I have not succeeded in getting from postgresql the performances
> I was expecting for data-mining.
> 
> I could get them if postgresql could be improved on some or all
> of the following items:
> 
> (1) the execution engine may aggregate grouped data without a sort in
>     some cases.

As other's have said, this is in 7.4

> (2) the PL/pgSQL interpreter would be a great deal faster.

It did what you told it to do.

> 
> (3) the client programming interfaces would provide a real on-the-fly
>     (without intermediate storage) fetching mecanism.
> 
> (4) Also, I noticed that temporary tables/indexes created by postgresql
>     when processing a request are stored in the same partition as the
>     database in use. What about "/tmp" or other partitions? Maybe
>     a set of other directories could be designated for this purpose?
> 
> Hope this help... at least to add new items to the postgresql todo list;-)
> 
> Have a nice day,
> 
> 

-- 
Nigel J. Andrews



Re: suggestions to improve postgresql suitability for data-mining

From
"Christopher Kings-Lynne"
Date:
> II) SQL
> -------
>
> The first idea is to ask SQL to do the job with a 'group by' clause:
>
> SELECT area, type, month, SUM(amount), COUNT(*)
> FROM client AS c, invoice AS i
> WHERE c.id=i.client
> GROUP BY area, type, month;
>
> As I am just interested in reading the data, without any transaction, I
> tuned a little bit the database parameters (fsync=false, more shared_mem
> and sort_mem).
>
> It works, but it is quite slow and it requires a lot of disk space.
> Indeed, the result of the join is big, and the aggregation seems to
> require an external sort step so as to sum up data one group after the
> other.
>
> As the resulting table is very small, I wish the optimizer would have
> skipped the sort phase, so as to aggregate the data as they come after the
> join. All may be done on the fly without much additionnal storage (well,
> with some implementation efforts). Maybe it is the "hash evaluation of
> group by aggregates" item listed in the todo list.

As of 7.4CVS it does do this.  You will find this much faster in 7.4
release.

Chris