Thread: Statistical Analysis

Statistical Analysis

From
"Nathan Barnett"
Date:
I am having to perform a large data analysis query fairly frequently and the
execution time is not exceptable, so I was looking at doing a statictical
sample of the data to get fairly accurate results.  Is there a way to
perform a query on a set number of random rows instead of the whole dataset?
I have looked through the documentation for a function that would do this,
but I have not seen any.  If this is a RTFM type question, then feel free to
tell me so and point me in the right direction because I just haven't been
able to find any info on it.

Thanks ahead of time.


---------------
Nathan Barnett


Re: Statistical Analysis

From
"Mitch Vincent"
Date:
I don't think it's random (well, I'm sure it's not) but you could use LIMIT
to get a smaller number of results...

*shrug* just an idea.

Good luck!

-Mitch

----- Original Message -----
From: "Nathan Barnett" <nbarnett@cellularphones.com>
To: <pgsql-general@postgresql.org>
Sent: Monday, July 24, 2000 3:20 PM
Subject: [GENERAL] Statistical Analysis


> I am having to perform a large data analysis query fairly frequently and
the
> execution time is not exceptable, so I was looking at doing a statictical
> sample of the data to get fairly accurate results.  Is there a way to
> perform a query on a set number of random rows instead of the whole
dataset?
> I have looked through the documentation for a function that would do this,
> but I have not seen any.  If this is a RTFM type question, then feel free
to
> tell me so and point me in the right direction because I just haven't been
> able to find any info on it.
>
> Thanks ahead of time.
>
>
> ---------------
> Nathan Barnett
>
>


Re: Statistical Analysis

From
"Stephan Szabo"
Date:
Are you grabbing a set of rows to work on in an outside app?

You may be able to get a smaller random set with:
select <whatever> from <table> order by random() limit <number>
But this will pretty much force a sort step [and if you're not limiting the
rows with a where clause, probably a full sequential scan] and could
be very expensive depending on the number or matching rows for any
limiting clauses you have.  You'd have to play with it in practice to see
if it's any faster.

----- Original Message -----
From: "Nathan Barnett" <nbarnett@cellularphones.com>
To: <pgsql-general@postgresql.org>
Sent: Monday, July 24, 2000 12:20 PM
Subject: [GENERAL] Statistical Analysis


> I am having to perform a large data analysis query fairly frequently and
the
> execution time is not exceptable, so I was looking at doing a statictical
> sample of the data to get fairly accurate results.  Is there a way to
> perform a query on a set number of random rows instead of the whole
dataset?
> I have looked through the documentation for a function that would do this,
> but I have not seen any.  If this is a RTFM type question, then feel free
to
> tell me so and point me in the right direction because I just haven't been
> able to find any info on it.




RE: Statistical Analysis

From
"Nathan Barnett"
Date:
Stephan,
    The SORT is what I'm trying to avoid because I was using a group by to grab
all the data in the groups that I needed, but it requires a sort to group by
and this bottlenecked the query.  I really just wanted to grab a sample of
all the rows in the table and then perform the group by on the subset to
avoid the overhead of sorting the whole table.  My query has no where
clauses and thus must sort through all of the data being analyzed.  It then
aggregates the data in a table that is then being used in the realtime
queries.  The analysis must be able to run every hour.

----------------
Nathan Barnett

-----Original Message-----
From: pgsql-general-owner@hub.org [mailto:pgsql-general-owner@hub.org]On
Behalf Of Stephan Szabo
Sent: Monday, July 10, 2000 3:49 PM
To: Nathan Barnett; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Statistical Analysis


Are you grabbing a set of rows to work on in an outside app?

You may be able to get a smaller random set with:
select <whatever> from <table> order by random() limit <number>
But this will pretty much force a sort step [and if you're not limiting the
rows with a where clause, probably a full sequential scan] and could
be very expensive depending on the number or matching rows for any
limiting clauses you have.  You'd have to play with it in practice to see
if it's any faster.

----- Original Message -----
From: "Nathan Barnett" <nbarnett@cellularphones.com>
To: <pgsql-general@postgresql.org>
Sent: Monday, July 24, 2000 12:20 PM
Subject: [GENERAL] Statistical Analysis


> I am having to perform a large data analysis query fairly frequently and
the
> execution time is not exceptable, so I was looking at doing a statictical
> sample of the data to get fairly accurate results.  Is there a way to
> perform a query on a set number of random rows instead of the whole
dataset?
> I have looked through the documentation for a function that would do this,
> but I have not seen any.  If this is a RTFM type question, then feel free
to
> tell me so and point me in the right direction because I just haven't been
> able to find any info on it.





RE: Statistical Analysis

From
"Nathan Barnett"
Date:
Tim,
    Hmm... this might just work because I could actually perform myrandfunc() <
.2 and then do a LIMIT on it for 10% or what not.  That would almost
gurantee the exact amount of rows.

-----------------
Nathan Barnett


-----Original Message-----
From: keitt@ulysses.nceas.ucsb.edu
[mailto:keitt@ulysses.nceas.ucsb.edu]On Behalf Of Timothy H. Keitt
Sent: Monday, July 24, 2000 3:41 PM
To: Nathan Barnett
Subject: Re: [GENERAL] Statistical Analysis


You would need to add a pseudorandom number function to postgresql.  If
your function returns numbers on [0, 1), then you could do:

    select * from mytable where myrandfunc() < 0.1;

and get back (asymtotically) 10% of the rows.  If you want exactly n
randomly chosen rows, its a bit more expensive computationally.

Another more involved approach would be to implement random cursors.
This would be great for bootstrapping analysis.

Tim

Nathan Barnett wrote:
>
> I am having to perform a large data analysis query fairly frequently and
the
> execution time is not exceptable, so I was looking at doing a statictical
> sample of the data to get fairly accurate results.  Is there a way to
> perform a query on a set number of random rows instead of the whole
dataset?
> I have looked through the documentation for a function that would do this,
> but I have not seen any.  If this is a RTFM type question, then feel free
to
> tell me so and point me in the right direction because I just haven't been
> able to find any info on it.
>
> Thanks ahead of time.
>
> ---------------
> Nathan Barnett

--
Timothy H. Keitt
National Center for Ecological Analysis and Synthesis
735 State Street, Suite 300, Santa Barbara, CA 93101
Phone: 805-892-2519, FAX: 805-892-2510
http://www.nceas.ucsb.edu/~keitt/


Re: Statistical Analysis

From
Andrew McMillan
Date:
Nathan Barnett wrote:
>
> Stephan,
>         The SORT is what I'm trying to avoid because I was using a group by to grab
> all the data in the groups that I needed, but it requires a sort to group by
> and this bottlenecked the query.  I really just wanted to grab a sample of
> all the rows in the table and then perform the group by on the subset to
> avoid the overhead of sorting the whole table.  My query has no where
> clauses and thus must sort through all of the data being analyzed.  It then
> aggregates the data in a table that is then being used in the realtime
> queries.  The analysis must be able to run every hour.

What about using random() in the WHERE clause?

Regards,
                Andrew.
--
_____________________________________________________________________
            Andrew McMillan, e-mail: Andrew@cat-it.co.nz
Catalyst IT Ltd, PO Box 10-225, Level 22, 105 The Terrace, Wellington
Me: +64 (21) 635 694, Fax: +64 (4) 499 5596, Office: +64 (4) 499 2267

Re: Statistical Analysis

From
"Timothy H. Keitt"
Date:
Oops.  Just tried that; the random() call only get evaluated once.  What
you need is a column type "random" that calls random() each time its
evaluted.

T.

Andrew McMillan wrote:
>
> Nathan Barnett wrote:
> >
> > Stephan,
> >         The SORT is what I'm trying to avoid because I was using a group by to grab
> > all the data in the groups that I needed, but it requires a sort to group by
> > and this bottlenecked the query.  I really just wanted to grab a sample of
> > all the rows in the table and then perform the group by on the subset to
> > avoid the overhead of sorting the whole table.  My query has no where
> > clauses and thus must sort through all of the data being analyzed.  It then
> > aggregates the data in a table that is then being used in the realtime
> > queries.  The analysis must be able to run every hour.
>
> What about using random() in the WHERE clause?
>
> Regards,
>                                 Andrew.
> --
> _____________________________________________________________________
>             Andrew McMillan, e-mail: Andrew@cat-it.co.nz
> Catalyst IT Ltd, PO Box 10-225, Level 22, 105 The Terrace, Wellington
> Me: +64 (21) 635 694, Fax: +64 (4) 499 5596, Office: +64 (4) 499 2267

--
Timothy H. Keitt
National Center for Ecological Analysis and Synthesis
735 State Street, Suite 300, Santa Barbara, CA 93101
Phone: 805-892-2519, FAX: 805-892-2510
http://www.nceas.ucsb.edu/~keitt/

Re: Statistical Analysis

From
Tom Lane
Date:
"Timothy H. Keitt" <keitt@nceas.ucsb.edu> writes:
> Oops.  Just tried that; the random() call only get evaluated once.

Hmm, you're right.  That's a bug.  The system knows that random()
can give different results on each call, but there's one piece of
the planner that hasn't gotten the word :-(

            regards, tom lane

Re: Statistical Analysis

From
Tom Lane
Date:
"Timothy H. Keitt" <keitt@nceas.ucsb.edu> writes:
> Oops.  Just tried that; the random() call only get evaluated once.

Sigh, forgot to mention the solution.  There's an undocumented function:

 * bool oidrand (oid o, int4 X)-
 *      takes in an oid and a int4 X, and will return 'true'
 *    about 1/X of the time.

typically used like this:

-- select roughly 1/10 of the tuples
SELECT * FROM onek WHERE oidrand(onek.oid, 10);

This doesn't get collapsed by the overly aggressive constant-qual
recognizer because it takes a table column as input.  (The function
doesn't actually *use* the OID, mind you, but the planner doesn't
know that.  What a kluge... but it gets the job done.)

Note that this isn't necessarily going to fix your performance problem,
since a scan of the whole input table is still going to be required.
But if the expensive processing was somewhere downstream of that basic
scan, it should help.

            regards, tom lane

Re: Statistical Analysis

From
"Timothy H. Keitt"
Date:
Tom Lane wrote:
>
> Note that this isn't necessarily going to fix your performance problem,
> since a scan of the whole input table is still going to be required.
> But if the expensive processing was somewhere downstream of that basic
> scan, it should help.
>

The only way that I know of to do this fast is to insert the items in
random order.  I've done this (in C++, not postgres) using a map
(b-tree) data structure: insert key-value pairs with the key being a
random number; then pop entries off the tail of the map as needed.

Tim

--
Timothy H. Keitt
National Center for Ecological Analysis and Synthesis
735 State Street, Suite 300, Santa Barbara, CA 93101
Phone: 805-892-2519, FAX: 805-892-2510
http://www.nceas.ucsb.edu/~keitt/

Re: Statistical Analysis

From
Steve Heaven
Date:
At 20:18 24/07/00 -0400, you wrote:
>Sigh, forgot to mention the solution.  There's an undocumented function:
>
> * bool oidrand (oid o, int4 X)-
> *      takes in an oid and a int4 X, and will return 'true'
> *    about 1/X of the time.
>
>typically used like this:
>
>-- select roughly 1/10 of the tuples
>SELECT * FROM onek WHERE oidrand(onek.oid, 10);
>

It doesnt seem to work as you explain.
For a value of 1 you expect (nearly) all the tuples and two should return
half, but that not what I'm finding.

galore=> select count(*) from topten where room='HI';
count
-----
   14
(1 row)

galore=> SELECT * FROM topten WHERE room='HE' and oidrand(topten.oid, 1);
type|data                     |norder|room  |grp
----+-------------------------+------+------+---
B   |0764552503               |     1|HE    |
B   |0751327190               |     1|HE    |
B   |0718144392               |      |HE    |
B   |0500280754               |      |HE    |
B   |0028610091               |     1|HE    |
(5 rows)
galore=> SELECT * FROM topten WHERE room='HE' and oidrand(topten.oid, 2);
type|data                     |norder|room  |grp
----+-------------------------+------+------+---
B   |0764552503               |     1|HE    |
B   |0751327190               |     1|HE    |
B   |0718144392               |      |HE    |
(3 rows)
galore=> SELECT * FROM topten WHERE room='HE' and oidrand(topten.oid, 7);
type|data                     |norder|room  |grp
----+-------------------------+------+------+---
B   |0751327190               |     1|HE    |
B   |0718144392               |      |HE    |
(2 rows)
--
thorNET  - Internet Consultancy, Services & Training
Phone: 01454 854413
Fax:   01454 854412
http://www.thornet.co.uk

Re: Statistical Analysis

From
Guillaume Perréal
Date:
Steve Heaven wrote:
>
> At 20:18 24/07/00 -0400, you wrote:
> >Sigh, forgot to mention the solution.  There's an undocumented function:
> >
> > * bool oidrand (oid o, int4 X)-
> > *       takes in an oid and a int4 X, and will return 'true'
> > *     about 1/X of the time.
> >
> >typically used like this:
> >
> >-- select roughly 1/10 of the tuples
> >SELECT * FROM onek WHERE oidrand(onek.oid, 10);
> >
>
> It doesnt seem to work as you explain.
> For a value of 1 you expect (nearly) all the tuples and two should return
> half, but that not what I'm finding.
>
> galore=> select count(*) from topten where room='HI';
> count
> -----
>    14
> (1 row)
>
> galore=> SELECT * FROM topten WHERE room='HE' and oidrand(topten.oid, 1);
> type|data                     |norder|room  |grp
> ----+-------------------------+------+------+---
> B   |0764552503               |     1|HE    |
> B   |0751327190               |     1|HE    |
> B   |0718144392               |      |HE    |
> B   |0500280754               |      |HE    |
> B   |0028610091               |     1|HE    |
> (5 rows)
> galore=> SELECT * FROM topten WHERE room='HE' and oidrand(topten.oid, 2);
> type|data                     |norder|room  |grp
> ----+-------------------------+------+------+---
> B   |0764552503               |     1|HE    |
> B   |0751327190               |     1|HE    |
> B   |0718144392               |      |HE    |
> (3 rows)
> galore=> SELECT * FROM topten WHERE room='HE' and oidrand(topten.oid, 7);
> type|data                     |norder|room  |grp
> ----+-------------------------+------+------+---
> B   |0751327190               |     1|HE    |
> B   |0718144392               |      |HE    |
> (2 rows)
> --
> thorNET  - Internet Consultancy, Services & Training
> Phone: 01454 854413
> Fax:   01454 854412
> http://www.thornet.co.uk

Isn't it because oidrand evals as 'random() < 1/X' ? or maybe 'random() <
1/(X+1)' ?

--
Guillaume Perréal - Stagiaire MIAG
Cemagref (URH), Lyon, France
Tél: (+33) 4.72.20.87.64