Thread: Re: Update table with random values from another table

Re: Update table with random values from another table

From
Rory Campbell-Lange
Date:
On 12/02/09, Rory Campbell-Lange (rory@campbell-lange.net) wrote:
> I have a test system for which I need to replace actual user's data (in
> 'users') with anonymised data from another table ('testnames') on
> postgres 8.3.
>
> The tricky aspect is that each row from testnames must be randomised to
> avoid each row in users being updated with the same value.
>
> I've been trying to avoid a correlated subquery for each column I am trying
> to update, and instead trying the following, which never returns. There are
> 2000 records in testnames and about 200 in users.

I'm obviously doing something badly wrong because:

    UPDATE
            users
    SET t_firstname =
                    (select firstname from testnames order by random() limit 1),
        t_surname =
                    (select lastname from testnames order by random() limit 1)
    WHERE
            n_role IN (2,3);

Doesn't return either after 60 seconds on a 8 core machine with 8GB of
RAM and 15K disks in R10 (no swap in use).

Rory

Re: Update table with random values from another table

From
Grzegorz Jaśkiewicz
Date:
On Thu, Feb 12, 2009 at 1:10 PM, Rory Campbell-Lange
<rory@campbell-lange.net> wrote:
>    UPDATE
>            users
>    SET t_firstname =
>                    (select firstname from testnames order by random() limit 1),
>        t_surname =
>                    (select lastname from testnames order by random() limit 1)
>    WHERE
>            n_role IN (2,3);
>
> Doesn't return either after 60 seconds on a 8 core machine with 8GB of
> RAM and 15K disks in R10 (no swap in use).

That would be because, for every row in users table, postgres has to
run two subselects, with order by random() (which in it self is quite
expensive).

How many roles it should update ? or otherwise: what's the return of
select count(*) from users where n_role IN (2,3) ?

try that sort of approach (modify it for your use):



test2=# create table foo(a int, b int);

.. insert some test data to foo(), and ziew(a) ...

test2=# update foo set a=n1.a , b=n2.a from (select
generate_series(1,100) id, a from ziew order by random() limit 100)
n1, (select generate_series(1,100) id, a from ziew order by random()
limit 100) n2;



--
GJ

Re: Update table with random values from another table

From
Grzegorz Jaśkiewicz
Date:
actually forget about that generate_series() in sub queries, I just
realized that it won't do.
I don't have too much time to analyse and find solution, but
essentially you need to do it like in my example.

See, that's where normalization would help a lot. Ie, having a
separate table for name, and surname - and than third one to connect
them into full name.

Re: Update table with random values from another table

From
Tom Lane
Date:
=?UTF-8?Q?Grzegorz_Ja=C5=9Bkiewicz?= <gryzman@gmail.com> writes:
> On Thu, Feb 12, 2009 at 1:10 PM, Rory Campbell-Lange
> <rory@campbell-lange.net> wrote:
>> UPDATE
>> users
>> SET t_firstname =
>> (select firstname from testnames order by random() limit 1),
>> t_surname =
>> (select lastname from testnames order by random() limit 1)
>> WHERE
>> n_role IN (2,3);
>>
>> Doesn't return either after 60 seconds on a 8 core machine with 8GB of
>> RAM and 15K disks in R10 (no swap in use).

> That would be because, for every row in users table, postgres has to
> run two subselects, with order by random() (which in it self is quite
> expensive).

Well, no, because those subselects are independent of the parent query;
I'd expect PG to do them just once.  Do they show up as "SubPlans" or
"InitPlans" in EXPLAIN?

            regards, tom lane

Re: Update table with random values from another table

From
Rory Campbell-Lange
Date:
Hi Grzegorz

Many thanks for your reply.

On 12/02/09, Grzegorz Ja??kiewicz (gryzman@gmail.com) wrote:
> actually forget about that generate_series() in sub queries, I just
> realized that it won't do.
> I don't have too much time to analyse and find solution, but
> essentially you need to do it like in my example.
>
> See, that's where normalization would help a lot. Ie, having a
> separate table for name, and surname - and than third one to connect
> them into full name.

I realise that for every row in my users table (which has a unique
integer field) I can update it if I construct a matching id field
against a random row from the testnames table.

Something like this:

    UPDATE
        users
    SET ....
    FROM
        (SELECT
            dynamic_id, firstname, lastname
        FROM
            testnames
         ORDER BY
             random()
         ) x
    WHERE
        users.id = x.id;

However I'm not sure how to generate a dynamic_id for testnames. If I
use generate_series() I get a full join, rather than

    1  firstname1 lastname1
    2  firstname2 lastname2

Rory

Re: Update table with random values from another table

From
Rory Campbell-Lange
Date:
On 12/02/09, Tom Lane (tgl@sss.pgh.pa.us) wrote:
> =?UTF-8?Q?Grzegorz_Ja=C5=9Bkiewicz?= <gryzman@gmail.com> writes:
> > On Thu, Feb 12, 2009 at 1:10 PM, Rory Campbell-Lange
> > <rory@campbell-lange.net> wrote:
> >> UPDATE
> >> users
> >> SET t_firstname =
> >> (select firstname from testnames order by random() limit 1),
> >> t_surname =
> >> (select lastname from testnames order by random() limit 1)
> >> WHERE
> >> n_role IN (2,3);
> >>
> >> Doesn't return either after 60 seconds on a 8 core machine with 8GB of
> >> RAM and 15K disks in R10 (no swap in use).
>
> > That would be because, for every row in users table, postgres has to
> > run two subselects, with order by random() (which in it self is quite
> > expensive).
>
> Well, no, because those subselects are independent of the parent query;
> I'd expect PG to do them just once.  Do they show up as "SubPlans" or
> "InitPlans" in EXPLAIN?

Hi Tom

I don't know what the problem was, but I restarted my psql session and the
query runs in 2.181 ms. The plan is an InitPlan. Apologies for the specious
post.

I'm still unable to work out how to update some columns in my user's table each
with a unique record from my testnames table :).

Rory


Re: Update table with random values from another table

From
Rory Campbell-Lange
Date:
On 12/02/09, Rory Campbell-Lange (rory@campbell-lange.net) wrote:
> I realise that for every row in my users table (which has a unique
> integer field) I can update it if I construct a matching id field
> against a random row from the testnames table.

I can make my join table pretty well by using the ranking procedures
outlined here: http://www.barik.net/archive/2006/04/30/162447/

    CREATE TEMPORARY SEQUENCE rank_seq;
    select nextval('rank_seq') AS id, firstname, lastname from testnames;

or

    SELECT
        firstname, lastname,
            (SELECT
                count(*)
            FROM
                testnames t2
            WHERE
                t2.firstname < t1.firstname) + 2 AS id
    FROM
        testnames t1
    ORDER BY
        id;

The second method skips some ids (probably because I haven't got an
integer column in testnames)? It looks like I will have to go for the
first procedure or write a function with a loop and counter.

Any other ideas?

Rory

Re: Update table with random values from another table

From
Sam Mason
Date:
On Thu, Feb 12, 2009 at 05:39:49PM +0000, Rory Campbell-Lange wrote:
> On 12/02/09, Rory Campbell-Lange (rory@campbell-lange.net) wrote:
> > I realise that for every row in my users table (which has a unique
> > integer field) I can update it if I construct a matching id field
> > against a random row from the testnames table.
>
> I can make my join table pretty well by using the ranking procedures
> outlined here: http://www.barik.net/archive/2006/04/30/162447/
>
>     CREATE TEMPORARY SEQUENCE rank_seq;
>     select nextval('rank_seq') AS id, firstname, lastname from testnames;
[...]
> Any other ideas?

The first is similar to the best I could come up with as well.  Your
problem is difficult to express in SQL because what you're trying to do
doesn't seem very relational in nature.  I'd do something like:

  BEGIN;
  ALTER TABLE users ADD COLUMN num SERIAL;
  CREATE TEMP SEQUENCE s1;
  UPDATE users u SET name = x.name
    FROM (
      SELECT name, nextval('s1') AS id
      FROM (
        SELECT name FROM testnames ORDER BY random() OFFSET 0) x) x
    WHERE u.id = x.id;
  ALTER TABLE users DROP COLUMN num;
  COMMIT;

If your existing unique integer field runs from 1 to a number less than
the number of testuser names then you won't need to add the "num" column
first.  The inner selects are about making sure that things are ordered
randomly before we assign a sequence value to them, not sure if it's
strictly needed but shouldn't hurt.

--
  Sam  http://samason.me.uk/