Thread: Batching up data into groups of n rows

Batching up data into groups of n rows

From
Andy Chambers
Date:
Hi,

I have a need to write a query that batches up rows of a table into
groups of n records.  I feel like this should be possible using the
existing window functions but I can't figure out how to do it from the
examples.

So I have some table, let's say

create table addresses (
  line_1 text,
  line_2 text,
  city text,
  state text,
  zip text);

...and I want to select the data from that table, adding a "group_id"
column, and a "record_id" column.  The "group_id" would start at 1,
and increment by 1 every 100 rows, and the "record_id" would increment
by 1 every row, but restart at 1 for each new group_id.

Thanks,
Andy

Re: Batching up data into groups of n rows

From
Marti Raudsepp
Date:
On Mon, Sep 26, 2011 at 18:59, Andy Chambers <achambers@mcna.net> wrote:
> ...and I want to select the data from that table, adding a "group_id"
> column, and a "record_id" column.  The "group_id" would start at 1,
> and increment by 1 every 100 rows, and the "record_id" would increment
> by 1 every row, but restart at 1 for each new group_id.

I can't see why you would want this, but regardless...

Since you didn't list a primary key for the table, I'm using "ctid" as
a substitute. But that's a PostgreSQL-specific hack, normally you
would use the real primary key instead.

update addresses set group_id=(nr/100)+1, record_id=(nr%100)+1 from
  (select ctid, row_number() over () -1 as nr from addresses) as subq
  where subq.ctid=addresses.ctid;

This isn't going to be fast for a large table as it will effectively
make 3 passes over the table, but it will get the job done.

If you want the numbers to be ordered by certain columns, you'd use
row_number() OVER (ORDER BY col1, col2)

Regards,
Marti

Re: Batching up data into groups of n rows

From
Andy Chambers
Date:
On Mon, Sep 26, 2011 at 4:22 PM, Marti Raudsepp <marti@juffo.org> wrote:
> On Mon, Sep 26, 2011 at 18:59, Andy Chambers <achambers@mcna.net> wrote:
>> ...and I want to select the data from that table, adding a "group_id"
>> column, and a "record_id" column.  The "group_id" would start at 1,
>> and increment by 1 every 100 rows, and the "record_id" would increment
>> by 1 every row, but restart at 1 for each new group_id.
>
> I can't see why you would want this, but regardless...

The addresses need to be sent to a 3rd party web-service for
canonicalization.  The web service accepts batches of <100 addresses.
I was wondering how I'd get Postgres to generate the XML for sending
100 addresses at a time to this web service.

> Since you didn't list a primary key for the table, I'm using "ctid" as
> a substitute. But that's a PostgreSQL-specific hack, normally you
> would use the real primary key instead.
>
> update addresses set group_id=(nr/100)+1, record_id=(nr%100)+1 from
>  (select ctid, row_number() over () -1 as nr from addresses) as subq
>  where subq.ctid=addresses.ctid;

Cool!  I don't need to actually store these ids in the database, they
just need to be generated on the fly and forgotten but I think I can
adapt the example to do what I need.

Thanks,
Andy