Thread: Having a plpgsql function return multiple rows that indicate its progress in a cursor like fashion

Hello,

At the moment my pg client application (running 8.4) transfers data
from several remote DBs (that run 8.3) via dblink, using cursors where
appropriate, and reporting back progress to users using a progress bar
and brief messages.

I thought it would be great to change my code to have all this done
within a single plpgsql function, and have all the benefits that that
brings.

It's easy enough to do all this, by writing a function that RETURNS
TABLE(progress integer, message text), and RETURNing NEXT when time
comes to display a new message or increment the progress bar. However,
that approach has the considerable drawback of not actually returning
any rows until it finally returns all of them. I'm not willing to give
up giving the user those messages and having their progress bar
updated in real-time though. I would like to have the function behave
as a cursor, and return one row at a time when control reaches each
RETURN NEXT statement.

Is it possible to somehow achieve what I've described, perhaps by
doing something with a function that returns refcursor?

Thanks,
Peter Geoghegan

On 16 Feb 2010, at 1:04, Peter Geoghegan wrote:

> Hello,
>
> At the moment my pg client application (running 8.4) transfers data
> from several remote DBs (that run 8.3) via dblink, using cursors where
> appropriate, and reporting back progress to users using a progress bar
> and brief messages.
>
> I thought it would be great to change my code to have all this done
> within a single plpgsql function, and have all the benefits that that
> brings.
>
> It's easy enough to do all this, by writing a function that RETURNS
> TABLE(progress integer, message text), and RETURNing NEXT when time
> comes to display a new message or increment the progress bar. However,
> that approach has the considerable drawback of not actually returning
> any rows until it finally returns all of them. I'm not willing to give
> up giving the user those messages and having their progress bar
> updated in real-time though. I would like to have the function behave
> as a cursor, and return one row at a time when control reaches each
> RETURN NEXT statement.


I'd think RETURN NEXT would behave the way you want it to. There's probably something in your function causing your
functionto behave like it does now. I suspect the problem lies in the way you determine how far you've progressed, but
youdidn't tell us anything about your function, so I'm just guessing. 

I don't know much about dblink, so it is possible your problem is related to that. I imagine it may batch "small"
resultsets and send them over all at once to reduce traffic. Do you see this problem with larger result sets (say >10k
rows)?

Alban Hertroys

--
Screwing up is the best way to attach something to the ceiling.


!DSPAM:737,4b7a5d9710441627593049!



2010/2/16 Alban Hertroys <dalroi@solfertje.student.utwente.nl>:
> On 16 Feb 2010, at 1:04, Peter Geoghegan wrote:
>
>> Hello,
>>
>> At the moment my pg client application (running 8.4) transfers data
>> from several remote DBs (that run 8.3) via dblink, using cursors where
>> appropriate, and reporting back progress to users using a progress bar
>> and brief messages.
>>
>> I thought it would be great to change my code to have all this done
>> within a single plpgsql function, and have all the benefits that that
>> brings.
>>
>> It's easy enough to do all this, by writing a function that RETURNS
>> TABLE(progress integer, message text), and RETURNing NEXT when time
>> comes to display a new message or increment the progress bar. However,
>> that approach has the considerable drawback of not actually returning
>> any rows until it finally returns all of them. I'm not willing to give
>> up giving the user those messages and having their progress bar
>> updated in real-time though. I would like to have the function behave
>> as a cursor, and return one row at a time when control reaches each
>> RETURN NEXT statement.
>
>
> I'd think RETURN NEXT would behave the way you want it to. There's probably something in your function causing your
functionto behave like it does now. I suspect the problem lies in the way you determine how far you've progressed, but
youdidn't tell us anything about your function, so I'm just guessing. 
>
> I don't know much about dblink, so it is possible your problem is related to that. I imagine it may batch "small"
resultsets and send them over all at once to reduce traffic. Do you see this problem with larger result sets (say >10k
rows)?

if you have a large dataset (more than 1 M rows), then better is C
coding. PLpgSQL (return next, return query) push result to memory, and
it can be a problem.

Regards
Pavel Stehule

>
> Alban Hertroys
>
> --
> Screwing up is the best way to attach something to the ceiling.
>
>
> !DSPAM:737,4b7a5d9710441627593049!
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

> I'd think RETURN NEXT would behave the way you want it to. There's probably something in your function causing your
functionto behave like it does now. I suspect the problem lies in the way you determine how far you've progressed, but
youdidn't tell us anything about your function, so I'm just guessing. 
>

Why do you think that? At the moment, the function merely sends two
messages along the lines of "beginning downloading...",  "connecting
to first db...", before connecting to the first remote DB, which, in
my perfunctory testing was unavailable. The connection blocks, throws
an exception, is handled in an EXECPTION block (by once again sending
a message in the usual way). However, I get all 3 messages at once,
only when the dblink function finishes blocking and throws its
exception, which takes about 7 seconds because I like to use a timeout
in my connection string.

> I don't know much about dblink, so it is possible your problem is related to that. I imagine it may batch "small"
resultsets and send them over all at once to reduce traffic. Do you see this problem with larger result sets (say >10k
rows)?
>

I'll attempt to devise a useful test, but I'm a little doubtful that
I'll get anywhere with my current approach, given that my initial,
very simple test failed.

Regards,
Peter Geoghegan

On 16 Feb 2010, at 10:34, Peter Geoghegan wrote:

>> I'd think RETURN NEXT would behave the way you want it to. There's probably something in your function causing your
functionto behave like it does now. I suspect the problem lies in the way you determine how far you've progressed, but
youdidn't tell us anything about your function, so I'm just guessing. 
>>
>
> Why do you think that? At the moment, the function merely sends two
> messages along the lines of "beginning downloading...",  "connecting
> to first db...", before connecting to the first remote DB, which, in
> my perfunctory testing was unavailable. The connection blocks, throws
> an exception, is handled in an EXECPTION block (by once again sending
> a message in the usual way). However, I get all 3 messages at once,
> only when the dblink function finishes blocking and throws its
> exception, which takes about 7 seconds because I like to use a timeout
> in my connection string.

I assumed you were generating the progress indicator from query results in a remote DB. It turns out that's entirely
notwhat you're doing, but how were we supposed to know that? 

You say you want your function to behave like a cursor.
A function can't behave like a cursor, just as a select can't; both result in a result set. What RETURN NEXT and RETURN
QUERYdo is make it possible to access the function results row by row. You still need a cursor to get the behaviour of
acursor out of that. 

If that doesn't answer your question, I'm pretty sure that to help you with this problem people will need quite a bit
moreinformation than you're giving us. What are you doing? What do you see vs. what do you expect? Some examples would
help.

>> I don't know much about dblink, so it is possible your problem is related to that. I imagine it may batch "small"
resultsets and send them over all at once to reduce traffic. Do you see this problem with larger result sets (say >10k
rows)?
>>
>
> I'll attempt to devise a useful test, but I'm a little doubtful that
> I'll get anywhere with my current approach, given that my initial,
> very simple test failed.


From your description it turns out dblink isn't involved yet, so it can't cause the issue you were asking about.

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.


!DSPAM:737,4b7a71b410441921015512!



> I assumed you were generating the progress indicator from query results in a remote DB. It turns out that's entirely
notwhat you're doing, but how were we supposed to know that? 
>

Well, it made sense to leave dblink mostly out of things (it's a
useful way to get a function to block though) until I'd figured out if
it was feasible to do this with a function that RETURNS TABLE(...).
Occam's razor and all that. It wasn't. I was trying to give background
information, an indication of my intent - there may have been an
entirely different approach that had not yet occurred to me, so that
information may have been pertinent (although probably not - the fact
that dblink is involved in incrementing the function probably doesn't
matter at all. It probably might as well be anything else for our
purposes).

> You say you want your function to behave like a cursor.
> A function can't behave like a cursor, just as a select can't; both result in a result set. What RETURN NEXT and
RETURNQUERY do is make it possible to access the function results row by row. You still need a cursor to get the
behaviourof a cursor out of that. 

I understand the distinction (after all, I suggested that the solution
was likely to involve returning a refcursor in my original post) -
let's not get bogged down in semantics though. I guess my question
boils down to: can I return a cursor, but not to return the result of
a single select, but of multiple different selects in succession, to
report progress as described, or, alternatively, do something that
will produce similar results? Here's my admittedly very rough stab at
this, using a function that RETURNS TABLE, an approach that evidently
doesn't work (which is not to say that I ever had a reasonable
expectation of this working. In fact, I was almost certain it wouldn't
work, but I needed to start somewhere):

CREATE OR REPLACE FUNCTION download_sales(download_date_arg date,
rem_arg integer[]) RETURNS TABLE (progress integer, message text) AS
$BODY$
DECLARE
    cur_progress integer DEFAULT 0;
    tup rems%rowtype;
BEGIN
    progress := cur_progress;
    cur_progress := cur_progress + 1;
    message := 'Beginning downloading sales...';
    RETURN NEXT;


    FOR tup IN SELECT * FROM rems WHERE is_active AND  id = ANY(rem_arg)
    LOOP
        DECLARE
            conn_str text;
            query_str text;
        BEGIN
            -- connection will timeout after 7 seconds.
            conn_str = 'hostaddr=' || tup.ip || ' port=' || tup.port ||
'dbname=remote_db user=' || tup.username || ' password=' ||
tup.password || ' connect_timeout=7';
            -- open persistent connection to rem DB
            progress := -1;
            cur_progress := 0;
            message := 'Connecting to rem ''' || tup.description || '''';
            RETURN NEXT;
            SELECT dblink_connect(conn_str);
                        -- TODO: Actually transfer data. INSERT it
into local tables, and indicate progress to user as above
                        -- The remote DB stores how many tuples there
are for the day, so I don't have to do a count(*) first
                        -- we send a magic number (say -1), which
indicates that number of tuples follows, then we send number of tuples
                        -- then we send which tuple we're currently
on, for each and every remote DB
        EXCEPTION
            WHEN CONNECTION_EXCEPTION THEN
                message := 'Could not connect to rem ''' || tup.description || '''';
                RETURN NEXT;
                RETURN;

        END;
    END LOOP;

    RETURN;

END;
$BODY$
LANGUAGE 'plpgsql' VOLATILE SECURITY DEFINER;

As I've already said, the problem with this approach is that I see all
3 messages at once, when the CONNECTION_EXCEPTION is thrown and we
finally RETURN, after about 7 seconds (which is undoubtedly how
RETURNS TABLE is documented to behave). I want (although, as I've
said, don't expect) to see the first two messages immediately, and
only the third when the connection fails, so I know what's happening
in real-time.

Regards,
Peter Geoghegan

On 16 Feb 2010, at 12:35, Peter Geoghegan wrote:
> As I've already said, the problem with this approach is that I see all
> 3 messages at once, when the CONNECTION_EXCEPTION is thrown and we
> finally RETURN, after about 7 seconds (which is undoubtedly how
> RETURNS TABLE is documented to behave). I want (although, as I've
> said, don't expect) to see the first two messages immediately, and
> only the third when the connection fails, so I know what's happening
> in real-time.


It seems you're right, I built a simple test-case (see attachment) using timeofday(). The numbers from fetching from a
cursorover the set-returning function run away from the selects that directly call timeofday() in between. 
In my case I pause the _client_ between calls, but the results are the same.
Peculiar...

I'm running PostgreSQL 8.4.1 on i386-apple-darwin10.0.0, compiled by GCC i686-apple-darwin10-gcc-4.2.1 (GCC) 4.2.1
(AppleInc. build 5646), 64-bit 



!DSPAM:737,4b7a925f10448503891907!

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.



!DSPAM:737,4b7a925f10448503891907!

Attachment
> It seems you're right, I built a simple test-case (see attachment) using timeofday(). The numbers from fetching from
acursor over the set-returning function run away from the selects that directly call timeofday() in between. 
> In my case I pause the _client_ between calls, but the results are the same.
> Peculiar...
>

Cursors simply address the problem of "impedance mismatch" (the use of
that term in this way probably pre-dates ORM, so please don't tell me
that that term refers exclusively to an ORM problem within an RDBMS
context). You don't have to fetch the result set all at once where
that is impractical. However, the entire result set is available on
the server from the first fetch.

What I want to do is to fetch the beginning of the result set before
the end has even been determined, with the fetching potentially
blocking in my app to let the plpgsql function catch up and generate a
row that didn't even exist on the server before, and for a time after
the most recent fetch. That is fundamentally different to how cursors
are typically used, and I hope that there is a way to do it.

Thanks,
Peter Geoghegan

Peter Geoghegan <peter.geoghegan86@gmail.com> writes:
> Cursors simply address the problem of "impedance mismatch" (the use of
> that term in this way probably pre-dates ORM, so please don't tell me
> that that term refers exclusively to an ORM problem within an RDBMS
> context). You don't have to fetch the result set all at once where
> that is impractical. However, the entire result set is available on
> the server from the first fetch.

The above statements are not true as far as the cursor mechanism is
concerned (at least not in Postgres).  However, they are true as far as
the result of a plpgsql function returning set is concerned.  The
function runs to completion, dumping RETURN NEXT outputs into a
"tuplestore", and then returns the tuplestore as its result.

It is possible to code set-returning functions in a suspend-and-resume
style, but none of the available PLs do that; you have to get down to
the C level.

            regards, tom lane

> It is possible to code set-returning functions in a suspend-and-resume
> style, but none of the available PLs do that; you have to get down to
> the C level.

Aren't my requirements sufficiently common to justify developing a
mechanism to report progress back to client applications during batch
operations and the like? This seems like a nice thing to have, that
would be broadly useful, perhaps as a contrib module. Obviously I have
no idea of the amount of effort developing such a feature would
entail.

Regards,
Peter Geoghegan

Peter Geoghegan escribió:
> > It is possible to code set-returning functions in a suspend-and-resume
> > style, but none of the available PLs do that; you have to get down to
> > the C level.
>
> Aren't my requirements sufficiently common to justify developing a
> mechanism to report progress back to client applications during batch
> operations and the like? This seems like a nice thing to have, that
> would be broadly useful, perhaps as a contrib module. Obviously I have
> no idea of the amount of effort developing such a feature would
> entail.

We've had several requests for progress reporting thingies of all kinds.
The amount of effort is not insignificant, which is probably why nothing
has gotten done yet ...

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

> We've had several requests for progress reporting thingies of all kinds.
> The amount of effort is not insignificant, which is probably why nothing
> has gotten done yet ...

What do you mean by "reporting thingies of all kinds"? It isn't as if
I'm asking for some esoteric feature, some regional variation of
reporting on batch operations. I'd like to be able to report things
back to client apps. That would perhaps take the form of returning a
message string, and an integer at each RETURN NEXT (or whatever
mechanism is decided upon). It would probably still be very useful to
a lot of people if the feature was completely inflexible, not allowing
you to vary the RETURN NEXT datatype, for example.

Can't someone think of a way of doing this as a neat adjunct to Postgres?

Thanks,
Peter Geoghegan

On 16/02/10 20:40, Peter Geoghegan wrote:
>
> Can't someone think of a way of doing this as a neat adjunct to Postgres?

Lots of people can think of ways of doing this. Not so many are prepared
to devote the time and effort to making it happen. In fact, by
definition, so far there have been zero such people.

Don't suppose you're volunteering? :-)

--
   Richard Huxton
   Archonet Ltd

> Don't suppose you're volunteering? :-)

I've already looked at the SRF documentation. It doesn't look all that
intimidating, and I might be willing to have a go if we can first
build some consensus on what this ought to look like. Can we?

Regards,
Peter Geoghegan

Peter Geoghegan <peter.geoghegan86@gmail.com> writes:
> Aren't my requirements sufficiently common to justify developing a
> mechanism to report progress back to client applications during batch
> operations and the like?

Have you experimented with RAISE NOTICE?  Using it this way is a bit of
a hack maybe, but I think you are entirely unaware of what would be
involved in producing something that's less of a hack.

            regards, tom lane

>> Aren't my requirements sufficiently common to justify developing a
>> mechanism to report progress back to client applications during batch
>> operations and the like?
>
> Have you experimented with RAISE NOTICE?  Using it this way is a bit of
> a hack maybe, but I think you are entirely unaware of what would be
> involved in producing something that's less of a hack.

That assessment is accurate. Should I take it that you aren't
particularly interested in pursuing something like this at present?

I'll see what I can come up with by playing with RAISE NOTICE

Thanks,
Peter Geoghegan

Tom Lane <tgl@sss.pgh.pa.us> writes:
> Peter Geoghegan <peter.geoghegan86@gmail.com> writes:
>> Aren't my requirements sufficiently common to justify developing a
>> mechanism to report progress back to client applications during batch
>> operations and the like?
>
> Have you experimented with RAISE NOTICE?  Using it this way is a bit of
> a hack maybe, but I think you are entirely unaware of what would be
> involved in producing something that's less of a hack.

Would returning a refcursor then using fetch in the application be
another solution?

As far as hacking is concerned, I think it boils down to materialise
against value-per-call implementation, right? Not saying it's easy to
implement value-per-call support in plpgsql, but should the OP think
about what's involved, is that the track to follow?

  http://git.postgresql.org/gitweb?p=postgresql.git;a=blob;f=src/backend/utils/fmgr/README;hb=HEAD#l380
  http://git.postgresql.org/gitweb?p=postgresql.git;a=commitdiff;h=d9a319525591bc437e4770b4e796a7517844a784

The first link is the fmgr/README explaining the concepts, and the
second one is a recent enough patch dealing with materialise and
value-per-call in the context of SQL functions.

Regards,
--
dim

Hi dim,

> Would returning a refcursor then using fetch in the application be
> another solution?

I assume not, since nobody stepped forward and offered a way, even
though I suggested that returning a refcursor may be the way to go
(you'll recall that you suggested that to me in IRC - I'm sternocera
there. I believe we met in Paris too). As I said, "I guess my question
boils down to: can I return a cursor, but not to return the result of
a single select, but of multiple different selects in succession, to
report progress as described, or, alternatively, do something that
will produce similar results?"

I also said:

"Cursors simply address the problem of "impedance mismatch"...You
don't have to fetch the result set all at once where that is
impractical. However, the entire result set is available on the server
from the first fetch."

Tom contradicted this, but I believe he just meant that my statement
was technically inaccurate, and not that it was conceptually
inaccurate. My (perhaps incorrect) understanding is that once you open
a cursor, you cannot change that which will later be fetched from the
same cursor - What rows will be returned when everything is fetched is
determined when the cursor is opened. Also, I cannot very well open a
cursor twice, because, as the docs say, "the cursor cannot be open
already" when opening a cursor. So, unless I'm mistaken, I don't think
returning a refcursor helps me here.

> As far as hacking is concerned, I think it boils down to materialise
> against value-per-call implementation, right? Not saying it's easy to
> implement value-per-call support in plpgsql, but should the OP think
> about what's involved, is that the track to follow?
>
>  http://git.postgresql.org/gitweb?p=postgresql.git;a=blob;f=src/backend/utils/fmgr/README;hb=HEAD#l380
>  http://git.postgresql.org/gitweb?p=postgresql.git;a=commitdiff;h=d9a319525591bc437e4770b4e796a7517844a784
>
> The first link is the fmgr/README explaining the concepts, and the
> second one is a recent enough patch dealing with materialise and
> value-per-call in the context of SQL functions.

I'll investigate. To be perfectly frank, I am probably overextending
myself in doing so, because a) I am not familiar with the PostgreSQL
codebase and b) Tom's admonition about the likely difficulty of doing
this indicates that it is probably quite an involved task.

I think it would be useful to articulate, in broad strokes, what this
feature should look like, if not for my benefit, then for the benefit
of whoever will eventually implement it (because, given the
aspirations and momentum of the postgres community, and the obvious
utility of what I've described, I think it's inevitable that *someone*
will).

Regards,
Peter Geoghegan