Thread: More of a SQL question, I guess.

More of a SQL question, I guess.

From
Howard Eglowstein
Date:
I have a database with 150 fields per row. For historical reasons, it's
broken into three databases, data_a, data_b and data_c. One of the
fields is a timestamp and one of them is a serial number that came in
from a remote machine that reports its status every day.

If I use a command like 'SELECT MAX(Timestamp) FROM data_a WHERE field1
= '0001', I can get back the time of the latest report. I can then issue
the command 'SELECT * from data_a, data_b, data_c WHERE
data_a.id=data_b.id AND data_a.id=data_c.id and field1='0001' to get the
150 fields for that report. It works fine, but it takes a while to
respond when the database is the better part of a million records.

If it were just one serial number, the two queries would be okay because
It really only takes a minute. The problem I have is that we're hoping
to have thousands of machine in the field (this is a just a test
database) and clearly this approach won't work for thousands of serial
numbers. The second method I did was to simply use 'SE:ECT * from
data_a.....' to get all million records, and have my C code look for
serial numbers in each line and keep the latest by timestamp. That takes
about as long as doing the first procedure 3 times, but it gives me the
latest data for all of the serial numbers in the system. That's
perfectly cool, except that it won't scale nicely. If the web code that
does the search isn't on the same machine that holds the data, we'll
have to ship gigabytes of data over the network for each search.

What I'd *like* is something that uses groups and MAX() to do this in
one SQL command. It would group the data by the serial number (field1),
find the record in each group with the maximum timestamp value and
return all 150 fields. Is this possible?  I thought of using unions, but
I think I have to issue a pretty long command for each group and the PG
buffers probably will max out after some relatively small number, no?

Any thoughts would be appreciated.

Howard

Re: More of a SQL question, I guess.

From
Frank Bax
Date:
Howard Eglowstein wrote:
> I have a database with 150 fields per row. For historical reasons, it's
> broken into three databases, data_a, data_b and data_c. One of the
> fields is a timestamp and one of them is a serial number that came in
> from a remote machine that reports its status every day.
>
> If I use a command like 'SELECT MAX(Timestamp) FROM data_a WHERE field1
> = '0001', I can get back the time of the latest report. I can then issue
> the command 'SELECT * from data_a, data_b, data_c WHERE
> data_a.id=data_b.id AND data_a.id=data_c.id and field1='0001' to get the
> 150 fields for that report. It works fine, but it takes a while to
> respond when the database is the better part of a million records.
>
> If it were just one serial number, the two queries would be okay because
> It really only takes a minute. The problem I have is that we're hoping
> to have thousands of machine in the field (this is a just a test
> database) and clearly this approach won't work for thousands of serial
> numbers. The second method I did was to simply use 'SE:ECT * from
> data_a.....' to get all million records, and have my C code look for
> serial numbers in each line and keep the latest by timestamp. That takes
> about as long as doing the first procedure 3 times, but it gives me the
> latest data for all of the serial numbers in the system. That's
> perfectly cool, except that it won't scale nicely. If the web code that
> does the search isn't on the same machine that holds the data, we'll
> have to ship gigabytes of data over the network for each search.
>
> What I'd *like* is something that uses groups and MAX() to do this in
> one SQL command. It would group the data by the serial number (field1),
> find the record in each group with the maximum timestamp value and
> return all 150 fields. Is this possible?  I thought of using unions, but
> I think I have to issue a pretty long command for each group and the PG
> buffers probably will max out after some relatively small number, no?
>
> Any thoughts would be appreciated.


Start with

SELECT field1,MAX(timestamp) as ts FROM data_a GROUP BY field1

to get a timestamp for each value of field1; then use self join to get
the serial number from these rows.  At this point you have a problem if
there are any duplicate timestamp values.

SELECT id, a.f1, ts from data_a join (select f,max(timestamp) as ts from
data_a group by f1) as a on a.ts=data_a.timestamp;

Call this select SQL1 and join itto your longer sql statement:

SELECT * from data_a, data_b, data_c
JOIN (SQL1) as aa on aa.id = data_a.id
WHERE data_a.id=data_b.id AND data_a.id=data_c.id



It seems to me the underlying question here comes up often.  Might I
suggest that we add a new question to section 4 of FAQ:
Q: How can I retrieve other fields in same row as result returned by an
aggregate function such as max()?
A:  Is my "SQL1" the best/only answer to this question?


Re: More of a SQL question, I guess.

From
Tom Lane
Date:
Frank Bax <fbax@sympatico.ca> writes:
> It seems to me the underlying question here comes up often.  Might I
> suggest that we add a new question to section 4 of FAQ:
> Q: How can I retrieve other fields in same row as result returned by an
> aggregate function such as max()?
> A:  Is my "SQL1" the best/only answer to this question?

DISTINCT ON is also a commonly suggested answer, viz

SELECT DISTINCT ON (f1) * FROM data_a ORDER BY f1, timestamp DESC;

One point about your SQL1 is that if there are multiple rows sharing
the max timestamp, the join will return all of them.  This might or
might not be what's wanted.  The DISTINCT ON method gives the other
answer: you get just one row per f1 value.  As given it'd be a random
one of the max-timestamp rows, but you could add additional ORDER BY
columns to prioritize them.

            regards, tom lane