Re: More of a SQL question, I guess. - Mailing list pgsql-novice

From Frank Bax
Subject Re: More of a SQL question, I guess.
Date
Msg-id 46C1BDEA.6050209@sympatico.ca
Whole thread Raw
In response to More of a SQL question, I guess.  (Howard Eglowstein <howard@yankeescientific.com>)
Responses Re: More of a SQL question, I guess.  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-novice
Howard Eglowstein wrote:
> I have a database with 150 fields per row. For historical reasons, it's
> broken into three databases, data_a, data_b and data_c. One of the
> fields is a timestamp and one of them is a serial number that came in
> from a remote machine that reports its status every day.
>
> If I use a command like 'SELECT MAX(Timestamp) FROM data_a WHERE field1
> = '0001', I can get back the time of the latest report. I can then issue
> the command 'SELECT * from data_a, data_b, data_c WHERE
> data_a.id=data_b.id AND data_a.id=data_c.id and field1='0001' to get the
> 150 fields for that report. It works fine, but it takes a while to
> respond when the database is the better part of a million records.
>
> If it were just one serial number, the two queries would be okay because
> It really only takes a minute. The problem I have is that we're hoping
> to have thousands of machine in the field (this is a just a test
> database) and clearly this approach won't work for thousands of serial
> numbers. The second method I did was to simply use 'SE:ECT * from
> data_a.....' to get all million records, and have my C code look for
> serial numbers in each line and keep the latest by timestamp. That takes
> about as long as doing the first procedure 3 times, but it gives me the
> latest data for all of the serial numbers in the system. That's
> perfectly cool, except that it won't scale nicely. If the web code that
> does the search isn't on the same machine that holds the data, we'll
> have to ship gigabytes of data over the network for each search.
>
> What I'd *like* is something that uses groups and MAX() to do this in
> one SQL command. It would group the data by the serial number (field1),
> find the record in each group with the maximum timestamp value and
> return all 150 fields. Is this possible?  I thought of using unions, but
> I think I have to issue a pretty long command for each group and the PG
> buffers probably will max out after some relatively small number, no?
>
> Any thoughts would be appreciated.


Start with

SELECT field1,MAX(timestamp) as ts FROM data_a GROUP BY field1

to get a timestamp for each value of field1; then use self join to get
the serial number from these rows.  At this point you have a problem if
there are any duplicate timestamp values.

SELECT id, a.f1, ts from data_a join (select f,max(timestamp) as ts from
data_a group by f1) as a on a.ts=data_a.timestamp;

Call this select SQL1 and join itto your longer sql statement:

SELECT * from data_a, data_b, data_c
JOIN (SQL1) as aa on aa.id = data_a.id
WHERE data_a.id=data_b.id AND data_a.id=data_c.id



It seems to me the underlying question here comes up often.  Might I
suggest that we add a new question to section 4 of FAQ:
Q: How can I retrieve other fields in same row as result returned by an
aggregate function such as max()?
A:  Is my "SQL1" the best/only answer to this question?


pgsql-novice by date:

Previous
From: "Patrick Lindeman"
Date:
Subject: Re: Postmaster start up problems (can't create lock file )
Next
From: "Jasbinder Singh Bali"
Date:
Subject: Re: Postmaster start up problems (can't create lock file )