Thread: Re: median query causes disk to fill up

Re: median query causes disk to fill up

From
"richyen3@gmail.com"
Date:
Sorry, I forgot to also mention that I am running this on a machine
with 80GB free disk space, and 1GB RAM (but I wouldn't think that this
would be the problem, would it?)

--Richard



On Apr 13, 9:25 am, "richy...@gmail.com" <richy...@gmail.com> wrote:
> Hi,
>
> I'm trying to run a query to find the median value, organized by
> date.  However, when I run the query, it runs for about 4 hours, and
> then quits with the following message:
>
> > ERROR:  could not write block 10447102 of temporary file: No space left on device
> > HINT:  Perhaps out of disk space?
>
> My table has 512327 rows and is the following:
>
> >                 Table "public.m_uop_times"
> >      Column     |            Type             | Modifiers
> > ----------------+-----------------------------+-----------
> >  objectid       | integer                     |
> >  date_part      | double precision            |
> >  date_start     | timestamp without time zone |
> >  date_processed | timestamp without time zone |
> >  gen_time       | integer                     |
> > Indexes:
> >     "m_uop_date_idx" btree (date_processed)
> >     "m_uop_epoch_idx" btree (date_part)
> >     "m_uop_gen_idx" btree (gen_time)
> >     "m_uop_objectid_idx" btree (objectid)
> >     "m_uop_start_idx" btree (date_start)
>
> The date_part column is actually simply "EXTRACT (EPOCH FROM
> date_start::date)" so that I could put an index on that date, and the
> gen_time column is actually "date_processed-date_start" so that there
> could be an index on that difference as well.
>
> My median query is copied fromhttp://book.itzero.com/read/others/0602/OReilly.SQL.Cookbook.Dec.2005...
>
> Here it is:
>
> > select date_start, avg(gen_time)
> >   from (
> > select a.date_start::date, a.gen_time
> >   from m_uop_times a, m_uop_times b
> >  where a.date_part = b.date_part
> >  group by a.date_start::date, a.gen_time
> > having sum(case when a.gen_time=b.gen_time then 1 else 0 end)
> >                           >= abs(sum(sign(a.gen_time - b.gen_time)))
> >        ) as foo
> > group by date_start;
>
> Basically, I want to find the median gen_time for each day.
>
> Would anyone know a better way to do this, or have suggestions on how
> I can make this work without dying?
>
> Any help appreciated!
> --Richard



Re: median query causes disk to fill up

From
Alban Hertroys
Date:
richyen3@gmail.com wrote:
> Sorry, I forgot to also mention that I am running this on a machine
> with 80GB free disk space, and 1GB RAM (but I wouldn't think that this
> would be the problem, would it?)

Running out of disk space during a select sounds like you ran out of
memory and out of swap consequently. Maybe other applications on your
machine are using up a lot, or maybe you didn't configure enough memory
for postgres?

As for finding a median, if memory serves me correctly that's the center
value of a range, right? I'm not sure which value to pick if there's an
even number of results, but something like the following sounds a
logical approach to me:

SELECT a.value
  FROM table a, table b
 GROUP BY a.value
 HAVING COUNT(a.value <= b.value) = COUNT(a.value >= b.value)

I suppose the AVG() is to solve the case where there are two candidate
values for a median. I don't know how fast this is, or whether this even
yields a correct answer, it's just an idea.
It probably makes sense to add ordering to both result sets, but I'm
kind of hoping the planner can handle that.

Having an ordered index on those values would really help a lot of
course, but IIRC that's not yet possible...

And now I ran out of the limited amount of time I can spend on stuff
like this :P

--
Alban Hertroys
alban@magproductions.nl

magproductions b.v.

T: ++31(0)534346874
F: ++31(0)534346876
M:
I: www.magproductions.nl
A: Postbus 416
   7500 AK Enschede

// Integrate Your World //