Thread: Finding bogus dates

Finding bogus dates

From
Scott Ribe
Date:
Suppose that I have a varchar column which contains dates, and some of them
are bogus, such as '1968-02-31'. Is it possible to specify a query condition
"cannot be cast to a valid date". (It's legacy data, and I'm trying to clean
it up before importing into the final schema.) There probably aren't that
many, and I can just keep hitting pg errors one row at a time until I work
through this, but...

--
Scott Ribe
scott_ribe@killerbytes.com
http://www.killerbytes.com/
(303) 722-0567 voice



Re: Finding bogus dates

From
Tomas Vondra
Date:
> Suppose that I have a varchar column which contains dates, and some of them
> are bogus, such as '1968-02-31'. Is it possible to specify a query condition
> "cannot be cast to a valid date". (It's legacy data, and I'm trying to clean
> it up before importing into the final schema.) There probably aren't that
> many, and I can just keep hitting pg errors one row at a time until I work
> through this, but...

I guess you can use something like this procedure:

 create or replace function is_date(d varchar) returns boolean as $$
 declare
   tmp date;
 begin
   tmp := d::date;
   return true;
 exception
   when others then
     return false;
 end;
 $$ language plpgsql;

Re: Finding bogus dates

From
Scott Ribe
Date:
> I suppose you could create a boolean function that does a cast, and catches
> the
> execption, returning NULL.

Yes, I was puzzling over query syntax and didn't think about a function.
That would be fine.


--
Scott Ribe
scott_ribe@killerbytes.com
http://www.killerbytes.com/
(303) 722-0567 voice



Re: Finding bogus dates

From
"A. Kretschmer"
Date:
am  Thu, dem 18.01.2007, um 11:26:03 -0700 mailte Scott Ribe folgendes:
> Suppose that I have a varchar column which contains dates, and some of them
> are bogus, such as '1968-02-31'. Is it possible to specify a query condition
> "cannot be cast to a valid date". (It's legacy data, and I'm trying to clean
> it up before importing into the final schema.) There probably aren't that
> many, and I can just keep hitting pg errors one row at a time until I work
> through this, but...

You can write a function for this, read this chapter in the docu about
trapping errors:

http://www.postgresql.org/docs/8.2/interactive/plpgsql-control-structures.html#PLPGSQL-ERROR-TRAPPING



Andreas
--
Andreas Kretschmer
Kontakt:  Heynitz: 035242/47150,   D1: 0160/7141639 (mehr: -> Header)
GnuPG-ID:   0x3FFF606C, privat 0x7F4584DA   http://wwwkeys.de.pgp.net

Re: Finding bogus dates

From
"Brandon Aiken"
Date:
Add a date field to the table.

Run UPDATE "foo" SET "newDate" = to_date("oldDate","MM/DD/YYYY").

Bad dates like 02/31/2006 will be converted to sane dates.  02/31/2006
--> 03/03/2006.

Now run SELECT * FROM "foo" WHERE to_char("newDate","MM/DD/YYYY") <>
"oldDate".  If a date got changed for sanity reasons, it'll be
different.

That should get most of 'em.

--
Brandon Aiken
CS/IT Systems Engineer
-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Scott Ribe
Sent: Thursday, January 18, 2007 1:26 PM
To: pgsql-general postgresql.org
Subject: [GENERAL] Finding bogus dates

Suppose that I have a varchar column which contains dates, and some of
them
are bogus, such as '1968-02-31'. Is it possible to specify a query
condition
"cannot be cast to a valid date". (It's legacy data, and I'm trying to
clean
it up before importing into the final schema.) There probably aren't
that
many, and I can just keep hitting pg errors one row at a time until I
work
through this, but...

--
Scott Ribe
scott_ribe@killerbytes.com
http://www.killerbytes.com/
(303) 722-0567 voice



---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
       subscribe-nomail command to majordomo@postgresql.org so that your
       message can get through to the mailing list cleanly

Re: Finding bogus dates

From
Scott Ribe
Date:
I didn't know to_date would do that. It's better anyway. I just continued
with the "fix and try again" approach and they're only 2 bad dates out
94,000+, so I don't have a huge problem here. I can try to do some research
and find the correct date, but failing that, the to_date approximation is
probably no worse than using null.

--
Scott Ribe
scott_ribe@killerbytes.com
http://www.killerbytes.com/
(303) 722-0567 voice



Re: Finding bogus dates

From
"Brandon Aiken"
Date:
Actually, now that I think about it a second you can find them really
easy just by doing:

SELECT * FROM "foo"
WHERE to_char(to_date("oldDate",'MM/DD/YYYY'),'MM/DD/YYYY') <>
"oldDate";

--
Brandon Aiken
CS/IT Systems Engineer

-----Original Message-----
From: Scott Ribe [mailto:scott_ribe@killerbytes.com]
Sent: Thursday, January 18, 2007 3:48 PM
To: Brandon Aiken; pgsql-general postgresql.org
Subject: Re: [GENERAL] Finding bogus dates

I didn't know to_date would do that. It's better anyway. I just
continued
with the "fix and try again" approach and they're only 2 bad dates out
94,000+, so I don't have a huge problem here. I can try to do some
research
and find the correct date, but failing that, the to_date approximation
is
probably no worse than using null.

--
Scott Ribe
scott_ribe@killerbytes.com
http://www.killerbytes.com/
(303) 722-0567 voice



Re: Finding bogus dates

From
Scott Ribe
Date:
> Bad dates like 02/31/2006 will be converted to sane dates.

Actually, that's the core of the direct query!

select * from foo where to_date(olddate,'YYYY-MM-DD')::text <> olddate;

The format of the exported dates matches the default date::text format
because I specified the export that way. If not, the query would be a little
more complicated by having to specify the format, but the same basic idea
still would work.

Which is good, since I'm now on to a table that has 108 bogus dates. I'd
hate to find & fix them one at a time...

--
Scott Ribe
scott_ribe@killerbytes.com
http://www.killerbytes.com/
(303) 722-0567 voice



Re: Finding bogus dates

From
"John D. Burger"
Date:
Scott Ribe wrote:

> Actually, that's the core of the direct query!
>
> select * from foo where to_date(olddate,'YYYY-MM-DD')::text <>
> olddate;
>
> The format of the exported dates matches the default date::text format
> because I specified the export that way. If not, the query would be
> a little
> more complicated by having to specify the format, but the same
> basic idea
> still would work.

But this won't work if one had a text column of dates in various
formats, right?  You could use a disjunction with lots of formats in
your query - or better yet, put all the formats you can think of in a
temp table and join against it:

   select * from foo
     where not exists (select format from dateformats
    where to_char(to_date(olddate, format), format) = olddate);

But the date casting code seems to be even more general, or at least
seems to know about many more formats than I'd be likely to dream up
on my own.

- John D. Burger
   MITRE



Re: Finding bogus dates

From
Scott Ribe
Date:
> But this won't work if one had a text column of dates in various
> formats, right?

Right. In my case I have bad data from a source I didn't control, exported
via code that I do control which happens to output YYYY-MM-DD. Well, except
that I don't do what I need to when MM or DD are more than 2 digits, but I'm
going back to look at that again ;-)

--
Scott Ribe
scott_ribe@killerbytes.com
http://www.killerbytes.com/
(303) 722-0567 voice



PG not rejecting bad dates (was Re: Finding bogus dates)

From
Ron Johnson
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 01/18/07 17:22, Scott Ribe wrote:
>> But this won't work if one had a text column of dates in various
>> formats, right?
>
> Right. In my case I have bad data from a source I didn't control, exported
> via code that I do control which happens to output YYYY-MM-DD. Well, except
> that I don't do what I need to when MM or DD are more than 2 digits, but I'm
> going back to look at that again ;-)

Why didn't the PG engine reject these bad-date records at INSERT
time.  This smacks of something that MySQL would do...

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFFsAX+S9HxQb37XmcRAg9NAJ4mdQXdp0tLDpiLcTaOVVrZgUMdSACgjTti
9iVMxYAWNIKOwtfm3T38Aac=
=Rpbe
-----END PGP SIGNATURE-----

Re: PG not rejecting bad dates (was Re: Finding bogus dates)

From
David Fetter
Date:
On Thu, Jan 18, 2007 at 05:42:54PM -0600, Ron Johnson wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 01/18/07 17:22, Scott Ribe wrote:
> >> But this won't work if one had a text column of dates in various
> >> formats, right?
> >
> > Right. In my case I have bad data from a source I didn't control, exported
> > via code that I do control which happens to output YYYY-MM-DD. Well, except
> > that I don't do what I need to when MM or DD are more than 2 digits, but I'm
> > going back to look at that again ;-)
>
> Why didn't the PG engine reject these bad-date records at INSERT
> time.  This smacks of something that MySQL would do...

I'm pretty sure it didn't accept these as bad dates, but as text
strings.  As you point out, it's a MySQLism to take "we are all here
to go into space" as a valid date.

Cheers,
D
--
David Fetter <david@fetter.org> http://fetter.org/
phone: +1 415 235 3778        AIM: dfetter666
                              Skype: davidfetter

Remember to vote!

Re: PG not rejecting bad dates (was Re: Finding bogus dates)

From
"Chad Wagner"
Date:
On 1/18/07, Ron Johnson <ron.l.johnson@cox.net> wrote:
> Right. In my case I have bad data from a source I didn't control, exported
> via code that I do control which happens to output YYYY-MM-DD. Well, except
> that I don't do what I need to when MM or DD are more than 2 digits, but I'm
> going back to look at that again ;-)

Why didn't the PG engine reject these bad-date records at INSERT
time.  This smacks of something that MySQL would do...

The original poster mentioned that the data type that the "date" was stored in was a varchar, not really much it can do there if don't use the right data type :(.

Not to mention how misleading it probably is to use a varchar for a data to the optimizer for calculating selectivity.

--
Chad
http://www.postgresqlforums.com/

Re: PG not rejecting bad dates (was Re: Finding bogus dates)

From
Ron Johnson
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 01/18/07 17:52, David Fetter wrote:
> On Thu, Jan 18, 2007 at 05:42:54PM -0600, Ron Johnson wrote:
>>
>> On 01/18/07 17:22, Scott Ribe wrote:
>>>> But this won't work if one had a text column of dates in various
>>>> formats, right?
>>> Right. In my case I have bad data from a source I didn't control, exported
>>> via code that I do control which happens to output YYYY-MM-DD. Well, except
>>> that I don't do what I need to when MM or DD are more than 2 digits, but I'm
>>> going back to look at that again ;-)
>> Why didn't the PG engine reject these bad-date records at INSERT
>> time.  This smacks of something that MySQL would do...
>
> I'm pretty sure it didn't accept these as bad dates, but as text
> strings.  As you point out, it's a MySQLism to take "we are all here
> to go into space" as a valid date.

Ah, the relevant snippet from OP:
> I have a varchar column

That would explain everything.  Except why it's a VARCHAR instead of
DATE.  But that's a whole 'nother discussion.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFFsAtRS9HxQb37XmcRArUSAJ9p519CtEa10tHeOGmr83lXoaRZ1ACcDpl5
VOHvbmhM+kiSARaXKPg5ZfU=
=jz33
-----END PGP SIGNATURE-----

Re: PG not rejecting bad dates (was Re: Finding bogus dates)

From
David Fetter
Date:
On Thu, Jan 18, 2007 at 06:05:37PM -0600, Ron Johnson wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 01/18/07 17:52, David Fetter wrote:
> > On Thu, Jan 18, 2007 at 05:42:54PM -0600, Ron Johnson wrote:
> >>
> >> On 01/18/07 17:22, Scott Ribe wrote:
> >>>> But this won't work if one had a text column of dates in various
> >>>> formats, right?
> >>> Right. In my case I have bad data from a source I didn't control, exported
> >>> via code that I do control which happens to output YYYY-MM-DD. Well, except
> >>> that I don't do what I need to when MM or DD are more than 2 digits, but I'm
> >>> going back to look at that again ;-)
> >> Why didn't the PG engine reject these bad-date records at INSERT
> >> time.  This smacks of something that MySQL would do...
> >
> > I'm pretty sure it didn't accept these as bad dates, but as text
> > strings.  As you point out, it's a MySQLism to take "we are all here
> > to go into space" as a valid date.
>
> Ah, the relevant snippet from OP:
> > I have a varchar column
>
> That would explain everything.  Except why it's a VARCHAR instead of
> DATE.  But that's a whole 'nother discussion.

As I understood OP, it's a staging table :)

Cheers,
D
--
David Fetter <david@fetter.org> http://fetter.org/
phone: +1 415 235 3778        AIM: dfetter666
                              Skype: davidfetter

Remember to vote!

Re: Finding bogus dates

From
Adam Rich
Date:
I suppose you could create a boolean function that does a cast, and catches the
execption, returning NULL. 
 
If that doesn't work, a perl stored procedure using Date::Calc and check_date() ?

 


----- Original Message ----
From: Scott Ribe <scott_ribe@killerbytes.com>
To: pgsql-general postgresql.org <pgsql-general@postgresql.org>
Sent: Thursday, January 18, 2007 12:26:03 PM
Subject: [GENERAL] Finding bogus dates

Suppose that I have a varchar column which contains dates, and some of them
are bogus, such as '1968-02-31'. Is it possible to specify a query condition
"cannot be cast to a valid date". (It's legacy data, and I'm trying to clean
it up before importing into the final schema.) There probably aren't that
many, and I can just keep hitting pg errors one row at a time until I work
through this, but...

--
Scott Ribe
scott_ribe@killerbytes.com
http://www.killerbytes.com/
(303) 722-0567 voice



---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
       subscribe-nomail command to majordomo@postgresql.org so that your
       message can get through to the mailing list cleanly

Re: PG not rejecting bad dates (was Re: Finding bogus

From
Scott Ribe
Date:
>> That would explain everything.  Except why it's a VARCHAR instead of
>> DATE.  But that's a whole 'nother discussion.
>
> As I understood OP, it's a staging table :)

Right. And it's exactly because the original source has bogus data that I
need a staging table to load it up and study it and fix it.

--
Scott Ribe
scott_ribe@killerbytes.com
http://www.killerbytes.com/
(303) 722-0567 voice