Thread: Post-2018 messages in archives

Post-2018 messages in archives

From
Noah Misch
Date:
At some point in the last few months, the archives of many mailing lists added
messages dated far in the future.  For example, pgsql-hackers archives gained
four messages from years 2030, 2032 and 2036:

https://www.postgresql.org/list/pgsql-hackers/since/203011010000/

This disrupts my use of the "Next" link.  If you're looking at the last page
of messages and click "Next", you'll get a page with just the latest one
message.  Normally, if you refresh that page later, you'll see messages added
after you clicked "Next".  With the far-future messages in there, "Next"
brings one to https://www.postgresql.org/list/pgsql-hackers/since/203602080620
which won't get new messages regularly for another 18 years.

Perhaps the fix is to set the archive date to the archives ingest time when
the message asserts a date substantially (15min?) earlier or later.  Would
that be an improvement?


Re: Post-2018 messages in archives

From
Magnus Hagander
Date:
On Mon, Dec 3, 2018 at 2:40 AM Noah Misch <noah@leadboat.com> wrote:
At some point in the last few months, the archives of many mailing lists added
messages dated far in the future.  For example, pgsql-hackers archives gained
four messages from years 2030, 2032 and 2036:

https://www.postgresql.org/list/pgsql-hackers/since/203011010000/

This disrupts my use of the "Next" link.  If you're looking at the last page
of messages and click "Next", you'll get a page with just the latest one
message.  Normally, if you refresh that page later, you'll see messages added
after you clicked "Next".  With the far-future messages in there, "Next"
brings one to https://www.postgresql.org/list/pgsql-hackers/since/203602080620
which won't get new messages regularly for another 18 years.

Perhaps the fix is to set the archive date to the archives ingest time when
the message asserts a date substantially (15min?) earlier or later.  Would
that be an improvement?


I wonder what caused this. I did a full reparse of the archives last week. I wonder if this caused it, and that we actually had this problem before but we cleaned it up manually at some point, and this manual cleanup got overwritten by this reparse.H

Unfortunately we don't keep the ingest time separately. But for the future, doing so would probably be a good idea, for other reasons as well.  I think 15 minutes might be pushing it a bit given the kind of times we see around, in particular with incorrectly configured timezones. But something like 24h would probably work.

Luckily, it's not too terribly bad: 

archives=# select count(*) from messages where date > now();
 count
-------
    10
(1 row)

(out of about 1.3M messages).

So short-term I will go process those messages manually.

Re: Post-2018 messages in archives

From
Noah Misch
Date:
On Mon, Dec 03, 2018 at 10:08:20AM +0100, Magnus Hagander wrote:
> On Mon, Dec 3, 2018 at 2:40 AM Noah Misch <noah@leadboat.com> wrote:
> > At some point in the last few months, the archives of many mailing lists
> > added
> > messages dated far in the future.  For example, pgsql-hackers archives
> > gained
> > four messages from years 2030, 2032 and 2036:
> >
> > https://www.postgresql.org/list/pgsql-hackers/since/203011010000/

> > Perhaps the fix is to set the archive date to the archives ingest time when
> > the message asserts a date substantially (15min?) earlier or later.  Would
> > that be an improvement?

> Unfortunately we don't keep the ingest time separately. But for the future,
> doing so would probably be a good idea, for other reasons as well.  I think
> 15 minutes might be pushing it a bit given the kind of times we see around,
> in particular with incorrectly configured timezones. But something like 24h
> would probably work.
> 
> Luckily, it's not too terribly bad:
> 
> archives=# select count(*) from messages where date > now();
>  count
> -------
>     10
> (1 row)
> 
> (out of about 1.3M messages).
> 
> So short-term I will go process those messages manually.

Data looks clean now.  Thanks.  If the problem remains as rare as it has been,
the automated fix I was contemplating is premature.


Re: Post-2018 messages in archives

From
Magnus Hagander
Date:
On Wed, Dec 5, 2018 at 2:53 AM Noah Misch <noah@leadboat.com> wrote:
On Mon, Dec 03, 2018 at 10:08:20AM +0100, Magnus Hagander wrote:
> On Mon, Dec 3, 2018 at 2:40 AM Noah Misch <noah@leadboat.com> wrote:
> > At some point in the last few months, the archives of many mailing lists
> > added
> > messages dated far in the future.  For example, pgsql-hackers archives
> > gained
> > four messages from years 2030, 2032 and 2036:
> >
> > https://www.postgresql.org/list/pgsql-hackers/since/203011010000/

> > Perhaps the fix is to set the archive date to the archives ingest time when
> > the message asserts a date substantially (15min?) earlier or later.  Would
> > that be an improvement?

> Unfortunately we don't keep the ingest time separately. But for the future,
> doing so would probably be a good idea, for other reasons as well.  I think
> 15 minutes might be pushing it a bit given the kind of times we see around,
> in particular with incorrectly configured timezones. But something like 24h
> would probably work.
>
> Luckily, it's not too terribly bad:
>
> archives=# select count(*) from messages where date > now();
>  count
> -------
>     10
> (1 row)
>
> (out of about 1.3M messages).
>
> So short-term I will go process those messages manually.

Data looks clean now.  Thanks.  If the problem remains as rare as it has been,
the automated fix I was contemplating is premature.

Thanks for confirming.

I think it's still needed, in case either (1) it happens again, or (2) we reparse the archives fully again which will reset it all. It's not too urgent at this point though, but I've left it on my  TODO list to make sure we have a safeguard in there.

--

Re: Post-2018 messages in archives

From
Noah Misch
Date:
On Wed, Dec 05, 2018 at 09:39:18AM +0100, Magnus Hagander wrote:
> On Wed, Dec 5, 2018 at 2:53 AM Noah Misch <noah@leadboat.com> wrote:
> > On Mon, Dec 03, 2018 at 10:08:20AM +0100, Magnus Hagander wrote:
> > > On Mon, Dec 3, 2018 at 2:40 AM Noah Misch <noah@leadboat.com> wrote:
> > > > At some point in the last few months, the archives of many mailing
> > lists
> > > > added
> > > > messages dated far in the future.  For example, pgsql-hackers archives
> > > > gained
> > > > four messages from years 2030, 2032 and 2036:
> > > >
> > > > https://www.postgresql.org/list/pgsql-hackers/since/203011010000/
> >
> > > > Perhaps the fix is to set the archive date to the archives ingest time
> > when
> > > > the message asserts a date substantially (15min?) earlier or later.
> > Would
> > > > that be an improvement?
> >
> > > Unfortunately we don't keep the ingest time separately. But for the
> > future,
> > > doing so would probably be a good idea, for other reasons as well.  I
> > think
> > > 15 minutes might be pushing it a bit given the kind of times we see
> > around,
> > > in particular with incorrectly configured timezones. But something like
> > 24h
> > > would probably work.
> > >
> > > Luckily, it's not too terribly bad:
> > >
> > > archives=# select count(*) from messages where date > now();
> > >  count
> > > -------
> > >     10
> > > (1 row)
> > >
> > > (out of about 1.3M messages).
> > >
> > > So short-term I will go process those messages manually.
> >
> > Data looks clean now.  Thanks.  If the problem remains as rare as it has
> > been,
> > the automated fix I was contemplating is premature.
> >
> 
> Thanks for confirming.
> 
> I think it's still needed, in case either (1) it happens again, or (2) we
> reparse the archives fully again which will reset it all. It's not too
> urgent at this point though, but I've left it on my  TODO list to make sure
> we have a safeguard in there.

Works for me.  Pondering it more, the timestamp that matters most for archive
purposes is the timestamp at which list subscribers started to receive their
copies of the message.  Based on that, I'm thinking we should ignore the Date
header and always use the timestamp from a particular "Received ... by
HOSTNAME.postgresql.org" header.  Before settling on that, I'd want to check
how many messages change timestamp by more than ~100s, and I'd want to spot
check a few messages to see whether the change looks like an improvement.


Re: Post-2018 messages in archives

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> On Wed, Dec 05, 2018 at 09:39:18AM +0100, Magnus Hagander wrote:
>>> Unfortunately we don't keep the ingest time separately. But for the future,
>>> doing so would probably be a good idea, for other reasons as well.

> Works for me.  Pondering it more, the timestamp that matters most for archive
> purposes is the timestamp at which list subscribers started to receive their
> copies of the message.  Based on that, I'm thinking we should ignore the Date
> header and always use the timestamp from a particular "Received ... by
> HOSTNAME.postgresql.org" header.  Before settling on that, I'd want to check
> how many messages change timestamp by more than ~100s, and I'd want to spot
> check a few messages to see whether the change looks like an improvement.

Another point worth considering here is moderation queue delays, which
are not infrequently measured in days :-(.  I am not quite sure whether
it'd be better to tag a moderation-delayed message with the timestamp
when it entered the queue or the time when it exited.  But either one
would be better than believing the Date: header.

            regards, tom lane


Re: Post-2018 messages in archives

From
Noah Misch
Date:
On Wed, Dec 05, 2018 at 11:31:39PM -0500, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > On Wed, Dec 05, 2018 at 09:39:18AM +0100, Magnus Hagander wrote:
> >>> Unfortunately we don't keep the ingest time separately. But for the future,
> >>> doing so would probably be a good idea, for other reasons as well.
> 
> > Works for me.  Pondering it more, the timestamp that matters most for archive
> > purposes is the timestamp at which list subscribers started to receive their
> > copies of the message.  Based on that, I'm thinking we should ignore the Date
> > header and always use the timestamp from a particular "Received ... by
> > HOSTNAME.postgresql.org" header.  Before settling on that, I'd want to check
> > how many messages change timestamp by more than ~100s, and I'd want to spot
> > check a few messages to see whether the change looks like an improvement.
> 
> Another point worth considering here is moderation queue delays, which
> are not infrequently measured in days :-(.  I am not quite sure whether
> it'd be better to tag a moderation-delayed message with the timestamp
> when it entered the queue or the time when it exited.  But either one
> would be better than believing the Date: header.

Good point.  I'd prefer to use the time when it exited the queue, which
conforms to "timestamp at which list subscribers started to receive their
copies of the message" mentioned above.  I usually download November's mbox in
the first few days of December.  If we use the timestamp of entering the queue
(or the Date header), there's no particular upper bound on when the November
mbox stops accruing new messages.


Re: Post-2018 messages in archives

From
Magnus Hagander
Date:
On Thu, Dec 6, 2018 at 7:14 AM Noah Misch <noah@leadboat.com> wrote:
On Wed, Dec 05, 2018 at 11:31:39PM -0500, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > On Wed, Dec 05, 2018 at 09:39:18AM +0100, Magnus Hagander wrote:
> >>> Unfortunately we don't keep the ingest time separately. But for the future,
> >>> doing so would probably be a good idea, for other reasons as well.
>
> > Works for me.  Pondering it more, the timestamp that matters most for archive
> > purposes is the timestamp at which list subscribers started to receive their
> > copies of the message.  Based on that, I'm thinking we should ignore the Date
> > header and always use the timestamp from a particular "Received ... by
> > HOSTNAME.postgresql.org" header.  Before settling on that, I'd want to check
> > how many messages change timestamp by more than ~100s, and I'd want to spot
> > check a few messages to see whether the change looks like an improvement.
>
> Another point worth considering here is moderation queue delays, which
> are not infrequently measured in days :-(.  I am not quite sure whether
> it'd be better to tag a moderation-delayed message with the timestamp
> when it entered the queue or the time when it exited.  But either one
> would be better than believing the Date: header.

Good point.  I'd prefer to use the time when it exited the queue, which
conforms to "timestamp at which list subscribers started to receive their
copies of the message" mentioned above.  I usually download November's mbox in
the first few days of December.  If we use the timestamp of entering the queue
(or the Date header), there's no particular upper bound on when the November
mbox stops accruing new messages.

Given that this has happened 10 times across 1.25 million messages, I really can't get excited about building any form of complicated solution for it.. :)

So for this, just using the automatic timestamp assigned to the row when it enteres the archives should do. Normally it will only differ a second or a few compared to the suggestions above, and it would only grow to something bigger if the archives server was temporarily down or there were other delivery issues.

--