The continuing saga of my 5000/200 [was Re: DECstation 5000/200 timekeeping]

Discussion:

(too old to reply)

Mouse

3 years ago

So, about that 5000/200 that's been suffering from inability to keen
even vaguely decent time: the saga continues.

Lacking much else to try, I decided to try for 1.4T on the 5000/200, in
the hope that that would work better.

1.4T appears to have a moderately mature pmax port, but it has next to
no cross-build support, so it's a bit more complicated. I installed
stock 1.4 and timekeeping is _much_ improved - but now, the SCSI driver
is flaky. It's not using the MI SCSI subsystem; disks show up as, for
example,

asc0 at tc0 slot 5 offset 0x0 (bus speed: 25 MHz) : target 7
Beginning old-style SCSI device autoconfiguration
rz1 at asc0 drive 1 slave 0 SEAGATE ST373307LC rev 0003
rz1: 69809MB, 49855 cyl, 4 head, 204 sec, 512 bytes/sect x 142969680 sectors

and, unless the hardware decided to break just at the same time as I
started running 1.4T, this rz stuff is depressingly flaky. I see
things like

rz1: Illegal request

which typically gets reflected as EIO to userland. Sometimes there are
additional messages as well, such as

asc_get_status: cmdreg 11, fifo cnt 3

I've got the machine running pseudo-diskless (booted off a disk, but
NFS-mounted my house NFS server and chrooted to that mount point) to
sidestep the SCSI issues and am trying a build of 1.4T into /altroot,
in the hope that that will produce something more useful. If not, I
may experiment with other versions; I've got an archive of, I _think_,
every release from 1.0 through 6.1.2, though it remains to me seen
whether there's a version that doesn't suffer from either of the above
issues (flaky SCSI or flaky timekeeping), and any version other than
1.4T, 4.0.1, or 5.2 would mean a significant effort to port the most
important of my changes over.

I also may have a stab at moving it to the MI SCSI subsystem, if 1.4T
doesn't bring that in.

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML ***@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Mouse

3 years ago

Permalink

Post by Mouse
I've got the machine running pseudo-diskless (booted off a disk, but
NFS-mounted my house NFS server and chrooted to that mount point)
[...]

A further very odd symptom: the DS 5000/200 can ping the NFS server,
but the NFS server can't ping the DS. Even though NFS works fine.
(The NFS server is NetBSD/i386, my modified 5.2.)

tcpdump on the NFS server sees both packets for pings from the DS.
When I ping the DS from the NFS server, tcpdump on the DS sees both
request and reply, but the NFS server never sees the reply. I
currently have no clue whatever why not.

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML ***@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Jonathan Stone

3 years ago

Permalink

Mouse,

Can you repeat your NTP "experiment" on a sun4c instead of the 5000/200? Sun4c has the same MI ASC driver and IIRC they don't have a cycle-counter. if the sun4c keeps decent time, that points to something pmax-specific (or possibly mips-specific). If it doesn't, that tends to confirm Maciej's hypothesis.

i haven't had access to a sun4c in .... almost 25 years, or i'd try myself.

Mouse

3 years ago

Permalink

Post by Mouse
So, about that 5000/200 that's been suffering from inability to keen
even vaguely decent time: the saga continues.
[...] I installed stock 1.4 and timekeeping is _much_ improved [...]

Can you repeat your NTP "experiment" on a sun4c instead of the 5000/200? Sun$

I installed 5.2 - absolutely stock 5.2 - on a SPARCstation-2. I'm
fairly sure that's sun4c; if nothing else, when netbooting its TFTP
request name ends in ".SUN4C".

Configured and started ntp. Gave it some 5-10 minutes without doing
anything else, to let it settle down; it reported itself happily
synced.

I then started hitting the disk and network: pulling over tarballs and
untarring them, very much the sort of thing I was doing on the 5000/200
when the clock started drifting so drastically.

The clock refused to drift. Or, rather, since it was already drifting
slightly (10-100 ms per 64-second sample), it continued drifting, at
close enough to the same speed that I would have to keep careful
records to tell whether the load made a difference. It definitely did
not do the ten-seconds-per-minute sort of drift I saw on the 5000/200.

I don't know whether I have any other MIPS machines, to help tell
whether this is a 3MAX issue or a pmax issue or a MIPS issue or what.
I do not have any at ready hand, certainly; if I do have any more, they
are buried somewhere in my storage unit.

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML ***@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Mouse

3 years ago

Permalink

[Replying to content scraped from mail-index, hence suboptimal threading]

Post by Jonathan Stone
[[ sun4c does not show ~10-sec-per-min drift seen on DECstation 5000/200 ]]
Thanks very much! Can you confirm that there are no timecounter
sources on either machine, except for the RTC?

On the 5000/300, yes. As I wrote on 2021-10-26, I saw

kern.clockrate: tick = 3906, tickadj = 15, hz = 256, profhz = 256, stathz = 256

kern.timecounter.choice = clockinterrupt(q=0, f=256 Hz) dummy(q=-1000000, f=1000000 Hz)
kern.timecounter.hardware = clockinterrupt

On the SS2, I forgot to check that at the time. I just now booted that
setup again (I haven't had occasion to write to that disk since) and I
see

kern.clockrate: tick = 10000, tickadj = 40, hz = 100, profhz = 100, stathz = 100

kern.timecounter.choice = clockinterrupt(q=0, f=100 Hz) timer-counter(q=100, f=1000000 Hz) dummy(q=-1000000, f=1000000 Hz)
kern.timecounter.hardware = timer-counter
kern.timecounter.timestepwarnings = 0

dmesg includes

timer0 at mainbus0 ioaddr 0xf3000000 ipl 10: delay constant 17, frequency = 1000000 Hz
timecounter: Timecounter "timer-counter" frequency 1000000 Hz quality 100

and sys/arch/sparc/sparc/timer.c (timer_get_timecount) looks, at a
quick glance, as though it's accessing some kind of free-running
counter hardware.

I'll see if I can dig up other sun4c (SS1 or SS1+, most likely) to see
what they have.

Post by Jonathan Stone
i'm currently not well,

Oh, that's not good to hear. Here's hoping you get better soon and
thoroughly!

Post by Jonathan Stone
but i'll look into this as and when I can.

Thank you! But, please, take care of yourself first.

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Jonathan Stone

3 years ago

Permalink

On Tuesday, November 2, 2021, 03:40:59 PM PDT, Mouse <***@rodents-montreal.org> wrote:

[[ sun4c does not show ~10-sec-per-min drift seen on DECstation 5000/200 ]]

Thanks very much! Can you confirm that there are no timecounter sources on either machine, except for the RTC?
If so, then the problem is definitely n mips- or pmax-specific code. (Quite possibly the changes to use asc/sd, perhaps in interrupt handling?)

i'm currently not well, but i'll look into this as and when I can.

Mouse

3 years ago

Permalink

Post by Jonathan Stone
Can you confirm that there are no timecounter sources on either
machine, except for the RTC?

On the 5000/300, yes. [...]
On the SS2, [...] I see
kern.clockrate: tick = 10000, tickadj = 40, hz = 100, profhz = 100, stathz = 100
kern.timecounter.choice = clockinterrupt(q=0, f=100 Hz) timer-counter(q=100, f=1000000 Hz) dummy(q=-1000000, f=1000000 Hz)
kern.timecounter.hardware = timer-counter
kern.timecounter.timestepwarnings = 0

I brought the SS2 back up again and manually set
kern.timecounter=clockinterrupt. I then repeated the test - start NTP
(same broadcastclient configuration as all my tests), wait some 5-10
minutes for it to stabilize, then hit the disk and network.

It stabilized less stably - a little less stably. After waiting eight
minutes (well, 8*64 seconds), the offset values from a remote xntpdc (x
because I'm running that on one of my 1.4T SPARCs) jump around, but are
all reasonably small. In milliseconds (ie, multiplied by 1000 from
what xntpdc prints), consecutive values were -.481, -1.202, -4.756,
-3.786, -10.985, -8.872, -2.558, -6.672. Not the best timekeeping, but
certainly not as bad as what I saw on the 5000/200.

Then I started hitting the disk and network (tar up a bunch of stuff on
a much faster machine, then ship it over the net and untar it on the
SS2).

The resulting offset values: -11.367, -7.422, -4.247, -7.454, -16.282.
At that point I killed the test, because it wasn't drifting anything
like as severely as the 5000/200. (Remember, these numbers are
milliseconds. The 5000/200 was drifting by over 10 seconds, at least
three orders of magnitude worse than what I saw here.)

I have tried various things to be surer the counter-timer isn't being
used. So far I have failed. Booting -c simply doesn't work. userconf
isn't entered at all. I have no idea why not; config -x says the
kernel is built with USERCONF turned on, and, based on booting with a
flags string containing unrecognized flags, the code in bootpath_build
that handles flags is running. I tried binary-patching the netbsd in
question to turn the "counter-timer" string used by timermatch_mainbus
into "xounter-timer" instead. That kernel exploded because it "could
not find xounter-timer in OPENPROM" - apparently string merging is
aggressive enough to share that string with code outside timer.c. So I
backed that out and instead patched timermatch_mainbus to change the
address it uses, to make it compare against "ounter-timer" instead.
Then it fails to attach ("counter-timer at mainbus0 ioaddr 0xf3000000
ipl 10 not configured"), but the next line is "panic: counter-timer",
because it turns out to be a panic-level error for any of the devices
in (the first section of) openboot_special4c to fail to attach. (So
using -c probably would have panicked even if userconf had worked.)

This then makes me think every sun4c has - must have - a counter-timer
node, or one of the above errors would trip at boot. Being certain
it's not being used would be significantly more work.

So I'm not as sure as I'd like that it's using just clock interrupts.

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML ***@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de