List of Archived Posts

1993 Newsgroup Postings

360/67, was Re: IBM's Project F/S ?
360/67, was Re: IBM's Project F/S ?
360/67, was Re: IBM's Project F/S ?
Self-virtualization and CPUs
360/67, was Re: IBM's Project F/S ?
360/67, was Re: IBM's Project F/S ?
Self-virtualization and CPUs
HELP: Algorithm for Working Sets (Virtual Memory)
PowerPC Architecture (was: Re: PowerPC priced very low!)
Where did the hacker ethic go?
PowerPC Architecture (was: Re: PowerPC priced very low!)
Where did the hacker ethic go?
managing large amounts of vm
managing large amounts of vm
S/360 addressing
unit record & other controllers
unit record & other controllers
unit record & other controllers
location 50
1st non-ibm 360 controller
Most Embarrasing Misposting
Too much data on an actuator (was: 3.5 inch *9GB* )
Assembly language program for RS600 for mutual exclusion
MTS & LLMPS?
Multicpu's ready for the masses yet?
MTS & LLMPS?
MTS & LLMPS?
crabby, stu, initials, etc
Log Structured filesystems -- think twice
Log Structured filesystems -- think twice
/etc/sendmail.cf UUCP routing
Big I/O or Kicking the Mainframe out the Door
Big I/O or Kicking the Mainframe out the Door

360/67, was Re: IBM's Project F/S ?

Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
From: lynn@netcom2.UUCP (Lynn Wheeler)
Newsgroups: comp.arch
Subject: Re: 360/67, was Re: IBM's Project F/S ?
Date: 9 Apr 93 16:58:47 GMT

The machine started out as a 360/65 with .75microsecond memory (NO
processor cache) ... to which was added relocation hardware (8 entry
fully-associative look-up hardware) which added .15microseconds to
memory/address operations (.9msecond total). Typical instructions
operated in 2-3 memory cycles.

CP/67 ran on the 360/67 and was the predecessor to the VM/370 product
... done by the IBM Cambridge Scientific Center. The original version
was a project implemented on a (one of a kind) custom modified 360/40
that had hardware address relocation added to it. Some of the people
had worked on CTSS and at the start there were a number of
similarities. In some sense CP/67 implementation can be consisdered a
slightly-older "brother" to Unix (same heritage, slightly older).
One of the more interesting aspects of cp/40 was that it implemented
virtual machines with the ctss-like user interface implemented as
a single user operating system (called CMS ... for cambridge monitor
system) running in a virtual machine under CP/40.

In '68 & '69, I added to CP/67 a clock-like global LRU replacement
algorithm, dynamic adaptive feed-back fair-share scheduling, ordered
seek queueing, something akin to working-set controls to prevent page
thrashing, fast-path for dispatch, scheduling, interrupt handling and
numerous other functions. Part of the psuedo-working-set page
thrashing controls were ynamically adaption of the MPL to the
efficiency of the page I/O subsystem i.e.  higher MPL and more memory
contention for faster page I/O subsystems, lower MPL and less memory
contention for slower page I/O subsystems ... in effect, MPL was
limited to the sum of the psuedo-working-sets that could fit in the
available pageble-pages ... but part of the psuedo-working-set
calculation took into account the effectiveness of the page I/O
subsystem.

In early '70s (20+ years ago) this system running on a single cpu
768kbyte '67 (104 4k pageable pages) would provide subsecond
interactive response with 80+ users in a dynamic mixed-mode
(interactive/batch) ... with the processor operating at 100%
utilization. Under this load the page I/O rate would typically average
around 150 4k-pages/sec.

Part of the secret was doing near optimal, dynamically-adaptive
page-replacement algorithm as well as near optimal
dynamically-adaptive page thrashing controls in almost no
instructions. To take the page-fault interrupt, (near-optimally)
select the page-replacement, schedule the page I/O, execute the page
I/O, do the process-switch, later take the page i/o interrupt, and
task-switch back to the original process would occur on the average in
under 500 instructions (this average included a pro-rated percentage
of the instructions needed to do page-writes for replacing
changed/modified pages).

The Genoble Scientific Center used a similar base to implement a
"real" working-set scheduling algorithm that ran on a 1mbyte
single-cpu '67 (154 pageable pages, i.e. amount of storage left after
fixed/pinned kernel storage requirements). They provided approximately
similar performance for 35 users on their machine ... that we were
providing on our machine (80+ users, similar workload and 1/3rd less
available memory).

We also did a SMP 2-cpu version of the implementation ... and out of
that work a co-worker originated a MP synchronization instruction. A
couple months were spent coming up with a mnemonic that were his
initials ... finally hit on Compare&Amp;Swap ... although the mnemonic
today frequently is shortened to CS. The SMP/'67 was somewhat unique
in having a totally independent I/O subsystem with independent path to
memory. A "half-duplex" (i.e. a SMP with SMP I/O controller and only
one processor) '67 processor would run a simple numerical intensive
workload slightly slower than a "simplex" '67. However, in most
real-world environments (with heavy I/O load), a half-duplex '67 would
sustain a higher instruction/second rate than a simplex '67.

As an aside, the system was heavily instrumented and there were a
great deal of performance tests done ... in some of my crazed moments
I would run series of automated tests that were all at the edge of the
envelope (150 active user, mixed-mode workload running in 104
pageable-page configuration) ... which is actually harder than might
look at first since a large percentage of things that crash a system
had to be fixed (including eliminating ALL causes of things like
zombie processes).

Note that 12 years later (over 10 years ago), similar software on a
processor with 50 times the MIP rate and 50 times the number of
pageable pages was typically only supporting 3-4 times the number of
users (in theory in a similar mixed-mode workload environment). The
graphs that I was drawing at the time indicated a growth in workload
capacity (at least in terms of online, active users) proportional to
the change in the disk I/O subsystem thruput over the 12 year period.

360/67, was Re: IBM's Project F/S ?

Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
From: lynn@netcom4.UUCP (Lynn Wheeler)
Newsgroups: comp.arch
Subject: Re: 360/67, was Re: IBM's Project F/S ?
Date: 9 Apr 93 21:20:51 GMT

the cp/67 fastpath made a big difference on being able to run
production operating system (OS/360 MFT) in one virtual machine while
supporting other virtual machines running the CMS single-user
interactive operating system (on the same hardware).

On the day I got CP/67 ... the elapsed time it took our MFT to execute
our benchmark jobstream in a CP/67 virtual machine was 2.5 times the
elapsed time it took to run it stand-alone (running the '67 in '65
mode). This was a hand-crafted MFT. The standard MFT "out-of-the-box"
took twice the elapsed time to execute the benchmark jobstream.

When I was done with the CP/67 fastpath code, the elapsed time for the
jobstream benchmark was 1.15 times the "stand alone" time (part of
this was the difference between the standalone '65 mode running
real with 750nsec memory cycle ... and relocate virtual machine
mode adding 150nsec to the memory cycle time).

360/67, was Re: IBM's Project F/S ?

Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
From: lynn@netcom.UUCP (Lynn Wheeler)
Newsgroups: comp.arch
Subject: Re: 360/67, was Re: IBM's Project F/S ?
Date: 10 Apr 93 17:12:41 GMT

when we got cp/67 in early '68 ... it only had 2741 & 1050 support
and our account had a number of tty ... so I had to add tty support
to cp/67. At the same time, I rewrote the terinal interface driver
so that it would dynamically recognize whether the attached device
was 2741, 1050, or tty (i.e. single rotory dial-up could be used
regardless of the device type).

unfortunately the ibm hardware people told me latter that I shouldn't
be able to do that. While the hardware controller actually supported
being able to dynamically change the line-scanner associated with
any line ... somebody had taken a short-cut and hardwired the oscillator
frequency to in-coming lines. It was an "accident" that tty bit
frequency and 2741 bit frequency was close enuf that something that
wasn't suppose to work ... actually did.

In response to that ... we formed a four-man team that built the first
ibm "oem" controller. it included support for strobbing the initial
incoming bits on a connection in order to dynamically determine the
terminal speed (i.e. dynamically determine both terminal speed and
terinal type). Unfortunately while most "other" ibm mainframe software
would "tolerate" dynamic speed determination ... they wouldn't
tolerate dynamically changing the terminal type on a line.

In order that the mainline IBM software didn't feel completely left
out ... we later started a project for terminal support in MVT/HASP
(this was MVT OSr18 and I believe HASP-III). I ripped out the "2780"
remote device support and replaced it with my terminal driver from
CP/67. I then re-implemented the CMS syntax-directed editor's command
set (had to be all re-entrant code, original CMS code wasn't). It gave
MVT/HASP a powerful conversational remote job entry capability.

Self-virtualization and CPUs

From: lynn@netcom4.UUCP (Lynn Wheeler)
Newsgroups: comp.arch
Subject: Re: Self-virtualization and CPUs
Date: 11 Apr 93 19:54:45 GMT

note that self-virtualization includes the ability to operate the
hypervisor/vm-monitor under itself ... where the "2nd-level"
hypervisor might also be running copies of itself and/or other
applications in virtual machines (I've seen a 3-level stack for
specific application environment ... i.e. 3-level nested hypervisors
under which something else was running doing productive work ... this
is aside from various experiments that just toyed to see how deep the
stack could be made).

there are cases of virtual machine hypervisors running in hardware
environments that are not self-virtualizable ... i.e. the hypervisor
might be able to provide virtual machines for other purposes ... but
is unable to "hypervise" itself.

360/67, was Re: IBM's Project F/S ?

Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
From: lynn@netcom.UUCP (Lynn Wheeler)
Newsgroups: comp.arch
Subject: Re: 360/67, was Re: IBM's Project F/S ?
Date: 12 Apr 93 01:49:50 GMT

Work on clock-like algorithm (desigh, implementation, deployment) was
done in '68 while sysprog at universtity. Enhancement for more
dynamics switching between random & LRU was done >20 years ago while I
was at IBM. No publications from that period was made ... just
deployed the code. I did include description in talk I gave at 10/86
SEAS meeting on the Isle of Jersey (it was taken from a early '83
white paper I did on "Performance History" ...  spanning 15 year
period from '68 to '83 ... which also highlighted the fact that
relative performance of disk I/O subsystems had declined by up to an
order of magnitude in the period).

at various times over the past 20+ years I've suggested it as
technology for things like software caching & hardware disk caching
controllers and other things that have a tendancy to use LRU
algorithms.

.........

some related page replacement:

L. Belady, A Study of Replacement Algorithms for a Virtual Storage
Computer, IBM Systems Journal, v5n2, 1966

L. Belady, The IBM History of Memory Management Technology, IBM
Journal of R&D, v25n5

R. Carr, Virtual Memory Management, Stanford University,
STAN-CS-81-873 (1981)

R. Carr and J. Hennessy, WSClock, A Simple and Effective Algorithm
for Virtual Memory Management, ACM SIGOPS, v15n5, 1981

P. Denning, Working sets past and present, IEEE Trans Softw Eng, SE6,
jan80

J. Rodriquez-Rosell, The design, implementation, and evaluation of a
working set dispatcher, cacm16, apr73

............

related paper using cluster analysis to restructure programs for vmm
environment:

D. Hatfield & J. Gerald, Program Restructuring for Virtual Memory,
IBM Systems Journal, v10n3, 1971

..........

also try Pat O'Neil (poneil@cs.ubm.edu) ... I was talking to him about
a year ago ... he was doing a section on replacement algorithms for a book.

360/67, was Re: IBM's Project F/S ?

Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
From: lynn@netcom.UUCP (Lynn Wheeler)
Newsgroups: comp.arch
Subject: Re: 360/67, was Re: IBM's Project F/S ?
Date: Wed, 14 Apr 1993 05:09:56 GMT

note that by the mid-70s ... a significant number of environments
(running vm/370) were severely I/O constrained ... all types of
I/O  .... page i/o, file i/o, etc.

In terms of working set, by 1978 ... a significant number of
configurations had three times as much real memory as would nominally
be needed to maintain MPL level with CPU saturation. However, many of
these same systems had paging bottlenecks. It wasn't the page
thrashing bottlenecks from the '60s ... that lead to solutions like
workset scheduling to tray and address real storage contention. It was
that there was insufficient page I/O capacity. Even if you took as the
value for "workingset" ... the total number of virtual pages touched
during a reasonable execution period (potentially several cpu seconds)
... all the pending processes would still be allowed in the dispatch
queue. However, at the same time there was insufficient page I/O
capacity just to get the pages into memory ... there was a paging
bottleneck but not because of real memory contention (like the '60s)
... it was a page I/O bottleneck. This is part of the basis for my
statement that in the 15 years between 68 and 83, the relative
performance of disk subsystems had declined by up to a factor of ten
(processor complex got 50 times faster while i/o subsystems only got
five times faster).

Part of my solution in '78 was block paging ... which looked an awful
lot like the old time "swapping" from the '60s. My original scheduler
attempted to implement "scheduling to the bottleneck" ... i.e. if the
CPU was the constrained resource ... "fair share" was based on cpu
resource consumption. If I/O &/or memory became more of the bottleneck
... then an attempt was made to shift the fair share calculations so
that they were based more on working set size of I/O utilization (i.e.
default cpu scheduler in a page thrashing environment would nominally
attempt to give the best preferential treatment to the process
generating the most paging activity ... since it would be the one most
likely using the least amount of CPU). By '78, the set of
constrained/saturated resources for typical environments were
frequently NOT THE CPU. However, many of the environments had
progressed to the point where just scheduling based on I/O useage
(rather than CPU useage) was not adequate. Policies analogous to
workingset sizes (real storage utilization) were need for I/O
resources.

A partial solution to the opportunity was the program restructuring
work (referenced in prior append). I produced a sparce bit map of
virtual address space locations (in 32-byte increments) that were
referenced during a 1000 instruction interval. Hatfield ran "cluster
analysis" (see some recent postings on the subject of cluster analysis
in comp.programming) against the storage traces. He produced a storage
reordering that would attempt to provide an improved ordering/packing
of objects in virtual memory (strong/minimal working set).

Storage utilization maps were also printed out. With respect to
earlier comments about apl/360 ... in the straight forward apl/360
port to CMS ... the code was effectively ported over intact. The
original OS code essentially swapped a 64kbyte "real" region between
disk and memory.  The code running against the 64kbyte workspace
treated it as "real memory". The storage management was very
primitive, sequentially using locations of memory until all memory was
exhausted ... and then doing garbage collection back to minmal
storage. This produced a saw-tooth storage utilization map. It wasn't
really too bad when dealing with a 64kbyte region ... but it became
really unfriendly when talking about a 1meg workspace ... even for a
1kbyte application. The apl/360 storage management would proceed to
(effectively) touch every location in the 1meg region and then do
garbage collection collapsing it back down to 1kbyte. A q&d fix was to
activate garbage collection at more frequent intervals (not just at
bumping the top of the workspace area).

There was a recent comment that some of the larger cache sized
machines (1-4mbyte cpu caches) were provided opportunities for the
re-emergance of old '60s optimization technology associated with
storage packing/optimization.

An associated issued that has been lurking in the back of my mind has
to do with various OOPs technologies. Given an (non-OOPs) optimized
program with "packed" instructions as well as "packed" data (variables
that are packed together in small total storage space and can be
accessed in a single reference) ... what happens when it is translated
into an OOPs environment. Does the OOPs storage allocation (for data)
spread out into lots of diffuse locations (multiple storage references
accross multiple different cache lines)? Does methods result in a
similar diffusion of code?

Words of wisdom from Zippy:
It don't mean a THING if you ain't got that SWING!!

Self-virtualization and CPUs

From: lynn@netcom.UUCP (Lynn Wheeler)
Newsgroups: comp.arch
Subject: Re: Self-virtualization and CPUs
Date: 14 Apr 93 19:13:06 GMT

there is an annoying problem running VMM opsys (hypervisor or
otherwise) under a VMM hypervisor .. when the opsys is attempting
production work and its storage isn't pinned. Typically the hypervisor
is executing a LRU-like replacement algorithm ... as is the opsys.

the annoyance is that the first level system has a tendancy to page
out the page that the 2nd level system would like to use next for
paging. situation can be come quite pathelogical.

this applies to other types of scenerios also ... like DBMS systems
managing caches in virtual memory. If the DBMS doesn't have hook into
the opsys' VMM ... the DBMS can repeatedly be picking a cache page
that the opsys was removed from real storage. The scenerio then is
the opsys has to page the virtual page back in before the DBMS can
schedule the contents for replacement.

HELP: Algorithm for Working Sets (Virtual Memory)

Refed: **, - **, - **, - **, - **
From: lynn@netcom4.UUCP (Lynn Wheeler)
Newsgroups: comp.arch.storage
Subject: Re: HELP: Algorithm for Working Sets (Virtual Memory)
Date: Sat, 24 Apr 1993 15:38:19 GMT

see my posting on 4/12 to comp.arch (360/67, was Re: IBM's Project
F/S?)  ... also see my post on 4/9 to comp.arch ... same subject. The
Rodriques-Rosell article mention in the 4/12 posting is a description
of the Grenoble Science Center project mentioned in the 4/9 posting.

PowerPC Architecture (was: Re: PowerPC priced very low!)

Refed: **, - **, - **, - **, - **
From: lynn@netcom.UUCP (Lynn Wheeler)
Newsgroups: comp.arch
Subject: Re: PowerPC Architecture (was: Re: PowerPC priced very low!)
Date: Thu, 6 May 1993 15:37:27 GMT

32bit, 48bit, 52bit, 64bit, etc.

Typically 32bit applies to the number of bits in the virtual
address.

The hardware takes that virtual 32bit address and typically looks it
up in some sort of table-look-aside buffer to find the real address
(although in prior discussion on 360/67, it was a fully-associated
look-up, instead of a set-associative index). The TLB lookup is
typically composed of three parts:

a) which virtual address space the virtual address is
   associated with
b) the virtual address
c) the real address

Using the two components of the lookup (which virtual address space
identifier and which virtual address), a "real-address" is produced.
The number of bits in the real-address is typically associated with
the total amount of real hardware it is possible to configure on the
machine (which can either be greater or less than the associated
virtual address size).

On machines with page-tables, the virtual address space identifier is
typically somehow associated with the page-table real address (either
the actual address of the page table ... or possibly a page table
origin stack index). On machines with a limited page table origin
stack index (say 2-4), process switches that also involve address
space switch... can incur a performance penalty if a typical scenerio
requires more address space switches than the depth of the stack (say
a micro-kernel implementation with different address spaces for each
component in the kernel, and there are typically more kernel components
than there is space in the page-table stack index). When the pagetable
stack is exceeded, some sort of LRU algorithm is used to invalidate
a pagetable stack entry ... which then requires invalidating all the
associated entries in the TLB.

Some machines that have segmented virtual address space, may implement
either a segment table stack associated TLB or a page table stack
associated TLB (i.e. entries are associated with an address space
identifier, i.e. the segment table real address origin OR a virtual
segment associated TLB, the page table real address origin for a
specific virtual segment within the virtual address space).

In any case, the TLB entry typically will have some sort of
virtual address space identifier PLUS the virtual address
identifier ... which then maps to a real address. The number of
bits used in the TLB for a virtual address space identifier
is dependent on whether the real page (or segment) table address
is stored ... or whether there is a separate address space stack,
and the TLB entry only contains an index into the address space
identifier stack (where the address space identifier may be
the unique address space, or possibly unique segment within
the address space). In a segmented virtual address space architecture,
with a system design point that would include large numbers
of shared segments, there is a slight performance advantage
to use segment-associated identifier rather than virtual address
space identifier (i.e. with virtual address space identification,
information requarding a shared segment pages would potentially
occur multiple times in the TLB).

Power architecture implements inverted page tables, there are NO page
and segment tables found in some other architectures.  The power
architecture does not have a virtual address space register that
points to the virtual->real page table mapping ...  since no such
table exists. To provide the TLB hardware with a mechanism for
uniquely identifying which virtual address space a virtual page exists
in, the Power architecture defines a logical segment identifier.
Effectively this logical segment identifier takes the place of a
pointer to a real storage location of page tables (found in other
architectures). The logical segment identifier is used by the TLB
hardware to uniquely distinquish which virtual address space, a
specific virtual page belongs to (in a page-table architecture, the
virtual page number is used to index into a real page table that
exists in storage someplace, the address of that specific real page
table is what is used to distinquish one program's virtual address
space from some other program's virtual address space).

In the Power architecture with inverted pagetables, there are no
pagetables ... and therefor there is no real page table address
that can be used to distinquish one programs virtual address
space from any other program's virtual address space. The Power
architecture, in place of a real page table address, uses a logical
segment identifier to distinquish between one virtual address space
and another virtual address space.

The Power architecture implements a virtual segment architecture using
16 segment table registers (many machines use a single virtual
register, which points to a real storage location that contains the
page-table ... for flat virtual address space, or a segment table, for
segmented virtual address space). When decoding a virtual address in
the power architecture, the high 4 bits of the address are used to
index one of the segment registers (in other segmented architectures,
the segment part of the virtual address is used to index an entry in
some real segment table, and picks out a specific page-table). The
specific segment table register selected, yields the logical segment
identifier. This is roughly equivalent to a PTO-associative (i.e.
Page table origin) TLB, in a segmented virtual address space
architecture.

The number of bits in this logical segment identifier, roughly
determines is the total number of different virtual segments that
might have virtual page addresses entries in a TLB at any one time.
It is roughly equivalent to the maximum number of different page
tables that can exist in real memory (in other architectures that have
segment/page tables).

In a Power implementation to talk about the total number of bits in
the combined virtual address and the logical segment id, is roughly
equivalent to talking about the number of simultaneous different
virtual address spaces that can exist in other systems (i.e. maximum
virtual address space size times the total number of different
concurrent address spaces).

Where did the hacker ethic go?

From: lynn@netcom.UUCP (Lynn Wheeler)
Newsgroups: alt.internet.services,alt.unix.wizards,comp.ai,comp.graphics
Subject: Re: Where did the hacker ethic go?
Date: Sat, 8 May 1993 02:27:35 GMT

it seems more like that there are a relatively small supply of
hackers, regardless of the number of programmmers. There are now a
significantly larger number of people doing programming ... w/o a
corresponding increase in the number of the old style hackers.
They may still be there, just not as proportionally significant.

as an aside, one of my favorite questions along this line has to do
with language proficiency ... using the guideline that a measure
of proficiency in learning a foreign language is when the person
stops "translating" and starts to "think" in the language ... how
many people find themselves proficient in a programming language?

PowerPC Architecture (was: Re: PowerPC priced very low!)

From: lynn@netcom.UUCP (Lynn Wheeler)
Newsgroups: comp.arch
Subject: Re: PowerPC Architecture (was: Re: PowerPC priced very low!)
Date: Sat, 8 May 1993 02:33:27 GMT

for a segmented virtual memory architecture that uses (only) the
high four bits of a 32-bit virtual address to select a segment
number ... implies that there are a most 16 "segments" in the
virtual memory segmented architecture ... and that the segment
size is bounded by 0 and 2**(32-4).

Where did the hacker ethic go?

From: lynn@netcom.UUCP (Lynn Wheeler)
Newsgroups: alt.internet.services,comp.ai,comp.graphics
Subject: Re: Where did the hacker ethic go?
Date: Thu, 13 May 1993 16:38:56 GMT

ok, given that computer language proficiency works for programming in
the small what works for programming in the large. programming is
a relatively young human endeavor there is little or no natural
language vocabulary/lexicon.

The "book" paradigm implies the use of craft/artistic side of the
brain, rather than the analytical side. Is the switch because of the
lack of words/vocabulary? Would it be possible to do programming in
the large using the analytical side if a person could generate symbol
abstractions on the fly? Given that neither the artistic approach or
"abstraction on the fly" translates to natural language well (i.e. can
one describe the brain processes analytically associated with programmin
in the large?) is it possible to tell the difference

managing large amounts of vm

Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
From: lynn@netcom2.com (Lynn Wheeler)
Newsgroups: comp.arch
Subject: managing large amounts of vm
Date: Tue, 29 Jun 1993 15:34:02 GMT

there are a number of optimziations for managing large
amounts of virtual memory.

an obvious one that dates from the late '60s is to not allocate
backing store for virtual memory until process/application actually
touches&changes a page.

in the early '70s, i implemented paging page tables (similar
in concept to paging virtual memory ... but applied to "inactive"
virtual memory control tables).

with the advent of (relatively) large real memory ... there is another
optimization technique. traditionally when a page is read into memory,
the backing store location is left allocated.  when the page is
selected for replacement ... if it has not been changed ... there is
no need to "write" the page to disk (the copy on disk is left
"intact").  in 1980, i implemented a scheme which monitors depletion
of backing store space (disk paging space) and switches the management
of backing store space from a "dup" algorithm to a "no-dup" algorithm.

A "dup" algorithm leaves the backing store slot allocated when a
virtual page is read in. A "no-dup" algorithm deallocates the backing
store slot when a virtual page is read in (i.e. there is no
"duplicate" virtual memory page left on the paging disk). In a "dup"
situation, the max/total number of allocated virtual pages is
effectively the size of the space allocated on disk for paging. In a
"no-dup" situation, the max/total number of allocated virtual pages
becomes the combination of real memory and paging disk space.

The "dup" algorithm achieves some performance benefit in not having
to "write" replaced pagces that haven't been changed (disk i/o load).
The "no-dup" algorithm trades-off additional disk i/o load for potentially
increase in total number of extent virtual memory pages.

A simple example is high-end workstation with 512mbytes of real memory
and 512mbytes of disk paging space. In a "dup" scenerio, the total
number of extent virtual memory effectively becomes 512mbytes. In the
"no-dup" case, the total amount of virtual memory can grow to
512mbytes+512mbytes.

The switch back&forth between "dup" and "no-dup" operational modes
is based on total number of allocated virtual memory pages compared
to the total number of disk paging page slots.

            lynn

managing large amounts of vm

Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
From: lynn@netcom.com (Lynn Wheeler)
Newsgroups: comp.arch
Subject: Re: managing large amounts of vm
Date: Tue, 29 Jun 1993 18:21:33 GMT

when all slots are allocated (whether using a dup or no-dup strategy)
then lock-up can occur if the process requesting additional page slots
is not terminated. Systems that I worked on since the late '60s alwas
implemented the termination strategy (as opposed to system lock-up,
... oh and with strategies for biasing termination towards non-system
processes).

however, there is a difference in the operating region between "dup"
and "no-dup" strategy when the termination strategy takes effect.  In
the "dup" case, it occurs when all page slots on disk are allocated,
in the no-dup case, it occurs when all slots in the combined paging
disk area plus real memory are allocated.

The issue regarding implementing the dynamic dup/no-dup strategy
presents itself if there are significant number of operating regions
where there are >disk page slots ... and <(disk page slots)+(real
memory slots).

A sample scenerio, a user tunes an application so that it takes all
real memory slots ... i.e. little or no paging activity during the
execution of the application. Assuming this is a workstation, single
user environment ... while such an application is running, the
majority of extraneous demon & other system process virtual memory
pages are rolled out to disk. In the "dup" case, there has to be enuf
page-slots available on disk for all the demons plus all of the
application. In the "no-dup" case, there only has to be enuf
page-slots available on disk to handle all the demon/system process
virtual memory pages. A non-trivial number of workstations have
real-memory configurations that can contain all demon/system process
virtual memory pages.

As a result, some number of large memory workstation environments
could actually be configured with less disk page slots than there is
real memory (assuming no-dup strategy) .... i.e. none of the pages in
real-memory have allocated slots on disk. If a demon "goes off" during
the execution of the application ... its pages can be "paged-into real
memory", its disk page-slots released, and an equivalent number of the
application pages "page-out" to the newly released disk page-slots
(that had belong to the demon before it being brought in).

... oh, btw, in the transition between "dup" to "no-dup" strategy,
the code also had to run thru pages in real-memory and release
any associated disk page slots.

               lynn

S/360 addressing

Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
From: lynn@netcom2.com (Lynn Wheeler)
Newsgroups: comp.sys.next.advocacy,comp.arch
Subject: Re: S/360 addressing
Date: Thu, 16 Sep 1993 17:22:58 GMT

... except for 360/67 (mid to late 60s) which had two virtual address
modes, 24bit and 32bit (not 31bit). 360/67mp also had full-blown
channel controller, something that wasn't seen again (in IBM, except
possibly for the 125 IOPs) until 3033 days (late 70s). late models of
the 3033 also had a kind-of "real-mode" 26-bit addressing (i.e.
64mbyte storage) ... which was not specifiable in instructions but the
real page number could be specified in the virtual address
page-table-entry.

trivia fact: one of the results of mp operating system work on the
360/67 at IBM/CSC in the lates 60s & early '70s was the architecting
of the compare and swap instruction. the choice of the mnemonic "CAS"
was because it is the initials of the person that did the primary work
(i.e. the designation comapre & swap was based on requirement to find
word combination to match the initials) although the mnemonic is
frequently corrupted to CS. A requirement (placed on the group) for
getting CAS implemented in machine hardware involved architecting a
use that wasn't MP-specific (i.e.  atomic storage update by
application non-disabled, interruptable regions of code).

              lynn

unit record & other controllers

Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
From: lynn@netcom6.com (Lynn Wheeler)
Newsgroups: alt.folklore.computers
Subject: unit record & other controllers
Date: Thu, 23 Sep 1993 06:22:59 GMT

my first programming job was undergraduate summer job. university had
a 709 with a 1401 front-end for handling unit record (all card input
went to 7track tape on the 1401, carried over to 709 and jobs run ...
output went to tape ... which was carried back to the 1401 tape drive
and would produce printer/punch output).

The university was going to replace the 709 with a 360 .. as an
interim step the 1401 was replaced with a 360/30 ... and I was given
the job of implementing the 1401 MPIO utility on the 360. I was
supplied with 360 assembler & machine code manuals as well as 1401,
2540printer/punch and tape drive manuals. Program need to be able to
handle both input & output tapes simultaneously (i.e. card->tape and
tape->printer/punch running at the same time). It needed to be able to
both run under os/360 as well as stand-alone managing its own
interrupts&I/O as well as error recovery.

the 2540 reader/punch had five output pockets ... two pockets that
could be feed from the reader, two pockets that could be feed from the
punch, and a center pocket that could be feed from either. I only
worked on one application that used the center pocket ... does anybody
else have applications that used the center pocket?

The 2540 card reader could be operated in two ways,

1) a single I/O operation which would read/feed in a single operation,
2) separate commands for "feeding" a card and for "reading" a card

## trivia question: what was one of the primary reasons for the
separate feed/read mode of operation?

+++++
Lynn Wheeler           | internet: lynn@netcom.com

         answer follows

2540 reader answer:

The ibm 80-col card has 12 rows ... for BCD (& EBCDIC) a col. encodes
a single (6bit) character (8bit for ebcdic) mapping into a single
byte. However, for "binary", two six-bit bytes were encoding/punched
using all 12 rows in each column. Typically a card was feed & then
read using "BCD" read operation (which would read a single column into
a single byte). If the card had invalid "BCD" punch codes (i.e. col.
binary), the I/O operation would result in an error. It was then
necessary to reread the card using col. binary I/O operation (which
read the 80 columns into 160 byte positions).

several years later I got a copy of the stand-alone LLMPS which had a
whole set of various types of unit record routines. I was also
informed that LLMPS formed the original core/nucleus for the MTS
system. Can anybody verify that?

my 2540 center pocket application:

doing class scheduling application using cards, the class schedule
cards were read into the center pocket. As each card was read, it was
analyzed. If there was some error was found in the card, a card was
(blanked) punched behind it (the cards from the punch side had
different colored top strip). Postprocessing cards in error just
involved locating all the cards in the tray that had a colored strip
card following them.

... hasp misc.

somewhere in the dim past I had a project where I replaced the HASP
2780 driver code with TTY & 2741 drivers along with a CMS-like context
editor ... to provide a HASP-based CRJE support.

I had programmed the 2702 telecommunications software so that the
software could automatically recognize whether there was a 2741 or
a TTY coming in on the dial-in line (and update the appropriate fields).
Supposedly the 2702 had all the command "control" functions that allowed
it to work. After it was all up and working (demo'ing that both TTY
and 2741 could dial into the same base rotory) ... the IBM hardware people
told me it wouldn't work. The 2702 design & implementation had all the
necessary logic to "switch" line-scanners on a per-line basis. The hitch
was that somewhere along the way, they took a hardware shortcut and the
frequency oscillator was hardwired to each line. While the line-scanner
could be changed ... the line speed couldn't. For whatever reason, there
was enuf slop between the 2741 rate and the TTY rate that it worked anyway.

I think it was not too long after that we started a project to build our
own telecommunications controller (I believe the machine eventually become
the first OEM IBM control unit).

unit record & other controllers

<.pre> Refed: **, - **, - **, - **, - **, - **
From: lynn@netcom6.com (Lynn Wheeler)
Newsgroups: alt.folklore.computers
Subject: Re: unit record & other controllers
Date: Thu, 23 Sep 1993 07:55:37 GMT

oh yes, this first oem telecommunications controller that we did ...
it not only handled terminal type recognition but also did automatic
speed recognition ... on initial connection ... it effectively did
something like 10* oversampling on the in-coming bits to get a
estimate of bit duration.

when we first attached it to the ibm byte-mux channel, it red-lighted
the CPU (i.e. hardware failure). It turns out that we were holding
op-in for >13mics. The cpu timer on the 360 was located in main
memory and "ticed" every 13mics. If the timer was locked out of
updating the memory location for two tics ... it generated a hardware
failure.

After that was overcome ... we started transmitted lots of data to
memory of the 360 ... however on closer examination it looked as if it
was all garbage. the test case we were using was tty ... and the
"controller" had a small task monitor that a tty could interact with
(i.e. the core machine was an ascii minicomputer). It took me several
hours to realize the problem. As it turns out we were just transmiting
straight ascii into the ibm mainframe ... which it finds to be
garbage. The "problem" was that (at least) the 2702 line-scanner took
the in-coming leading bit and placed it into the low-order bit
position in a byte. Effectively from "straight" ascii byte standpoint
the bits in a byte had been reversed. In order for the ibm mainframe
to think we were transmitting it valid ascii ... we had to invert the
bit order in the byte before sending it to ibm mainframe memory.

trivia question: what does the following 360 mainframe instruction do:

      d207004c0050

--
Lynn Wheeler                | lynn@netcom.com

the instruction is a memory to memory move of 8 bytes from
hex location 50 to hex location 4c (overlapping move). Since
the "timer" was 4 bytes at location hex 50 in low memory, in
a single atomic instruction, it saved the current value of
the time and reloaded it with a new value.

A frequent convention was to keep a (large) standard value at
hex 54 (the location to be moved into 50). The difference between
the value at 4c and 54 (after the move) was effectively the
elapsed time since the previous reset of the timer. This delta
value was used to accurately update time used by different
processes (at least to 13microsecond resolution).

unit record & other controllers

Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
From: lynn@netcom6.com (Lynn Wheeler)
Newsgroups: alt.folklore.computers
Subject: Re: unit record & other controllers
Date: Thu, 23 Sep 1993 16:05:12 GMT

a new trivia question: what is the 12-2-9 punch in col. 1 of punch
card.

....

"hex" decks in 360 ebcdic were identified with 12-2-9 (0x02) in col1
followed by three character card-type identifier.

my 1st programming job, a 1401-MPIO replacement for the program that
did front-end unit record I/O the 709 ... quickly grew to an unwiedly
size of over 2,000 cards. it took 30 minutes to assemble (translate
from symbolic machine language to a "hex" card deck). This was the
"stand-alone" version ... The version with OS data management macros
(to leave OS executing while it ran, rather than take over the whole
machine) took significantly longer. A single "DCB" macro took 6
minutes (elapsed time) to assemble. 6-7 DCB macros & other misc. stuff
pushed the assemble time for my program to well over an hour.

it took a couple months ... but it soon became faster for me to
repunch "hex" cards to patch a program ... than it was to re-assemble
the whole program ... in doing so, I had to learn to read hex punch
cards (i.e. translate the punch holes into hexadecimal equivalent).

The card key-punch had two functions necessary for this,

	"duplicate"  - i.e. old card in copy position, new card in punch
	position ... and hold down the duplicate key

	"multi-punch" - hold down a key that stopped the automatic
	card col. advancement and allowed the rows (in a col) to
	be individual selected

... that was before I learned about REP cards.

Note that keypunches had the equivalent of carriage-control tapes ...
which were standard punch cards, appropriately encoded (with commands)
and wrapped around a small drum located in the top middle of the
keypunch (026s & 029s). A relatively simple function was to
"interpret" (read the punch-holes and print the corresponding
character on the top of the card) a card deck (at least non-hex card
deck that had been punched on say a 2540 punch). Load the hopper with
your (punched) card deck, set up the control card to automatically
interpret, feed the first card ... and it could automatically process
the rest of the deck.

+++++
Lynn Wheeler           | internet: lynn@netcom.com

location 50

Refed: **, - **, - **, - **, - **, - **
From: lynn@netcom6.com (Lynn Wheeler)
Newsgroups: alt.folklore.computers
Subject: re: location 50
Date: Thu, 23 Sep 1993 21:39:34 GMT

code would "mess" up various stuff in MVT if you were running MVT
(like the CVT update) ... but not if you were running a different
operating system like cp/67 ... and this was the code it used to do
timer maintenance as well as cpu accounting for all activity
(accounting by virtual machine for both time spent in virtual machine
mode ... as well as accounting for for all time spent in
superviser/kernel mode & for which process the time in kernel mode was
on behalf of).

mft/mvt (& vs1/svs) had a harder time of storage integrity even with
standard 360 storage protect because everything resided in the same
address space ... and all kernel code expected to execute in key0
(rather large-grained storage protect). With lots of bugs in the
kernel code from which the kernel had little or no protection ... the
most serious was various types of low-core overlay of the first 128
bytes of real storage. The hardware interrupt vector for "program"
interrupts (various instruction failures) was in the first 128 bytes
of storage. Other "undefined" locations in low storage was also used
by low-level kernel routines (depending on operating system). Given
the gross granularity of protection and the frequency that programming
bugs resulted in loading a zero/null value into a address/base
register ... there was a significant percentage of software errors
that resulted in critical storage corruptions.

note there is a current thread running over in comp.arch on dealing
with NULL storage pointers. Part of the problem was that a 360
instruction could address the 1st 4k of real memory in two ways:

1) not specifying a address register (i.e. 0) and
   using address encoded in the instruction
2) specifying a address register ... and having a null
   value in the register

in general, instructions using the 1st mode were "valid" and
instructions using the 2nd mode tended to be software errors.

1st non-ibm 360 controller

From: lynn@netcom.com (Lynn Wheeler)
Newsgroups: alt.folklore.computers
Subject: 1st non-ibm 360 controller
Date: Sun, 26 Sep 1993 19:31:17 GMT
Lines: 12

i believe that the manufacturer of the minicomputer that we used for
360 telecommunications controller project (dynamic terminal identification,
dynamic speed recognition, etc) ... also built the minicomputer that
was used for the non-dec/pdp unix port (some 10+ years later).

--
Lynn Wheeler                | lynn@netcom.com

Most Embarrasing Misposting

From: lynn@netcom6.com (Lynn Wheeler)
Newsgroups: alt.folklore.computers
Subject: Re: Most Embarrasing Misposting
Date: Thu, 7 Oct 1993 21:37:14 GMT

it wasn't a posting ... but I once pointed at the wrong file as a cc:
list and sent out email to 15,000+ people

--
Lynn Wheeler                |  lynn@netcom.com, lhw@well.sf.ca.us

Too much data on an actuator (was: 3.5 inch *9GB* )

From: lynn@netcom6.com (Lynn Wheeler)
Newsgroups: comp.arch.storage
Subject: Re: Too much data on an actuator (was: 3.5 inch *9GB* )
Date: Thu, 7 Oct 1993 21:49:38 GMT

there is not only non-uniform arrival rates but also non-uniform
access characteristics and/or even bursty access patterns.

assuming some uniformity of access patterns, it is possible to do
access/profiles for various data clusters in terms of
accesses/mbyte/sec.

Not only can data-clusters with certain (low) access rate profiles be
migrated to slower speed devices ... but it is also logical possible
to partition a large (9gb?) drive ... where possibly some small
amount of data with very high access/mbyte/sec activity was positioned
in one partition and other partitions had much lower activity data
(in theory the allocation could be done such that the sum of the
accesses/mbyte/sec for all the various data clusters were within
the performance envelope of the drive).

this is harder to do for really bursty profiles ... unless the
aggregation is large enuf such that some reasonable predictable
statistical average is meaningful.

--
Lynn Wheeler                |  lynn@netcom.com, lhw@well.sf.ca.us

Assembly language program for RS600 for mutual exclusion

Refed: **, - **, - **, - **, - **, - **, - **
From: lynn@netcom5.com (Lynn Wheeler)
Newsgroups: comp.unix.aix
Subject: Re: Assembly language program for RS600 for mutual exclusion
Date: Mon, 11 Oct 1993 15:36:58 GMT

rs/6000 has no equivalent atomic instruction. aix has system
call that emulates compare&swap semantics ... and there is
c library definition for "CS" mnemonic. it is special cased
in the svc interrupt handler so that emulation is done within
8 instructions. since the interrupt handler is running disabled,
there is no possibility of interruption and to all intents and
purposes it is atomic from the standpoint of your program.

--
Lynn Wheeler                |  lynn@netcom.com, lhw@well.sf.ca.us

MTS & LLMPS?

Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
From: lynn@netcom5.com (Lynn Wheeler)
Newsgroups: alt.fan.mts,alt.folklore.computers
Subject: Re: MTS & LLMPS?
Date: Mon, 11 Oct 1993 20:18:44 GMT

cp/67 was a rewritten version of "cp/40". csc did a custom design of
relocation hardware that was prpq implemented on a 360/40. That was
the hardware that was used for the implementation of cp/40 ...  most
of the stuff translated from ctss. cp/40 was then ported to the 360/67
when that hardware became available.

the cp/40 and initial version of cp/67 had a schedular that looked
like it might have been right out of ctss. One of the people at LL
(around summer of '68) replaced that with a significantly simpler
mechanism that included a mechanism for controlling page thrashing
(nmber of tasks allowed to run simultaneously was a function of real
storage independent of the execution characteristics of those tasks).

I don't believe that cp/40 had any llmps code.

llmps was similar in implementation to my first programming job which
was to do a version of the 1401 MPIO (card-tape/tape-pringer spooler)
running on a 360/30 (so the 360/30 could be used as "front-end" for
709).

I didn't see any cp/67 code until january of '68 ... which was a couple
months after a version was made available to LL.

--
Lynn Wheeler                |  lynn@netcom.com, lhw@well.sf.ca.us

Multicpu's ready for the masses yet?

From: lynn@netcom5.com (Lynn Wheeler)
Newsgroups: comp.arch
Subject: Re: Multicpu's ready for the masses yet?
Date: Mon, 11 Oct 1993 22:48:35 GMT

in the past when there have been significant underutilization of fixed
and floating point units .... usually because of instruction decoder
stalls (branches or cache misses) ... an approach to boost the
instruction feed into the fixed & floatings units was to add one (or
more) independent instruction streams. From the programming
point-of-view the architectures were smp ... but the hardware
underneath may or may not be fully replicated.

superscaler and multiple fixed & floating point units is futher
starting to blur the distinction. On a single chip with multiple fixed
& floating point units ... a design trade-off issue would be the
difficulty in having a single pool of (multiple) fixed & floating
point units being fed by two (or more) independent instruction
streams.

A side question is (for any specific application) what is the off-chip
cache and memory bus utilization for a single i-stream ... and whether
or not there is sufficient excess capacity for the intended
application.

--
Lynn Wheeler                |  lynn@netcom.com, lhw@well.sf.ca.us

MTS & LLMPS?

Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
From: lynn@netcom.com (Lynn Wheeler)
Newsgroups: alt.folklore.computers
Subject: Re: MTS & LLMPS?
Date: Tue, 12 Oct 1993 18:05:09 GMT

i'm somewhat vague on the details but the virtual address space I
believe was 256k bytes (64 4k pages) and the real machine was 256k
bytes (64 4k pages). Somewhere around the storage-protect key array
(standard on 360) they implemented the translation address stuff (i.e.
each real page had a hardware entry giving the virtual address for
that page). Task switch required reloading the translation address
array. This was custom implemented on a 360/40 (only one that
existed). Some of the people that had been involved in ctss worked on
cp/40 and cms.

The 360/44 was a somewhat standard IBM product (360/40) that had extra
hardware for numerical intensive workloads.

Current day virtual address translation involves table look aside
buffer (effectively cache of virtual->real address translations) ...
also typically because of high level of multiprogramming ... TLB
entries also tend to have some sort of "ownership" information (i.e.
translation for multiple, different address spaces can co-exist in the
TLB simultaneously).

The 360/67 had a 8-way fully associative "look-aside" buffer (i.e.
lookup was simultaneously executed against all 8 entries), w/o any
ownership identification ... i.e. virtual address space switch
required that all eight entries were alwas flushed. '67 also had a
virtual address mode bit that selected between 24-bit virtual address
and 32-bit (not 31-bit) virtual address modes.

when the '67 came along, the csc group ported cp/40 to the '67. btw,
csc was located on the 4th floor of 545 tech. sq (project mac & i
believe the ge645 were a couple floors up)

--
Lynn Wheeler                |  lynn@netcom.com, lhw@well.sf.ca.us

MTS & LLMPS?

Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
From: lynn@netcom4.com (Lynn Wheeler)
Newsgroups: alt.folklore.computers
Subject: Re: MTS & LLMPS?
Date: Wed, 13 Oct 1993 18:20:58 GMT

32-bit was standard on all '67s ... in fact i believe that some number
of customers with memory-mapped '67 applications migrated to multics
when the 360/67 was discontinued (and the 370 line only had 24-bit
addressing). I ran across a copy of one of the early 360 documents
which described the model "60" and the model "62" ... which became the
model 65 and the model 67. I've never seen any reference to model 66
tho. The model 62 was described as coming in 1 cpu, 2-way smp, and
4-way smp models ... '67 only had 1 & 2 cpu models (other than the
custom triplex for the manned orbital lab). Charlie Salisbury worked
on the triplex a lockheed and then later on cp/67 duplex support.
Charlie is also responsible for the C&S instruction (actually the C&S
mnemonic was chosen because they are his initials and then words were
generated that went with the letters).

LL had defined the search list instuction as a RPQ to the '67 ...  and
modified CP/67 to use it for various types of queue searches
(especially free storage allocation). cp/67 had a SLT
simulator for '67s w/o the RPQ. It eventually came into disuse when
other types of paradigms were developed for managing the task (i.e.
push/pop free-storage subpool allocation operated in 12-14
instructions for 90+ percent of free-storage requests, beat the SLT
instruction which still had to do the memory accesses).

------

you mean in the hey day of protests ... and somebody called the boston
office of the FBI to say that there was a bomb planted in the offices
of a certain government intelligence agency ... and then went
roming to see what building got evacuated? also ... while it wasn't
on the office doors ... in the telephone room for that floor, the
telephone company had written the agency initials on the board next to
their punch-down blocks.

------

there was also a "bug" in the translation hardware in the '67 ... that
as far as I know never got fixed. Whenever the address space control
register was loaded, the process invalidated all the entries in
the translation look-aside buffer. However, there was an interesting
bug in the page-fault interrupt hardware ... that zero'ed all
entries in the associative array ... but forgot to invalidate
the entries. The problem only showed up if attempting to execute
in relation mode ... after a page fault ... without having done
a control register reload in the interim.

--
Lynn Wheeler                |  lynn@netcom.com, lhw@well.sf.ca.us

crabby, stu, initials, etc

From: lynn@netcom6.com (Lynn Wheeler)
Newsgroups: alt.folklore.computers
Subject: Re: crabby, stu, initials, etc
Date: Thu, 14 Oct 1993 19:12:26 GMT

stu also had a lot of "MAD" macros ... which were a lot harder to
"fix-up" than just changing comments. Some of them were not only used
by system code ... but also (user) application code.

they cleaned out all the initials ... i had identified a lot of
stuff (like all the tty/ascii support) with the initials of the
university i was working at.

last time i ran across crabby he was in atlanta & had something to
do with as/400 application code.

Log Structured filesystems -- think twice

Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
From: lynn@netcom4.com (Lynn Wheeler)
Newsgroups: comp.arch.storage
Subject: Re: Log Structured filesystems -- think twice
Date: Fri, 15 Oct 1993 18:57:45 GMT

I built a filesystem in '86 that was similar in function to many of
the LFS stuff w/o the cleaner function. This was done originally
to try and keep high-speed links fed.

I had a project in early to mid 80s that was implemented an
internet-backbone like pilot running satellite tdma Ku-band T2 links.
bandwidth was re-allocatable on superblock boundary (round-trip about
800mills) and the stations supported frequency hopping (designed for
reed-solomon & viterbi fec & a variation on double-key DES ... which
gave us some problems in one or two quarters). One of the early
problems I had to do was to enhance the protocol to do rate-based
pacing ...  since windowing was pretty ineffective ...  especially
taking into account bursty traffic and the round-trip propagation
delay (this was single-hop ... some of the double-hop systems are
twice as bad).

Anyway, after that ... one of the main problems were filesystems being
able to sustain T2 thruput to keep the links feed (sustained,
bidirectional T2 can be harder than it looks).

Around 86, I finally redid a filesystem for some of the attached nodes
that had a lot performance enhancements (i.e.  bit-maps, late binding,
contiguous writes, contiguous allocation, locality of data, indirects
and metadata). Besides the thruput functional issues there are some
integrity issues.

For recoverable/consistent, the metadata has to be carefully written
and/or logged.  For "transaction" operations there are a whole class
of failure modes associated atomic transactions involving multi-record
writes. In the implementation I did in '86, I used careful update with
some shadowing ... however the logical (metadata) records were >1
physical record.  I used 8-byte log sequence number at the start of
the logical record and at the end of the logical record. If (when
reading) the two values weren't the same ... there was an incomplete
write (and therefor the record was inconsistent). The controllers in
this case didn't support out of order writes ... or in-order writes
would have to have been forced for meta-data writes.

RAID-5 typically has a similar inconsistency where both a single
record write and the parity record update have to be performed as a
single atomic operation. One solution is for the controller to have
battery backed ram (possibly duplexed) that logs pending parity
updates and forces completion on recovery after things like power
failures. This gets a little trickier in a no-single-point-of-failure
design ... if power failure occurs and controller with the log doesn't
come back up ... the other controller needs access to the ram log(s)
in order to force consistency.

However, The only piece of all of this that made it into any product
was the rfc 1044 support.

--
Lynn Wheeler                |  lynn@netcom.com, lhw@well.sf.ca.us

Log Structured filesystems -- think twice

Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
From: lynn@netcom6.com (Lynn Wheeler)
Newsgroups: comp.arch.storage
Subject: Re: Log Structured filesystems -- think twice
Date: Sun, 17 Oct 1993 21:42:02 GMT

ocmparisons ... different case than earlier one I posted, it was '74
(instead of '86) ... base system ran in virtual memory ... filesystem
ran thru "real I/O" but "real" actually involved v->r translation and
indirection. Base Filesystem had bit-map but didn't take advantage of
searching for contiguous (although sometimes it worked out). Logical
I/O was (effectively) synchronous ... and could be done direct to/from
target application address (unless logical request was order of magnitude
difference between an index search I/O operation that lasted 250millis
and a simple block read/write I/O that took 16mills. Also, the
performance statistics were processor complex specific and the
serialization was occuring across several processor complexes all
sharing the same disk (and the same program library). I (effectively)
had to manually aggregate the individual processor complex stats into
a single "system" summary.

--
Lynn Wheeler                |  lynn@netcom.com, lhw@well.sf.ca.us

/etc/sendmail.cf UUCP routing

From: lynn@netcom4.netcom.com (Lynn Wheeler)
Newsgroups: comp.unix.aix
Subject: Re: /etc/sendmail.cf UUCP routing
Date: Fri, 24 Dec 1993 23:49:18 GMT

i've had the same/similar problem for going on a year. I've reworked
sendmail.cf to send out all mail over uucp ... even tho some may be
tcp/ip. problem i'm left with is that i code $f as valid "from" tcp/ip
domain name ... even tho the immediate outgoing link is
uucp. Everything works fine ... goes out correctly, arrives correctly,
etc ... EXCEPT I get a gratuitous error mail saying that the mail
wasn't sent (even tho it was) because of "no bang" (i.e. "!") in the
from address.

I got the latest o'reilly sendmail book and it claims that
sendmail.8.4.1 uses different rules for the wrapper and the header
(i.e. in theory, in my case, sendmail is complaining about the from
address in the wrapper being w/o a bang). However, the list of
standard macros for sendmail.8.4.1 only shows $f ... so I don't see
how to encode one kind of from address in the wrapper and a different
address in the header (which i believe would get rid of the error,
i.e.  if I had a bang-from address in the uucp wrapper ... while
maintaining a DNS-from address in the header).

... i.e. the point of this is that not only do i want outgoing tcp/ip
mail to go out with the standard dns domain addressing ... but to also
have my from address appear in standard domain format.

--
Lynn Wheeler                |  lynn@netcom.com, lhw@well.sf.ca.us

Big I/O or Kicking the Mainframe out the Door

Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
From: lynn@netcom.netcom.com (Lynn Wheeler)
Newsgroups: comp.unix.large,comp.arch.storage
Subject: Re: Big I/O or Kicking the Mainframe out the Door
Date: Wed, 29 Dec 1993 18:31:42 GMT

The following is from a presentation made at the fall '86 SEAS meeting
based on a report done in '83 (based on some work from the late '70s
but with tables updated to reflect 1983 mainframe hardware
configuration ... i.e. the presentation was made 7-8 years after the
work). It compares two "mainframes" (used for same/similar multi-user
timesharing environment), one circa 1968-1970 and one circa 1983. The
"Current Performance" heading refers to '1983'.

The term "drums" refer to mainframe fixed-head disks (i.e. one disk
r/w head per track). The value for drums is the total number of mbytes
available in a typical configuration.

"Pageable pages" refers to the number of 4kbyte pages available for
paging (total real memory minus fixed kernel requirements).

The "user" level corresponds to a reasonably performing time-sharing
service providing 90th percentile sub-second response for interactive
requests.

Page I/O refers to the number of 4kbyte page transfers per second.

User I/O refers to the number of distinct user file I/O operations per
second (in the '67 system a single user file I/O operation ranged from
800bytes to 64kbytes transferred, avg.  around 4k-8k; in the 3081K
system a single user file I/O operation would range from 4kbytes to
64kbytes transferred).

The transfer rate is the standard file I/O disk transfer. On the '67,
"drums" had a transfer rate of 1.5mbyte/second and disks had a
transfer rate of 330kbyte/second, On the 3081K, both drums and disks
had a transfer rate of 3mbyte/second.

The 360/67 is a single processor system. The 3081K is a 2-way
symmetrical multiprocessor (each processor rated at 7mips).

The reference to "PAM" is work that I had done in the early '70s on
page-mapped filesystems.

A current day "mainframe" (i.e. '93 instead of '83) would typically
have values 4-10* larger (except for things like "drum" space, avg.
arm access, and transfer rate).

Current Performance.

For a real look at current performance and where the problems may be,
it is helpful to place a current environment side by side with the
3.1L system. The following tables shows a CP/67 3.1L system on a
360/67 with 768K of storage, three 2301 drums and 45 2314 drives
(numbers given are for a 3.1L system w/o any PAM minidisks).  It is
compared with a typical 3081K HPO system with 32megs of storage, six
2305 drums, and 32 3380 actuators assumed to be running a workload
with similar execution characteristics.

system          3.1L            HPO     change
machine         360/67          3081K

mips            .3              14       47*
pageable pages  105             7000     66*
users           80              320      4*
channels        6               24       4*
drums           12meg           72meg    6*
page I/O        150             600      4*
user I/O        100             300      3*
disk arms       45              32       4*?perform.
bytes/arm       29meg           630meg   23*
avg. arm access 60mill          16mill   3.7*
transfer rate   .3meg           3meg     10*
total data      1.2gig          20.1gig  18*

Comparison of 3.1L 67 and HPO 3081k

if we compare the resources that are traditionally considered critical
- CPU and memory - we see an increase from 45 to 65 times between the
67 and the 3081.  However we only see an increase by roughly a
factor of four in the number of users supported.  Even at that
we see performance problems in supporting that many users.  There
appear to be ten times as much resource per user on the 3081 compared
to the 67, however there are still performance problems.  Why?  An
even more interesting comparison is to show the same information as a
function of raw MIPS as given in following table.

system          3.1L    HPO     change
machine         360/67  3081K
mips            .3      14

pageable pages  350     500      1.4*
users           266     22.8     .086*
channels        20      1.5      .075*
drums           40meg   5meg     .125*
page I/O        500     43       .086*
user I/O        333     21       .063*
disk arms       150     1.1      .0068*

67/3081 Comparison by resource per MIPS

system          3.1L    HPO     change
machine         360/67  3081K
users           80      320

mips            .0038   .044     11.58*
pageable pages  1.3     21.88    16.8*
channels        .075    .075     1*
drums           .15meg  .225meg  1.5*
page I/O        1.88    1.87     1*
user I/O        1.25    0.94     .75*
disk arms       .56     0.1      .18*

67/3081 Comparison by resource per User

Major problems can easily be seen in the data concerning I/O rates:
system hardware has changed from being CPU and real storage
constrained to being I/O constrained.  The page I/O capacity (distinct
from the page capacity, i.e., real storage) has increased by a factor
of four.  In addition, the user I/O capacity in terms of accesses per
second has increased by a factor of four to eight.  However, the rest
of the hardware in the system (CPU, real storage) has increased by
factors of 40 to 60.

The third table contains the same resource information on a per user
basis.  A plausible hypothesis is that the current factor limiting the
average number of users supported is I/O capacity.  Variations in the
number of users on a system to system basis would then be explained as
a function of

•       system I/O capacity
•       average I/O requirements per user on that system
•       ratio of disk arms to MPL and probability of
        disk arm thrashing

During the late '60s & early '70s I had done work on dynamic adaptive
feedback scheduling (doing "fair-share" and something that I
characterized as "scheduling to the bottleneck" resource scheduling),
page replacement algorithms, pathlength I/O work and other forms of
optimization. The 3.1L system could:

	take a page fault
	select near optimal page replacement
	perform page I/O (both read and any writes)
	near optimal scheduling task switch
	take page I/O interrupt (completion signal)
	update paging tables
	task switch back to original process

in an average kernel pathlength of 500 instructions (or less, i.e.
<1.5mills of '67 processing time). The '67 paging activity averaged
2/3rds reads/faults and 1/3rd writes (i.e. 150 page I/Os per second
represented approximately 100 reads/faults per second, 50
writes/seconds, 200 task-switches/second and the total "overhead"
involved was on the order of 150mills processor time).

The investigation done in the late '70s was to show that the
"scheduling to the bottleneck" algorithm from the late '60s that
dynamically adapted the scheduling algorithm based on cpu, page I/O,
and real stroage resource consumption was inadequate (and frequently
ineffective) since the primary constrained resource had become user
file I/O.

One big differences between the mainframe system and traditional Unix
systems is that the mainframe default/standard I/O paradigm operates
directly between user space and physical I/O interface ...  while the
Unix paradigm frequently involves kernel calls for doing buffer moves.
The "buffer move" constraint/bottleneck was highlighted in the
gigabyte router talk by Partridge at an '89 IETF meeting (as well as
RAID/striping papers).

Compared to typical Unix system, most mainframe systems tend to be
configured with a significantly larger number of disk arms/drives
(especially when calculated in terms of arms/MIP).  The
latency/performance advantage that mainframes had with expensive
fixed-head disks is now pretty much a level playing field with the use
of large electonic stores(/caches).

Other work done during the '80s concentrated on optimizing the file
I/O opportunity:

*       being able to profile various data clusters in terms of
	accesses/second/mbyte and attempting to load-balance
	the clusters across the available disk arms (assumes
	some sort of uniform access patterns, frequently tho
	access patterns are bursty rather than uniform)

*       file organization for large block transfer (taking into
	account that transfer rate has increased much more
	significantly than access rate ... also more applicable
	to bursty access patterns).

*      	more caching of high-use data (tends to minimize the
        advantage of data cluster load balancing).
--
Lynn Wheeler                |  lynn@netcom.com, lhw@well.sf.ca.us

Big I/O or Kicking the Mainframe out the Door

Refed: **, - **, - **, - **, - **, - **
From: lynn@netcom4.netcom.com (Lynn Wheeler)
Newsgroups: comp.unix.large,comp.arch.storage
Subject: Re: Big I/O or Kicking the Mainframe out the Door
Date: Wed, 29 Dec 1993 20:11:30 GMT

... oops finger check ... '89 "gigabyte router" reference should have
been "gigabit router" ... if i remember correctly, it assumed 50mip
processor, smart outboard I/O controllers, at least three one gigbit
links, no buffer copies, 128(?) total instruction pathlength per
packet (in & out), bimodel distribution of packets sizes (64 & 1500)
with an avg. packet size around 512 (250k packets/sec @128 .. 32m
instructions/sec).

note that instruction based buffer copies tend to have very adverse
affect on processor thruput ... since both the "input" buffer and
"output" buffer locations tend to all be cache misses ... as
well as "flushing" useful entries out of the cache.

--
Lynn Wheeler                |  lynn@netcom.com, lhw@well.sf.ca.us
next, subject index - home