List of Archived Posts
1994 Newsgroup Postings
- Big I/O or Kicking the Mainframe out the Door
- Big I/O or Kicking the Mainframe out the Door
- Register to Memory Swap
- Multitasking question
- Multitasking question
- Schedulers
- Schedulers
- Schedulers
- Schedulers
- link indexes first
- IBM 7090 (360s, 370s, apl, etc)
- scheduling & dynamic adaptive ... long posting warning
- talk to your I/O cache
- lru, clock, random & dynamic adaptive
- REXX
- 360 "OS" & "TSS" assemblers
- talk to your I/O cache
- lru, clock, random & dynamic adaptive ... addenda
- cp disk story
- Dual-ported disks?
- Dual-ported disks?
- CP/67 & OS MFT14
- Dual-ported disks?
- CP/67 & OS MFT14
- 370 ECPS VM microcode assist
- CP spooling & programming technology
- CP spooling & programming technology
- CP spooling & programming technology
- CP spooling & programming technology
- Misc. more on bidirectional links
- 370 ECPS VM microcode assist
- 370 ECPS VM microcode assist
- CP spooling & programming technology
- CP spooling & programming technology
- High Speed Data Transport (HSDT)
- painting computers
- short CICS story
- High Speed Data Transport (HSDT)
- High Speed Data Transport (HSDT)
- Failover and MAC addresses (was: Re: Dual-p
- mainframe CKD disks & PDS files (looong... warning)
- Failover and MAC addresses (was: Re: Dual-p
- SIE instruction (S/390)
- IBM 370/195
- IBM 370/195
- process sleeping?
- baddest workstation
- bloat
- Bloat, elegance, simplicity and other irrelevant concepts
- bloat
- SMP, Spin Locks and Serialized Access
- Rethinking Virtual Memory
- Rethinking Virtual Memory
- Rethinking Virtual Memory
- Rethinking Virtual Memory
- Rethinking Virtual Memory
- Rethinking Virtual Memory
- Measuring Virtual Memory
- How Do the Old Mainframes
- How Do the Old Mainframes
- How Do the Old Mainframes Compare to Today's Micros?
Refed: **, - **, - **, - **, - **, - **, - **, - **, - **
Newsgroups: comp.unix.large,comp.arch.storage
From: lynn@netcom4.netcom.com (Lynn Wheeler)
Subject: Re: Big I/O or Kicking the Mainframe out the Door
Date: Wed, 5 Jan 1994 16:31:59 GMT
typical unix has two issues:
synchronous I/O
&
buffer copies
standard system strategies for read-ahead and write-behind somewhat
compensate for synchronous I/O (although it contributes to a level of
unpredicability associated with various failure modes).
memory mapped strategies aren't a complete panecia. Sequential access
with mmaped files (i.e. page-faults) can actually be worse than freads
unless kernal also has a mmap'ed "read-ahead" strategy (similar to
buffered I/O read-ahead).
memory-mapped I/O can actually come in several flavors. In the early
'70s I did a variation on memory-mapped I/O integrated in with a
mainframe filesystem (which was never shipped as a product, but I
maintained at various installations over the next 10-15 years).
Repeat-Note: the numbers/examples posted previously was for systems
w/o the mmap'ed filesystem.
advantages of the mmap'ed filesystem modification (even for mainframe
system that already supported I/O schema supporting asynchronous I/O
AND direct I/O ... noncopy ... transfer):
1) eliminated certain duplicated kernel function activity (in file
I/O) ... net effect was a cut of 60-90% in kernel file I/O pathlength
2) mmap'ed interface allowd specification of mapping an arbitrary
number of disk blocks to an arbitrary span of virtual pages ... along
with advisory flags indicating Synchronous, Asynchronous, Defferred
• Deffered - effectively equivalent to performing mmap function
w/o any data transfer (left for the application to page
fault the data)
• Synchronous (on reads) - data will be transferred before
returning to the application
• Asynchronous (on reads) - schedule data transfer but return
immediately to application
the syncronization of asynchronous activity relied on controlling
whether the application had access to the virtual page. It was even
possible to take a multi-buffered, asynchronous I/O application and
simply have the file I/O subsystem to replace the standard I/O kernel
calls with mmap kernel calls (assuming buffers are page-aligned). Thus
it was possible to emulate an existing file I/O, direct transfer,
multi-buffer, asynchronous paradigm (with the interface) as well as
the more common full-file mmap'ing.
3) large mainframe large (multi-block) I/Os were scheduled in virtual
address order. The page I/O subsystem had the capability for
out-of-order transfer which reduced the latency ... effectively
similar to out-of-order transfer supported by some of the
sophisticated caching controllers (i.e. controller can begin transfer
at current head position ... rather than waiting for some specific
"starting" point).
4) direct mapped (mainframe) I/O reads tended to leave the virtual
pages dirty and direct mapped I/O writes writes did nothing for the
page dirty status. Running the I/O thru the paging system resulting in
not having the dirty bit on (for reads) and cleaned the dirty bit (on
writes). The result was much lower ratio of dirty pages (which would
have to be written if selected for replacement).
In the previous posting giving 1970/1983 comparison, the '83 system
had extensive (but pretty non-intrusive) instrumentation of I/O
activities. All I/O requests (including page I/O) were time-stamped to
measure both queueing time and service time. The queueing & service
times were accumulated (by category) for individual processes,
individual disks, and total overall system. Other instrumentation made
it possible to caculate (with resolution of several microseconds) for
each process:, total elapsed active time, total cpu service time,
total cpu queueing time, total page I/O queueing time, total page I/O
service time, total file I/O queueing time, and total file I/O service
time, and total "blocked" time (waiting for some service). For the '83
3081 system example, the file I/O queueing+service times ran
significantly higher than the page I/O queueing+service times.
An an aside, a measure of contention (i.e. scheduling to the
bottleneck) is not the measure of the use of a resource, but the time
spent queued/waiting for the resource to be available (i.e. contention
is queueing time ... not service time or queue+service time).
In any case, for a moderately heavy file I/O workload (circa '83), the
mmap'ed changes could reduce the measured file I/O queueing time by a
factor of three (as well as some other measureable performance
improvements).
As a separate aside, I briefly saw a posting referencing caching
strategies and "Mattson". I didn't catch the context. I've run across
hardcopy to '78 kernel code that was installed on several large
mainframe systems to capture the live-load I/O activity data ... & was
the real-time feed into Mattson's caching model. I'm not sure the
posting was referencing the paper written on those results ... or was
looking for some other reference.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **
From lynn@netcom4.netcom.com Tue Oct 19 07:45:23 1999
Newsgroups: comp.unix.large,comp.arch.storage
From: lynn@netcom4.netcom.com (Lynn Wheeler)
Subject: Re: Big I/O or Kicking the Mainframe out the Door
Date: Wed, 5 Jan 1994 16:34:26 GMT
One of the potential pitfalls of simple mmap implementations for full
I/O mapping is the "cleaning" of pages once they have been used. In a
buffer'ed paradigm, the application explicit declares that it is done
with some data by requesting it to be overlayed. This has the net
effect of "reducing" the applications working-set size (as opposed to
a straight file mmap'ing and waiting for the sysetm to discover the
pages are no longer in use).
I had done a fair bit with various variations clock-like global LRU
replacement algorithms (and working set controls) during the late
'60s. During the early '80s there was detailed modeling work comparing
them with both "true/exact" LRU replacement and "optimal"
replacement. The global clock replacement algorithms all tended to
operate within <15% of the performance of true/exact LRU-replacement
... including getting worse when LRU got worse.
About that time, I stumbled on a interesting twist to clock global LRU (I
was originally trying to find a way of improving cache performance and
reduce inter-processor contention in a SMP environment).
There are times when LRU-replacement appears to be "chasing its tail",
i.e. it is choosing for replacement the exact virtual page that would
be needed next. It such scenerios, it would be very desirable to a
switch to a near MRU (most recently used) replacement algorithm.
The net result of the algorithm variation was that it tended to
operate as a LRU-replacement algorithm in the part of the envelope
where LRU did well ... but would effectively switch to a random
replacement algorithm in parts of the envelope where LRU was doing
poorly (i.e. random was significantly better than LRU).
It was interesting that the implementation had no code to explicit
recognize the condition and change the behavior ... it was somewhat
the fallout of the way it treated the distribution of reference &
non-reference page patterns. The implementation code is very close
to standard clock global LRU replacement, with equivalent pathlength,
but does have better cache hit profile as well as better SMP
operational characteristics.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Refed: **, - **, - **, - **, - **, - **, - **
From: lynn@netcom4.netcom.com Tue Oct 19 07:45:24 1999
Newsgroups: comp.arch
From: lynn@netcom4.netcom.com (Lynn Wheeler)
Subject: Re: Register to Memory Swap
Date: Mon, 17 Jan 1994 00:24:25 GMT
Test&set has been around in smp since at least the early/mid-60s (i.e.
test a location for zero/non-zero at the same time setting it to
non-zero).
Atomic update (based on the existing value) is (I'm reasonably sure)
result of CAS's work in 70/71 time-frame at 545 Tech Sq. (which led to
the original mnemonic compare&swap ... i.e. his initials). The use of
atomic compare&swap for multi-threaded, "enabled" applications
(whether SMP or non-SMP) dates from the same period.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Refed: **, - **, - **, - **, - **, - **, - **
Newsgroups: alt.folklore.computers
From: lynn@netcom9.netcom.com (Lynn Wheeler)
Subject: Re: Multitasking question
Date: Mon, 7 Mar 1994 06:59:22 GMT
in '68/'69 i completely replaced cp/67 implementation with dynamic
adaptive feedback stuff. The implementation i replaced was for all
intents and purposes the same as described in the 4.3bsd book. my
understanding from the cp/67 developers were that several of them had
come over from ctss/7094. It is possible/plausable that both cp/67 and
unix can trace a common ancestry back to ctss/7094.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
Newsgroups: alt.folklore.computers
From: lynn@netcom11.netcom.com (Lynn Wheeler)
Subject: Re: Multitasking question
Date: Wed, 9 Mar 1994 23:50:32 GMT
The initial version (1.0, 1967) of the cp/67 scheduler (that I saw
sometime in 68) was like your CTSS description, I believe 10 queues,
top queue having very short "time-slice" and each lower queue have
progressively larger time-slice, If process went to t/s-end, it was
moved to the tail of the next lower queue. If process became blocked
(during execution) it would move to the next higher queue.
Two biggest problems with it were lack of any page-thrashing controls
and high cpu overhead in the implementation (including non-linear
increase with respect to increasing numbers).
Somebody out at Lincoln Labs (summer '68?) did a 2-level queue
replacement that had a table (table values set proportional to machine
real storage size) that limited the number of "active" processes in
each queue. "queue-slices" were on the order of a second and
"blocking" didn't result in queue transition. There was daemon that
once a second recalculated queue position based on cpu use & aging
(this bore more similarity to bsd description).
Problem with this implementation was that the page-thrashing control
didn't take into account process/program behavior ... and the cpu
overhead for dispatching/scheduling function was still non-linear
(although significantly reduced from ctss look-alike).
The Dynamic-adaptive changes (late '68, '69):
1) eliminated cycling through processes & therefor any non-linear
& "scaling" problems
2) cpu process overhead per dispatch/schedule (proportional to work
done & not number of users) was reduced to near zero
3) dynamic adaptive page-thrashing controls based on program behavior,
real-stroage availability, AND efficiency of paging subystem (also
originated clock global LRU replacement, existing algorithm was close to
FIFO). Also redid most of the paging ode so the pathlength was as
close to zero as possible.
4) execution order was based on an "advisory-deadline" calculation.
the actual calculations took into account a variety of factors,
individual program behavior (like cpu use, paging behavior, etc) as
well as overall system behavior (cpu bottlenecked, paging
bottlenecked), interactive, etc. Future advisory deadlines were
proportional to program characteristics and granularity of cpu
allocation. "Interactive" programs were allocated small cpu
granularities frequently (frequency interval was proportional to cpu
granularity), however being interactive (or non-interactive) didn't
affect aggregate cpu allocation (just granularity with frequency
proportional to granularity size).
In general:
1) near zero pathlengths (& elimination of non-linear scaling)
2) dynamic adaptive ("scheduling to the bottleneck")
3) consistent resource control regardless of factors like
interactive/non-interactive ... such factors could affect
granularity of allocation but not magnitude of allocation
Genoble Science Center published a paper in CACM sometime in the early
'70s describing their implementation of a "working-set" dispatcher on
CP/67 (effectively faithful implementation described in Denning's
article). They had a 1mbyte 360/67 (which left 154 4k "pageable-pages"
after fixed kernel requirements). They provided about the same level
of performance for 30 users as we did for 70 user on a 768k 360/67
(104 4k pageable pages).
The differences (circa '70-71):
grenoble cambridge
machine 360/67 360/67
# users 30-35 70-75
real store 1mbyte 768k
p'pages 154 4k 104 4k
replacement local LRU "clock" global LRU
thrashing working-set dynamic adaptive
priority cpu aging dynamic adaptive
There were also big differences in "straightline" pathlength as well
non-linear scaling pathlength effects.
Some of the pathlength stuff I came to regret. I would rearrange a
couple hundres of lines of code in half-dozen modules so that the
sequence of events came out implicitly like I wanted them to ...
rather than to have to explicitly write code to make it happen (zero
pathlength implementation). Some of the code made it into the product
and possibly years later get modified (and things stop working the way
they should ... it is hard to explain about "implicit" workings).
Note that the '67 had a 900microsecond cycle time and no cache. Most
instructions were slightly less than 2machine cycles to over
3-4. Compute bound with no I/O, machine might clock .5-.7
MIPS. However, heavy I/O could result in some severe memory bus
contention (with the instruction unit) cutting MIP rate significantly.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
Newsgroups: alt.folklore.computers
From: lynn@netcom9.netcom.com (Lynn Wheeler)
Subject: Re: Schedulers
Date: Fri, 11 Mar 1994 10:36:17 GMT
CP/67 & VM/370 Scheduling
... long ago, and far away ... hopefully this doesn't sound too much
like a biography.
As an undergraduate in '68 & '69 I did a lot of cp/67 modifications,
high-performance fastpath, dynamic adaptive dispatching, dynamic
adapting scheduling & thrashing controls, page replacement algorithm
(clock global LRU), teletype support, and numerous other things. Also was
one of four people responsible for designing/building the first
non-IBM control unit for IBM mainframe. Also did some MFT/MVT hacking,
ripped out the 2780 support from HASP-III and replaced it with 2741 &
TTY support along with an editor that implemented the CMS/CTSS syntax
(early '69). The university I was at also had a IBM 2250 (high
performance graphic display) and I hooked the backend of the CMS
editor to 2250 character library to create a full-screen editor (late
'68).
In '70, I graduated and joined IBM/CSC at tech sq. I was able to get
some number of my changes incorporated into various cp/67 product
releases.
With the advent of the IBM 370, the "product" group split off, did a
somewhat grounds up rewrite of for vm/370 and moved out to a bldg. in
Burlington Mall.
When CSC finally got a 370 in 73, I ported a lot of my changes from
cp/67 to vm/370. I had at the time two BU students that were on
work/study program. Much of the scheduling/thrashing descriptions I've
already posted earlier in this thread. On what was called VM/370
"release 2 plc 15" we had done a whole lot of changes including the
following:
automated operator procedures (done to support some
sophisticated benchmarking methodology)
dynamic shared library support (changes to both
cp and cms)
high performance page-mapped file system (both cp & cms)
dynamic adaptive feedback dispatching/scheduling
(including thrashing controls)
page replacement (clock global LRU & something better than LRU)
page "migration" (moving pages around between high
performance fixed-head disk and movable
arm disks)
various disk & page I/O subsystem optimization
more "fastpath" optimization
The three of us ran an "internal IBM" product group supporting the
modifications on the CSC mainframe and also packaging and shipping the
modified product to internal IBM sites (system was operational on some
100 mainframes at internal IBM sites). Through a special agreement it
as also shipped/supported for some AT&T sites. There was approximately
30,000 lines of new &/or changed code.
The benchmarking methodology included a lot of synthetic workload
stuff. There was extensive monitoring and profiling of production
systems, synthetic workloads were created to simulated production
characteristics and then validated/tuned. Along with this was heavy
instrumentation of the kernel (some of which was also necessary for
being able perform dynamic adaptive calculations).
Typical benchmarking process would build a specific kernel, boot the
kernel, initialize parameters as specified, run a specific synthetic
workload, gather all the data, kill all synthetic workload processes
and then go on to the next benchmark. We would typically
(automatically) kick this off at midnight on Friday and it would run
totally automated until 8am Monday morning when it would rebuild the
"production" kernel and bring up the machine in "production" mode for
normal users. We could sometimes get 100 separate synthetic workload
benchmarks run over the 56hr weekend.
A subset of the automated operater support and the dynamic shared
library support was picked up by the vm/370 development group for the
"basic" release 3 product.
About the time that the base "release 3" was shipped a decision was
made to make an "add-on" software product release of the CSC
"performance enhancements" (I believe the SHARE scheduler white paper
refered to them as the "Wheeler Scheduler"). Unfortunately at the
time, both of the BU students had gone on to other things and I was
doing a scalable SMP project.
In any case, I went to work part time on turning out this "Resource
Manager PRPQ". Besides the technical work (see following) this was
going to be the first IBM SCP (system control program) software that
was charged for. As a result I spent possibly more time on various
"business" stuff than on the technical things (helping formulate how
IBM was going to charge for SCP software).
A set of benchmarks was established for validating the
design/implementation, with systematic variations in the synthetic
workload, in the hardware configuration, in the kernel paramenters and
characteristics. The benchmarks were designed to cover all conceivable
operational environments that the software might be used for. There
were also "clunkers" ... if a nominal heavy load for a particular
configurations was 100 users ... several tests at 800 synthetic users
were run. In order to perform such stress tests, numerous
timing-dependent bugs in the base system had to be found and fixed
... as well as a redesign of the kernel syncronization mechanism to
eliminate all possible cases of zombie processes.
On the order of 2000 (new) separate benchmark tests were run in the
process of (re)validating the RM-PRPQ. Included in the tests were
priority changes (nice'ing) of numerous kinds. The default dynamic
adaptive mode was to assume "fair share" resource allocation ....
however administrative priority changes (nice'ing) was defined to have
very specific effects as to process resource allocation. This had to
be verified across a wide combination of possible configurations and
workloads (as well as demonstrating that each nice'ing increment
exactly resulted in the defined resource allocation change ...
administrative controls allowed specification of either more & less
resources than fair share ... or specific percentage of total system
resources). The instrumentation/monitoring of a large percentage of
the "internal IBM" sites running the code was also used to help
calibrate/validate the dynamic/adaptibility of the code.
The RM-PRPQ did not contain the page-mapped filesystem changes, but
included several of the performance enhancements. The RM-PRPQ had a
couple of new modules and 6500 lines of code (which included
modifications to 60-some existing CP kernel modules).
The (IBM) CP kernel module naming convention was a three letter perfix
("DMK" for all modules in the CP kernel) followed by a three letter
module designation. The RM-PRPQ module responsible for most of the
the dynamic adaptive resource logic was named DMKSTP. I believe that
the number of mainframes licensed to run the RM-RPQP went over 1000
(mid-70s). I also got the job of being 1st, 2nd, & 3rd level field
support for the product for the first 9 months after it shipped.
Come 1983, I had almost forgotten the R2PLC15 system that we had been
supporting at AT&T back in '74. The IBM branch office called and
wanted help in getting AT&T off the system. Apparently as each new
processor came out AT&T would migrate the software to the new
generation of machines. What was interesting/gratifying was that while
the dynamic adaptive implementation was dynamically adaptive ... there
was well over a magnitude difference (>*20 the performance) between
the 1974 machines (that R2PLC15 had been calibrated on) and the 1983
machines.
In any case, the cp/67 and the vm/370 (at least after the RM-PRPQ)
thrashing controls essentially became the same.
See Melinda's paper for more/other details.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Newsgroups: alt.folklore.computers
From: lynn@netcom9.netcom.com (Lynn Wheeler)
Subject: Re: Schedulers
Date: Fri, 11 Mar 1994 10:45:08 GMT
thrashing footnote ... actually for the stress tests the trashing
controls worked too well. To really get some of the stress tests
working I had to build special kernels with the thrashing code
crippled. With 5* to 10* the nominal number of users expected to be
found in a heavily loaded system, AND the thrashing controls crippled
... could get some real stress going in various parts of the system
(like 5-15 seconds elapsed time to service a page fault ... when the
system was paging at 300/sec).
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
Newsgroups: alt.folklore.computers
From: lynn@netcom11.netcom.com (Lynn Wheeler)
Subject: Re: Schedulers
Organization: NETCOM On-line services
Date: Fri, 11 Mar 1994 20:34:32 GMT
more stories out of school?
Basically global clock runs around all pages in real storage resetting
reference & testing reference bits. When it finds a page that doesn't
have its reference bit set on ... the page is selected for
replacement.
A "local" LRU algorithm is limited on a process by process basis for
running around virtual address space resetting & testing reference
bits.
The original CP/67 "algorithm" just cycled thru real storage looking
for a page that didn't belong to an active process (no resetting &/or
testing of the page's hardware reference bits). If it didn't find one,
it would pick the first available page. Assuming all real storage was
"occupied" this algorithm effectively approximated FIFO. It also had
pathlength performance penalties since all real storage had to be
scanned prior to deciding it was just doing FIFO.
The original VM/370 algorithm had a threaded list of virtual pages
that was supposedly scan'ed by a "clock" global LRU algorithm. However
there was other code that was constantly removing pages from one
portion of the threaded list and re-inserting them elsewhere. The
removal & re-insertion had some bad side effects ...
1) the "effective" ordering no longer was the average time since a
page had its reference bit reset ... this "average time" needed to be
relatively uniform for ALL pages for the "global scan" for the
algorithm to approximate real LRU. a supposedly minor "implementation"
change totally negated what made clock global LRU an approximation to
real LRU.
2) removal and reinsertion of pages were process specific, this
had the effect of repeatedly "clustering" all virtual pages
for a process together ... the result would be that the
"replacement algorithm" would be raiding pages from specific
processes in "bursts" ... with long bursts of not raiding any
pages for a specific process
3) there was very inconsistent treatment of "shared" pages which were
located in virtual address space of multiple processes simultaneously.
Prior to "release 3" shared library changes the maximum number of
these pages were well capped (nominally 16 max). After the "release 3"
changes, the maximum number of "shared" pages started to explode (and
so did the side-effects of not correctly handling them).
4) various hardware technology & configuration evolutions resulted in
numerous workload environments where the virtual page mean elapsed
resident lifetime exceeded the mean threaded list shuffle interval.
The "clock" global LRU implicitly assumes a uniform "reset interval" ...
creating something of descrimination function between 1) all pages
that get used more frequently than the "reset interval" and 2) all
pages that were used longer in the past than the reset interval. I
came up with the "clock" global LRU implementation in the '60s
specifically because it had the advantage of dynamically adapting the
"reset interval" proportional to the demand for pages (w/o explicit
code being required ... it also has one or two other implicit "dynamic
adaptive" characteristics). The threaded list shuffle wiped out all
"memory" of page reference history, When the mean virtual-page
residence lifetime exceeded the mean threaded-list shuffle interval
... all relationship to a LRU-replacement algorithm disappeared.
Another example of "implementation" optimization invalidatiing
"algorithm" architecture was the mainstream IBM operation system
effort. It was no secret that the VM/370 product was viewed by
"strategic" IBM as an orphan child. During the '70s, customers were
frequently told that the last VM/370 release had already been shipped.
Internally, people were told that if they ever wanted a career path
and/or promotion that they had to transfer to the "mainstream,
strategic, operating system product".
In any case, in the early '70s the "mainstream" product was preparing
to embrace "virtual". The non-virtual design implemented a single real
address space for all of the kernel as well as all executing
processes/applications. The translation the "real" system to "virtual"
effectively created a single large simualted real address space with
the virtual hardware (changing the name from MVT to SVS or "single"
virtual system). there were some tricks behind the scenes to map (&
fix/pin) various kernel pages to the same real address as their
virtual address.
I did some consulting to them regarding LRU replacement algorithms ...
however their OR simulation group "discovered" that system performance
would be better if the replacement algorithm was biased towards
selected "non-changed" virtual pages prior to "changed" virtual pages.
They were un deterred by arguments about the replacement algorithm no
longer approximated LRU-replacement. They shipped the product with the
"discovered" performance improvement by the OR simulation group. It
was relatively late in the SVS product cycle that somebody observed
that the "majority" of non-changed virtual pages were "shared" library
code used by all applications and the "majority" of changed virtual
pages were primate application data pages (i.e. the implementation was
biased to replace high-useage pages that were commonly referenced by
all applications before private pages ... replacing high-useage shared
pages had two down-sides ... individual process page fault rates ...
plus tending to serialize/block multiple applications simultaneously).
The original VM/370 replacement algorithm had the "shared" page
problem of not being able to discriminate between "high-useage" shared
pages and "low-useage" shared pages that haven't been touch in an
extended period of time. The mainstream SVS implementation went to
the other extreme and was actively biased towards replacing shared
pages (regardless of high or low useage).
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Refed: **, - **, - **, - **, - **, - **, - **, - **
From: lynn@netcom9.netcom.com (Lynn Wheeler)
Subject: Re: Schedulers
Date: Fri, 11 Mar 1994 21:28:31 GMT
Newsgroups: alt.folklore.computers
RM-PRPQ footnote:
one other feature that went out in the RM-PRPQ product was "pageable
pagetables". VM/370 required 23.5 double words (188bytes) fixed real
storage for every 64k bytes of virtual memory (about
12k-real/1mbyte-virtual). A hypothetical configuration with 300
processes, each with 16mbytes of virtual memory, would have required
on the order of 57 mbytes of fixed real storage for tables in the
standard vm/370.
benchmark footnote:
the benchmark suite started with a synthetic workload. First a large
number of real live workloads were profiled ... and then some
composite synthetic workloads were put togetm running the synthetic
workloads were then cross-checked against the live-load profiles for
final calibration.
An operational envelope profile was put together from data regarding
"heavy load" operation of a large number of real live systems.
Approximatly selected from the
outer edges of the heavy-load operational envelope as well as
representative (&/or common) points within the envelope. Twenty-four
composite synthetic workloads were created who's execution profile
matched the selected operational poi
utilization, resource utilization distribution were all profile
factors).
A typical "validation" for some code change typically required running
the complete suite of 24 operating point benchmarks with 4-5 different
tuning options (i.e. 96-120 total benchmarks).
trivial response footnote:
There was a paper by some group in the late '70s claiming that they
had the best performing (vm/370) timesharing service with 300+ logged
on users, 100% processor utilization and 90th percentile trivial
interactive response of .24 seconds.
I had a guine-pig production installation at the time with similar
workload and configuration profile. The major difference was that I
had deployed my high-performance page-mapped filesystem (that never
was included in the product) along with several additional
dispatching, scheduling, and paging enhancements (including
"remembering" members of a previous working set and "block" paging
when process reentered the queue). With a similar configuration and
workload, this configuration had a 90th percentile trivial interactive
response of .11 seconds.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Refed: **, - **, - **
From: lynn@netcom7.netcom.com (Lynn Wheeler)
Newsgroups: alt.hypertext,comp.infosystems.interpedia
Subject: Re: link indexes first
Date: Fri, 11 Mar 1994 19:58:03 GMT
we've looked at some "bi-directional" link challenges. One scenerio is
a departmental CD-ROM server. The CD-ROM can be "pressed" with an
arbitrary number of bidirectional links ... but departmental "views",
individual "views" and/or overlayed individual/departmental "views"
require r/w update capability. We addressed the opportunity with
"virtual" subjects (stored in r/w database) that had "one-way" pointer
to the "real" subject (possibly located on the r/o
cd-rom). Departmental, individual, and overlayed
individual/departmental views involve transparently merging the
"virtual" and the "real" subject along with all the associated
relations.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
Newsgroups: alt.folklore.computers
From: lynn@netcom9.netcom.com (Lynn Wheeler)
Subject: Re: IBM 7090 (360s, 370s, apl, etc)
Organization: NETCOM On-line services
Date: Wed, 23 Mar 1994 19:27:16 GMT
For cp/67 .... csc did the original port of apl/360 to cms/apl. It
included some extensions for doing "cms file i/o". Also there was a
peculiar problem with apl memory garbage collection.
apl would nominally not re-use any storage location ... almost any
value "modification" involved allocating the next available memory
location for the new value (and ignoring the "old" location). When
memory allocation reached the end of storage it would garbage collect,
compress variables down to low-memory addresses and restart. csc had
some extensive monitoring tools to do full instruction and storage
reference monitoring (some of the technology was eventually released
as the VS/Repack product in the mid-70s ... among other things, if
given a load-map it would use "cluster-analysis" to do program
re-arrangement for improved virtual memory operation). One of the
tools could produce a printed chart of memory references (we had 6'
high outputs scotch-tape together along 10-15 feet worth of wall)
against time. APL operation would typically have this very pronounced
saw-tooth effect with a sharp rise using all of (virtual) memory and
then straight line collapsed when end-of-memory was reached and
garbage collection would run.
this wasn't too bad virtual memory characteristic with 32k-100k byte
APL workspaces ... but it turned out that a lot of the people using
cms/apl on the csc machine were doing it because they could get 1mbyte
and larger (virtual memory) workspaces (in addition to cms file i/o).
Effectively apl would utilize all available virtual memory regardless
of the size of the apl application/program running. In order to handle
that, csc developed an optimized virtual memory garbage collector for
apl.
Some of the IBM 370 machines also had an interesting virtual memory,
cache implementation. The 370 architecture had a mode-bit that
selected between 2k & 4k virtual pages. The dos/vs & vs1 operating
systems ran with 2k virtual pages and svs/mvs ran with 4k virtual
pages. vm/370 nominal ran 4k virtual memory ... but if emulating a
virtual machine in relocate mode ... it would use whatever page-mode
the virtual machine specified. The caches were "real-mapped", but some
machines would start cache line selection using "low-bits" of the page
displacement. On the 64k cache 370/168 it would use the "11th bit"
(i.e. 2k) as part of the cache-line selection. However, when switching
>from 4k->2k virtual relocate mode ... the 168 would invalidate all
cache-line entries and switch to being only a 32kbyte cache machine.
Going from 2k->4k mode it would again invalidate all the cache line
entries and then switch back to being a 64kbyte cache machine. There
was at least one case where a dos/vs/vm "shop" upgraded from a
32kcache 168 to a 64kcache 168 only to find their performance
significantly degrade.
The 370/168 had a 7 deep tlb sto (i.e. each tlb entry had a 3bit
identifier, "invalid" and seven possible address-space "ownerships").
It was somewhat "tuned" for SVS/MVS. The SVS/MVS design reserved the
1st 8mbytes of virtual memory for kernel code which left the 2nd
8mbytes (in 16mbyte virtual address space) for application code. On
the 370/168, one of the tlb index bits was the virtual address 24bit
(8mbyte). This worked out well for SVS/MVS ... but hampered cms, dos,
vs1 running on the machine since in most nominal environments, all
virtual address were <8mbyte (with effectively no virtual addresses
>8mbyte, half of the tlb entries alwas went unused).
When our scalable SMP projects got canned (first a 5-way and then a
revived 16-way) ... we adapted the design/implementation to a standard
370 2-way (actually when the 2-way support was released as part of the
base VM product, they had to do some interesting product "repackaging"
since the implementation was dependent on a large part of the code in
the RM PRPQ) ... we had done some optimizations for cache affinity
management and pre-emptive task-switching ... which (on 2-way)
resulted in situations were the "MIP" rate on one of the processors
was effectively near the nominal uni-value ... but the other
processor would hit a "MIP" rate nearly 50% higher. Several of the vm
"performance" monitors only paid attention to % kernel cpu utilization
and % process cpu utilization ... in some cases running at a 50%
higher MIP rate would superficially appear as if less work was being
performed.
In late '77 & early '78 I helped put together a cluster system of
eight 2-way MPs (initially 168s but upgraded to 3033s) all sharing the
same disk farm. It was used to provide primarily APL-based application
service (with typical apl workspaces running around 1mbyte or larger)
... at the time it was the largest "single system image" operation.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
Newsgroups: alt.folklore.computers
From: lynn@netcom9.netcom.com (Lynn Wheeler)
Subject: scheduling & dynamic adaptive ... long posting warning
Date: Thu, 24 Mar 1994 00:18:25 GMT
With respect to scheduling & dynamic adaptive feedback control system
postings a couple weeks ago (hot button of mine) ... although not
directly related to computers (might be better placed in
alt.folklore.military? ... anyway) ...
Col (ret) John Boyd has had some fascinating things to say about
operating inside your opponent's OODA-loop (observe, orientate,
decide, act; feedback loop). A lot of his thoughts about increasing
feedback loop performance seemed to orginate from his background as a
fighter pilot during the Korean War coming right out of "plane turn
radius" in dog fights. I've had the privilege of sponsoring his talks
several times. I had done a lot in the late '60s and early '70s with
dynamic adaptive control systems using feedback loops and operating
envelopes so I was quite taken with Boyd's OODA-loop concept and plane
operating envelopes.
At one time, Boyd was in charge of lightweight fighter plane R&D at
the pentagon ... he had also ran a "skunk-works" responsible for the
early F16 design. Prior to that he had developed something called
"Boyd's Laws" which he evolved into a 300 or so page fighter pilot
training manual. Basically it involves looking at different plane's
performance envelopes along several axis simultaneously (graphs look
like lopsided circles). You overlap two of these envelope/graphs for
two different planes ... and it suggests where your plane operates
best vis-a-vis an opponent ... and conversely the same for them. He
says that later the CIA translated a Russian fighter pilot training
manual ... and it was word-for-word his document except for simple
changes like feet & miles to meters & kilometers. The use of these
performance envelopes then evolved into being used for plane design
for things like in what areas do you want improvements ... and what
areas are you willing to take sacrifices/trade-offs.
Boyd also has seemingly hours of stories about technology/science "not
working" and/or at least being used incorrectly. I've suspected that
he also had a hand in the F20/Tigershark (since it conformed with lots
of his statements about designing a plane that had a long MTBF and a
typical enlisted person could repair/service quickly, i.e. flying time
was much greater than down/service time).
US News & Report had a short article on him during Desert Storm titled
"The Fight To Change How America Fights" ... also mentioning the "Jedi
Knights". I remember a briefing (on CNN) given by some Col. two days
into the war that talked about how strategy & tactics had changed ...
using phrases that Boyd was using/advocating at least 10 years
earlier.
Boyd had a two part talk, 1) Patterns of Conflict, and 2) Organic
Design For Command & Control. Patterns of Conflict is the longer of
the two talks. The last four foils list over 200 references. By
comparison, 'Organic design for command & control' has less than 1/5th
as many foils. While both talks draw heavily on historical examples
from warfare, the real focus of the talk was fundamental principles of
how to be successful in a competitive environment. Any typos in the
attached excerts are mine.
... from "Patterns Of Conflict":
• Sun Tzu (around 400 BC)
Probe enemy to unmask his strengths, weaknesses, patterns of movement
and intentions. Shape enemy's perception of world to
manipulate/undermine his plans and actions. Employ Cheng/Ch'i
maneuvers to quickly and unexpectedly hurl strength against
weaknesses.
• Bourcet (1764-71)
"A plan ought to have several branches. ...One should...mislead the
enemy and make him imagine that the main effort is coming at some
other part. And...one must be ready to profit by a second or third
branch of the plan without giving one's enemy time to consider it."
• Napoleon (early 1800's)
"Strategy is the art of making use of time and space. I am less chary
of the latter than the former. Space we can recover, time never. "...I
may lose a batte, but I shall never lose a minute." "The whole art of
war consists in a well reasoned and circumspect defensive, followed by
rapid and audacious attack."
• Clausewitz (1832)
Friction (which includes the interaction of many factors, suchas
uncertainty, psychological/moral forces and effects, etc.) impedes
activity. "Friction is the only concept that more or less corresponds
to the factors that distinguish real war from war on paper." In this
sense, friction represents the climate or atmosphere of war."
• Jomini (1836)
By free and rapid movements carry bulk of forces (successively)
against fractions of the enemy.
• N.B. Forrest (1860's)
"Git thar the fustest with the mostest."
• Blumentritt (1947)
"The entire operational and tactical leadership method hinged upon...
RAPID, concise assessment of situations, ...QUICK decisions and QUICK
execution", on the principle: "each minute ahead of the enemy is an
advantage."
• Balck (1980)
Emphasis upon creation of "implicit connections or bonds" based upon
"trust, not mistrust", that permit wide freedom fro subordinates to
exercise initiative and imagination -- yet, harmonize within intent of
superior commanders. Benefit: internal simplicity that permits rapid
adaptability.
• Yours truly (i.e. John Boyd)
Operate inside adversary's observation-orientation-decision-action
loops to enmesh adversary in a world of uncertainty, confusion,
disorder, fear, panic, chaos, ...or fold adversary back inside
himself, so that he cannot cope with events/efforts as they unfold.
xxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxx
Quite frequently, Boyd's foils are "black". A few foils From "organic
design for command & control":
foil 25
commment
-------
Up to this point we have show orientation as being a critical element
in command and control -- implying that without orientation there is
no command and control worthy of the name.
? - raises question - ?
-----------------------
what do we mean by command and control?
foil 26
Some Historical Snapshots
------------------------
Before attempting to respond to this question let us take a look at
some evidence (provided by Martin Van Creveld as well as myself) that
may help in this regard:
* Napoleon's use of staff officers for personal reconnaissance
* Maltke's message "directives" of few words
* British tight control at the Battle of the Somme in 1916
* British GHQ "phantom" recce regiment in WW II
* Patton's "household cavalry"
* My use of "legal eagle" and comptroller at NKP
foil 27
A Richer View
-------------
(a la Martin Van Creveld -- "Command" -- 1982)
In the June 1967 War, "... General Yashayahu Gavish spent most of his
time either 'accompanying' units down to brigade level -- by which,
according to his own definition, he meant staying at that unit's
command post and observing developments at first hand -- or else
helicoptering from one unit to another; again, in his own words,
'there is no alternative to looking into a subordinate's eyes,
listening to his tone of voice'. Other sources of information at his
disposal included the usual reporting system; a radio network linking
him with the three divisional commanders, which also served to link
those commanders with each other; a signals staff whose task it was to
listen to the divisional compunctions networks, working around the
clock and reporting to Gavish in writing; messages passed from the
rear, i.e., from General Headquarters in Tel Aviv, linked to Gavish by
'private' radio-telephone circuit; and the results of air
reconnaissance forwarded by the Air Force and processed by Rear
Headquarters. Gavish did not depend on these sources exclusively,
however; not only did he spend some time personally listening in to
the radio networks of subordinate units (on one occasion, Gavish says,
he was thereby able to correct an 'entirely false' impression of the
battle being formed at Brigadier Gonen's headquarters) but he also had
a 'directed telescope' in the form of elements of his staff, mounted
on half tracks, following in the wake of the two northernmost divisions
and constantly reporting on developments."
foil 28
Point
-----
The previous discussion once again reveals our old friend -- the
many-sided implicit cross-referencing process of projection,
correlation, and rejection.
? - Raises Question - ?
-----------------------
Where does this lead us?
foil 29
Epitome of "Command and Control"
--------------------------------
Nature
------
* Command and control must permit one to direct and shape what is to
be done as well as permit one to modify that direction and shaping by
assessing what is being done
What does this mean?
--------------------
* Command must give directions in terms of what is to be done in a
clear unambiguous way. In this sense, command must interact with
system to shape character or nature of that system in order to realize
what is to be done;
whereas
* Control must provide assessment of what is being done also in a
clear unambiguous way. In this sense, control must not interact nor
interfere with system but must determine (not shape) the
character/nature of what is being done
Implication
-----------
* Direction and shaping, hence "command", should be evident while
assessment and determination hence "control", should be invisible and
should not interfere -- otherwise "command and control" does not exist
as an effective means to improve our fitness to shape and cope with
unfolding circumstances.
foil 30
Illumination
------------
* Reflection upon the statements associated with the Epitome of
"Command and Control" leave one unsettled as to the accuracy of these
statements. Why? Command, by definition, means to direct, order, or
compel while control means to regulate, restrain, or hold to a certain
standard as well as to direct or command.
* Against these standards it seems that the command and control (C&C)
we are speaking of is different than the kind that is being applied.
In this sense, the C&C we are speaking of seems more closely aligned
to 'leadership' (rather than command) and to some kind of 'monitoring'
ability (rather than control) that permits leadership to be effective.
* In other words, leadership with monitoring, rather than C&C, seems
to be a better way to cope with the multi-faceted aspects of
uncertainty, change, and stress. On the other hand, monitoring, per
se, does not appear to be an adequate substitute for control. Instead,
after some sorting and reflection, the idea of 'appreciation' seems
better. Why? First of all, appreciation includes the recognition of
worth or value and the idea of clear perception as well as the ability
to monitor. Moreover, next, it is difficult to believe that leadership
can even exist without appreciation.
* Pulling these threads together suggests that 'appreciation and
leadership' offer a more appropriate and richer means than C&C for
shaping and adapting to circumstances.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
Newsgroups: comp.arch
From: lynn@netcom11.netcom.com (Lynn Wheeler)
Subject: Re: talk to your I/O cache
Date: Fri, 25 Mar 1994 16:01:26 GMT
There is another type of I/O "cache" optimization that I worked on in
the late '70s. It basically was an extension of the "page migration"
work that I had done early (moving virtual memory pages around at
different levels in a disk-performance hierarchy ... analogous to file
HSM migration).
In traditional "cache" operations, reads are "non-distructive".
However, in situations where there the I/O cache line size and read
record size are the same and there is a large processor real memory
cache .. then a case can be made doing "distructive" cache reads
(which effectively makes everything in the real memory cache
"changed").
In the "non-distructive" read situation, there will be "duplicate"
records in the I/O cache and the real memory cache. The total number
of cached records is equal to the number of records in the I/O caches
plus the number of records in the real memory cache MINUS the number
of duplicates (i.e. the same record exists in both the I/O cache and
the real memory cache).
For configurations with large real memory caches, it is possible that
the number of these "duplicates" can approach the total I/O cache
capacity. In such a situation the I/O cache is typically reduced to
doing little more than optimization associated with rotational latency
(and doesn't really need to be any larger than the number of bytes on
a track). Even in that situation, some of the "hit" numbers can be
deceptive i.e. program is sequentially reading single record at a
time, and say there are 10 records per track and there is full-track
buffering ... the first record read brings in the track which counts
as a "miss" but the next 9 record reads will be counted as "hits" for
a theoritical cache-hit ratio of 90%. In this environment, "cache"
sizes larger than a track will not increase the hit-ratio beyond 90%.
Sorry for the side-track, back to the "dup"/"no-dup" scenerio. As long
as the number of "duplicates" are small with respect to the size of
the I/O cache, then non-distructive reads are fine. However, when the
number of duplicates approach a significant percentage of the I/O
cache size, then some optimization can be achieved by switching to
"distructive" reads and a "no-dup" policy.
The page-migration scenerio from the late '70s came about because of
the growth in real memory sizes that approached or exceeded the size
of high-speed fixed-head disk paging devices. Say that there was
128mbytes of page disk capacity and 64mbytes of real memory ... in a
"dup" strategy the largest total amount of virtual pages is 128mbytes.
However, with a logical "distructive" read (in the page scenerio
anytime a page is read into memory, its disk backing store location is
deallocated, if the page is ever "replaced" in the future, it must be
written) and a "no-dup" algorithm, the total amount of virtual memory
increases to 196mbytes=128mbytes+64mbytes. In this scenerio there is a
tradeoff between I/O activity and available virtual memory capacity
(note there must be at least 1 disk page slot held in reserve to
avoid a deadlock scenerio).
I/O caches represent a similar opportunity ... assuming the cache-line
size and the record transfer size is the same. There is an optimization
which requires that there be two-types of "writes" tho:
1) standard write which implies place the record in the
cache as well as force it to disk
2) cache-only write ... which just places the record in
the cache ... but doesn't force it to disk (and allows it
to be discarded if selected for replacement).
as well as "distructive" and "non-distructive" reads (i.e.
non-distructive read leaves the record in the cache, distructive read
invalidates the cache line and makes the space available). Note the
cache-only write also handles the scenerio of indicating to the I/O
cache that the information is already out at the specified disk record
location.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Refed: **, - **, - **, - **, - **
Newsgroups: comp.arch,alt.folklore.computers
From: lynn@netcom11.netcom.com (Lynn Wheeler)
Subject: lru, clock, random & dynamic adaptive
Date: Fri, 25 Mar 1994 16:52:42 GMT
This is a followup to both my i/o cache posting in comp.arch and my
(earlier) scheduler/dynamic adapting posting in
alt.folklore.computers.
I had originated CLOCK in the late '60s ... but in the early '70s
stumbled across and very interesting variation on clock. From our
instruction/storage traces and replacement simulatorike 10% of true-LRU.
However, neither clock nor true-LRU handled the case well where the
working-set was slightly larger than the cache (memory) size ... or an
application effectively exhibited page-reference activity much larger
than total available memory. In a global environment this not only had
the effect of wiping any of the local application's re-use ... but
also would wipe all other application pages from memory. In effect
both clock and true-LRU degenerated to FIFO under such stress
conditions and had no-page-reuse characterisitics.
The variation that I stumble across in the early '70s had the
characteristics of operating like clock under "normal" conditions but
had the interesting characteristic of automatically "degenerating" to
RANDOM (rather than FIFO) under stress conditions (the pathlength was
also effectively the same as clock in UP configurations and actually
better than clock in SMP configurations).
In scenerios where page-reference patterns were strictly sequential
with no re-use, RANDOM, FIFO, and LRU would all perform the same. In
scenerios where the page-reference patterns involved "loops" larger
than available cache/memory, FIFO/LRU guaranteed that there would
be no page reuse ... whereas under the same stress conditions,
RANDOM allowed for a high-probability of page-reuse.
In normal scenerios, normal clock exhibits large degrees of natural
dynamic adaptation to all sorts of load & configuration along with
relative short/minimal pathlength. However, it still suffers from the
clock global LRU tendency to degenerate to FIFO (and no page-reuse) under
stress conditions. The "adaptive-variation" clock that I stumbled
across in the early '70s avoided this pitfall (and could significantly
outperform true-LRU in the simulator involving "stress" workloads).
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
Newsgroups: alt.folklore.computers
From: lynn@netcom11.netcom.com (Lynn Wheeler)
Subject: REXX
Date: Sun, 27 Mar 1994 01:28:25 GMT
This is a REXX story from the early '80s.
In 1982 REXX was still in its early incarnations and there were
efforts to get it released to the world as a product. Some of the
nay-sayers were claiming that it was just another batch command
language ... which the world already had plenty. Being part of the
true-believers I wanted to do a demonstration that showed that it was
significantly more than another batch command language.
I selected as a demonstration a replacement of a VM product component
that was currently implemented in 370 assembler. The existing product
was called DUMPSCAN and it contained >20k lines of assembler code and
was used to view CP and CMS postmortem storage image dumps (and had a
full-time department of 5-10 people supporting it).
My demonstration was that in 3 months elapsed, working half-time, I
would create a REXX replacement for DUMPSCAN that had 5* the function
and ran 5* faster (REXX is an interpreted language). The initial part
of the demonstration was completed in a little over 2 months ... it
had a very small assembler stub module (couple hundred lines of code)
that provided some low-level primitive functions for "DUMPRX". The
actual replacement was 2200 lines of REXX code that implemented a
large superset of the DUMPSCAN function and would operate 5* faster
(with a side-effect for those familiar with the OCO issue was that
effectively nearly all source code had to be shipped). Some of the
enhancements:
• "opcodes" formated storage display
• display storage as addresses with respect to
kernel symbol table.
• some simple psuedo-assembler code written in REXX could
process source include files and perform "source" formated
display of storage locations
• handle not only postmortem storage dumps but also work
against live cp & cms kernel
• parse the GML source file for messages&codes manual and
display information of interest.
• save/log complete session
• sophisticated high-level "help"
Since I still had almost a month left on the product ... I produced
nearly another 800 lines of REXX code that implemented
expert-system-like analysis of postmortem storage images.
It became relatively successful ... although never released to
customers as a product. I directly distributed the application to over
100 internal locations world-wide and at least at one time was in use
by all internal locations as well as all (VM) field service people.
In support of this, I also made a minor modification to CP kernel to
maintain symbol entry-point table. The "standard" DUMPSCAN process was
to merge a "saved" loadmap (generated when the kernel was built) with
the dump storage image.
The CP kernel build process (not really changed since 1967) was to use
the "BPS" loader to load into memory all the kernel binaries. Then a
kernel application would receive control from the BPS-loader and write
the storage-image to a special disk boot-location.
In 1969, I was started playing around with some enhancements to the CP
kernel to allow part of it to be "unpinned" and allow it to page. As
part of this I also modified the boot-build routine. When the
BPS-loader exits to the loaded program, it passes the address of the
loader symbol table as well as the count of entries in the table. The
modification I made to the boot-build routine was to copy the the
BPS-loader symbol table to the end of the kernel core image and write
it out as part of the boot-image. Since it was located in the
"pageable", unpinned portion of the kernel it wasn't going to take up
runtime storage (there were some 360/67s out there that only had
512kbytes of real storage ... fixed kernel size was becoming
critical).
In any case, I went back and resurrected the 1969 modifications and
re-applied them to the '82 cp kernel (appending the BPS-loader symbol
table to the end of the CP kernel boot-image). I then added the
appropriate switch to DUMPRX to utilize the real symbol table if
available. This eliminated the problem of getting entry symbols from a
"load-map" that didn't match the boot-file..
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
Newsgroups: alt.folklore.computers
From: lynn@netcom11.netcom.com (Lynn Wheeler)
Subject: 360 "OS" & "TSS" assemblers
Date: Sun, 27 Mar 1994 17:04:33 GMT
CP, CMS, and much of CMS application code was written in 360/370
assembler for both CP/67 (360) and VM/370 (and used a ported version
of the mainstream "OS" assembler ... in fact, when our location
received its first version of CP/67, the CP source was still being
assembled and built on OS PCP).
In '74, I had written a PLI program to analyse 370 assembler listings.
One of my pet peeves at the time were system failures involving
"uninitialized" address registers. The PLI program parsed the
assembler listing into individual machine instructions and interpreted
the instructions ... including register load, store, & ref'ed
activities.
The analysis created "simulated" code blocks from the parsed listing
information. A "code block" was defined as a set of sequentially
executed instructions which was started by a non-branch that was the
next instruction after a branch instruction (branch target or
fall-thru). A code block was terminated by a branch-instruction or
because the following instruction was the target of a branch
instruction. For each code-block a register usage map was created
that showed:
• non-addressing useage of the register
• addressing useage of the register
• alteration of the register
• register used in the code-block w/o 1st being altered
The last item implied that a code block was dependent on the register
value being established by some preceding code.
The analysis code that would following all possible paths through the
code blocks creating summary register activity maps for each path. It
would also identify "dead-code" (code-blocks that were never gotten
to).
There were numerous possible assembler coding techniques that the
code-block building couldn't handle ... but these were relatively rare
in the CP and CMS routines (and I handled by "fixing" the code and
then rerunning). Most of the incorrect handling would result in
mis-identifying sections as "dead-code".
Post-processing involved generating a pseudo-code representation of
the original assembler program (looked something like C).
If/then/else/while/until/etc. structures could effectively be
determined directly from the code-blocks. The post-processing could
handle nested conditional structures to any depth, but in most cases
going more than 4-5 deep resulted in less-readable ... rather than
more-readable code.
Except for the conditional control structures, it was somewhat a
trivial one-for-one translation between a machine-op instruction and
something that looked like pseudo code. It did attempt to maintain a
stack of operations for each register and defer generating the
pseudo-code. For instance a load-a/add-b/store-a could turn into a
a=+b ... rather than r=a/r=r+b/b=r.
The hard part on the pseudo-code generation was making sure the
symbolics were correct. A standard 360 assembler output line looked
something like:
address instruction addr1 addr2 original statement
yyyy ooiiiiii a1a1 a2a2 a b c d e f g h g
Giving the relative address of the instruction in the program, the
actual instruction (360/370 instructions could be 2bytes, 4bytes, or
6bytes), and the addresses of locations used by the instructions
(360/370 instructions could be register/register, register/storage, or
storage/storage, i.e. 0, 1, or 2 storage location addresses). The
parsing of the address fields and the original statement in an attempt
to formulate a reasonable pseudo-instruction was somewhat
problematical.
All 360/370 instruction storage addresses are displacements with
respect to value in an address register. Things to be addressed are
defined symbolicly to the assembler with something called "csects"
(typically program code) and "dsects" (typically include data
structure definitions). The assembler is informed about possible
address register contents with a "using" statement. The fields "addr1"
and "addr2" are (non-displacement) addresses within some csect or
dsect structure. The problem was that which csect/dsect the address
came from was not identified.
The previous output format was for the line of mainline "OS" 360/370
assemblers. However, generating pseudo-code from the "TSS"
(time-sharing system, the "official" operating system support for the
360/67) assembler was less problematical. The TSS assembler would
prefix the addr1 and addr2 fields with a "space" identifier which
uniquely mapped to a specific csect/dsect. As long as a reasonable
convention of symbolic field useage was followed, the TSS
assembler-output eliminated ambiguities as to what to generate for
symbolic parameter names in the pseudo-code.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Newsgroups: comp.arch
From: lynn@netcom7.netcom.com (Lynn Wheeler)
Subject: Re: talk to your I/O cache
Organization: NETCOM On-line services
Date: Sun, 27 Mar 1994 20:19:25 GMT
note that the architecture can somewhat be reduced to that of a
multi-level storage implementation (real storage, I/O caches, and
storage). Manager in the processor would have capability of specifying
logical "copy" operations (read from disk, thru cache, write to disk
thru cache) or logical "move" operations (discard copy once transfer
is complete).
Transfer operations would have flavors of copy/move as pertaining to
specific levels in the storage hiearchy (i.e. like read from disk or
cache and discard cache copy).
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **
From: lynn@netcom9.netcom.com (Lynn Wheeler)
Newsgroups: comp.arch,alt.folklore.computers
Subject: Re: lru, clock, random & dynamic adaptive ... addenda
Date: Sun, 27 Mar 1994 23:53:20 GMT
in response to number of inquiries regarding additional details:
global LRU work i did as an undergraduate in late '60s was basically a
1bit, 1hand clock. work in the 70-72 time-frame involved
>1bit
1hand & 2hand clock
variation on clock ... lets call it clock-v
since the hardware only had 1bit, additional 8bits were simulated in
software (1 byte per real page).
clock-V variation required at least 2bits ... although greater than
2bits didn't seem to make much difference.
the csc simulator could handle a number of different algorithms (for
which it was calibrated against live implementations) as well as
STRICT/TRUE LRU (i.e. maintaine strick lru page ordering ...
clock only approximates strick page ordering).
both 1hand & 2hand clock-v had the characterisitic of degenerating to
random under stress (compared to standard clock global LRU that effectively
reduces to FIFO). Under typical conditions, 1-hand clock and 1-hand
clock-v were nearly identical (i.e. somewhat worse than strick, true
LRU). In addition (at least in the simulator), it turned out that for
any given load ... it seemed like it was alwas possible to find a
variation on 2hand clock-v that would outperform strick/true LRU (even
under normal loads).
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
Newsgroups: alt.folklore.computers
From: lynn@netcom7.netcom.com (Lynn Wheeler)
Subject: cp disk story
Organization: NETCOM On-line services
Date: Mon, 28 Mar 1994 00:59:34 GMT
Lines: 36
About the time I was working on the eight 2-way cluster activity (late '70s) ..
I was also playing around over in the disk engineering & test labs.
The disk & test labs split everything into "test-cells" ... a typical
lab. room would have multiple test-celss and one or more mainframe
processers. Operation of the hardware in a test-cell was either
totally stand-alone ... or would be cabled (one at a time) to a
mainframe and custom, test software would be executed (typically
little more than BPS system with some custom I/O execise software).
Typically a single test-cell configuration would generate guantity
and/or severity of errors that would crash &/or hang any of the
standard operating systems within 15-30 minutes.
As a hack, I redid the CP I/O subsystem to be bullit-proof so that
multiple test-cells could be attached and operated concurrently. Over
time, the engineering and test labs also migrated much of their
time-sharing and IS processing to these processor complexes.
There was one weekend where some of the test-lab people thot that they
had an almost ready new disk controller. They swapped the controller
for one that handled a string of 16 disk drives used for standard
time-sharing service. Late Monday morning I was getting calls asking
what had I done to their system that resulted in a several hundred
percent performance degradation. There had been no software changes
over that weekend ... the only change had been the controller swap.
It turns out that the almost ready controller had a peculiar bug that
prevented it from efficiently handling lots of concurrent I/O
activity. Normal mid-morning load would have resulted in concurrent
I/O on all 16 drives.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
Newsgroups: comp.arch.storage
From: lynn@netcom9.netcom.com (Lynn Wheeler)
Subject: Re: Dual-ported disks?
Date: Wed, 30 Mar 1994 05:20:23 GMT
couples minor points:
• bsd tahoe&reno (at least) have an interesting feature mapping
ip->MACaddress, after return from calling arp cache routine, the ip
address and the MACaddress is saved. Next time in, if the ip addres
matches the saved address ... the call to arp cache lookup is
bypassed. For some (possibly pathelogical) cases, say involving a
client communicating exclusively for extended periods to a single
server the IP-address won't change. arp cache time-out isn't
sufficient. one possible workaround is periodically pinging two
different ip-addresses (to reset arp-cache call). In effect, there is
a "hidden" single-entry arp-cache value that doesn't conform to the
arp-cache rules.
• no single point of failure disk situation ... to effectively handle
this two disks with mirrored data and (at least) two controllers (if
not 4) are required. MTBF is typically much less for rotating
mechanical media so redundant disks are more of an issue than
controller electronics for handling various failure modes and
affecting aggregate system MTBF. dual-disk controllers are for the
same machine ... but so that the same disk can be attached to
different machines (handling software, processor complex failures).
Handling disk failures requires mirrowed disks (or
no-single-point-of-failure RAID attachments) each with their own pair
of controllers.
• no single point of failure also needs to handle some pathelogical
conditions. one of the best-known is the "stalled-processor" scenerio.
there is some agreed upon protocol that all processors agree on that
is used to establish the "right to do a disk write" ... one of the
processors optains the "disk write priviledge" and stalls. The other
processors in the complex decide that the processor is dead and
reconfigures the complex (backing out and removing the "dead"
processor from the configuration). The "stalled" processor comes back
to life and attempts to finish the write operation. To handle this
scenerio requires more than just a loosely-coupled distributed
protocol and a reconfiguration protocol ... but also requires a
"fencing" mechanism that is free of race conditions.
these are fairly easy ... lets try a Hippi or FCS switch
configuration. It may require two such switches for
no-single-point-of-failure. It also requires mirrored disk or
no-single-point-of-failure RAID controllers that are at least
dual-ported (one to each switch). Two processors each thinking the
other has died ... both of them attempt to reconfigure, and the very
first thing in reconfiguration is to fench out the (other) "failed"
processor. The fenching must be done at both switches and must be done
in such a way that it is race-condition and deadlock free.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **
From: lynn@netcom11.netcom.com (Lynn Wheeler)
Newsgroups: comp.arch.storage
Subject: Re: Dual-ported disks?
Date: Wed, 30 Mar 1994 15:02:31 GMT
for 2-way solution ... various disk "reserve" commands fences out
the other processor ... but the architecture doesn't scale ...
the typical device "reserve" semantics say lock-out everybody but
"me". The "fencing" semantics requires only the "presumed" failed
processor(s) is fenched. For 2-way the effects are the same. For
>2-way reserve and fencing don't have the same semantics.
I attended some HiPPI meetings in the late '80s (before DEC complained
about its name and got it changed) advocating fencing in the switch
architecture. They were going to do it ... but I haven't followed that
in a long time. Pairs of such switches are little bit harder to make
sure configuration is "race" free. FCS is another matter. Anybody know
if fenching also got into FCS switch?
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
Newsgroups: alt.folklore.computers
From: lynn@netcom9.netcom.com (Lynn Wheeler)
Subject: CP/67 & OS MFT14
Organization: NETCOM On-line services
Date: Sun, 3 Apr 1994 17:51:11 GMT
In response to various inquiries, attached is report that I
presented at the fall '68 SHARE meeting (Boston?). CSC had installed
CP/67 at our university in January '68. We were then part of the
CP/67 "announcement" that went on at the spring '68 SHARE meeting (in
Houston).
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
OS Performance Studies With CP/67
OS MFT 14, OS nucleus with 100 entry trace table, 105 record
in-core job queue, default IBM in-core modules, nucleus total
size 82k, job scheduler 100k.
HASP 118k Hasp with 1/3 2314 track buffering
Job Stream 25 FORTG compiles
Bare machine Time to run: 322 sec. (12.9 sec/job)
times Time to run just JCL for above: 292 sec. (11.7 sec/job)
Orig. CP/67 Time to run: 856 sec. (34.2 sec/job)
times Time to run just JCL for above: 787 sec. (31.5 sec/job)
Ratio CP/67 to bare machine
2.65 Run FORTG compiles
2.7 to run just JCL
2.2 Total time less JCL time
1 user, OS on with all of core available less CP/67 program.
Note: No jobs run with the original CP/67 had ratio times higher than
the job scheduler. For example, the same 25 jobs were run under WATFOR,
where they were compiled and executed. Bare machine time was 20 secs.,
CP/67 time was 44 sec. or a ratio of 2.2. Subtracting 11.7 sec. for
bare machine time and 31.5 for CP/67 time, a ratio for WATFOR less
job scheduler time was 1.5.
I hand built the OS MFT system with careful ordering of
cards in the stage-two sysgen to optimize placement of data sets,
and members in SYS1.LINKLIB and SYS1.SVCLIB.
MODIFIED CP/67
OS run with one other user. The other user was not active, was just
available to control amount of core used by OS. The following table
gives core available to OS, execution time and execution time ratio
for the 25 FORTG compiles.
CORE (pages) OS with Hasp OS w/o HASP
104 1.35 (435 sec)
94 1.37 (445 sec)
74 1.38 (450 sec) 1.49 (480 sec)
64 1.89 (610 sec) 1.49 (480 sec)
54 2.32 (750 sec) 1.81 (585 sec)
44 4.53 (1450 sec) 1.96 (630 sec)
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
MISC. footnotes:
I had started doing hand-built "in-queue" SYSGENs starting with MFT11.
I would manually break all the stage2 SYSGEN steps into individual
components, provide "JOB" cards for each step and then effectively run
the "stand-alone" stage2 SYSGEN in the standard, production job-queue.
I would also carefully reorder the steps/jobs in stage2 (as well as
reordering MOVE/COPY statements for PDS member order/placement) so as
to appropriately place data on disk for optimal disk arm-seek
performance.
In the following report, the "bare-machine" times of 12.9 sec/job was
typically over 30 seconds/job for a MFT14 built using standard
"stand-alone" SYSGEN process (effectively increase in arm-seek elapsed
time). Also, the standard OS "fix/maintenance" process involved
replacing PDS-members which resulted in destroying careful member
placement. Even with an optimally built system, "six months" of OS
maintenance would resort in performance degrading to over 20 secs/job.
A non-optimal built OS system actually would make CP/67 performance
look "better" (i.e. ratio of CP/67 times to "bare-machine" times).
CP/67 overhead (elapsed time increase) was proportional to simulation
activity for various "kernel" activities going on in the virtual
machine. I/O elapsed time was not affecting by running under CP/67.
Keeping the sumulation overhead fixed, but doubling (or tripling) the
elapsed time with longer I/O service time would improve the
CP/67/bare-machine ratios.
The modified CP/67 was based on numerous pathlength performance
changes that I had done between Jan of 1968 and Sept of 1968, i.e.
reduce CP/67 elapsed time from 856 sec. to 435 secs (reduction in
CP/67 pathlength CPU cycles from 534secs to 113secs).
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Newsgroups: comp.arch.storage
From: lynn@netcom9.netcom.com (Lynn Wheeler)
Subject: Re: Dual-ported disks?
Date: Sun, 3 Apr 1994 18:06:30 GMT
a minor example regarding clusters & availability was a hypothetical
situation involving 1minute of down-time per year.
various hardware fault-tolerant solutions would provide the hardware
availability but various systems investigated had done nothing about
some various mundane aspects of system operation. One was installing a
new version of the operating system ... requiring a minimum of a 1
hour outage. With a system upgrade on the order of one per year, each
year there was the equivalent of 60 years of downtime.
clusters handled the opportunity by each processor complex having its
own private system disks. individual processors could be removed from
the complex (w/o taking down the service) and upgraded, tested and
then restored to production operation.
at least in that sort of distributed, server environment, cluster
operation could mask both hardware and software outages (scheduled and
unscheduled).
To the extent that the fault-tolerant vendors invest in
handling/masking the software downtime scenerio ... they are also
providing the ability for software to also mask hardware failures.
Clustering isn't trivial SMOP ... but neither has been the redundant
array of inexpensive disk efforts ... but the benefits are similar.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
Newsgroups: alt.folklore.computers,comp.arch
From: lynn@netcom11.netcom.com (Lynn Wheeler)
Subject: Re: CP/67 & OS MFT14
Date: Sun, 3 Apr 1994 19:45:58 GMT
note that in later years in order to emulate/virtualize various
"relocate" operating systems (DOS/VS, VS1, SVS, MVS, etc), CP had to
(effectively) emulate the TLB (table look aside buffer). CP had two
basic components ... 1) virtual machine real hardware emulation 2)
shared resource management. For #1, CP had to implement a soft-analogy
of many hardware states/functions/capabilities.
Virtualizing a "relocate" operating system provided an interesting
challenge. All the state things were relatively step-by-step
relatively straight forward.
However, LRU replacement algorithms presented an interesting
challenge. Many operating systems implement various types of LRU
virtual page replacement algorithms based on some observed
generalogies about program behavior ... i.e. pages that haven't been
used in a long while are least likely to be used in the near future.
A virtualized, relocate operating system however can easily violate
that premise. From a virtual relocate operating system standpoint,
what it believes to be "real memory" is actually what CP manages as
virtual memory. While the virtual relocate OS is running thru its
"real memory" looking for virtual pages to replace and re-use the
"real page" location ... it is actually running thru CP's virtual
memory specifically search for the least used page to be the very next
page to use.
In effect a virtualized LRU page replacement algorithm actually
exhibits behavior the opposite of the underlying LRU assumptions ...
instead of the leased-used page being the least-likely to be used next
... the least used page is just the opposite ... it is the one that is
most likely to be used yet.
Therefor LRU page replacement algorithms don't recurse (&/or
virtualize) gracefully (a virtualized LRU page replacement algorithm
violates underlying basic program execution & page-reference pattern
assumptions).
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
Newsgroups: alt.folklore.computers
From: lynn@netcom9.netcom.com (Lynn Wheeler)
Subject: 370 ECPS VM microcode assist
Date: Sun, 3 Apr 1994 22:55:56 GMT
In May of '75, some people from the Endicott programming lab came to
Cambridge looking for advice as to microcode acceleration for a new
machine they were building. They had some available microcode store on
the machine and they were looking at things to "sink-into" the
hardware to improve system performance. Likely candidate was kernel
pathlengths.
I got together with Bob Creasy and we built a instrumented kernel and
ran various tests ... accumulating the following profile as to the
(then) kernel pathlength behavior.
The instrumentation inserted events to create time-stamp records at
various points in the code. At the start of the benchmark, the
time-stamp process was looped 10,000 times to calibrate the
time-stamping overhead.
The resulting data was reduced pairing up various time-stamp records
to account for functional elapsed time between the time-stamps (minus
the fixed overhead of doing the time-stamp).
The following is the results from that first run which was provided to
Endicott for selecting kernel functions for migration to "hardware";
there were 6000 bytes of microcode space that was available for
sinking CP kernel function into the hardware. The "79.55" accumulated
percentage represents the approximate equivalent of 6000 bytes of 370
machine code.
"path" is the three-character module name (w/o the "DMK" prefix) and
the byte displacement within the module.
path count time percent
(mics) cp
dsp+8d2 to dsp+c84 67488 374. 9.75
from 'unstio' end to enter problem state
prg+56 to prv+46 69848 232 6.27
from prog. interrupt to priv. simulation
ccw+33e to ccw+33e 64868 215 5.38
loop in ccw calling page lock
fre+5a8 73628 132 3.77
'FRET'
ccw + f4 to ccw +33e 45297 213 3.73
from initial 'FREE' call to page lock call
dsp+4 to dsp+214 84674 110 3.61
main entry to start of 'unstio'
ptr+a30 124502 75 3.59
unlock page
ccw + 33e to '3' 44839 207 3.58
from lock page to ticscan return
ios+20 19399 474 3.55
dmkiosqv (before alternate path finding)
fre+8 73699 122 3.47
FREE
IOS+1c2 to DSP+4 27806 208 2.23
call SCN(real) until DSP entry (after I/O int)
dsp+4 to dsp+c84 15105 374 2.18
asysvm entry until enter prob state
sch+4 23445 221 2.00
ios+108 to ios+1c2 27952 165 1.78
I/O interrupt to call scn(real)
scn+84 84359 54 1.76
dsp+93a to dsp+c84 11170 374 1.62
sch call to entry problem mode
prv+46 to dsp+b8 20976 199 1.61
non-I/O priv. instruction to new psw DSP entry
ccw+1252 to EXIT 26212 156 1.58
ticscan return to exit
vio+13a to ccw+0 19405 191 1.43
v.sio, ioblok free call until ccwtran call
vio+1d0 to ios+20 19399 181 1.36
ccwtran return to DMKIOSQV call
ios+0 8423 416 1.35
DMKIOSQR
vio+3e to VIO+13a 19405 169 1.27
vio entry(for sio) to 'FREE' call
dsp+214 to dsp+8d2 70058 45. 1.21
'unstio' with no calls
vio+992 to unt+5a 19410 157. 1.17
ccw+28a to fa (via FREE) 26140 107 1.08
ticscan return till loop back for next block
unt+9e to 116 (FRET) 44694 60. 1.03
unt+9e to 9e (PTR+A30) 65092 38 .97 (79.55 cumm.)
unt+116 to exit 19407 118 .89
from FRET call to EXIT
vio+4 to 3e (SCN+84) 45240 49 .86
vio entry until scan call for vdevblok
vio+3e to dsp+4 25504 86. .685
from SCN call to DSP (non-SIO)
SCN+4 27979 69 .75
real I/O scan (most IOS+1c2)
dsp+214 to 4ce (SCN+84) 14637 126. .72
'unstio' until scn call
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
Newsgroups: alt.folklore.computers
From: lynn@netcom11.netcom.com (Lynn Wheeler)
Subject: CP spooling & programming technology
Date: Tue, 5 Apr 1994 16:31:59 GMT
This is part of a CFP announcement that I broadcast in Dec, 1981 for
an advanced technology conference (that I ran in March 1982):
TOPICS
• High level system programming language
• Software development tools
• Distributed software development
• Migration of CP functions to virtual address spaces
• Migration to non-370 architectures
• 370 simulators
• Dedicated, end-user system
The objective of the conference was to address the rate at which the
existing product could adapt to hardware and other environmental
changes ... i.e. the technology rate of change was increasing and the
software technology was not able to track that rate of change, nor the
increases in the rate of change ... right out of Boyd's OODA-loop).
In some sense this was an attempt to respond to the "UNIX"
opportunity. At the time (and to some extent still), the UNIX
operating system wasn't competitive other than its characteristic to
be adaptable to different & changing environments (hardware,
architectures, requirements, etc). The conference included UNIX paper.
At the time, we had a dearth of advanced technology conferences. The
prior one had been held six years previously. At that conference our
16-way MP project was on the agenda as well as the "801" project.
The period between the CFP and the conference was also the period
during which I was conducting the REXX/DUMPSCAN demonstration (see
prior posting).
The "migration of CP functions to virtual address spaces" was
effectively to restructure CP into even more of a micro-kernel than it
already was. At the time, the CP kernel consisted of around 190 source
modules and 250k machine instructions ... all operating within a
single protection domain.
One of the "migration" demonstrations that I did was the CP spooling
function ... re-implementing it in PASCAL (I didn't have a 370 C
compiler at the time), migrating most of the function to a virtual
address space, and extending/improving a lot of the function.
During the early to mid-80s, I was also running a skunk-works project
that I called HSDT (high-speed data transport). HSDT had a deployed
pilot with a number of high-speed terrestrial and satellite links
(HSDT included designing/deploying a high-speed digital TDMA satellite
system and a double-hop digital broadcast system). For the project I
also implemented various drivers for both "bitnet" and "ip" protocols.
For "bitnet" throughput, the CP spooling system represented a serious
performance limitation when driving high-speed links (or even driving
lots of low-speed links). The "bitnet" file transfer protocol was
store-and-foward with nodes using the CP spooling system for
intermediate storage. The CP spooling system used a synchronous, per
process serialized 4kbyte-block transfer semantics. The "bitnet"
protocol operated as a single process (independent of the number of
links being driven). Under heavy load, the spooling interface might
limit the "bitnet" process to 15 4kbyte-block transfers/second (along
with holding the "bitnet" process the majority of the time in blocked
state). Also, because of the spooling systems transaction logging at
"file" boundaries, if lots of "small" files were being processed, the
thoughput would drop even lower.
The "rewritten" spooling subsystem eliminated nearly all of the
thoughput bottlenecks, providing asynchronous interface semantics along
with read-ahead, write-behind, and contiguous block allocation (along
with multi-block reads/writes) ... while still preserving
file-boundary transaction semantics. One of the "harder" parts of the
implementation was preserving the standard CP on-disk record format
("bitnet" protocol didn't actually transfer user files, it transferred
encapsulated CP spool disk records, and a large part of the HSDT pilot
traffic originated and were targeted for standard systems).
There was also an attempt (in the "bitnet" protocol) to close a couple
small failure-mode windows that could result in the same file being
transfered more than once.
Another opportunity that presented itself for both the "bitnet" and
"ip" protocols was the bandwidth*delay products over the high-speed
satellite links. They only were on the order of 50-100 4kbyte blocks
(not quite today's NII opportunity with coast-to-coast terrestrial
1gbit fiber hitting on the order of 800 4kbyte blocks). However, it
was formadable opportunity especially with lots of bursty traffic. The
problems that have since been documentated regarding adaptive
slow-start windows in bursty environment with large bandwidth*delay
products, we had lots of. Addressing the opportunity required
implementing "bitnet" and "ip" adaptive rate-based pacing (and once
rate-based pacing was established it then became possible to start
looking at other types of rate-based algorithms like fair-share).
The design of the packet-boundary encryption mechanism was also
interesting and may have caused some heart-burn.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
Newsgroups: alt.folklore.computers
From: lynn@netcom11.netcom.com (Lynn Wheeler)
Subject: Re: CP spooling & programming technology
Date: Thu, 7 Apr 1994 15:53:25 GMT
With respect to some questions regarding HSDT, I was using a number of
things ... but one set of hardware was NSC HYPERchannel adaptors ...
both for some of the long-haul/WAN interfaces but also for some local
intra-cluster transport between local processors (I wish we had them
available in the late '70s doing the cluster of eight MPs in the
shared disk, single-system-image complex).
Slightly prior to HSDT (and slightly after the 8 2-way cluster work),
I had gotten involved designing/implementing remote device support
over HYPERChannel. The initial project was to "remote" some 300
"overflow" people from the IMS group to another site ... while still
providing "local" access. For the most part this was local 3270
controllers ... but also some channel attached unit record gear.
At the time, the remote-device support from NSC mapped a single remote
device subchannel address to a local A220 subchannel address ...
downloading the local 370 channel command sequence to the A510 for
execution. This represented too much inefficiency for me. One or two
A220s easily had adequate performance to handle the job, I didn't need
5-6 of them just to provide a one-to-one subchannel mapping for the
remote devices. Besides, the remote site was connected by a T1
microwave link (with two HYPERChannel A710s driving the interface).
Another limitation of their implementation was that the 3272/3274
controllers only handled a single operation at a time, while the A220s
were high-performance burst controllers. With the one-for-one mapping
design, the A220 would actually be identified as a 372x controller and
the operating system would only schedule a single operation at a time
... while the "real" A220 could have simultaneous I/O scheduled on
all subchannels.
In any case, I designed a brand-new implementation where I dynamically
scheduled/allocate each A220 subchannel address on a per operation
basis.
This uncovered a problem with the A710s. In most of the typical
"remote" configurations with a single 3270 controller at a remote
location and the original implementation ... there would never be more
than one I/O operation in-flight at a time. It turns out that while
the T1 link is full-duplex ... there were some portion of the A710
adaptor that effectively operated half-duplex. It wasn't uncommon for
me to be scheduling 10-15 simulaneous operations resulting in
high-probability that their would be requests for simultaneous data
transfer in both directions. This caused me all manner of problems
until I put in some dynamic adaptive restrictive code to really
back-off on the number of simultaneous operations in-flight at any one
moment (NSC eventually had to replace the 710s with 715s, there was also
the full-duplex 720 satellite link adapters).
Eventually all the kinks were worked out and it turned out that the
remote users didn't see any degradation in response on their local
3270s. There was however an interesting side-benefit with overall
system performance increasing by 10-15%. Prior to remote'ing the
3270s, the controllers were evenly spread across all the channels,
sharing them with disk controllers. By high-performance standards, the
3270 controllers transferred data slower than disks (more channel busy
time) and had relatively slower controller electronics (lots of slow
channel hand-shaking and dramatically more channel busy time). Having
the 3270 controllers on the same channels with the disks really
impacting the channel availability time for disk activity.
Remote'ing the 3270 resulted in:
1) the 3270 hand-shaking busy time becaming the "problem" of the
remote channel emulation of the A510 adapter (and was masked from the
mainframe)
2) the actual data-transfer channel busy time happened at the A220
1.5mbyte burst rate
3) the A220 had very low hand-shaking busy overhead.
4) "compressing" all 3270 activity down into a single channel
Subsequently there were some performance advisories regarding not
configuring 3270 controllers and disk controllers on the same channel.
The installation was duplicated in a number of locations. One provided
remoting several hundred people in the Boulder field support group.
The computer complex building and the new programmer's building were
relatively close but on opposite sides of the interstate. Infrared T1
modems were used to connect the computer complex to the programmers.
Originally, it was thought that the modems would be prone to high BER
during rain and heavy fog. It turned out that the only time that BER
got to be a problem was during a snow storm that was so heavy nobody
could get into work. However, the infrared modems did have an
alignment problem on sunny days. The modems were on poles on the top
of the buildings. During the course of the day, the sun would unevenly
heat different sides of the building ... which resulted in appreciable
movement in the modems (at the top of the poles). The poles had to be
relocated and the modems slightly defocused.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
Newsgroups: alt.folklore.computers
From: lynn@netcom3.netcom.com (Lynn Wheeler)
Subject: Re: CP spooling & programming technology
Date: Thu, 7 Apr 1994 17:17:25 GMT
There was a "funny" glitch in my 80/81 re-implementation of the remote
device support. For various types of A510 errors (basically A510
emulates a mainframe channel and allows direct attachment of mainframe
channel controllers), I would map the error back into a simulated
channel check for the device (logically the A510 error was the
equivalent of a logical channel error).
Some 8-10 years later I got a call from somebody who monitors the
industry quality reporting information (i.e. there is a company that
gathers from lots of installations the mainframe error reporting
information and produces reports by manufacturer and model). It turned
out that the machine/model this person was associated with was showing
up with an unexpected (alarming?) number of channel errors in the
reports.
It turned out that the installations involved had NSC HYPERChannel
remote device support and the channel check errors were really coming
>from the A51x remote device adapters and then the software driver was
reflecting them back to the operating system as simulated channel
check interrupts. Most of the errors weren't even really hardware.
The mainframe "channel" protocol is basically synchronous. With the
A51x operation there is an attempt to simulate this synchronous-bus
behavior over a network using basically asynchronous message protocol.
Sporadically (even w/o A710s) a race condition would result and there
is no choice but to abort and get the system to redrive the operation.
Turns out that switching to reporting/simulating IFCC (interface
control check, instead of channel check) would logically kick-off the
same operating system redrive/recovery actions ... and not have the
(mostly communication/race-condition) errors show up as a black mark
in the industry reliability summary reports.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Refed: **, - **, - **, - **
From: lynn@netcom7.netcom.com (Lynn Wheeler)
Newsgroups: alt.folklore.computers
Subject: Re: CP spooling & programming technology
Date: Thu, 7 Apr 1994 20:55:37 GMT
One of the biggest problems that HSDT project had with the satellite
links was getting permits from various local boards. In one case where
a 4.5meter dish was going up about a half-mile from some residential
housing, there were number of residents showed up at a hearing to
complain about the "radiation" dangers that it represented to them,
their children and their pets.
Now the TDMA gear had a 25watt transmitter that nominally ran around
7watts (each station monitored its own rebroadcast signal strength and
could automatically up the power-budget in situations like rain-fade
... this was Ku-band). Furthermore, it was a very focused, relatively
tight-beamed transmission ... going straight up ... with nearly zero
side radiation.
However, all that had no effect. Finally had to do the calculations
that if somebody was suspended directly above the dish in the focused
transmission beam they would receive less radiation than they were
currently getting at home from the local 50,000 watt FM station.
--
Lynn Wheeler | lynn@netcom.com, lhw@well.sf.ca.us
Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
Newsgroups: comp.infosystems.interpedia
From: lynn@netcom7.netcom.com (Lynn Wheeler)
Subject: Misc. more on bidirectional links
Date: Thu, 7 Apr 1994 20:46:50 GMT
One of the areas for bi-directional links is in some stuff with things
like domain-specific ontologies.
One example is the NLM's UMLS (unified medical language system)
meta-thesaurus. in attempts to address the opportunity of queries
against really massive information bases, they've developed a
constrained language classification system for much of the medlars
stuff (in addition to the "online catalogue/abstracts" having all the
words in the various fields index, the entries are also classified
using the constrained knowledge/concepts).
The UMLS is available on CDROM and consists of (just the term
statistics, not including the inter-term relationships, definitions,
etc):
30,123 MeSH (16,760 preferred terms; 130,482 supplemental chemical
terms)
23,495 INSERM French translation of MeSH (Main headings and
French Synonyms)
12,495 SNOMED II (6,971 preferred terms)
21,293 ICD-9-CM terms (13,119 preferred terms)
5,595 CRISP (4,285 preferred terms)
5,094 LCSH (5,094 preferred terms)
2,619 COSTART (1,179 preferred terms)
1,511 COSTAR (1,511 preferred terms)
905 NIC (336 preferred terms)
776 AI Rheum (687 preferred terms)
604 Neuronames (604 preferred terms)
603 DXPlain (603 preferred terms)
450 DSM 3R (263 preferred terms)
557 CPT (210 preferred terms)
100 NANDA (99 preferred terms)
122 ACR (122 preferred terms)
112 UMDNS (112 preferred terms)
This is about 500mbytes in "relational ascii" form.
In this usage, there is a "preferred" classification term that is
used in classification/indexing entries. The "preferred" term also
points at all its synonyms (as well as all its synonyms pointing at
it). It is possible to go both directions ... users can enter queries
using non-preferred words/terms and they are automatically translated
into the correct classification term for lookup. In addition, the