List of Archived Posts
1995 Newsgroup Postings
- old mainframes & text processing
- pathlengths
- pathlengths
- Why is there only VM/370?
- What is an IBM 137/148 ???
- SMP, Sequent Computer Systems, and software
- 1401 overlap instructions
- Who started RISC? (was: 64 bit Linux?)
- 801
- Who built the Internet? (was: Linux/AXP.. Reliable?)
- 3330 Disk Drives
- 2nd wave?
- atomic load/store, esp. multi-CPU
- Cache and Memory Bandwidth (was Re: A Series Compilers)
- Virtual Memory (A return to the past?)
- 801 & power/pc
- Crashproof Architecture
- slot chaining
- SSA
- characters
- multilevel store
old mainframes & text processing
Refed: **, - **, - **, - **, - **, - **, - **, - **
From: lynn@garlic.garlic.com (Anne & Lynn Wheeler)
Newsgroups: alt.os.multics,alt.folklore.computers
Date: 15 Jan 1995 23:48:11 GMT
Subject: Re: old mainframes & text processing
The original method for CMS handling commands was/is sorta funky.
Everything was tokenized into 8 character units concatenated together
and then a system call was made. It turns out that this was the
procedure for kernal system calls or commands ... or just about
anything. The kernel would then attempt to resolve the first
8character unit as an exec (aka shell script) in search path, binary
executable in search path, or kernel functions.
standard system has a standard command abbreviation table ... but
users could augement it with their own specification.
there was some tricks regarding invoking kernel calls directly from
command line (or exec/shell-scripts) by typing appropriate binary
data.
early on cms ran into some scaleup issues and started doing sorts on
file directories and leaving around a status bit indicating whether
the directory was sorted or "dirty". If sorted ... simple filename
search was performed with binary search (rather than linear). Also in
the early '70s, (for performance) cms added a 2nd type of kernel call
api ... which would only resolve to kernel functions (actually a
kernel function branch table) ... instead of alwas doing the
generalized search mechanism. lots of applications then got rebuilt
using macros mapped to the new performance implementation (this
somewhat reduced the requirement for the file directory sort).
The original exec(-1) (shell script) processor was somewhat more
sophisticated than dos command processing ... but was augmented in the
early '70s with "exec-2" and in the late '70s with rex (which turned
into rexx in the early '80s).
--
Anne & Lynn Wheeler | lynn@garlic.com, lynn@netcom.com
pathlengths
Refed: **, - **, - **, - **
From: lynn@garlic.garlic.com (Anne & Lynn Wheeler)
Subject: pathlengths
Newsgroups: comp.arch
Date: 10 Mar 1995 05:54:57 GMT
in my youth i was fascinated with taking 1000 instruction pathlength
and turning it into zero ... or some such thing ... i.e. reorder
several thousand lines of code so that functions that were high-use
critical path became a side-effect of executing other things in a
particular sequence. as a result i could typically page fault, page
schedule, schedule, task switch, page i/o complete, task switch back,
etc in 1/3rd to 1/20th path length of any comparable system. Downside
was that it could be maintenance nightmare ... 5 or 10 years later,
scheduler might stop operating at less than optimal fair share because
of perturbation in various side-effects due to code changes to random
other places in the system.
370 virtual machine ... created a very clear-cut api for the
microkernel. also extensive instrumentation was constantly exposing
the cpu utilization of various parts of the system. between the
clear-cut, non-ambiguous api ... and system function definition
... along with a (somewhat subliminal) perception that any cpu
utilization by the "kernel" was bad ... there was constant striving to
force the kernel path length to zero ... a subculture that I don't
believe exists/existed in any other operating system (except possibly
some real time controllers). In most other systems, kernel cpu
utilization seems to be assumed to be part of the cost of running an
operating system; it could somewhat be dismissed and attention
refocused on other "more important" issues like gui interfaces.
another aspects of instrumentation was I designed & executed a suite
of over 2000 benchmarks to validate the dynamic adaptive feedback
scheduler (for an extremely wide range of load & operational
environments, supporting fair share, non-fair share, dynamic adaptive
to the resource bottleneck, etc); i.e. change a process priority by a
single digit (i.e. nice'ing) would (alwas?) result in very predictable
change in resource consumption across a wide-range of configurations
and loads. It took me 3 months elapsed time to run the benchmark
validation suite ... in preperation for releasing what essentially was
just a 6k instruction product feature.
pathlengths
Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
From: lynn@garlic.garlic.com (Anne & Lynn Wheeler)
Subject: pathlengths
Newsgroups: comp.arch
Date: 10 Mar 1995 17:18:30 GMT
actually i should qualify the 6000 instructions for the dynamic
adaptive resource manager. At the time it was decided that they
wanted me to release the resource manager I had been doing a
5-way smp project ... so when I bundled the 6000 instructions, it
actually consisted of:
1) lot of kernel restructure for smp
2) restructuring of kernel serialization to eliminate all cases
of zombie processes and elimination of all known cases of
kernel failures due to sequencing problems
3) bunch of fast path stuff
4) one of my page replacement algorithms from the '60s which
was also smp'ed
5) dynamic adaptive resource management
... when they got around to releasing smp in the regular product 85%
of the resource manager instructions was absorbed into the base
product ... leaving less than 1000 instructions in resource manager
feature.
... as another aside with regard to tss/360 ... I did a side project
in the early '70s that analyzed 360/370 program code and was run off
the assembler output. Around '84 (for nearly 10 years, the tss/370
project had been operating with cast of 10s rather than thousands and
the vm/370 group had done the opposite), I did structural comparison
of the tss/370 kernel against the vm/370 kernel (using my
application). By that time, tss/370 kernel was achieving a
succinctness and compactness that was more characteristic of the cp/67
kernel (and the vm/370 kernel had evolved into a much more complex
organization); i.e. even with the 370 virtual machine API firmly
implanted in everyone's minds ... strength of implementation focus
became diffused as organization grew.
... minor trivia question ... from a program analysis standpoint there
was a major difference between the output of the standard 360/370 "H"
assembler and the tss/370 assembler. In the "H" assembler, data-space
addresses weren't taged so some structural analysis became somewhat
ambiguous. The tss/370 provided a tag'ed identifier for each data-space
identifier ... removing a lot of ambiguity from structural analysis.
--
Anne & Lynn Wheeler | lynn@garlic.com, lynn@netcom.com
| finger lynn@garlic.com for public key
Why is there only VM/370?
Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
From: lynn@garlic.garlic.com (Anne & Lynn Wheeler)
Subject: Re: Why is there only VM/370?
Date: 1995/04/05
Newsgroups: comp.arch
supporting 2nd order paging ... there actually is two forces at work
here contributing to bad behavior ... the most obvious is that two
level paging is redundant ... the less obvious is that running a LRU
under a LRU violates the LRU assumption ... the 2nd level system
becomes to exhibit MRU behavior rather than LRU behavior (i.e. LRU
assumes that the pages used the most recently are the ones most likely
to be used in the future, a 2nd level LRU algorithm can actually start
to exhibit inverse behavior, the least recently used page is the one
most likely to be selected next).
for interactive, the group somewhat controlled both above the line
(CMS) and below the line (CP) implementation and went a long way
towards eliminating/optimizing their modular behavior.
Some effect was also done on the VS1 operating system to eliminate
duplicate operations when running 2nd level.
However the MVT->SVS->MVS genre had little such work done. A big hit
came in the MVT->SVS transition. Effectively the VM kernel had to
implement/shadow all the status bits that existed in the real hardware
definition ... as well as simulate each priv. instruction executed
code running in the virtual machine. In the MVT->SVS transition, the
virtual machine went from non-relocate to relocate ... which exploded
the number of bits in the hardware definition by a couple orders of
magnitude (before it was little more than the regs & the psw ... but
now it also included all the virtual-virtual relocate tables).
Another effect was that the number of priv. instructions executed by
the (virtual) operating exploded in the MVT->SVS transition. This
could be seen in a non-VM environment on how long it took the same job
to run under MVT vis-a-vis SVS (drastic increase in kernel
pathlength). From a VM/370 standpoint that drastic increase in kernel
pathlength also included a significant increase in ratio of privilege
instructions (requiring simulation) to total instructions.
In the SVS->MVS transition period ... some machines appeared that
provided virtual machine hardware "assists" that would handle some
percentage of privilege instruction "simulation" directly in the
hardware (i.e. implement decode of the instruction according to the
virtual machine rules, not real machine rules). This effectively
reduced the simulation overhead to zero for those instructions.
However, the performance gain from hardware assist was more than
offset by further bloat in the MVS kernel and another significant
increase in ratio of (non-assisted) privilege instructions to total
insturctions executed.
In '68, a MFT job-stream running about 40% cpu busy ran 2.5 times
longer executing under VM as it did standalone. By the summer of '68,
I had reduced the cp pathlength so that the same job-stream only ran
1.15 times as long. Running the same job stream under MVT bumped the
time elongation up to 1.5 times as long. Along came SVS and execution
increase it ballooned back up to over 2* increase. When hardware
assist appeared, backlevel MVT ran almost at 1.0 (almost no increase),
while SVS tended to be in 1.5 range. MVS blew it out of the water
again ... although VS1 on some machines actually ran faster under VM
than they would standalone.
In areas of generalized operating system function (scheduling, paging,
dispatching, file i/o, etc), somewhat orthoganal to virtual machine
simulation, CP tended to have 1/10th the pathlength of the generalized
operating systems that might run under it (in addition to having better
algorithms). As a result, the cp/cms timesharing combo (along with
careful attention to minimizing duplication of effort above & below
the simulation line) tended to blow away any of the other operating
system offerings (i.e. TSO running on MVS running w/o VM on native
machine).
In many respects the attention to pathlength was a direct result
of people being able to show performance running with & w/o VM.
For some reason, the MVT->SVS->MVS bloat was significant greater
than any VM vis-a-vis non-VM difference ... but it never came
under the same level of scrutiny.
What is an IBM 137/148 ???
Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
From: lynn@garlic.garlic.com (Anne & Lynn Wheeler)
Newsgroups: alt.folklore.computers
Subject: Re: What is an IBM 137/148 ???
Date: 22 May 1995 21:02:09 GMT
155&165 had 80ns cache, relatively slow main memory (i think 2mic),
and no virtual memory. there was also 145 non-cache machine that
shipped with virtual memory hardware. 145 was vertical m'code, 155 was
horizontal m'code and 165 was hard wired.
when 370 virtual memory was announced along with 158 & 168 which had
main memory that was around 4-5* faster, if i remember right the
165->168 also decreased the 370 instruction implementation from
something like avg. of 2.1 to about 1.6 machine cycles per
instruction. 155&158 were both about 0.9-1mip machines running
"in-cache", but cache miss was much more expensive on 155 (because of
slower memory). 145 was about .3-.4mip. 168 was about 2.5mip machine.
the 155&165s had to have hardware upgrades to support virtual memory
(in fact one or two features were left out of virtual memory
announcements because of difficulty in retro-fitting the additional
signal line to the 165 machines already in customer shops).
it was also possible to swing out the 155 front panel and flip a
switch turning off the cache. for heavy i/o workload, non-cache 155,
145 & 360/67 all had about same thruput running cp (i.e. modifed
version of cp/67 running 370 virtual memory mode).
the 138/148 where about 3 years later ('76?), had some amount of
faster memory; buts also lots more m'code space ... into which were
coded "operating system performance assists". The 148 also had a lot
of work done on floating point ... significantly reducing the
difference between floating point & fixed point performance that had
been characteristic of ibm machines up until then. the 148 had around
128k for m'code ... and had around 6k left and were looking for some
way to utilize it. I posted the test run results to this group about a
year ago which were used to select pieces of the kernel that got
migrated to m'code (someplace in files at
ftp.netcom.com/pub/ly/lynn).
vm/370 started out with a model number table that was used at boot
time to adjust a number of parameters, theoritically based on
processor performance. For the resource manager performance
enhancement, I replace it with some boot benchmarking code that
attempted to figure all that stuff out dynamically.
there was 360/95 & 360/195 which were the top-end floating point
machines. main work done on the 360/195 to create a 370/195 was the
370 non-virtual memory instructions (same as in the original 155 &
165) and instruction retry.
A lot of work was put into all the machines in the 370 line to recover
from non-reproducable, transient hardware errors (I vaguely remember
that given the number of circuits in the 195, their MTBF, and the
speed at which the 370/195 ran; that a hardware error was expected to
occur something like once a month, the addition of some of the 370
instruction retry for the 370/195 was suppose to mask that).
wasn't really easy since it had 64 instruction pipeline and imprecise
interrupts. state configuration for what could be retried (& how) was
difficult. imprecise interrupts contributed to the lack of virtual
memory hardware on the 370/195.
there was also a prototype dual-istream 195 machine that was never
produced. the pipeline didn't have speculative execution so a branch
would drain the pipeline (except for special case where branch target
that would loop in the pipeline); some investigation was put into
building a simulated SMP 195 that appeared to be two processors
... but with only a small additional hardware bump for 2nd instruction
address and 2nd set of regs. For many operating environments it would
provide nearly twice the thruput for <5% increase in hardware.
late '70s saw the introduction of the 303x line. Main feature of the
303x line was the channel controller box. The 3031 was effectively a
158 with new covers and a channel controller, the 3032 was a 168 in
new covers & a channel controller. The 3033 was a new machine. The 168
effectively used technology that was about 4 circuits/chip. The 3033
was built with newer 20 circuits/chip ... although it was originally
layed out just using the 168 logic (i.e. only 4 circuits/chip being
used) ... which resulted in about 20% performance improvement (because
of somewhat faster chip). Some last minute redesign of the logic in
critical places up'ed the improvement to closer to 50% (utilizing more
than 4 circuits/chip and getting more intra-chip processing).
The 4331/4341 were introduced about the same time to replace the
138/148.
Some trivia information ... the 303x channel controller box was
basically a 158 with a different horizontal m'code. It also had some
additional I/O feature support. About that time, I had done a custom
modified operating system for the disk engineering lab. Their typical
environment were 6-10 "test cells" connected to a machine. Problem
was that operating a single test-cell engineering box had a habit of
"crashing" all of the mainframe operating system; typically within
10-15 minutes. As a result, testing had to be done on a stand-alone,
single test-cell at a time basis, using pretty rudimentary
software. Trick was to implement an absolute bullet-proof replacement
I/O subsystem that would support all test-cells operating
concurrently. Turns out there were a couple tricks you could play
with the 303x channel controller sending standard ops in special
sequence that would help fence off a test-cell that had gone bezerk.
Also if you hit all channels on a channel controller with a clear
channel in quick succession, you could cause it to reboot. Didn't
help with 4341 operation/testing ... just had to grin & bear it as
best as possible.
The 155/158 & 165/168 were done by different teams of engineers at
different locations. The next machine after the 303x was the 3081 done
by the 155/158 engineering group ... followed by the 3090 done by the
165/168 engineering group.
The 3081 was introduced as an SMP-only machine (2-way and the 3084
4-way). However, it wasn't a "mullti-processor" in traditional IBM
terms. Up until then, the 360, 155/165, 158/168, and 303x SMPs had all
been independant machines with independant power-supplies, channels,
etc ... but with hardware interconnect to share memory bus and
syncronize caches ... and they could be partitioned and operated as
independent machines. The 3081 had two processors packaged in the same
box sharing a lot of the same components.
At the low-end of the original 370 line were the 115/125 which had
novel architecture (for 370). The machines were layed out with a
common bus which could be shared by up to 9 microprocessors. In the
115 all the microprocessors were the same. Different microprocessors
typically were dedicated to different functions and had different
m'code loads. The primary difference between the 115 & the 125 was
for the 125 there was a unique faster microprocessor running the 370
m'code load. As far as I know all configurations shipped to customers
were limited to one processor running with a 370 m'code load. I did a
cp design that would support a 125 being configured with up to five
370 processors ... also utilized programmability of the other
processors to offload a lot of kernel pathlength. When that project
got killed, rolled the design over into standard 2-way smp leaving out
a number of special features (since i wasn't going to be allowed to
modify the m'code). I did a custom release 3 version for HONE (started
out with eight 2-way SMP operating in single-system image mode with
common disk farm, i.e. this config had stuff that if any complex was
taken offline &/or failed, online users were reconfig'ed onto
remaining available machines). Same SMP design shipped in release 4
product.
The 5-way 125 and 148 activities were going on about the same time and
for some reason the 148 group viewed the 125 activity as competitive.
As a result in the shoot-out meetings I got to be both the shooter
and the target on both sides of the table.
--
Anne & Lynn Wheeler | lynn@garlic.com, lynn@netcom.com
| finger lynn@garlic.com for public key
SMP, Sequent Computer Systems, and software
From: Lynn Wheeler <lynn@garlic.com>
Subject: Re: SMP, Sequent Computer Systems, and software
Date: 05/25/1995
Newsgroups: comp.arch
.. for cache based machines ... various forms of affinity could also
boost thruput ... interrupts occuring on processors in the middle of
doing random other things ... vis-a-vis interrupts going to processor
which already had been processing interrupts ... and handed off higher
level (transaction) processing to machine already doing higher level
processing has been able to show a 50% increase in MIP rate (i.e.
better control on when asynchronous interrupts might occur as well as
what processor was already doing what). Structure of fine-granularity
locks, inter-CPU processing hand-off, and dynamic adaption has impact
on being able to achieve such goal.
1401 overlap instructions
Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
From: lynn@garlic.garlic.com (Anne & Lynn Wheeler)
Newsgroups: comp.lang.asm370,alt.folklore.computers
Subject: Re: 1401 overlap instructions
Date: 03 Jun 1995 20:36:48 GMT
my first student programming job was to implement the 1401 mpio
function (front-end to 709) on 360/30 ... this was summer of '66. 1401
had 3 (7-track) tape drvies, 2540 reader/punch & 1403N1 printer.
incoming cards were either bcd (subset ebcdic) or binary (used all 80
columns and 12 rows; read column binary with top & bottom 6 rows going
into different bytes, 160 bytes total). had to do a reads &
feed/select-stacker separate. i couldn't keep an output tape, input
tape, card reader, punch/printer all running with card reader at full
speed while using os/360 (release 6?). wrote my own multi-tasker, &
interrupt handlers; took over the interrupts from the operating system
... and then could run card reader at full speed ... while also
processing output stream. Used as much of memory as possible for
elastic buffers.
my assembly program ran to about 2000 cards (about a box). w/o any
macros ... took about 30 minutes to assemble program and generate
executable. did version with macros and switch that ran stand-alone or
ran under os. problem was that macros really slowed down the
assembler; a DCB macro took six minutes elapsed time; five DCBs for
the two tapes and three unit record added another 30 minutes to the
assembler elapsed time.
After a while, I found it was frequently faster to repunch/multipunch
(026) a 12-2-9 TXT card with patches than it was to re-assemble
... somewhat arcane skill being able to read punch holes in 360 binary
decks, fan a binary deck ... pick out the card with the address of the
instruction(s) needing patching and dup/repunch new card with fixes.
felt somewhat silly when i finally discovered .REP cards.
--
Anne & Lynn Wheeler | lynn@garlic.com, lynn@netcom.com
| finger lynn@garlic.com for public key
Who started RISC? (was: 64 bit Linux?)
Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
From: Lynn Wheeler <lynn@garlic.com>
Newsgroups: comp.os.linux.development.system,comp.arch,alt.folklore.computers
Subject: Re: Who started RISC? (was: 64 bit Linux?)
Date: 13 Jun 1995 14:35:55 GMT
had to been prior to '78, i was at conference '76
presenting 16-way smp design and the 801 group
was presenting 801 and cpr operating system.
i remember because somebody from the 801 group flamed
our presentation about being able to modify existing
kernel to support 16-way smp because "they had looked
at the existing source and the control blocks didn't
contain any fields that would support smp". I guess they
never heard of modifying control blocks.
in any case, i returned the favor by flaming their
hardware deficiencies. their response was that they
were writing a brand new closed operating system and
that the hardware deficiencies were specifically selected
as being performance/software trade-offs. The closed
system implementation would do authentication and
authorization checking at compile/load time ... and as a
result runtime application code would have inline
"supervisor" code (w/o kernel calls) that would compensate
for the hardware deficiencies. 801 processors would
never be supported by any but the closed operating system
with application inline kernel code (i.e. the trade-off
significantly changes if authentication/authorization has
to be done at runtime with protected kernel calls).
i was at a pitch yesterday given by one of the sun java
people ... and load time authentication & authorization
design-point description was similar (although the setting
is totally different).
801
Refed: **, - **, - **, - **, - **, - **, - **, - **, - **
From: Lynn Wheeler <lynn@garlic.com>
Subject: re: 801
Date: 1995/06/13
Newsgroups: comp.os.linux.development.system,comp.arch,alt.folklore.computers
had to been prior to '78, i was at conference '76
presenting 16-way smp design and the 801 group
was presenting 801 and cpr operating system.
i remember because somebody from the 801 group flamed
our presentation about being able to modify existing
kernel to support 16-way smp because 'they had looked
at the existing source and the control blocks didn't
contain any fields that would support smp'. I guess they
never heard of modifying control blocks.
in any case, i returned the favor by flaming their
hardware deficiencies. their response was that they
were writing a brand new closed operating system and
that the hardware deficiencies were specifically selected
as being performance/software trade-offs. The closed
system implementation would do authentication and
authorization checking at compile/load time ... and as a
result runtime application code would have inline
'supervisor' code (w/o kernel calls) that would compensate
for the hardware deficiencies. 801 processors would
never be supported by any but the closed operating system
with application inline kernel code (i.e. the trade-off
significantly changes if authentication/authorization has
to be done at runtime with protected kernel calls).
i was at a pitch yesterday given by one of the sun java
people ... and load time authentication & authorization
design-point description was similar (although the setting
is totally different).
Who built the Internet? (was: Linux/AXP.. Reliable?)
Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
Newsgroups: alt.folklore.computers,alt.os.multics
From: lynn@netcom3.netcom.com (Lynn Wheeler)
Subject: Re: Who built the Internet? (was: Linux/AXP.. Reliable?)
Date: Thu, 15 Jun 1995 13:55:07 GMT
... then there was the internal network which originated out of
another part of 545 tech sq., circa 1970. the one byte field reminded
me of a similar story from 1976. about that time jes2 was thinking
about inventing nje ... but sticking with the original hasp 1byte
addressing scheme (originally used to identify "virtual" unit record
devices ... but the left over numbers were going to be used to
identify network nodes; typical jes2 system might have 60-80 virtual
unit record definitions ... leaving something less than 200 for
network identifiers). problem was that at that time the internal
network already was between 500 & 700 mainframe nodes. The jes2 group
basicallly replied ... oh well, no real customer would ever have such
a large network. From then on the jes2 group was playing constant
catchup, a couple years later when jes2 did a hack to expand support
to 1999 nodes, the internal network was well over 2000 mainframes.
As a result jes2/nje systems could only play problematic end-point
attachment roles on the internal net but could never be a reliable
intermediate node being unable to address the complete network
(i.e. the internal network was purely host based).
In fact the only reason jes2 could attach at all was a brilliant piece
of layering in the native/backbone of the original internal network
implementation. nje made the mistake of mixing the line driver
protocol, the network protocol, the transport protocol, and the
application protocol all in the same header record. two jes2 systems
at different release levels frequently couldn't even talk to each
other directly (to say nothing of attaching to the internal network).
In order for jes2 to attach to the internal network, the native
networking code wrote a series of nje line driver emulators that would
translate between native internal network and nje. The whole series
of nje line driver emulators typically corresponded to different
flavor/releases of nje &/or jes2. The nje emulation would encapsulate
and pass real nje headers originating from jes2 systems ... for
non-jes2 nodes, the drivers could fabrigate and/or strip nje headers
as appropriate.
There are a whole series of stories about jes2/nje systems crashing
other jes2/nje systems on the internal network. Typical scenerio goes
a jes2/nje system at specific release level in hursley attempts to
transmit something to a jes2/nje in san jose via the internal net.
The intermediate backbone node in hursley has the appropriate jes2/nje
line driver started, accepts the incoming transmission, appropriately
encapsulates the nje header and starts forwarding it. It eventually
arrives at backbone intermediate node in san jose which recognizes the
node, passes the initial record to the appropriate nje driver emulator
that de-encapsulates the nje header before forwarding over the link to
the destination jes2/nje node. Since the intermediate node does the
appropriate nje handshaking, the header record accepted and passed to
jes2 processing directly. Since (at least) four layers of protocol
were all jumbled together in the nje header ... minor field definition
changes could really confuse the jes2 subsystem processing. Specific
types of jes2 confusion would lead to panics ... which could then
cascade into bringing down the whole mainframe system (early form of
unintentional virus).
In typical fashion,
1) the initial fix was to require the backbone nje driver emulators to
do at least field verification (for the appropriate release level) and
where possible to do inter-release nje field conversion ... before
forwarding to a jes2 system
2) the eventual "fix" was to force the native internal network
software to abandon all drivers but officially sanctioned nje
emulation drivers.
.... and now back to your regularly scheduled program
--
Anne & Lynn Wheeler | lynn@netcom.com
3330 Disk Drives
Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
From: lynn@garlic.garlic.com (Anne & Lynn Wheeler)
Newsgroups: alt.folklore.computers
Subject: Re: 3330 Disk Drives
Date: 18 Jun 1995 17:10:56 GMT
following from report i originally did in '81 ... part of the numbers
were excerpted in postings in this group early last year (archived at
ftp.netcom.com/pub/ly/lynn). 3330 numbers for 3330-ii. overall
system numbers showed that 3380 technology had a relative decline in
system performance by at least a factor of five compared to 2314
relative system performance. also 4k acc/sec/meg declined by more than
an order of magnitude (absolute).
Some of the numbers were done in late '70s where it was shown that
upgrading from 3330-ii to 3350 only improved performance if allocated
data was limited to approx. same as on 3330-ii. The 3380 are the
original (single density) 3380.
2305 2314 3310 3330 3350 3370 3380
data
cap, mb 11.2 29 64 200 317 285 630
avg. arm
acc, ms 0 60 27 30 25 20 16
avg. rot
del. ms 5 12.5 9.6 8.4 8.4 10.1 8.3
data
rate mb 1.5 .3 1 .8 1.2 1.8 3
4k blk
acc, ms 7.67 85.8 40.6 43.4 36.7 32.3 25.6
4k acc.
per sec 130 11.6 24.6 23 27 31 39
40k acc
per sec 31.6 4.9 13. 11.3 15. 19.1 26.6
4k acc
per sec
per meg 11.6 .4 .38 .11 .08 .11 .06
======================================================================
slightly different table ... assuming a uniform access distribution,
loading the indicated max. data on the drive (i.e. not filling the
whole thing) gives the resulting 4kbyte-block-access/sec/mbyte
(i.e. 3380 with only 40mbyte loaded gives approx. the same performance
as 2314).
2305 2314 3310 3330 3350 3370 3380
data
cap, mb 11.2 29 64 200 317 285 630
4k acc.
per sec 130 11.6 24.6 23 27 31 39
20 meg - .041 .091 .082 .098. .122 .152
40 meg - - .023 .021 .025 .031 .039
60 meg - - - .009 .011 .014 .017
80 meg - - - .005 .006 .008 .010
--
Anne & Lynn Wheeler | lynn@garlic.com, lynn@netcom.com
| finger lynn@garlic.com for public key
2nd wave?
Refed: **, - **, - **
From: Lynn Wheeler <lynn@garlic.com>
Date: Mon, 17 Jul 1995 22:34:32 -0700
Newsgroups: 2020world
Subject: 2nd wave?
Something somewhat spurious when I was trying to define the term business
science a couple years back ....
Wisdom &
Understanding
/\
//\\
/// \\
/// \\
/planning\
/discovery\\
/ // | \
/ Knowledge \ \
/ //workers ***\
/------------*****-\
/ // *Group*\
/ // **ware* \
/ // ******** \
/* || ******* \
Process /*** / / ****** \M
Systems---->/*s* / | \ \a
/*k* | | | \t
/*r** / / \ \e
/*o** / / Infor- \r
CAD /*w** | | Fact mation \i
VLSI----->/*e*** / | Workers \ \a
/*m** / | | \l
/*a* | | | \s
/*r* Dollars | \ \
/*F** / | | ****** \
/***** / People | ****** \
/ | | | **VLDB* \
/----***-----------------------------------*******-----\
/ *****/ | | ******** \
/ ******* | | ********** \
/ ***CIM**** | | \
/ *********** / \ \
/ ******* | Factory Workers | \
/ / | | \
/ / | | \
----------+--------------+----------------------+---------------------
Complexity Distributed
Real Dollars Information
--
Anne & Lynn Wheeler | lynn@garlic.com, lynn@netcom.com
| finger lynn@garlic.com for public key
atomic load/store, esp. multi-CPU
Refed: **, - **, - **
From: Lynn Wheeler <lynn@garlic.com>
Subject: Re: atomic load/store, esp. multi-CPU
Date: 08/19/1995
Newsgroups: comp.unix.aix
cs (compare&swap) is software simulation of ibm 370 instruction
orginated in the early 70s (nearly 25 years ago at the ibm cambridge
scientifc ccenter) to be both enabled UP & SMP "safe". however, the
6000 software simulation is only "enabled uniprocessor" safe
(basically 6000 cs is a system call that simulates the c&s semantics
in disabled kernel code) and does NOT provide SMP "safe" semantics
(software simulation NOT providing hardware cache syncronization
semantics across multi-CPU environment).
Cache and Memory Bandwidth (was Re: A Series Compilers)
Refed: **, - **, - **, - **, - **, - **, - **
From: lynn@garlic.garlic.com (Anne & Lynn Wheeler)
Subject: Re: Cache and Memory Bandwidth (was Re: A Series Compilers)
Date: 1995/07/08
Newsgroups: comp.arch,comp.sys.super,comp.unix.cray
there is also the custom rs/6000 4-way 801 machines that didn't do
hardware coherency (effectively since day #1, no cache coherency has
been fundamental 801 feature). memory segments can be labled
write-shared or non-write-shared. a segment labeled write-shared
doesn't get cached ... non-write-shared segments require software
coherency ... analogous to support for i-cache & d-cache coherency or
for processor & I/O coherency (i.e. lots of flush to memory).
--
Anne & Lynn Wheeler | lynn@garlic.com, lynn@netcom.com
| finger lynn@garlic.com for public key
Virtual Memory (A return to the past?)
Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
From: lynn@garlic.garlic.com (Anne & Lynn Wheeler)
Subject: Re: Virtual Memory (A return to the past?)
Date: 1995/09/27
Newsgroups: comp.arch
.. some of you are probably getting tired of seeing this ... but a
typical '68 hardware configuration and a typical configuration 15
years later
machine 360/67 3081K
mips .3 14 47*
pageable pages 105 7000 66*
users 80 320 4*
channels 6 24 4*
drums 12meg 72meg 6*
page I/O 150 600 4*
user I/O 100 300 3*
disk arms 45 32 4*?perform
bytes/arm 29meg 630meg 23*
avg. arm access 60mill 16mill 3.7*
transfer rate .3meg 3meg 10*
total data 1.2gig 20.1gig 18*
Comparison of 3.1L 67 and HPO 3081k
========================================
360/65 is nominal rated at something over .5mips (reg<->reg slightly
under 1mic, reg<->storage start around 1.5mic and go up). running
relocate increases 67 memory bus cycle 20% from 750ns to 900ns (with
similar decrease in mip rate). 67 was non-cached machine and high I/O
rate resulted in heavy memory bus (single-ported) contention with
processor.
drums are ibm'ese for fixed head disks.
disk access is avg. seek time plus avg. rotational delay.
the 3.1l software is actually circa late 70 or earlier 71 (late in the
hardware life but allowing more mature software). the 3081k software
is the vm/hpo direct descendant of the cp/67 system.
90th percentile trivial response for the 67 system was well under a
second, the 90th percential trivial response for the 3081k was .11
seconds (well under instantaneous observable threshold for majority of
the people).
the page i/o numbers is sustained average under heavy load. actual
paging activity at the micro-level shows very bursty behavior with
processes generating page-faults at device service intervals during
startup and then slowing down to contention rates during normal
running. the 3081k system had pre/block page-in support (i.e. more
akin to swap-in/swap-out of list of pages rather than having to
individually page fault).
big change between 68 and 83 ... which continues today is that
processor has gotten much faster than disk tech. has gotten
faster. real memory sizes and program sizes have gotten much bigger
than disk has gotten faster (programs have gotten 10-20 larger, disk
get twice as fast, sequentially page faulting a memmap'ed region 4k
bytes at a time takes 5-10 times longer). Also while current PCs are
significantly more powerful than mainframe of late '60s and the
individual disks are 5-10 times faster, the aggregate I/O thruput of
todays PCs tend to be less than the aggregate I/O thruput of the
mainframe systems.
In any case, when I started showing this trend in the late '70s that
disk relative system performance was declining (i.e. rate of getting
better was less than the getting better rate for the rest of the
system) nobody believed it. A simple measure was that if everything
kept pace, the 3081K system would have been supporting 2000-3000 users
instead of 320.
Somewhat bottom line is that even fixed head disks haven't kept up
with the relative system performance. Strategy today is whenever
possible do data transfers in much bigger chunks than 4k bytes,
attempt to come up with asynchronous programming models (analogous to
weak memory consistency & out-of-order execution for processor
models), and minimize as much as possible individual 4k byte at a
time, synchronous page fault activity.
--
Anne & Lynn Wheeler | lynn@garlic.com, lynn@netcom.com
| finger lynn@garlic.com for public key
801 & power/pc
Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
From: Lynn Wheeler <lynn@garlic.com>
Subject: 801 & power/pc
Date: 1995/09/17
Newsgroups: comp.arch
there is fundamental design change between 801/power and
power/pc. I relate it back to conference i was at in
'76 time-frame (almost 20 years ago) ... where 801 group flamed
us for saying we were modifying existing operating system to
support 16-way SMP. I returned the curtesy during their
presentation.
base 801/power had various hardware optimization trade-offs
that were based on hardware being only used in a closed, custom
proprietary operating system (and would NEVER be used in
any sort of open/general purpose system). two were
small number of concurrent shared memory objects (reduced
hardware) and optimized non-coherent cpu/cache operation
(fundamental principle of 801 was it would NEVER be
used in cache coherent &/or SMP configuration). Another
fundamental principle was that security/authentication/etc would
be done at compile/load time ... and all applications could
directly execute all instructions (w/o kernel calls for runtime
security/authentication) ... and therefor compiler could
generate inline code to compensate for explicitly designed
hardware shortcomings (in some sense, various design
challenges were similar to current hot java).
translation to power/pc violates a number of fundamental, basic
801 principles established during the '70s/'80s. For instance,
801 never has to worry about serialization to any byte of data,
once CPU has data fetched from cache, by definition it never has
to worry about local cpu copy and cache/memory copy being in
sync. not having to worry about syncronization allows for
streamlined cpu pipelined operations that would be much more
difficult in cache-coherent environment.
Crashproof Architecture
From: Lynn Wheeler <lynn@garlic.com>
Subject: Re: Crashproof Architecture
Date: 09/17/1995
Newsgroups: comp.arch
formalizing existing structures for saving virtual memory across
system boots is relatively straight-forward. Much more complicated is
potential huge amount of system state, especially associated with
active i/o operations, pending signals, etc. there were a couple
places in the early '70s that took existing commercial virtual memory
operating systems to implement such function ... and the virtual
memory save hack was possibly only 5-10% of the work.
at least one such place (early '70s) had a couple geographically
distributed centers with large clusters. checkpoint allows any complex
in a cluster to be periodically shutdown for maintenance w/o impacting
non-stop applications. application could resume later on the same
complex after it came back up ... but more frequently resumed
immediately on another complex in the same cluster/center (and given
certain types of restrictions regarding i/o activity could even resume
on a complex located in a different cluster/center).
slot chaining
Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
From: Lynn Wheeler <lynn@garlic.com>
Subject: slot chaining
Date: 1995/09/23
Newsgroups: comp.arch
i added slot chaining and ordered seek queueing to cp/67 around
1970 ... and typically could operate a 1/3mip to 1/2mip processor
with 70-80 mixed mode users (batch & interfactive) running the
processor at 100% utilization and <1sec 90 percentile response
for trivial requests. this was with 2-3 2301s (fixed head,
4mbyte each) which would hit aggregate sustained thruput of 270
4kbyte page transfers per second on a shared i/o bus that
operated at 1.5mbytes/sec. This was with less than 1/3rd of the
total virtual memory pages resident on the fixed head devices ...
the rest were typically movable head disks. The 2301s operated at
3600rpm and had 9 4k pages formated per two tracks (i.e.
two revoluations to transfer 9 4k pages)
In the conversion to vm/370, the 2301s were upgraded to
2305s (same transfer but 12mbyte/device). I also introduced
smarter page replacement algorithm (compared to 1hand & 2hand
clock I had created in late '60s) as well as more sophisticated
page migration algorithm (i.e. relocation of virtual memory
pages betwen fixed head devices and movable head devices).
Configurations which split 2-6 2305s across at least two I/O
buses (1.5mbyte/sec each) could hit 600 page transfers/sec over
two busses (channels for those mainframe folks). Total
aggregate page transfers would exceed that because typical
2305 configurations only had capacity for 20% or less of
the total virtual memory pages (rest on moveable head devices).
device latency = 0 is the wrong way to describe, the hardware
was capable of ordering queued requests so that there was
no device redrive/service delay between one request and the
next ... and the programming for the 2305 turned out to
be trivial. The 2305 supported multiple virtual hardware
addresses. Simplest software support mapped each rototional
position (or page slot) to a unique hadware address on the
device. Effectively a two-level address ... 1st part
selected the device (actually a range of hardware addresses)
and the rotational position of the page selected the hardware
sub-address ... i.e. from the sector number it was possible
to calculate a predictable rotational position ... and each
page start position in a revolution was assigned to a specific
hardware address on the device. Hardest part was selecting
the convention ... once that was done it effectively only added
4-5 instructions to mainline disk support (although all the
migration stuff was much more sophisticated).
SSA
Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
From: lynn@garlic.com (Anne & Lynn Wheeler)
Date: 1995/09/25
Subject: SSA
Newsgroups: comp.arch.storage
ssa, grump ....
a large number of the 9333 systems were for ha/cmp and we heavily
backed the project. however we were also doing cluster scaleup using
fcs. during san fran usenix, jan. 1992, Hester, my wife, and I had a
meeting with Ellison, Baker, Shaw, Puri (and a couple others) in
Ellison's conference room. We proposed having 16-way fcs pilot
clusters in customer shops with parallel oracle by summer of 1992
... with 128-way available by ye92.
unfortunately the kingston group were out trolling for technology and
found cluster scaleup the very next week. in something like 8-10
weeks, the project was transferred to kingston, announced as a
supercomputer, and we were instructed to not work on anything
involving more than 4 processors.
in the elephant's dance to do the supercomputer subset of cluster
scaleup ... the device interconnect strategy got obliterated. so
instead of 9333->interoperable family (1/4 & full speed fcs, potentially
1/8 & 1/4 speed fcs on serial copper, etc); in the resulting confusion,
9334->ssa.
while ssa is quite good technology (especially compared to scsi), an
interoperable fcs family strategy would be better. also for ha/cmp, it
was important to be able to do fencing; at least for (some) hippi
switches was able to get a fencing feature included, i haven't
followed fcs much recently and don't know if any of the current fcs
switches are able to fench.
trivia question: in the late '88 timeframe, what was the projected '92
per drop price for fcs (including prorated price of switch)?
+-+-+-+
Anne & Lynn Wheeler | lynn@garlic.com, lynn@netcom.com
| finger lynn@garlic.com for public key
characters
Refed: **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **, - **
From: Lynn Wheeler <lynn@garlic.com>
Subject: characters
Date: 1995/09/25
Newsgroups: comp.lang.misc,alt.folklore.computers,comp.sys.misc
.. 'green' cards list the character mapping for ebcdic
code (8bit, 256). even for printable characters there
are some rough edges for ebcdic<->ascii translaters
.. ebcdic doesn't have brackets, braces, etc.
i added tty/asci support to one of the mainframe operating
systems while an undergraduate back in the 60s ... somewhat
chose arbritrary mapping for ascii->ebcdic for incoming.
not sure about telco uses. i did a cusom modified operating
system that was installed at at&t longlines about '74
(effectively the resource manager that shipped in '76 but with
lots of additional features, like a page mapped file system, etc
.. that didn't ship). I remember getting a call around '83 or so
from some IBM'er asking if i could do something since at&t still
had a lot of machines running it (at&t had been moving it to next
generation machines as they came out ... the opsys-base predated
SMP support ... and the next generation hardware was smp only).
did do another project in the 60s as an undergraduate which
was a terminal controller (someplace there is an article
blaming us for originating the ibm oem control unit business).
In any case, discovered an interesting feature of the standard
ibm control unit; the line-controller (uart) reads the leading
bit into the low-order bit position, i.e. if you ever looked
at 'raw' ascii in the memory of an ibm mainframe (after it
came in off a ascii device) ... you would find it bit-reversed,
I haven't paid any attention recently to see whether new
generation of tcp controllers still support bit-reversal ... the
terminal controllers i believe still have the convention.
It would lead to some confusion on the mainframe side if
the terminal side still did ascii bit reversal and the ip
controllers didn't (i.e. would need two completely different
set of translate tables when translation was done on the
mainframe side).
multilevel store
Refed: **, - **, - **, - **, - **, - **
From: Lynn Wheeler <lynn@garlic.com>
Subject: multilevel store
Date: 1995/10/04
Newsgroups: comp.arch
ibm mainframe (and others) extended store is somewhat along those
lines ... except it pushed the architecture slightly further.
borrowed some electronic disk and some from multi-level software
controlled cache. implementations are at least 10 years old.
there is a very wide fast custom bus. at issue is that it is big
and somewhat remote (in terms of nanoseconds), so latency for
access is long by memory bus standards. however, the wide/fast
bus ... once the transfer starts, completes on the order of
<100 standard instructions. the interface paradigm is very long
synchronous (cache-bypass, storage-to-storage) instruction. the
trade-off is that normal async. device driver pathlength runs to
several thousand instructions. the synchronous paradigm busies
the cpu for a very long period of time (<100 instructions,
actually i haven't seen the current timings, 10 years ago, it
was about 20 instruction timings) ... but that is only a small
fraction of what a device driver paradigm would cost.
next, previous, subject index - home