List of Archived Posts
From: lynn@netcom2.UUCP (Lynn Wheeler) Newsgroups: comp.arch Subject: Re: 360/67, was Re: IBM's Project F/S ? Date: 9 Apr 93 16:58:47 GMTThe machine started out as a 360/65 with .75microsecond memory (NO processor cache) ... to which was added relocation hardware (8 entry fully-associative look-up hardware) which added .15microseconds to memory/address operations (.9msecond total). Typical instructions operated in 2-3 memory cycles.
CP/67 ran on the 360/67 and was the predecessor to the VM/370 product ... done by the IBM Cambridge Scientific Center. The original version was a project implemented on a (one of a kind) custom modified 360/40 that had hardware address relocation added to it. Some of the people had worked on CTSS and at the start there were a number of similarities. In some sense CP/67 implementation can be consisdered a slightly-older "brother" to Unix (same heritage, slightly older). One of the more interesting aspects of cp/40 was that it implemented virtual machines with the ctss-like user interface implemented as a single user operating system (called CMS ... for cambridge monitor system) running in a virtual machine under CP/40.
In '68 & '69, I added to CP/67 a clock-like global LRU replacement algorithm, dynamic adaptive feed-back fair-share scheduling, ordered seek queueing, something akin to working-set controls to prevent page thrashing, fast-path for dispatch, scheduling, interrupt handling and numerous other functions. Part of the psuedo-working-set page thrashing controls were ynamically adaption of the MPL to the efficiency of the page I/O subsystem i.e. higher MPL and more memory contention for faster page I/O subsystems, lower MPL and less memory contention for slower page I/O subsystems ... in effect, MPL was limited to the sum of the psuedo-working-sets that could fit in the available pageble-pages ... but part of the psuedo-working-set calculation took into account the effectiveness of the page I/O subsystem.
In early '70s (20+ years ago) this system running on a single cpu 768kbyte '67 (104 4k pageable pages) would provide subsecond interactive response with 80+ users in a dynamic mixed-mode (interactive/batch) ... with the processor operating at 100% utilization. Under this load the page I/O rate would typically average around 150 4k-pages/sec.
Part of the secret was doing near optimal, dynamically-adaptive page-replacement algorithm as well as near optimal dynamically-adaptive page thrashing controls in almost no instructions. To take the page-fault interrupt, (near-optimally) select the page-replacement, schedule the page I/O, execute the page I/O, do the process-switch, later take the page i/o interrupt, and task-switch back to the original process would occur on the average in under 500 instructions (this average included a pro-rated percentage of the instructions needed to do page-writes for replacing changed/modified pages).
The Genoble Scientific Center used a similar base to implement a "real" working-set scheduling algorithm that ran on a 1mbyte single-cpu '67 (154 pageable pages, i.e. amount of storage left after fixed/pinned kernel storage requirements). They provided approximately similar performance for 35 users on their machine ... that we were providing on our machine (80+ users, similar workload and 1/3rd less available memory).
We also did a SMP 2-cpu version of the implementation ... and out of that work a co-worker originated a MP synchronization instruction. A couple months were spent coming up with a mnemonic that were his initials ... finally hit on Compare&Swap ... although the mnemonic today frequently is shortened to CS. The SMP/'67 was somewhat unique in having a totally independent I/O subsystem with independent path to memory. A "half-duplex" (i.e. a SMP with SMP I/O controller and only one processor) '67 processor would run a simple numerical intensive workload slightly slower than a "simplex" '67. However, in most real-world environments (with heavy I/O load), a half-duplex '67 would sustain a higher instruction/second rate than a simplex '67.
As an aside, the system was heavily instrumented and there were a great deal of performance tests done ... in some of my crazed moments I would run series of automated tests that were all at the edge of the envelope (150 active user, mixed-mode workload running in 104 pageable-page configuration) ... which is actually harder than might look at first since a large percentage of things that crash a system had to be fixed (including eliminating ALL causes of things like zombie processes).
Note that 12 years later (over 10 years ago), similar software on a processor with 50 times the MIP rate and 50 times the number of pageable pages was typically only supporting 3-4 times the number of users (in theory in a similar mixed-mode workload environment). The graphs that I was drawing at the time indicated a growth in workload capacity (at least in terms of online, active users) proportional to the change in the disk I/O subsystem thruput over the 12 year period.
From: lynn@netcom4.UUCP (Lynn Wheeler) Newsgroups: comp.arch Subject: Re: 360/67, was Re: IBM's Project F/S ? Date: 9 Apr 93 21:20:51 GMTthe cp/67 fastpath made a big difference on being able to run production operating system (OS/360 MFT) in one virtual machine while supporting other virtual machines running the CMS single-user interactive operating system (on the same hardware).
On the day I got CP/67 ... the elapsed time it took our MFT to execute our benchmark jobstream in a CP/67 virtual machine was 2.5 times the elapsed time it took to run it stand-alone (running the '67 in '65 mode). This was a hand-crafted MFT. The standard MFT "out-of-the-box" took twice the elapsed time to execute the benchmark jobstream.
When I was done with the CP/67 fastpath code, the elapsed time for the jobstream benchmark was 1.15 times the "stand alone" time (part of this was the difference between the standalone '65 mode running real with 750nsec memory cycle ... and relocate virtual machine mode adding 150nsec to the memory cycle time).
From: lynn@netcom.UUCP (Lynn Wheeler) Newsgroups: comp.arch Subject: Re: 360/67, was Re: IBM's Project F/S ? Date: 10 Apr 93 17:12:41 GMTwhen we got cp/67 in early '68 ... it only had 2741 & 1050 support and our account had a number of tty ... so I had to add tty support to cp/67. At the same time, I rewrote the terinal interface driver so that it would dynamically recognize whether the attached device was 2741, 1050, or tty (i.e. single rotory dial-up could be used regardless of the device type).
unfortunately the ibm hardware people told me latter that I shouldn't be able to do that. While the hardware controller actually supported being able to dynamically change the line-scanner associated with any line ... somebody had taken a short-cut and hardwired the oscillator frequency to in-coming lines. It was an "accident" that tty bit frequency and 2741 bit frequency was close enuf that something that wasn't suppose to work ... actually did.
In response to that ... we formed a four-man team that built the first ibm "oem" controller. it included support for strobbing the initial incoming bits on a connection in order to dynamically determine the terminal speed (i.e. dynamically determine both terminal speed and terinal type). Unfortunately while most "other" ibm mainframe software would "tolerate" dynamic speed determination ... they wouldn't tolerate dynamically changing the terminal type on a line.
In order that the mainline IBM software didn't feel completely left out ... we later started a project for terminal support in MVT/HASP (this was MVT OSr18 and I believe HASP-III). I ripped out the "2780" remote device support and replaced it with my terminal driver from CP/67. I then re-implemented the CMS syntax-directed editor's command set (had to be all re-entrant code, original CMS code wasn't). It gave MVT/HASP a powerful conversational remote job entry capability.
From: lynn@netcom4.UUCP (Lynn Wheeler) Newsgroups: comp.arch Subject: Re: Self-virtualization and CPUs Date: 11 Apr 93 19:54:45 GMTnote that self-virtualization includes the ability to operate the hypervisor/vm-monitor under itself ... where the "2nd-level" hypervisor might also be running copies of itself and/or other applications in virtual machines (I've seen a 3-level stack for specific application environment ... i.e. 3-level nested hypervisors under which something else was running doing productive work ... this is aside from various experiments that just toyed to see how deep the stack could be made).
there are cases of virtual machine hypervisors running in hardware environments that are not self-virtualizable ... i.e. the hypervisor might be able to provide virtual machines for other purposes ... but is unable to "hypervise" itself.
From: lynn@netcom.UUCP (Lynn Wheeler) Newsgroups: comp.arch Subject: Re: 360/67, was Re: IBM's Project F/S ? Date: 12 Apr 93 01:49:50 GMTWork on clock-like algorithm (desigh, implementation, deployment) was done in '68 while sysprog at universtity. Enhancement for more dynamics switching between random & LRU was done >20 years ago while I was at IBM. No publications from that period was made ... just deployed the code. I did include description in talk I gave at 10/86 SEAS meeting on the Isle of Jersey (it was taken from a early '83 white paper I did on "Performance History" ... spanning 15 year period from '68 to '83 ... which also highlighted the fact that relative performance of disk I/O subsystems had declined by up to an order of magnitude in the period).
at various times over the past 20+ years I've suggested it as technology for things like software caching & hardware disk caching controllers and other things that have a tendancy to use LRU algorithms.
some related page replacement:
L. Belady, A Study of Replacement Algorithms for a Virtual Storage Computer, IBM Systems Journal, v5n2, 1966
L. Belady, The IBM History of Memory Management Technology, IBM Journal of R&D, v25n5
R. Carr, Virtual Memory Management, Stanford University, STAN-CS-81-873 (1981)
R. Carr and J. Hennessy, WSClock, A Simple and Effective Algorithm for Virtual Memory Management, ACM SIGOPS, v15n5, 1981
P. Denning, Working sets past and present, IEEE Trans Softw Eng, SE6, jan80
J. Rodriquez-Rosell, The design, implementation, and evaluation of a working set dispatcher, cacm16, apr73
related paper using cluster analysis to restructure programs for vmm
D. Hatfield & J. Gerald, Program Restructuring for Virtual Memory, IBM Systems Journal, v10n3, 1971
also try Pat O'Neil (firstname.lastname@example.org) ... I was talking to him about a year ago ... he was doing a section on replacement algorithms for a book.
From: lynn@netcom.UUCP (Lynn Wheeler) Newsgroups: comp.arch Subject: Re: 360/67, was Re: IBM's Project F/S ? Date: Wed, 14 Apr 1993 05:09:56 GMTnote that by the mid-70s ... a significant number of environments (running vm/370) were severely I/O constrained ... all types of I/O .... page i/o, file i/o, etc.
In terms of working set, by 1978 ... a significant number of configurations had three times as much real memory as would nominally be needed to maintain MPL level with CPU saturation. However, many of these same systems had paging bottlenecks. It wasn't the page thrashing bottlenecks from the '60s ... that lead to solutions like workset scheduling to tray and address real storage contention. It was that there was insufficient page I/O capacity. Even if you took as the value for "workingset" ... the total number of virtual pages touched during a reasonable execution period (potentially several cpu seconds) ... all the pending processes would still be allowed in the dispatch queue. However, at the same time there was insufficient page I/O capacity just to get the pages into memory ... there was a paging bottleneck but not because of real memory contention (like the '60s) ... it was a page I/O bottleneck. This is part of the basis for my statement that in the 15 years between 68 and 83, the relative performance of disk subsystems had declined by up to a factor of ten (processor complex got 50 times faster while i/o subsystems only got five times faster).
Part of my solution in '78 was block paging ... which looked an awful lot like the old time "swapping" from the '60s. My original scheduler attempted to implement "scheduling to the bottleneck" ... i.e. if the CPU was the constrained resource ... "fair share" was based on cpu resource consumption. If I/O &/or memory became more of the bottleneck ... then an attempt was made to shift the fair share calculations so that they were based more on working set size of I/O utilization (i.e. default cpu scheduler in a page thrashing environment would nominally attempt to give the best preferential treatment to the process generating the most paging activity ... since it would be the one most likely using the least amount of CPU). By '78, the set of constrained/saturated resources for typical environments were frequently NOT THE CPU. However, many of the environments had progressed to the point where just scheduling based on I/O usage (rather than CPU usage) was not adequate. Policies analogous to workingset sizes (real storage utilization) were need for I/O resources.
A partial solution to the opportunity was the program restructuring work (referenced in prior append). I produced a sparce bit map of virtual address space locations (in 32-byte increments) that were referenced during a 1000 instruction interval. Hatfield ran "cluster analysis" (see some recent postings on the subject of cluster analysis in comp.programming) against the storage traces. He produced a storage reordering that would attempt to provide an improved ordering/packing of objects in virtual memory (strong/minimal working set).
Storage utilization maps were also printed out. With respect to earlier comments about apl/360 ... in the straight forward apl/360 port to CMS ... the code was effectively ported over intact. The original OS code essentially swapped a 64kbyte "real" region between disk and memory. The code running against the 64kbyte workspace treated it as "real memory". The storage management was very primitive, sequentially using locations of memory until all memory was exhausted ... and then doing garbage collection back to minmal storage. This produced a saw-tooth storage utilization map. It wasn't really too bad when dealing with a 64kbyte region ... but it became really unfriendly when talking about a 1meg workspace ... even for a 1kbyte application. The apl/360 storage management would proceed to (effectively) touch every location in the 1meg region and then do garbage collection collapsing it back down to 1kbyte. A q&d fix was to activate garbage collection at more frequent intervals (not just at bumping the top of the workspace area).
There was a recent comment that some of the larger cache sized machines (1-4mbyte cpu caches) were provided opportunities for the re-emergance of old '60s optimization technology associated with storage packing/optimization.
An associated issued that has been lurking in the back of my mind has to do with various OOPs technologies. Given an (non-OOPs) optimized program with "packed" instructions as well as "packed" data (variables that are packed together in small total storage space and can be accessed in a single reference) ... what happens when it is translated into an OOPs environment. Does the OOPs storage allocation (for data) spread out into lots of diffuse locations (multiple storage references accross multiple different cache lines)? Does methods result in a similar diffusion of code?
Words of wisdom from Zippy:
It don't mean a THING if you ain't got that SWING!!
From: lynn@netcom.UUCP (Lynn Wheeler) Newsgroups: comp.arch Subject: Re: Self-virtualization and CPUs Date: 14 Apr 93 19:13:06 GMTthere is an annoying problem running VMM opsys (hypervisor or otherwise) under a VMM hypervisor .. when the opsys is attempting production work and its storage isn't pinned. Typically the hypervisor is executing a LRU-like replacement algorithm ... as is the opsys.
the annoyance is that the first level system has a tendancy to page out the page that the 2nd level system would like to use next for paging. situation can be come quite pathelogical.
this applies to other types of scenerios also ... like DBMS systems managing caches in virtual memory. If the DBMS doesn't have hook into the opsys' VMM ... the DBMS can repeatedly be picking a cache page that the opsys was removed from real storage. The scenerio then is the opsys has to page the virtual page back in before the DBMS can schedule the contents for replacement.
From: lynn@netcom4.UUCP (Lynn Wheeler) Newsgroups: comp.arch.storage Subject: Re: HELP: Algorithm for Working Sets (Virtual Memory) Date: Sat, 24 Apr 1993 15:38:19 GMTsee my posting on 4/12 to comp.arch (360/67, was Re: IBM's Project F/S?) ... also see my post on 4/9 to comp.arch ... same subject. The Rodriques-Rosell article mention in the 4/12 posting is a description of the Grenoble Science Center project mentioned in the 4/9 posting.
From: lynn@netcom.UUCP (Lynn Wheeler) Newsgroups: comp.arch Subject: Re: PowerPC Architecture (was: Re: PowerPC priced very low!) Date: Thu, 6 May 1993 15:37:27 GMT32bit, 48bit, 52bit, 64bit, etc.
Typically 32bit applies to the number of bits in the virtual address.
The hardware takes that virtual 32bit address and typically looks it up in some sort of table-look-aside buffer to find the real address (although in prior discussion on 360/67, it was a fully-associated look-up, instead of a set-associative index). The TLB lookup is typically composed of three parts:
a) which virtual address space the virtual address is associated with b) the virtual address c) the real address
Using the two components of the lookup (which virtual address space identifier and which virtual address), a "real-address" is produced. The number of bits in the real-address is typically associated with the total amount of real hardware it is possible to configure on the machine (which can either be greater or less than the associated virtual address size).
On machines with page-tables, the virtual address space identifier is typically somehow associated with the page-table real address (either the actual address of the page table ... or possibly a page table origin stack index). On machines with a limited page table origin stack index (say 2-4), process switches that also involve address space switch... can incur a performance penalty if a typical scenerio requires more address space switches than the depth of the stack (say a micro-kernel implementation with different address spaces for each component in the kernel, and there are typically more kernel components than there is space in the page-table stack index). When the pagetable stack is exceeded, some sort of LRU algorithm is used to invalidate a pagetable stack entry ... which then requires invalidating all the associated entries in the TLB.
Some machines that have segmented virtual address space, may implement either a segment table stack associated TLB or a page table stack associated TLB (i.e. entries are associated with an address space identifier, i.e. the segment table real address origin OR a virtual segment associated TLB, the page table real address origin for a specific virtual segment within the virtual address space).
In any case, the TLB entry typically will have some sort of virtual address space identifier PLUS the virtual address identifier ... which then maps to a real address. The number of bits used in the TLB for a virtual address space identifier is dependent on whether the real page (or segment) table address is stored ... or whether there is a separate address space stack, and the TLB entry only contains an index into the address space identifier stack (where the address space identifier may be the unique address space, or possibly unique segment within the address space). In a segmented virtual address space architecture, with a system design point that would include large numbers of shared segments, there is a slight performance advantage to use segment-associated identifier rather than virtual address space identifier (i.e. with virtual address space identification, information requarding a shared segment pages would potentially occur multiple times in the TLB).
Power architecture implements inverted page tables, there are NO page and segment tables found in some other architectures. The power architecture does not have a virtual address space register that points to the virtual->real page table mapping ... since no such table exists. To provide the TLB hardware with a mechanism for uniquely identifying which virtual address space a virtual page exists in, the Power architecture defines a logical segment identifier. Effectively this logical segment identifier takes the place of a pointer to a real storage location of page tables (found in other architectures). The logical segment identifier is used by the TLB hardware to uniquely distinguish which virtual address space, a specific virtual page belongs to (in a page-table architecture, the virtual page number is used to index into a real page table that exists in storage someplace, the address of that specific real page table is what is used to distinguish one program's virtual address space from some other program's virtual address space).
In the Power architecture with inverted pagetables, there are no pagetables ... and therefor there is no real page table address that can be used to distinguish one programs virtual address space from any other program's virtual address space. The Power architecture, in place of a real page table address, uses a logical segment identifier to distinguish between one virtual address space and another virtual address space.
The Power architecture implements a virtual segment architecture using 16 segment table registers (many machines use a single virtual register, which points to a real storage location that contains the page-table ... for flat virtual address space, or a segment table, for segmented virtual address space). When decoding a virtual address in the power architecture, the high 4 bits of the address are used to index one of the segment registers (in other segmented architectures, the segment part of the virtual address is used to index an entry in some real segment table, and picks out a specific page-table). The specific segment table register selected, yields the logical segment identifier. This is roughly equivalent to a PTO-associative (i.e. Page table origin) TLB, in a segmented virtual address space architecture.
The number of bits in this logical segment identifier, roughly determines is the total number of different virtual segments that might have virtual page addresses entries in a TLB at any one time. It is roughly equivalent to the maximum number of different page tables that can exist in real memory (in other architectures that have segment/page tables).
In a Power implementation to talk about the total number of bits in the combined virtual address and the logical segment id, is roughly equivalent to talking about the number of simultaneous different virtual address spaces that can exist in other systems (i.e. maximum virtual address space size times the total number of different concurrent address spaces).
From: lynn@netcom.UUCP (Lynn Wheeler) Newsgroups: alt.internet.services,alt.unix.wizards,comp.ai,comp.graphics Subject: Re: Where did the hacker ethic go? Date: Sat, 8 May 1993 02:27:35 GMTit seems more like that there are a relatively small supply of hackers, regardless of the number of programmmers. There are now a significantly larger number of people doing programming ... w/o a corresponding increase in the number of the old style hackers. They may still be there, just not as proportionally significant.
as an aside, one of my favorite questions along this line has to do with language proficiency ... using the guideline that a measure of proficiency in learning a foreign language is when the person stops "translating" and starts to "think" in the language ... how many people find themselves proficient in a programming language?
From: lynn@netcom.UUCP (Lynn Wheeler) Newsgroups: comp.arch Subject: Re: PowerPC Architecture (was: Re: PowerPC priced very low!) Date: Sat, 8 May 1993 02:33:27 GMTfor a segmented virtual memory architecture that uses (only) the high four bits of a 32-bit virtual address to select a segment number ... implies that there are a most 16 "segments" in the virtual memory segmented architecture ... and that the segment size is bounded by 0 and 2**(32-4).
From: lynn@netcom.UUCP (Lynn Wheeler) Newsgroups: alt.internet.services,comp.ai,comp.graphics Subject: Re: Where did the hacker ethic go? Date: Thu, 13 May 1993 16:38:56 GMTok, given that computer language proficiency works for programming in the small what works for programming in the large. programming is a relatively young human endeavor there is little or no natural language vocabulary/lexicon.
The "book" paradigm implies the use of craft/artistic side of the brain, rather than the analytical side. Is the switch because of the lack of words/vocabulary? Would it be possible to do programming in the large using the analytical side if a person could generate symbol abstractions on the fly? Given that neither the artistic approach or "abstraction on the fly" translates to natural language well (i.e. can one describe the brain processes analytically associated with programmin in the large?) is it possible to tell the difference
From: email@example.com (Lynn Wheeler) Newsgroups: comp.arch Subject: managing large amounts of vm Date: Tue, 29 Jun 1993 15:34:02 GMTthere are a number of optimziations for managing large amounts of virtual memory.
an obvious one that dates from the late '60s is to not allocate backing store for virtual memory until process/application actually touches&changes a page.
in the early '70s, i implemented paging page tables (similar in concept to paging virtual memory ... but applied to "inactive" virtual memory control tables).
with the advent of (relatively) large real memory ... there is another optimization technique. traditionally when a page is read into memory, the backing store location is left allocated. when the page is selected for replacement ... if it has not been changed ... there is no need to "write" the page to disk (the copy on disk is left "intact"). in 1980, i implemented a scheme which monitors depletion of backing store space (disk paging space) and switches the management of backing store space from a "dup" algorithm to a "no-dup" algorithm.
A "dup" algorithm leaves the backing store slot allocated when a virtual page is read in. A "no-dup" algorithm deallocates the backing store slot when a virtual page is read in (i.e. there is no "duplicate" virtual memory page left on the paging disk). In a "dup" situation, the max/total number of allocated virtual pages is effectively the size of the space allocated on disk for paging. In a "no-dup" situation, the max/total number of allocated virtual pages becomes the combination of real memory and paging disk space.
The "dup" algorithm achieves some performance benefit in not having to "write" replaced pagces that haven't been changed (disk i/o load). The "no-dup" algorithm trades-off additional disk i/o load for potentially increase in total number of extent virtual memory pages.
A simple example is high-end workstation with 512mbytes of real memory and 512mbytes of disk paging space. In a "dup" scenerio, the total number of extent virtual memory effectively becomes 512mbytes. In the "no-dup" case, the total amount of virtual memory can grow to 512mbytes+512mbytes.
The switch back&forth between "dup" and "no-dup" operational modes is based on total number of allocated virtual memory pages compared to the total number of disk paging page slots.
From: firstname.lastname@example.org (Lynn Wheeler) Newsgroups: comp.arch Subject: Re: managing large amounts of vm Date: Tue, 29 Jun 1993 18:21:33 GMTwhen all slots are allocated (whether using a dup or no-dup strategy) then lock-up can occur if the process requesting additional page slots is not terminated. Systems that I worked on since the late '60s alwas implemented the termination strategy (as opposed to system lock-up, ... oh and with strategies for biasing termination towards non-system processes).
however, there is a difference in the operating region between "dup" and "no-dup" strategy when the termination strategy takes effect. In the "dup" case, it occurs when all page slots on disk are allocated, in the no-dup case, it occurs when all slots in the combined paging disk area plus real memory are allocated.
The issue regarding implementing the dynamic dup/no-dup strategy
presents itself if there are significant number of operating regions
where there are
> disk page slots
< (disk page slots)+(real memory slots).
A sample scenerio, a user tunes an application so that it takes all real memory slots ... i.e. little or no paging activity during the execution of the application. Assuming this is a workstation, single user environment ... while such an application is running, the majority of extraneous demon & other system process virtual memory pages are rolled out to disk. In the "dup" case, there has to be enuf page-slots available on disk for all the demons plus all of the application. In the "no-dup" case, there only has to be enuf page-slots available on disk to handle all the demon/system process virtual memory pages. A non-trivial number of workstations have real-memory configurations that can contain all demon/system process virtual memory pages.
As a result, some number of large memory workstation environments could actually be configured with less disk page slots than there is real memory (assuming no-dup strategy) .... i.e. none of the pages in real-memory have allocated slots on disk. If a demon "goes off" during the execution of the application ... its pages can be "paged-into real memory", its disk page-slots released, and an equivalent number of the application pages "page-out" to the newly released disk page-slots (that had belong to the demon before it being brought in).
... oh, btw, in the transition between "dup" to "no-dup" strategy, the code also had to run thru pages in real-memory and release any associated disk page slots.
From: email@example.com (Lynn Wheeler) Newsgroups: comp.sys.next.advocacy,comp.arch Subject: Re: S/360 addressing Date: Thu, 16 Sep 1993 17:22:58 GMT... except for 360/67 (mid to late 60s) which had two virtual address modes, 24bit and 32bit (not 31bit). 360/67mp also had full-blown channel controller, something that wasn't seen again (in IBM, except possibly for the 125 IOPs) until 3033 days (late 70s). late models of the 3033 also had a kind-of "real-mode" 26-bit addressing (i.e. 64mbyte storage) ... which was not specifiable in instructions but the real page number could be specified in the virtual address page-table-entry.
trivia fact: one of the results of mp operating system work on the 360/67 at IBM/CSC in the lates 60s & early '70s was the architecting of the compare and swap instruction. the choice of the mnemonic "CAS" was because it is the initials of the person that did the primary work (i.e. the designation comapre & swap was based on requirement to find word combination to match the initials) although the mnemonic is frequently corrupted to CS. A requirement (placed on the group) for getting CAS implemented in machine hardware involved architecting a use that wasn't MP-specific (i.e. atomic storage update by application non-disabled, interruptable regions of code).
From: firstname.lastname@example.org (Lynn Wheeler) Newsgroups: alt.folklore.computers Subject: unit record & other controllers Date: Thu, 23 Sep 1993 06:22:59 GMTmy first programming job was undergraduate summer job. university had a 709 with a 1401 front-end for handling unit record (all card input went to 7track tape on the 1401, carried over to 709 and jobs run ... output went to tape ... which was carried back to the 1401 tape drive and would produce printer/punch output).
The university was going to replace the 709 with a 360 .. as an interim step the 1401 was replaced with a 360/30 ... and I was given the job of implementing the 1401 MPIO utility on the 360. I was supplied with 360 assembler & machine code manuals as well as 1401, 2540printer/punch and tape drive manuals. Program need to be able to handle both input & output tapes simultaneously (i.e. card->tape and tape->printer/punch running at the same time). It needed to be able to both run under os/360 as well as stand-alone managing its own interrupts&I/O as well as error recovery.
the 2540 reader/punch had five output pockets ... two pockets that could be feed from the reader, two pockets that could be feed from the punch, and a center pocket that could be feed from either. I only worked on one application that used the center pocket ... does anybody else have applications that used the center pocket?
The 2540 card reader could be operated in two ways,
1) a single I/O operation which would read/feed in a single operation, 2) separate commands for "feeding" a card and for "reading" a card
## trivia question: what was one of the primary reasons for the separate feed/read mode of operation?
2540 reader answer:
The ibm 80-col card has 12 rows ... for BCD (& EBCDIC) a col. encodes a single (6bit) character (8bit for ebcdic) mapping into a single byte. However, for "binary", two six-bit bytes were encoding/punched using all 12 rows in each column. Typically a card was feed & then read using "BCD" read operation (which would read a single column into a single byte). If the card had invalid "BCD" punch codes (i.e. col. binary), the I/O operation would result in an error. It was then necessary to reread the card using col. binary I/O operation (which read the 80 columns into 160 byte positions).
several years later I got a copy of the stand-alone LLMPS which had a whole set of various types of unit record routines. I was also informed that LLMPS formed the original core/nucleus for the MTS system. Can anybody verify that?
my 2540 center pocket application:
doing class scheduling application using cards, the class schedule cards were read into the center pocket. As each card was read, it was analyzed. If there was some error was found in the card, a card was (blanked) punched behind it (the cards from the punch side had different colored top strip). Postprocessing cards in error just involved locating all the cards in the tray that had a colored strip card following them.
... hasp misc.
somewhere in the dim past I had a project where I replaced the HASP 2780 driver code with TTY & 2741 drivers along with a CMS-like context editor ... to provide a HASP-based CRJE support.
I had programmed the 2702 telecommunications software so that the software could automatically recognize whether there was a 2741 or a TTY coming in on the dial-in line (and update the appropriate fields). Supposedly the 2702 had all the command "control" functions that allowed it to work. After it was all up and working (demo'ing that both TTY and 2741 could dial into the same base rotory) ... the IBM hardware people told me it wouldn't work. The 2702 design & implementation had all the necessary logic to "switch" line-scanners on a per-line basis. The hitch was that somewhere along the way, they took a hardware shortcut and the frequency oscillator was hardwired to each line. While the line-scanner could be changed ... the line speed couldn't. For whatever reason, there was enuf slop between the 2741 rate and the TTY rate that it worked anyway.
I think it was not too long after that we started a project to build our own telecommunications controller (I believe the machine eventually become the first OEM IBM control unit).
Lynn Wheeler | internet: email@example.com
From: firstname.lastname@example.org (Lynn Wheeler) Newsgroups: alt.folklore.computers Subject: Re: unit record & other controllers Date: Thu, 23 Sep 1993 07:55:37 GMToh yes, this first oem telecommunications controller that we did ... it not only handled terminal type recognition but also did automatic speed recognition ... on initial connection ... it effectively did something like 10* oversampling on the in-coming bits to get a estimate of bit duration.
when we first attached it to the ibm byte-mux channel, it red-lighted the CPU (i.e. hardware failure). It turns out that we were holding op-in for >13mics. The cpu timer on the 360 was located in main memory and "ticed" every 13mics. If the timer was locked out of updating the memory location for two tics ... it generated a hardware failure.
After that was overcome ... we started transmitted lots of data to memory of the 360 ... however on closer examination it looked as if it was all garbage. the test case we were using was tty ... and the "controller" had a small task monitor that a tty could interact with (i.e. the core machine was an ascii minicomputer). It took me several hours to realize the problem. As it turns out we were just transmiting straight ascii into the ibm mainframe ... which it finds to be garbage. The "problem" was that (at least) the 2702 line-scanner took the in-coming leading bit and placed it into the low-order bit position in a byte. Effectively from "straight" ascii byte standpoint the bits in a byte had been reversed. In order for the ibm mainframe to think we were transmitting it valid ascii ... we had to invert the bit order in the byte before sending it to ibm mainframe memory.
trivia question: what does the following 360 mainframe instruction do:
the instruction is a memory to memory move of 8 bytes from hex location 50 to hex location 4c (overlapping move). Since the "timer" was 4 bytes at location hex 50 in low memory, in a single atomic instruction, it saved the current value of the time and reloaded it with a new value.
A frequent convention was to keep a (large) standard value at hex 54 (the location to be moved into 50). The difference between the value at 4c and 54 (after the move) was effectively the elapsed time since the previous reset of the timer. This delta value was used to accurately update time used by different processes (at least to 13microsecond resolution).
Lynn Wheeler | email@example.com
From: firstname.lastname@example.org (Lynn Wheeler) Newsgroups: alt.folklore.computers Subject: Re: unit record & other controllers Date: Thu, 23 Sep 1993 16:05:12 GMTa new trivia question: what is the 12-2-9 punch in col. 1 of punch card.
my 1st programming job, a 1401-MPIO replacement for the program that did front-end unit record I/O the 709 ... quickly grew to an unwiedly size of over 2,000 cards. it took 30 minutes to assemble (translate from symbolic machine language to a "hex" card deck). This was the "stand-alone" version ... The version with OS data management macros (to leave OS executing while it ran, rather than take over the whole machine) took significantly longer. A single "DCB" macro took 6 minutes (elapsed time) to assemble. 6-7 DCB macros & other misc. stuff pushed the assemble time for my program to well over an hour.
it took a couple months ... but it soon became faster for me to repunch "hex" cards to patch a program ... than it was to re-assemble the whole program ... in doing so, I had to learn to read hex punch cards (i.e. translate the punch holes into hexadecimal equivalent).
The card key-punch had two functions necessary for this,
"duplicate" - i.e. old card in copy position, new card in punch position ... and hold down the duplicate key
"multi-punch" - hold down a key that stopped the automatic card col. advancement and allowed the rows (in a col) to be individual selected
... that was before I learned about REP cards.
Note that keypunches had the equivalent of carriage-control tapes ... which were standard punch cards, appropriately encoded (with commands) and wrapped around a small drum located in the top middle of the keypunch (026s & 029s). A relatively simple function was to "interpret" (read the punch-holes and print the corresponding character on the top of the card) a card deck (at least non-hex card deck that had been punched on say a 2540 punch). Load the hopper with your (punched) card deck, set up the control card to automatically interpret, feed the first card ... and it could automatically process the rest of the deck.
Lynn Wheeler | internet: email@example.com
From: firstname.lastname@example.org (Lynn Wheeler) Newsgroups: alt.folklore.computers Subject: re: location 50 Date: Thu, 23 Sep 1993 21:39:34 GMTcode would "mess" up various stuff in MVT if you were running MVT (like the CVT update) ... but not if you were running a different operating system like cp/67 ... and this was the code it used to do timer maintenance as well as cpu accounting for all activity (accounting by virtual machine for both time spent in virtual machine mode ... as well as accounting for for all time spent in superviser/kernel mode & for which process the time in kernel mode was on behalf of).
mft/mvt (& vs1/svs) had a harder time of storage integrity even with standard 360 storage protect because everything resided in the same address space ... and all kernel code expected to execute in key0 (rather large-grained storage protect). With lots of bugs in the kernel code from which the kernel had little or no protection ... the most serious was various types of low-core overlay of the first 128 bytes of real storage. The hardware interrupt vector for "program" interrupts (various instruction failures) was in the first 128 bytes of storage. Other "undefined" locations in low storage was also used by low-level kernel routines (depending on operating system). Given the gross granularity of protection and the frequency that programming bugs resulted in loading a zero/null value into a address/base register ... there was a significant percentage of software errors that resulted in critical storage corruptions.
note there is a current thread running over in comp.arch on dealing with NULL storage pointers. Part of the problem was that a 360 instruction could address the 1st 4k of real memory in two ways:
1) not specifying a address register (i.e. 0) and using address encoded in the instruction 2) specifying a address register ... and having a null value in the register
in general, instructions using the 1st mode were "valid" and instructions using the 2nd mode tended to be software errors.
From: email@example.com (Lynn Wheeler) Newsgroups: alt.folklore.computers Subject: 1st non-ibm 360 controller Date: Sun, 26 Sep 1993 19:31:17 GMT Lines: 12i believe that the manufacturer of the minicomputer that we used for 360 telecommunications controller project (dynamic terminal identification, dynamic speed recognition, etc) ... also built the minicomputer that was used for the non-dec/pdp unix port (some 10+ years later).
Lynn Wheeler | firstname.lastname@example.org
From: email@example.com (Lynn Wheeler) Newsgroups: alt.folklore.computers Subject: Re: Most Embarrasing Misposting Date: Thu, 7 Oct 1993 21:37:14 GMTit wasn't a posting ... but I once pointed at the wrong file as a cc: list and sent out email to 15,000+ people
Lynn Wheeler | firstname.lastname@example.org, email@example.com
From: firstname.lastname@example.org (Lynn Wheeler) Newsgroups: comp.arch.storage Subject: Re: Too much data on an actuator (was: 3.5 inch *9GB* ) Date: Thu, 7 Oct 1993 21:49:38 GMTthere is not only non-uniform arrival rates but also non-uniform access characteristics and/or even bursty access patterns.
assuming some uniformity of access patterns, it is possible to do access/profiles for various data clusters in terms of accesses/mbyte/sec.
Not only can data-clusters with certain (low) access rate profiles be migrated to slower speed devices ... but it is also logical possible to partition a large (9gb?) drive ... where possibly some small amount of data with very high access/mbyte/sec activity was positioned in one partition and other partitions had much lower activity data (in theory the allocation could be done such that the sum of the accesses/mbyte/sec for all the various data clusters were within the performance envelope of the drive).
this is harder to do for really bursty profiles ... unless the aggregation is large enuf such that some reasonable predictable statistical average is meaningful.
Lynn Wheeler | email@example.com, firstname.lastname@example.org
From: email@example.com (Lynn Wheeler) Newsgroups: comp.unix.aix Subject: Re: Assembly language program for RS600 for mutual exclusion Date: Mon, 11 Oct 1993 15:36:58 GMTrs/6000 has no equivalent atomic instruction. aix has system call that emulates compare&swap semantics ... and there is c library definition for "CS" mnemonic. it is special cased in the svc interrupt handler so that emulation is done within 8 instructions. since the interrupt handler is running disabled, there is no possibility of interruption and to all intents and purposes it is atomic from the standpoint of your program.
Lynn Wheeler | firstname.lastname@example.org, email@example.com
From: firstname.lastname@example.org (Lynn Wheeler) Newsgroups: alt.fan.mts,alt.folklore.computers Subject: Re: MTS & LLMPS? Date: Mon, 11 Oct 1993 20:18:44 GMTcp/67 was a rewritten version of "cp/40". csc did a custom design of relocation hardware that was prpq implemented on a 360/40. That was the hardware that was used for the implementation of cp/40 ... most of the stuff translated from ctss. cp/40 was then ported to the 360/67 when that hardware became available.
the cp/40 and initial version of cp/67 had a schedular that looked like it might have been right out of ctss. One of the people at LL (around summer of '68) replaced that with a significantly simpler mechanism that included a mechanism for controlling page thrashing (nmber of tasks allowed to run simultaneously was a function of real storage independent of the execution characteristics of those tasks).
I don't believe that cp/40 had any llmps code.
llmps was similar in implementation to my first programming job which was to do a version of the 1401 MPIO (card-tape/tape-pringer spooler) running on a 360/30 (so the 360/30 could be used as "front-end" for 709).
I didn't see any cp/67 code until january of '68 ... which was a couple months after a version was made available to LL.
Lynn Wheeler | email@example.com, firstname.lastname@example.org
From: email@example.com (Lynn Wheeler) Newsgroups: comp.arch Subject: Re: Multicpu's ready for the masses yet? Date: Mon, 11 Oct 1993 22:48:35 GMTin the past when there have been significant underutilization of fixed and floating point units .... usually because of instruction decoder stalls (branches or cache misses) ... an approach to boost the instruction feed into the fixed & floatings units was to add one (or more) independent instruction streams. From the programming point-of-view the architectures were smp ... but the hardware underneath may or may not be fully replicated.
superscaler and multiple fixed & floating point units is futher starting to blur the distinction. On a single chip with multiple fixed & floating point units ... a design trade-off issue would be the difficulty in having a single pool of (multiple) fixed & floating point units being fed by two (or more) independent instruction streams.
A side question is (for any specific application) what is the off-chip cache and memory bus utilization for a single i-stream ... and whether or not there is sufficient excess capacity for the intended application.
Lynn Wheeler | firstname.lastname@example.org, email@example.com
From: firstname.lastname@example.org (Lynn Wheeler) Newsgroups: alt.folklore.computers Subject: Re: MTS & LLMPS? Date: Tue, 12 Oct 1993 18:05:09 GMTi'm somewhat vague on the details but the virtual address space I believe was 256k bytes (64 4k pages) and the real machine was 256k bytes (64 4k pages). Somewhere around the storage-protect key array (standard on 360) they implemented the translation address stuff (i.e. each real page had a hardware entry giving the virtual address for that page). Task switch required reloading the translation address array. This was custom implemented on a 360/40 (only one that existed). Some of the people that had been involved in ctss worked on cp/40 and cms.
The 360/44 was a somewhat standard IBM product (360/40) that had extra hardware for numerical intensive workloads.
Current day virtual address translation involves table look aside buffer (effectively cache of virtual->real address translations) ... also typically because of high level of multiprogramming ... TLB entries also tend to have some sort of "ownership" information (i.e. translation for multiple, different address spaces can co-exist in the TLB simultaneously).
The 360/67 had a 8-way fully associative "look-aside" buffer (i.e. lookup was simultaneously executed against all 8 entries), w/o any ownership identification ... i.e. virtual address space switch required that all eight entries were alwas flushed. '67 also had a virtual address mode bit that selected between 24-bit virtual address and 32-bit (not 31-bit) virtual address modes.
when the '67 came along, the csc group ported cp/40 to the '67. btw, csc was located on the 4th floor of 545 tech. sq (project mac & i believe the ge645 were a couple floors up)
Lynn Wheeler | email@example.com, firstname.lastname@example.org
From: email@example.com (Lynn Wheeler) Newsgroups: alt.folklore.computers Subject: Re: MTS & LLMPS? Date: Wed, 13 Oct 1993 18:20:58 GMT32-bit was standard on all '67s ... in fact i believe that some number of customers with memory-mapped '67 applications migrated to multics when the 360/67 was discontinued (and the 370 line only had 24-bit addressing). I ran across a copy of one of the early 360 documents which described the model "60" and the model "62" ... which became the model 65 and the model 67. I've never seen any reference to model 66 tho. The model 62 was described as coming in 1 cpu, 2-way smp, and 4-way smp models ... '67 only had 1 & 2 cpu models (other than the custom triplex for the manned orbital lab). Charlie Salisbury worked on the triplex a lockheed and then later on cp/67 duplex support. Charlie is also responsible for the C&S instruction (actually the C&S mnemonic was chosen because they are his initials and then words were generated that went with the letters).
LL had defined the search list instuction as a RPQ to the '67 ... and modified CP/67 to use it for various types of queue searches (especially free storage allocation). cp/67 had a SLT simulator for '67s w/o the RPQ. It eventually came into disuse when other types of paradigms were developed for managing the task (i.e. push/pop free-storage subpool allocation operated in 12-14 instructions for 90+ percent of free-storage requests, beat the SLT instruction which still had to do the memory accesses).
you mean in the hey day of protests ... and somebody called the boston office of the FBI to say that there was a bomb planted in the offices of a certain government intelligence agency ... and then went roming to see what building got evacuated? also ... while it wasn't on the office doors ... in the telephone room for that floor, the telephone company had written the agency initials on the board next to their punch-down blocks.
there was also a "bug" in the translation hardware in the '67 ... that as far as I know never got fixed. Whenever the address space control register was loaded, the process invalidated all the entries in the translation look-aside buffer. However, there was an interesting bug in the page-fault interrupt hardware ... that zero'ed all entries in the associative array ... but forgot to invalidate the entries. The problem only showed up if attempting to execute in relation mode ... after a page fault ... without having done a control register reload in the interim.
Lynn Wheeler | firstname.lastname@example.org, email@example.com
From: firstname.lastname@example.org (Lynn Wheeler) Newsgroups: alt.folklore.computers Subject: Re: crabby, stu, initials, etc Date: Thu, 14 Oct 1993 19:12:26 GMTstu also had a lot of "MAD" macros ... which were a lot harder to "fix-up" than just changing comments. Some of them were not only used by system code ... but also (user) application code.
they cleaned out all the initials ... i had identified a lot of stuff (like all the tty/ascii support) with the initials of the university i was working at.
last time i ran across crabby he was in atlanta & had something to do with as/400 application code.
From: email@example.com (Lynn Wheeler) Newsgroups: comp.arch.storage Subject: Re: Log Structured filesystems -- think twice Date: Fri, 15 Oct 1993 18:57:45 GMTI built a filesystem in '86 that was similar in function to many of the LFS stuff w/o the cleaner function. This was done originally to try and keep high-speed links fed.
I had a project in early to mid 80s that was implemented an internet-backbone like pilot running satellite tdma Ku-band T2 links. bandwidth was re-allocatable on superblock boundary (round-trip about 800mills) and the stations supported frequency hopping (designed for reed-solomon & viterbi fec & a variation on double-key DES ... which gave us some problems in one or two quarters). One of the early problems I had to do was to enhance the protocol to do rate-based pacing ... since windowing was pretty ineffective ... especially taking into account bursty traffic and the round-trip propagation delay (this was single-hop ... some of the double-hop systems are twice as bad).
Anyway, after that ... one of the main problems were filesystems being able to sustain T2 thruput to keep the links feed (sustained, bidirectional T2 can be harder than it looks).
Around 86, I finally redid a filesystem for some of the attached nodes that had a lot performance enhancements (i.e. bit-maps, late binding, contiguous writes, contiguous allocation, locality of data, indirects and metadata). Besides the thruput functional issues there are some integrity issues.
For recoverable/consistent, the metadata has to be carefully written and/or logged. For "transaction" operations there are a whole class of failure modes associated atomic transactions involving multi-record writes. In the implementation I did in '86, I used careful update with some shadowing ... however the logical (metadata) records were >1 physical record. I used 8-byte log sequence number at the start of the logical record and at the end of the logical record. If (when reading) the two values weren't the same ... there was an incomplete write (and therefor the record was inconsistent). The controllers in this case didn't support out of order writes ... or in-order writes would have to have been forced for meta-data writes.
RAID-5 typically has a similar inconsistency where both a single record write and the parity record update have to be performed as a single atomic operation. One solution is for the controller to have battery backed ram (possibly duplexed) that logs pending parity updates and forces completion on recovery after things like power failures. This gets a little trickier in a no-single-point-of-failure design ... if power failure occurs and controller with the log doesn't come back up ... the other controller needs access to the ram log(s) in order to force consistency.
However, The only piece of all of this that made it into any product was the rfc 1044 support.
Lynn Wheeler | firstname.lastname@example.org, email@example.com
From: firstname.lastname@example.org (Lynn Wheeler) Newsgroups: comp.arch.storage Subject: Re: Log Structured filesystems -- think twice Date: Sun, 17 Oct 1993 21:42:02 GMTocmparisons ... different case than earlier one I posted, it was '74 (instead of '86) ... base system ran in virtual memory ... filesystem ran thru "real I/O" but "real" actually involved v->r translation and indirection. Base Filesystem had bit-map but didn't take advantage of searching for contiguous (although sometimes it worked out). Logical I/O was (effectively) synchronous ... and could be done direct to/from target application address (unless logical request was
One "simple" advantage was reduction of pathlength ... "real I/O" simulation was eliminated ... as was "double I/O" ... i.e. virtual page that was target of read I/O operation had to be "paged" into memory before it "overlayed" with the read op. The application virtual memory page was just remapped to filesystem disk location.
Since the base filesystem already had direct I/O ... just straight memory-mapping was not a win by itself ... but the pathlength reduction and possible double I/O on reads provided a lot of benefit.
However, there were lots of dynamic adaptive benefit ... particularly with "large" windows &/or full-file mapping. The underlying system code dynamically decided (based on real storage availability/contention, I/O availability/contention, etc) how to implement the request. Large multi-track I/O requests could be scheduled when there was lots of "free" real stroage ... or in cases when the request was "larger" than real memory (or lots of contention) it would allow faulting & effectively segment the request.
In any case, applications ran up to three times faster (depending on degree of I/O boundness ... or I/O bottlenecking) between the unmodified filesystem ... and the same filesystem semantics implemented with the dynamic adaptive memory-mapping.
Applications could psuedo "tune" for this environment by specifying multiple "large" buffers (large is relative lets say 16k up to several hundred k or larger). Underlying system would translate virtual buffer addresses into multiple virtual file "windows" (all over the same file). Application buffering calls (effectively) provided the "hints" to the underlying system when parts of a file were no longer needed and could be discarded (real pages occurpied by the discarded virtual file pages could be remapped to another part of the same file). Application was paced/blocked automatically by a page fault if it got ahead of the disk I/O transfers (w/o requiring explicit serialization software in the application).
I had some applications tuned up in this manner (multiple large buffers/windows) that would run 5-10 times faster (on mapped-modified filesystem than same code/application/logic running on the unmodified filesystem).
As a total aside the filesystem window/buffer "pacing" exhibit some of the same characteristics the communication window pacing that I had when I started the T2 satellite project. When the application intially started ... all the filesystem windows/buffers were empty ... and so it immediately generated requests to fill all of them (in a single burst). In the filesystem case, it would actuall schedule/execute the first request (if the device was available) and return ... and then when the remaining requests for the other buffers/windows they would all queue up ... and potentially get scheduled as a single physical operation. In dedicated environment it was good, but it exhibited anti-social characteristics in multi-user, shared environment. As a result, there was different adaptive optimization criteria based on dedicated/shared resource. Pre-emptive access to resources were needed typically in the 50-100mills interval windows.
Severe example of anti-social behavior (in shared environment)
As a semi-aside ... in the late '70s I shot an extreme disk performance problem at a large IBM mainframe multi-cpu, multi-system shop. They had tons of performance statistics ... which showed that when performance was "bad", all sorts of utilization dropped off; but in general there was little or no correlation in any of the statistics between good performance and bad performance.
Eventually I eye-balled in all the statistics that certain disk I/O rates dropped from an avg. of 30+/sec to 6-7/sec. It turns out that they had very large program library and that the index for this library spanned 45 tracks. The mainframe methodology for finding something in a library index was left over from the early '60s when real storage was very scarce and there was large excess i/o capacity. An I/O operation could be scheduled that would do a sequential search of the library index (on disk) for a pointer to the desired item (such a search not only busied the particular disk drive ... but would also busy-out a large percentage of all I/O resources).
In this particular hardware configuration, a single search could span no more than 15 tracks in a single operation. The majority of the customers workload (on all the processor complexes) involved loading various applications out of this particular program library. On the average, the index search involved 25-30 tracks. Spinning at 3600rpm, or 60rps ... a 15 track index lookup took 250mills (during which time the particular device and a significant portion of the rest of the I/O resources were unavailable/busy).
When things got loaded, all systems in the complex began to serialize on program loading from this particular library ... with a program lookup taking about 350-400mills and program load taking 30-50 mills. (400+mills and 3-4 physical I/O operations total for program load).
A "problem" in the performance statistics was that they only showed raw I/O event counts/sec ... and there was >order of magnitude difference between an index search I/O operation that lasted 250millis and a simple block read/write I/O that took 16mills. Also, the performance statistics were processor complex specific and the serialization was occurring across several processor complexes all sharing the same disk (and the same program library). I (effectively) had to manually aggregate the individual processor complex stats into a single "system" summary.
Lynn Wheeler | email@example.com, firstname.lastname@example.org
From: email@example.com (Lynn Wheeler) Newsgroups: comp.unix.aix Subject: Re: /etc/sendmail.cf UUCP routing Date: Fri, 24 Dec 1993 23:49:18 GMTi've had the same/similar problem for going on a year. I've reworked sendmail.cf to send out all mail over uucp ... even tho some may be tcp/ip. problem i'm left with is that i code $f as valid "from" tcp/ip domain name ... even tho the immediate outgoing link is uucp. Everything works fine ... goes out correctly, arrives correctly, etc ... EXCEPT I get a gratuitous error mail saying that the mail wasn't sent (even tho it was) because of "no bang" (i.e. "!") in the from address.
I got the latest o'reilly sendmail book and it claims that sendmail.8.4.1 uses different rules for the wrapper and the header (i.e. in theory, in my case, sendmail is complaining about the from address in the wrapper being w/o a bang). However, the list of standard macros for sendmail.8.4.1 only shows $f ... so I don't see how to encode one kind of from address in the wrapper and a different address in the header (which i believe would get rid of the error, i.e. if I had a bang-from address in the uucp wrapper ... while maintaining a DNS-from address in the header).
... i.e. the point of this is that not only do i want outgoing tcp/ip mail to go out with the standard dns domain addressing ... but to also have my from address appear in standard domain format.
Lynn Wheeler | firstname.lastname@example.org, email@example.com
From: firstname.lastname@example.org (Lynn Wheeler) Newsgroups: comp.unix.large,comp.arch.storage Subject: Re: Big I/O or Kicking the Mainframe out the Door Date: Wed, 29 Dec 1993 18:31:42 GMTThe following is from a presentation made at the fall '86 SEAS meeting based on a report done in '83 (based on some work from the late '70s but with tables updated to reflect 1983 mainframe hardware configuration ... i.e. the presentation was made 7-8 years after the work). It compares two "mainframes" (used for same/similar multi-user timesharing environment), one circa 1968-1970 and one circa 1983. The "Current Performance" heading refers to '1983'.
The term "drums" refer to mainframe fixed-head disks (i.e. one disk r/w head per track). The value for drums is the total number of mbytes available in a typical configuration.
"Pageable pages" refers to the number of 4kbyte pages available for paging (total real memory minus fixed kernel requirements).
The "user" level corresponds to a reasonably performing time-sharing service providing 90th percentile sub-second response for interactive requests.
Page I/O refers to the number of 4kbyte page transfers per second.
User I/O refers to the number of distinct user file I/O operations per second (in the '67 system a single user file I/O operation ranged from 800bytes to 64kbytes transferred, avg. around 4k-8k; in the 3081K system a single user file I/O operation would range from 4kbytes to 64kbytes transferred).
The transfer rate is the standard file I/O disk transfer. On the '67, "drums" had a transfer rate of 1.5mbyte/second and disks had a transfer rate of 330kbyte/second, On the 3081K, both drums and disks had a transfer rate of 3mbyte/second.
The 360/67 is a single processor system. The 3081K is a 2-way symmetrical multiprocessor (each processor rated at 7mips).
The reference to "PAM" is work that I had done in the early '70s on page-mapped filesystems.
A current day "mainframe" (i.e. '93 instead of '83) would typically
have values 4-10* larger (except for things like "drum" space, avg.
arm access, and transfer rate).
For a real look at current performance and where the problems may be, it is helpful to place a current environment side by side with the 3.1L system. The following tables shows a CP/67 3.1L system on a 360/67 with 768K of storage, three 2301 drums and 45 2314 drives (numbers given are for a 3.1L system w/o any PAM minidisks). It is compared with a typical 3081K HPO system with 32megs of storage, six 2305 drums, and 32 3380 actuators assumed to be running a workload with similar execution characteristics.
system 3.1L HPO change machine 360/67 3081K mips .3 14 47* pageable pages 105 7000 66* users 80 320 4* channels 6 24 4* drums 12meg 72meg 6* page I/O 150 600 4* user I/O 100 300 3* disk arms 45 32 4*?perform. bytes/arm 29meg 630meg 23* avg. arm access 60mill 16mill 3.7* transfer rate .3meg 3meg 10* total data 1.2gig 20.1gig 18* Comparison of 3.1L 67 and HPO 3081kif we compare the resources that are traditionally considered critical - CPU and memory - we see an increase from 45 to 65 times between the 67 and the 3081. However we only see an increase by roughly a factor of four in the number of users supported. Even at that we see performance problems in supporting that many users. There appear to be ten times as much resource per user on the 3081 compared to the 67, however there are still performance problems. Why? An even more interesting comparison is to show the same information as a function of raw MIPS as given in following table.
system 3.1L HPO change machine 360/67 3081K mips .3 14 pageable pages 350 500 1.4* users 266 22.8 .086* channels 20 1.5 .075* drums 40meg 5meg .125* page I/O 500 43 .086* user I/O 333 21 .063* disk arms 150 1.1 .0068* 67/3081 Comparison by resource per MIPS system 3.1L HPO change machine 360/67 3081K users 80 320 mips .0038 .044 11.58* pageable pages 1.3 21.88 16.8* channels .075 .075 1* drums .15meg .225meg 1.5* page I/O 1.88 1.87 1* user I/O 1.25 0.94 .75* disk arms .56 0.1 .18* 67/3081 Comparison by resource per UserMajor problems can easily be seen in the data concerning I/O rates: system hardware has changed from being CPU and real storage constrained to being I/O constrained. The page I/O capacity (distinct from the page capacity, i.e., real storage) has increased by a factor of four. In addition, the user I/O capacity in terms of accesses per second has increased by a factor of four to eight. However, the rest of the hardware in the system (CPU, real storage) has increased by factors of 40 to 60.
• system I/O capacity • average I/O requirements per user on that system • ratio of disk arms to MPL and probability of disk arm thrashingDuring the late '60s & early '70s I had done work on dynamic adaptive feedback scheduling (doing "fair-share" and something that I characterized as "scheduling to the bottleneck" resource scheduling), page replacement algorithms, pathlength I/O work and other forms of optimization. The 3.1L system could:
take a page fault select near optimal page replacement perform page I/O (both read and any writes) near optimal scheduling task switch take page I/O interrupt (completion signal) update paging tables task switch back to original processin an average kernel pathlength of 500 instructions (or less, i.e. <1.5mills of '67 processing time). The '67 paging activity averaged 2/3rds reads/faults and 1/3rd writes (i.e. 150 page I/Os per second represented approximately 100 reads/faults per second, 50 writes/seconds, 200 task-switches/second and the total "overhead" involved was on the order of 150mills processor time).
The investigation done in the late '70s was to show that the "scheduling to the bottleneck" algorithm from the late '60s that dynamically adapted the scheduling algorithm based on cpu, page I/O, and real stroage resource consumption was inadequate (and frequently ineffective) since the primary constrained resource had become user file I/O.
One big differences between the mainframe system and traditional Unix systems is that the mainframe default/standard I/O paradigm operates directly between user space and physical I/O interface ... while the Unix paradigm frequently involves kernel calls for doing buffer moves. The "buffer move" constraint/bottleneck was highlighted in the gigabyte router talk by Partridge at an '89 IETF meeting (as well as RAID/striping papers).
Compared to typical Unix system, most mainframe systems tend to be configured with a significantly larger number of disk arms/drives (especially when calculated in terms of arms/MIP). The latency/performance advantage that mainframes had with expensive fixed-head disks is now pretty much a level playing field with the use of large electonic stores(/caches).
Other work done during the '80s concentrated on optimizing the file I/O opportunity:
* being able to profile various data clusters in terms of accesses/second/mbyte and attempting to load-balance the clusters across the available disk arms (assumes some sort of uniform access patterns, frequently tho access patterns are bursty rather than uniform)
* file organization for large block transfer (taking into account that transfer rate has increased much more significantly than access rate ... also more applicable to bursty access patterns).
* more caching of high-use data (tends to minimize the advantage of data cluster load balancing).
Lynn Wheeler | email@example.com, firstname.lastname@example.org
From: email@example.com (Lynn Wheeler) Newsgroups: comp.unix.large,comp.arch.storage Subject: Re: Big I/O or Kicking the Mainframe out the Door Date: Wed, 29 Dec 1993 20:11:30 GMT... oops finger check ... '89 "gigabyte router" reference should have been "gigabit router" ... if i remember correctly, it assumed 50mip processor, smart outboard I/O controllers, at least three one gigbit links, no buffer copies, 128(?) total instruction pathlength per packet (in & out), bimodel distribution of packets sizes (64 & 1500) with an avg. packet size around 512 (250k packets/sec @128 .. 32m instructions/sec).
note that instruction based buffer copies tend to have very adverse affect on processor thruput ... since both the "input" buffer and "output" buffer locations tend to all be cache misses ... as well as "flushing" useful entries out of the cache.
Lynn Wheeler | firstname.lastname@example.org, email@example.com
next, subject index - home