The initial analysis subroutine of DMKMCH receives control by a machine check interruption. To minimize the possibility of losing logout
information by recursive machine check interruptions, the machine check
new PSi gives control to DMKMCH with the system disabled for further
interruptions. There is always a danger that a machine malfunction may occur immediately after DMKMCH is entered and the system is disabled for
interruption. Disabling all interruptions is only a temporary measure to give the initial analysis subroutine time to make the following emergency provisions: It disables for soft machine check interruptions. Soft recording is
not enabled until the error is recorded. It saves the contents of the fixed and extended logout areas in the machine check record. It alters the machine check new PSi to point to the term subroutine.
The term subroutine handles second machine check errors. It enables the machine for hard machine check interruption. If a virtual user was running when the interruption occurred, the
running status (GPRs, FPRs, PSi, M.C. old PSi, CRs, etc.) is saved in
the user's VMBLOK. It initially examines the machine check data for the following error
types: MCIC=ZERO PSi invalid System damage Timing facilities damage Channel inoperative on 3031/3032/3033 processor
The occurrence of any of these errors is considered uncorrectable by DMKMCH; the primary system operator is informed, the error is
formatted and recorded, and the system enters a wait state, code 001 or If the instruction processing damage bit is on, it tests for the following types of aalfunctions: Multiple-Bit Error in Main Storage --Control is given to the main storage analysis subroutine. SPF Key Error --Control is given to the SPF analysis subroutine.
Retry failed --If the processor was in supervisor state the error
is considered uncorrectable and the VM/370 system is terminated.
If the processor was in problea state, the virtual machine is
reset or terainated and the system continues operation. If processor retry or ECC was successful on a soft error, control is
given to the soft recording subroutine to format the record, write it
out on the error recording cylinder, and update the count of soft
error occurrences. If external damage was reported,
recording subroutine to foraat the
error recording cylinder.
control is
record and
given to the soft
write it out on the
CP Introduction 1-153
The main storage analysis subroutine is given control when the machine check interruption was caused by a multiple-bit storage error. An
initial function points the machine check new PSi to an internal
subroutine to indicate a solid machine check, in case a machine check
interruption occurs while exercising main storage.
Damaged storage areas associated with any portion of the CP nucleus
itself cannot be refreshed; multiple-bit storage errors in CP cause the V8/370 system to be terminated. An automatic restart reinitializes V8/370. If the damage is not in the CP nucleus, main storage is exercised to determine if the failure is solid or intermittent. ftultiple-bit ECC
storage errors on a 3031, 3032, or 3033 processor are always treated as
solid errors. If the failure is solid, the 4K page frame is marked
unavailable for use by the system. If the failure is intermittent, the
page frame is marked invalid. The change bits associated with the
damaged page frame are checked to determine if the page had been
altered, by the virtual machine. If no alteration had occurred, Vft/370 assigns a new page frame to the virtual machine and a backup copy of the
page is brought into storage the next time the page is referenced. If
the page had been altered V8/370 resets or terminates the virtual
.achine, clears its virtual storage, and sends an appropriate message to
the user. Nor.al system operation continues for all other users.
The SPP analysis subroutine is given control when the machine check
interruption was caused by an SPF error. An initial function points the machine check new PSi to an internal subroutine if a machine check
interrruption occurs during testing and validation. The SPF analysis
routine then determines if the error was associated with a failure in
virtual machine storage or in the storage associated with the control
program.
An SPF error associated with VK/370 is a potentially catastrophic
failure. Namely, Vft/370 always runs with a PSi key of zero, which means that the SPF key in main storage is not checked for an out-of-parity
condition. The SPF analysis subroutine exercises all 16 keys in the
failing storage 2K page frame. If an SPP machine check occurs in
exercising the 16 keys 5 times each, the error is considered solid and
the operating system is terainated with a system shutdown. If an SPF machine check does not occur, the machine check is considered
intermittent. The zero key is restored to the failing 2K page frame and
this is transparent to the virtual machine. If an SPF machine check occurs, which is associated with a virtual machine, the SPF analysis subroutine exercises all 16 keys in the
failing storage 2K page frame. If an SPF machine check does not occur,
the aachine check is intermittent and the SWPTABLE for the page
associated with the failing storage address is located. The storage key
for the failing 2K storage page frame is retrieved from the SWPTIBLE and
the change and reference bits are set on in the storage key. The
storage key is then stored into the affected failing storage 2K page
fraae. If an SPF machine check occurs in exercising the 16 keys 5 times each, then the aachine check is considered solid and the following
actions are taken. (1) The virtual machine is selectively reset or
terminated by the virtual machine termination subroutine; (2) The 4K
page fraae associated with the failing address is removed as an
1-154 IBM VM/370 System Logic and Problem Determination--Volume 1
Previous Page Next Page