ECC checks the validity of data fro. real and control storage,
automatically correcting single-bit errors. It also detects multiple-bit errors but does not correct them. Data enters and leaves storage through
a storage adapter unit. This unit checks each doubleword for correct
parity in each byte. If a single-bit error is detected, it is corrected.
The corrected doubleword is then sent back into real or control storage
and on to the processor. When a multiple-bit error is detected, a
machine check interruption occurs, and the error location is placed in
the fixed logout area. MCH gains control and attempts to recover from the error.
Two control registers are used by MCH for loading and storing control
information (see Figure 21). Control register 14 contains mask bits
which specify whether certain conditions can cause aachine check
interruptions and mask bits which control conditions under which an
extended logout can occur. Control register 15 contains the address of
the extended logout area. Iii , I I I I IWordlBitsl Name of Field I Associated with 14 1 0 1 Check-stop control ftch-Chk handling
14 1 1 1 Synchronous MCEL control ftch-Chk handling
14 1 2 I I/O extended logout control Chan-Chk handling
14 I 4 1 Recovery report mask ftch-Chk handling
14 I 5 I Degradation report mask ftch-Chk handling
14 1 6 I External damage report mask ftch-Chk handling
14 1 7 I Warning mask ftch-Chk handling
14 I 8 I Asynchronous MCEL control ftch-Chk handling
14 I 9 1 Asynchronous fixed log control ftch-Chk handling
15 18-281 MCEL address ftch-Chk handling
Figure 21. RMS Control Register Assignments VM/370 Machine Check Handler module (DftKftCH) consists of the following
functions: Initial analysis subroutine Main storage analysis subroutine SPF analysis subroutine Recovery facility mode switching Operator communication subroutine Virtual user termination subroutine Soft recording subroutine Buffer error subroutine Termination subroutine
1-152 IBM VM/370 System Logic and Problem Deterlination--Voluae 1
The initial analysis subroutine of DMKMCH receives control by a machine check interruption. To minimize the possibility of losing logout
information by recursive machine check interruptions, the machine check
new PSi gives control to DMKMCH with the system disabled for further
interruptions. There is always a danger that a machine malfunction may occur immediately after DMKMCH is entered and the system is disabled for
interruption. Disabling all interruptions is only a temporary measure to give the initial analysis subroutine time to make the following emergency provisions: It disables for soft machine check interruptions. Soft recording is
not enabled until the error is recorded. It saves the contents of the fixed and extended logout areas in the machine check record. It alters the machine check new PSi to point to the term subroutine.
The term subroutine handles second machine check errors. It enables the machine for hard machine check interruption. If a virtual user was running when the interruption occurred, the
running status (GPRs, FPRs, PSi, M.C. old PSi, CRs, etc.) is saved in
the user's VMBLOK. It initially examines the machine check data for the following error
types: MCIC=ZERO PSi invalid System damage Timing facilities damage Channel inoperative on 3031/3032/3033 processor
The occurrence of any of these errors is considered uncorrectable by DMKMCH; the primary system operator is informed, the error is
formatted and recorded, and the system enters a wait state, code 001 or If the instruction processing damage bit is on, it tests for the following types of aalfunctions: Multiple-Bit Error in Main Storage --Control is given to the main storage analysis subroutine. SPF Key Error --Control is given to the SPF analysis subroutine.
Retry failed --If the processor was in supervisor state the error
is considered uncorrectable and the VM/370 system is terminated.
If the processor was in problea state, the virtual machine is
reset or terainated and the system continues operation. If processor retry or ECC was successful on a soft error, control is
given to the soft recording subroutine to format the record, write it
out on the error recording cylinder, and update the count of soft
error occurrences. If external damage was reported,
recording subroutine to foraat the
error recording cylinder.
control is
record and
given to the soft
write it out on the
CP Introduction 1-153
Previous Page Next Page