and the model-dependent data is stored in the extended logout area. The
machine check handler uses these fields to analyze the error, format an
error record, and write the record out on the error recording cylinder
of SYSRES. If the machine fails to recover from the malfunction through its own recovery facilities, the machine check handler is notified by a machine
check interruption. An interruption code, noting that the recovery
atteapt was unsuccessful, is inserted in the fixed logout area. The machine check handler then analyzes the data and attempts to keep the system as fully operational as possible.
Recovery from machine malfunctions can be divided into the following
categories: functional recovery, system recovery, operator-initiated restart, and system repair. These levels of error recovery are discussed
in their order of acceptability, functional recovery being most
acceptable and system repair being least acceptable: Rl£OVERY: Functional recovery is recovery from a machine
check without adverse effect on the system or the interrupted user.
This type of recovery can be made by processor retry, the ECC facility,
or the machine check handler. Processor retry and ECC error correcting
facilities are discussed separately in this section because they are
significant in the total error recovery scheme. Functional recovery by MCH is made by correcting storage protect feature (SPF) keys and
intermittent errors in real storage. System recovery is attempted when functional recovery
is impossible. System recovery is the continuation of systea operations
at the expense of the interrupted user, whose virtual machine operation
is terminated. System recovery can only take place if the user in
question is not critical to continued systea operation. An error in a system routine that is considered to be critical to system operation
precludes functional recovery and would require logout and a system dump followed by reloading the system. When ,the errors may have caused a loss of
supervisor or system integrity, the system is put into a disabled wait
state. The operator is instructed to run the standalone error recovery
(SEREP) program and then IRa nua11y restart the system. REP!!R: System repair is recovery that requires the services of
aaintenance personnel and tak'es place at the discretion of the opera tor. Usually, the operator has tried to recover by system-supported restart
one or more times with no success. SYSTEM/370 RECOVERY FEATURES The operation of the Machine Check Handler depends on certain automatic
recovery actions taken by the hardware and on logout information given
to it by the hardware.
Processor errors are autoaatica11y retried by microprogram routines.
These routines save source data before it is altered by the operation. When the error is detected, a aicroprograa returns the processor to the
beginning of the operation, or to a point where the operation was executing correctly, and the operation is repeated. After several
unsuccessful retries, the error is considered peraanent.
CP Introduction 1-151
ECC checks the validity of data fro. real and control storage,
automatically correcting single-bit errors. It also detects multiple-bit errors but does not correct them. Data enters and leaves storage through
a storage adapter unit. This unit checks each doubleword for correct
parity in each byte. If a single-bit error is detected, it is corrected.
The corrected doubleword is then sent back into real or control storage
and on to the processor. When a multiple-bit error is detected, a
machine check interruption occurs, and the error location is placed in
the fixed logout area. MCH gains control and attempts to recover from the error.
Two control registers are used by MCH for loading and storing control
information (see Figure 21). Control register 14 contains mask bits
which specify whether certain conditions can cause aachine check
interruptions and mask bits which control conditions under which an
extended logout can occur. Control register 15 contains the address of
the extended logout area. Iii , I I I I IWordlBitsl Name of Field I Associated with 14 1 0 1 Check-stop control ftch-Chk handling
14 1 1 1 Synchronous MCEL control ftch-Chk handling
14 1 2 I I/O extended logout control Chan-Chk handling
14 I 4 1 Recovery report mask ftch-Chk handling
14 I 5 I Degradation report mask ftch-Chk handling
14 1 6 I External damage report mask ftch-Chk handling
14 1 7 I Warning mask ftch-Chk handling
14 I 8 I Asynchronous MCEL control ftch-Chk handling
14 I 9 1 Asynchronous fixed log control ftch-Chk handling
15 18-281 MCEL address ftch-Chk handling
Figure 21. RMS Control Register Assignments VM/370 Machine Check Handler module (DftKftCH) consists of the following
functions: • Initial analysis subroutine • Main storage analysis subroutine • SPF analysis subroutine • Recovery facility mode switching • Operator communication subroutine • Virtual user termination subroutine • Soft recording subroutine • Buffer error subroutine • Termination subroutine
1-152 IBM VM/370 System Logic and Problem Deterlination--Voluae 1
Previous Page Next Page