GA22-7000-4 IBM System/370 Principles of Operation Sept 1975 page 171

Machine-Check Detection .
Recovery Mechanisms .
Redundancy Correction CPU Retry
Unit Deletion Handling of Machine Checks
Contents Handling of Invalid CBC in Storage Programmed Validation of Storage Handling of Invalid CBC in Keys in Storage Handling of Invalid CBC in Registers. Validation of Registers .
Check-Stop State Machine-Check Interruption Conditions Repressible Conditions .
Exigent Conditions .
Machine-Check Interruption Interruption Action. Point of Interruption Machine-Check Logout.
Machine-Check Handling
Machine-Check Extended Interruption Information Machine-Check Interruption Code.
171
172
172
172
172
172
172
173
173
173
173
175
175
175
175
175
175
176
177
177
178
178
179
179 180 181
181
181
181
181
182
182
183 Subclass .
Time of Interruption Occurrence Storage Error Type .
Machine-Check I nterruption Code Validity Bits.
Machine-Check Extended Logout Length
Machine-Check Control Registers . Control Register 14 .
Check-Stop Control Logout Controls. Machine-Check Subclass Masks Control Register 15 . Summary of Machine-Check Masking
The System/370 machine-check handling mecha
nism provides extensive machine-malfunction detec
tion to ensure the integrity of system operation, auto
matic recovery from some malfunctions, and report
ing by means of a machine-check interruption to
assist in maintenance and repair and in program
damage-assessment and recovery.
Machine-Check Detection
Machine-check detection mechanisms may take
many forms, especially in control functions for arith
metic and logical processing, addressing, sequencing,
and execution. For program-addressable informa
tion, detection is normally accomplished by encoding
redundancy into the information in such a manner
that most failures in the retention or transmission of
the information will result in an invalid code. The
encoding normally takes the form of one or more
redundancy bits appended to a group of information
bits. These redundancy bits are referred to as "check bits." The group of data bits and the associated
check bits are called the "checking block. " The inclusion of a single check bit in the checking
block allows the detection of any single-bit failure
within the checking block. In this arrangement, the
checking bit is sometimes referred to as a "parity bit. " In other arrangements, a group of check bits is
included, increasing the checking power and, in some
cases, providing sufficient redundancy to permit
both detection and correction.
For checking purposes, the entire content of a
checking block, including the redundancy, is called a
"checking block code" (CBC). When a CBC com
pletely meets the checking requirements (that is, no
failure is detected), it is said to be valid. When both
detection and correction are provided and a CBC is
not valid but satisfies the checking requirements for
correction (the failure is correctable), it is said to be
near-valid. When a CBC does not satisfy the check
ing requirements (the failure is uncorrectable), it is
said to be invalid.
Machine-Check Handling 171

Recovery Mechanisms
Three mechanisms may be used to provide recovery
from machine-detected malfuncti.ons: redundancy
correction, retry, and unit deletion. Redu",dancy Correction
When sufficient redundancy is included in circuitry,
or in a checking block, failures can be corrected. For
example, circuitry can be triplicated, with a voting
circuit to take two out of three, thus correcting a
single failure. An arrangement for redundancy cor
rection of failures of one order and for detection of
failures of a higher order is called error checking and
correction (ECC). Normally, ECC allows detection
of double-bit failures and correction of single-bit
failures.
CPU Retry
In models with CPU-retry capability, information
about the state of the machine is saved periodically.
The point in the processing to which this saving of
information pertains is referred to as a "hardware
checkpoint." When a malfunction is detected, re
covery is attempted by returning the machine state
to that existing at the latest hardware checkpoint and
proceeding from that point. The interval between
checkpoints is model-dependent. In some cases, sev
eral checkpoints are established within a single
instruction; in others, checkpoints are established
only at the beginning of instructions, or even less
frequently.
llnit L)eletion
In some models, malfunctions in certain transparent
units of the system can be circumvented by discon
tinuing the use of the unit. Examples of cases where
transparent-unit deletion may be used include the
disabling of all or a portion of a cache or of a
translation-Iookaside buffer (TLB).
HandUng of Machine Checks
A machine check can be caused only by a machine
malfunction and never by data or instructions. This
is ensured during the power-on sequence by initializ
ing the machine controls to a valid state and by plac
ing valid CBC in the programmable registers, in the
keys in storage, and, if it is volatile, also in main
storage.
Specification of an unavailable system compo
nent, such as a storage unit, channel, or II a device,
does not cause a machine-check indication. Instead,
such a condition is indicated by the appropriate pro
gram or j[j a interruption or code setting.
172 System/370 Principles of Operation A machine-check is indicated whenever the result
of an operation could be affected by information
with invalid CBC, or when any other malfunction
makes it impossible to establish reliably that an oper
ation can be, or has been, performed correctly.
When information with invalid CBC is fetched
but is not used, the condition mayor may not be
indicated. In order to guarantee system integrity,
however, CBC is preserved as invalid unless the
contents of the entire checking block are replaced in '
the operation.
Depending on the model, and on the type of mal
function, a malfunction detected during an II a op
eration may cause a machine-check interruption
condition, may result in an II a-error condition or
both. II a-error conditions are indicated by an il a interruption or by the appropriate condition code
setting during the execution of an II a instruction.
When a CCW or data with invalid CBC is fetched
from storage but is not used in an II a operation, it
depends on the model whether the condition is re
ported.
When a machine malfunction is detected, the ac
tion taken depends upon the nature of the malfunc
tion and the situation in which it occurs. In some
cases an automatic hardware recovery mechanism
may be invoked. When the recovery attempt is un
successful, or if a recovery mechanism does not ex
ist, a damage condition is said to exist. Machine
check conditions may be reported as machine-check
interruptions or II a interruptions, or they may cause
the CPU to enter the check-stop state.
Handling of Invalid CBC in Storage
When a checking block contains an invalid CBC and
an attempt is made to store into the block without
replacing the entire block, the data in the block
(including the check bits) is regenerated by the stor
age unit, and no new data is entered into the block.
Normally the contents of the block can only be
changed by presenting an entire block of data to be
entered on one storage cycle.
The size of the main-storage checking block de
pends on the model. When the main-storage check
ing block consists of multiple bytes and contains an
invalid CBC, special procedures are necessary to
restore or place new information into the block. The
restoring of a valid CBC in a storage location is
called storage validation. Validation of storage is
provided as a program function and is also provided
with the system clear-manual operation.
A checking block with invalid CBC is never vali under programming control unless the entire
contents of the checking block are replaced. Even
when an instruction, or an II a input operation,

Previous Page Next Page

GA22-7000-4 IBM System/370 Principles of Operation Sept 1975 Page 171 (171 of 329)

Help