Recovery Mechanisms
Three mechanisms may be used to provide recovery
from machine-detected malfuncti.ons: redundancy
correction, retry, and unit deletion. Redu",dancy Correction
When sufficient redundancy is included in circuitry,
or in a checking block, failures can be corrected. For
example, circuitry can be triplicated, with a voting
circuit to take two out of three, thus correcting a
single failure. An arrangement for redundancy cor­
rection of failures of one order and for detection of
failures of a higher order is called error checking and
correction (ECC). Normally, ECC allows detection
of double-bit failures and correction of single-bit
failures.
CPU Retry
In models with CPU-retry capability, information
about the state of the machine is saved periodically.
The point in the processing to which this saving of
information pertains is referred to as a "hardware
checkpoint." When a malfunction is detected, re­
covery is attempted by returning the machine state
to that existing at the latest hardware checkpoint and
proceeding from that point. The interval between
checkpoints is model-dependent. In some cases, sev­
eral checkpoints are established within a single
instruction; in others, checkpoints are established
only at the beginning of instructions, or even less
frequently.
llnit L)eletion
In some models, malfunctions in certain transparent
units of the system can be circumvented by discon­
tinuing the use of the unit. Examples of cases where
transparent-unit deletion may be used include the
disabling of all or a portion of a cache or of a
translation-Iookaside buffer (TLB).
HandUng of Machine Checks
A machine check can be caused only by a machine
malfunction and never by data or instructions. This
is ensured during the power-on sequence by initializ­
ing the machine controls to a valid state and by plac­
ing valid CBC in the programmable registers, in the
keys in storage, and, if it is volatile, also in main
storage.
Specification of an unavailable system compo­
nent, such as a storage unit, channel, or II a device,
does not cause a machine-check indication. Instead,
such a condition is indicated by the appropriate pro­
gram or j[j a interruption or code setting.
172 System/370 Principles of Operation A machine-check is indicated whenever the result
of an operation could be affected by information
with invalid CBC, or when any other malfunction
makes it impossible to establish reliably that an oper­
ation can be, or has been, performed correctly.
When information with invalid CBC is fetched
but is not used, the condition mayor may not be
indicated. In order to guarantee system integrity,
however, CBC is preserved as invalid unless the
contents of the entire checking block are replaced in '
the operation.
Depending on the model, and on the type of mal­
function, a malfunction detected during an II a op­
eration may cause a machine-check interruption
condition, may result in an II a-error condition or
both. II a-error conditions are indicated by an il a interruption or by the appropriate condition code
setting during the execution of an II a instruction.
When a CCW or data with invalid CBC is fetched
from storage but is not used in an II a operation, it
depends on the model whether the condition is re­
ported.
When a machine malfunction is detected, the ac­
tion taken depends upon the nature of the malfunc­
tion and the situation in which it occurs. In some
cases an automatic hardware recovery mechanism
may be invoked. When the recovery attempt is un­
successful, or if a recovery mechanism does not ex­
ist, a damage condition is said to exist. Machine­
check conditions may be reported as machine-check
interruptions or II a interruptions, or they may cause
the CPU to enter the check-stop state.
Handling of Invalid CBC in Storage
When a checking block contains an invalid CBC and
an attempt is made to store into the block without
replacing the entire block, the data in the block
(including the check bits) is regenerated by the stor­
age unit, and no new data is entered into the block.
Normally the contents of the block can only be
changed by presenting an entire block of data to be
entered on one storage cycle.
The size of the main-storage checking block de­
pends on the model. When the main-storage check­
ing block consists of multiple bytes and contains an
invalid CBC, special procedures are necessary to
restore or place new information into the block. The
restoring of a valid CBC in a storage location is
called storage validation. Validation of storage is
provided as a program function and is also provided
with the system clear-manual operation.
A checking block with invalid CBC is never vali­ under programming control unless the entire
contents of the checking block are replaced. Even
when an instruction, or an II a input operation,
specifies that the entire contents of a checking block
are to be replaced, validation mayor may not occur, depending on the operation and the model. Storage
validation during the IPL input operations follows
the same rules as for normal input operations.
Programmed Validation of Storage
Execution of the instruction MOVE (MVC) or MOVE LONG (MVCL) validates the main-storage
area containing the first operand when the following
conditions are satisfied: The first-operand field and second-operand
field participating in the operation do not over­
lap. The first-operand field starts on a boundary of a
checking block and is an integral number of
checking blocks in length. For MVCL, the second-operand field, if nonze­
ro in length, starts on a boundary of a checking
block and, if it is shorter than the first-operand
field, is an integral number of checking blocks
in length.
An interruption or stopping of the CPU during
execution of MVCL does not affect the validation
function performed.
Handling of Invalid CBC in Keys in
Storage
Depending on the model, each key in storage may
consist of a single checking block, or the protection
bits and the change and reference bits may be in
separate checking blocks. Invalid CBC on the key in
storage is ignored in storing or fetching with a zero
protection key. References to main storage to which
protection does not apply are treated as if a protec­
tion key of zero is used for the reference. This in­
cludes such references as channel references during
the IPL procedure, implicit references such as in
timer updating and interruption action, and OAT table accesses. The key in storage is validated by
SET STORAGE KEY.
The table "Handling of Invalid CBC in Keys in
Storage" describes the action taken when the key in
storage has invalid CBC.
Handling of Invalid CBC in Registers
During a machine-check interruption, the contents
of the general, floating-point, and control registers,
and of the CPU timer and clock comparator if they
are installed, are stored at fixed locations in main
storage. Invalid CBC detected during this operation
does not result in additional machine-check­
interruption conditions; instead, the accuracy of the
information stored is indicated by the appropriate
setting of the validity bits in the machine-check­
interruption code. On some models, registers with
invalid CBC will be automatically validated during
the interruption. On other models, programmed vali­
dation is required. The TOO clock and the prefix
register are not stored during the machine-check
interruption and are not validated. On those models in which registers are not auto­
matically validated as part of the machine-check
interruption, a register with invalid CBC will not
cause a machine-check interruption condition unless
the contents of the register are actually used. In
these models, each register may consist of one or
more checking blocks, but multiple registers are not
included in a single checking block. When only a
portion of a register is accessed, invalid CBC in the
unused portion of the same register may cause a
machine-check interruption condition. For example,
invalid CBC in the right half of a long operand of a
floating-point register may cause a machine-check
interruption condition if a LOAD (LE) operation
attempts to replace the left half, or short form, of
the register.
Invalid CBC associated with the check-stop con­
trol bit (control register 14, bit 0) and with the asyn­ 'chronous fixed-logout control bit (control register
14, bit 9) will cause the CPU either to immediately
enter check-stop state or to assume that bits 0 and 9
have their initialized values of one and zero, respcc­
tively.
Invalid CBC associated with the prefix register
cannot be safely reported by the machine-check
interruption, since the interruption itself requires
that the prefix value be applied to convcrt real ad­
dresses to the corresponding absolute addresses.
When the check-stop control bit (control register 14,
bit 0) is one, invalid CBC in the prefix register caus­
es the CPU to immediately enter the check-stop
state. When the check-stop control bit is zero, inval­
id CBC in the prefix register either may cause the CPU to enter the check-stop state or may generate a
system damage condition, depending on the model.
Validation of Registers On those models which do not validate registers
during a machine-check interruption, the following
instructions will cause validation of a register, pro­
vided the information in the register is not used be­
fore the register is validated. Other instructions,
although they replace the entire contents of a regis­
ter, do not necessarily cause validation.
General registers are validated by BRANCH
AND LINK (BAL, BALR), LOAD (LR), and LOAD ADDRESS (LA). LOAD (L) and LOAD MULTIPLE (LM) validate if the operand is on a
Machine-Check Handling 173
Previous Page Next Page