Three mechanisms may be used to provide recovery
from machine-detected malfuncti.ons: redundancy
correction,
When sufficient redundancy is included in circuitry,
or in a checking block, failures can be corrected. For
example, circuitry can be triplicated, with a voting
circuit to take two out of three, thus correcting a
single failure. An arrangement for redundancy cor
rection of failures of one order and for detection of
failures of a higher order is called error checking and
correction (ECC). Normally, ECC allows detection
of double-bit failures and correction of single-bit
failures.
CPU Retry
In models with
about the state of the machine is saved periodically.
The point in the processing to which this saving of
information pertains is referred to as a "hardware
checkpoint." When a malfunction is detected, re
covery is attempted by returning the machine state
to that existing at the latest hardware checkpoint and
proceeding from that point. The interval between
checkpoints is model-dependent. In some cases, sev
eral checkpoints are established within a single
instruction; in others, checkpoints are established
only at the beginning of instructions, or even less
frequently.
llnit L)eletion
In some models, malfunctions in certain transparent
units of the system can be circumvented by discon
tinuing the use of the unit. Examples of cases where
transparent-unit deletion may be used include the
disabling of all or a portion of a cache or of a
translation-Iookaside buffer (TLB).
HandUng of Machine Checks
A machine check can be caused only by a machine
malfunction and never by data or instructions. This
is ensured during the power-on sequence by initializ
ing the machine controls to a valid state and
ing valid CBC in the programmable registers, in the
keys in storage, and, if it is volatile, also in main
storage.
Specification of an unavailable system compo
nent, such as a storage unit, channel, or
does not cause a machine-check indication. Instead,
such a condition is indicated by the appropriate pro
gram or
172 System/370
of an operation could be affected by information
with invalid CBC, or when any other malfunction
makes it impossible to establish reliably that an oper
ation can be, or has been, performed correctly.
When information with invalid CBC is fetched
but is not used, the condition mayor may not be
indicated. In order to guarantee system integrity,
however, CBC is preserved as invalid unless the
contents of the entire checking block are replaced in '
the operation.
Depending on the model, and on the type of mal
function, a malfunction detected during an
eration may cause a
condition, may result in an
both.
setting during the execution of an
When a CCW or data with invalid CBC is fetched
from storage but is not used in an
depends on the model whether the condition is re
ported.
When a machine malfunction is detected, the ac
tion taken depends upon the nature of the malfunc
tion and the situation in which it occurs. In some
cases an automatic hardware recovery mechanism
may be invoked. When the recovery attempt is un
successful, or if a recovery mechanism does not ex
ist, a damage condition is said to exist. Machine
check conditions may be reported as machine-check
interruptions or
the
Handling of Invalid CBC in Storage
When a checking block contains an invalid CBC and
an attempt is made to store into the block without
replacing the entire block, the data in the block
(including the check bits) is regenerated by the stor
age unit, and no new data is entered into the block.
Normally the contents of the block can only be
changed by presenting an entire block of data to be
entered on one storage cycle.
The size of the main-storage checking block de
pends on the model. When the main-storage check
ing block consists of multiple bytes and contains an
invalid CBC, special procedures are necessary to
restore or place new information into the block. The
restoring of a valid CBC in a storage location is
called storage validation. Validation of storage is
provided as a program function and is also provided
with the system clear-manual operation.
A checking block with invalid CBC is never vali
contents of the checking block are replaced. Even
when an instruction, or an