GA22-7000-10 IBM System/370 Principles of Operation Sept 1987 page 11-3

out of three, thus correcting a single
failure. An arrangement for correction
of failures of one order and for
detection of failures of a higher order
is called error checking and correction (ECC). Commonly, ECC allows correction
of single-bit failures and detection of
double-bit failures.
Depending on the model and the portion
of the machine in which ECC is applied,
correction may be reported as system
recovery, or no report may be given. Uncorrected errors in storage and in the
storage key may be reported, along with
a failing-storage address, to indicate
where the error occurred. Depending on
the situation, these errors may be
reported along with system recovery,
with external secondary report, or with
the damage or backup condition resulting
from the error. CPU RETRY
In some models, information about some
portion of the state of the machine is
saved periodically. The point in the
processing at which this information is
saved is called a checkpoint. The
information saved is referred to as the
checkpoint information. The action of
saving the information is referred to as
establishing a checkpoint. The action
of discarding previously saved informa
tion is called invalidation of the
checkpoint information. The length of
the interval between establishing check
points is model-dependent. Checkpoints may be established at the beginning of
each instruction or several times within
a single instruction, or checkpoints may
be established less frequently.
Subsequently, this saved information may
be used to restore the machine to the
state that existed at the time when the
checkpoint was established. After
restoring the appropriate portion of the
machine state, processing continues from
the checkpoint. The process of restor
ing to a checkpoint and then continuing
is called CPU retry. CPU retry may be used for machine-check
recovery, to effect nullification and
suppreSSlon of instruction execution
when certain program interruptions
occur, and in other model-dependent
situations.
Effects of CPU Retry CPU retry is, in general, performed so
that there is no effect on the program.
However, change bits which have been
changed from zeros to ones are not
necessarily set back to zeros. As a
result, change bits appear to be set
to ones for blocks would have been
accessed if to the checkpoint
had not occurred. If the path taken by
the program;s dependent on information
that may be changed by another CPU or by
a channel or if an interruption occurs,
then the final path taken by the program
may be different from the earlier path;
therefore, change bits may be ones
because of stores along a path apparent
ly never taken. Checkpoint synchronization consists in
the following steps.
1. The CPU operation is delayed until
all conceptually previous accesses
by this CPU to storage have been
completed, both for purposes of
machine-check detection and as
observed by other CPUs and by chan
nels.
2. All previous checkpoints, if any,
are canceled.
3. Optionally, a new checkpoint is
established. The CPU operation is
delayed until all of these actions
appear to be completed, as observed
by other CPUs and by channels.
Handling of Machine Checks during Check point Synchronization
When, in the process of completing all
previous stores as part of the
checkpoint-synchronization action, the
machine is unable to complete all stores
successfully but can successfully
restore the machine to a previous check
point, processing backup is reported.
When, in the process of completing all
stores as part of the checkpoint
synchronization action, the machine is
unable to complete all stores success
fully and cannot successfully restore
the machine to a previous checkpoint,
the type of machine-check-interruption
condition reported depends on the origin
of the store. Failure to successfully
complete stores associated with instruc
tion execution may be reported as
instruction-processing damage, or some
less critical machine-check-interruption
condition may be reported with the
storage-Iogical-validity bit set to
zero. A failure to successfully
complete stores associated with the
execution of an interruption, other than
program or supervisor call, is reported
as system damage. Chapter 11. Machine-Check Handling 11-3

When the machine check occurs as part of
a checkpoint-synchronization action
before the execution of an instruction,
the execution of the instruction is
nullified. When it occurs before the
execution of an interruption, the inter
ruption condition, if the interruption
is external, I/O, or restart, is held
pending. If the checkpoint-
synchronization operation was a
machine-check interruption, then along
with the originating condition, either
the storage-Iogical-validity bit is set
to zero or instruction-processing damage
is also reported. Program interrup
tions, if any, are lost.
Checkpoint-Synchronization Operations
All interruptions and the execution of
certain instructions cause a
checkpoint-synchronization action to be
performed. The operations which cause a
checkpoint-synchronization action are
called checkpoint-synchronization oper
ations and include: • CPU reset • All interruptions: external, I/O, machine check, program, restart,
and supervisor call • The BRANCH ON CONDITION (BCR) instruction with the Mt and R2
fields containing all ones and all
zeros, respectively • The instructions LOAD PSW, SET STORAGE KEY, SET STORAGE KEY EXTENDED, and SUPERVISOR CALL • All I/O instructions • The instructions MOVE TO PRIMARY, MOVE TO SECONDARY, PROGRAM CALL, PROGRAM TRANSFER, SET ADDRESS SPACE CONTROL, and SET SECONDARY ASN • The DAS-tracing function
Programming Note
The instructions which are defined to
cause the checkpoint-synchronization
action invalidate checkpoint information
but do not necessarily establish a new
checkpoint. Additionally, the CPU may
establish a checkpoint between any two
instructions or units of operation, or
within a single unit of operation.
Thus, the point of interruption for the
machine check is not necessarily at an
instruction defined to cause a
checkpoint-synchronization action.
11-4 System/370 Principles of Operation Checkpoint-Synchronization Action
For all interruptions except I/O inter
ruptions, a checkpoint-synchronization
action is performed at the completion of
the interruption. For I/O interrup
tions, a checkpoint-synchronization
action m9Y or may not be performed at
the completion of the interruption. For
all interruptions except program,
supervisor-call, and exigent machine
check interruptions, a checkpoint
synchronization action is also performed
before the interruption. The fetch
access to the new PSW may be performed
either before or after the first
checkpoint-synchronization action. The
store accesses and the changing of the
current PSW associated with the inter
ruption are performed after the first
checkpoint-synchronization action and
before the second.
For all checkpoint-synchronization in
structions except BRANCH ON CONDITION (BCR), I/O instructions, and SUPERVISOR CALL, checkpoint-synchronization actions
are performed before and after the
execution of the instruction. For BCR, only one checkpoint-synchronization
action is necessarily performed, and it
may be performed either before or after
the instruction address is updated. For SUPERVISOR CALL, a checkpoint
synchronization action is performed
before the instruction is executed,
including the updating of the instruc
tion address in the PSW. The
checkpoint-synchronization action taken
after the supervisor-call interruption
is considered to be part of the inter
ruption action and not part of the
instruction execution. For I/O instructions, a checkpoint-synchroniza
tion action is always performed before
the instruction is executed and mayor
may not be performed after the instruc
tion is executed.
The DAS-tracing function causes
checkpoint-synchronization actions to be
performed before the trace action and
after completion of the trace action.
UNIT DELETION
In some models, malfunctions in certain
units of the system can be circumvented
by discontinuing the use of the unit.
Examples of cases where unit deletion
may occur include the disabling of all
or a portion of a cache or of a
translation-Iookaside buffer (TLB).
Unit deletion may be reported as a
degradation machine-check-interruption
condition.

Previous Page Next Page

GA22-7000-10 IBM System/370 Principles of Operation Sept 1987 Page 11-3 (316 of 558)

Help