.DF

  To: Nancy Schoendorf DSD                                 Date: June 1, 1984

From: Rob Horning                                       Subject: Firefox RAM

  cc: Craig Robinson
      Joe Mixsell DSD
.DE

This is a more detailed explanation of the error correction and error logging
strategy for Firefox than was given in the ERS.  The definition is not complete
and so it may change somewhat in the next few months.  If you have any questions
pleas call me at 1-226-2501.

The error correction strategy is based on the assumption that we must be able
to have a failed RAM chip without requiring a service call.  This requires
that we be able to correct single bit errors without noticeable performance
degradation and that we have the ability to correct double bit errors when
one of the errors is caused by a known bad chip (data comes back random from
the known bad chip).

As was mentioned in the ERS, error logging is required to be done by software.
Software will also have to determine when there is a hard error.  The memory
controller will provide information about errors as they occur.  It will give
the address of the error and the syndrome code for the error (which points at 
the bad bit).  The memory board will not keep any record of errors.

Firefox will require the operating system to periodicly run a memory check
and scrubbing (to be explained later) process.  This will have to run every few
minutes.  The memory check portion of the process will do the following things.
.AL
.LI
Check the error flags on the memory boards.
.LI
If there is an error then check to see if it is a hard error by writing the data
back to the location and reading it again, then checking to see if the error
flag is set.
.LI
If it is not a hard error then log the information and check the log to see
if the soft error rate for the chip in error is high enough to be considered
a hard error.  The acceptable soft error rate needs to be determined.
.LI
If it is a hard error, check to see if a hard error has already been logged
for the memory bank.
.LI
If there has not been a hard error logged for the memory bank then log the error
and write the syndrome code to the location corresponding to the address.
.LI
If a hard error has already been logged, then report a hardware failure to
the console.  The memory board needs to be repaired.  It is still functional
but it does not have error correction for the failed bank and it may run at
about 70% speed.
.LE

It has not been determined how or were errors should be logged.

If an uncorrectable error is detected by the memory controller, bus error will
be pulled.  The address of the error will be provided by the memory
controller to the software.

At power up it will be necessary for the PDC to determine were the hard errors
are.  If the hard errors are logged in the non-volatile on the Firefox processor
board, then it can
transfer the information to the memory controller(s).  If not then it will
have to depend on the memory test to find the bad chips.

The only time that it is important that the bad chips be found at power up
is when there are two bad chips in one bank (the memory board is operating in
a degraded mode).  The memory controller cannot correct double bit errors
unless it knows the location of one of the errors.

To aid the PDC in power up memory test the controller will most likely allow
the error correction to be turned off and the syndrome bits to be read and
written.  It is possible to determine were multiple bit errors occurred without
this feature but the PDC needs to understand the error correction algorithm.


Scrubbing -

Following is a brief description of scrubbing.

There are two types of scrubbing that I have heard of.  The first type
requires writing data back to the RAM when an error is detected.  Some
memory controllers do this automatically.  Firefox will require that the
software read the errored location and write it back to fix the soft
error.

The second type of scrubbing requires that every location is read periodicly
to insure that soft errors are not present in dormant areas of memory for
long periods of time.  With a 16 Mbyte system, reading and writing 60 cache
lines (32 bytes) every 5 minutes would be very sufficient.

Both types of scrubbing may need to be implemented to decrease the probability that a
double bit error will occur.  The Firefox memory controller will not provide
scrubbing and so any scrubbing that is needed will have to be done with
software.  The second type of scrubbing may not be needed if the soft error
rate is low enough.  The first type will be needed to insure that a
single error is not reported numerous times.

