From maf@hpesmaf Tue Nov  1 12:43 MST 1988
Received: from hpesmaf.HP.COM by hpfclr.HP.COM; Tue, 1 Nov 88 12:43:04 mst
Received: by hpesmaf.HP.COM; Tue, 1 Nov 88 12:49:21 mst
Date: Tue, 1 Nov 88 12:49:21 mst
From: Mark Forsyth <maf@hpesmaf>
Full-Name: Mark Forsyth
Message-Id: <8811011949.AA21827@hpesmaf.HP.COM>
To: bja@hpesmaf, cag@hpesmaf, jvg@hpesmaf, rob@hpesmaf, tpp@hpesmaf
Status: R

>From stm@hpesstm Thu Oct 20 19:27 MDT 1988
Received: from hpesstm.HP.COM by hpesmaf.HP.COM; Thu, 20 Oct 88 19:26:46 mdt
Received: by hpesstm.HP.COM; Thu, 20 Oct 88 19:21:46 mdt
From: Steve Mangelsdorf <stm@hpesstm>
Full-Name: Steve Mangelsdorf
Message-Id: <8810210121.AA19666@hpesstm.HP.COM>
Subject: MIPS visit summary
To: d_georg@hpesstm, jkw@hpesstm, cgk@hpesstm, maf@hpesstm, crs@hpesstm,
        jdy@hpesstm, dft@hpesstm, twg@hpesstm
Date: Thu, 20 Oct 88 19:21:40 MDT
X-Mailer: Elm [version 2.0 gamma]
Status: R

                                                          stm 10/20/88



                    Summary of MIPS Presentation to TCG


There is a big book of slides which you can borrow from Darius, Jeff or
myself.  I have omitted stuff which is covered in the slides.


Architecture
------------

They liked HP-PA better than all other RISC architectures (except their
own).  88K was the runner up.  There was contempt for Sparc, Apollo.

They may add some features to their architecture:  a nullify bit for
branches, indexed loads/stores, and 64 bit coprocessor loads/stores.
They have committed forward compatibility to their customers, not
backward compatibility.  This was about the only good news for HP in the
presentation.  

TLB coherency is done in software.  

They claim that TLB miss rates are reduced 70% by dedicating two "super"
entries for kernal instruction and data spaces.

There are no decimal instructions, and they claimed that benchmarking
of COBOL programs showed little advantage to adding them. 

They said they will probably never build a vector coprocessor.


Implementations
---------------

The BIT (ECL) implementation will be completed in '89.  It is 60 VUPS
(they stressed that this assumed their usual conservative rating values)
and 12 Mflops on DP Linpack.  They expect that it will be used only in
servers and multi user machines, not workstations.  ECL gives a roughly
2:1 performance advantage.

A 33 MHZ version of the R3000 using 15ns RAMs will be available soon.

The R4000 (the next CMOS implementation) will sample in 11/89, and
systems will be shipping in 6/90.  It has an on chip I-cache.  This
information was given to Jim Brokish under NDA.  Otherwise, they
wouldn't say much about the implementation, but in more general terms
they indicated the trend was to on chip caches and issuing multiple
instructions per cycle.  They have extensively simulated "CPI < 1"
schemes.

The TLB miss penalty is 10-12 instructions, or 15-20 cycles.  They claim
that TLB misses cause only a 1% performance loss.  The fully associative
TLBs have only 64 entries, and there is no backup TLB.

Current implementations have a direct mapped write through D-cache.  The
advantage is that there is no need for error correction, MP protocols
are simpler, and stores can be done in one state (no need to read tags).
They take a penalty on byte/halfword writes.  They expect MP systems to
add a 2nd level write back cache to reduce bus traffic.

Performance on real math problems is strongly proportional to memory
bus bandwidth, not just the FPU design.

They handle denormalized numbers in software.

They say that synchronous RAMs give only a small frequency advantage.

CPU chip pricing is getting close to $10 per MIP over a certain range.
They expect today's products to sell for $50 per chip in '92 due to
competition among their semiconductor partners.

They passed around an R3000 board which made a strong impression.  It
was much smaller than a TopGun board, had many fewer layers, and used
single sided surface mount except for the two PGAs.

They do a lot of phase muxing of pins, including the cache address, tag
and data busses.

Right now they use real caches, but they claim they can easily go to
virtual caches if and when it is necessary.


Design Methodology
------------------

Extensive behavioral simulation is used before tape release.  They boot
UNIX with their simulator!  The R3000 chip set had only one functional
bug (in the FPU).

Binning is used.

They have a goal of increasing clock rate 1.5 to 2X over the life of a
product with mask changes and compatible process enhancements.

They bragged about the thoroughness with which they perform VLSI
characterization and stress the functional design when samples are
available.  A lot of what they described sounded like our SEL and "phase
2" testing.  The time from getting R3000 samples to controlled shipments
of systems was around 3 months.

Their core tool set is Mentor Graphics running on Apollos.  The size of
their designs broke the tool set, but they seemed happy with Mentor's
responsiveness.  They also use HSPICE, ECAD's Dracula, and in house logic
simulators.


The Competition
---------------

The Apollo machine is a dead end due to poor performance and lack of
growth path.  It is 35% slower than the R3000 on a real math benchmark
(Doduc). 

A big shakeup is coming in the superworkstation market next year similar
to the shakeup in the minisuper market underway now.

The 33 MHZ Cypress implementation, for which 20 VUPS is claimed, is more
like 15-16 VUPS.


Compilers
---------

The compilers are clearly leadership.  They claim they are running
out of optimizations to implement.  

Scheduling of the floating point units is done by the compiler.  The
FPU doesn't even bother queueing instructions!  There are hardware
interlocks for those cases where the compiler doesn't have enough
instructions to schedule.  They may add compiler switches to specify
the target machine if the various implementations' optimization rules
diverge too far.

The people from the compiler lab seemed very impressed.

They have a utility to reorganize an object file based on run time
profiling data in order to reduce the I-cache miss rate (remember the
I-cache is one way).

Their compiler places about half as many instructions outside the
Linpack inner loop as our compiler.


UNIX
----

A binary licence for MIPS' OS and C compiler is only $300 per CPU, list
price.  A source licence for the OS is $50,000.


