The term embarrassingly parallel is widely used today by computer scientists to describe their applications that port most naturally to parallel architecture. Almost exclusively these computer scientists are doing floating point number crunching on gigantic computer models of complex problems like fluid turbulence. These operations first require domain decomposition of the data in the model to fit portions of the model into each node of the parallel machine. Since most of the algorithms used are based on operations with neighboring data it is important to let each node contain as much memory as possible. This is because there is an inefficiency introduced by the requirement to duplicate data on the boundaries of the portions of the data that is stored in each node. The boundary of the data will have one less dimension than the data stored in each node. If the data in a node represents a three dimensional model then its two dimensional surface will require duplicates of some of its neighbor nodes data to make the computations on this border data. The ratio of the duplicated data is 2N*X^(N-1)/X^N where N is the number of dimensions and X is the units of computation per dimension. of the data partitioned for a single node. Another factor that has influenced parallel system design is the non-parallelizable portion of application programs. If something is 90% parallelizable then 10% must always be computed sequentially. Adding a higher degree of parallelism does not improve the performance of this non-parallelizable portion of an application. The only way to speed this up is to have a faster node. These two factors have led designers away from massively parallel processors (MPP) with smaller and slower processors towards fewer faster processors with more memory per node.
In recent years workstation technology has seen a faster decrease in the ratio of price to performance than have super computers. In fact, while super computers have become faster in the past few years they have had no decrease in their price to performance ratio. (*1) Researchers now demonstrate that clusters of workstations can provide the computing power of super computers at one tenth the cost and no high maintenance fees. Typically these computer scientists are using workstations with 64 megabytes of memory, 50 MIP, 50 MFLOP, machines costing $25K each and connected on a sub 1 megabyte per second eithernet network.
Artificial Intelligence software has traditionally be
written in LISP and been run on very large sequential
computers. Today "C" and "C++" have become more popular for
producing work outside of the research lab. The computing
requirements and performance for academic AI software has
kept much of it out of the marketplace. FORTRAN still
dominates the parallel processing field. And although "C"
offers some improvement in the efficiency of AI platforms
Forth has been used by some researchers who wanted to push
the price to performance ratio of their research. Neural
network simulations, natural language processing, and expert
systems are all well known in the Forth community. It is
expert systems that have had the most success in entering
the workplace despite the fact that many are very expensive.
Expert systems have demonstrated that they can provide Ph.D.
specialist level of advice to a human through some form of
dialog about some limited domain of expertise. In fact many
experts are disturbed to find that most of the expertise
that took them years to learn can be encoded in a couple of
hundred simple logic statements. Philip J. Koopman Jr.'s
Stack Computers, the New Wave lists expert systems are one
of the best matches to the stack machine architecture. (*2)
In an article in AI Expert Robert Trelease compared the
performance of a tower of Hanoi benchmark on a number of
expert systems and machines and the results were striking.
(*3) The slowest entry was the Digital VAX 11/780 running
the industry standard OPS5 expert environment at 60 seconds.
The fastest was FORPS on a NOVIX where the expert system
converts the English language rules into native forth code
for the NOVIX. The NOVIX time of .006 seconds is four
orders of magnitude faster than the VAX. If you combine
this with the difference in cost you get more than six
orders of magnitude difference in the price to performance
ratio.
AI research has remained very fragmented after decades of
progress. Game playing, natural language, expert systems,
neural networks, machine vision, and speech processing have
all made significant advances, but things like common sense,
consciousness, or self awareness have remained very elusive.
Much of the problem has been the attempts to do these things
on sequential computers.
If you study organic architecture you find incredible
parallelism! The granule cells are the most numerous cells
in the human brain, and it is estimated that there are 3 x
10^10 granule cells in the cerebellum alone. Granule cells
posses from one to seven dendrites, the average being four.
(*4) Although we understand the organic architecture to
some extent, it is very difficult to produce a non-organic
device with this many computing elements.
It is well understood that different parts of the human
brain are specialized to perform different functions and
that processing speeds are very slow, but the degree of
parallelism is very very high. Given the high degree of
functional behavior possible with very few interconnected
neurons in more primitive organisms the degree of
parallelism in the human brain gives a new meaning to MPP.
The latest forth engine microprocessor developed by Charles
Moore is the MuP21. The MuP21 is designed to run the OKAD
VLSI software that developed the MuP21. (a sort of hardware
metacompiling)
The MuP21 is essentially a complete workstation for VLSI CAD
on a single chip. Add a keyboard, some memory, a monitor,
some mass storage and software and you have a VLSI CAD
workstation.
Add a network interface and a few other things and you get
F21 the next chip to be produced by Charles Moore. The F21
very much resembles a conventional networked workstation in
functionality but with orders of magnitude less complexity
and cost.
Parallel computing systems today as noted before are often
clusters of workstations. Unused computing cycles can be
gathered from idle workstations already in use, or dedicated
workstations can be clustered to form less expensive super
computers. As noted before today people are using
workstations with 64 megabytes of memory, 50 MIP, 50 MFLOP,
machines costing $25K each and connected on a sub 1 megabyte
per second eithernet network. If we are discussing systems
with dedicated workstations then a similar system configured
with F21 would contain 2.5 megabytes of memory, 200 MIP,
and connected on a 10 megabyte per second network and cost
under $200 per node.
Figure 1. Conventional Workstations F21 System Megabytes per node 64 2.5 MIPs per node 50 200 Cost per node ($) 25,000 under 200 Megabytes per second network under 1 10
Just as it is possible to use a cluster of conventional
workstations as a super computer it is possible to do the
same thing with the F21. The F21 was designed for a
tradeoff between minimal cost per node and maximum
performance per node. The price performance advantage of
the F21 over conventional workstations on some applications
is five hundred to one. Where required the F21 could be
connected to an inexpensive DSP chip for floating point to
increase the floating point performance per node.
As shown before, one advantage of doing AI in forth on a
forth engine compared to doing AI in LISP on conventional
architecture is several orders of magnitude improvement in
performance. One advantage of using a networked minimalist
forth engine like the F21 rather than conventional
workstations on parallel problems is several orders of
magnitude reduction of cost. When these advantages are
combined the results should be a price performance
breakthrough for parallel AI.
The cofounder of the Artificial Intelligence Laboratory at
MIT, Marvin Minsky refers to agents of the mind. (*5)
Minsky's The Society of Mind describes how the human mind is
a collection of agents that act very much like expert
systems with limited domains of knowledge. Agents know
nothing about other agents to which they are not directly
connected. And all they know about the agents to which they
are connected are the messages sent by them. In a way these
agents function something like expert systems. These agents
function like neural networks, and for the most part most of
them have no model of the real world like classical AI
programs. Classical AI programs have relied on various
schemes for knowledge representation and modeling of the
real world, but much attention has been focused in recent
years on the work of another MIT scientist Rodney Brooks.
(*6) Brooks' subsumption architecture approach demonstrates
that robots can deal with real world problems with little
more than a few fairly primitive reflexes. Because the
subsumption architecture does not require vast intellectual
knowledge representation or vast computing resources it is
considered by many a divergence from classical AI. I see
Minsky's view of mind as the interaction of many
disconnected agents and Brook's idea that reflexes are more
important than planning and modeling as very similar. It
should be noted that the small insect like robots that
Brooks created are in fact controlled by parallel
processing. Brooks' somewhat famous robot Attila has six
legs, 60 sensors, 23 motors, and 11 computers. This robot
though only the size of a shoe box and controlled by very
primitive micro controllers has force sensing whiskers, a
gyroscope, a pitch-and-roll sensor, a near-infrared range
finder, and a small camera for machine vision. People who
see Attila are often disturbed by what appears to be a large
living mechanical insect.
AI has been kept in the lab by the excessive costs of LISP
and mainframe computers. And although a market does exist
for commercial expert systems built on this type of platform
it is limited by expense. The subsumption architecture
approach to intelligent machines may make them more
prolific.
I have written interactive speech conversant programs on
early microcomputers. These programs though simple could
carry on a spoken conversation with a human on a very
limited subject matter. I have written programs on early
microcomputers that combined simple image analysis routines
that functioned like neural networks with a formal expert
system. This program could recognize humans and pets and
trigger various behaviors like dialing a phone and
transmitting an interesting image to me in another town.
Modest programs like this can be easily connected through
expert agents to form a program that exhibits more advanced
behavior.
Minsky argues that one agent people don't seem to have is an
agent to watch the inner workings of the other agents. The
idea of agents evolved only after people understood how
neurons and neural networks work, and began to examine how
consciousness might reside on this sort of architecture. He
likes to say that he thinks humans have evolved about 400
different agents, maybe one more agent every million years.
He loves to explain things like why pain hurts, or why mice
don't like music in terms of the interaction of these
agents.
Computer visionary Alan Kay often states that the age of desktop metaphor and graphic user interface is passing and the age of intelligent agent interfaces is dawning. He uses the term agent to refer to an entity with some degree of intelligence that will perform some work for a user. Typically these agents appear as talking heads on a monitor and interact with the user like a human servant. Kay argues that with the development of widespread connectivity and distributed data that the desktop metaphor that is so popular today will become useless. You can make sense of the visual image of the icons that represent the files or folders on one disk drive, but when you are connected to the internet the visual representation of the files available would be of little use to a human. Instead you would like something that could go out and "look" at the millions of files and only show you things of interest to you. To do this these agents need a number of things. The searching of millions of files is the easy part, the hard part is to know what is of interest, and to communicate this to the human. I say communicate because what is needed is more than just a report of information. What is desired is something resembling interaction with a human.
For a number of years the production of videos depicting a vision of life with intelligent agent machines has become a billion dollar industry. The first was Apple's presentation of the knowledge navigator. Apple, HP, AT&T and other companies clearly plan to produce these types of machines by the turn of the century. PDAs like Apple's Newton is an example of the first step in that direction. Add color, talking heads, a video camera with image recognition, microphone and speaker and interactive conversant software, wireless communications and networking, video phone and video conferencing, and lots of memory and processing power to a Newton and your there!
The F21 was designed as an agent chip. The reason F21 can digitize a video image, digitize an audio signal, generate video, generate audio, and overlay computer generated video on top of an external video source is so that F21 can generate conversant talking heads on a video monitor. The reason that F21 has a network interface is so that F21 can provide scalable memory and processing power to handle whatever degree of AI is desired. F21 is an ideal processor for expert systems and parallel subsumption architecture, and F21 will have almost all the hardware you need to create the intelligent agents that people expect to see on the computers forecast now for the year 2000.
Multimedia is not a precise term. It means different things to different people. Today it is widely used to refer to a computer system that includes CD-ROM based audio and video in addition to the computer generated video. Computer based training on interactive video is commonly offered on this type of media. F21 should probably be called a multimedia chip because it has video in, video out, mixed video, audio in, and audio out capabilities directly from the chip. What do you call a system with hundreds of video in, video out, audio in, and audio out connections? Parallel multimedia is what I call it.
F21 is leaner than the design of F20 two years ago. Every
part of the chip has been streamlined and simplified. As
MuP21 evolved many things changed in the CPU core. It
gained an extra bit and changed from MuP20 to MuP21. The
extra bit gives you that needed carry bit for 20 bit math,
and a way to have more addressing space. Conditional
branching instructions changed from skips to jumps. The
multiply step instruction changed. SWAP went away. NOP
went away, then it came back. Instruction bits moved
around. Data and address bits changed polarity. Memory
maps changed. Memory busses changed. And the number of
instructions dropped from 33 to 25. This made for a
somewhat difficult to hit moving target environment.
The P20 software simulator became S21 and simulated both
MuP21 and F21 moving targets. It was always easier to
modify the software simulator than it was for Chuck to make
the changes to the hardware simulation in OKAD. The first
MuP20 eForth had been created with MASM and loaded into the
P20 simulator in 1991. Later that year eForth development
shifted to metacompilation. Metacompilation proved to be a
much better approach to hit a moving target like MuP21. The
MuP21 design eventually stabilized as the chip became
functional. Larger stacks were added to the MuP21 design,
and the analog processors, network processor and parallel
port that distinguish the F21 were added to the MuP21 core.
After the instruction set stabilized F21 code development
began. The first version of F21 eForth was simple to
implement with larger on chip stacks. After the first DTC
model was improved and streamlined a STC model was coded.
The first STC eForth proved to be slower than the previous
DTC version. This was because off page calls required two
cells of memory while threaded addresses would always fit
into one cell. Since off page memory reference are the
slowest operations the replacing of off page references with
longer code sequences caused and increase in the size of
the program, and therefore an increase in the number of off
page references. The next version was STC with native
instructions like DUP compiled inline with several NOP
instructions to fill a cell rather than a subroutine call to
the high level DUP word used by the interpreter. The next
version packed these primitives tightly with NOPs only being
used where required for address alignment. Tail recursion
optimization was added to convert a call to a jump rather
than follow it with a subroutine return. Macros were added
to implement inline code for @, !, and SWAP and other words,
and these macros were modified to insert different code
depending on context. F21 eForth source code would compile
down to almost 1k word of code, but would not quite fit on
one 1k page. More code optimizations were added. Loop
unrolling can be used to linearize inlined code, and use of
the A register as a fast constant where possible was added.
The optimizing cross compiler for F21 is called X21. X21
can produce code for MuP21 as well as F21, but the smaller
stacks on MuP21 require the programmer to manage the
resources very carefully. The very simple benchmark in
figure 2 compiled with the X21 optimizing compiler uses up
most of the stack registers available on MuP21.
The Mentink benchmark was posted to comp.lang.forth on the
internet and people posted the results from various systems
and compilers. I tested the simulated MuP21 and F21 and
show the results in figure 4. At the time I posted several
notices about how the benchmark could be very deceptive
because it was so trivial that many optimizing compilers
would actually remove the guts of the program since it did
nothing. I stated that anyone posting results should also
state if possible any such details. With this in mind
notice that some results that were posted indicated that
indeed some compilers had removed some of the code.
Bernie Mentink posted the following simple benchmark to the internet:
DECIMAL : INNER 1000 0 DO 34 DROP LOOP : : BENCH 1000 0 DO INNER LOOP 7 EMIT ;
Some people removed the 7 EMIT and simply used the loops with some timer function built into the forth they were using and reported the result. This will make a big difference on a fast processor since the BELL may be quite long!
Both the MuP21 and F21 have DRAM and fast SRAM address spaces, so the performance on the benchmark is shown for both types of memory. Also the results for various eForth models and fforth a native code optimizing compiler are given where appropriate.
The code produced by X21 or fforth is show in figure 2.
Figure 2. address opcode label comment 001000 5715C INNER n push n push 1000 0 DO 001001 003E8 1000 001002 00000 0 001003 57D58 IN1 n drop n pop 34 DROP 001004 00022 34 001005 00001 1 001006 BE37B + pop over over INC loop counter 001007 A080A xor T0 IN2 COMPARE 001008 FF39E drop push push nop 00100A FFFFE IN2 drop drop drop ; FINISH 00100B 5715C BENCH n push n push 1000 0 DO 00100C 003E8 1000 00100D 00000 0 00100E 20000 BE1 call INNER 00100F 563DE n pop nop nop 001010 00001 1 001011 BE37B + pop over over INC counter 001012 A0815 xor T0 BE2 COMPARE 001013 FF39E drop push push nop 001014 00003 jmp BE1 LOOP 001015 FFFBA BE2 drop drop drop n FINISH 001016 00007 7 001017 57020 n push ; long call and return 001018 03987 lit' EMIT
Figure 3 shows the code with the inner loop is unrolled and inlined. (**1).
Figure 3. address opcode label comment 001000 57BDE INNER n nop nop nop 001001 00001 1 001002 57D5F n drop n drop 1st and 2nd 001003 00022 001004 00022 001005 57D5F n drop n drop 3rd and 4th 0010// //// unrolled inner loop 5th-48th 001096 57D5F n drop n drop 49th and 50th 001006 00022 001007 00022 001099 88C02 2* C0 001002 loop 20times=1000 00109A F8400 drop ;
This INNER executes in 140 microseconds on the MuP21 in SRAM. This gives 140 milliseconds for BENCH. The fastest code is produced by using the A register as a fast constant (**2).
Figure 4. Forth Version Model Memory Options Processor MuP21 F21 MuP21 eForth 2.02 DRAM DTC 3.3 2.6 F21 eForth STC 2.02 DRAM STC calls 1.5 F21 eForth DTC 2.02 DRAM DTC 1.2 F21 eForth STC 2.03 DRAM STC NATIVE PRIMITIVES 0.7 F21 eForth STC 2.04 DRAM STC packed primitives 0.65 fforth DRAM STC optimized 0.45 0.25 fforth SRAM 0.3 0.17 fforth **1 DRAM STC +INLINED 0.19 0.137 fforth **1 SRAM STC +INLINED 0.14 0.077 fforth **2 DRAM STC INLINE W/REG 0.027 0.020 fforth **2 SRAM STC INLINE W/REG 0.020 0.010 **1 inlined inner loop optimization. **2 inlined inner loop optimization with register optimization.The results of the Mentink benchmark that were posted to the internet and the results from the S21 simulations of MuP21 and F21 in DRAM and SRAM are all combined and sorted by execution speed the result is shown in figure 5.
You will note that the slowest entry for the MuP21 and F21 are running MuP21 eForth in DRAM. The speed is similar to a 386 running FPC and about half as fast as a 386 running TCOM compiled code. The code compiled by fforth or X21 is about as fast as the 50 MHz R4000 in the SGI Crimson when running in DRAM and about as fast as the 100 MHz R4000 in the SGI Crimson when running in SRAM. Inlined, unrolled and register optimized code will run an order of magnitude faster than the SGI Crimson. It is also 300 times faster than the 486 running a forth in "C" or 260 times faster than the F21 running the code under MuP21 eForth.
Figure 5. MENTINK BENCHMARK RESULTS FROM INTERNET SORTED BY SPEED PROCESSOR CLOCK FORTH-COMPILER OPTION SECONDS VAX 6620 FIG 8080 (EMULATION) 15.32 H8 10 eForth 15.0 68000 7 7.6 80386 33 HS/FORTH 6.1 MuP21 100 MuP21 eForth 2.02 DRAM 3.3 80486 33 F83 in "C" 3.0 80386 33 FPC 2.8 68040 25 Yerk 2.7 F21 200 MuP21 eForth 2.02 DRAM 2.6 80386 33 TCOM (FPC) 1.7 80386 33 HS/FORTH w/optimization 1.6 no 34 DROP 80386 33 TCOM (FPC) 0.99 no 7 EMIT HP-PA Forth in "C" 0.75 68030 25 0.75 F21 200 F21 eForth STC 2.03 DRAM 0.7 R3000 33 RISC pFORTH INDIGO 0.66 F21 200 F21 eForth STC 2.04 DRAM 0.65 MuP21 100 fforth DRAM 0.45 68040 25 0.35 R3000 66 RISC pFORTH INDIGO 0.33 MuP21 100 fforth SRAM 0.3 F21 200 fforth DRAM 0.25 R4000 50 RISC pFORTH CRIMSON 0.24 80486 33 ForthCMP 0.21 no 34 DROP MuP21 100 fforth **1 DRAM 0.19 F21 200 fforth SRAM 0.17 MuP21 100 fforth **1 SRAM 0.14 F21 200 fforth **1 DRAM 0.137 R4000 100 RISC pFORTH CRIMSON 0.12 F21 200 fforth **1 SRAM 0.077 MuP21 100 fforth **2 DRAM 0.027 MuP21 100 fforth **2 SRAM 0.02 F21 200 fforth **2 DRAM 0.02 F21 200 fforth **2 SRAM 0.01
Forth Linda was developed to provide an extension to Forth
for parallel programming. The advantage of the Linda
approach is that the number of processors need not be known
at compile time, and the load balancing at runtime is
automatic. Linda also provided mechanisms for compiling on
a network of heterogeneous machines. Linda implements a
tuple space where remote execution is done with active
tuples and data is passed in passive tuples. (*7) The F*F
parallelizing optimizing compiler for F21 will use the
active tuple mechanism of Linda for remote execution, but
uses a simpler mechanism for shared data.
The network interface on the F21 has been reduced to have
only two functions: broadcast an interrupt or a data
transfer to one or more nodes on the network. With the
addition of some network support code these two functions
map directly to the parallelizing words in F*F.
Since the network interface on the F21 has a memory
broadcast capability, a single broadcast on the F21 network
interface can write to the memory of a single node, a group
of nodes, or all the nodes on the network. So when global
variables or data structures are updated it is done in an
atomic manner on the network. Fetches from global variables
and data structures are just local memory fetches so they
are very fast. F*F uses the words G! and G!! to perform a
global stores to global data structures. The F*F word RX(
queues up the words enclosed in parenthesis for remote
execution on a different processor.
Some primitive network boot and control words will be
provided in the system, but the three words RX(, G!, and G!!
are all that are needed by a F*F forth programmer to convert
forth programs to run in parallel. F*F does not make the
global data structures or parallelizing transparent.
Instead it tries to make the words that specify parallel
execution as simple and direct as possible.
Copyright 1993 Jeff Fox
Call, write, or email for information on MuP21, F21, F*F and
other products.
1. Bell, Gordon. "Tracking the Teraflops" Distributed
Computing for Aerosciences Applications Workshop at NASA
Ames Research Center October 1993
2. Koopman, P. "Stack Computers, the New Wave"
3. Trelease, R. "The Forth Wave in AI." AI Expert (October
1987)
4. Albus, J. "The Cerebellum: A Substrate for List-
Processing in the Brain" "Cybernetics, Artificial
Intelligence, and Ecology" Spartan Books 1972
5. Minsky, M. "The Society of Mind" Simon & Schuster, Inc.
1985
6. Jones, J. and Flynn, A. "Mobile Robots" A K Peters,
Wellesly, Massachusetts 1993
7. Fox, J. "Debugging eForth-Linda on a Simulated Network
of Forth Computers" "1991 FORML Conference Proceedings",
Forth Interest Group 1992
OTHER SUGGESTED REFERENCES
Minsky, M, and Papert, S. "Perceptrons" MIT Press 1969
Bjornson, R. and Carriero, N. and Gelernter, D, "The
Implementation and Performance of Hypercube Linda" 1988
Carriero, N. and Gelernter, D. "How to Write Parallel
Programs: A Guide to the Perplexed" Yale University
Department of Computer Science RR-628, May 1988
Gelernter, D. "Getting the Job Done" Byte November 1988
Mattson, T. "An Introduction to Linda" Scientific
Computing Associates, Inc. 1989
Hendtlass, T. "Embedded Node Collectives" "1991 FORML
Conference Proceedings", Forth Interest Group 1992