Gaduate Institute of Electonics Engineeing, NTU Systolic and Pipelined Why Systolic Pocessos Achitectue? Souces: VLSI Signal Pocessing ÜÜ Ande Hon, Jason Handube, Michelle Gunning, Beman, J. Kim, Heiko Schoede, Hai Tao, Shaaban ACCESS IC LAB
Systolic Pocessos Discussed on Monday, May 26 1. Achitectue fo Matix Matix multiplication 2. Achitectue fo Matix Vecto Multiplication 3. Convolution, one-dimensional and polynomial multiplication 4. Pipelined Systolic matche fo many gene pattens in a long chomosome. Count 100%, 75% and 50 % matches. 5. Systolic pocesso that solves Petick Function, SAT, and SAT with minimal liteal cost poblems. (not shown hee, if you ae inteested, ask Newton fo the solution).
Methodology of designing systolic pocessos. 1. Wite full equation fo esults. 2. Skew the input flows o put inputs to pocessos. 3. The inputs may come fom two diffeent diections, these diections may be opposite 4. Sometimes the pocessos must be sepaated by othe types of pocessos o just delays (do-nothing egistes clocked by geneal clock). Howeve, sometimes skewing input data is sufficient fo this. 5. Daw the stuctue of pocessos. 6. Design data path of each pocesso. 7.Veify the coect opeation. Daw snapshots, use colo pencils. 8. Design the contolle of each pocesso o a SIMD global contolle. Remembe about synchonization. Each pocesso must complete its wok in the exactly same moment (clock pulse). It is somehow simila to Sieve of Eatostenes but has inteesting ealization of logic opeatos and use of andom numbe geneato..
Pipelined Computations Pipelined pogam divided into a seies of tasks that have to be completed one afte the othe. Each task executed by a sepaate pipeline stage Data steamed fom stage to stage to fom computation f, e, d, c, b, a P1 P2 P3 P4 P5
Pipelined Computations Computation consists of data steaming though pipeline stages Execution Time = Time to fill pipeline (P-1) P = # of pocessos N = # of data items (assume P < N) + Time to un in steady state (N-P+1) + Time to empty pipeline (P-1) f, e, d, c, b, a P1 P2 P3 P4 P5 P5 P4 P3 P2 P1 a b a b c a b c d a b c d e a b c d e f c d e f d e f e f f time
Pipelined Example: Sieve of Eatosthenes Goal is to take a list of integes geate than 1 and poduce a list of pimes E.g. Fo input 2 3 4 5 6 7 8 9 10, output is 2 3 5 7 Fan s pipelined appoach (a little diffeent than the book): Pocesso P_i divides each input by the ith pime If the input is divisible (and not equal to the diviso), it is maked (with a negative sign) and fowaded If the input is not divisible, it is fowaded Last pocesso only fowads unmaked (positive) data [pimes]
Sieve of Eatosthenes Pseudo-Code Code fo pocesso Pi (and pime p_i): x=ecv(data,p_(i-1)) If (x>0) then If (p_i divides x and p_i = x ) then send(-x,p_(i+1) If (p_i does not divide x o p_i = x) then send(x, P_(i+1)) Else Send(x,P_(i+1)) / Code fo last pocesso x=ecv(data,p_(i-1)) If x>0 then send(x,output) P2 P3 P5 P7 out
Simple Sting Matching Steam text past set of matches =c1? =c2? =c3? & & &
Matix Vecto Multiplication Conside multiplying a 3x2 X 2x1 matix:
Systolic Aays
Y values goes left, X values go ight, A values fan in T0 T1 T2 T3 T4 T5 T6 T7
The pocesso 1. The pocesso was designed on Monday by students. 2. Ask fo thei notes o design you own. 3. You need to emembe data path design and egiste tansfe design, as well as FSM synthesis, in geneal.
Systolic Aay Example: 3x3 Systolic Aay Matix Multiplication Evey cell is like this: ai bi egiste * clea + egiste Clock to all thee egistes egiste 1. Obseve multiplie/adde combination, so typical 2. Obseve accumulation of esults in-place. Evey cell of final matix is a pocesso. Results emain in these pocessos, next they should be shifted-out, if necessay.
Systolic Aay Example: 3x3 Systolic Aay Matix Multiplication Pocessos aanged in a 2-D gid Each pocesso accumulates one element of the poduct Alignments in time b2,2 b2,1 b1,2 b2,0 b1,1 b0,2 b1,0 b0,1 b0,0 Columns of B Rows of A a0,2 a0,1 a0,0 a1,2 a1,1 a1,0 a2,2 a2,1 a2,0 T = 0 Example souce: http://www.cs.hmc.edu/couses/2001/sping/cs156/
Systolic Aay Example: 3x3 Systolic Aay Matix Multiplication Pocessos aanged in a 2-D gid Each pocesso accumulates one element of the poduct Alignments in time b2,2 b2,1 b1,2 b2,0 b1,1 b0,2 b1,0 b0,1 b0,0 a0,2 a0,1 a0,0 a0,0*b0,0 a1,2 a1,1 a1,0 a2,2 a2,1 a2,0 T = 1 Example souce: http://www.cs.hmc.edu/couses/2001/sping/cs156/
Systolic Aay Example: 3x3 Systolic Aay Matix Multiplication Pocessos aanged in a 2-D gid Each pocesso accumulates one element of the poduct Alignments in time b2,2 b2,1 b1,2 b2,0 b1,1 b0,2 b1,0 b0,1 a0,2 a0,1 a0,0*b0,0 + a0,1*b1,0 a0,0 a0,0*b0,1 b0,0 a1,2 a1,1 a1,0 a1,0*b0,0 a2,2 a2,1 a2,0 T = 2 Example souce: http://www.cs.hmc.edu/couses/2001/sping/cs156/
Systolic Aay Example: 3x3 Systolic Aay Matix Multiplication Pocessos aanged in a 2-D gid Each pocesso accumulates one element of the poduct Alignments in time b2,2 b2,1 b1,2 b2,0 b1,1 b0,2 a0,2 a0,0*b0,0 + a0,1*b1,0 + a0,2*b2,0 a0,1 a0,0*b0,1 + a0,1*b1,1 a0,0 a0,0*b0,2 b1,0 b0,1 a1,2 a1,1 a1,0*b0,0 + a1,1*b1,0 a1,0 a1,0*b0,1 b0,0 a2,2 a2,1 a2,0 a2,0*b0,0 T = 3 Example souce: http://www.cs.hmc.edu/couses/2001/sping/cs156/
Systolic Aay Example: 3x3 Systolic Aay Matix Multiplication Pocessos aanged in a 2-D gid Each pocesso accumulates one element of the poduct Alignments in time b2,2 b2,1 b1,2 a0,0*b0,0 + a0,1*b1,0 + a0,2*b2,0 a0,2 a0,0*b0,1 + a0,1*b1,1 + a0,2*b2,1 a0,1 a0,0*b0,2 + a0,1*b1,2 b2,0 b1,1 b0,2 a1,2 a1,0*b0,0 + a1,1*b1,0 + a1,2*a2,0 a1,1 a1,0*b0,1 +a1,1*b1,1 a1,0 a1,0*b0,2 a2,2 a2,1 b1,0 a2,0*b0,0 + a2,1*b1,0 b0,1 a2,0 a2,0*b0,1 T = 4 Example souce: http://www.cs.hmc.edu/couses/2001/sping/cs156/
Systolic Aay Example: 3x3 Systolic Aay Matix Multiplication Pocessos aanged in a 2-D gid Each pocesso accumulates one element of the poduct Alignments in time b2,2 a0,0*b0,0 + a0,1*b1,0 + a0,2*b2,0 a0,0*b0,1 + a0,1*b1,1 + a0,2*b2,1 a0,2 a0,0*b0,2 + a0,1*b1,2 + a0,2*b2,2 b2,1 b1,2 a1,0*b0,0 + a1,1*b1,0 + a1,2*a2,0 a1,2 a1,0*b0,1 +a1,1*b1,1 + a1,2*b2,1 a1,1 a1,0*b0,2 + a1,1*b1,2 a2,2 b2,0 a2,0*b0,0 + a2,1*b1,0 + a2,2*b2,0 b1,1 a2,1 a2,0*b0,1 + a2,1*b1,1 b0,2 a2,0 a2,0*b0,2 T = 5 Example souce: http://www.cs.hmc.edu/couses/2001/sping/cs156/
Systolic Aay Example: 3x3 Systolic Aay Matix Multiplication Pocessos aanged in a 2-D gid Each pocesso accumulates one element of the poduct Alignments in time a0,0*b0,0 + a0,1*b1,0 + a0,2*b2,0 a0,0*b0,1 + a0,1*b1,1 + a0,2*b2,1 a0,0*b0,2 + a0,1*b1,2 + a0,2*b2,2 b2,2 a1,0*b0,0 + a1,1*b1,0 + a1,2*a2,0 a1,0*b0,1 +a1,1*b1,1 + a1,2*b2,1 a1,2 a1,0*b0,2 + a1,1*b1,2 + a1,2*b2,2 a2,0*b0,0 + a2,1*b1,0 + a2,2*b2,0 b2,1 a2,2 a2,0*b0,1 + a2,1*b1,1 + a2,2*b2,1 b1,2 a2,1 a2,0*b0,2 + a2,1*b1,2 T = 6 Example souce: http://www.cs.hmc.edu/couses/2001/sping/cs156/
Systolic Aay Example: 3x3 Systolic Aay Matix Multiplication Pocessos aanged in a 2-D gid Each pocesso accumulates one element of the poduct Alignments in time a0,0*b0,0 + a0,1*b1,0 + a0,2*b2,0 a0,0*b0,1 + a0,1*b1,1 + a0,2*b2,1 a0,0*b0,2 + a0,1*b1,2 + a0,2*b2,2 Done a1,0*b0,0 + a1,1*b1,0 + a1,2*a2,0 a1,0*b0,1 +a1,1*b1,1 + a1,2*b2,1 a1,0*b0,2 + a1,1*b1,2 + a1,2*b2,2 b2,2 a2,0*b0,0 + a2,1*b1,0 + a2,2*b2,0 a2,0*b0,1 + a2,1*b1,1 + a2,2*b2,1 a2,2 a2,0*b0,2 + a2,1*b1,2 + a2,2*b2,2 T = 7 Example souce: http://www.cs.hmc.edu/couses/2001/sping/cs156/
Hexagonal Systolic Aay fo matix-matix multiplication This is an inteesting modification of pevious design in which movement is in thee diections and the esults ae not stoed but flow out It is you task to design each pocesso in detail
Example Systolic Algoithm: Matix Multiplication Poblem: multiply two nxn matices A ={a_ij} and B={b_ij}. Poduct matix will be R={_ij}. Systolic solution uses 2D aay with NxN cells, 2 input steams and 2 output steams
Systolic Matix Multiplication a44 a34 a24 a14 == == == b41 b42 b43 b44 a43 a33 a23 a13 == == b31 b32 b33 b34 a42 a32 a22 a12 === b21 b22 b23 b24 a41 a31 a21 a11 -- -- -- P11 b11 b12 b13 b14 -- -- -- -- -- P21 P12 -- -- P31 P22 P13 -- -- P41 P32 P23 P14 P42 P33 P24 P43 P34 P44
k, Opeation at each cell Each cell updates at each time step as shown i j below 0 i, j initialized to 0 a i, k b k, j k 1 i, j k i, j + a = i, k b k, j b k, j a i, k
Data Flow fo Systolic MM Beat 1 Beat 2 a11 b11 1 1,1 a 21 a 12 1 2,1 2 1,1 b 21 1 1,2 b 12
Data Flow fo Systolic MM Beat 3 Beat 4 a 31 a 22 1 3,1 a 13 2 2,1 3 1,1 1 2,2 b 31 2 1,2 b 22 1 1,3 b 13 a 41 a 32 1 4,1 a 23 2 3,1 a 14 3 2,1 1 3, 2 4 1,1 2 2,2 b 41 3 1,2 1 2,3 b 32 2 1,3 b 23 1 1,4 b 14
Data Flow fo Systolic MM Beat 5 Beat 6 a 42 a 33 2 4,1 a 24 3 3,1 4 2,1 2 3,2 1,1 3 2,2 4 1,2 2 2,3 b 42 3 1,3 b 33 2 1,4 b 24 a 43 a 34 3 4,1 4 3,1 2,1 1,1 1, 2 3 3,2 4 2,2 3 2,3 4 1,3 b 43 3 1,4 b 34 1 4,2 1 3,3 1 2,4 2 4,2 1 4,3 2 3,3 1 3,4 2 2,4
Data Flow fo Systolic MM Beat 7 Beat 8 1,1 1,1 1, 2 2,1 1, 2 2,1 a 44 4 4,1 3,1 2, 2 1, 3 4 4 3, 2 2,3 3 4,2 2 4,3 3 3,3 2 3,4 3 2,4 4 1,4 b 44 3,1 2, 2 1, 3 4,1 3, 2 4 4,2 2, 3 3 4,3 4 3,3 3 3,4 4 2,4 1,4 1 4,4 2 4,4
Data Flow fo Systolic MM Beat 9 Beats 10 and 11 1,1 1, 2 2,1 2,1 1,1 1, 2 3,1 2, 2 1, 3 4,1 3, 2 2, 3 1,4 3,1 2, 2 1, 3 4,2 3, 3 2, 4 4,1 3, 2 2, 3 4,2 3, 3 2, 4 4 4 4,3 3,4 3 4,4 1,4 2,1 1,1 1, 2 3,1 2, 2 1, 3 2 4,3 3, 4 4 4,4 4,1 2, 3 3, 1,4 4,2 3, 3 2, 4 4,3 3, 4 4,4
Gaduate Institute of Electonics Engineeing, NTU 1. Use this method fo 1D polynomial multiplication 2. Genealize to two dimensions. ÜÜ ACCESS IC LAB
Design B1 Peviously poposed fo cicuits to implement a patten matching pocesso and = fo a cicuit to implement polynomial multiplication. - Boadcast input, move esults, weights stay - (Semi-systolic convolution aays with global data communication
Design B2 = = The path fo moving y i s is wide then w i s because of y i s cay moe bits then w i s in numeical accuacy. The use of multiplie-accumulatos may also help incease pecision of the esult, since exta bit can be kept in these accumulatos with modest cost. Boadcast input, move weights, esults stay [(Semi-) systolic convolution aays with global data communication]
Design F When numbe of cell is lage, the adde can be implemented as a pipelined adde tee to avoid lage delay. Design of this type using unbounded fan-in. - Fan-in esults, move inputs, weights stay - Semi-systolic convolution aays with global data communication
Design R1 Design R1 has the advantage that it dose not equie a bus, o any othe global netwok, fo collecting output fom cells. The basic ideal of this de-sign has been used to implement a patten matching chip. - Results stay, inputs and weights move in opposite diections - Pue-systolic convolution aays with global data communication
Design R2 Multiplie-accumulato can be used effectively and so can tag bit method to signal the output of each cell. Compaed with R1, all cells wok all the time when additional egiste in each cell to hold a w value. - Results stay, inputs and weights move in the same diection but at diffeent speeds - Pue-systolic convolution aays with global data communication
Design W1 This design is fundamental in the sense that it can be natually extend to pefom ecusive filteing. This design suffes the same dawback as R1, only appoximately 1/2 cells wok at any given time unless two independent computation ae inteleaved in the same aay. -Weights stay, inputs and esults move in opposite diection - Pue-systolic convolution aays with global data communication
Design W2 This design lose one advan-tage of W1, the constant esponse time. This design has been extended to implement 2-D convolution, whee high thoughputs athe than fast esponse ae of concen. -Weights stay, inputs and esults move in the same diection but at diffeent speeds - Pue-systolic convolution aays with global data communication
Remaks Above designs ae all possible systolic designs fo the convolution poblem. Using a systolic contol path, weight can be selected onthe-fly to implement intepolation o adaptive filteing. We need to undestand pecisely the stengths and dawbacks of each design so that an appopiate design can be selected fo a given envionment. Fo impoving thoughput, it may be wothwhile to implement multiplie and adde sepaately to allow ovelapping of thei execution. (Such as next page show) When chip pin is consideed, pue-systolic equies fou; semi-systolic equies thee I/O pots.
Ovelapping the executions of multiply-and-add in design W1
Citeia and advantages The design makes multiple use of each input data item Because of this popety, systolic systems can achieve high thoughputs with modest I/O bandwidths fo outside communication. The design uses extensive concuency Concuency can be obtained by pipelining the stages involved in the computation of each single esult, by multipocessing many esults in paallel, o by both.
Citeia and advantages Thee ae only a few types of simple cells To achieve pefomance goals, a systolic system is likely to use a lage numbe of cells which must be simple and of only a few types to cutail design and implementation cost. Data and contol flow ae simple and egula Pue systolic system totally avoid long-distance o iegula wies fo data communication.
Othogonal Tiangulaization.. On-the-fly least-squaes solutions using one and two dimensional systolic aay, with p=4.
paallel mege initial situation: x1 x2 x3 x4 x5 x6 x7...... x17 x18 y1 y2 y3 y4 y5 y6 y7...... y17 y18 1.) sot columns (odd-even-tansposition sot) 2.) sot ows (odd-even-tansposition sot) soted!!!!
0-1 pinciple The 0-1 pinciple states that if all sequences of 0 and 1 ae soted popely than this is a coect sote. The sote must be based on moving data. 0s 1s 0s 0s initially afte vetical sot 1s 0s afte hoizontal sot 1s
MIMD-mesh (clocked) min max Time: 2n
systolic mege 1 3 3 4 5 5 6 7 9 8 8 7 4 4 3 2
systolic mege 1 3 3 4 5 5 6 7 9 8 8 7 4 4 3 2
systolic mege 1 3 3 4 5 5 6 7 9 8 8 7 4 4 3 2
systolic mege 1 3 3 4 5 5 6 7 9 8 8 7 4 4 3 2
systolic mege 1 3 3 4 5 5 6 7 94 84 83 72 49 48 38 27
systolic mege 1 3 3 4 5 5 6 7 4 4 3 2 9 8 8 7
systolic mege 1 3 3 4 4 4 3 2 5 5 6 7 9 8 8 7
systolic mege 1 3 3 4 4 4 3 2 5 5 6 7 9 8 8 7
systolic mege 1 3 3 2 4 4 3 4 5 5 6 7 9 8 8 7
systolic mege 1 3 3 2 4 4 3 4 5 5 6 7 9 8 8 7
systolic mege 1 3 3 2 4 4 3 4 5 5 6 7 9 8 8 7
systolic mege 1 3 2 3 4 3 4 4 5 5 6 7 9 8 8 7
systolic mege 1 3 2 3 4 3 4 4 5 5 6 7 9 8 8 7
systolic mege 1 2 3 3 3 4 4 4 5 5 6 7 9 8 8 7
systolic mege 1 2 3 3 3 4 4 4 5 5 6 7 9 8 8 7
systolic mege 1 2 3 3 3 4 4 4 5 5 6 7 9 8 8 7
systolic mege 1 2 3 3 3 4 4 4 5 5 6 7 8 9 7 8
systolic mege 1 2 3 3 3 4 4 4 5 5 6 7 8 9 7 8
systolic mege 1 2 3 3 3 4 4 4 5 5 6 7 8 7 9 8
systolic mege 1 2 3 3 3 4 4 4 5 5 6 7 8 7 9 8
systolic mege systolic mege 1 2 3 3 3 4 4 4 5 5 6 7 7 8 8 9 soted!!!
Poblems to think about 1. Compae these designs with sotes shown ealie in class 2. Use these ideas to edesign sotes/absobes and sotes that emove epeated elements. 3. Think about othe applications of the ideas shown hee fo sot and mege.
Applications of Systolic Aay - Signal and image pocessing: FTR, IIR filteing, and 1-D convolution. 2-D convolution and coelation. Discete Fuie tansfom Intepolation 1-D and 2-D median filteing Geometic waping
Applications of Systolic Aay - Matix aithmetic: Matix-vecto multiplication Matix-matix multiplication Matix tiangulaization (solution of linea systems, matix invesion) QR decomposition (eigenvalue, least-squae computation) Solution of tiangula linea systems
Applications of Systolic Aay - Non-numeic applications: Data stuctue Gaph algoithm Language ecognition Dynamic pogamming Encode (polynomial division) Relational data-base opeations