Design of High-Performance Asynchronous Pipeline Using ... · Design of High-Performance...

1434IEICE TRANS. ELECTRON., VOL.E95–C, NO.8 AUGUST 2012

PAPERDesign of High-Performance Asynchronous Pipeline UsingSynchronizing Logic Gates

Zhengfan XIA†a), Nonmember, Shota ISHIHARA†, Student Member, Masanori HARIYAMA†, Member,and Michitaka KAMEYAMA†, Fellow

SUMMARY This paper introduces a novel design method of an asyn-chronous pipeline based on dual-rail dynamic logic. The overhead of hand-shake control logic is greatly reduced by constructing a reliable criticaldatapath, which o!ers the pipeline high throughput as well as low powerconsumption. Synchronizing Logic Gates (SLGs), which have no data de-pendency problem, are used in the design to construct the reliable criticaldatapath. The design targets latch-free and extremely fine-grain or gate-level pipeline, where the depth of every pipeline stage is only one dual-raildynamic logic. HSPICE simulation results, in a 65 nm design technology,indicate that the proposed design increases the throughput by 120% anddecreases the power consumption by 54% compared with PS0, a classicdual-rail asynchronous pipeline implementation style, in 4-bit wide FIFOs.Moreover, this method is applied to design an array style multiplier. Itshows that the proposed design reduces power by 37.9% compared to clas-sic synchronous design when the workloads are 55%. A chip has beenfabricated with a 4!4 multiplier function, which works well at 2.16G data-set/s (Post-layout simulation).key words: asynchronous pipeline, dual-rail, critical datapath

1. Introduction

Nowadays, clock design becomes an obstacle to high-performance VLSI systems. With technology scaling, clockdistribution remains an important challenge with the re-quirement of high-speed and low-power in VLSI systemsdesign [1]. Asynchronous design, which replaces the ex-ternally supplied global clock with a local handshake, hasthe potential to make high-performance design more fea-sible. Due to the local handshake, asynchronous systemsavoid issues related to clock distribution such as large clockpower and di"cult management of clock skew. Moreover,the operating speed of asynchronous systems is determinedby actual local latencies rather than global worst-case la-tency, which supplies a higher operating speed and betterelasticity.

Asynchronous design has recently attracted industrydue to its potential for high-speed, low-power, low elec-tromagnetic interference, and a natural match with hetero-geneous system timing. Papers such as [2] explored theapplications of asynchronous technology. Because asyn-chronous design has the ability to interface with varied en-vironments operating at di!erent rates, they are useful forthe design of System on Chip (SoC) [3], and also suitable

Manuscript received December 22, 2011.Manuscript revised March 14, 2012.†The authors are with the Graduate School of Information Sci-

ences, Tohoku University, Sendai-shi, 980-8579 Japan.a) E-mail: [email protected]

DOI: 10.1587/transele.E95.C.1434

for low power design method such as adaptive voltage scal-ing [4], [5]. Many asynchronous products and experimentalchips have also been announced. In particular, Philip has de-veloped a fully asynchronous 80C51 microcontroller whichshows much better performance of speed, power consump-tion and electromagnetic noise emission than synchronousdesign [6]. Intel has developed an asynchronous Pentium in-structions length decoder which exhibits three times higherthroughput and only half of the power consumption than acorresponding clocked implementation [7].

Although asynchronous design has many attractiveproperties for VLSI systems, it also has some drawbacks.One of the problems is that the handshake control logic thatimplements the handshaking normally has large overhead[8]. It not only consumes silicon area but also a!ects thecircuit speed and power consumption. The design methodof asynchronous pipeline introduced in this paper intends tosolve problems related to the handshake overhead. Otherproblems such as a lack of EDA tools including synthesisand test pattern generation for asynchronous design are notmentioned in this paper.

This paper introduces a novel design method of asyn-chronous pipeline based on dual-rail dynamic logic (We willmention this kind of asynchronous pipeline as “dual-raildynamic asynchronous pipeline” in the following context).The design focuses on improving asynchronous pipeline tobe more practical for a wide range of applications. To ob-tain high throughput, the design targets extremely fine-grainor gate-level pipelining, where the depth of every pipelinestage is only one dual-rail dynamic logic. Moreover, thehandshake control logic is greatly simplified with a reliablecritical datapath which is constructed by using Synchroniz-ing Logic Gates (SLGs) [14]. The reduced handshake over-head not only increases the throughput but also decreasesthe power consumption. A further feature of the proposeddesign is that explicit latches or registers are not used. Thedynamic logic gates themselves provide an implicit latchfunction with a careful sequencing of handshake control.The removal of explicit latches or registers provides ben-efits of smaller forward latency, smaller silicon area, andlower power consumption. Based on the features of the de-sign, we name the proposed asynchronous pipeline as Asyn-chronous Pipeline based on a Constructed Critical Datapath(abbreviated APCCD).

The paper is organized as follows. Section 2 intro-duces the previous works. A detailed background on, PS0, a

Copyright c" 2012 The Institute of Electronics, Information and Communication Engineers

XIA et al.: DESIGN OF HIGH-PERFORMANCE ASYNCHRONOUS PIPELINE USING SYNCHRONIZING LOGIC GATES1435

classic dual-rail dynamic asynchronous pipeline implemen-tation style is introduced, followed by a recent improved ap-proach, LP2/2. Section 3 focuses on the design of APCCD.SLGs and the extended design, Synchronizing Logic Gateswith a Latch function (SLGLs), are introduced to constructthe reliable critical datapath. A simple analysis of the ro-bustness of the structure is also provided. Then, extensionto more complex structures is further discussed. Finally,Sect. 4 is the evaluation results which show the benefitsof the proposed design compared with conventional asyn-chronous design and to classic synchronous design. More-over, a chip has been fabricated in 65 nm technology with a4 ! 4 multiplier function. Section 5 presents conclusions.

2. Previous Works

The classic work on dual-rail dynamic asynchronouspipelines design is done by Williams [9], which introducedseveral implementation styles. PS0 pipeline is a classic im-plementation style which has optimized handshake controllogic and no explicit latches or registers between pipelinestages. It is always introduced as the basis of the dual-raildynamic asynchronous pipelines design [10], [11]. In thispaper, PS0 pipeline will also be used as a basis to explainthe principle and problems of the dual-rail dynamic asyn-chronous pipelines design.

Lookahead pipelines [12] are recently proposed dual-rail dynamic asynchronous pipelines, which are also cho-sen to compare with our design. Lookahead style is analternative implementation style based on Williams’ work.Based on new structures and protocols, it greatly improvesthe throughput of PS0 pipeline.

2.1 Williams’ PS0 Pipeline

2.1.1 4-Phase Dual-rail Encoding

4-phase dual-rail encoding is used in dual-rail dynamicasynchronous pipelines design. Table 1 shows the code ta-ble of the 4-phase dual-rail encoding. The 4-phase dual-rail encoding encodes a bit onto two wires, (w t, w f ). Thedata value 0 is encoded as (0, 1) and 1 is encoded as (1, 0);the spacer is encoded as (0, 0); (1, 1) is not used. Figure 1

Table 1 Code table of 4-phase dual-rail encoding.

Fig. 1 An example of 4-phase dual-rail encoding.

shows an example of the 4-phase dual-rail encoding. In thedata transfer, every data is separated by a spacer.

2.1.2 Structure of PS0 Pipeline

Figure 3 shows a block diagram of PS0 pipeline. In PS0pipeline, each pipeline stage is composed of a function blockand a completion detector. Each function block is imple-mented using dual-rail dynamic logic. Each completion de-tector generates separate local handshake signal to controlthe flow of data through its stage. The precharge/evaluationcontrol input, pc, of each stage is connected to the output ofthe subsequent stage’s completion detector.

Figure 2 shows an example of dual-rail logic and com-pletion detector. Figure 2(a) is a dual-rail dynamic logic, adual-rail AND gate. Figure 2(b) is a completion detector. Indual-rail encoding, a bit done signal can be generated by a2-input static NOR gate. If completion detector detects 2-bit dual-rail datapath, a C-element is needed to generate thetotal done signal. And so forth, in order to detect an entiredatapath width, n-bits, of pipeline, a tree of C-elements withn inputs is needed to form a full completion detector whichis shown in Fig. 3.

The problem is that the delay of the completion detec-tor tree grows logarithmically with the width of the data-path [9]. With the increase of the width of datapath, thecompletion detection time becomes more and more largewhich greatly decreases the performance of pipeline, notonly the performance of throughput, but also the perfor-mance of power consumption.

Fig. 2 Dual-rail logic and completion detector. (a) A dual-rail AND gate.(b) A completion detector.

Fig. 3 Block diagram of PS0 pipeline.


2.1.3 Protocol of PS0 Pipeline

The protocol of PS0 pipeline is quite simple. F(N) isprecharged when F(N + 1) finishes evaluation. F(N) evalu-ates when F(N + 1) finishes its reset, or precharge. In Fig. 3,if we observe a single data flow through an initially emptypipeline which every pipeline stage is in evaluation phase,the complete cycle of events are as follows,

• F1 evaluates and data flows to F2.• F2 evaluates and data flows to F3. F2’s completion

detector detects completion of evaluation and sends aprecharge signal to F1.• F1 precharges and F3 evaluates. F3’s completion de-

tector detects completion of evaluation and sends aprecharge signal to F2.• F2 precharges. F2’s completion detector detects the

completion of precharge and sends an evaluation sig-nal (enable signal) to F1. The evaluation signal enablesF1 to evaluate new data once again.

There are 3 evaluations, 2 completion detections and 1precharge in the complete cycle for a pipeline stage. Thepipeline cycle time Tcycle is:

Tcycle = 3tEval + 2tCD + tPrech (1)

where tEval and tPrech are the evaluation and precharge timesfor each stage, and tCD is the delay through each completiondetector.

2.2 Singh’s LP2/2 Pipeline

LP2/2 pipeline is one implementation style of lookaheadpipelines which is proposed by Singh. LP2/2 has same func-tion block as PS0 but di!erent structure and protocol whichare shown in Fig. 4. The main idea of lookahead pipelinesis that an evaluation signal is generated early and sent toprevious pipeline stage.

Figure 4 shows a block diagram of a LP2/2 pipeline.Di!erent from PS0, the completion detectors are placedahead of function blocks. In this way, the completion de-tection and the evaluation of function block become concur-rent event, which saves the handshake time. Moreover, anasymmetric completion detector is employed, where the set

Fig. 4 Block diagram of a LP2/2 pipeline.

of C-element is composed of the full detection of all dual-rail signal but the reset of C-element is directly controlledby pc signal. The completion detector will be reset when pcsignal becomes low, which quickly generates an early eval-uation signal for the previous stage without waiting for thecompletion of the precharge process in the current pipelinestage.

A complete cycle of events of LP2/2 pipeline has 2evaluations, 1 completion detection and 1 completion detec-tor reset for a pipeline stage [12]. The pipeline cycle timeTcycle is given by:

Tcycle = 2tEval + tCD + tCDreset (2)

where tCDreset is the reset time of a completion detector.

3. Asynchronous Pipeline Based on A ConstructedCritical Datapath

3.1 Overview of New Design

Conventional dual-rail dynamic asynchronous pipeline de-signs, such as PS0 and LP2/2, need a full completion de-tector to detect the entire datapaths so as to generate a totaldone signal. The full completion detector provides these de-signs with a property of delay-insensitive which is usefulfor field-programmable VLSI design because of the consid-erable delay on the long distance of routing wire betweenLUTs [13]. For custom circuit design which has much shortrouting wire, it is unnecessary and causes a large overhead.The full completion detector can be simplified by just usingone of the bit-done signals as the total done signal. This re-quires an assumption that none of the other bits or the wiredelay across the entire datapaths is slower than the observedbit by more than the delay through a static NOR gate andthe drive bu!er chain following it in the handshake controllogic. In this paper, we proposed a method to guarantee thisassumption so as to reduce the overhead of handshake con-trol logic.

The above mentioned assumption seems very reason-able for bit-parallel datapaths because the handshake controllogic takes so long time to drive any signal in response to thecompletion detector. However, if we consider the data de-pendency problem, gate delay relates to input data patterns,of conventional dual-rail dynamic logic, incorrect operationcould occur. The critical signal transition in entire datapathsof every pipeline stage would vary from one gate to oth-ers according to di!erent data patterns. If the observed bithas the quickest signal transition than others, the assumptionmight not be satisfied. Moreover, the di!erence betweenthe fastest transition and the slowest transition would growlarger along with the flow of data in circuit without protec-tion of explicit latches or registers between pipeline stages.It is more and more di"cult to guarantee that the observedbit would satisfy the assumption in the subsequent pipelinestages.

Figure 5 shows a block diagram of our proposed de-sign, Asynchronous Pipeline based on a Constructed Criti-


Fig. 5 Block diagram of Asynchronous Pipeline based on a ConstructedCritical Datapath (APCCD).

cal Datapath (APCCD). The solid arrow represents a con-structed critical datapath in pipeline. A static NOR gate isconnected to the critical datapath to observe the bit-done sig-nal. Because the signal transition on the critical datapath isalways the slowest one in each pipeline stage, it is easy tosatisfy the required assumption for correct handshake. Ap-parently, adding delay elements is an intuitive way to get thiscritical datapath. However, this needs some timing analysisand these added delay elements are undesired overhead. Inorder to get a simple and low overhead solution, SLGs andSLGLs, which have no data dependency problem, are usedin our design to construct the critical datapath.

In Fig. 5, the output of NOR gate is connected to theprecharge/evaluation control input, pc, of previous stage.This structure and the protocol are stem from PS0 pipeline.(The term “APCCD” mentioned in this paper is defaultbased on the structure and protocol of PS0.) In fact, thedesign idea can also be applied in the structure and protocolof LP2/2. However, LP2/2 needs additional control circuitto realize its lookahead concept. With a simplified comple-tion detector, it has less benefit in LP2/2 than in PS0. Thecomparison between them will be one part of evaluation re-sults.

3.2 SLGs and SLGLs

3.2.1 Synchronizing Logic Gates

In a conventional dual-rail dynamic gate, gate delay re-lates to the input data patterns. Figure 2(a) shows theconventional dual-rail AND gate. The true side of logicis implemented by out t=a t·b t and the false side byout f=a f+b f . Table 2 shows the states of pull-down tran-sistor paths on di!erent data patterns. In conventional dual-rail AND gate, there are three transistor paths: [a t, b t],[a f ], [b f ]. First of all, these paths have di!erent numberof transistors at the sequential position. When they turn onrespectively, [a f ] and [b f ] cause less delays than [a t, b t].Moreover, when the data pattern is (0, 1, 0, 1), [a f ] and[b f ] will be both ON, which leads to a much quicker signaltransfer. As a result, the evaluation time has a large variationaccording to di!erent data patterns.

SLGs are dual-rail dynamic gates which solve thedata dependency problem. Figure 6 shows the synchroniz-ing AND gate and the truthtable of dual-rail AND gates.The principle is that, in the pull-down network, there isexactly one path activated according to one data pattern

Table 2 The states of pull-down transistor paths on di!erent datapatterns.

Fig. 6 Synchronizing AND gate and the truth table of dual-rail ANDgates.

and the stack of all possible paths is kept constant at thesequential position. Compared to conventional dual-railAND gate, the false side logic expression is changed toout f=a t·b f+a f ·(b t+b f ). Then, there are four transis-tor paths: [a t, b t], [a t, b f ], [a f , b t], [a f , b f ]. Everypath has two transistors at the sequential position, and Ta-ble 2 shows that there is only one path turns on correspond-ing to an input data pattern. As a result, the evaluation timeis constant depending on di!erent data patterns. In addi-tion, for the synchronizing AND gate, the transistor pathscan only turns on when the input signals a and b are bothvalid. It means that SLGs cannot start evaluation after allinputs become valid.

The characteristics of SLGs are listed as follows:

• An SLG has a certain number, inputs number, of tran-sistors in pull-down transistor paths at the sequentialposition.• An SLG does not have data dependency problem. The

evaluation time relates to the number of inputs.• The absence of any inputs will postpone the evaluation

of an SLG. It means that SLGs can synchronize theinputs.

3.2.2 Synchronizing Logic Gates with a Latch Function

Based on the characteristics of SLGs, SLGLs are extended.Figure 7 shows synchronizing AND gate with a latch func-tion and the table of latch states. An SLGL has an enableport, (en t, en f ), which controls the opaque and transpar-ent state of the SLGL. The principle is that SLGLs cannotstart evaluation without the presence of the enable signal.

Any conventional dual-rail dynamic gates can be re-designed to become SLGs and SLGLs, like the dual-railAND gate. The critical datapath in pipeline can be easilyconstructed using SLGs and SLGLs which will be shown innext section.


3.3 Structure of APCCD

Figure 8 shows the structure of asynchronous pipeline basedon a constructed critical datapath. The solid arrow repre-sents the constructed critical datapath and the dashed arrowrepresents sub-critical datapaths. A static NOR gate is con-nected to the critical datapath to observe the bit-done signalfor the entire datapaths in each pipeline stage. The output ofNOR gate is connected to the precharge/evaluation controlinput, pc,of the previous stage with a drive bu!er.

3.3.1 Construction of the Critical Datapath

In conventional design, every pipeline stage is composed ofnormal dual-rail dynamic logic which has data dependencyproblem. Therefore, the critical, or the slowest, signal tran-sition on the outputs of these gates is unknown. However,it is not di"cult to predict that the slowest signal transitionmight present on the output of the gate which has the largestnumber of inputs. In order to get a stable critical signal tran-sition, we change the gate that has the largest number ofinputs to the SLG in each pipeline stage. The examples areshown in stage1 and stage2 in Fig. 8. The signal transitionon the output of the SLG will always be the slowest one onthe outputs of the pipeline stage. The reasons are as follows:

• The SLG has the largest inputs compared with otherlogic gates in each pipeline stage.• There is only one path activated and the path always

Fig. 7 Synchronizing AND gate with a latch function and the table oflatch states.

Fig. 8 Structure of asynchronous pipeline based on a constructed critical datapath.

has constant stack at the sequential position in the SLG.• Other logic gates have smaller, or same, number of in-

puts and might have parallel paths activated together inthe pull-down network.

The above analysis of the critical signal transition isbased on an assumption that all logic gates in each pipelinestage evaluate at the same time. It is very di"cult to satisfythis assumption in practical design. If logic gates in eachpipeline stage evaluate at di!erent time, the signal transitionon the output of the SLG might not be the critical one anymore. In order to avoid this problem, SLGLs are used tomake sure that every SLG or SLGL in each pipeline stage isthe last one to start evaluation.

The solution is shown, as examples, in stage3 andstage4 in Fig. 8. If two SLGs in adjacent pipeline stagesare not connected with each other, we cannot guarantee thatthe output of the SLG in the subsequent stage has the criticalsignal transition because it might evaluate earlier with quickarrived inputs than other gates. In this situation, the SLGLwould replace the SLG in the subsequent stage. For exam-ple, the SLGs in stage2 and stage3 are not connected witheach other. Then, the SLG in stage3 is changed to the SLGLand the output of the SLG in stage2 connects to the enableport of the SLGL. Because the SLGL synchronizes its in-puts, it cannot start evaluation without the presents of thecritical signal transition from the previous pipeline stage. Asa result, the SLGL in stage3 will be the last gate start eval-uation, which guarantees the signal transition on the outputof the SLGL is the critical one. The same structure is alsoshown from stage3 to stage4.

3.3.2 Robustness of the Critical Datapath

We first theoretically analysis the robustness of the con-structed critical datapath by using the method of logical ef-fort [15]. Then, we discuss how to further increase the ro-bustness of the constructed critical datapath in practical de-sign with delay variations.

First of all, we simply introduce the method of logicale!ort. The method of logical e!ort is an easy way to es-timate delay in a CMOS circuit. In the method, modelingdelay of a logic gate isolates the e!ects of a particular fab-


rication process by expressing all delays in terms of a basicdelay unit particular to that process. The delay incurred bya logic gate is comprised of two components, a fixed partcalled the parasitic delay p and a part that is proportional tothe load on the gate’s output, called the e!ort delay f . Thetotal delay is the sum of the e!ort and parasitic delays:

d = f + p (3)

The e!ort delay depends on the load and on properties ofthe logic gate driving the load. There are two related termsfor these e!ects: the logical e!ort g captures the e!ect of thelogic gate’s topology on its ability to produce output current,while the electrical e!ort h describes how the electrical en-vironment of the logic gate a!ects performance and how thesize of the transistors in the gate determines its load-drivingcapability. The e!ort delay of the logic gate is the productof these two factors:

f = gh (4)

The electrical e!ort is defined by:

h =Cout

Cin(5)

where Cout is the capacitance that loads the output of thelogic gate and Cin is the capacitance presented by the inputterminal of the logic gate. Electronic e!ort is also calledf anout by many CMOS designer. According to Eqs. (3), (4)and (5), we get the delay of a logic gate:

d = gCout

Cin+ p (6)

From the viewpoint of the logic gate itself, an SLG hasmore complicated topology than the normal gate in the pull-down network. It slightly increases the parasitic delay p andthe logical e!ect g, which would increase the delay of thegate according to Eq. (6). From the viewpoint of the struc-ture, the output of SLG is connected to a static NOR gateand the SLG or the enable port of SLGL in the next stage.Compared with the outputs of other logic gates, an SLG haslarger fanout, Cout, which also slightly increases the delay ofthe gate. Therefore, SLGs and SLGLs normally has largerdelay than conventional dual-rail dynamic logic even theyhave same number of inputs. These imposed delays increasethe robustness of the constructed critical datapath.

In practical design, the robustness of the constructedcritical datapath is a!ected by delay variations, whichshould be seriously considered. As a matter of fact, it isan essential problem in VLSI circuit design. Same as the ro-bustness of a clock signal in synchronous design and the ro-bustness of a match delay line in bundled-data asynchronousdesign (single-rail design) [8]. As we all know, these de-signs all su!er from delay variations. In order to resist theinfluence of delay variations, synchronous design enlargesthe cycle time of a clock signal to get some margin. Onthe other hand, bundled-data asynchronous design adds ex-tra delay margin on the matched delay line to match the

worst case delay in combinational logic block. Same likethese solutions, the delay variations problem in APCCD canbe solved by enlarging delay margin on the constructed crit-ical datapath. We supply four measures to enlarge the delaymargin, which are listed as follows:

• Enlarge the delays of SLGs and SLGLs by changingtheir transistor size.• The constructed critical datapath is given a low priority

in circuit layout.• Delay elements can be added on the critical datapath to

get additional margin.• The delays of non-critical datapaths are minimized as

small as possible.

Depends on the requirements in practical design, one mea-sure or multiple measures can be applied to protect the con-structed critical datapath from delay variations.

3.3.3 Extension to More Complex Structures

The previous sections just analyzed the linear structure ofAPCCD. For more complex datapaths design, forks andjoins are needed [8]. Figure 9 shows fork structure and joinstructure in APCCD. According to the features of these twostructures, the delay variation problems on the datapaths andon acknowledge signal networks are also respectively dis-cussed from the viewpoint of handshake structure.

Figure 9(a) shows fork structure. In fork structure, theoutputs of function block A are split to connect with func-tion block B and C, which requires a C-element to collectthe acknowledge signals from all successors of A. The con-struction of the critical datapath in fork structure is similarto that of described in the linear structure.

The problem in fork structure is that the datapaths from

Fig. 9 (a) Fork structure. (b) Join structure.


function block A to function block B and C are more com-plex than the linear structure. Delay variations problemwould be serious, and it would a!ect the correctness of thecritical datapaths at the inputs of B and C. Besides the in-troduced measures to resist the influence of delay variationin previous section, we will analyze the robustness from theviewpoint of the handshake structure. The malfunction ofpipeline happens only when B and C do not finish their eval-uations before A finish its precharge. The delivery time ofthe precharge signal from B and C to A is that:

Tprc = TNOR + TC + Tbu!er (7)

where TNOR, TC and Tbu!er are delay time of a static NORgate, a C-element and the bu!er gate. Therefore, the margintime is that:

Tmargin = Tprc = TNOR + TC + Tbu!er (8)

If the delay variations on the datapaths are smaller thanTmargin, no malfunction would happen.

Figure 9(b) shows join structure. In join structure, theoutputs of function block A and B merge together at func-tion block C, which requires sending an acknowledge signalfrom C to all its predecessors. In function block C, the crit-ical datapaths from function block A and B need to simulta-neously connect to an SLG or an SLGL. This design processis also similar to that of described in the linear structure. Theoutput of the SLG or the SLGL would be the critical datap-ath in the output of function block C.

The problem in join structure is that the acknowledgesignal networks at function block B and C are more complexthan the linear structure. It would cause large delay varia-tions. In this situation, the malfunction of pipeline wouldhappens only when A and B do not completely finish theirprecharge process before C enters the next evaluation phase,which means that C would mistakenly absorb old data fromA or B. The delivery time of the precharge signal from C toA and B is that:

Tprc = TNOR + Tbu!er (9)

According to the handshake protocol of APCCD, the timefor C to enter the next evaluation phase is that:

TnextEval = 2TEval + 2TNOR + 2Tbu!er (10)

where TEval is the evaluation time for a function block.Therefore, the margin time is that:

Tmargin = TnextEval # Tprc = 2TEval + TNOR + Tbu!er (11)

If the delay variations on the acknowledge signal networksare smaller than Tmargin, no malfunction would happen.

4. Evaluation

Conventional dual-rail dynamic asynchronous pipelines usea full completion detector which has a large overhead. Theyare unpractical in fine-grain pipelined large function block

Table 3 The performance of di!erent bits completion detectors.

design. So far as we know, bundled-data asynchronous de-sign is popular in such situation. However, APCCD greatlyreduces the overhead of completion detector, which is ap-plicable in fine-grain pipelined large function block design.Therefore, the evaluations of APCCD are separated into twoparts. One is the comparison with conventional dual-rail dy-namic asynchronous pipelines. 4-bit wide FIFO is chosenas the test case, which is used to show the advantages ofAPCCD compared to PS0 and LP2/2. Another is the com-parison to the classic synchronous design and the bundled-data asynchronous design. 8 ! 8 array style multiplier ischosen as the test case, which is used to show the advan-tages of APCCD to design a fine-grain pipelined large func-tion block. All evaluation results are simulated by HSPICEin a 65 nm design technology. A chip has been fabricatedwith a 4 ! 4 multiplier function.

4.1 Comparison with PS0 and LP2/2

Table 3 shows the performance of di!erent bits detectors.The transistor count, delay and power consumption of thecompletion detector dramatically increase with the inputbits, which makes the conventional dual-rail dynamic asyn-chronous pipelines unpractical in the design of a fine-grainpipelined function block with wide datapath. For example,the width of the datapath in an 8 ! 8 array style multiplieris between 16 bits and 62 bits. No matter full completiondetectors or many split small completion detectors are usedto design a fine-grain pipelined multiplier, the overhead ofthe completion detector would be huge. Therefore, 4-bitwide FIFOs with 10 stages is chosen as a simple test caseto show the advantages of APCCD compared to PS0 andLP2/2. The evaluation results are shown in Table 4. We usetwo new terms of APCCD PS0 and APCCD LP2/2 in thiscomparison, APCCD PS0 represents the APCCD based onthe structure and protocol of PS0 pipeline. APCCD LP2/2is defined in the same way.

Equation (1) shows the pipeline cycle time of PS0pipeline. tCD is the delay time of a static NOR gate and atree of C-element, which is much larger than tEval, the eval-uation time of a dual-rail dynamic gate. Therefore, the delayof the completion detector, tCD, has large proportion of thepipeline cycle time. In APCCD PS0, tCD is reduced to thedelay time of a static NOR gate, which greatly improves thepipeline cycle time. Although the evaluation time of SLG orSLGL in a pipeline stage is slightly slower than other gates,it does not increase tEval a lot. The evaluation results show


Table 4 Evaluation results of 4-bit wide FIFOs.

Table 5 Comparison of APCCD to synchronous pipeline and bundled-data asynchronous pipeline.

that APCCD PS0 increases the throughput by 120% and de-creases the power consumption by 54% compared to PS0.

On the other hand, APCCD LP2/2 improves thethroughput by 17.7% and power consumption by 33.2%compared to LP2/2. However, APCCD LP2/2 does not getmore benefits than APCCD PS0 though LP2/2 has muchhigher throughput than PS0. Conventional LP2/2 improvesthe performance of throughput by adding extra control cir-cuit to avoid a complete detection and an evaluation com-pared to PS0. It is reasonable when the delay time of acompletion detector is large. However, the completion de-tector in APCCD is simplified to a single static NOR gatewhose delay is much small. In this situation, the addi-tional control circuit in LP2/2 becomes an obstacle to in-crease the throughput. Moreover, it also consumes somepower. Therefore, the throughput and power consumptionof APCCD PS0 are 20.3% higher and 31.1% lower thanAPCCD LP2/2.

4.2 Comparison to Synchronous Pipeline and Bundled-Data Asynchronous Pipeline

In order to further show the benefits of APCCD, which isnamed as APCCD PS0 in previous section, we choose arraystyle multiplier as a test case to compare to synchronouspipeline and bundled-data asynchronous pipeline. 8 ! 8multipliers are respectively fine-grain pipelined using thesethree design methods. Synchronous pipeline with a se-quential clock-gating [17]–[19] is also designed to com-pare to APCCD. Table 5 shows the comparison results. Be-cause dynamic logic in APCCD provides an implicit latchfunction, storage elements are all removed which respec-

Fig. 10 The performance of power consumption (8! 8 multiplier, 2.86Gdata-set/s).

Fig. 11 Definition of workload.

tively saves 278 flip-flops and 791 latches compared to syn-chronous pipeline and bundled-data asynchronous pipeline.At the same time, APCCD respectively improves the for-ward latency by 66.2% and by 68.6%. Synchronous pipelinewith clock-gating design need extra flip-flops to implementthe clock gating control circuit, which slightly deterioratesthroughput and forward latency.

Figure 10 shows the performance of power consump-tion when all pipelined circuits work at 2.86G data-set/s.The workload refers to the rate of the number of active-state cycles to the total number of cycles. In our case, theworkload is calculated based on a period of consecutive datainjection cycles (active-state cycles) following consecutiveempty cycles, which is shown in Fig. 11. The workload iscalculated as N/(N +M), where N is the number of consec-utive data injection cycles and M is the number of consec-utive empty cycles. Previous works [2], [6], [7] all mentionthat bundled-data asynchronous design shows better powerperformance than their synchronous counterparts. The eval-uation results show that, APCCD even has better powerperformance than its bundled-data asynchronous counter-part. When the workload of circuit is lower than 55%, thebundled-data asynchronous pipeline shows its better perfor-mance than synchronous pipeline. However, APCCD savesup to 37.9% of power at the same situation. When the cir-cuit works at peak speed, APCCD still saves 15.1% of powercompared to synchronous pipeline. On the other hand, syn-chronous pipeline with sequential clock-gating design re-duces the clock power a lot when the workload is low. ButAPCCD still saves up to 16.5% of power compared to it.

Figure 12 shows the fabricated chip with a 4! 4 multi-plier function. Table 6 shows features of the fabricated chip.


Fig. 12 Photo of the fabricated chip (4 ! 4 multiplier).

Table 6 Features of the fabricated chip.

Fig. 13 A waveform result of the fabricated 4 ! 4 multiplier.

In order to reduce the influences of delay variations in prac-tical design, we took three measures. First, we managed tomake the critical datapath has longest routing wire by reduc-ing its routing priority. Second, we slightly reduced the tran-sistor size in SLGs and SLGLs to increase their delay times.Third, we added some delay elements at the dangerous cor-ner to enhance the robustness of the critical datapath. As aresult, the fabricated chip works correctly. Figure 13 showsa waveform result of the fabricated 4x4 multiplier. The in-puts are defined as [a3, a2, a1, a0] and [b3, b2, b1, b0]. Theoutputs are defined as [(t0, f 0), (t1, f 1), (t2, f 2), ..., (t7, f 7)].Figure 13 just shows the multiply computation result of0001!0011 = 00000011 and part of the waveform. It showsthat inputs a0, b0 and b1 are high signal and all other in-puts are low signal. Outputs (t0, f 0) and (t1, f 1) are data1signal, (1, 0), and all other outputs are data0 signal, (0,1).

Every output data-set is triggered by an ack signal from thereceiver. When the supply voltage is changed from 1.2 Vto 0.75 V, all computation results have been verified to becorrect. To a certain degree, it demonstrates the robustnessof APCCD. The simulation results show that the post-layoutmultiplier works well at 2.16 GHz.

5. Conclusions

We proposed a novel design method for dual-rail dynamicasynchronous pipeline that is realized based on a con-structed critical datapath. The design method greatly re-duces the overhead of handshake control logic, which notonly increases the throughput but also decreases the powerconsumption. The evaluation results show that the proposeddesign has benefits compared with conventional design andis even comparable to classic synchronous pipeline.

In addition, sub-critical datapaths in APCCD do nothave to use dual-rail logic because the handshake circuitdoes not detect them. If single-rail logic is applied to replacethese dual-rail logics, the overhead of logic gates would befurther reduced, which would has much better performanceof power consumption. It will be done in the future work.

Acknowledgment

This work is supported by VLSI Design and Education Cen-ter (VDEC), the University of Tokyo in collaboration withSTARC, e-Shuttle Inc., Fujitsu Ltd., Cadence Design Sys-tems Inc., Synopsys Inc. and Mentor Graphics Inc.

References

[1] B.H. Calhoun, Y. Cao, X. Li, K. Mai, L.T. Pileggi, and R.A.Rutenbar, “Digital circuit design challenges and opportunities inthe era of nanoscale CMOS,” Proc. IEEE, vol.96, no.2, pp.343–365,Feb. 2008.

[2] C.H. Van Berkel, M.B. Josephs, and S.M. Nowick, “Applications ofasynchronous circuits,” Proc. IEEE, vol.87, no.2, pp.223–233, 1999.

[3] M. Krstic, E. Grass, F.K. Gurkaynak, and P. Vivet, “Globally asyn-chronous, locally synchronous circuits: Overview and outlook,”IEEE, Design and Test of Computers, vol.24, no.5, pp.430–441,2007.

[4] Y.W. Li, G. Patounakis, A. Jose, K.L. Shepard, and S.M. Nowick,“Asynchronous datapath with software-controlled on-chip adaptivevoltage scaling for multirate signal processing application,” Proc.International Symposium on Advanced Research in AsynchronousCircuits and Systems, pp.216–225, 2003.

[5] S. Ishihara, Z. Xia, M. Hariyama, and M. Kameyama, “Evalua-tion of a self-adaptive voltage control scheme for low-power FP-GAs,” J. Semiconductor Technology and Science (JSTS), vol.10,no.3, pp.165–175, 2010.

[6] H. van Gageldonk, K. van Berkel, A. Peeters, D. Baumann, D.Gloor, and G. Stegmann, “An asynchronous low-power 80c51 mi-crocontroller,” Proc. International Symposium on Advanced Re-search in Asynchronous Circuits and Systems, pp.96–107, 1998.

[7] S. Rotem, K. Stevens, R. Ginosar, P. Beerel, C. Myers, K. Yun, R.Kol, C. Dike, M. Roncken, and B. Agapiev, “RAPPID: An asyn-chronous instruction length decoder,” Proc. International Sympo-sium on Advanced Research in Asynchronous Circuits and Systems,pp.60–70, 1999.


[8] J. Sparsø and S. Furber, Principles of Asynchronous Circuit Design:A Systems Perspective. Kluwer Academic Publishers, 2001.

[9] T.E. Williams, Self-timed rings and their application to division,PhD thesis, Stanford University, June 1991.

[10] M. Renaudin, B.E. Hassan, and A. Guyot, “A new asynchronouspipeline scheme: Application to the design of a self-timed ring di-vider,” IEEE J. Solid-State Circuits, vol.31, no.7, pp.1001–1013,1996.

[11] G. Matsubara and N. Ide, “A low power zero-overhead self-timeddivision and square root unit combining a single-rail static circuitwith a dual-rail dynamic circuit,” Proc. International Symposium onAdvanced Research in Asynchronous Circuits and Systems, pp.198–209, 1997.

[12] M. Singh and S.M. Nowick, “The design of high-performance dy-namic asynchronous pipelines: Lookahead style,” IEEE Trans. VeryLarge Scale Integr. (VLSI) Syst., vol.15, no.11, pp.1256–1269, Sept.2007.

[13] M. Hariyama, S. Ishihara, and M. Kameyama, “Evaluation of a field-programmable VLSI based on an asynchronous bit-serial architec-ture,” IEICE Trans. Electron., vol.E91-C, no.9, pp.1419–1426, Sept.2008.

[14] Z. Xia, S. Ishihara, M, Hariyama, and M. Kameyama, “Synchronis-ing logic gates for wave-pipelining design,” Electron. Lett., vol.46,no.16, pp.1116–1117, Aug. 2010.

[15] I. Sutherland, B. Sproull, and D. Harris, Logical E!ort: DesigningFast CMOS Circuits, Morgan Kaufmann, San Mateo, CA, 1999.

[16] A.M.G. Peeters, Single-rail handshake circuits, Ph.D. dissertation,Dep. Math. Comput. Sci., Eindhoven Univ. Technol., Eindhoven,The Netherlands, 1999.

[17] S. Ahuja and S. Shukla, “MCBCG: Model checking based sequen-tial clock-gating,” High Level Design Validation and Test Workshop,pp.20–25, 2009.

[18] L. Li, W. Wang, K. Choi, S. Park, and M.-K. Chung, “SeSCG: Selec-tive sequential clock gating for ultra-low-power multimedia proces-sor design,” Electro/Information Technology (EIT), pp.1–6, 2010.

[19] S. Jairam, R. Madhusudan, S. Jithendra, V. Parimala, H.Udayakumar, and R. Jagdish, “Clock gating for power optimizationin ASIC,” www.islped.org/X2008/Jairam.pdf

Zhengfan Xia received the B.E. de-gree in Electronic and Information Engineeringfrom China University of Geosciences, Beijing,China, in 2008, and the M.S. degree in Informa-tion Science from Tohoku University, Sendai,Japan, in 2011. He is currently working towardthe Ph.D. degree in Graduate School of Informa-tion Sciences, Tohoku University. His primaryresearch interest is in the area of asynchronousarchitecture.

Shota Ishihara received the B.S. degreein Information Engineering, the M.S. degree ininformation sciences, and the Ph.D. degree ininformation sciences from Tohoku University,Sendai, Japan, in 2007, 2009 and 2012, respec-tively. He is currently working at Murata Manu-facturing Co., Ltd. His research interests includereconfigurable VLSIs, asynchronous architec-tures, nonvolatile logic-in-memory circuit tech-nologies, and RF modules. Dr. Ishihara receivedthe Yasujiro Niwa Outstanding Paper Award in

2012.

Masanori Hariyama received the B.E.degree in Electronic Engineering, M.S. degreein Information Sciences, and Ph.D. in Informa-tion Sciences from Tohoku University, Sendai,Japan, in 1992, 1994, and 1997, respectively.He is currently an associate professor in Grad-uate School of Information Sciences, TohokuUniversity. His research interests include VLSIcomputing for real-world application such asrobots, high-level design methodology for VL-SIs and reconfigurable computing.

Michitaka Kameyama received the B.E.,M.E. and D.E. degrees in Electronic Engineer-ing from Tohoku University, Sendai, Japan, in1973, 1975, and 1978, respectively. He is cur-rently Dean and a Professor in the GraduateSchool of Information Sciences, Tohoku Uni-versity. His general research interests are intel-ligent integrated systems for real-world appli-cations and robotics, advanced VLSI architec-ture, and new-concept VLSI including multiple-valued VLSI computing. Dr. Kameyama re-

ceived the Outstanding Paper Awards at the 1984, 1985, 1987 and 1989IEEE International Symposiums on Multiple-Valued Logic, the TechnicallyExcellent Award from the Society of Instrument and Control Engineers ofJapan in 1986, the Outstanding Transactions Paper Award from the IEICEin 1989, the Technically Excellent Award from the Robotics Society ofJapan in 1990, and the Special Award at the 9th LSI Design of the Year in2002. Dr. Kameyama is an IEEE Fellow.

Design of High-Performance Asynchronous Pipeline Using ... · Design of High-Performance...

Documents

Transcript of Design of High-Performance Asynchronous Pipeline Using ... · Design of High-Performance...