Yours Fav Shopping Stop!!

Showing posts with label IC Compiler. Show all posts
Showing posts with label IC Compiler. Show all posts

Monday, 6 April 2015

Clean up Transition Violations between ICC and PTSI

ABSTRACT

 When we talk about SI effect, people always care about delta-delay and noise. But omit another important issue, the transition effort induced from SI. In deep sub-micro designs, the timing DRC issue, especially transition violation, is much more critical. This paper will introduce two methods to fix transition violations. Firstly, prevent transition violations caused by long net at place or CTS phase in ICC. Secondly, fix transition violations based on PTSI sign-off analysis report at post-route phase.

In default, ICC uses Arnoldi calculation for partial nets in post-route phase due to runtime consideration. It will degrade transition time correlation between ICC and PTSI. ICC is more optimistic than PTSI. Both ICC and PT provide unique TCL based command. It’s easy to interact or write scripts that can be used for both. We successfully speed up clean up transition violations process in our design with these methods to reduce iterations.

WANG Liang   wangliang@vimicro.com
HOU Hua    Minhouhuamin@vimicro.com
ZHU Kan   zhukan@vimicro.com
Vimicro Corporation

1.0  Introduction

This paper is focus on fixing transition issue in ultra deep sub-micron Back-End design. So far, transition is the important point in timing sign-off flow because the transition violation will affect cell delay calculation, as we known that the cell delay calculation is based on input transition and output load. The transition violation means the transition time value extends the library look up table (which is simulated by library vendor). So the cell delay will be inaccurate and induce risk for tapeout.

 Fortunately ICC can help us to deal with most of transition issues, but sometimes, maybe we need some skills or other tools (like PTSI) co-work to speed up transition fixing progress. Generally, in our design, the bottleneck of fixing max transition is caused by two reasons. First, long net connection, it will induce potential transition issue. Secondly, according to runtime, ICC default use Arnoldi calculation for partial nets in post-route phase. It will degrade transition time correlation between ICC and PTSI. ICC is more optimistic than PTSI.

 This paper will introduce VIMICRO transition fixing methodology based on ICC and PTSI which include two progresses. At pre-route stage solve the transition issue by fixing the long net connections. At post-route stage fix the remaining transition violations according to the PTSI sign-off DMSA report.


2.0  Clean up transition violations methodology based on ICC and PTSI

Aiming at the reasons of transition violations we import two optional steps with two VIMICRO property procedures: tproc_fix_long_nets and tproc_fix_transition.

 One is focusing on fixing long net at early stage of physical implementation to solve some special floorplan requirements such as long channel area. It can be introduced at place or CTS stage.

 The other is focusing on SI induced transition violations at sign-off stage which are hard to be fixed in the ICC because of the timing driven engine (according to runtime, ICC default use Arnoldi calculation for partial nets in post-route phase). It will read PTSI DMSA timing DRC report and generate ICC ECO scripts to solve this issue.

 Figure 1 shows the situations of these two optimal steps in the normal flow.


2.1     Pre-route potential transition issue

In some special design case, such as long channel area in the floorplan, the calculated result of pre-route signal network load may mismatch with real routing due to long net connections.

Our design includes two power domains. The Always on domain is around shut down domain. There is no way to avoid long channel area in the floorplan like Figure 2.

Sometimes ICC estimates long net load optimistically and is hard to find a good location to fix long net issue in pre-route phase. It will induce big transition violation in post-route phase and degrade correlation between pre-route and post-route phase.


2.2     Auto fix long net connection at early stage for better transition QoR

It is better if we can solve the long net connection issue before routing, but how to do? The best method is to insert buffer along the real routing. We can separate four steps for it.

Step 1: To find long connection nets in the design.
Step 2: To route the nets which are found in step 1.
Step 3: To insert buffer along the nets which are routed at step 2.
Step 4: To remove the net shapes which are routed at step 2.

Real design may have a lot of long net connections. We wish this dirty job can be done automatically. Based on Synopsys Pilot, we can add procedure to boost up our design environment easily. The procedure tproc_fix_long_nets focuses on long net connection issue in ICC environment. It includes several options such as indicating in which sub module the long net issue will be fixed, the long net threshold, insert buffer distance, buffer type and so on.

 Below are this command’s detailed options and usage examples:

tproc_fix_long_nets

    -insts       “full hier instance names”   required
    -net_length  “max net length”             optional
    -buf_length  “insert buf length”          optional
    -path        “output dir”                 optional
    -file        “output file name”           optional
    -clk_buffer  “clock  buffer”              optional
    -nom_buffer  “normal buffer”              optional
    -noroute     “the preroute stage”         optional
    -pre_cts     “the pre_cts  stage”         optional

Usage: tproc_fix_long_nets -insts $TEV_FIX_LONG_NET_MODULES -noroute -pre_cts

Limitation: This procedure can handle clock network at post-CTS stage regardless clock skew. We must run an incremental clock tree optimization to recover clock skew if using this procedure to deal with clock network.

The Figure 3 shows an example of long net connection issue in our design.


The Figure 4 is our procedure result after fixing long net connection issue.


From the result we can easily identify the effect of this method. The long net connection has a lot of bad effects like big transition, SI, EM and RC-009 Warning in the STA. The long channel area floorplan is only one reasons resulting long net connection issue, there may also some other causes. However, any long net connection issue can always be solved by this method.

2.3     Post-route transition report difference between ICC and PTSI

When we talk about SI effect, people always care about delta-delay and noise. But omit another important issue, the transition effort induced from SI. In the deep sub-micro process SI effect is more and more heavy. It can result in transition time difference between the driving pin and loading pin on the same net. With the SI effect the transition time will become worse than without. Like Figure 5 shows.

 In the STA tools, the Arnoldi algorithm can evaluate SI affected transition. PTSI is a good sign-off STA tool which uses 100% Arnoldi algorithm. It can calculate SI effect exactly, including transition time. However, ICC default uses Arnoldi calculation for partial nets in post-route phase due to runtime consideration.

The most distinguished difference between PTSI full Arnoldi and ICC partial Arnoldi is SI affected transition time. So the ICC reported transition time is more optimistic than PTSI in post-route phase



2.4     Basic method for transition fixing in post-route phase


Firstly, we can fix transition violations with ICC until timing clean or almost clean (include setup/hold, max transition/capacitance and so on). Then dump netlist and SPEF and use PTSI DMSA feature to report timing for all scenarios. There may still a lot of transition violations, but don’t worry, we can fix them directly according to PTSI report in the PT or ICC environment.

Generally, the easiest method for transition fixing is size up driving strength. If no such cell, we can insert a buffer which driving strength is bigger than original. Figure 6 and Figure 7 shows the Size Up and Insert Buffer methods




2.5     Auto fix transition violations according to PTSI DMSA report

Until now we know it is better to fix transition violations according to PTSI result since the ICC is optimistic with default setting and from its result, we may not see the total transition violations induced by SI completely. The methods of transition fixing are Size Up and Insert Buffer.

The PTSI report probably includes a lot of transition violations even when it is almost clean in the ICC. So perhaps we still have many works to do. Fortunately, we can use another procedure to do this. It is procedure tproc_fix_transition which will do the things below serially.

1. Find out all transition violations in the PTSI DMSA report and unique the driving pins.
2. Try to size up the driving cell with same cell type.
3. Try to insert buffer at driving pin with bigger driving strength than driving cell if step 2 failed.
4. Dump Warning info if neither step 2 nor step 3 succeeds.

Below are this command’s detailed options and usage examples: 

tproc_fix_transition

    -pt_rpt_file       “point pt transition report”    required
    -path              “output dir”                    optional
    -file              “output file name”              optional
    -buffer_list       “buffer list”                   optional
    -iteration_loops   ”search find loops”             optional

Usage: tproc_fix_transition -pt_rpt_file $TEV_PTSI_DMSA_TRAN_RPT

Limitation: This procedure just size up or insert a bigger buffer at driving pin, it cannot choose a location to solve long net and big fanout issues. We must solve long net (with previously procedure tproc_fix_long_nets) and big fanout (set max fanout constraint) in early stage.

Below is part of PTSI DMSA transition report before transition fixing with our procedure:

## Violation Types: max_transition

Worst Violator    : -0.3681
===================================================
Violation Range for Bar 1 : -0.000000 to  -0.073560
Violation Range for Bar 2 : -0.073560 to  -0.147120
Violation Range for Bar 3 : -0.147120 to  -0.220680
Violation Range for Bar 4 : -0.220680 to  -0.294240
Violation Range for Bar 5 : > -0.294240
===================================================


Histogram of violations on (bins = 5)
| (% of WNS)         (violations) –>
 0-20%: ********************************************* ***** (118)
20-40% : ****************************************** (99)
40-60% : ***************** (40)
60-80% :  (1)
 > 80% : * (2)
After one iteration loop with tproc_fix_transition procedure, the PTSI DMSA transition report becomes:


## Violation Types: max_transition
Worst Violator    : -0.1678
===================================================
Violation Range for Bar 1 : -0.000000 to  -0.033560
Violation Range for Bar 2 : -0.033560 to  -0.067120
Violation Range for Bar 3 : -0.067120 to  -0.100680
Violation Range for Bar 4 : -0.100680 to  -0.134240
Violation Range for Bar 5 : > -0.134240
===================================================


Histogram of violations on (bins = 5)
| (% of WNS)         (violations) –>
 0-20%: ************************ (11)
20-40% : ************************************************** (23)
40-60% : ********* (4)
60-80% :  (0)
 > 80% : ** (1)

Through about 3~4 iteration loops and a few manual works, we successfully clean up all transition violations in our design.

3.0  Conclusions and Recommendations

With the cooperation of ICC and PTSI we successfully clean up transition violations in our design. At pre-route stage, we use a procedure automatically find and fix long net connection to avoid potential transition violations. At post-route stage, we introduce another procedure which can read PTSI DMSA transition report to automatically fix all remaining transition violations in the ICC or PT environment.

 We recommend run incremental clock tree optimization if we need fix long net issue in the clock tree at post-CTS stage to recover clock skew. In post-route phase we recommend fix transition violations in the PT environment based on ‘What-if’ feature. It can evaluate transition fixing effect more quickly

4.0  Acknowledgements

We would like to thank VIMICRO colleagues and Synopsys engineer, our SNUG technical reviewer, for their valuable inputs.

5.0  References

[1]     Synopsys, IC Compiler Implementation User Guide, C-2009.06-SP2.
[2]     Synopsys, Prime Time User Guide: Fundamentals, B-2008.12.
[3]     Synopsys, Prime Time SI User Guide, B-2008.12.
[4]     HOU Hua Min, LI Ying, Multi-million Set-Top Box Static Timing Analysis using Distributed Multi-scenario Analysis, Thomson Broadband R&D (Beijing) Co., Ltd, SNUG 2008

Friday, 3 April 2015

Minimizing Clock Latency with IC Compiler

ABSTRACT

This article mainly focuses on minimization clock latency during physical implementation with Synopsys IC Compiler. The obstacles in optimizing clock networks may come from many aspects. In this article, will cover the following three aspects: unbalanced slew rate on I/O cell, the clock pin with high capacitance, and clock networks from multiple clock domains. For each of these issues, practical solutions will be discussed with regards to optimizing clock latency. Through employing the techniques in this article, the clock tree latency can be effectively minimized.

Leo Zeng   Martin Ma  Kimi Jin      

leo.zeng@britesemi.com
martin.ma@britesemi.com
kimi.jin@britesemi.com

1.0  Introduction

As modern system-on-chip design continues to increase in complexity of chip functionality, it becomes necessary to implement higher clock frequency, more clock domains, and also more sophisticated clock structures. While the process geometries are scaling down to 65 and 40 nanometers, advanced technology nodes allow for high density integration and larger chip sizes. Thus meaning that the loading of clock networks will become heavier than before and clock networks may have to traverse through longer distances. Furthermore, on-chip variation will become more considerable in terms of clock uncertainty and latency. It makes the timing closure become a tough job. For example, the timing sign off criteria under TSMC 65nm low-power design is as following:

WC corner for setup check:  set_timing_derate –early 0.95 set_timing_derate –late 1.0

BC corner for hold check:     set_timing_derate –early 1.0   set_timing_derate –late 1.1

Assume that the clock frequency is 800 MHz and the launch and capture clock latencies are both 10ns. The longer latency will generate 500ps clock uncertainty between launch and capture clock path because of timing derating, which is 40 percent of clock cycle. This would mean that the setup margin decreases by 40 percent. The insufficiency of setup margin will cause great number of violations and degrade chip performance. The hold time analysis is in the same situation.
Additionally, minimizing clock latency can optimize clock skew and decrease power dissipation, as well as, improve clock tree performance.
In this article, the following topics will be presented,
l      Unbalanced slew rate on I/O cell
l      Clock pins with high capacitance
l      Clock networks from multiple clock domains.

2.0  Unbalanced slew rate on I/O cell

In most cases, the system clock transits into design through an IO cell, such as from a crystal oscillator on the board. The timing characteristic of such IO cells will influence the clock tree synthesis result. The I/O cells sometimes have unbalanced slew rate that will introduce skew between rising and falling edges into the clock source, which in turn will be propagated through clock network. However, the clock tree synthesis engine tries to build clock tree and balance rising and falling insertion delay. During the insertion delay balancing between different edges, redundant buffers/inverters will be inserted. Consequently, the clock network latency is increased.


To avoid this problem, I/O cells with equal slew rates should be selected for clock defined I/O cell. However, it is difficult to find a perfectly balanced I/O cell. Figure 2.1 shows a crystal oscillator I/O cell which is characterized in Synopsys Liberty. From the cell delay look up table, it can be found that there is about 300ps difference between the rising and falling propagation delay. This delay difference will be propagated through clock networks and finally reach the sink pins. In order to prevent asymmetric propagation delay from the clock root, the clock definition point should be moved to the output pin of I/O cell. For example, the original clock definition is as following:

Create_clock -name xclk -period 40 -waveform {0 20} [get_ports {XCLK}]

The improved clock definition as following:

Create_clock -name xclk -period 40 -waveform {0 20} [get_pins {U_PAD_XCLK/XC}

Then clock source pin is defined at U_PAD_XCLK/XC, which will not propagate through the I/O cell. The I/O’s slew will also not impact clock tree synthesis.


3.0  Clock pin with high capacitance

For some IP macros, the input capacitance on clock pin is very high. The input capacitance has two components. One is the intrinsic input capacitance which is characterized in Synopsys Liberty, and the other is extrinsic capacitance caused by the parasitic coupling effect to neighboring signal wires. The extrinsic capacitance has a dependency on the geometry of the pin and coupled signal wires. For pins of small geometry, the coupling effect will be weak, and the extrinsic capacitance is negligible. However, for pins with high aspect ratio in geometry, the coupling effect will become considerable. This situation occurs when the rest of the macro is covered by route guide, as shown in the following example. Since the route guide represents invisible routes in a certain layer, the EDA tool will recognize route guide as ‘virtual signal wires’ during parasitic extraction. The whole route guide will be treated as fat wire. Hence the parasitic model would be a pessimistic estimation to the extrinsic capacitance.


From Figure 3, it can be seen that the pin “Pclk” is surrounded by route guides of the same layer. Because the pin is thin and long, the coupling effect of this pin to the surrounding route guides will be significant. This makes the extrinsic capacitance become dominant when compared to the intrinsic capacitance.

Nevertheless, in most cases this parasitic model is too pessimistic. The real metal density around this pin could be only a small portion of the route guide. Therefore, when using the route guide for parasitic extraction, the value of extrinsic capacitance is greatly over-estimated. And thus, will result in an unbeneficial impact on Clock Tree Synthesis.

This over-estimated extrinsic capacitance will cause longer clock latency. During the clock tree synthesis phase, in the first step, IC Compiler will fix DRC violations (such as ‘max capacitance’, ‘max transition’ and ‘max fan-out’) by buffer insertion. The IC Compiler will continue to insert buffers at pins with DRC violations until the violation has been fixed or the max buffer level is reached. If an individual sink pin has a very high input capacitance, then buffers/inverters with higher driving strength will have to be inserted close to this pin in order to satisfy the ‘max capacitance’ and ‘max transition’ constraints. Effectively, this will cause extra insertion delay in this clock tree branch. The second step of clock tree synthesis is logic level balancing. In this step, insertion delay of different clock tree branch within the same clock domain will be balanced. The global clock tree insertion delay will be determined by the longest branch. Therefore, the insertion delay of other clock tree branches will be elongated.

A recommended solution is to insert a buffer with proper driving strength as an ‘anchor cell’ at a very close location to the sink pin, allowing the buffer to be set with a floating value according to the real propagation delay of this buffer to the sink pin. And so, that sink pin will be isolated from the clock tree.


4.0  Clock networks from multiple clock domains

In modern system-on-chip architecture, multiple clock domains have been a prevalent design style. A multi-clock domain design typically includes multiple islands operating synchronously and served by independent clocks within the clock domain and with dedicated interfaces to manage the inter-clock-domain communications. This multi-clock domain design architecture for SOC belongs to a class of designs called globally asynchronous and locally synchronous (GALS).


Figure 4 is a generic illustration of the GALS design style. These different functional units in an SOC design can be on separate clock domains. This scheme provides functional flexibility for each of the domains to operate at the optimal frequency and to minimize clock skew of the entire chip and clock latency of independent clock network. The chip may receive multiple copies of the system clock and multiple PLLs are used to generate the clocks for each of the synchronous units. When multiple clocks are generated in a design, the relationships between the clock domains depend on how the clocks are generated and how they are used in the design. The relationship between two clocks can be classified as synchronous and asynchronous. Clocks CK1, CK2 … CK4 have different phases and frequencies. So CK1, CK2 …CK4 are asynchronous with respect to each other. For internally generated clocks in each synchronous unit, if two clocks share a common source and have a fixed offset phase relationship, and they are classified as synchronous with respect to each other.

4.1    Clock network in synchronous clock domain

The Clock Tree Synthesis Engine of IC Compiler is based on synchronous clock tree structure model. During clock tree synthesis, the IC Compiler will consider generated clocks and its master clock as a synchronous group, and then clock trees within the same synchronous group will be synthesized and balanced jointly. Moreover, the balance between generated and master clocks may pose a problem that the number of sink pins of this synchronous group could be very large. The logic level balancing is based on buffer/inverter insertion, so the clock tree branch with shorter insertion delay will be forced long to meet the global balance requirement.

However, there is no communication between the generated clock group and the master clock group in most cases, and the balancing between these two clocks is unnecessary. Optimizing clock trees in synchronous group requires careful analysis of clock structure and design constraints, from which the relationship between different clocks and interaction between different synchronous units can be extracted. The following description will provide an implementation approach in detail.

This clock tree optimization method intends to build a shorter clock trees by preventing the balanced clock trees between synchronous units with no communications. The optimization sequence can be divided into three steps: clock structure analysis, design constraint analysis, and clock tree exception definition.

The first step is clock tree analysis. The structure of clock trees can be printed by IC Compiler. Figure 4.1 shows the clock structure diagram, the following information can be obtained:

l      This design separates into four clock domains: xclk clock domain, pll1_clk clock domain, pll2_clk clock domain, and pll3_clk clock domain.

l      In the xclk clock domain, the master clock is xclk clock, and the generated clocks are padc_mclk clock and pmu_mclk clock. In the pll1_clk clock domain, the master clock is pll1_clk clock, the others are generated clocks. Similarly, the master clock is pll2_clk clock and the others are generated clock in pll2_clk clock domain; the master clock is pll3_clk clock and the others are generated clock in pll3_clk domain.

l       The generated clocks and the master clock have a fixed phase and frequency relationship in every independent clock domain. The generated clocks and the master clock are synchronous in the same clock domain. The clock tree synthesis engine will balance all sinks in the same clock domain.


The second step is to analyze the design constraint, which indicates some information about the relationship between clocks, such as Figure 4.1.0 showing:



1      In the xclk clock domain, the generated clock pmu_clk, padc_mclk and master clock xclk are defined as an asynchronous group, which means there is no communication between any of these clock clusters. So it is not necessary to balance the sinks among the three clocks.

l      In the pll1_clk clock domain, the relationship among the generated clock cpu_clk, cpu_aclk, atclk, pclkdbg and the master clock pll1_clk are specified as synchronous. The relationship clk_out clock, the pll_replica_mclk clock and the others are specified as asynchronous. It means the clk_out clock clusters, the pll_replica_mclk clock clusters and the others do not communicate with each other in the design .So the clk_out clock and pll_replica_mclk are not necessary to balance with the others.

l      Similarly, in the pll2_clk clock domain, the ddr_phy_clk clock and vdac_pixel_clk1 are not necessary to balance with the others. In the pll3_clock_domain, the peri_clk clock, the effuse_mclk clock, the i2c_mclk clock, the pwm_mclk clock and the timer_mclk clock are not necessary to balance with the others.

The third step is to define the clock exceptions with ‘set_clock_tree_exceptions’ command. It is important to set clock exceptions correctly as they will impact the clock tree synthesis result. Use ddr_phy_clk clock in the pll2_clk clock domain as an example, according to previous analysis, specify the pin ‘u_top_design/u_cpu_sub/u_clkrst_ddrc_mclk/ddrphy_clk_div/CK’ as a float pin which defines the clock insertion delay seen from a cell’s clock pin. The floating value is determined by the clock tree latency of this clock tree branch, which can be obtained with the following command:

report_clock_timing –from u_top_design/u_cpu_sub/u_clkrst_ddrc_mclk/ddrphy_clk_div/CK
     -type latency -verbose

 Suppose that the float value is 1.0ns, thus set the clock exceptions with the following command,

set float_pin [get_pins u_top_design/u_cpu_sub/u_clkrst_ddrc_mclk/ddrphy_clk_div/CK]

set float_value 1.0

set_clock_tree_exceptions -float_pin_max_delay_rise $float_value \
                          -float_pin_max_delay_fall $float_value \
                          -float_pin_min_delay_rise $float_value \
                          -float_pin_min_delay_fall $float_value \
                          -float_pins $float_pin

As a result, it can successfully avoid the adverse effects of the ddr_phy_clk clock clusters and minimize the clock latency in the pll2_clk clock domain.

4.2    Clock domain overlapping

In some designs, different clock domains share some common paths, which are defined as clock domain overlapping. The clock tree synthesis engine of the IC Compiler takes clock domains overlapping into consideration. The overlapped parts of clock trees will be synthesized and balanced collectively.

Figure 4.2 shows that some branches of cpu_clk (the function clock) and test_clk (scan clock) overlap at the output of a mux. The ‘case_analysis’ value on ‘selection’ pin of the mux will control which clock can traverse to the overlapping clock cluster. The general solution for this is to set the variable ‘timing_enable_mutiple_clocks_per_reg’ as true, and then remove the case analysis to allow IC Compiler balance both clock domains in the common paths after the mux.

However, this solution may increase the clock latency on these paths. As shown in the Figure 4.2, the original network latency of the test clock, cpu_clk and overlapping clock cluster are 2.5ns, 1.1ns and 1.2ns respectively. These three clock clusters are owned by two clock domains, cpu_clk and test_clk. The overlapping clock cluster is shared by both of the cpu_clk and test_clk domains, so sink pins of the overlapping clock cluster need to be balanced with the sink pins of both the test_clk and cpu_clk clock clusters. To begin with, the network latency of the overlapping clock cluster is increased to 2.5ns when balanced with test_clk clock cluster. From the figure it can be seen that a 0.8ns insertion delay is inserted to the common path. Then the second clock domain ‘cpu_clk’ will be balanced with respect to the 0.8ns insertion delay. As a result, the ‘cpu_clk’ clock cluster will be extended with 0.9ns insertion delay. From this example it can be seen that although the clock tree synthesis is accomplished independently in each clock domain, they still influence each other, and may result in clock trees with poor quality.

A recommended solution is shown as following:

Firstly, set case value to the select pin of the mux cell with ‘set_case_analysis’ command

                  Set_case_analysis 1 [get_pins mux/s]

Secondly, set floating value to the A0 pin of the mux cell with ‘set_clock_tree_excaptions’ command.

set float_pin [get_pins mux/A0]

set float_value 1.2

set_clock_tree_exceptions -float_pin_max_delay_rise $float_value \
                          -float_pin_max_delay_fall $float_value \
                          -float_pin_min_delay_rise $float_value \
                          -float_pin_min_delay_fall $float_value \
                          -float_pins $float_pin

By doing this, the insertion delay balancing will be ceased at the floating pins rather than at the sink pins, so that buffer insertion can be prevented, giving a shorter clock tree than former result.

4.3    Pre-exiting Cell in the Clock Network

In some cases, the clock network contains some pre-existing buffers and inverters (even some delay cell) before clock tree synthesis. These pre-existing cells might impede clock tree synthesis from finding an optimal solution. As a result, the clock latency will be increased. We can use the ‘report_clock_tree -summary -structure -level info’ command to analyze the clock structure and use the ‘remove_clock_tree -clock tree’ command to remove these pre-existing cells before running clock tree synthesis.

4.4 Buffer-based Clock Network vs. Inverter-based Clock Network

Another issue in clock tree synthesis strategy is the selection of cell type for clock tree building. There are two types of clock tree building scheme: one is buffer-based clock network, and the other is inverter-based. Choosing the right type can effectively minimize the clock latency.


Figure 4.4 compares the buffer-based clock network and the inverter-based clock network. As the figure shows, the clock buffer cell Buffer2 consists of two clock inverters: the PMOS transistor P3 and NMOS transistor N3 make up the preceding inverter, whereas the PMOS transistor P4 and NMOS transistor N4 make up of the successive inverter. Generally, the size of P3-N3 pair will be smaller than P4-N4 pair. This allows buffer to have lower input capacitance with the same driving strength, when compared to inverter. Therefore, the load of buffer to preceding driving cell is less than inverter. This feature allows buffer-based clock trees to have longer wire length without ‘max capacitance’ violations. At the same time, inverters have to be used in pairs. So the buffer-based clock network will have fewer levels than inverter-based type.

On the other hand, in the clock buffer, the preceding inverter only experiences internal gate loading whereas the successive clock inverter experiences interconnect loading of clock net. The loading of two inverters inside the buffer are asymmetrical. In contrast, the clock inverter experience symmetrical loading which comes from clock nets. This allows the inverter-based clock network to achieve better insertion delay and maintain better duty cycle ratio.

From above, it can be seen that there is a trade-off between clock tree levels and duty cycle in selection of clock tree type. It depends on the characteristics of the clock buffer/inverter and the requirement of a design. We recommend using the clock inverters only to build the clock tree if possible.


5.0  Conclusions and Recommendations

This article posed the problem on how to minimize clock latency during clock tree synthesis with IC Compiler. First, the impact of I/O cell to clock tree synthesis is examined, and methods to reduce the adverse effect of IO cells are discussed, such as modifying the clock definition point. In the second part, the impact of macro clock pin with high capacitance is taken in to account, and measures to reduce this impact are specified. The third part emphasized the methodologies of analyzing clock structure and design constraints in multi-clock domain. Based on the analysis above, a series of approaches to minimizing clock network latency are recommended.

6.0  Acknowledgements

We would like to thank to the colleagues of design service team in Britesemi, for their kind supports. And we also would like to give our sincere thanks to Synopsys technical support.

7.0  References

[1] Synopsys “IC Compiler User Guide Version B-2008.09-SP4”, April 2009
[2] Jan M. Rabaey, Anantha Chandrakasan, Borivoje Nikolic, “Digital Integrated Circuits: A Design Perspective”, March 2004
[3] Anantha Chandrakasan, Massachusetts Institute of Technolog, Cambridge, Massachusetts, “Clocking in modern VLSI system”, December 2008



ICC user guide for reading

The IC Compiler tool uses logic libraries to provide timing and functionality information for all standard cells. In addition, logic libraries can provide timing information for hard macros, such as RAMs.
The tool supports logic libraries that use nonlinear delay models (NLDMs) and Composite Current Source (CCS) models and automatically selects the timing models to use, based on the contents of the logic libraries. If the logic libraries contain a mixture of both NLDM and CCS models, by default, the tool uses the CCS capacitance and timing data; you can control this by setting the lib_pin_using_cap_from_ccs and lib_cell_using_delay_from_ccs variables before loading the libraries.
The link libraries are the logic libraries used to resolve cell references when the tool links the design.The first library specified in the link_library variable is the main library.

The target libraries are the logic libraries that the tool uses to perform physical optimization.
The Milkyway reference libraries contain timing constraints and physical information about the standard cells and macro cells in your logic library. In addition, these reference libraries define the placement unit tile. The technology file provides technology-specific information, such as the name and characteristics of each metal layer.

The physical library information is stored in the Milkyway design library. For each cell, the Milkyway design library contains several views of the cell, which are used for different physical design tasks. Commonly used views include
  • The layout (CEL) view
  • The place and route (FRAM) view
  • The metal fill (FILL) view
  • The power and ground connectivity (CONN) view
The error (ERR) view
If a Milkyway library does not already exist for your design, you need to create one and open it.

> Creating a Milkyway Design Library:

To create a Milkyway design library, use the create_mw_lib command;
Before working with the design, you must specify the Milkyway reference libraries used for the design. You can do this when you create the Milkyway design library with the create_mw_lib command or you can do it later by using the set_mw_lib_reference command.
Note:
You can use the set_mw_lib_reference command to define the Milkyway reference libraries associated with a Milkyway design library only when the Milkyway design library is closed. If the design library is open, close it with the close_mw_lib command before defining the reference libraries.
> You can change the technology file and Milkyway reference libraries associated with your design library.
Note: You can change the physical library information only when the Milkyway design library is closed.

> Before you process your design, you should use the check_library command to ensure that the logic libraries and physical libraries are correct and consistent.
By default, the check_library command performs consistency checking between the logic libraries specified in the link_library variable and the physical libraries referenced in the current Milkyway design library. If the check_library command reports any inconsistencies, you must fix them before you process your design.

> The IC Compiler tool can read designs in Milkyway, .ddc, or ASCII (Verilog) format.
1) Reading a Design in Milkyway Format:
  • The unit settings in the Milkyway design must be consistent with the unit settings in the main library (the first library in the link_library definition). To see the main library unit settings, use the report_lib command. To see the Milkyway design library unit settings, use the report_units command. The Milkyway design contains the floorplan information and timing constraints previously set on the design, so you do not need to annotate this information separately. 2) Reading a Design in .ddc Format:
  • Readthe.ddcfileforthedesignbyusingtheimport_designscommand;Annotatethefloorplaninformationonthedesign.The .ddc design contains the scan chain information and timing constraints previously set on the design, so you do not need to annotate this information separately. If you are using a bottom-up flow, you must have FRAM views for all blocks in the top-level design.

> Note: Before reading the floorplan file, create the logical power and ground connections by using the derive_pg_connection command
> The tool maintains the SCANDEF data that describes scan chain characteristics and constraints, as well as the scan-stitched netlist. The netlist and SCANDEF data must be consistent with each other. To validate their consistency, use the check_scan_chain command.

> Validating the Floorplan Information :

  • Before running the place_opt command, run the check_physical_design command with the -stage pre_place_opt option.
    icc_shell> check_physical_design -stage pre_place_opt 
  • Before running the clock_opt command, run the check_physical_design command
    with the -stage pre_clock_opt option.
    icc_shell> check_physical_design -stage pre_clock_opt 
  • Before running the route_opt command, run the check_physical_design command with the -stage pre_route_opt option.
    icc_shell> check_physical_design -stage pre_route_opt 

> To validate the integrity of the design database, run the check_database command. By default, the check_database command verifies that
  • The power and ground network is consistent
  • There are no logical connections to physical-only cells
  • The UPF data is consistent, if it exists
  • The design has been uniquified and contains hierarchy preservation data
    To perform only the logical checks, use the -netlist option. To perform only the physical checks, use the -physical option. You can increase the verbosity of the generated messages by setting the -verbosity option to medium or high (the default is low).
    You should run the check_database command
  • After reading a design in ASCII format using the read_def command
  • Before running application commands, such as change_names, create_block_abstraction, place_opt, clock_opt, and route_opt. This can help save runtime in the case of a data error
  • Before using the write_verilog command.

> use the derive_pg_connection command  to creates the logical power and ground connections for leaf cells, hierarchical cells, and physical-only cells in both single-voltage and multivoltage designs.

  • To show the logical power and ground connections made by the derive_pg_connection command, use the report_cell_physical –connections command. You should rerun the derive_pg_connection command whenever the power and ground connections have changed, such as
    • After design planning
    • After using logic ECO to modify the design
    • After chip-finishing tasks

  • If the design does not yet have power and ground ports, use the -create_ports top option to create these ports.

  • If the design already has logical power and ground connections but you want to regenerate these connections, use the remove_pg_network -top command to remove the existing power and ground network before you run the derive_pg_connection command. By default, the derive_pg_connection command does not change existing power and ground pin connections. To reconnect the power and ground pins, use the -reconnect option when you run the derive_pg_connection command.

Note that the -reconnect option of the derive_pg_connection command does not reconnect tie-off nets.

TLUPlus is a binary table format that stores the RC coefficients. The TLUPlus models enable accurate RC extraction results by including the effects of width, space, density, and temperature on the resistance coefficients.

  • The map file, which matches the layer and via names in the Milkyway technology file with the names in the ITF file.

For multicorner-multimode designs, use the set_tlu_plus_files command to specify the TLUPlus files for each scenario. The set_tlu_plus_files command only applies to the current scenario.
After specifying the TLUPlus files, you should validate them by running the check_tlu_plus_files command.
> The operating conditions of a design include the process, voltage, and temperature parameters under which the chip is intended to operate. The tool analyzes and optimizes the design under the conditions you specify.
For multicorner-multimode designs, specify an operating condition for each scenario in the design.

Note: The link path should contain only the maximum library.

To find out which libraries are defined as the maximum and minimum libraries, use the list_libs command. In the generated report, the uppercase letter “M” appears next to the maximum library, and the lowercase letter “m” appears next to the minimum library.

> Advanced on-chip variation (AOCV) is an optional method of accuracy improvement that determines varying derating factors for different clock paths based on the path lengths.

Setting Timing Constraints
At a minimum, the timing constraints must contain a clock definition for each clock signal, as well as input delay or output delay for each I/O port. This requirement ensures that all signal paths are constrained for timing.
To model the clock tree effects for placement before running clock tree synthesis, you should also define the uncertainty, latency, and transition constraints for each clock by using the set_clock_uncertainty, set_clock_latency, and set_clock_transition commands.

The tool does not optimize paths that are not constrained for timing. Before proceeding, use the check_timing command to verify that all paths are constrained. If the check_timing command reports unconstrained paths, run the report_timing_requirements command to verify that the unconstrained paths are false paths (the check_timing command considers false paths unconstrained).

remove_ideal_network -all  This command removes ideal_network attributes, latencies, and transition times.

For multicorner-multimode designs, you must define timing constraints for each scenario.

Selecting the Delay Calculation Method
By default, the tool uses the Elmore delay model for preroute delay calculation and the Arnoldi delay model for routed clock and postroute delay calculation. To change the delay calculation model, use the set_delay_calculation_options command.
For preroute delay calculation, the tool computes delays based on estimated parasitic data for the nets. You can choose either Elmore or asymptotic waveform evaluation (AWE) as the delay model. The AWE delay model provides better accuracy and better correlation with postroute delay calculation.

For routed clock delay calculation and for postroute delay calculation, you can choose either Arnoldi or Elmore as the delay model. The Arnoldi delay model provides better accuracy and better correlation with PrimeTime delay calculation, especially for smaller geometries and for highly resistive nets.

You can use the compare_delay_calculation command to compare Elmore and Arnoldi results of sample delay calculations in your design. If the results are similar between the two methods, use the faster Elmore method. Otherwise, if you need the additional accuracy, use the Arnoldi method.

By default, the tool does not include crosstalk delta delays in the delay calculations. To extract coupling capacitances and include crosstalk delta delays in the postroute delay calculations, enter the following command:
icc_shell> set_si_options -delta_delay true 

Defining the Buffer Strategy for Optimization
During the optimization step, the place_opt command introduces buffers and inverters to fix timing and DRC violations. However, this buffering strategy is local to some critical paths. The buffers and inverters that are inserted might become excess later because critical paths change during the course of optimization. You can reduce the excess buffer and inverter counts after place_opt by using the set_buffer_opt_strategy command, as shown in the following example:
icc_shell> set_buffer_opt_strategy -effort low
This buffering strategy does not degrade the quality of results (QoR).

If you use the set_dont_use command to set the dont_use attribute on cells BUF1 and BUF2 , the tool uses cells BUF1 and BUF2 to fix hold violations, but not setup and DRC violations.

Enabling Tie Cell Insertion
A tie cell is a special-purpose standard cell whose output is constant high or constant low and is used to hold the input of another cell at the given constant value.

High-Fanout Net Synthesis
During placement and optimization, the IC Compiler tool does not buffer clock nets as defined by the create_clock command, but it does, by default, buffer other high-fanout nets, such as resets or scan enables, using a built-in high-fanout synthesis engine.
The high-fanout synthesis engine does not buffer nets that are set as ideal nets or nets that are disabled with respect to design rule constraints.

Inserting Port Protection Diodes
The IC Compiler tool can automatically insert protection diodes on subdesign ports to prevent antenna violations at the top level. You insert the port protection diodes after floorplanning but before starting placement.

Performing Placement and Optimization 

The place_opt command performs coarse placement, high-fanout net synthesis, physical optimization, and legalization. In addition, it can perform clock tree synthesis, scan chain reordering, and power optimization.

Performing Power Optimization 
Leakage dissipation is low when the threshold voltage is high and vice versa. Moreover, switching delay is increased when threshold voltage is high. So, using high-threshold-voltage cells to reduce the leakage can violate the timing constraints of the design. If the logic library supports cells with multiple threshold voltages, using cells with a lower threshold voltage for timing-critical paths and cells with a higher threshold voltage for other paths can reduce the leakage power without violating the delay.

During leakage-power optimization, the tool improves the leakage power only if it does not degrade timing.

Creating Multibit Register Banks
You can use the register banking flow to merge single-bit registers to form multibit register banks during placement and optimization.
Creating multibit register banks reduces
  • The design area due to the smaller area of one multibit register bank compared to that of multiple single-bit registers
  • The clock tree power consumption due to the smaller number of clock tree buffers

Analyzing the Placement Area Utilization
The default utilization is calculated as: (non-fixed_standard_cell_area + fixed_standard_cell_area) / (total_area – blocked_area)
whereas non-fixed-only utilization is calculated as: (non-fixed_standard_cell_area) / (total_area – fixed_standard_cell_area – blocked_area) 

Reporting Quality-of-Results
You can generate a report on the quality of results (QoR) for the design in its current state by using the create_qor_snapshot command (or by choosing Timing > Create QoR Snapshot in the GUI). This command measures and reports the quality of the design in terms of timing, design rules, area, power, congestion, clock tree synthesis, routing, and so on.

Performing Magnet Placement
To improve congestion for a complex floorplan or to improve timing for the design, you can use magnet placement to specify fixed objects as magnets and have the tool move their connected standard cells close to them.
For best results, perform magnet placement before standard cells are placed.
Magnet placement allows cells to be overlapped by default. To prevent overlapping of cells, you can set the magnet_placement_disable_overlap variable to true, changing it from its default of false.


Refining Placement
If your design shows large timing or violations after you run the place_opt command, adjust the place_opt options and rerun place_opt.
If your design shows small timing or violations after you run place_opt, run psynopt to fix these violations.
If your design has congestion violations after you run place_opt, rerun place_opt with high-effort congestion reduction (-congestion option). If your design still has congestion violations, you can refine the placement to fix these violations.