ABSTRACT
This article mainly focuses on minimization clock latency during physical implementation with Synopsys IC Compiler. The obstacles in optimizing clock networks may come from many aspects. In this article, will cover the following three aspects: unbalanced slew rate on I/O cell, the clock pin with high capacitance, and clock networks from multiple clock domains. For each of these issues, practical solutions will be discussed with regards to optimizing clock latency. Through employing the techniques in this article, the clock tree latency can be effectively minimized.
Leo Zeng Martin Ma Kimi Jin
leo.zeng@britesemi.com
martin.ma@britesemi.com
kimi.jin@britesemi.com
martin.ma@britesemi.com
kimi.jin@britesemi.com
1.0 Introduction
As modern system-on-chip design continues to increase in complexity of chip functionality, it becomes necessary to implement higher clock frequency, more clock domains, and also more sophisticated clock structures. While the process geometries are scaling down to 65 and 40 nanometers, advanced technology nodes allow for high density integration and larger chip sizes. Thus meaning that the loading of clock networks will become heavier than before and clock networks may have to traverse through longer distances. Furthermore, on-chip variation will become more considerable in terms of clock uncertainty and latency. It makes the timing closure become a tough job. For example, the timing sign off criteria under TSMC 65nm low-power design is as following:
WC corner for setup check: set_timing_derate –early 0.95 set_timing_derate –late 1.0
BC corner for hold check: set_timing_derate –early 1.0 set_timing_derate –late 1.1
Assume that the clock frequency is 800 MHz and the launch and capture clock latencies are both 10ns. The longer latency will generate 500ps clock uncertainty between launch and capture clock path because of timing derating, which is 40 percent of clock cycle. This would mean that the setup margin decreases by 40 percent. The insufficiency of setup margin will cause great number of violations and degrade chip performance. The hold time analysis is in the same situation.
Additionally, minimizing clock latency can optimize clock skew and decrease power dissipation, as well as, improve clock tree performance.
In this article, the following topics will be presented,
l Unbalanced slew rate on I/O cell
l Clock pins with high capacitance
l Clock networks from multiple clock domains.
2.0 Unbalanced slew rate on I/O cell
In most cases, the system clock transits into design through an IO cell, such as from a crystal oscillator on the board. The timing characteristic of such IO cells will influence the clock tree synthesis result. The I/O cells sometimes have unbalanced slew rate that will introduce skew between rising and falling edges into the clock source, which in turn will be propagated through clock network. However, the clock tree synthesis engine tries to build clock tree and balance rising and falling insertion delay. During the insertion delay balancing between different edges, redundant buffers/inverters will be inserted. Consequently, the clock network latency is increased.
To avoid this problem, I/O cells with equal slew rates should be selected for clock defined I/O cell. However, it is difficult to find a perfectly balanced I/O cell. Figure 2.1 shows a crystal oscillator I/O cell which is characterized in Synopsys Liberty. From the cell delay look up table, it can be found that there is about 300ps difference between the rising and falling propagation delay. This delay difference will be propagated through clock networks and finally reach the sink pins. In order to prevent asymmetric propagation delay from the clock root, the clock definition point should be moved to the output pin of I/O cell. For example, the original clock definition is as following:
Create_clock -name xclk -period 40 -waveform {0 20} [get_ports {XCLK}]
The improved clock definition as following:
Create_clock -name xclk -period 40 -waveform {0 20} [get_pins {U_PAD_XCLK/XC}
Then clock source pin is defined at U_PAD_XCLK/XC, which will not propagate through the I/O cell. The I/O’s slew will also not impact clock tree synthesis.
3.0 Clock pin with high capacitance
For some IP macros, the input capacitance on clock pin is very high. The input capacitance has two components. One is the intrinsic input capacitance which is characterized in Synopsys Liberty, and the other is extrinsic capacitance caused by the parasitic coupling effect to neighboring signal wires. The extrinsic capacitance has a dependency on the geometry of the pin and coupled signal wires. For pins of small geometry, the coupling effect will be weak, and the extrinsic capacitance is negligible. However, for pins with high aspect ratio in geometry, the coupling effect will become considerable. This situation occurs when the rest of the macro is covered by route guide, as shown in the following example. Since the route guide represents invisible routes in a certain layer, the EDA tool will recognize route guide as ‘virtual signal wires’ during parasitic extraction. The whole route guide will be treated as fat wire. Hence the parasitic model would be a pessimistic estimation to the extrinsic capacitance.
From Figure 3, it can be seen that the pin “Pclk” is surrounded by route guides of the same layer. Because the pin is thin and long, the coupling effect of this pin to the surrounding route guides will be significant. This makes the extrinsic capacitance become dominant when compared to the intrinsic capacitance.
Nevertheless, in most cases this parasitic model is too pessimistic. The real metal density around this pin could be only a small portion of the route guide. Therefore, when using the route guide for parasitic extraction, the value of extrinsic capacitance is greatly over-estimated. And thus, will result in an unbeneficial impact on Clock Tree Synthesis.
This over-estimated extrinsic capacitance will cause longer clock latency. During the clock tree synthesis phase, in the first step, IC Compiler will fix DRC violations (such as ‘max capacitance’, ‘max transition’ and ‘max fan-out’) by buffer insertion. The IC Compiler will continue to insert buffers at pins with DRC violations until the violation has been fixed or the max buffer level is reached. If an individual sink pin has a very high input capacitance, then buffers/inverters with higher driving strength will have to be inserted close to this pin in order to satisfy the ‘max capacitance’ and ‘max transition’ constraints. Effectively, this will cause extra insertion delay in this clock tree branch. The second step of clock tree synthesis is logic level balancing. In this step, insertion delay of different clock tree branch within the same clock domain will be balanced. The global clock tree insertion delay will be determined by the longest branch. Therefore, the insertion delay of other clock tree branches will be elongated.
A recommended solution is to insert a buffer with proper driving strength as an ‘anchor cell’ at a very close location to the sink pin, allowing the buffer to be set with a floating value according to the real propagation delay of this buffer to the sink pin. And so, that sink pin will be isolated from the clock tree.
4.0 Clock networks from multiple clock domains
In modern system-on-chip architecture, multiple clock domains have been a prevalent design style. A multi-clock domain design typically includes multiple islands operating synchronously and served by independent clocks within the clock domain and with dedicated interfaces to manage the inter-clock-domain communications. This multi-clock domain design architecture for SOC belongs to a class of designs called globally asynchronous and locally synchronous (GALS).
Figure 4 is a generic illustration of the GALS design style. These different functional units in an SOC design can be on separate clock domains. This scheme provides functional flexibility for each of the domains to operate at the optimal frequency and to minimize clock skew of the entire chip and clock latency of independent clock network. The chip may receive multiple copies of the system clock and multiple PLLs are used to generate the clocks for each of the synchronous units. When multiple clocks are generated in a design, the relationships between the clock domains depend on how the clocks are generated and how they are used in the design. The relationship between two clocks can be classified as synchronous and asynchronous. Clocks CK1, CK2 … CK4 have different phases and frequencies. So CK1, CK2 …CK4 are asynchronous with respect to each other. For internally generated clocks in each synchronous unit, if two clocks share a common source and have a fixed offset phase relationship, and they are classified as synchronous with respect to each other.
4.1 Clock network in synchronous clock domain
The Clock Tree Synthesis Engine of IC Compiler is based on synchronous clock tree structure model. During clock tree synthesis, the IC Compiler will consider generated clocks and its master clock as a synchronous group, and then clock trees within the same synchronous group will be synthesized and balanced jointly. Moreover, the balance between generated and master clocks may pose a problem that the number of sink pins of this synchronous group could be very large. The logic level balancing is based on buffer/inverter insertion, so the clock tree branch with shorter insertion delay will be forced long to meet the global balance requirement.
However, there is no communication between the generated clock group and the master clock group in most cases, and the balancing between these two clocks is unnecessary. Optimizing clock trees in synchronous group requires careful analysis of clock structure and design constraints, from which the relationship between different clocks and interaction between different synchronous units can be extracted. The following description will provide an implementation approach in detail.
This clock tree optimization method intends to build a shorter clock trees by preventing the balanced clock trees between synchronous units with no communications. The optimization sequence can be divided into three steps: clock structure analysis, design constraint analysis, and clock tree exception definition.
The first step is clock tree analysis. The structure of clock trees can be printed by IC Compiler. Figure 4.1 shows the clock structure diagram, the following information can be obtained:
l This design separates into four clock domains: xclk clock domain, pll1_clk clock domain, pll2_clk clock domain, and pll3_clk clock domain.
l In the xclk clock domain, the master clock is xclk clock, and the generated clocks are padc_mclk clock and pmu_mclk clock. In the pll1_clk clock domain, the master clock is pll1_clk clock, the others are generated clocks. Similarly, the master clock is pll2_clk clock and the others are generated clock in pll2_clk clock domain; the master clock is pll3_clk clock and the others are generated clock in pll3_clk domain.
l The generated clocks and the master clock have a fixed phase and frequency relationship in every independent clock domain. The generated clocks and the master clock are synchronous in the same clock domain. The clock tree synthesis engine will balance all sinks in the same clock domain.
The second step is to analyze the design constraint, which indicates some information about the relationship between clocks, such as Figure 4.1.0 showing:
1 In the xclk clock domain, the generated clock pmu_clk, padc_mclk and master clock xclk are defined as an asynchronous group, which means there is no communication between any of these clock clusters. So it is not necessary to balance the sinks among the three clocks.
l In the pll1_clk clock domain, the relationship among the generated clock cpu_clk, cpu_aclk, atclk, pclkdbg and the master clock pll1_clk are specified as synchronous. The relationship clk_out clock, the pll_replica_mclk clock and the others are specified as asynchronous. It means the clk_out clock clusters, the pll_replica_mclk clock clusters and the others do not communicate with each other in the design .So the clk_out clock and pll_replica_mclk are not necessary to balance with the others.
l Similarly, in the pll2_clk clock domain, the ddr_phy_clk clock and vdac_pixel_clk1 are not necessary to balance with the others. In the pll3_clock_domain, the peri_clk clock, the effuse_mclk clock, the i2c_mclk clock, the pwm_mclk clock and the timer_mclk clock are not necessary to balance with the others.
The third step is to define the clock exceptions with ‘set_clock_tree_exceptions’ command. It is important to set clock exceptions correctly as they will impact the clock tree synthesis result. Use ddr_phy_clk clock in the pll2_clk clock domain as an example, according to previous analysis, specify the pin ‘u_top_design/u_cpu_sub/u_clkrst_ddrc_mclk/ddrphy_clk_div/CK’ as a float pin which defines the clock insertion delay seen from a cell’s clock pin. The floating value is determined by the clock tree latency of this clock tree branch, which can be obtained with the following command:
report_clock_timing –from u_top_design/u_cpu_sub/u_clkrst_ddrc_mclk/ddrphy_clk_div/CK
-type latency -verbose
Suppose that the float value is 1.0ns, thus set the clock exceptions with the following command,
set float_pin [get_pins u_top_design/u_cpu_sub/u_clkrst_ddrc_mclk/ddrphy_clk_div/CK]
set float_value 1.0
set_clock_tree_exceptions -float_pin_max_delay_rise $float_value \
-float_pin_max_delay_fall $float_value \
-float_pin_min_delay_rise $float_value \
-float_pin_min_delay_fall $float_value \
-float_pins $float_pin
As a result, it can successfully avoid the adverse effects of the ddr_phy_clk clock clusters and minimize the clock latency in the pll2_clk clock domain.
4.2 Clock domain overlapping
In some designs, different clock domains share some common paths, which are defined as clock domain overlapping. The clock tree synthesis engine of the IC Compiler takes clock domains overlapping into consideration. The overlapped parts of clock trees will be synthesized and balanced collectively.
Figure 4.2 shows that some branches of cpu_clk (the function clock) and test_clk (scan clock) overlap at the output of a mux. The ‘case_analysis’ value on ‘selection’ pin of the mux will control which clock can traverse to the overlapping clock cluster. The general solution for this is to set the variable ‘timing_enable_mutiple_clocks_per_reg’ as true, and then remove the case analysis to allow IC Compiler balance both clock domains in the common paths after the mux.
However, this solution may increase the clock latency on these paths. As shown in the Figure 4.2, the original network latency of the test clock, cpu_clk and overlapping clock cluster are 2.5ns, 1.1ns and 1.2ns respectively. These three clock clusters are owned by two clock domains, cpu_clk and test_clk. The overlapping clock cluster is shared by both of the cpu_clk and test_clk domains, so sink pins of the overlapping clock cluster need to be balanced with the sink pins of both the test_clk and cpu_clk clock clusters. To begin with, the network latency of the overlapping clock cluster is increased to 2.5ns when balanced with test_clk clock cluster. From the figure it can be seen that a 0.8ns insertion delay is inserted to the common path. Then the second clock domain ‘cpu_clk’ will be balanced with respect to the 0.8ns insertion delay. As a result, the ‘cpu_clk’ clock cluster will be extended with 0.9ns insertion delay. From this example it can be seen that although the clock tree synthesis is accomplished independently in each clock domain, they still influence each other, and may result in clock trees with poor quality.
A recommended solution is shown as following:
Firstly, set case value to the select pin of the mux cell with ‘set_case_analysis’ command
Set_case_analysis 1 [get_pins mux/s]
Secondly, set floating value to the A0 pin of the mux cell with ‘set_clock_tree_excaptions’ command.
set float_pin [get_pins mux/A0]
set float_value 1.2
set_clock_tree_exceptions -float_pin_max_delay_rise $float_value \
-float_pin_max_delay_fall $float_value \
-float_pin_min_delay_rise $float_value \
-float_pin_min_delay_fall $float_value \
-float_pins $float_pin
By doing this, the insertion delay balancing will be ceased at the floating pins rather than at the sink pins, so that buffer insertion can be prevented, giving a shorter clock tree than former result.
4.3 Pre-exiting Cell in the Clock Network
In some cases, the clock network contains some pre-existing buffers and inverters (even some delay cell) before clock tree synthesis. These pre-existing cells might impede clock tree synthesis from finding an optimal solution. As a result, the clock latency will be increased. We can use the ‘report_clock_tree -summary -structure -level info’ command to analyze the clock structure and use the ‘remove_clock_tree -clock tree’ command to remove these pre-existing cells before running clock tree synthesis.
4.4 Buffer-based Clock Network vs. Inverter-based Clock Network
Another issue in clock tree synthesis strategy is the selection of cell type for clock tree building. There are two types of clock tree building scheme: one is buffer-based clock network, and the other is inverter-based. Choosing the right type can effectively minimize the clock latency.
Figure 4.4 compares the buffer-based clock network and the inverter-based clock network. As the figure shows, the clock buffer cell Buffer2 consists of two clock inverters: the PMOS transistor P3 and NMOS transistor N3 make up the preceding inverter, whereas the PMOS transistor P4 and NMOS transistor N4 make up of the successive inverter. Generally, the size of P3-N3 pair will be smaller than P4-N4 pair. This allows buffer to have lower input capacitance with the same driving strength, when compared to inverter. Therefore, the load of buffer to preceding driving cell is less than inverter. This feature allows buffer-based clock trees to have longer wire length without ‘max capacitance’ violations. At the same time, inverters have to be used in pairs. So the buffer-based clock network will have fewer levels than inverter-based type.
On the other hand, in the clock buffer, the preceding inverter only experiences internal gate loading whereas the successive clock inverter experiences interconnect loading of clock net. The loading of two inverters inside the buffer are asymmetrical. In contrast, the clock inverter experience symmetrical loading which comes from clock nets. This allows the inverter-based clock network to achieve better insertion delay and maintain better duty cycle ratio.
From above, it can be seen that there is a trade-off between clock tree levels and duty cycle in selection of clock tree type. It depends on the characteristics of the clock buffer/inverter and the requirement of a design. We recommend using the clock inverters only to build the clock tree if possible.
5.0 Conclusions and Recommendations
This article posed the problem on how to minimize clock latency during clock tree synthesis with IC Compiler. First, the impact of I/O cell to clock tree synthesis is examined, and methods to reduce the adverse effect of IO cells are discussed, such as modifying the clock definition point. In the second part, the impact of macro clock pin with high capacitance is taken in to account, and measures to reduce this impact are specified. The third part emphasized the methodologies of analyzing clock structure and design constraints in multi-clock domain. Based on the analysis above, a series of approaches to minimizing clock network latency are recommended.
6.0 Acknowledgements
We would like to thank to the colleagues of design service team in Britesemi, for their kind supports. And we also would like to give our sincere thanks to Synopsys technical support.
7.0 References
[1] Synopsys “IC Compiler User Guide Version B-2008.09-SP4”, April 2009
[2] Jan M. Rabaey, Anantha Chandrakasan, Borivoje Nikolic, “Digital Integrated Circuits: A Design Perspective”, March 2004
[3] Anantha Chandrakasan, Massachusetts Institute of Technolog, Cambridge, Massachusetts, “Clocking in modern VLSI system”, December 2008
No comments:
Post a Comment