A 4-to-1 240 Gb/s PAM-4 MUX with a 7-tap Mixed-Signal FFE in 55nm BiCMOS

M. Verplaetse, H. Ramon, N. Singh, B. Moeneclaey, P. Ossieur, G. Torfs
2021 2021 IEEE Custom Integrated Circuits Conference (CICC)  
Next generation high-speed wireline and optical communications will target single lane data rates over 200Gb/s. For this, the generation and transmission of >100Gbaud PAM-4 is a key step. Recent transmitters in advanced CMOS and FinFET nodes [1,2] provide extensive transmit-side FFE capabilities at respectively 64 and 56Gbaud. Speed limitations in these technologies will make the transition to >100Gbaud a challenge. Alternatively, InP-based multiplexers like [3] manage to reach >100Gbaud
more » ... They also offer the possibility to create high-swing output drivers, necessary to efficiently drive optical modulators. However, InP solutions lack the ability to introduce more complex equalization of the signal. BiCMOS based transmitters like in [4], enable the integration of more complex circuits with respect to InP technologies, are capable to deliver high signal swings required for optical drivers and promise increased bandwidth compared to CMOS/FinFET. This paper presents a 120Gbaud PAM-4 TX incorporating a 7-tap FFE in a 55nm BiCMOS technology. The advantage of the presented FFE architecture is the efficient use of both digital and analog delay structures to obtain >100Gbaud operation with a large amount of filter taps in a compact configuration. The TX architecture incorporating a 4:1 PAM-4 MUX and 7-tap FFE is shown in Fig 1. It accepts four input signals, processed by input decoders to obtain the MSB and LSB components. After retiming using two 90 degree out of phase quarter-rate clocks (CLK/2 and CLK/2 90º), the 4 MSB and 4 LSB signals are separately multiplexed and filtered using a 7-tap FFE (MUX-FFE) resulting in 2 independent 7-tap FFEs, one for the MSB and one for the LSB. The two quarterrate clocks are derived from an externally supplied half-rate clock (CLK) which is divided on-chip. Using phase interpolators (PI), the phases of both quarter-rate clocks can be fine-tuned. The filtered MSB and LSB signals are combined in the output stage to obtain a full-rate PAM-4 signal. An output driver amplifies the PAM-4 signal to drive a 100Ω differential load. The four PAM-4 levels are created by scaling the overall gain in the separate FFEs. Fig.2 provides the block diagram of the input decoder. It can operate in a quarter-rate PAM-4 mode or a half-rate NRZ mode. An emitter follower (EF) input buffer is followed by a CTLE, providing tunable peaking around 16GHz in PAM-4 mode and 28GHz in NRZ mode. Decoding the PAM-4 signal is done via 3 variable level shifters (LS) and slicers (S1, S2 and S3) to convert the PAM-4 signal into 3 bit streams Dup, Dmid and Dlow. The sliced signals are decoded further with optional gray decoding to extract the MSB and LSB component. In the NRZ mode, only the middle slicer (S2) is used and the halfrate signal is demultiplexed to obtain two quarter-rate bit streams Dmid and Dmid,180. All decoding is performed using quarter-rate logic. To monitor the performance of the input decoders, an additional subsampler is added, operating at 1/128th -rate. Fig. 3 provides the core 7-tap MUX-FFE architecture alongside the important analog and digital circuits. The filter architecture resembles the structure of a travelling wave FIR filter [4] to which additional cross amplifiers are added (h1,h2,h4,h5). The architecture allows that each of the delay values is equal to (a multiple of) the tap spacing, unlike conventional analog travelling wave FIR filters, where each delay element must have a physical value that is only a fraction of the tap spacing. This allows efficient implementation of the input delay line in the digital domain while maintaining a distributed analog summation at the output. This reduces loading such that a high bandwidth can be obtained. The generation of 2 bit digital delays at full rate (= 2Ts) is performed by delaying the retimed quarter rate signals using high-speed latches. A two stage 4:1 MUX topology is implemented based on a half-rate clock. Shunt inductive peaking is added in the second MUX-stage to enhance the bandwidth. The filter coefficients are represented by analog variable gain amplifiers (VGAs), implemented by 2 parallel differential pairs (Qv0-Qv1 and Qv2-Qv3) with variable tail currents, controlled by a 7-bit differential DAC (with outputs I0 + a and I0a, I0 the nominal bias current and "a" defining the gain setting). To save area, compact analog delay cells are used instead of passive delay lines. The active analog delay circuit is formed by the summation of 2 signal paths followed by an EF. A first path is formed by a normal differential pair (Qd0-Qd3). The second path is formed by a cascoded differential pair (Qd1-Qd2) with double gain that is connected with the inverse sign to the output. The emitter of the cascodes (Qd4-Qd5) are loaded with an additional capacitance. In this way, a right halve plane zero is introduced, providing both bandwidth peaking and increased delay. The summation of both signal paths is combined with the summation of the different VGA signals, provided via separate cascode transistors (Qd6-Qd7), in a single resistive load Rsum with shunt inductive peaking to increase the bandwidth at the summation node. A fixed analog delay around 8.3ps is obtained with a summation bandwidth >83GHz. The architecture of the output stage is given in Fig.4 . The equalized MSB and LSB components are pre-amplified by separate emitter degenerated differential pairs and summed together in a shared load Rsum,pre. The pre-drivers are followed by an EF and 100Ω differential output driver with cross-coupled compensation capacitors and series/shunt inductors to compensate for the pad and ESD parasitic capacitance and bump parasitic inductance. The output stage has a maximum differential output swing of 1.2Vpp when generating PAM-4 signals. A single filter tap (only MSB component) can provide up to 300mVpp swing. Hence, up to 8.5dB of overshoot can be introduced in the filter without having to lower the DC gain. The chip is fabricated in a 55nm BiCMOS technology and flipchipped on a high-speed PCB. The output is connected via a 2cm differential trace, a 1.8mm connector, 10cm RF-coaxial cable and DC blocking capacitors to the sampling heads of a 100GHz sampling oscilloscope. This leads to a total added loss of 8dB at 60GHz. A bench-top clock source is used to generate the half-rate clock (CLK) with a maximum frequency of 60GHz. A 92GS/s AWG generates the quarter-rate PAM-4 or half-rate NRZ signals. By using an SPIinterface, on-chip settings such as FFE coefficients can be set and monitoring information can be read. Using a single 2.5V supply, the TX consumes 2W when the input decoders are configured for NRZ and 2.16W when configured to decode PAM-4. From this power, approximately 680mW is consumed in the analog part (VGAs, delay circuits and output stage), 300mW in the MUX and retiming and 560mW for clock distribution. The chip measures 3.03mm 2 . In Fig. 5 , the measured PRBS15 NRZ (MSB only) and PAM-4 eyediagrams at respectively 100,112 and 120Gbaud are shown after optimization of the MSB and LSB FFE coefficients. The filters are optimized to reach a peak-to-peak voltage at the scope input of 500mVpp for the PAM-4 signals (measured at the center of the upper and lower noise band) in all three cases. The measured level separation mismatch ratio (RLM) is > 98.3%, close to the ideal 100%, for all cases. The maximum overshoot, introduced by the 7-tap filter responses, are respectively 3, 5 and 6 dB, while having a DC-gain around 1.2dB. In Fig. 6 , the TX performance is summarized and compared against several integrated state-of-the-art solutions in different technologies. This work shows a doubling of the operating rate compared to FinFET solutions and closely matches the speed and power obtained in InP solutions which lack any FFE functionality.
doi:10.1109/cicc51472.2021.9431514 fatcat:rv4uomicgnedndzbv65bb3ptrq