When everything fails: 24h PCB design lesson

4A LDOs, 40x HDMIs, wire bondings & SLVS signals: what could possibly go wrong in one brave night?

3D-Section_Bonding-Analysis

It was the 13th March of 2018, the day after a plane would have picked me up, headed north enough that I wouldn’t touch a PC for days. We had to finish this board, it was a must to start testing the LHCb UT front end boards. Thus that day I swiped my card early in the morning at 7AM to leave almost at the same time, one day after.

This is the story of a 24h rush design and a series of unfortunate events.

Theta-Carrier Front End board Hybrid adapter for LHCb UT by Nadim Conti
Theta-Carrier board: 2x 2A Low Noise LDO + 40x mini-HDMIs, up to 500V HV lines, bonding wires, and many SLVS signals to be routed everywhere

In research, you’re continuously targeted by new things to keep in mind and under control, maybe something didn’t go as planned (like always) or a new idea pops up into your mind in the middle of a design, or a colleague is late to send you some feedback, and if you still have time,  your brain will find that used-to-be lacking info an awesome thing to implement ASAP so that –“insert here whatever excuse”–.

The result was a PCB (Printed Circuit Board) that needed to be done in 24h. Clearly, I managed to recycle some calculations and design parts, but 60 to 80% of everything was done in this last day.

Theta-carrier PCB Block Diagram - Nadim Conti
The block diagram: this board is basically a PSU and switchboard between 2 other testing boards and the front end PCB of LHCb UT (seen as “FULL THETA BONDING INTERFACE”

Let’s start with the design, shall we? The main goal of this board is to power up another PCB to be placed on top of it and route its signals to another board, simple uh?

First, we need a way to power up our front end board, and since our ASICs happen to be really, REALLY vulnerable to noise, we must use a low noise voltage regulator, leading us to the choice of 2x TPS74401 by Texas.

PSRR - Power Supply Rejection Ratio on a DC-DC converter LDO - Nadim Conti
Power Supply Rejection Ratio vs Noise Frequency and Input power supply (TPS74401)

The PSRR is one of the first things you look at when selecting a low noise regulator, namely: “how much noise gets through it at a given frequency and input voltage”. These two graphs are literally showing us the attenuation a noise signal will “see” while traveling towards our front end board (that must be noise free).

Note: we’re not going to use two because we need to pull more current out of them, but because our final front end board is divided into two power sectors.

Thermal simulation of a DC-DC voltage regulator LDO low noise - Nadim Conti
4-Layer reduced copper footprint thermal simulation to validate the LDO design

After your schematic is done, a nice thermal simulation can give you enough confidence in your design that you might start to think it will not transform itself into a little ball of fire at start-up. At this point, you can start the layout and send it into production.

theta carrier LDO filtering inductors
LDO and input filter stage: 2 massive 10A inductors are used to ensure no overheating and additional thermal mass for heat dissipation (heat sinks are too expensive…)

The first hint of a calm road ahead (Schematic)

One of the first things you have to do in an LDO is finding how to set the voltage output so that you get what you want. In this application we needed 1.2V, but at startup, we measured only 0.8V (try to guess why by looking at the following image and the TPS74401 datasheet). [SPOLER ALERT: solution after the image]

Theta carrier schematic issue 1
Find the issue

……… Done?

Yup, you have probably found it. I will admit it is a pretty stupid thing to get wrong, but at 2AM it is an easy mistake to do. This is why I always tell everyone I’m teaching or “mentoring” to, that they are not stupid in getting something wrong, most of the time is just being physically and mentally tired.

The error, in case you’re tired, is on the feedback line. LDO’s FB pin is supposed to receive a certain voltage, almost always lower than OUT, so that the IC can compensate OUT to keep it stable inside a certain window.

To do so, a voltage divider is used, and that’s clearly what the schematic is missing, FB IS WIRED DIRECTLY TO OUT (it draws almost no current), this way the TPS thinks OUT is at a high voltage value and tries to lower it, by doing so FB falls together with OUT, but at some point the minimum OUT voltage is reached (which, again, is almost always higher than the internal FB reference), and so, the feedback loops stops, right at 0.8Volt.

Output-voltage-TPS-programming
OUT Voltage selection with R1 and R2 as parameters. With R1 (2.49 kOhm) acting like a short and R2 (4.99K) acting as a load, 0.8V is selected.

This issue was promptly fixed with a little PCB surgery and two solder points.

LESSON: DRC can’t check topology. Check every pin one by one when doing the schematic and follow the current (also, look at the Datasheet and at all the application diagrams you can find).

Everything powers up, but, hey, nope..

Even though your calculations are right and you’re supplying the right voltage out of your LDO, it doesn’t mean you LOAD HAS THE RIGHT VOLTAGE, that’s why sense lines exist, and that’s why our ASICs didn’t power up even after the 0.8V to 1.2V fix, we were still missing 70mV.

final assembly with silicon sensor, front end board and theta carrier with bonding wires in place - nadim conti
Final assembly with silicon sensor, front end board and theta carrier with bonding wires in place. The path between the LDO (on the right) and ASICs (left) is long (5cm) and resistive,  leading to 70mV of dropout.

What? yeah, 70mV are something easy to lose when we’re talking about 2Amp flowing inside a PCB, and if you have a thin enough plane or very thin traces and wires, it’s almost sure that you’re going to see them “disappear”.

What you usually do to avoid these issues is to connect a sense line to your LDO,  letting it know how much it should compensate the OUT pin for resistive drops towards the load.

Otherwise, you can simulate how much voltage dropout your components are going to see to compensate for it by yourself (this is called Power Delivery Network Analysis).

Here it wasn’t done, unfortunately, my PDN analyzer license expired earlier that month,  leaving me with just a tired brain to work with.

Lesson: ALWAYS consider what is going to happen to your voltage when long planes and lines are being used (voltage drops) and to your current density when thin traces are your design choice (higher current density means higher heat generated due to Joule effect and higher voltage drop).

“Events” is plural (Mechanics)

Another unfortunate issue that affected this board was the absence of a dynamic assembly 3D analysis, meaning that in the available time, no check was done to prove the final assembly to be feasible. It was not done with light heart, I had checked all the components on other already made boards, but one was “different”.

At some point later in the assembly phase, I’ve received a call and one explicit picture, I’ve missed something obvious that was very well hidden in a previous board I’ve checked.

ERF8 edge mount PCB outline issue
Edge connector not fitting due to missing PCB edge slot

Yeah, again, it’s obvious, have I had access to a spare connector I would have noticed it on the spot, prior to submission, but I didn’t. One side of the ERF8 Edge mount has a plastic piece extending into the PCB, requiring a little cutout in the latter.

Solved by hand cutting that part of the connector.

Lesson: ALWAYS buy small quantities of components before producing and sending to assembly your PCB, try them on a prototype board and or simulate their insertion and volume occupancy on your expected to be PCB (you can use a piece of hard paper to simulate it or go ahead with a 3D CAD).

That nice thing called thermodynamics (Assembly)

Sometimes a good connection is a bad connection. I learned it the hard way.

thermals and direct connection
Thermal relief on the Right, direct plane connection on the Left

Thermals (connection between large Cu planes and component pads made to prevent warm flowing away from the part while soldering) are always used when you want to hand solder boards with a big thermal mass, this because otherwise you end warming up the whole PCB instead of only that little pad you want to solder on (down to almost rendering the operation impossible to perform with a single soldering iron).

Another nice image that I received from the assembly house is the following. You can see that there are two things here they wanted to point my attention to (unfortunately they wrote in Italian):

  1. Holes are too big and without solder paste on the top layer (actually they were sized just right by looking at the datasheet);
  2. The leftmost pad, electrically connected to GND, was connected to a very large Cu plane DIRECTLY  (I’ve used a copper pour on Altium);
PCB footprint feedback -- issue during assembly
THT holes big enough can make the component go around during reflow

What could possibly happen? Well, of those 40 mini-HDMI connectors per board, lots of them had pins shorted together.

Let’s break it down: components are soldered by placing them on PCBs and then passing that board inside what is called “Reflow Oven”. Inside here, temperature rises following a certain cycle to reach the final assembled state. Each phase serves a specific purpose, but today we’re only going to focus on their influence on this matter.

REFLOW profile
Example of a reflow profile

After the board gets PREHEAT /SOAKED, the final ramp to REFLOW soldering starts. If the board has a high enough thermal mass, like Theta-Carrier, it will not be able to follow the temperature profile accordingly. Thus, in case of big power planes and small Cu details, the latter will reach reflow temperature faster, meaning solder will melt sooner.

What happens then is almost black magic. Pins start to reflow, but the Cu plane doesn’t, and if Murphy is looking at you closely enough, your plane will reach reflow temperature when small pads are getting closer to the cooling phase, leading to an imbalance between forces acting on the two pin groups. Forces? yes, you have to know that due to surface tension, once solder gets liquid, it will try to bring whatever component is on that pad at its center point.

The result? All those beautiful mini-HDMI were pulled towards their pin 1 side (GND), leading to massive shorts on all pins as shown in the following picture.

PCB footprint soldering feedback

Lesson: Never merge power pins with their planes, especially if you have a very high thermal mass close enough to cause problems. Never enlarge solder mask opening or pin copper ONLY ON ONE SIDE OF YOUR COMPONENT. If you change something on the pads, it MUST be done on ALL pads, or at least in a weighted and smart way.

Finally, the board arrives: let’s test our front end

Once the board is here we can start using it to test our front end board (This front end board). to do so we have to carry out a procedure called “Wire Bonding“, a way of connecting two different substrates by using thin wires (25um in diameter) and thermo-compression, in picture.

Wire bondings under the microscope at INFN - Nadim Conti - LHCb UT - SALT128
Wire bondings (99% Al, 1% Si) used to connect an ASIC to the front end PCB

The issue here is rather hidden in how our machine operates. To do wire bondings, or to do any kind of operation on a PCB or whatever you are working on, you need clearance. You literally need volume/space into which you can move yourself or tools needed to finish the required operation.

As you can see in the CAD drawing, we used to have a rather simple 2D model to highlight how much we have to stay away from the bonding point with our components.

bonding wire tip - wedge clearance.PNG
Unfortunately, that nice picture only shows clearance constraints NEAR the bonding point, NOT FAR AWAY ( 3-4cm + ).

I think you’ve noticed a 3D sectioned image and one picture of those big inductors I’ve selected right at the start of this story, this is why I’ve posted them.

On the left, you can see highlighted in blue all the components on risk to have a collision with the bonding head structure, namely parts higher than 5mm (from PCB surface) and far away from the bonding area. On the right, the two huge inductors. What’s the final result? Bonding cannot be performed due to head support collision with PCB components.

bonding machine head collision with filtering inductors - nadim conti
Bonding machine head collision with filtering inductors

As a result, we had to move our front-end test board away from the collision point. Luckily the show-stopping collision was wiped away from the table with a movement of only 3mm ca. Had it been more than 5, wire bonding would have been rendered almost impossible (in that case we would have removed the two inductors).

bonding head clearance after front end board repositioning
Bonding head clearance after front end board repositioning, on the left the bonding wedge and clamp performing wire-bonding.

Lesson: If you have to perform very delicate operations on your PCB, SIMULATE THEM WITH 3D models of your tools. Right now we have the full bonding head under 3D CAD design so that we can place it on the PCB Altium project and clearly see these issues before committing to PCB manufacturing and Assembly.

We have signals, but, hey, what the…. (Signal Recovery)

To propagate one signal is “easy”, you have to route it in an impedance controlled fashion and be aware of few signal integrity concepts, but if you want to propagate 10 or even 100 SLVS lines, you better have to know what to do and check.

Theta-carrier connected to the VLDB board from CERN and to our back end server setup with 48 fiber optic links
Theta-carrier connected to the VLDB board from CERN and to our back-end server setup with 48 fiber optic links and up to 40x HDMI cables (it means ca. 800 conductors going from one board to another).

One of the main issues of digital synchronous telecommunication is that you have to wait for the right time to sample your signal. Usually when the CLK (clock) signal is either HIGH, LOW or changing status.

In the following picture, you can see an example with two signals. We are sampling at CLK=LOW, thus both SIG1 and SIG2 MUST TRANSIT LOGIC STATE AT CLK=HIGH.

You can see from the graph on top that both of them have critical transitions right at the borderline of our sampling window, but they’re still fine.

What happens then at the receiver end when SIG2 is routed slightly longer or shorter than SIG1 and CLK? Yup, it will arrive with a certain delay, and if that delay is high enough, the logic level transition will happen INSIDE OUR SAMPLING WINDOW!

data corruption due to line delay.jpg

With tens of cables and almost 14.000 tracks on  Theta-carrier alone, this is exactly what happened, resulting in corrupted data and the need of fixing this issue by slightly extending one signal path (thanks to all those jumpers I’ve placed) and reversing clock polarity.

Lesson: if you have to sample multiple signals at the same time, make sure that their track length is almost the same and that they are routed on very similar substrates, otherwise the propagation speed will change and you’ll end up having some of them delayed just too much for your system to recover them.

A happy ending

In the end, everything failed, and everything worked. This is not a story about a disaster that was not possible to recover, especially since in research we don’t usually have respins as in other industries, you have to get it right the first time.

This is a story about some of the many parameters you have to keep in mind and to check in your workflow to obtain a first version, first-time success in your next design.

On this story, we’re developing Computer Aided methods to check everything in our work packages: Thermal simulation to avoid ball of fires, signal integrity tools to avoid having corrupted data at some end of the PCB, and 3D model of our machines to animate our operations.

And on top of that, a little reminder that everyone can fail at something, especially if he or she is working for long hours at the same thing.
Don’t cringe to one of your “failures”, as you can see, we are all humans and frankly, I’ve learned more with this bad PCB (and by trying to fix everything) than with many of the other well-done ones.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.