4A LDOs, 40x HDMIs, wire bondings & SLVS signals: what could possibly go wrong in one brave night?
It was the 13th March of 2018, the day after a plane would have picked me up, headed north enough that I wouldn’t touch a PC for days. We had to finish this board, it was a must to start testing the LHCb UT front end boards. Thus that day I swiped my card early in the morning at 7AM to leave almost at the same time, one day after.
This is the story of a 24h rush design and a series of unfortunate events.
In research, you’re continuously targeted by new things to keep in mind and under control, maybe something didn’t go as planned (like always) or a new idea pops up into your mind in the middle of a design, or a colleague is late to send you some feedback, and if you still have time, your brain will find that used-to-be lacking info an awesome thing to implement ASAP so that –“insert here whatever excuse”–.
The result was a PCB (Printed Circuit Board) that needed to be done in 24h. Clearly, I managed to recycle some calculations and design parts, but 60 to 80% of everything was done in this last day.
Let’s start with the design, shall we? The main goal of this board is to power up another PCB to be placed on top of it and route its signals to another board, simple uh?
First, we need a way to power up our front end board, and since our ASICs happen to be really, REALLY vulnerable to noise, we must use a low noise voltage regulator, leading us to the choice of 2x TPS74401 by Texas.
The PSRR is one of the first things you look at when selecting a low noise regulator, namely: “how much noise gets through it at a given frequency and input voltage”. These two graphs are literally showing us the attenuation a noise signal will “see” while traveling towards our front end board (that must be noise free).
Note: we’re not going to use two because we need to pull more current out of them, but because our final front end board is divided into two power sectors.
After your schematic is done, a nice thermal simulation can give you enough confidence in your design that you might start to think it will not transform itself into a little ball of fire at start-up. At this point, you can start the layout and send it into production.
The first hint of a calm road ahead (Schematic)
One of the first things you have to do in an LDO is finding how to set the voltage output so that you get what you want. In this application we needed 1.2V, but at startup, we measured only 0.8V (try to guess why by looking at the following image and the TPS74401 datasheet). [SPOLER ALERT: solution after the image]
Yup, you have probably found it. I will admit it is a pretty stupid thing to get wrong, but at 2AM it is an easy mistake to do. This is why I always tell everyone I’m teaching or “mentoring” to, that they are not stupid in getting something wrong, most of the time is just being physically and mentally tired.
The error, in case you’re tired, is on the feedback line. LDO’s FB pin is supposed to receive a certain voltage, almost always lower than OUT, so that the IC can compensate OUT to keep it stable inside a certain window.
To do so, a voltage divider is used, and that’s clearly what the schematic is missing, FB IS WIRED DIRECTLY TO OUT (it draws almost no current), this way the TPS thinks OUT is at a high voltage value and tries to lower it, by doing so FB falls together with OUT, but at some point the minimum OUT voltage is reached (which, again, is almost always higher than the internal FB reference), and so, the feedback loops stops, right at 0.8Volt.
This issue was promptly fixed with a little PCB surgery and two solder points.
LESSON: DRC can’t check topology. Check every pin one by one when doing the schematic and follow the current (also, look at the Datasheet and at all the application diagrams you can find).
Everything powers up, but, hey, nope..
Even though your calculations are right and you’re supplying the right voltage out of your LDO, it doesn’t mean you LOAD HAS THE RIGHT VOLTAGE, that’s why sense lines exist, and that’s why our ASICs didn’t power up even after the 0.8V to 1.2V fix, we were still missing 70mV.
What? yeah, 70mV are something easy to lose when we’re talking about 2Amp flowing inside a PCB, and if you have a thin enough plane or very thin traces and wires, it’s almost sure that you’re going to see them “disappear”.
What you usually do to avoid these issues is to connect a sense line to your LDO, letting it know how much it should compensate the OUT pin for resistive drops towards the load.
Otherwise, you can simulate how much voltage dropout your components are going to see to compensate for it by yourself (this is called Power Delivery Network Analysis).
Here it wasn’t done, unfortunately, my PDN analyzer license expired earlier that month, leaving me with just a tired brain to work with.
Lesson: ALWAYS consider what is going to happen to your voltage when long planes and lines are being used (voltage drops) and to your current density when thin traces are your design choice (higher current density means higher heat generated due to Joule effect and higher voltage drop).
“Events” is plural (Mechanics)
Another unfortunate issue that affected this board was the absence of a dynamic assembly 3D analysis, meaning that in the available time, no check was done to prove the final assembly to be feasible. It was not done with light heart, I had checked all the components on other already made boards, but one was “different”.
At some point later in the assembly phase, I’ve received a call and one explicit picture, I’ve missed something obvious that was very well hidden in a previous board I’ve checked.
Yeah, again, it’s obvious, have I had access to a spare connector I would have noticed it on the spot, prior to submission, but I didn’t. One side of the ERF8 Edge mount has a plastic piece extending into the PCB, requiring a little cutout in the latter.
Solved by hand cutting that part of the connector.
Lesson: ALWAYS buy small quantities of components before producing and sending to assembly your PCB, try them on a prototype board and or simulate their insertion and volume occupancy on your expected to be PCB (you can use a piece of hard paper to simulate it or go ahead with a 3D CAD).
That nice thing called thermodynamics (Assembly)
Sometimes a good connection is a bad connection. I learned it the hard way.
Thermals (connection between large Cu planes and component pads made to prevent warm flowing away from the part while soldering) are always used when you want to hand solder boards with a big thermal mass, this because otherwise you end warming up the whole PCB instead of only that little pad you want to solder on (down to almost rendering the operation impossible to perform with a single soldering iron).
Another nice image that I received from the assembly house is the following. You can see that there are two things here they wanted to point my attention to (unfortunately they wrote in Italian):
- Holes are too big and without solder paste on the top layer (actually they were sized just right by looking at the datasheet);
- The leftmost pad, electrically connected to GND, was connected to a very large Cu plane DIRECTLY (I’ve used a copper pour on Altium);
What could possibly happen? Well, of those 40 mini-HDMI connectors per board, lots of them had pins shorted together.
Let’s break it down: components are soldered by placing them on PCBs and then passing that board inside what is called “Reflow Oven”. Inside here, temperature rises following a certain cycle to reach the final assembled state. Each phase serves a specific purpose, but today we’re only going to focus on their influence on this matter.
After the board gets PREHEAT /SOAKED, the final ramp to REFLOW soldering starts. If the board has a high enough thermal mass, like Theta-Carrier, it will not be able to follow the temperature profile accordingly. Thus, in case of big power planes and small Cu details, the latter will reach reflow temperature faster, meaning solder will melt sooner.
What happens then is almost black magic. Pins start to reflow, but the Cu plane doesn’t, and if Murphy is looking at you closely enough, your plane will reach reflow temperature when small pads are getting closer to the cooling phase, leading to an imbalance between forces acting on the two pin groups. Forces? yes, you have to know that due to surface tension, once solder gets liquid, it will try to bring whatever component is on that pad at its center point.
The result? All those beautiful mini-HDMI were pulled towards their pin 1 side (GND), leading to massive shorts on all pins as shown in the following picture.
Lesson: Never merge power pins with their planes, especially if you have a very high thermal mass close enough to cause problems. Never enlarge solder mask opening or pin copper ONLY ON ONE SIDE OF YOUR COMPONENT. If you change something on the pads, it MUST be done on ALL pads, or at least in a weighted and smart way.
Finally, the board arrives: let’s test our front end
Once the board is here we can start using it to test our front end board (This front end board). to do so we have to carry out a procedure called “Wire Bonding“, a way of connecting two different substrates by using thin wires (25um in diameter) and thermo-compression, in picture.
The issue here is rather hidden in how our machine operates. To do wire bondings, or to do any kind of operation on a PCB or whatever you are working on, you need clearance. You literally need volume/space into which you can move yourself or tools needed to finish the required operation.
As you can see in the CAD drawing, we used to have a rather simple 2D model to highlight how much we have to stay away from the bonding point with our components.
Unfortunately, that nice picture only shows clearance constraints NEAR the bonding point, NOT FAR AWAY ( 3-4cm + ).
I think you’ve noticed a 3D sectioned image and one picture of those big inductors I’ve selected right at the start of this story, this is why I’ve posted them.
On the left, you can see highlighted in blue all the components on risk to have a collision with the bonding head structure, namely parts higher than 5mm (from PCB surface) and far away from the bonding area. On the right, the two huge inductors. What’s the final result? Bonding cannot be performed due to head support collision with PCB components.
As a result, we had to move our front-end test board away from the collision point. Luckily the show-stopping collision was wiped away from the table with a movement of only 3mm ca. Had it been more than 5, wire bonding would have been rendered almost impossible (in that case we would have removed the two inductors).
Lesson: If you have to perform very delicate operations on your PCB, SIMULATE THEM WITH 3D models of your tools. Right now we have the full bonding head under 3D CAD design so that we can place it on the PCB Altium project and clearly see these issues before committing to PCB manufacturing and Assembly.
We have signals, but, hey, what the…. (Signal Recovery)
To propagate one signal is “easy”, you have to route it in an impedance controlled fashion and be aware of few signal integrity concepts, but if you want to propagate 10 or even 100 SLVS lines, you better have to know what to do and check.
One of the main issues of digital synchronous telecommunication is that you have to wait for the right time to sample your signal. Usually when the CLK (clock) signal is either HIGH, LOW or changing status.
In the following picture, you can see an example with two signals. We are sampling at CLK=LOW, thus both SIG1 and SIG2 MUST TRANSIT LOGIC STATE AT CLK=HIGH.
You can see from the graph on top that both of them have critical transitions right at the borderline of our sampling window, but they’re still fine.
What happens then at the receiver end when SIG2 is routed slightly longer or shorter than SIG1 and CLK? Yup, it will arrive with a certain delay, and if that delay is high enough, the logic level transition will happen INSIDE OUR SAMPLING WINDOW!
With tens of cables and almost 14.000 tracks on Theta-carrier alone, this is exactly what happened, resulting in corrupted data and the need of fixing this issue by slightly extending one signal path (thanks to all those jumpers I’ve placed) and reversing clock polarity.
Lesson: if you have to sample multiple signals at the same time, make sure that their track length is almost the same and that they are routed on very similar substrates, otherwise the propagation speed will change and you’ll end up having some of them delayed just too much for your system to recover them.
A happy ending
In the end, everything failed, and everything worked. This is not a story about a disaster that was not possible to recover, especially since in research we don’t usually have respins as in other industries, you have to get it right the first time.
This is a story about some of the many parameters you have to keep in mind and to check in your workflow to obtain a first version, first-time success in your next design.
On this story, we’re developing Computer Aided methods to check everything in our work packages: Thermal simulation to avoid ball of fires, signal integrity tools to avoid having corrupted data at some end of the PCB, and 3D model of our machines to animate our operations.
And on top of that, a little reminder that everyone can fail at something, especially if he or she is working for long hours at the same thing.
Don’t cringe to one of your “failures”, as you can see, we are all humans and frankly, I’ve learned more with this bad PCB (and by trying to fix everything) than with many of the other well-done ones.