If you want your code to be talkative then don’t make it eat exceptions!

Garvit Gupta
4 min readMar 2, 2022

By “talkative code” I simply mean the code that tells the reason for its failure with proper logs, exceptions, alerts, etc. The term is not coined by me, I probably read it somewhere but I am unable to recall where.

Recently I was debugging an issue with my colleague that he was facing while trying to run a service locally, the log that we were seeing on his terminal was:

To give some context, this error is from a service that is responsible for generating PDFs. A worker can be thought of as an independent program that takes data and creates a PDF and saves the PDF on disk (later PDF is uploaded to cloud storage). A worker can become unresponsive due to multiple reasons — insufficient memory, trying to generate a very large PDF, etc.

Now back to the error message, being faced with such an error earlier I knew some reasons that might cause this, so one by one we started checking those things:

  1. Our first guess was incompatible node-js version — service runs on node 10 in production but we realized that we were trying to run it on node 12. So we changed the node version but still same error.
  2. Then we tried some random things — deleting and re-installing node modules multiple times, restarting the service (always hoping some magic will fix the issue), restarting the service in a new terminal (thinking that the current terminal might be possessed by evil :p).
  3. Then we tried running the service in a docker container using production docker configuration.
  4. After an hour or so our container started but not successfully. It was giving some other error related to the underlying architecture of the machine.
  5. This clicked something and we realized that previously we have always run the service on x86 architecture (Intel chip) but this time we were trying to run it on arm architecture (apple’s M1 chip).
  6. We spent the next few hours trying to make the container run by using an x86 base image of ubuntu but nothing worked. Same error every time.
  7. Then we moved back to trying to run the application without container redoing some of our previous steps but nothing changed.
  8. Finally, we were devising an unproductive and slow workaround to run the code (spoiler: we were planning to deploy the code on dev after every change) when a bulb flickered ON in my head. 💡
  9. I told my colleague to run a command on his terminal and boom the issue was fixed.

Before telling the magic command let me show you the code that was throwing that error:

The logic is simple:

  1. The generatePdf function tries to generate a PDF and throws WORKER_UNRESPONSIVE error if it is unable to create PDF.
  2. The driver code calls the generatePDF function and catches errors thrown by the function and “assumes” that the function can only throw if a worker is unresponsive.

Now, there is an issue in the above logic(there might be multiple issues like catching and throwing the Error superclass but for sake of this discussion we will consider only one issue), the issue is that the “error thrown by the function generatePdf is completely swallowed by the calling code”, there is no way of knowing what was the original error. We have taken the assumption that generatePdf can only throw an error if a worker is unresponsive, there are two issues with this:

  1. We think that we completely understand all the situations in which generatePdf can fail so we try to catch the failures and handle them — but this is seldom the case, we can rarely anticipate all possible scenarios in which a piece of code will throw an exception.
  2. We have overridden the error before throwing it again. The original error is lost forever.

A very simple solution to this is to always pass a reference of original error before re-throwing a caught error. Or if you don’t want to expose internal technical details of your code to the client, then log the original error and throw a new error with a more user-friendly message.

Now the magic command that fixed the issue: mkdir pdf-out. That’s it. The actual issue was while trying to write the PDF on disk, the code expects pdf-out directory to be already created which was missing. After adding the error log, the messages that we got was:

The error clearly mentions that the directory is missing. The issue that took several hours to debug would have taken less than 5 minutes if error handling was proper.

The issue here wasn’t just that the code hides the original error but in addition to that it is also giving a completely unrelated error which made us banging our heads for hours trying to fix something which wasn’t broken in the first place.

Not swallowing the original error seems like an obvious thing to do and you might think that this should be a very rare thing. But believe me, I have seen this pattern more than what we can call rare.

PS: The specific error of missing directory that I talked about can be solved in much better and different ways like code should create the directory if it is missing, etc. But the purpose of this post is not to discuss how to solve the issue but to discuss how we could have identified the issue sooner.

--

--