[Simon Tatham, 2023-07-06]
Computers give a lot of error messages. They’re often misleading, or ambiguous, or difficult to understand. So it’s often tempting not to really try: to ignore all the details mentioned in the error message, and focus only on the plain fact that it’s an error!
But there’s a lot of juice to be squeezed out of error messages if you know what to look for. It’s worth learning the skill of extracting all the detailed information you can. That gives you a head start on debugging problems yourself, and a better chance of writing a good bug report for somebody else – and perhaps even a better chance of deciding which of those things to do.
In this article, I’ll try to give an overview of the kind of things you can tell from the details of an error message – right down to things as trivial as the punctuation.
Disclaimer: this article is adapted from a talk I gave to my colleagues in 2022. So it reads a little more like somebody ranting out loud than my usual writings. Also, its original target audience was about 30 very specific people. But I hope it’s useful to other people too!
When a computer prints an error message, what’s it telling you?
Most obviously, it’s saying: there was an error. That is, something went wrong. In other words, whatever I just tried to do didn’t work.
Just occasionally, that really is the only thing that the message communicates. One embedded device I own literally prints a chirpy dialog box saying ‘Something went wrong! [OK]’, and that’s all you can ever find out about the problem.
But usually a program will at least try to tell you more detail than that:
Any or all of this information can help narrow down what the problem is, and what to do about it. But I’ve often seen people – even technically competent ones – apparently just ignore all the helpful details in the error message, and try to solve the problem without them. It’s as if they thought that would be an extra challenge. Or as if they hadn’t even noticed that the error message said anything more specific than ‘Something went wrong!’
Try not to do that. It only makes your own life harder. It really is worth reading the error message in detail, and understanding as much as you can about it.
If you’re trying to solve the problem yourself, this can often save you from wasting an hour pursuing a wrong idea of what the problem is. For example, by spotting immediately that if that had been the problem then you’d have expected a different error message.
Even if your plan is to report the problem to someone else, it can still be worth understanding something about the error message, because it gives you hints about what other things you should mention in your report. (For example, if the error message said ‘no such file’, it’s probably helpful to mention what file might be involved, if you can possibly find out.)
In particular, if you have a long log file (say, from a CI job) containing multiple errors, an understanding of what they mean will help you decide whether this particular error message is the useful one, or whether you should instead look further up or down in the log for a more relevant message from another subprogram.
Even better, if you can recognise the error as one that indicates a temporary problem, you may be able to avoid having to do anything at all!
This article is divided into two main sections: facts and advice. In the facts section, I’ll discuss some general principles of what kinds of errors exist, how programs typically report them, and what some specific messages mean. In the advice section, I’ll apply that knowledge to give some recommendations about what to do about various kinds of message, such as where (or whether) to look for more information.
This article will focus mostly on general-purpose error messages: the kinds of error that lots of different programs are likely to have to report. In particular, I’ll discuss errors reported by operating systems (both Unix and Windows) when a program tries to interact with them to manipulate files or directories or processes: ‘I tried to open this file, or run that program, and something went wrong.’ I’ll also discuss networking errors in particular, because these days, more or less everything involves networking.
But I won’t go into detail about the error messages generated by any particular program. Often those can be complicated in their own right. In particular, untangling complicated compiler errors could easily fill a whole separate article!
Even for the general error messages I’ll be discussing, I won’t have time or space to describe absolutely everything you’ll need to know. I only have room here to give a general idea of how things are, suggest what kinds of thing to look out for, and recommend you start keeping notes of your own. So this article won’t teach you all by itself to be an expert error-message-ologist. But it might show you how to start building up the experience to turn yourself into one.
In this section I’ll discuss some typical kinds of error condition that often happen, and the general conventions for reporting them to the user, so you know what general shape of message to expect in this kind of situation.
I’ll cover three main categories:
When a Unix process interacts with the operating system,
failures are usually reported using a system of numeric codes.
These are referred to as errno
values (after
the name of the variable that the error code is left in).
Each errno
value has a symbolic name inside the
program, always in capitals and beginning with ‘E
’,
such as ENOENT
. There’s a standard system function
(‘strerror
’) which translates each one into a short
phrase intended to be printed to the user. Most simple Unix
programs will print that string as part of their error messages,
so it’s worth recognising the common ones and knowing what they
mean.
There isn’t enough space here to list all
the errno
values, but here are some of the most
important:
ENOENT
, translated as ‘No such file or
directory’, means that you – or rather, whichever program
received this error – tried to access a file or directory that
simply did not exist at all.EISDIR
, translated as ‘Is a directory’, means
you tried to do a file-like operation to a directory, such as
deleting it using rm
instead
of rmdir
, or trying to write data to it.ENOTDIR
, translated as ‘Not a directory’, means
the opposite: you tried to do a directory-like operation to a
file, such as looking up a file inside it, or trying
to cd
into it.EACCES
, translated as ‘Permission denied’,
means that some file or directory you tried to
access exists, but the Unix permission bits mean
that you are not allowed to access it. Some other
user on the same system probably can. In particular,
the root
user can almost certainly access the
file.EPERM
, translated as ‘Operation not permitted’,
is confusingly similar to EACCES
, and also means
that you tried to do something that your particular user id
doesn’t have permission to do. But EACCES
is
about the permissions on files or directories,
and EPERM
is about everything else: all
the other situations in which ‘you need to
be root
to do that’, such as configuring a
network interface, or killing another user’s process.Unfortunately, a typical file access operation only reports
failure in the form of a single code of this type. That doesn’t
always tell you everything you needed. In particular, if you
tried to access a file via a path of more than one subdirectory,
like foo/bar/baz/quux.txt
, then the error code
might refer to any of the steps on that path, and nothing will
tell you which. ENOENT
might happen because the
actual file quux.txt
is missing, or because the
directory foo
didn’t exist in the first place, or
anything in between. EACCES
might happen because
you don’t have the right to read quux.txt
in
particular – or because you don’t even have the right to look
inside the containing directory to find out whether a
file of that name exists.
An even more annoying example is that if you’re trying
to rename a file, the error code might refer to the
source or the destination, and won’t tell you which! For
example, if ‘mv foo.txt baz/quux.txt
’ reports ‘No
such file or directory’, that might mean that the
file foo.txt
doesn’t exist to be moved
anywhere, or that the directory baz
doesn’t exist for you to move it into.
(An even more confusing case is that you can
get ENOENT
even when you’re trying
to create a file! Your instinct is to think “Of course
it doesn’t already exist, that’s why I’m trying to
create it.” But ENOENT
can occur if
the directory you’re trying to create it in doesn’t
exist.)
So, if you see an error message like this, the first thing to be aware of is that it might mean more than one detailed thing. If you need to know which step on a directory path is the problem, or whether an error referred to the source or destination file, you’ll have to poke around yourself after receiving the error report.
Several error codes in this system have confusing or unclear
symbolic names, or translations, or both. For example, even the
‘ent’ in ENOENT
isn’t completely obvious. (It’s
meant to indicate that there was no entry in the
containing directory with the name you specified.)
It’s confusing that EACCES
goes with the phrase
‘Permission denied’ and not ‘Access denied’ – if you know that
there are things called both EACCES
and EPERM
, you might reasonably expect the word
‘permission’ in the translation to indicate that the error code
was EPERM
! But no, both messages use the
word ‘permission’ or ‘permitted’, and you have to spot the
difference between the detailed wordings ‘Permission denied’ and
‘Operation not permitted’.
Some of these messages and their translations are very vague,
when they could and should be much more specific. For example,
if you see EKEYEXPIRED
/ ‘Key has expired’ when
trying to access a file, then the ‘key’ in question almost
certainly refers to a Kerberos ticket – so you’d only expect to
see it if your organisation is using Kerberos at all.
It doesn’t refer to any kind of key that an application
program might have been dealing with on purpose, such as an SSH
key (or a key on the keyboard).
(This is particularly confusing if you get ‘Key has expired’ when you were trying to do something with an SSH key! In that situation the SSH key itself is probably fine – or, at least, you have no evidence otherwise right now – but your Kerberos setup is currently too confused for SSH to find it.)
A few errno
messages are outright misleading. For
example, ETXTBSY
/ ‘Text file busy’ has nothing to
do with what ordinary people think of as a text file. It means
that an executable file was already being run when a
program tried to write to it, or was in the middle of being
written when something tried to run it.
(And even that is only slightly strange compared to
the notorious historical message ‘Not a typewriter’, which has
had nothing to do with typewriters for decades! Linux at least
has now reworded this as ‘Inappropriate ioctl for device’, which
is at least accurate, though still not especially
helpful. But its symbolic code is still ENOTTY
,
because those are harder to change.)
I won’t try to write a complete list of all the subtleties of the Unix error code system, or all the situations in which one of these errors can occur at a surprising moment. That would take too much space, too much time, and I’d surely get half of it wrong myself. So all I can say is: be aware in general that this kind of confusion exists, and as you gain experience, build up your own list of things that can catch you out. Everyone’s list will be different!
When a Unix command-line tool reports an operating system error, the message will ideally include four pieces of information. If you’re lucky, you can hope to see:
errno
code,
indicating in what way that operation did not succeed.Here are a couple of examples:
$ ls /root
ls: cannot open directory '/root': Permission denied
The program has printed its own name: ‘ls
’. It
says what it was trying to operate on: the
directory /root
. It says what it was trying to do
to that directory: open it (in order to start reading a list of
filenames out of it, which is ls
’s job). And it
shows that that operation failed with the errno
value EACCES
: the file permission bits on the
directory /root
are set so that this user does not
have the rights to list its contents. (Not surprisingly, because
it’s the root user’s home directory, and this command was run by
an unprivileged user.)
$ ssh wibble.example.com
ssh: Could not resolve hostname wibble.example.com: Name or service not known
In this example, the program has printed its name,
‘ssh
’; it’s trying to operate on the network host
name ‘wibble.example.com
’; what it’s trying to do
is to resolve that name (which means translating it
into a numeric IP address that it would then have tried to make
a connection to). And the error
translation
is ‘Name or service not known’, which means that hostname
doesn’t exist at all. (Again, not surprisingly, because I just
made it up.)
However, you’re not always as lucky as that. Many programs
forget to print at least one of these pieces of information.
Sometimes this is pure laziness (the program’s error reporting
code was written in a hurry, and there isn’t enough internal
‘plumbing’ inside the program to get all the right pieces of
information to the place where the error message is
constructed). Often it’s because the programmer used a standard
system library function called perror
to report the
error, which prints text of your choice followed by the
translation of the errno
value, but makes it
outright difficult to get all three of the other useful
components into the prefix text.
One thing that you almost never find out from this type of error message is why the program was trying to do that operation to that object! In the cases I’ve shown above, it’s pretty obvious, but sometimes that can be the biggest mystery. I’ll come back to that theme in the advice section.
Windows has a similar system to Unix of error numbers with symbolic names and text translations. Generally both the names and the descriptions are a bit more verbose than Unix.
Here’s a link to MS’s documentation of the full list of error codes. They’re divided into a few sub-pages according to their numeric values, so it may help to know that most of the common errors relating to processes and files are likely to have small numbers and can be found in the 0–499 subpage. On the other hand, networking errors mostly have values just above 10,000, so those are likely to be in the 9000–11999 subpage.
Here are some typical examples of Windows error codes and their translations:
ERROR_FILE_NOT_FOUND
= ‘The system cannot find the file specified.’ERROR_PATH_NOT_FOUND
= ‘The system cannot find the path specified.’ERROR_ACCESS_DENIED
= ‘Access is denied.’ERROR_OPEN_FAILED
= ‘The system cannot open the device or file specified.’ERROR_DIRECTORY
= ‘The directory name is invalid.’ERROR_SHARING_VIOLATION
= ‘The process cannot access the file because it is being used by another process.’Unfortunately, just like the Unix error code system, some of the Windows codes are also confusing, or vague, or hard to make sense of, or you don’t get the one you’d expect in a particular situation. In particular, although Windows has more of these codes than Unix, and typically puts more words in the translations, that doesn’t mean you get more useful information: the messages are often surprisingly non-specific, and the same code can be reused for things you’d like to be able to tell apart.
For example, you get ERROR_ACCESS_DENIED
if you
try to read a file you’re not allowed to open – but
you also get it if you attempt a file-like operation on
a directory. (Unix would have told you the difference,
with EACCES
versus EISDIR
.)
If you attempt a directory-like operation on a file (such as
trying to cd
into it), ERROR_DIRECTORY
seems to be the error code you get. So ‘the directory name is
invalid’ should be taken to mean ‘there is something of
that name, but it’s not a directory’. If you try
to cd
into a directory that doesn’t exist at all,
you get ERROR_PATH_NOT_FOUND
.
(On the other hand, this means Windows has separate codes for
a file not existing and a directory not
existing, namely ERROR_FILE_NOT_FOUND
and ERROR_PATH_NOT_FOUND
, where Unix
uses ENOENT
for both.)
ERROR_OPEN_FAILED
is especially odd. It tells you
Windows couldn’t open something, but not why not. If
the file wasn’t there, or you didn’t have permission to read it,
you’d expect a more specific code
like ERROR_FILE_NOT_FOUND
or ERROR_ACCESS_DENIED
. So what kind of
failure might cause ERROR_OPEN_FAILED
?
Apparently it can happen when a virus scanner interferes with
opening the file in some way – but the error is so vague that
you’d never have guessed that just from the text.
Unfortunately, Windows doesn’t have Unix’s strong convention to report a few useful pieces of information along with the error code. A common behaviour of a Windows command-line tool is to just print the text translation of the error code, without any other information:
C:\Users\User>type wibble.txt
The system cannot find the file specified.
C:\Users\User>reg query HKCU\Software\Nonexistent\Thingy
ERROR: The system was unable to find the specified registry key or value.
This is particularly awkward if a long batch script is running lots of subcommands in sequence, because you don’t even get to find out which one failed. Suppose your batch file runs, and something prints ‘The process cannot access the file because it is being used by another process’. Not only do you have no idea what file couldn’t be accessed – you don’t even know which process failed to access it!
I wish I could give some advice for getting round this problem, and actually finding out which program gave an error on Windows. But the best I can say is: be aware that this is a problem, and keep your mind open to the possibility that the program giving the error might not be the same one it looks as if it is.
When one Unix program runs another, and the child process terminates, its parent receives some information about whether things went well. It’s useful to know how this works, so you can interpret the things the parent program might print afterwards.
On Unix, a parent process can find out which of these two things happened to a child that has stopped running:
The status code returned by a terminating program must fit in an 8-bit integer; that is, their possible values are between 0 and 255 inclusive. Status code 0 is invariably used to indicate that the program ran successfully. Non-zero status codes indicate failure of some kind.
A lot of programs don’t use the whole space of failure codes, and simply return 0 for success or 1 for any kind of failure at all. On the other hand, some programs will use different nonzero values to signal different kinds of failure, but there’s no particularly strong convention about how that should work.
One obvious approach would be to use the size of the number to
indicate the size of the disaster: if 0 means nothing went
wrong, perhaps 1 might mean something small went wrong, 2 might
mean a bigger problem, 3 a bigger one still, and so on? I’m
sure some programs do organise their error codes in
that way. But some do things completely differently. For
example, fsck
returns an error status that you have
to interpret in binary: each of its 8 bits indicates a different
type of failure, and it can tell you which combination
of problems occurred!
The second cause of process termination is signals. Just
like errno
codes, Unix signals each have a numeric
value, a symbolic name and a standard text translation.
Unlike errno
values, signal numbers are quite
standard between versions of Unix, and also much more likely to
be visible in log files (see the next section). So it’s useful
to know how to look them up. In bash
, at least, the
command ‘kill -l
’ will list all the signals and
their numbers.
Some signals just mean that the program crashed: it got its internal state confused, and tried to do something so badly wrong that the system had no way to continue running it. All of the following signals are crashes of one kind of another:
SIGSEGV
= ‘Segmentation fault’: the program
tried to access memory at an address where there wasn’t
any.SIGBUS
= ‘Bus error’: the program tried to
access memory at an address that didn’t even make sense
in some way.SIGILL
= ‘Illegal instruction’: the program
tried to run something that the CPU didn’t recognise as valid
machine code.SIGFPE
= ‘Floating point exception’: the
program tried to divide by zero, or something similar.If you’re not planning to actually debug the crashing program yourself, then the differences between these signals probably aren’t very important. But if you report the crash to somebody else, it’s worth mentioning which of these signals it was, in case that’s a useful clue.
(One signal name I’ll call out here for being confusing is
SIGFPE
, which need not refer to floating
point at all. An
integer division by zero can perfectly well cause that
signal. In fact, that’s actually more likely!)
Other signals indicate that something happened to the process from outside. Some examples:
SIGINT
= ‘Interrupt’: usually generated by
pressing Ctrl-C in the terminal the program was running in.SIGTERM
= ‘Terminated’: usually generated by
using the kill
command in the default way.SIGKILL
= ‘Killed’: often also generated
using kill
, and sometimes by other causes such as
the system being low on memory.SIGPIPE
= ‘Broken pipe’: this program was
writing output to a pipe, and the program at the receiving end
of the pipe terminated before reading it all.So if you see any of the first three of those messages in a log file, it probably doesn’t mean there’s anything wrong with the program that received the signal. If you’re unhappy that someone killed your process, you’ll need to hunt down the killer!
(In some situations the killer may be another piece of
software, of course. CI systems will sometimes run a long job,
and send it SIGTERM
or SIGKILL
if it
runs for too long. If you’re lucky, the program that sent the
signal will have left a message somewhere in its log
file confessing to the crime and giving its reasons.)
The last of these signals, SIGPIPE
, is a bit
different. That one isn’t sent on purpose by a human. It happens
when you pipe the output of one program into another, and the
receiving program dies (for any reason at all) while the sending
program hasn’t finished writing its output yet. This stops the
sending program from wasting lots of effort on generating the
rest of its output that nobody is listening to. So if you see
‘Broken pipe’ in a log, you should probably look for why
the receiving process died, because that’s the real
cause of the problem.
Sometimes, when one program runs another, there’s an
intermediate process in between. For example,
when make
runs a compiler, it doesn’t do it directly.
Instead, make
runs sh
, and that
runs the compiler.
In this situation, the intermediate process will try to pass on
the exit status of the child. So if the child process exits on
purpose with a particular status code, the intermediate process
will exit with the same code, so its parent make
can find out whether the operation succeeded, in the same way it
would find it out without the sh
in the middle.
But what happens if the subprogram terminates due to a signal?
As I describe in the previous section, exit statuses and
signals are completely separate on Unix: you can always tell
which one happened to a subprocess of yours. So if the
intermediate process like sh
wanted the parent to
receive exactly the same notification as if it hadn’t
been in the way, it would have to do that by deliberately
killing itself with the same type of signal.
That is possible, using the kill()
system
call. But it’s not generally considered a good
idea,
and intermediate processes don’t normally do it. Instead, the
convention is for the intermediate process to terminate with an
exit status obtained by adding 128 to the signal number.
For example, suppose you see a message along the lines of
‘make: *** [
… ] Error 139
’. This means that a process
run by make
has crashed with SIGSEGV
.
How do we know that? Because:
kill -l
, and find that it’s SIGSEGV
.Of course, not all programs follow the convention that
exit status 128 + n means a subprocess terminated with
signal n. As I mentioned in the previous
section, fsck
returns an exit status in which all 8
bits indicate different failures. So if fsck
exited
with status 139, it would mean something entirely different!
As usual, things are a bit different on Windows.
On Windows, a process can choose a full 32-bit exit code to return on purpose to its parent. On the other hand, the space of deliberate exit codes is shared with the space of status values indicating crashes and interrupts. If a process terminates for a reason that would generate a signal on Unix, then its parent will still just receive a 32-bit exit code, and will have to look at the particular code to decide whether it’s likely to be a deliberate or involuntary termination.
Similarly to Unix, the usual values returned as
deliberate exit codes on Windows are still normally small
non-negative integers, with 0 meaning success and nonzero
meaning an error. Perhaps on Windows there’s slightly
more of a convention that a larger number means a worse failure.
(Because cmd.exe
lets you test exit codes with the
construction ‘IF ERRORLEVEL
n’, and it
really means that the exit code is at least the number
specified.)
If the process crashes or is interrupted, the return status
will be taken from MS’s system
of NTSTATUS
codes. These codes are used for a lot of purposes, many
having nothing to do with process exit. But among them are some
values that can be generated when a child process exits. These
values are typically a bit more than 0xC0000000
.
Some examples:
0xC0000005
= STATUS_ACCESS_VIOLATION
is the Windows analogue of SIGSEGV
, indicating that the process crashed trying to access nonexistent memory0xC000001D
= STATUS_ILLEGAL_INSTRUCTION
is the Windows analogue of SIGILL
, indicating that the process crashed trying to execute invalid machine code0xC000013A
= STATUS_CONTROL_C_EXIT
is the Windows analogue of SIGINT
, indicating that the process was manually interrupted by someone pressing Ctrl-C.Because these exception codes occupy the same space of numbers
that a process can return deliberately, an intermediate process
such as cmd.exe
can propagate one unchanged to its
own parent if it wants to. So there’s no need on Windows for the
awkward Unix system of adding 128 to the signal number and
hoping your caller recognises that that’s what happened.
However, in my experience, cmd.exe
won’t
propagate all of these values unchanged. For example,
I’ve seen it pass in STATUS_ACCESS_VIOLATION
when
its last subprocess returned that, but as far as I can tell, if
its subprocess exits
with STATUS_CONTROL_C_EXIT
, cmd.exe
discards that status value and returns success to its own
parent.
Networking is complex. Not just in the general sense of ‘difficult to understand’, but in the more specific sense of ‘composed of lots of parts’.
A networking system is made of lots of separate components, layered on top of each other. So it’s horribly easy to spend ages debugging the wrong one, and trying to solve a problem you don’t actually have. Therefore, the first task of a networking error message is to try to indicate which layer has had a problem – and the first task of the person reading it is not to ignore that information!
As an example, I’ll discuss the stages of setting up an SSH connection. (If you’re a web-based person who doesn’t use SSH, then sorry about that choice of example. I do realise that HTTP is far more widely used than SSH, but it doesn’t make a good example for this purpose, because web browsers are exceptionally bad at reporting the details of network errors. I think SSH is one of the most commonly used protocols where reading the error messages is useful.)
When you make an SSH connection in the typical way, the SSH client and operating system between them must do all of the following things, any of which can fail and give an error message:
This list looks complete, but one complication is that the very first step (‘turn this hostname into a numeric address’) typically also requres connecting to a server on the network – and then again sometimes it doesn’t. So if the whole network is down, the evidence of that might be that the DNS step fails, or that the later connection step fails, depending on your particular network setup.
So, what error messages might you see from all of this? In rough order:
The DNS said your hostname doesn’t exist. Maybe look for a typo in the hostname, or go back to the person who gave you the hostname.
The DNS didn’t deliver a conclusive answer, for some reason. Maybe this means the DNS server is down; maybe it’s up, but can’t contact some other DNS server it needs; alternatlvely, maybe this is your first clue that your entire machine is cut off completely from the rest of the network and you can’t successfully send network packets anywhere.
We found the destination machine’s address, but couldn’t get any reply from sending even a single packet to it. Maybe this means the network is down between you and the destination (but, if you’ve got this far, maybe not between you and the DNS server we talked to already). Or maybe the destination machine itself is down.
The machine sent a packet back (so it probably
is there), but it isn’t accepting connections to the SSH
port in particular. Maybe sshd
is not running on
the destination machine.
‘Connection refused’ does not mean that the server is refusing logins to you in particular, i.e. that your particular user account is unauthorised. This message occurs when you haven’t told the server who you want to log in as, so it can’t possibly base a decision on that!
This probably means that authentication failed: you tried to prove to the server who you are, and it didn’t believe you.
If you’re sure you have an account on the server, then this is the moment to check your login details are right: are you using the right username, the right password, and/or the right SSH authentication key?
Alternatively, maybe you don’t have an account on this server – in which case perhaps this is the moment to ask the administrator server nicely if you can have one.
ERROR_ACCESS_DENIED
)This is more likely to mean that you
successfully logged in to the server, but then the
command you told the server to run (perhaps on
the ssh
command line) suffered some kind of a
failure that had nothing to do with networking – for
example, maybe you asked the server to access a file that
your account doesn’t have the right to access.
In other words, this is a failure of authorisation, as opposed to authentication: the server knows perfectly well who you are, but doesn’t believe that person is allowed to do the thing you’re attempting.
If you believe you should be allowed to do the thing, this is the moment to ask the server administrator why you can’t.
One other possibility to bear in mind for most of these problems is that you might simply be connecting to the wrong host, or to the wrong port number on the same host. (Some computers run two separate SSH servers on different port numbers, with entirely different sets of valid users.) If you are connecting to entirely the wrong place, then you might see any of these errors, depending on how closely the wrong server resembles the right one: the wrong host might not exist, or be down, or not running an SSH server, or not believe in your user account – or, even more confusingly, it might believe in your user account but have a different idea of what that account is allowed to do.
So, no matter which of these steps went wrong, it’s worth considering the possibility that the server you actually wanted is not the one you’re currently connecting to! If you’ve entered completely the wrong host name or port number, then it doesn’t really matter which step the wrong server rejected you at. The fix is the same: try the right server instead.
I said above that web browsers produce less useful error messages than SSH clients. This is true in the early stages of connection setup. Firefox, for example, prints exactly the same message for both ‘connection refused’ and ‘connection timed out’.
But once you’ve got through to a web server, things get a bit better, because HTTP itself will deliver some more useful error messages that let you tell the difference between types of failure:
This is the HTTP analogue of ENOENT
: the URL
you asked for just doesn’t exist at all.
The HTTP analogue of EACCES
: the URL is
private, and the web server doesn’t have you on the list of
people who can access it.
But, typically, the web server at least thinks it does have some idea of who you are. So if you logged in before getting this message, this probably doesn’t mean your login has failed.
The HTTP analogue of ‘Access denied’ at login time: you didn’t successfully convince the web server that you were you.
This is usually not seen as a final error message. When an ordinary browser receives this message, it should automatically follow the pointer to some other page and return that instead.
But some command-line tools don’t do this in all
situations. For example, if you’re downloading something
using curl
without the -L
option,
then it might generate this error as a reason why it can’t
do what you asked. (The solution may well be to
add -L
to your command line and try
again.)
Right, I’ve finished talking about what error messages mean. Now I’ll give some advice about what to do when you see one.
Mostly, these will be about investigation: how to convert the visible error message into an understanding of what’s really gone wrong.
The first step in investigating is to think of some questions you want answers to. Some are obvious, like ‘I wonder what file it means when it says “file not found?”’. Some might have become obvious after you’ve read the previous half of this article, like ‘I wonder whether the program that generated this message is the same one I think it is?’
Some that perhaps aren’t obvious are:
If you see an error message making a claim about anything you can independently observe, my first piece of advice is to go and look for yourself, and try to confirm the error message’s claim.
In my experience, people often find this counterintuitive. They look surprised when I ask it during a pair-debugging session. If the computer has just told them some file isn’t there (for example), they assume without question that it’s true, and move straight on to considering how it might have happened, or what to do about it.
But I’ve always found it’s worth going and looking for yourself. Mostly not because the computer is lying. (Although that does occasionally turn out to be the cause, and when it does, it’s good to find out sooner rather than later!) But mostly because, in the course of going and looking for yourself, it’s surprisingly common to find out something else on the way that gives you a clue to why the file isn’t there (or whatever the problem is).
Here are a few examples.
Go and look for yourself: tab-complete your way down the pathname in the error message, and see whether you think that file or directory exists.
For a start, if the pathname involved multiple components
(like /usr/lib/foo/bar/baz/quux.txt
), this will
at least tell you which component doesn’t
exist.
When you get to the directory where the file isn’t (or where some containing directory isn’t), look at what is in that directory. You never know what that might turn up. You might find a file spelled very similarly to the one you were expecting (aha, so the problem is just a typo!) Or you might find that everything in the directory is named in upper case rather than lower case (aha, is this program being run in a case-sensitive environment for the first time?) Or you might simply recognise the contents of the directory as a set of files you’ve seen before, and realise that that is clearly not where you’d expect to find the file mentioned in the pathname.
Another possibility is that you’ll find out that the file does exist, right where you thought it was. In that case, if the pathname is relative to the current working directory, perhaps the problem is that the program that failed wasn’t in the directory you thought it was in? And then you know you’re trying to solve a different problem: not ‘who deleted my file?’ but ‘why is this program running in the wrong directory?’
An even more confusing case is if the error message came
from a different computer – for example, a program
was run remotely via ssh
in the middle of a
script. Then the file might exist here, where
you’re looking, but not there where the program was
running.
In the very worst case, if you find that absolutely everything on the whole pathname looks exactly as you expect but just that one file is missing … then at least you’ve ruled out all those other causes of failure, and you can focus wholeheartedly on ‘who deleted my file, or what happened to the thing that was supposed to have created it?’.
PATH
?Go and look for yourself. Print out your PATH
;
identify which directory on it you expected the thing
to be in; list that directory to check that it’s there.
You might start by noticing that the directory containing
the program isn’t on your PATH
, so now
the question is not ‘where’s my program gone?’ but ‘why was
my PATH
not set up the way I expected?’
Or you might find that the program file is present under the right name, but isn’t marked as executable – so maybe something went wrong with the installation process that put it there.
Or (on Unix) you might find that it’s a broken symlink:
perhaps you symlinked a program into your bin
directory rather than copying it, and then blew away the
directory where it really lived, without remembering
that that would cause a problem.
Or – similarly to the previous example – you might
recognise the contents of the directory as clearly the wrong
thing. ‘Oh, oops, I added /opt/program
on
my PATH
, but all the binaries live one level
lower down in /opt/program/bin
, guess I should
have used that instead.’
Go and look for yourself. Open up the script file in a text
editor, or just in less
if that’s easier; have
a look at the specified location in the file; see what you
can see.
This is worth doing even if you don’t speak the language that the script is written in. It’s very easy to assume that you won’t be able to make sense of it, and not even bother looking. But it’s worth looking anyway, because there might very well turn out to be a problem so obvious that you don’t need to speak the language.
For example, you might be able to recognise that the file in question is completely the wrong kind of thing. Maybe it was supposed to be Python source code but it’s English text, or Perl, or binary gibberish, or HTML.
Another possibility is that the file has been truncated for some reason, perhaps because a disk filled up, or because a download was interrupted. If the program code in the file ends half way along a line, or in the middle of a pair of brackets or braces, then you might be able to recognise that even if you don’t know anything about the language at all. You only need to know that it has brackets or braces (or perhaps infer it from what you can see of the code) to guess that perhaps if the file ends in the middle of one that isn’t a good thing.
You might even find that the file contains another error message. For example, perhaps the program that generated the source file failed, and wrote its error message directly into the output file rather than sending it to a separate error channel. (Command-line programs shouldn’t do that, but sometimes they do anyway.)
And so on. If you see any error message that makes a claim you can observe directly, go and look for yourself! Even if you think you won’t be able to understand what you’re looking at. You never know.
The reason why it’s a bad idea to say ‘I won’t go and look because I don’t expect to understand it anyway’ is that that’s only what would happen if everything is working as expected. But you’re investigating an error message, so you already know that something isn’t doing what you expect. And that’s why it’s worth having a quick look in a shell script even if you don’t speak shell, or similar: what did happen might turn out to be easier to understand than what should have happened.
Sherlock Holmes famously said that it’s easy to reason forwards from a cause to likely effects, but harder to reason backwards from an effect to its likely causes.
A lot of debugging is based on Holmes’s hard backwards reasoning, especially when all you have to go on is a confusing error message. But once you’ve formed a theory about what the problem is, don’t forget to double-check the easy part, by asking yourself what effects you’d expect from the cause you have in mind. Specifically, ask yourself:
It sounds obvious, but it’s a surprisingly easy step to forget! For example, perhaps you already had some possible problem in mind before you even ran the program that failed, and when it did fail, it was easy to assume that it had done it in the way you expected – and if it hadn’t, spend half an hour looking in the wrong place, when the message telling you otherwise was on the screen all along.
Another way this can go wrong is that you form a theory of the problem and then find you’re not quite sure what error message your theory predicts. In that case, one thing you can try is to provoke the same error condition on purpose and see what error you end up with.
For example, supposing your theory is that a file has the wrong
permissions, but you can’t remember exactly what error you get
in that situation. Then you could try using some easy tool
like ls
to access a file you know has the
wrong permissions, and compare that error message with the one
you actually saw.
This technique doesn’t always work. Sometimes your prediction of the expected error message will be wrong. Software has bugs, and sometimes even its error reporting has bugs; systems are complicated, and sometimes you didn’t understand as well as you thought.
But you only get better with practice! It’s worth getting into the habit of trying to predict the errors you expect, even if you don’t always get it right. When you get it wrong, it’s a learning experience, and you find out more about how things really work.
Programs are (mostly) good at reporting what they tried to do, and what went wrong when they tried it. But they’re often less good at reporting why.
Of course, sometimes this is obvious, and clearly not the
problem. If ‘cat file
’ reports that it failed to
open file
, you don’t need it to explain why it
wanted to.
But sometimes it’s not obvious at all. A program might report
an error reading some obscure file such
as /etc/foo.conf
, and your first question might be
‘Why is this program even trying to
read /etc/foo.conf
, when I didn’t ask it to do
anything involving a foo
?’ Or you might find
yourself looking at a network error, when you didn’t realise
anything you were doing involved making a network connection. Or
you might see an error message indicating EPERM
,
when you didn’t think you were doing anything that ought to need
to be root.
If you don’t know why the program was even trying to do the failing thing, then one possibility is that it was a mistake to try it at all. So, before you start trying to arrange for the failing operation not to fail, consider whether that’s the right direction to be heading in: perhaps what you should be doing is arranging for it not to be tried at all.
Beware in particular of the temptation to handle
EPERM
(or any similar ‘you need to be root’
message) by immediately becoming root and trying the thing
again, without first thinking about whether it was what you
actually wanted to be doing. If you just try again as
root, you risk
the failure mode ‘You have now tried extra hard to do a
wrong and dangerous thing, and this time, succeeded.’
Computers critique spelling and punctuation in your code all the time. So it seems only fair to return the favour when you’re reading their error messages. And it’s actually useful: even the irrelevant details of an error message can be useful clues!
If you’re used to reading text written directly by humans, this won’t be second nature. Humans rephrase the same concept in different ways all the time – perhaps even on purpose, to avoid sounding too repetitive. So you get used to ignoring the differences of wording, and focusing on the meaning.
But a computer program has no fear of sounding repetitive, and
will use a fixed printf
format string (or
equivalent) for any given type of error message. So spelling,
wording and even punctuation can be useful clues, because you
can reasonably expect them to stay consistent. Even a missing
comma can be a clue that the error message isn’t the one you
thought it was!
(On the other hand, you can’t depend on different programs all reporting the same error in the same way. If they’re all producing the standard translation strings of OS error codes, then that part will be consistent, but any part of the message made up by the program author can easily differ from another program reporting the same type of problem. The consistency you can expect is very specific.)
In particular, there are a lot of phrases with similar sense but different details, which can indicate wildly different error conditions. For example, ‘Connection refused’, ‘Permission denied’ and ‘Access denied’ are all English phrases with the same basic meaning of ‘I refuse to let you do what you want’ – but in the usual contexts of software, they’re all talking about different things you wanted, so it matters a great deal which of them you saw.
An even tinier example is the difference between ‘Access
denied’, which (for example) an SSH client might print when the
server refuses your login details, and ‘Access is
denied’, which is the Windows standard translation
of ERROR_ACCESS_DENIED
and indicates either a file
permissions problem, or treating a directory as a file by
mistake.
A piece of wording can be a vital clue even of which
program generated the error. For example, I once ran a
program written in the scripting language Tcl, and it failed
with a syntax-error message. I didn’t speak Tcl, so if there
had been a genuine error in the program it would have been
hard work to figure it out. But I happened to spot that the
wording of the syntax error looked very familiar – in
fact, suspiciously like a particular error I was used to
seeing from bash
… aha! The problem was that the
Tcl script was being fed to an interpreter for entirely the
wrong language.
On another occasion, I saw someone get an error dialog box from a GUI application, in which the application’s own name was spelled without a capital letter. The application was developed by a large company, well organised enough that I doubted they would have misused their own trademark in that way – and that was enough of a clue to track the error down to a third-party plugin rather than the application itself.
Sometimes you don’t just get one error message at a time. You get an enormous log file containing messages from lots of programs running in a huge script. An example might be a software build log, or a log of a test run from a CI system.
If that job as a whole fails, then probably somewhere in the log file will be an error message. Or rather, at least one error message.
In many cases, there will be more than one error message. This can happen in a complicated web of interoperating programs for lots of reasons. A failure in one program can cause a knock-on failure in another; test harness programs will often try to helpfully re-summarise errors from other programs; a test suite might deliberately continue running after an error to try to report as many different useful things as possible; some errors aren’t fatal, and a program can carry on running in spite of them.
So the next problem is: out of all these error messages, which one is the useful one for actually getting to the bottom of what went wrong?
Obviously, you’ll need to figure this out if you intend to debug the problem yourself. But it’s also a useful thing to do if you’re going to report the problem to somebody else. You could just send your local expert a link to the entire log and say ‘please sort this out, kthxbye’, but if you’re more polite than that, you’d want to at least try to pick out the relevant error message, to save them some of the effort.
Some types of error message report that another program has exited with a failure status. In that situation, the other program might well have printed an error of its own before exiting. So error messages of this type function as ‘signposts’ in the sense that they point at another error message: they don’t indicate the cause of failure by themself, they just tell you where to look next.
For example, suppose you see a message such as ‘*** make:
[foo] Error 1
’. make
is a program that runs
other programs one by one (usually to build a piece of
software); this error is telling you that it couldn’t build a
component called foo
, because it tried to do that
by running a subprogram of some kind, which exited with status
1.
As we discussed earlier, exit status 1 means the subprogram deliberately reported failure. And if a program does that, it probably printed an error message just beforehand, saying why. So the useful message is not ‘Error 1’, but whatever the subprogram printed just before exiting with status 1. For example, it might have printed a message saying a file wasn’t found, or that some source code had a syntax error.
(However, if the message had said ‘*** make: [foo] Error
139
’, that would have been a different matter: as we
also discussed earlier, this is
more likely to mean that the subprogram crashed with a
segmentation fault, in which case it was probably caught by
surprise and didn’t have time to gasp out a last communication
at all.)
Another example of an error message that’s usually a signpost
on the route rather than the final answer is the Python
exception subprocess.CalledProcessError
, because
that too indicates that Python ran another program and expected
it to succeed, and instead it reported failure.
Typically the full message will look like something along the
lines of ‘subprocess.CalledProcessError: Command 'foo'
returned non-zero exit status 2
’. Again, a small exit
status like 1 or 2 suggests that the command probably reported
failure on purpose, so you should look for the message it
printed before doing so, which will be further up the log
file, before
the subprocess.CalledProcessError
.
It’s easy to mistake the CalledProcessError
for
the root cause of the problem, especially because Python
exceptions come with a large detailed traceback showing exactly
where in the Python program something went wrong. When the
problem genuinely is a bug in the Python script, that traceback
is often exactly what somebody needs to know to fix it. But in
the particular case of CalledProcessError
, it’s
misleading: if you send somebody a problem report consisting of
just the traceback, they’re likely to be frustrated that you
left out the most important thing, which was
immediately before the traceback started!
(Of course, this rule doesn’t give the right answer every time. Usually when a Python script runs a subprogram the interesting question is ‘Why didn’t the subprogram succeed?’ – but just occasionally, as we’ve just discussed, it’s unsurprising that the subprogram failed, and the better question is ‘Why was it trying that in the first place?’. And for that, the traceback might be more likely to help.)
Sometimes, one subprogram of a large job can fail in a way that doesn’t bring the whole job to a halt – but whatever the subprogram was trying to do is not fully done, and that can cause other programs to fail later in the job.
In particular, in build and test runs, a lot of the interoperating programs will be consuming each other’s output. So an error that prevents an output file from being created at all, or from being created correctly, can easily lead to the consuming program reporting an error message about that file if it runs at all. But the real problem is not that the consumer got confused by the nonexistent or wrong file – it’s that the file was nonexistent or wrong in the first place.
So another skill of finding the most useful error message is to identify things that look ‘cascadey’, and treat them as an indication to look further up the log file.
For example, errors of the general ‘file not found’ type, such as ‘No such file or directory’ or ‘The system cannot find the file specified’, might indicate that something earlier in the same script should have written that file, and might have printed a message saying why it didn’t. Not always – it will depend on what the file in question is – but it can be worth looking.
If your log file is from a software build process, and it contains a compile error in a source file in the build directory rather than in the original source directory, that might also be a cascade failure. Source files in the build directory are likely to be auto-generated by an earlier build step – so the auto-generating tool might have said something helpful at the point where it generated this file.
One particularly common case is that a program consuming a file reports that it’s unexpectedly completely empty. This can easily happen by accident, in more than one way.
One way that a file can be accidentally empty is because the
disk filled up while some earlier program was trying to write
it. If you’re lucky, there will be an
earlier error message mentioning that the disk was full –
although there may not be, because a lot of programs forget to
check that the write()
system call (or equivalent)
succeeded. But even if there’s no error message, it’s worth
bearing in mind. One clue to a disk-full problem might be
that lots of files in the same run seem to be
mysteriously empty. In that case, definitely check how much free
space there is on the disk in question!
Another way for a program to generate an empty file by mistake
is if it opens the output file prematurely, before
finding out some reason why it can’t generate the data to write
to it. This in turn can happen for more than one reason. A
common one is I/O redirection, i.e. running a command of the
form ‘command > output_file
’, because the shell
or command processor handles redirections by creating the output
file first, and then running the subprogram. So if the
subprogram runs into any kind of problem before writing anything
to its output, the output file is still there, just
empty.
Another type of premature output-file creation happens in
Python, if a program uses the argparse
module to
process its command line, and in particular
uses argparse.FileType
for the option that
specifies an output file. This will cause argparse
to actually create and open the output file during parsing of
the command-line options. So if the program then discovers a syntax
error in its input (for example) and exits with an error, it’s
too late – the output file already exists, without any data in
it.
In both of the previous subsections, the common theme has been: recognise that sometimes an error message indicates that you should be looking for a more interesting one earlier in the file.
So an obvious approach is to just look for the earliest error message you can find at all! Then it can’t possibly be a cascade failure or a re-summarisation of an earlier error message.
This often works, but (as usual) not always. One reason it can go wrong is that some error messages are not fatal! Sometimes a program will print an error message and then keep going, because the thing that failed wasn’t very important, or because the program already had a good plan for dealing with the problem. So if you focus your attention on the earliest error message, you might find that it’s still not the really important one.
One example of this occurs in build tools such
as cmake
, which sometimes start off by performing
tests to see what facilities are available on its build
machine. cmake
, for example, might print things like
this:
-- Looking for strerror_s -- Looking for strerror_s - not found -- Looking for setenv -- Looking for setenv - found -- Performing Test HAVE_STRUCT_STAT_ST_MTIMESPEC_TV_NSEC -- Performing Test HAVE_STRUCT_STAT_ST_MTIMESPEC_TV_NSEC - Failed -- Performing Test HAVE_STRUCT_STAT_ST_MTIM_TV_NSEC -- Performing Test HAVE_STRUCT_STAT_ST_MTIM_TV_NSEC - Success
The words ‘not found’ and ‘Failed’ in these messages look like
errors – but they’re perfectly normal. In this snippet (taken from
the LLVM compiler project), cmake
is checking the
system’s build environment for a large range of functions and
structure fields that aren’t all expected to exist, and if any one
of them doesn’t exist, it’s OK, the software project has a plan
for doing without it. (That’s why it bothered to check in the
first place – to see if its plan would need to be used.)
A completely different example: I used a test suite once that ran
all its subprograms with a deliberate restriction on the amount of
memory and CPU they can consume, via the Unix ulimit
command. This meant that if the program under test went out of
control, it wouldn’t make too much trouble for everything else on
the machine. But then we started running the same test suite in a
container environment in which processes aren’t allowed to
run ulimit
to lower their own limits in that way. So
the log files were suddenly full of errors saying ‘ulimit:
Operation not permitted
’. At first nobody noticed, because
the test harness proceeded without the safety precaution, and the
tests themselves passed. Later, when the test suite
actually failed a test, the person reading the log file
connected the dots and thought ‘aha! Error message here, failed
test there, they must go together’ – but in fact,
that ulimit
error had nothing at all to do with the
test failure.
One way to spot non-fatal errors is to compare your log file
against a log file from the same job not failing, if you
have one. Any error message you can see in the log that ended up
reporting success can’t be a fatal one, and then if you see it in
the failing log, it probably isn’t the real problem. In both of
the examples above, this technique would have identified the
errors as non-fatal, without having to know anything
about cmake
configuration checks
or ulimit
.
Here are a few real examples of error messages that are confusing, showcasing the use of the advice in this article.
apt
download failureThe first example (with irrelevant details removed) came from
the apt
command on an Ubuntu Linux system, when a
user tried to use it to download a file from the local Ubuntu
package repository:
E: Failed to fetch http://long URL/filename.tar.xz Could not open file filename.tar.xz - open (13: Permission denied) [IP: 10.11.12.13 80]
The person who got this error message interpreted it as reporting
a network error, because of the initial ‘Failed to fetch [URL]’,
and started trying to debug their network connection. But that
part of the message is apt
’s high-level description
of the overall task that couldn’t be completed. Later in the error
message, we see a detailed description of what exactly went wrong:
‘open (13: Permission denied)
’. That means that the
tool attempted the Unix system call open
, which opens
an ordinary disk file, and got the EACCES
error code,
meaning that Unix file permissions prevented it. (The ‘13’ is the
numerical value of EACCES
on Ubuntu.)
In other words, the problem wasn’t the network at all.
The package repository’s web server was functioning perfectly. But
after apt
started receiving the file data, it tried
to create a file on the receiving machine to save the data to,
and that failed with a permissions error.
In fact, the problem was that the user had run the apt
source
command, which downloads files into your current
directory, but their current directory was somewhere
in /etc
, where they couldn’t create files. They’d
forgotten to change directory to somewhere more sensible
first.
(If the web server really had refused to serve the file, the
details would have looked less like
a Unix errno
report and
more like an HTTP error code,
perhaps 404 or 403 or some such.)
(Also, this is a good illustration of
the principle that you don’t
always want to respond to a permission error by using more
authority. The actual failed operation was creating a file
in /etc
, which a normal user can’t do, but root
can. But the user definitely didn’t want to become root and try
again, because then they would have succeeded in
dropping a Debian package file somewhere
under /etc
, where it might confuse some system
program, and where they’d never find it again!)
curl | sh
syntax errorMy second example involved a user running a command of the
form ‘curl some://url | sh
’, to download and
immediately run a shell script (which they hoped would install a
piece of software they wanted to use). This failed,
because sh
reported a syntax error.
When this user asked for help, our first suggestion was to run
the curl
command on its own, without ‘|
sh
’ on the end, so that we could see the actual
text curl
was piping into sh
. The user
hadn’t tried this, because they didn’t expect to understand it
anyway, because they didn’t speak sh
. But this is
an illustration of the ‘go and look for
yourself’ principle: as soon as we did look at the output
of curl
, it turned out that curl
had
output an HTML error document in place of a shell script. And
even if you don’t understand shell scripts, you might
still be able to tell the difference between a shell script and
a piece of HTML.
Looking at the HTTP error document itself, it contained
an HTTP error code of 302,
meaning it was a redirection. (The user was downloading the
shell script via an outdated URL – the website hosting it had
been reorganised, and the old URL was automatically redirecting
to a more up-to-date one.) So the answer was that the user
needed to add the -L
option to
the curl
command line, which makes it follow
redirections, which it won’t do by default.
But it makes a difference which HTTP error it was: if it had been 404 rather than 302, then that wouldn’t have helped, and instead, we’d have had to check for typos or paste errors in the URL, or maybe whether the place they’d got the URL from was out of date.
rdiff-backup
startup failureMy final example comes from a backup utility I use
called rdiff-backup
, which can back up one computer
on to another by using SSH to connect to the other machine. When
SSH fails, rdiff-backup
tries to print helpful
user-friendly advice about what to do about it, which sometimes
goes wrong. For example:
Host key verification failed. Fatal Error: Truncated header string (problem probably originated remotely) Couldn't start up the remote connection by executing ssh -C username@hostname rdiff-backup --server Remember that, under the default settings, rdiff-backup must be installed in the PATH on the remote system. See the man page for more information on this. This message may also be displayed if the remote version of rdiff-backup is quite different from the local version (2.0.5).
The final paragraph is large, prominent, and friendly-looking,
and suggests some plausible things to
check. rdiff-backup
needs to be installed on both
computers for remote backups to work, so
perhaps rdiff-backup
isn’t installed correctly on the
machine we’re connecting to? Or perhaps it is installed, but is
completely the wrong version? But all of this is guesswork (as you
can tell from the word ‘may’ and the two different suggestions of
what might be wrong). It’s rdiff-backup
trying to give helpful advice to the user, not based on hard facts
about this particular situation, but based on the author’s general
experience of which things people most often get wrong.
All that rdiff-backup
really knows is that
it ran ssh
, and the data stream it received
from ssh
ended before it had seen anything that
looked like a greeting from another copy
of rdiff-backup
. That’s what it’s reporting in the
message ‘Truncated header string’.
But in this case, that’s a cascade
failure. Immediately before that, we can see the
real cause of the error: the ssh
command printed
‘Host key verification failed’. When ssh
prints
that, it means it gave up on the network connection for security
reasons before even trying to transfer data – so in
this case the connection was abandoned before even finding
out whether rdiff-backup
was installed on the
other system, or whether it was the right version. All of the
user-friendly advice is completely missing the point, and the
real error message shows that the problem is totally
different.
Everything I’ve said in the advice half of this article is unreliable!
All of these suggestions of what to investigate, or how to investigate, are good rules of thumb, or good starting points, or likely causes of problems. Sometimes they’re not right. If you think you know better, and that your case is an unusual one that doesn’t conform to my principles, you might be right.
And even if you don’t think you know better, there’s always the chance that your case is still an unusual one, and you just haven’t found that out yet.
For example, sometimes programs report the wrong error, which will cause a serious problem if you’re trying to deduce as much as you can from the details of an error message!
This can happen intentionally, for security reasons. In this article I’ve encouraged you to pay attention to the difference between ‘Permission denied’ and ‘No such file or directory’, or between 403 and 404 in HTTP, because they mean different things. But sometimes they’re deliberately merged: some web servers will deliberately return 404 (thing doesn’t exist at all) even when the real problem is 403 (it exists but you aren’t allowed to access it), in order to avoid leaking information about what things exist. All I can suggest is to remember which web servers can’t be trusted, and treat them with more suspicion.
Reporting the wrong error can also happen by accident. For example, on both Unix and Windows, operating system errors are all written into the same location – so if another error occurs between the real failure and the error message generation, the program can print the wrong error code. Another example is in the previous section, where a program well-meaningly tried to guess how its subprocess had failed, got it wrong, and gave misleading advice.
I know it sounds as if I’m saying that you have no hope of spotting the right error message. And it’s true that there are no rules that are simple and reliable.
Ultimately, the most reliable method of decoding error messages correctly is to know a lot about the specific programs you’re dealing with, and have lots of experience of how they really report their errors, which ones are dependable, which ones are misleading, and what the confusing ones really mean.
Without that knowledge, you just have to guess – but the rules of thumb in this article might make your guesses right more often.
And it’s worth trying to interpret error messages in as much detail as you can, because that’s a good way to gain the detailed experience that makes you an expert. Do your best to interpret the error message even if you don’t already know the area well; if you get the wrong answer, find out why you were wrong, and improve! It’s always tempting to fling the whole thing at someone more experienced, to get your immediate problem solved faster – but then you don’t learn as much.
This article is adapted from material that I originally wrote for my employer Arm. I reproduce it as a public article with Arm’s kind permission.
With any luck, you should be able to read the footnotes of this article in place, by clicking on the superscript footnote number or the corresponding numbered tab on the right side of the page.
But just in case the CSS didn’t do the right thing, here’s the text of all the footnotes again:
1. This particular case
is not strictly speaking an errno
code.
For historical reasons, the hostname lookup functions report
errors using their own separate enumeration of error codes,
which aren’t stored in errno
or translated using
the same functions. But the principle is the
same.
2. In that situation I feel as if Windows shouldn’t be using the definite article! It would be more honest for it to print ‘A process cannot access a file’, which at least doesn’t try to give you the impression that you ought to know the details already …
3. There’s more than one
reason why it’s a bad idea to propagate the signal literally.
One is that it’s inconvenient for the program itself, which
won’t get a chance to clean up its own state in the normal way,
and will have to to make a special effort to do it before
calling kill()
. Another is that processes dying of
particular signals have knock-on effects that aren’t wanted in
this case. For example, if a program terminates due
to SIGSEGV
, the OS might write a core dump file, or
even automatically offer to send a bug report to the developer.
If the program genuinely crashed, this might be useful – but
it’s not so useful if the same things happen because the program
deliberately sent itself SIGSEGV
.
4. The fact that exit statuses can be greater than 231, and that this even happens in practice, means that programmers should beware of storing one in an int
– especially if they have the Unix habit of using a negative value to mean ‘no real value here yet’! This caused a very confusing intermittent bug in early versions of pterm.exe
.
5. I’m not convinced that Windows reliably reports the difference between permanent and temporary DNS failure. I tested it by turning off my own name server temporarily and I still got ‘Host does not exist’. Then again, the MSDN docs say there are separate error codes, so who knows.
6. In the context where I originally
gave this talk, many people were users of the git
hosting system ‘Gerrit’, which typically runs a special SSH
server on the unusual port number 29418. The server it’s running
on is likely to also run an ordinary SSH server on the
default port 22, and a common cause of Gerrit login failure is
to forget to put the special port number in your git
clone
command, so that git
tries to talk to
the wrong one of those servers. Adding to this confusion, some
people configure ssh
to recognise the Gerrit server
name and automatically add the special port number –
and then when that person sends a sample command line to
somebody without that configuration tweak, it will cause exactly
this error …
7. Also, if you’re using sudo
in particular, there’s a strong chance of other nasty side
effects involving leaving files in locations specific
to your user id which are unexpectedly owned by root,
to create problems later. sudo kinit
is a
particular example I’ve seen causing trouble, and any large and
complicated GUI program is also a bad bet.
8. In that situation it’s probably useful to also provide the link to the entire log. If you misidentified the relevant message, the recipient will need to go back to the full log file and do their own analysis. As I said in my much older article How To Report Bugs Effectively, it’s fine to attempt a diagnosis in a bug report, but do also include all the raw data that your reasoning was based on.
9. Unix is particularly prone to generating empty files when the disk fills up, because typically Unix filesystem formats have separate limits for the number of files you can make, and the total amount of data in those files. And you basically always hit the data limit before the file limit. So when you try to make a file on a full disk, what happens is that you successfully create a new file of zero size, and don’t get an error until you actually try to write anything to it.
10. At least,
this happens if you did the obvious thing of
calling parser.add_argument
with type=argparse.FileType("w")
. My personal habit
in my own Python scripts is to set type=opener("w")
instead, where opener
is a wrapper of my own,
returning a lambda which will call the FileType
constructor. So I can defer actually opening the output file
until I’ve got past all the likely causes of error. But this is
unidiomatic and unusual, and most simple Python programs don’t
bother.