DNS And E-mail
Volume Number: 21 (2005)
Issue Number: 7
Column Tag: Programming
Mac In The Shell
DNS And E-mail
by Edward Marczak
Did You Know They're Related?
It is my charge to bring you all things Unix. To help you understand what's going on beneath the GUI.
Sometimes, this involves troubleshooting. While DNS and e-mail no longer play exclusively in the Unix realm,
they certainly spent their childhood there and have some Unixy-like scars to show for it. Not only did they
grow up in the same era, they just happen to be first cousins. This particular column aims to be a tiny
troubleshooting guide.
Family Tree
"DNS and e-mail?" I hear you ask. "They are such different systems, and do completely different things!"
Ah! It's the differences allow them to complement each other, which is why they are such perfect mates.
E-mail relies on DNS. In fact, most systems nowadays have some reliance on DNS. E-mail just won't get
delivered without DNS. Kerberos not functioning? Check your DNS. Most other systems would just be
cumbersome, as you'd have to specify other systems, not as a name, but as an IP address.
E-mail servers rely on DNS in several ways. Your outbound e-mail server must have access to DNS to lookup
mail routing information. To receive e-mail, your Internet facing DNS servers must be set up correctly to let
others know how to reach you. DNS mis-configuration is commonly a source of e-mail trouble.
DNS
Life without DNS cumbersome? Well, yeah - this is why DNS exists. Computers like numbers, people like
names. The Domain Name System exists primarily to convert names, like www.apple.com, into an IP address
(number), such as 17.112.152.32. It also helps us convert from number back to name. DNS has been pushed to
do even more than these simple translations.
Bottom's Up!
I'm going to introduce these systems youngest first. Before DNS existed, the name to address translation
took place by looking up entries in a flat-file called 'hosts'. This file still exists, and can still be
used. There was a time when OS X ignored the hosts file, but it is now recognized by default. You can find
it in /etc, where all good Unix boxes store the host file. When the Internet was small enough, current host
files were copied between machines - actually, one master machine, via ftp.
Naturally, as the number of machines on the Internet grew, this system just didn't scale. In 1984, Paul
Mockapetris released RFC 882 and 883 - the Domain Name System. Naturally, these RFCs were supplanted and
augmented, but those two were the birth of DNS.
DNS provides a hierarchical, distributed database. Name servers are the programs responsible for
transferring the database to clients (as appropriate) and to other servers. OS X ships with BIND (Berkeley
Internet Name Daemon) as its name server. There are other servers, such as DJB DNS, MaraDNS, and even
stand-alone DNS hardware such as Bluecat DNS. Tiger server, specifically, runs version 9.2.2 of BIND. While
there are pros and cons to BIND, it is certainly the most popular name server on the Internet.
Enough history, on to the details! Please note that it is outside the scope of this article to dig into
every facet of DNS, and I'll just be hitting on the basics (remember, we still have to get to e-mail!). If
you administer a name server, need to know more about BIND or even walk by a machine running a name server,
you owe it to yourself to pick up "DNS and BIND" by Albitz and Liu, published by O'Reilly. Best. DNS. Book.
Ever. I still have my worn 2nd edition. The only reason I haven't picked it up in a bit is because I
committed it to memory long ago.
The Details, Please!
Most techies (that's us) understand DNS structure. Generic top-level domains, such as 'com', 'mil', 'info'
and 'org' hang just off of the root ("."). Delegated zones live beneath the gTLDs, and subdomains and hosts
follow. When queried about a zone that it operates (or, is authorative for), the name server will consult its
database. BIND uses a simple flat-file scheme for a database, that is read at name-server start up. Each
entry in the database is referred to as a record. There are many different kinds of records, and I won't be
covering them all, just the ones germane to this discussion.
The main record to understand immediately is the A, or "'address"' record. This provides the core
functionality: it maps a name to an address. An excerpt from Apple's DNS may look something like this (the
"'IN"' stands for "'INternet class"' - somewhat of a relic, but still a requirement):
www.apple.com. IN A 17.112.152.32
mail-in3.apple.com. IN A 17.254.13.8
nserver.apple.com. IN A 17.254.0.50
So, when you request www.apple.com in your browser, your system asks DNS to resolve the name, and in turn
is told '17.112.152.32'. (For those of you who know this cold, yes, this is a simplified version of how this
all works). You can use DNS to help transition to a new server by changing the A record. Let's say you run
web services at www.example.com. The A record for www points to 10.0.0.10. One day, you buy the biggest,
shiniest new server. You could set it up as 10.0.0.11, configure it, and once ready, alter DNS to map
www.example.com to 10.0.0.11. People don't have to learn a new name, buit they will get routed to the new
server.
Didn't You Say Something About Mail?
Move all of that DNS information over in your brain for a moment; I'll tell you when to slide it back into
the center again. E-mail has long been one of my favorite systems. I'm not sure why. It absolutely disturbs
my inner core when an e-mail system is not working. I started off with sendmail (before the m4 macros were
the way to write your cf files), and shifted to Postfix when it was still Secure Mailer. What is seemingly a
very simple thing - moving a text file from point A to point B - actually has a fair amount of complexity.
If you've only ever configured the mail system via the GUI, or come from an Exchange, Kerio or Communigate
background, there's a simple issue that you may not know about. The subsystem that is responsible for sending
and receiving mail is completely separate from the subsystem that allows you access to your mail spool. When
someone from the outside sends mail that is destined for your mail server, Postfix is responsible for
receiving that mail via the SMTP protocol, figuring out what to do with it, and either delivering it
personally, or handing it off to an intermediary. When you set up a mail client (or "MUA" - Mmail Uuser
Aagent) to retrieve mail via the POP protocol ("boooooo!") or the IMAP protocol ("Yeah!"), Cyrus is handling
those requests. When you send an e-mail via your MUA, be it Entourage, Mail.app, Eudora, whatever - once
again, Postfix is responsible for talking to your mail client, queuing and sending the mail to the remote
system. So how does this all tie together?
Once again, imagine you run all services for example.com. When mail destined for mail.example.com is
presented to your OS X Server, Postfix, listening on port 25 for activity, springs into action. Postfix
itself has several subsystems that deal with various stages of accepting, queuing and delivery. However, one
of the first things Postfix must do, is decide whether or not this piece of mail belongs on this server or
not! Is it really for 'example.com'? If so, is there a user by the name requested? Important questions,
right? If the mail truly belongs here, it has to get added to the correct user's mail spool. In OS X's case,
Postfix hands the mail off to Cyrus via LMTP.
Most people have heard of SMTP, the simple mail transfer protocol. This protocol is responsible for
delivering mail between two systems over a network - typically over a WAN or the Internet. Somewhat lesser
known is LMTP, the local mail transfer protocol. OS X uses LMTP exactly as designed: to allow two mail
systems residing on the same machine to exchange mail (actually, these two can also be two separate machines
on the same network). LMTP uses the same semantics as SMTP, but avoids queuing. LMTP must process each
message as it isthey are received. It is also stipulated that LMTP must not run on port 25. In our case,
LMTP exists as a socket in /var/imap/socket.
Once the message is handed off via LMTP, Cyrus is responsible for dropping the message in the proper user's
mail store (specifically, Ccyrus' 'deliver' agent does this, although nowadays, 'deliver' is deprecated, and
is just a wrapper for LMTP delivery). When a person commands their e-mail program to "'Get New Mail"' Cryrus
is responsible for answering the request by accessing the mail store, and answering via the POP or IMAP
protocols.
Uh-oh
You're a freelancer or in house IT person that's responsible for a mail system, and you get the call: "I'm
not receiving any mail!" What to do? First, find out of it's just the person who called, or if the problem
affects everyone. If you have an account on the system in question, send yourself a test message from your
GMail account (unless, of course, someone from Google is reading this and you're trying to troubleshoot GMail.
In that case, use Hotmail). If you receive it, great. It's likely limited to that one person that called.
If you didn't receive it, time to start looking at the logs. Let's stop right here for a second while I
admit something. I love logs. I'm a bit obsessed with them. A future column will touch a bit more on logs,
but let's just remind ourselves how important they are. Logs are the heart of your system. I typically keep
one machine nearby that does nothing but watch logs for me. It's good to glance at so you get used to the
rhythm. That way, when that rhythm gets broken, you can recognize it instantly. That said, you want to
follow /var/log/mail.log right now. I recommend using 'tail -f' in terminal for this. Send that test message
again. Is there an entry in the log for this?
OK, now's the time to dust off that DNS information. If there's no entry in the mail log, it's likely that
the mail server is just fine, rather, it's DNS that is having an issue. As mentioned, there are several kinds
of DNS records, the A record just being one. The MX record defines a 'mail exchanger' for the domain or
subdomain. If that's not configured properly, mail isn't going to get delivered.
When I type up and send an e-mail, my MUA hands off the message to my e-mail server (or, Mail Delivery
Transfer Agent - MTDA). The mail server then queues the message and figures out how to deliver said message.
If my message is destined for steven@radiotope.com, the server must first find out where it is to send this
mail. DNS has the answer. This is a three-part conversation that would go something like this:
example server: "Excuse me, DNS for radiotope.com, can you please give me your mail exchanger?"
radiotope DNS: "Sure, it's www.radiotope.com."
example server: "Oh. I've never dealt with that server before, what's its IP address?"
radiotope DNS: "No problem, that's 69.55.224.105"
example: "Thanks!"
example: "Pardon me, www.radiotope.com, I've got some mail for you!"
radiotope mail server: "I'll take that, thank you!"
Notice how much of that conversation happens with the DNS server. Fortunately, the mail server will cache
the results for radiotope.com, and won't have to ask until the cache TTL (time to live runs out).
So, if DNS isn't correct, the mail will never hit the proper mail server, and you'll never see it in the log file.
I've had an incredible increase in the amount of mail server troubleshooting that I'm doing for people.
Apple certainly hasn't had a perfect record here. Up until 10.3.4 in the Panther series, people were bitten
with the 'log rolling bug'. That did get fixed. However, we're all still living with the 'reconstruct bug'.
When you check the mail.log file, you may see entries like this:
Jul 5 10:33:38 mailserver deliver[379]:
connect(/var/imap/socket/lmtp) failed: Connection refused
Or, this:
Jul 3 04:49:27 mail postfix/pipe[8805]: 6E8AF9A9F2:
to=<jimbob@example.com>, relay=cyrus, delay=135898, status=deferred
(temporary failure. Command output: couldn't connect to lmtpd: Connection refused_
421 4.3.0 deliver: couldn't connect to lmtpd_ )
On a busy mail server, you'll see a lot of these. They're both commonly associated with a corrupt Cyrus
database. Time to stop mail, and for a little explanation. Stop the mail services:
serveradmin stop mail
Although the issue here is with Cyrus, this command will stop both Postfix and Cyrus. You don't want
things trying to deliver to a corrupt mail store. As mentioned earlier, Postfix hands the message to Cyrus
via LMTP via a socket file. It was also mentioned that LMTP doesn't perform queuing - it is required to
process each message as it's received. In this case, it can't. Either Cyrus isn't running at all, or it has
corruption in its database. To see if Cyrus is running, look for the evidence of 'master' in your process
list:
# ps ax | grep [m]aster
206 ?? Ss 0:00.01 nfsd-master
227 ?? Ss 0:57.27 master
29133 ?? Ss 0:00.05 master
Whaaaa? There's two! For better or worse, both Cyrus and Postfix have their master processes named
"master". You can see which is which with lsof:
lsof | grep master
You should see a group of master processes owned by root, followed by a group owned by cyrus (cyrus-imap on
Tiger). If Cyrus is genuinely not running, watch your system log (/var/log/system.log) and hand crank it:
"/usr/bin/cyrus/bin/master &". Naturally, you can also stop and start all mail services while watching the
log.
Cyrus
Unlike many other IMAP and POP servers, Cyrus takes a different approach to handling mail. Traditional
IMAP servers, such as UW-IMAP, really don't get involved in the delivery process at all. You tell them where
the mail spool is, and they relay that to the user. In some cases, the IMAP server will be responsible for
taking mail out of the system spool and moving it to a spool in the user's home. Not so with Cyrus. Once
Postfix hands off the message to Cyrus, the 'deliver' agent is responsible for getting the mail into the
correct mail queue.
Here's where Cyrus is completely different. Deliver will drop the message into the correct user's mail
store, sitting at /var/spool/imap/user/(username) - this is the default location for OS X, and you can move it
to another partition, so be aware if you have done so. "deliver" is Cyrus' Mail Delivery Agent (MDA). That's
all standard, but Cyrus goes above and beyond this, keeping a database of mailboxes, ACLs (not the Apple
filesystem ACLs: - Cyrus sports its own ACL list that can let others have access to your mailboxes. As far as
I know, there is no Apple-official way to change the Cyrus ACLs), seen messages, quotas, etc. The databases
that keep track of this are in BDB format, and live in /var/imap.
Like any database, there's always a chance that it will get corrupted or somehow out-of-sync. Sometimes,
I've seen this happen with a system crash. OK, that makes sense. The disks went down dirty, and files didn't
get closed. Oddly, I've also seen Cyrus go batty with no great explanation (however, I do sometimes show up
on the scene to clean up, and let's face it, people aren't always 100% accurate about the events that lead up
to the whole mail system going down). But if you're seeing errors in the logs like the two above, there's a
good chance you have some Cyrus database problems.
Cyrus database problems can come in several flavors, so let's hit the highlights:
Permissions errors
Our good old friend 'permissions' is back! /var/imap should be owned by cyrus:mail, with perms of 755, all
the ways down. Same goes for /var/spool/imap, with one exception: /var/spool/imap/user should be a little
more restrictive: 700 for it and its contents.
If Cyrus can't read and write into those folders, it's going to have problems delivering and retrieving
mail. So double-check those folders!
Damaged Socket File
You should really never see this condition. However, this is a variation on the first condition: the
socket could have the wrong permissions assigned (although, that should never change unless you touch them!).
Bonus: they auto create on Cyrus startup. Just nuke 'em from /var/imap/socket with mail services stopped.
They'll get created properly when services are restarted. For the record, a Tiger server at minimum will look
like this:
# ls -l /var/imap/socket
total 0
-rw-- -- -- - 1 cyrusima mail 0 Jul 2 00:12 imap-1.lock
srwxrwxrwx 1 root mail 0 Jul 2 22:50 lmtp
Database corruption
This seems to be the big one. There are several things that can go a bit batty here. I can't stress the
importance of backups enough. The crummy thing is that to back up Cyrus properly, you need to shut down mail,
and there's no great alternative on OS X. Hot backups here can lead to inconsistent states. Backup, backup,
backup, and test, test, test! I mention this simply because restoring the database from the mail spool is
really imperfect. There are certain things that can not be inferred from the mail store, so those things get
set back to a default state.
Here's how this works: as Cyrus drops mail into the mail spool (/var/spool/imap) appropriately, it updates
its databases (/var/imap). They're relatively independent - the mail spool can live without the database,
however, that's where Cyrus stores all of the metadata (remember that stuff?), so it needs to be protected.
Cyrus keeps watch over this by check-pointing the database at regular intervals. The tool that rebuilds the
database is called 'reconstruct'. I'll spare you the long story and cut right to the chase: Apple's latest
version of reconstruct is a bit FUBARed. The bad version snuck in under a 'security update'. There's an open
source replacement that corrects the mistake that Apple's version makes. Get it here:
http://www.sussex.ac.uk/Users/jamesg/reconstruct.zip. The (long) explanation for this takes place in this
thread on Apple's support boards:
http://discussions.info.apple.com/webx?7@1022.JKM0abd8V6f.2@.68aec789/9
Is there any harm in running Apple's version? Well, yes. Apple's version will reset the internal
time-stamp of messages to '0' (zero). This won't affect every mail client, but guess which two it will?
Apple's Mail.app and Entourage - the two most popular mail client on OS X. How do you know if you've been hit
with this? If, after a reconstruct, your mail clients are choking while trying to check for new mail, but
webmail (using the built-in squirrelmail) is fine, sounds like this is the issue. Get the replacement from
the URL above, choose the version appropriate for your platform (yes, this currently still exists in 10.4.1).
Backup your original (mv /usr/bin/cyrus/bin/reconstruct /usr/bin/cyrus/bin/reconstruct.apple) and drop the
replacement into the same directory (/usr/bin/cyrus/bin/reconstruct). Set the wheels in motion like this:
1. Start a pot of water boiling. Now is no time to panic, and tea will do you some good. Preferably some green or camomile.
2. Stop all mail services. If you didn't do so above, do so now: serveradmin stop mail
3. backup /var/spool/imap and /var/imap...just in case
4. Run reconstruct. You must do this as cyrus:
sudo -u cyrus /usr/bin/cyrus/bin/reconstruct -i
The '-i' flag here will rebuild sub folders for each user as well. You should see messages scrolling by as
reconstruct works its way through each user. Depending on the size of the mail store, this could take some
time. Now's the time to pour that cup of tea.
Once that is complete, try firing up the mail system again: serveradmin start mail. On rare occasions,
this won't start Postfix. Sip that tea, and simply type: postfix start. No problem. At this point, mail
should be flowing again. Queued mail, and new mail coming in should be getting delivered, and people should
be able to access their mail. This alone will make them happy enough to ignore this issue: ALL of their mail
will now be marked unread. Flags will be lost. But! Their mail will be back! Make the rounds, then finish
off that tea. Maybe even get a cookie. What the heck.
Redundant Redundancy
A brilliant concept with DNS and mail is that you can specify multiple MX records. You can (and should)
assign them a priority. "0" is the highest priority, with everything else being lower. So, when DNS
specifies:
example.com. IN MX 10 ren.example.com.
example.com. IN MX 20 stimpy.example.com.
"10" is the higher priority server. That's where mail should go. However, if ren.example.com is down, the
mail will get delivered to stimpy.example.com. For best effect, ren and stimpy should be on separate
networks. In different buildings. In different states, if possible. If you run mail services in-house, see
if your ISP will perform backup MX duties for you. Most will. Like backing up data nightly, having a backup
mail server (or two...or three) is more than worth it. Don't walk this tightrope with no net: have a backup
mail server with properly configured MX records.
In the previous section, we had to take our mail services out of commission while we repaired the Cyrus
databases. If someone was trying to send us mail at that moment - something that has a very high probability
- without a backup mail server, it's going to bounce back to them. We'll have dropped off the face of the
Internet! However, with the backup server in place, the mail will get routed there, and be delivered to our
main mail server once it's back on-line. Brilliant, just brilliant!
Other Troubleshooting
Naturally, for a mail server to receive mail, it must be accessible. I recommend that anyone who does
network troubleshooting have an externally accessible machine - preferably through ssh, although ARD and
Timbuktu will let you get a shell also. This will be used for the purpose of getting an outside view of the
network you're currently working on.
To test SMTP, access your outside machine, get a shell, and type:
telnet mail.example.com 25
Of course, you need to substitute the Fully Qualified Domain Name (FQDN) of your mail server. You should
receive a reply like this:
Trying 192.168.0.18...
Connected to mail.example.com.
Escape character is '^]'.
220 mail.example.com ESMTP Postfix
Type 'quit' and press return to get back to your prompt. If you do not get this greeting, either Postfix
is not running, or something is blocking access. If your external test machine is on a residential network,
it's likely that port 25 is blocked. Make sure you don't let that throw you off. You can have Postfix listen
to both 25 and an alternate port by adding this line into /etc/postfix/master.cf:
8025 inet n - n - smtpd
Where 8025 is the new port that Postfix will listen to. Save the file, and type:
postfix reload
to restart Postfix and allow it to incorporate this change.
You can test IMAP in a similar way. Again, using telnet, connect to port 143:
# telnet mail.example.com 143
Trying 192.168.0.18...
Connected to mail.example.com.
Escape character is '^]'.
* OK mail.example.com Cyrus IMAP4 v2.2.12-OS X 10.3 server ready
Just like the SMTP test, if you do not see this greeting, either Cyrus is not serving up IMAP for some
reason, or the port is being blocked from you. While I haven't found an ISP that currently blocks IMAP,
perhaps there is an errant firewall rule preventing access. When you're done here, type "01 logout" and press
return.
Serendipity
Like this column, e-mail and DNS are completely intertwined. Both have nuances and pitfalls that require a
bit of knowledge. I've been helping people out more and more frequently with mail issues. The sad news is
that OS X's built in mail system is just about there, but not quite. The problems lie more on the Cyrus side
- specifically due to the way they're implemented on OS X. You'll find plenty of stable Cyrus servers running
on other platforms. Postfix, thankfully, is just a rock. When troubleshooting mail, remember: mail is really
a bag of tricks, and not one single coherent system. There are, of course, other factors that will impact
mail delivery, such as network design, choice of mail host, content filtering schemes, intermediary mail
routers, firewall and/or NAT rules, and more. Many facets of your seemingly distinct pieces of network
equipment are inherently intertwined.
This isn't always the easiest of stuff. Questions? Feel free to send 'em my way at
emarczak@mactech.com.
Ed Marczak, owns and operates Radiotope, a technology consulting company that implements mail
servers and mail automation. When not typing furiously, he spends time with his wife and two daughters. Get
your mail on at http://www.radiotope.com