Highly Available/Scalable Sendmail Using Sendmail Clusters on Linux
![]()
|
|
About the authors
Jay D. Allen works by day on the leading edge of IT for IBM, mostly with Linux. By night, Jay works on the trailing edge of IT, mostly with DEC PDP-11s and other antiques. Contact him at allen5@us.ibm.com.
Peter Bogdanovic has worked as a software engineer and Unix system administrator since 1992. He currently works for IBM's Linux Competency Center. Contact him at bognovic@us.ibm.com.
Clifford White is a solutions engineer for IBM and works for IBM's Linux Competency Center. Contact him at cliffw@us.ibm.com.
|
Clusters of servers running Sendmail can deliver high performance and high availability at competitive prices. For experienced systems administrators, this has long been a commonly held practice. This article describes our study to quantify and describe ways to achieve highly available/scalable sendmail. We studied several configurations of Sendmail clusters on Linux and quantified their relative performance. We investigated and tested common performance tuning parameters in Sendmail's configuration and in the Linux operating system. We didn't have a shared disk for these tests, so we scoped the project to include SMTP routing and queueing only. This would be a common configuration for a Sendmail cluster on the edge of a private network or as the front-end for an internal mail store. While our hardware resources were modest, we believe the relative differences make our results valuable to system architects who want to implement clusters of Linux-based Sendmail servers, because our results illustrate the relative importance of a Sendmail cluster's design features. |
There are many configuration options for Sendmail, LDAP and DNS, but we will consider only those important for this application. Unless otherwise stated, we used stock software and default settings. Of these options, we found a small handful of factors that have dramatic effects on performance or that are required for scalability, like LogLevel and QueueDirectory.
Ultimately, we found that even when Sendmail is correctly configured, all the important factors led us to two facts:
What we found
We evaluated several test scenarios:
For the Load Balancer scenario, we tried both the Alteon 180 appliance and a dedicated Linux server running balance software. We used a single host to find the optimum Sendmail configuration, stepping through variations in important configuration factors. Using the results of this testing we derived an optimized configuration and used that in our different cluster configurations.
DNS round-robin is a straightforward way to multiplex incoming Internet SMTP traffic across machines. In its simplest form, several A records are entered for the one mail server hostname. Each participating Sendmail server is configured to receive mail on behalf of this one hostname. When a sender goes to deliver mail to the recipient, a DNS query is made. The results will contain a list of all the A records for that host. By default, most MTA implementations take the first member of the list. Repeated queries for the same hostname yield a rotating list of IP address (this is a feature of BIND/DNS). For example, if the name "us.ibm.com" is looked up on the Internet, the following IP address list is returned:
Name: www.ibm.com
Addresses: 129.42.16.99, 129.42.17.99, 129.42.18.99, 129.42.19.99
Repeating the query returns:
Name: www.ibm.com
Addresses: 129.42.19.99, 129.42.16.99, 129.42.17.99, 129.42.18.99
And again returns:
Name: www.ibm.com
Addresses: 129.42.17.99, 129.42.18.99, 129.42.19.99, 129.42.16.99
In Figure 1, we see a round-robin DNS at work. All of the external interfaces of the Sendmail servers are directly connected to the Internet and published in DNS. Each machine acts as a SMTP router/buffer, taking mail from the Internet and delivering to a common mailbox server of some kind on a private network. This sort of arrangement is easy to set up, is inexpensive and, if done correctly, can be trouble free.
Figure 2 illustrates the problem of using round-robin DNS. Because of the way DNS works and the possibility of cached DNS records, if a host fails, mail agents might still try to contact it. At the least, mail might be delayed waiting for connections to time-out, be queued and resent after some backoff period.
|
|
|
|
|
Figure 1 |
|
Figure 2 |
In recent years, the use of external workload directors has grown. These specialized devices can distribute load intelligently between several servers and handle failover/failback. We did much of our testing with the Alteon 180 switch. It allowed us to create a virtual mail server. Mail arriving for this virtual server is passed round-robin to each of the three real servers. If for some reason one of the Sendmail servers should fail, the Alteon will stop sending new connections to that server (periodically checking to see if it has returned).
One of the benefits of this configuration is that it does not share the weakness of round-robin DNS solutions. Since only one IP address is published to the world, there is no reason to worry about cached entries when you add/remove members of the cluster. Our experience with products like the Alteon 180 and F5 Network's Big-IP have been generally positive. However, the equipment can be expensive and certainly adds another point of maintenance. To provide maximum protection and avoid single points of failure, two or more of the devices should be installed.
|
|
|
|
|
Figure 3 |
|
Figure 4 |
This is a traditional Sendmail/DNS configuration. A hostname in DNS can have one or more mail exchange (MX) records. These records include a weighting factor. When other SMTP servers attempt to deliver mail to this host, they first look for an MX record and its relative weight. The MX record with the lowest weighting number is selected when there is more than one MX record. Mail will be delivered to the first host that answers with the lowest weight/number.
In the following example, a domain has two MX records, one with a weight of 10 pointed at the Chicago office, and one with a weight of 20 pointed at the New York office. Under normal circumstances, all the mail will go through the Chicago office, flowing over the corporate WAN for internal delivery.
If the Chicago office should fail, or routes to the Chicago office fail, mail will automatically flow via the New York office.
If the Chicago office failed in the middle of a SMTP transaction, that transaction will fail (the red line). Because SMTP is a transaction, message integrity is assured, the sender will timeout, backoff, and resend the entire message later. If the Chicago office is still down, mail will automatically flow through New York.
The main the benefit of this method is that it's a very mature process and is well documented and understood. It also requires only minor configuration changes, and it requires no additional software or hardware. Unfortunately, the workload is not evenly distributed, so in practice under heavy load, the service may "flap" back and forth. MX solves availability problems, not scalability. The result is that you end up buying twice as much hardware to handle the possibility of a failure during peak traffic loads.
|
|
|
|
|
Figure 5 |
|
Figure 6 |
Real world solutions tend be hybrids drawing from all of the technologies described above. For example, Internet mail sent to IBM is delivered to one of three regional centers: Colorado, New York or North Carolina. Through a combination of MX records for failover and round-robin DNS, IBM assures fast/reliable Internet mail. Here are the public DNS record(s):
us.ibm.com preference = 10, mail exchanger = e22.nc.us.ibm.com
us.ibm.com preference = 10, mail exchanger = e23.nc.us.ibm.com
us.ibm.com preference = 10, mail exchanger = e24.nc.us.ibm.com
us.ibm.com preference = 20, mail exchanger = e1.ny.us.ibm.com
us.ibm.com preference = 10, mail exchanger = e2.ny.us.ibm.com
us.ibm.com preference = 10, mail exchanger = e3.ny.us.ibm.com
us.ibm.com preference = 10, mail exchanger = e4.ny.us.ibm.com
us.ibm.com preference = 20, mail exchanger = e31.co.us.ibm.com
us.ibm.com preference = 10, mail exchanger = e32.co.us.ibm.com
us.ibm.com preference = 10, mail exchanger = e33.co.us.ibm.com
us.ibm.com preference = 10, mail exchanger = e34.co.us.ibm.com
us.ibm.com preference = 20, mail exchanger = e21.nc.us.ibm.com
Repeated DNS queries yield a rotating list of MX preferences:
us.ibm.com preference = 10, mail exchanger = e2.ny.us.ibm.com
us.ibm.com preference = 10, mail exchanger = e3.ny.us.ibm.com
us.ibm.com preference = 10, mail exchanger = e4.ny.us.ibm.com
us.ibm.com preference = 20, mail exchanger = e31.co.us.ibm.com
us.ibm.com preference = 10, mail exchanger = e32.co.us.ibm.com
us.ibm.com preference = 10, mail exchanger = e33.co.us.ibm.com
us.ibm.com preference = 10, mail exchanger = e34.co.us.ibm.com
us.ibm.com preference = 20, mail exchanger = e21.nc.us.ibm.com
us.ibm.com preference = 10, mail exchanger = e22.nc.us.ibm.com
us.ibm.com preference = 10, mail exchanger = e23.nc.us.ibm.com
us.ibm.com preference = 10, mail exchanger = e24.nc.us.ibm.com
us.ibm.com preference = 20, mail exchanger = e1.ny.us.ibm.com
|
Test Name |
Queue Location |
LogLevel |
SharedMem |
Ident |
Delivery Mode |
Submits/sec |
|
stmpa-def |
disk |
9 |
No |
Yes |
Assync |
8.4 |
|
smpta-interactive |
disk |
9 |
No |
Yes |
Interactive |
10.0 |
|
smpta-noident |
disk |
9 |
No |
No |
Interactive |
9.7 |
|
smpta-sharedmem |
disk |
9 |
yes |
yes |
Interactive |
9.8 |
|
smpta-log4 |
disk |
4 |
no |
yes |
Interactive |
24.2 |
|
smtpa-ramdisk |
RAM |
9 |
No |
yes |
Interactive |
19.4 |
|
smtpa-ramdisk-log4 |
RAM |
4 |
No |
Yes |
Interactive |
48.4 |
|
smtpa-ramlog9 |
Disk |
9 |
No |
Yes |
Interactive |
25.8 |
|
smtpa-freshlog9 |
Disk |
9 |
No |
Yes |
Interactive |
10.1 |
|
smtpa-toeleven |
RAM |
9 |
Yes |
No |
Interactive |
49.0 |
The percentage of connect errors reported by MailStone for the single server solution is always 0.0
For the Multiple server tests, we used only the highest performing combination for the single-server test and observed how it scaled when servers were added. For all the multiple-server tests, the options are:
|
Test Name |
Round Robin |
Number of Servers |
Submits/sec |
Connection Errors |
|
bigmail2-allram |
Alteon |
2 |
95.9 |
43.1% |
|
balance2-allram |
Balance |
2 |
74.6 |
30.1 |
|
bigmail3-allram |
Alteon |
3 |
102.8 |
87.8% |
|
balance3-allram |
Balance |
3 |
103.8 |
81.4% |
Using commodity off-the-shelf computer equipment, we demonstrated that we could build a high-performance, highly available Sendmail service. In our tests, the disk I/O was the most significant factor in the Sendmail system's overall performance. Configuring the server not to record logs, or to record logs to RAM disk, greatly improves performance. The mail queues are the other disk-intensive part of the Sendmail server process. Distributing the queues to multiple directories, putting the queue directories on the fastest available file system, or moving the queues to RAM disks, also improves performance significantly. Moving the mail queues to RAM disks is probably not appropriate for many installations, because--in the case of an OS crash or server hardware failure--the message delivery integrity of the Sendmail system is compromised.
We were surprised to learn that, for our workload, we could replace a commercial load balancing manager, the Alteon 180, with a dedicated Linux system running the user-land program "balance". The economics of the Linux solution are compelling, but there may be operational advantages to using an appliance.
We saw doubling of performance when we went from one Sendmail server to two, but, curiously, adding a third server had no effect. Our load generation equipment was working hard, so it might be that we saturated the mail generation equipment. We also observed some TCP/IP problems under heavy load that might account for increases in connection error rate. We recommend a larger number of load generation machines to drive load above the 10 million messages per day mark.
Sendmail is like most traditional Unix programs. It's highly specialized but modular so it can be easily integrated with other components to make a larger solution. A Sendmail cluster is such a solution, and we need to describe the configuration of these other components. Some are designed to add to performance or scalability, workload distributor and network design. Other components add to the manageability of a Sendmail cluster, like LDAP and rsync. Finally, we describe the tools we used to simulate realistic workloads for our test environment.
Feel free to contact the authors if you have questions, want more information, or want copies of the configuration files they used in this study.
DISCLAIMER: The foregoing article is based on labaratory tests undertaken in a laboratory environment. Results in particular customer installations may vary based on a number of factors, including workload and configuration in each particular installation. Therefore, the above information is provided on an AS IS basis. The WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. Use of this information is at user's sole risk.
Open LDAP installation and configuration notes
Description
The LDAP server provides a Mail Local Address for the Sendmail service. Desired state: Fast LDAP response with a single point of control for updates and administration.
For maximum speed with mimimum network traffic, each mail server machine has a local LDAP server. We configured the servers as one master with two slaves. LDAP slaves maintain a read-only copy of the master database, and all updates are pushed from the master server. Update traffic is non-existent in this test, so replication was not a performance consideration.
Building
OpenLDAP uses the well known GNU configure script to set up the source package. On our Red Hat 7.1 system (glibc 2.2.2) the package builds fine with the standard GNU receipe:
./configure
make
make install
We installed the binaries and configuration files under /usr/local/etc/openldap. The LDAP datafiles and runtime files are under /usr/local/var.
Important Notes
Installation - configuration files
All configuration files should be in the $PREFIX/etc/openldap directory. Here are the annotated files from our installation:
Master slapd.conf
# $OpenLDAP: pkg/ldap/servers/slapd/slapd.conf,v 1.8.8.6 2001/04/20 23:32:43 kurt Exp $
#
# See slapd.conf(5) for details on configuration options.
# This file should NOT be world readable.
#
include /usr/local/etc/openldap/schema/core.schema
include /usr/local/etc/openldap/schema/cosine.schema
include /usr/local/etc/openldap/schema/inetorgperson.schema
include /usr/local/etc/openldap/schema/misc.schema
# LDAP directories use OID syntax to describe things. The syntax is described by these schema files. if the attribute type you need is not listed in these files, you're in a heap of trouble. The default slapd.conf includes only core.schema, so always check this.
# Define global ACLs to disable default read access.
# Do not enable referrals until AFTER you have a working directory
# service AND an understanding of referrals.
#referral ldap://root.openldap.org
pidfile /usr/local/var/slapd.pid
argsfile /usr/local/var/slapd.args
replogfile /usr/local/var/slapd.replog
#Locations for runtime information
# Load dynamic backend modules:
# modulepath /usr/local/libexec/openldap
# moduleload back_ldap.la
# moduleload back_ldbm.la
# moduleload back_passwd.la
# moduleload back_shell.la
#######################################################################
# ldbm database definitions
#######################################################################
database ldbm
suffix "dc=sequent,dc=com"
rootdn "dc=ent,dc=sequent,dc=com"
# These two items define the Root Distinguished name and will be used by all the db utilites.
# Cleartext passwords, especially for the rootdn, should
# be avoid. See slappasswd(8) and slapd.conf(5) for details.
# Use of strong authentication encouraged.
rootpw secret
#We're not very secure
# The database directory MUST exist prior to running slapd AND
# should only be accessible by the slapd/tools. Mode 700 recommended.
directory /usr/local/var/openldap-ldbm
# Indices to maintain
index objectClass eq
index mailLocalAddress pres,eq
index cn pres,eq
index sn pres,eq
index mailroutingaddress pres,eq
# Indexing is always nice to have in a database. LDAP is no exception
replica host=slave3:389
binddn="cn=Manager,dc=ent,dc=sequent,dc=com"
bindmethod=simple credentials=secret
replica host=slave5:389
bindmethod=simple credentials=secret
binddn="cn=Manager,dc=ent,dc=sequent,dc=com"
#The replica entries point to our slave LDAP servers.
Slave slapd.conf
This is identical to the master, except for replication. Where the master has the replica directive, the slave has a directive indicating who can perform updates:
updatedn "cn=Manager,dc=ent,dc=sequent,dc=com"
Resources
#
attributetype ( 2.16.840.1.113730.3.1.13
NAME 'mailLocalAddress'
DESC 'RFC822 email address of this recipient'
EQUALITY caseIgnoreIA5Match
SYNTAX 1.3.6.1.4.1.1466.115.121.1.26{256} )Sendmail Cluster - BIND/DNS Configuration
Sendmail and BIND have a lot in common. Both software packages are de facto standards on the Internet. They both are fast approaching 20 years old. They both are freely available. They both include simliar design features like a distributed computing model and transaction oriented client-server protocols. Maybe they share so much in common because Sendmail and BIND/Domain Name Service (DNS) have always been so tightly integrated? This integration is why an important part of any high performance sendmail solution needs to include a high performance DNS.
Although there are a few alternatives to BIND, we did not evaluate them. There is a reference to D.J. Bernstein's DNS server below as a prominent example.
DNS Configuration
For DNS servers, newer is better, so use version 9.1.0 or later. All of our testing was done using this version. The version 9.x DNS servers and above use a threaded programming paradigm which makes better use of multiprocessor machines, although--for all but the busiest sites or sites that have implemented--IPv6/DNSSEC won't see much benefit. Keep abreast of any changes to BIND, and plan on upgrading at least once a year.
In general, the idea is to get the BIND/DNS as close to the sendmail server as possible. Putting a cacheing-only name server on each sendmail server is a good idea. It requires no maintenance and reduces network load by accumulating a cache of records. As an alternative, you can also install full DNS slave servers on each sendmail server. If the sendmail servers are facing the Internet, these full BIND slaves can service Internet thirdparty lookups (they lookup your records) as well as the local Sendmail daemon's needs (you looking up someone elses records). This also would make lookups of local names faster (they are in the slaved zone files), but increases the configuration/maintenance demands.

Periodically clean the cache of the DNS server. The simplest method may be to restart the named process. In theory, this is not required, but it's a safe way to make sure you don't get a bunged up cache.
Keep an eye on the memory footprint of BIND. The local records and cache are all kept in memory, which over time tends to make BIND grow. At about two weeks, the records will be expiring from the cache at about the same rate they are being added, so BIND's memory size should be at a steady state. If you are concerned about limiting memory, use ulimit -m SIZE to set a reasonable memory limit.
For cacheing servers and slaves you may want to use forwarders, instead of the default which will do recursive root-hint searches for any new zones. Choose one or two dependable DNS servers to forward requests through.
DNS Sizing
While your DNS server may do a staggering number of transactions per day, the protocol is sparse and not too demanding on the server hardware. In terms of raw capacity, most medium-sized companies can serve up DNS reliably with a pair of modern uniproc workstation class machines. Install one as the local master server and the other as a slave. DNS is very well documented and comes with excellent logging and statistics controls. Make your best effort at sizing based on peak usage, and periodicly gather statistics to check your assumptions. To get a sample statistics dump, "add statistics-file /tmp/named.stats;" to the options section of /etc/named.conf. Then send a kill -SIGILL to the named process. The current stats will be dumped to /tmp/named.stats. Keep these over time, and watch for trends. You may be surprised at how much work the DNS server does!
Operating System Considerations
RedHat Linux (and others) include a name service cache system that brings some of the benefits of a caching name server. This may help to reduce network traffic when it's otherwise impossible to run a full cacheing name server, buts its main use is for speeding up NIS+ queries. Unfortunately, only the A records are cached, so there is no provision to cache MX records. Nscd is part of the Glibc-2.2.2 source.
In Linux, make sure /etc/nsswitch.conf includes only name services that are being used. The default Redhat file includes references to NIS+/NIS. This simplfies the code path through the Glibc and, in theory, will give incremental impovements to name lookup speed.
For /etc/resolv.conf, more is not better. Have no more than three servers listed since the default glibc behavior is a sequential search with fallback/retry times. By the time a fourth host would be queried, everyone would already have died of old age.
Sendmail Considerations
Sendmail's current default DNS behavior is probably reasonable for most installations. It simply follows the rules that Linux OS follows for looking things up, that is, /etc/resolv.conf and /etc/nsswitch.conf control everything. If you have more exotic needs take a look at "O ResolverOptions" in the sendmail configuration documentation. Also the "O Timeout.resolver.*" options can be used to tune resolver behavior.
References