A Smart World After All
Volume Number: 21 (2005)
Issue Number: 3
Column Tag: Programming
Source Hound
by Dean Shavit
A Smart World After All
Having a hard drive die on you can be a life-altering experience. I'll never forget the first time it happened to me. It was 1996, and I was sitting, drinking an espresso in a local coffee house, working on an essay for my Ph.D. comprehensive examinations in English, when my PowerBook 520c froze. I didn't think much of it at the time, as I was running System 7.5, which tended to freeze without warning, but when I rebooted, I found myself staring dumbly at the blinking question mark. Hours later, after running Norton Utilities, and seeing that my 160 meg hard drive was now recognized as an 80 meg hard drive, I realized that my essays and research were lost. Sure I had a two-week old backup, but I'd been in a frenzy, working hard to make a deadline, and hadn't thought to keep my backup current. I took it as a sign, and even though I'm occasionally called "Dr. Dean" when I show up on-site to troubleshoot an ailing Mac, there's now no other reason to call me Doctor.
At the time, I wasn't what one would call a "computer professional," though I did some consulting work in the area of Desktop Publishing, training on QuarkXpress and Illustrator, and managed to subsidize my desire to become a poet while suffering as a poor grad student. Though I was quite aware of certain signs of hard disk trouble, such as "sticktion," which would delay startup, or the "whine," which indicated bearings that were about to go, my trusty PowerBook 520c exhibited no signs of any disk problems, not one disk error, not one bad sector. Its hard drive died suddenly, without even a hoarse whisper. I'm really not waxing poetic; back in the mid 90s, the only real clues that a hard drive was dying (other than the occasional -36 disk error or bad sector turned up by a disk utility), were audible (a scratch, a ping, a scrape), so the hyperactive herky-jerky animation of Dr. Peter Norton rubbing his stethoscope over a hard disk platter in Norton Utilities 3 wasn't really that far from the reality of troubleshooting a failing disk.
It was rather amazing how accepting I was of the situation. I turned in my books and moved on to a career in Macintosh technology. I took responsibility for being a dummy and not backing up. I never once thought to ask, how come my Mac didn't alert me that there was a problem with my disk? In 1996, we didn't have such high expectations of our personal computers. We wanted them to work, to print, and sometimes to connect to a network. A cell phone or pager was a luxury, and email accounts were generally bundled with a job, or student enrollment at a University, or an AOL or CompuServe membership.
Besides my life-changing data loss, it seems that 1996 was a pivotal year for hard disks in general. In April of that year, a group of researchers from several drive manufacturers was busy hammering out version 2.0 of SFF-8035i, a standard for hard drive diagnostics which had been proposed the previous year, defining thirty attributes related to performance and reliability that hard disks should track internally. The standards they developed became known as the Self-Monitoring, Analysis and Reporting Technology systems, now referred to as SMART.
Disks Will Die
One of the biggest issues Veterinarians have is that the pets they treat don't have the ability to communicate specific symptoms of disease. Dogs and cats with organ failure generally don't complain about pain, but simply seem to slow down, not run as fast, or jump as high. No matter how perceptive the Vet, or how experienced, I've often heard them exclaim, "I wish the animals could tell me what was wrong." How strange would it be if, like an automobile, the pet had a diagnostic port that would reveal a score about how well their heart or kidneys, or intestines were functioning? Automobiles have such diagnostic ports, and some older (much much older) Macs used to have them as well.
Hard disks are mechanical devices, just like Zip disks, or floppy disks. They have motors. They have platters that rotate at speeds that would make the engines in most cars overheat and explode. They have little arms with magnetic heads that dance around picking up blocks of data. Frankly, it's an amazing testament to the engineering skills of hard disk manufacturers that they are as reliable as they are. With OS X, hard disks will get even more of a workout if there's not enough RAM installed in a machines, due to virtual memory page outs, leading to a premature demise of the startup disk.
Most end-users and far too many system administrators don't realize that current hard disks are keeping an internal log of their own performance, based on the SMART attributes codified and updated with the ATA-3, 4. and 5 standards. Modern SCSI disks also have SMART capabilities as well. Some of the attributes that SMART tracks are: (see table)
SMART attribute implementations still vary, somewhat, by manufacturer, but they have enough in common to, at the very least, give a pass/about to fail status report when queried. Starting with OS X 10.3, Apple's own Disk Utility includes a simple SMART test, which will display a one-line S.M.A.R.T. status report when you select a physical disk:
ID# | Attribute | Description
____|_______________|___________________________________________________________________________
1 | Raw Read | Count of non-corrected read errors. More errors (a smaller value)
| Error Rate | denotes a deteriorating disk surface.
____|_______________|___________________________________________________________________________
2 | Throughput | Throughput (I/O) performance of Hard Disk.
| Performance |
____|_______________|___________________________________________________________________________
3 | Spin Up Time | Spindle spin up average time (from parked to ready).
____|_______________|___________________________________________________________________________
4 | Start/Stop | Cycle count of spindle start and top.
| Count |
____|_______________|___________________________________________________________________________
5 | Reallocated | Count of reallocated sectors. A great indication of a failing disk.
| Sectors Count | Reallocated sectors are marked as "unusable," which is why modern hard
| | disks don't show bad sectors.
____|_______________|___________________________________________________________________________
6 | Seek Error | Count of seek errors. This indicates errors when the heads have a
| Rate | mechanical failure or when the heads positioned over a data block
| | cannot read it due to poor disk surface conditions.
____|_______________|___________________________________________________________________________
7 | Seek Time | If this attribute is lower, it indicates mechanical problems with the
| Performance | heads or disk surface.
____|_______________|___________________________________________________________________________
8 | Power-On | Total of many hours the drive has been powered up.
| Hours |
____|_______________|___________________________________________________________________________
9 | Internal | How hot the temperature sensors say the drive is. Can also be a great
| Temperature | indicator of how hot the inside or a computer is as well.
Figure 1.
The SMART test is also available from the Terminal in OS X, via the diskutil command, which is specific to OS X. First, you have to get the hard drive identifier:
In this case, the hard disk identifier is disk0, where as the volume identifier is disk0s3. Working with the hard disk identifier, we can then query diskutil to get info on the drive, which will include the SMART Status:
minime:~ dean$ diskutil info disk0|grep -i smart
SMART Status: Verified
Note that in piping to the grep command, I use the -i switch to turn off case sensitivity. Diskutil is very useful for shell scripts, such as the one I wrote below:
#!/bin/sh
## This script is designed to get the SMART status of a drive and send an email notice on fail
## to test, change the "good=1" below to "good=0" and you should receive a warning email
## Dean Shavit, MOST Training & Consulting dean@macworkshops.com http://www.macworkshops.com
## This is for OS X Machines with ATA drives only
##
## Step 1: Define a variable for a functional drive by counting the number of SMART Verified Disks
status=`diskutil info disk0|grep -ci verified`
## Step 2: Define a number for comparison against a failed drive
good=1
## Step 3: Define the warning message for the body of the email
warning="houston we have a problem!"
## Step 4: Define a variable for the computer name
box=`/usr/sbin/scutil --get ComputerName`
## Step 5: Define a variable - email address of person to notify
admin="dean@macworkshops.com"
## Step 6: compare current status with good status, if a match, echo, if not, notify
if [ $status == $good ]; then
echo $status
else echo From: $box- $warning!!! > /tmp/houston.txt| mail -s "SMART Alert Report"
$admin < /tmp/houston.txt
fi
done
This script, when run as a cron job, will check the hard disk periodically for SMART status and email you when the drive when the drive has a problem. There are two status codes that Disk Utility can report: "Verified" if none of the SMART attributes exceed their normal thresholds, or "About to Fail" which will appear in red letters, indicating that the drive will fail and should be replaced. Of course, "failed" isn't a status that the drive can report, but is a state! It is important to understand that even if the hard disk passes the SMART test, there's always the chance that the drive might suddenly expire, without ever issuing a warning. The SMART tests are diagnostic tests, and should never be used as a replacement for a good backup strategy. If anything, the advance warnings will help indicate when a drive needs to be replaced, so as to minimize any possible downtime of a server or workstation.
For those who don't want to script or want an easier path to SMART notification, there's a great donationware utility called SMARTReporter downloadable at http://homepage.mac.com/julianmayer that provides a nice Cocoa GUI for accomplishing the same scheduled diagnostic and warning email, without having to use the Terminal. For those consultants or admins who support off-site users, this can provide an invaluable early warning of a drive failure.
SMARTReporter Icon
SMARTReporter's a snap to install, just drag and drop it from the its disk image (.dmg) into your Applications or Utilities folder. When running, it can provide a SMART status indicator on your menu bar:
SMARTReporter Status Indicator
SMARTReporter's preferences allow an admin to change the interval of the check, define an icon set for the menu bar, set up email information, even launch another application if there's a hard disk failure, such as special Applescript Applet to throw up an alert to the user or give instructions on what to do. For email alerts, it can either use the mail information specified, or borrow the Apple Mail.app settings. If I had my druthers, Apple would include much the same functionality in future versions of Disk Utility.
SMARTReporter Preferences
Not So Smart
SMART Reporting in OS X does have its limits, however. SCSI and FireWire disks aren't supported, so if an admin is predisposed (like I am) use SCSI drives for internal RAID mirrors on servers other than Xserves, the diskutil command cannot get the SMART status of an SCSI disk:
host2:~ mostadmin$ diskutil list
/dev/disk1
#: type name size identifier
0: Apple_partition_scheme *17.0 GB disk1
1: Apple_partition_map 31.5 KB disk1s1
2: Apple_Driver_OpenFirmware 512.0 KB disk1s2
3: Apple_Boot_RAID 17.0 GB disk1s3
host2:~ mostadmin$ sudo diskutil info disk1
Device Node: /dev/disk1
Device Identifier: disk1
Mount Point:
Volume Name:
Partition Type: Apple_partition_scheme
Bootable: Not bootable
Media Type: Generic
Protocol: SCSI
Total Size: 17.0 GB
Free Space: 0.0 B
Read Only: No
Ejectable: No
OS 9 Drivers: Yes
Low Level Format: Not Supported
Note that the limitation for SMART reporting to ATA disks is something inherent in the way that OS X treats disks. FireWire disks, though, which always contain ATA or SATA drives running on Firewire bridge boards, don't report SMART status to diskutil either. This can be a major problem for Mac admins, who are increasingly relying on storage devices for second-tier file services or in some cases, low-cost backup devices. In the arena, all enclosures are definitely not created equal, as shown by Granite Digital's Firevue(TM) drive enclosures (http://www.granitedigital.com), which feature an LCD panel that displays the SMART status of the disk housed inside of it. This gives their enclosures quite an advantage over others, considering that OS X doesn't have the ability to get past the bridgeboard to read the SMART status of the ATA drive inside. Look for other enclosure makers to follow suit in the near future.
Smart, And Its Mirror, Trams
Ok, I'm not talking about Disney World here, and the super long trains of trams that shuttle folks back and forth from the Magic Kingdom to the parking lots. I am talking about RAID Mirrors on OS X, however. One of the problems many admins have complained about with OS X and OS X server, is the lack of notification that a RAID Level 1 (mirror) is going to go south. The upside is that if workstation or server in question does have a mirror, the other drive will continue to function until it's replaced, and the mirror's rebuilt. That's why the word SMART is mirrored above (ha).
Setting up a RAID Level 1 (mirror set) is easy in OS X. After booting off the installation CD for OS X or OS X Server, drag the disks you want to mirror into the RAID window in Disk Utility and choose mirror, then create the mirror set. The result will be a RAID disk that appears as two stacked hard drives in drive and volume list on the left hand side.
Mirror Raid Level 1 in Disk Utility
Of course, diskutil is available for those who want to create, repair, check, or destroy a mirror set from the terminal. For example, even though SMART status might not be available for SCSI disks, it's easy to script notification for a SCSI mirror, and any RAID 1 mirror for that matter, all based on output from diskutil.
So what I'm looking for in the script below is simply an OK from both disks that are members or "slices" of the RAID
host2:~ mostadmin$ diskutil checkRAID disk3
RAID SETS
---------
Name: boot
Unique ID: bootf5b49d82471e11d98b06003065be09be
Type: Mirror
Status: Running
Device Node: disk3
----------------------------------------------------------
# Device Node Status
----------------------------------------------------------
0 disk1 OK
1 disk2 OK
----------------------------------------------------------
Mirror set, note how similar it is to the script that checks the SMART status of a single hard drive:
#!/bin/sh
## This script is designed to get the status of a mirror set and send an email notice on fail
## to test, change the "good=2" below to "good=1" and you should receive a warning email
## Dean Shavit, MOST Training & Consulting dean@macworkshops.com http://www.macworkshops.com
## This is for servers using software RAID mirror sets only
##
## Step 1: Define a variable for a functional raid by counting the number of good disks
status=`diskutil checkraid disk3|grep -c OK`
## Step 2: Define a number for comparison against a failed raid
good=2
## Step 3: Define the warning message for the body of the email
warning="houston we have a problem!"
## Step 4: Define a variable for the computer name
box=`/usr/sbin/scutil --get ComputerName`
## Step 5: Define a variable - email address of person to notify
admin="dean@macworkshops.com"
## Step 6: compare current status with good status, if a match, echo, if not, notify
if [ $status == $good ]; then
echo $status
else echo From: $box- $warning!!! > /tmp/houston.txt| mail -s "RAID Alert Report"
$admin < /tmp/houston.txt
fi
done
So, with both the SMART script running as a cron job, along with the RAID script, it's easy to get an advance warning of an impeding single disk failure, or when a RAID Level 1 has a drive fail or go out of sync. It's tougher with or SCSI drives, but if they're used in a RAID Level 1 configuration, the script above can provide the same advance warning that it can for SCSI disks under OS X or OS X Server.
Maxwell Smart
Recently, I was onsite for a couple of months helping a customer upgrade two hundred Macs to OS X. Many of the machines running OS 9 showed signs of hard drive failure, such as frequent boot failures, corrupt directories, slow time to spin up and metallic whining sounds. One day, a department manager's hard drive failed, taking several years of email with it. We sent the drive into DriveSavers (http://www.drivesavers.com) for hardware data recovery, but no dice, the mechanism was too far gone, too much damage done to recover what was needed. It caused quite a stir in the IT department, and a discussion about early warning of hard drive failures. An interesting fact about this customer was that they had several UNIX/LINUX experts on staff who were very interested in learning more about OS X, but interestingly enough, tended to treat an OS X workstation as they would a Linux server. They'd try this command or that command and comment that OS X was very different. I shared my SMART alert script with them, and they found SMARTReporter on Versiontracker. However, they wanted to be able to have a constant real-time log, on each Mac OS X box, of SMART status and more sophisticated scripts that would report back attribute information such as temperature or reallocated (bad) sectors. It was time to go hunting for open-source tools.
The first stop was at the Maxwell web site, http://maxwell.sourceforge.net. Maxwell's a tiny command-line program that queries a hard drive for SMART information and can either report a pass/about to fail condition, just like Disk Utility, or a more comprehensive report of SMART attributes. The source code for Maxwell is less than 100k in size, and is such a simple program, compiling it doesn't require configuration first.
1. To install Maxwell, download the source code from SourceForge, then unpack the tarball into a folder on the Desktop. Open a Terminal window, then navigate to that folder:
2. Issue the following commands as in the example below:
[minime:~/Desktop/maxwell-0.5.1] dean% sudo make install
cc -c -o maxwell.o maxwell.c
cc -o maxwell -framework IOKit -framework CoreFoundation maxwell.o
/usr/bin/install -d -m 755 /usr/local/doc/maxwell
/usr/bin/install -m 644 LICENSE /usr/local/doc/maxwell/LICENSE
/usr/bin/install -m 644 README /usr/local/doc/maxwell/README
/usr/bin/install -d -m 755 /usr/local/bin
/usr/bin/install -m 755 maxwell /usr/local/bin/maxwell
/usr/bin/install -d -m 755 /usr/local/man/man8
/usr/bin/install -m 644 maxwell.8 /usr/local/man/man8/maxwell.8
[minime:~/Desktop/maxwell-0.5.1] dean%
Once installed Maxwell can be run to get a simple pass/fail status by simply invoking it at the command line:
[minime:~] dean% maxwell
Device: TOSHIBA MK8026GAX Reported PASS status
Or, for a more complete SMART report, it can be run with the -r switch:
[minime:~] dean% maxwell -r
BSD Path:/dev/disk0
Serial: 54IK0355T
Model: TOSHIBA MK8026GAX
Firmware: PA002B
Device supports S.M.A.R.T. operations
SMART self-test supported
SMART error logging supported
S.M.A.R.T. operations are enabled
SMART self-test enabled
SMART error logging enabled
Off line collection status is 0
Time to complete Off line Data collection: 3 hours, 20 minutes
Status is GOOD
TEST THRSH VALUE STATUS RAW
--------------------------------------------------------------------------------------
( 1) Raw Read Error Rate 50 100 0x0b00 0
( 2) Throughput Performance 50 100 0x0500 0
( 3) Spin Up Time 1 100 0x2700 1417
( 4) Start/Stop Count 0 100 0x3200 430
( 5) Reallocated Sector Count 50 100 0x3300 0
( 7) Seek Error Rate 50 100 0x0b00 0
( 8) Seek Time Performance 50 100 0x0500 0
( 9) Power-On Hours Count ** 0 94 0x3200 2540
( 10) Spin Retry Count 30 108 0x3300 0
( 12) Device Power Cycle Count 0 100 0x3200 335
(192) Power Off Retract Count 0 100 0x3200 2
(193) Load/Unload Cycle Count 0 78 0x3200 228634
(194) Device Temperature 0 100 0x2200 223339413546
(196) Reallocation Event Count 0 100 0x3200 0
(197) Current Pending Sector Count 0 100 0x3200 0
(198) Off-Line Scan Uncorrectable Sector Count 0 100 0x3000 0
(199) Ultra DMA CRC Error Count 0 200 0x3200 0
(220) Unknown 0 100 0x0200 184
(222) Unknown 0 98 0x3200 977
(223) Unknown 0 100 0x3200 0
(224) Unknown 0 100 0x2200 0
(226) Unknown 0 100 0x2600 241
(240) Unknown 1 100 0x0100 0
--------------------------------------------------------------------------------------
Device temperature is 30 degrees centigrade
--------------------------------------------------------------------------------------
The names listed above may not actually be correct for each test value. The real values
printed may make no sense whatever and the temperature may be some crazy value.
Different device manufacturers do things differently :(.
--------------------------------------------------------------------------------------
Device: TOSHIBA MK8026GAX Reported PASS status
With this much information at hand, it's quite a simple matter to filter this output so that, say, the Reallocated Sector Count test had a RAW value other than zero, then it could trigger an alert email, or if the temperature reaches a certain level, or the spin retry count is greater than zero. The man page for Maxwell also suggests that it was designed precisely for scripting purposes and for integration with cron. While Maxwell is certainly a notch above the simplistic pass/fail test of diskutil, it isn't exactly the comprehensive, enterprise-wide type of solution that my customer was looking for, either. They wanted something more integrated into the operating system, a tool that would keep a log, and send that log back to a central syslog server, where trends in hard disk wear and tear could be monitored over a period of time.
It's A Smart World
"It's a world of laughter, world of cheer, it's a world of hope, and a world of fear." Sound familiar? I can almost hear the little mechanical people singing! Our company currently keeps a few servers at Equinix, a state-of-the-art colocation facility near downtown Chicago. With giant steel doors, handprint readers, bag checks, and multiple security stations, it's not hard to imagine at times you're entering the back room of a Disney World exhibit filled with trade secrets. It's an interesting side note, that over the years, I've seen more and more Xserves and Xserve RAIDs populating the cabinets at Equinix. I'm sure many a pager would go off if one or more of those Xserve RAIDs had a drive failure, or if one of the internal disks inside an Xserve died, triggering an alert by the Server Monitor Application.
But now the Mac Mini, Apple's little darling unveiled at MacWorld, seems to have caught on as a the "server for the rest of us." There's even a hosting service for Mac Minis, called "Macminicolo," http://www.macminicolo.com, that allows Mac owners to "park" their Minis in a cabinet for as little as $29.95 per month. While a colocated Xserve with a dedicated 1U space in a rack at a facility goes for the street price of about $150 per month, including one megabit of bandwidth (200 gb of transfer per month), there's absolutely no doubt that the proprietor of Macminicolo can certain cram a whole lot more Mac Minis into the same cubic space an Xserve would occupy.
Rack of Xserves, Rack of Minis
On a side note, the success of (and need for) of such a colocation facility where "everyone can have their own server," shines a glaring spotlight on one of the great weaknesses of Apple's OS X Server software: the lack of virtualization. In the Linux world (on IBM and Red Hat and Sun solutions), virtualization of admin tools and even the actual processors are a standard way of allowing companies to "rent" the resources of the server, without having to actually "own the box." Whether it's only being able to see your users when you log into a virtualized server, or your file system, or on higher end solutions, your virtual server running on a Logical Partition of the server's resources (LPAR), OS X Server's Admin Tools have absolutely no virtualization capabilities as of this writing. It's going to be quite a while before OS X would allow me to log into an eight or sixteen processor Xserve and administer my LPAR, which would appear a single, dual, or four processor server, with my own environment, file system, list of users and server daemons. Today, Apple's solutions seem to focus on miniaturization, not virtualization.
Asides from the obvious management dilemmas that would face the sysadmins of MacMinicolo, where Mac Minis are smushed into a rack three across and three deep with their power supplies and cooling fans. It seems that the system administrators at Macminicolo have found a cool way to monitor their farm of Mac Minis:
InsideOut monitoring was developed to monitor mac a [sic] based server from the inside out. A headless application that is placed in the startup folder of any OSX machine contacts a master logging server. We monitor factors such as network availability, Hard Drive space, network traffic moved, temperature and the up/down status of specified applications. Customers subscribing to this service can choose to receive email notification if limits or events trigger warnings.
This sounds to me very much like the capabilities of the open-source monitoring software Nagios (http://www.nagios.org), which can log all of the above events and present reports, graphs, and network maps in a web page. Although temperature, application stats, network activity and free hard drive space are all good things to monitor, I would have to make the suggestion that in a situation with multiple Mini Macs with laptop hard drives stuffed like muffins into a rack, SMART status should be the number one indicator monitored, because should a hard drive fail, there'll not only be an unhappy customer who will have lost all of their data if they don't have an offsite backup, but there'll also be an unhappy technician scrambling to pull out a single Mini and replace the hard drive as quickly as possible. If a SMART monitor could give advance warning of a failure, that would make the whole process much easier on everyone.
So, in addition to an enterprise or medium-sized business needing to monitor the SMART status on lots of hard drives in client workstations, it seems that Macminicolo would be another good example of where it would work quite well. However, the solutions I've looked at so far, like diskutil and the script I wrote, SMARTReporter, and Maxwell, are all missing an important capability, none of them have the ability to write a log of SMART events either on the local machine, or to a syslog server on the network, which of course, was the same thing my customer wanted for their Macs.
The Smart Test Of All
Much of the really great open-source software for OS X seems to originate from the Linux community, and such is the case with the smartmontools, http://smartmontools.sourceforge.net, which by all accounts is the top tier solution for SMART monitoring and logging in a non-commercial distribution. Bruce Allen, maintainer of the smartmontools project, explains the origins of the package in an article that appeared in the Linux Journal:
By profession I am a physicist. My research group runs a large computing cluster with 300 nodes and 600 disk drives, on which more than 50TB of physics data are stored. I became interested in SMART several years ago when I realized it could help reduce downtime and keep our cluster operating more reliably. For about a year I have been maintaining an open-source package called smartmontools. . .
In July 2004, Geoff Keating ported the smartmontools to OS X for ATA drive support only, (though the Linux version includes support for ATA/ATAPI-3 to -7 disks and SCSI disk and tape devices) bringing a sophisticated and flexible SMART monitoring suite to OS X for the first time since Granite Digital discontinued their commercial SMARTVue(TM) software nearly two years ago. Smartmontools consists of two main components, a command, smartctl, and a daemon, smartd. While smartctl functions like an improved version of Maxwell with various switches to control the verbosity and depth of the reports and self-tests, it's smartd that really takes the monitoring to a real-time level, where any changes in SMART status, not just errors, are logged to a specified syslog facility.
To install the smartmontools from the source code, make sure you have the latest Xcode tools (Apple Developer Tools) installed, then download the source tarball from http://smartmontools.sourceforge.net, and expand it to a folder on your Desktop, there's many download packages available, but the one we're looking for is called: smartmontools-5.32.tar.gz. Open the Terminal, navigate inside the folder and do the following:
minime:~/Desktop/smartmontools-5.32 dean$ ./configure
You'll see a long list of checking for (various components). When the configure process is finished, type the following:
minime:~/Desktop/smartmontools-5.32 dean$ sudo make install
After a page or two of build feedback, smartmontools is now installed on your system. To start smartd, type the following command:
minime:~/dean$ /usr/local/etc/rc.d/init.d/smartd start
To run smartd automatically on startup, type the following:
minime:~/dean$ sudo ditto /usr/local/etc/rc.d/init.d/ /Library/StartupItems/SMART/
Now you'll need to add a line to the end of your /etc/hostconfig file that reads:
SMART:=-YES-
That will allow the SMART StartupItem to run every time your Mac boots, and you'll also see the reassuring message "Starting SMART disk monitoring" as the daemon loads before the Login Window appears. Now that smartd is running as part of the OS X startup process, the next step's to configure logging:
minime:~/dean$ sudo touch /var/log/smartd.log
This creates an empty logfile to contain the smartd output. The next step is add the line below to the /etc/syslogd.conf file so that smartd knows where to write that output, in this case using the free local3 log facility:
local3.* /var/log/smartd.log
or, if you'd like to send the log to a remote syslog server, then
local3.* @my.syslogserver.domain
and, if you'd like to make sure that the smartd.log file is rotated weekly, modify the /etc/periodic/weekly/500.weekly file, so that the following loop statement has smartd.log in the list of rotated log files.
for i in ftp.log lookupd.log lpr.log mail.log netinfo.log hwmond.log ipfw.log smartd.log; do
Test it with:
minime:/etc/periodic/weekly dean$ sudo 500.weekly
You should see the following line appear:
Rotating log files: ftp.log lookupd.log lpr.log mail.log netinfo.log ipfw.log smartd.log
Lastly, you'll need to modify the /usr/local/bin/etc/smartd.conf file to specify which drives you'd like to monitor, and which tests you'd like to run, as well as an admin notification email address. Remember, in OS X the startup disk is almost always disk0, if the computer has a single physical drive, when the /dev/hda placeholders in the smartd.conf file are meant for other UNIX systems, not OS X.
First, open up the /usr/local/etc/smartd.conf file in your favorite editor. Luckily, the smartd.conf file has an excellent synopsis of what each command and directive accomplishes, so it's pretty self-explanatory:
minime dean$ sudo pico /usr/local/etc/smartd.conf
Find the line that reads:
DEVICESCAN
And edit it so that it reads
#DEVICESCAN
This will tell the configuration file to explicitly use the device you specify, rather than the one it finds on its own.
Find the line that reads:
#/dev/hda -a -o on -S on -s (S/../.././02|L/../../6/03)
And edit it so that it reads
/dev/disk0 -a -o on -S on -s (S/../.././02|L/../../6/03)
This command will perform short and long self-tests and report full SMART status results to the smartd.log file on a schedule.
Find the line that reads:
#/dev/hdc -H -m admin@example.com
And edit it so that it reads
/dev/disk0 -H -m dean@macworkshops.com
This test will notify you by email if the simple health check (pass/about to fail) shows that the disk is in trouble. Now, restart the smartd process with the following command so that smartd will re-read its configuration file:
minime:~/dean$ sudo killall -HUP smartd
Now the smartd.log is visible in the Console application located in the /Applications/utilities folder, and we can read the events we just kicked off. Now your Mac has a real-time log of SMART events and a notification that is emailed immediately when the failure occurs, rather than waiting for cron to do its thing.
With log analysis tools such as Oak (http://www.ktools.org/oak.html), smartd logs can be analyzed to project trends of failure that might occur in waves, due to defective hard drives. Remember the Quantum Bigfoot 5 _ inch hard drives from the late 90s? They never were shipped in a Mac, but at the time Compaq put them in tens of thousands of their Deskpro workstations, and nearly every single disk drive had to be replaced. A tool like smartd can help prepare companies for a run of defective hard drives that might die, en masse. Other features of smartd include a database of hard drive SMART attributes (not nearly complete, but you can submit a device report from your hard drive for submission to the database by using smartctl -a to and emailing it). The other "half" of the smartmon tools is the smartctl command, which can be queried through cron scripts or directly from the command line, much like Maxwell.
Console
Everybody Now . . .
I realize that downloading, compiling, and configuring the smartmontools isn't for everyone, but it certainly is a rewarding journey for those administrators who want piece of mind and a logfile, at least until the storm hits. . .but even for consultants who support home users the smartmontools or even a simple script or SMARTReporter can be a real-life saver. Wouldn't it be nice to call a customer and say, "Hey, I'm going to come over with a brand-new hard disk, backup your data, and replace your drive, because it's going to fail . . ." Or better yet, "Backup your data, now, before it's too late!" Or even, "It seems like those Hitachi hard drives are dropping sectors like crazy, maybe we'd better turn up the air conditioners." So, for those who'd like to get started quickly with the smartmontools, I've whipped up a Installer package that puts the smartmontools on your Mac, as well as the requisite StartupItem. However, you'd still need to configure the /etc/hostconfig, /etc/syslog.conf, /usr/local/etc/smartd.conf files and create a smartd.log file yourself. The installer is available along with the scripts in this article at http://www.themachelpdesk.com. Now, I'd like everybody to join me ". . . There's so much that we share, that it's time we're aware, it's a SMART world after all. . ."
In Next Month's "Source Hound"
Steve Jobs, in his keynote address at MacWorld, repeated the following mantra over and over "H.264, H.264, H.264." Sure, H.264 support is with QuickTime 7, but if you want it now, it's available today, not from Apple, but from the deepest trenches of the open-source world, the realm of the Coneheads and the video rogues, where encoders are encoders and QuickTime is QuickTime, and never the twain shall meet, or will they?
Dean Shavit is an ACSA (Apple Certified System Administrator) who loves to use a Mac, but
hates paying for software. Since he's not into breaking the law, his most common response to any cool solution
is: "Does that cost money?" If it does, you can bet he's on the hunt for an Open-Source or freeware
alternative. Besides surfing for hours, following the scent of great source code, he's a partner at MOST
Training & Consulting in Chicago, where he trains system administrators in OS X and OS X Server, conducts
large-scale Mac Deployment and Upgrade projects, and writes for his own website,
http://www.themachelpdesk.com. If you have questions or comments you can contact him: dean@macworkshops.com.