OS X Investigation and Troubleshooting - Part 3
Volume Number: 22 (2006)
Issue Number: 7
Column Tag: Mac In The Shell
OS X Investigation and Troubleshooting - Part 3
The secrets to OS X success
by Edward Marczak
Introduction
Shovels, spades, backhoes, and even your hands: ways to dig. Just like our analog world, OS X provides us many tools with which to dig in. The first two parts of this series introduced several utilities, like tripwire, find and lsbom to check and monitor changes on a file system. We also looked at programs that monitor processes such as top and Spin Control. This month, we'll look at ways to check out specific binaries - particularly ones that are poorly documented. This will take a combination of looking at their images on the file system, and looking at their processes while it's running.
Slinging the Strings
Almost all programs have some static text stored as strings. Fact of the matter is, though, that C, C++, compiled AppleScript, and (especially) Objective C have a massive amount of descriptive text ripe for the plucking. What more aptly-named utility to use than strings.
strings looks for ASCII data in its input, and will output, by default, any ASCII strings that are at least 4 characters. This takes very little explanation once you see it in action. Recently, I had to troubleshoot OS X Server's PasswordService. While there is a man page, it doesn't quite tell you everything. To the shell I go! Running strings on PasswordService certainly yielded a bit of info:
strings /usr/sbin/PasswordService | less
__dyld_mod_term_funcs
__dyld_make_delayed_module_initializer_calls
__dyld_image_count
__dyld_get_image_name
__dyld_get_image_header
__dyld_NSLookupSymbolInImage
__dyld_NSAddressOfSymbol
libobjc
__objcInit
The kernel support for the dynamic linker is not present to run this program.
RunAppThread Created
RunAppThread Deleted
RunAppThread Started
...
(665 lines snipped for brevity)
Paging through the output yielded a few golden nuggets, including this line:
/Library/Preferences/com.apple.passwordserver.plist
Hey! There's no mention of a plist in the man page! Another interesting line jumped out:
External password command rejected because it must be in /usr/sbin/authserver/tools
Looking in /usr/sbin/authserver/tools reveals one program: weakpass. No man page, no real help, just a brief usage statement. I ran it and got this:
./weakpass marczak admin
approved
A quick search lead me to weakpass_edit which does have a man page, and explains what the program does a bit better. Useful for future reference.
Running strings on this one binary yielded a good bit of info. However, be aware that you're not looking at the actual source of a program, so, some of your detective work is simply inference. You may be looking at strings from a section of code that isn't really implemented yet. Frankly, you don't always know for sure precisely what you're looking at.
Sometimes you find funny comments (I'll leave you to look at strings /usr/sbin/AppleFileServer yourself), and sometimes a developer reveals much more than they had hoped. I once was able to map out an entire website and dig into parts of it that the webmaster didn't know he was sharing - all thanks to info derived from running strings on a binary. More importantly, I find strings useful for finding anything hardcoded - you'll find this a surprising amount, instead of an app reading from a preference file. Or, you'll find the hardcoded debug flags that it may be looking for. Back to OS X Server for one more example.
One day, while doing a little battle with OS X Directory Services, I remembered that there is a way to throw DirectoryService into debug mode. While the kill -USR1 trick is documented, that only debugs after one has logged in. Unfortunately, plenty happens while the machine is booting, before logging in is possible. I knew I once heard of a way to get DirectoryService to debug everything...strings to the rescue. A simple:
strings /usr/sbin/DirectoryService
yielded this deep into the output:
/Library/Preferences/DirectoryService/.DSLogDebugAtStart
...
/Library/Preferences/DirectoryService/.DSLogAPIAtStart
Enough to jog my memory! A simple 'touch /Library/Preferences/DirectoryService/.DSLogDebugAtStart' starts filling /Library/Logs/DirectoryService /DirectoryService.debug.log - right from the word go! Very handy.
So, I wouldn't give up strings too easily. But when using it, you have to remember to put the output into some sort of context, and certainly don't use the output to justify anything you can't prove empirically by running the program you're inspecting.
Watching While Running
Often, you'll find yourself wondering, "just what is that app doing right now?" Up until now, this series hasn't really focused on watching code run, outside of watching the resources it's taking. Sometimes, though, the only way to really troubleshoot an app is to watch it self-destruct. Of course, there are utilities that aid in doing just that.
Like top for running processes, sc_usage shows system calls and page statistics as a program makes calls and pages in and out in a top-like style. You must be root to run sc_usage, which takes a pid, process name, or name of an app to execute by using the -E switch. I prefer to use a process id as it's a little more specific. Let's take a look at Apache running locally on my machine as it serves a web page:
# sc_usage 134
httpd 0 preemptions 0 context switches 1 thread 07:11:38
0 faults 0 system calls 0:01:11
TYPE NUMBER CPU_TIME WAIT_TIME
-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
System Idle 0:21.670( 0:00.683)
System Busy 0:14.614( 0:00.326)
httpd Usermode 0:00.179
zero_fill 1215 0:00.024 0:00.001
pagein 143 0:00.018 0:01.046
copy_on_write 137 0:00.007 0:00.059
cache_hit 314 0:00.003 0:00.000
read 278 0:00.019 0:19.523
accept 1 0:00.000 0:13.716( 0:01.010) W
stat 286 0:00.038 0:00.494
open 197 0:00.012 0:00.491
close 195 0:00.020 0:00.176
mach_msg_trap 411 0:00.003 0:00.164
access 26 0:00.010 0:00.115
statfs 11 0:00.009 0:00.083
poll 2 0:00.000 0:00.075
getdirentries 141 0:00.130 0:00.063
chdir 2 0:00.000 0:00.004
getattrlist 261 0:00.012 0:00.001
lstat 189 0:00.009 0:00.000
vm_allocate 215 0:00.003 0:00.000
CURRENT_TYPE LAST_PATHNAME_WAITED_FOR CUR_WAIT_TIME THRD# PRI
-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
accept //%239/%239105369/footer.tpl.php 4:50.062 0 31
The first block of information shows the process name and some basic statistics, counting from the time you started sc_usage. In the next block, there are rows segregated by type, with the following headings:
Type - The type of system call or, in the last group, the actual call used.
Number - The number of times this call has been made. A number is parenthesis represents the delta between samples.
CPU_Time - The amount of CPU time this process has used.
Wait_time - the amount of time this process has spent waiting.
The final block of information is only displayed once the program starts to make calls, and is displayed from the bottom of the terminal up. The columns in this group are:
Current_type - the system call currently being made
Last_pathname_waited_for - The last pathname referenced by a system call that blocked.
Cur_wait_time - Time a thread has been blocked.
Thrd# - the thread number in question
Pri - the scheduling priority.
The overwhelming majority of the time, when you run sc_usage, you'll have 'mach_msg_trap' in the system call list. Popular system call? Not quite; this shows that a call is blocked and waiting for something to happen.
Watching the Files
Sometimes, troubleshooting - or just your curiosity - leads to a more specific query: which files does this application touch? Either read, create or write; what's it accessing? sc_usage might only give you half of the picture. Enter fs_usage. As you may have guessed, this utility will display file system usage. fs_usage will run with no switches, displaying all file system calls. But, we can show more finesse than that. You can also pass in a process id or process name that will constrain the output to that process only - very wise, as fs_usage generates a huge amount of output. If you use a process name, fs_usage will match all occurrences of that name. Let's watch it in action against the same httpd process from earlier:
# fs_usage 134
20:58:33.660 PAGE_IN A=0x00015000 B=0x2000 0.020402 W httpd
20:58:33.897 PAGE_IN A=0xa0002000 B=0x1000 0.236411 W httpd
20:58:33.991 PAGE_IN A=0x00019000 B=0x1000 0.001612 W httpd
20:58:33.991 PAGE_IN A=0x00014000 B=0x0 0.000008 httpd
20:58:33.991 read F=4 B=0x114 0.000030 httpd
20:58:33.992 PAGE_IN A=0x0004c000 B=0x1000 0.000278 W httpd
20:58:33.992 PAGE_IN A=0x0001a000 B=0x1000 0.000167 W httpd
20:58:33.996 PAGE_IN A=0x0002f000 B=0x1000 0.000322 W httpd
20:58:33.997 PAGE_IN A=0x00017000 B=0x3000 0.000315 W httpd
20:58:33.997 PAGE_IN A=0x00016000 B=0x0 0.000007 httpd
20:58:33.093 stat /www/wheresspot/html 0.095943 W httpd
20:58:33.093 open [ 2] /www/wheresspot/html/.htaccess 0.000227 W httpd
20:58:33.093 statfs 0.000018 httpd
20:58:33.094 statfs /Library/WebServer/Documents 0.000115 httpd
20:58:33.094 statfs /Library/WebServer/CGI-Executables 0.000063 httpd
20:58:33.094 statfs /Users/bdd/Sites 0.000028 httpd
20:58:33.094 statfs /Users/test/Sites 0.000009 httpd
20:58:33.094 statfs /Users/marczak/Sites 0.000009 httpd
20:58:33.094 statfs /Users/supervisor/Sites 0.000009 httpd
20:58:33.094 statfs /www/wheresspot/html 0.000265 W httpd
20:58:33.120 statfs /usr/share/httpd/icons 0.025489 W httpd
20:58:33.161 statfs [ 2] .*\.\.namedfork 0.000035 httpd
20:58:33.186 PAGE_IN A=0x000a2000 B=0x2000 0.024381 W httpd
20:58:33.198 PAGE_IN A=0x0020a000 B=0x1000 0.011685 W httpd
20:58:33.214 stat /www/wheresspot/html/index.html 0.015356 W httpd
20:58:33.214 PAGE_IN A=0x0002d000 B=0x2000 0.000318 W httpd
20:58:33.214 PAGE_IN A=0x0002e000 B=0x0 0.000012 httpd
20:58:33.215 PAGE_IN A=0x00018000 B=0x0 0.000014 httpd
20:58:33.215 stat /www/wheresspot/html/index.html 0.000119 httpd
That's some great output! What's it telling us? The first column is the time that the event occurred. The second column is the actual call or page event. The next column represents the data of the call. On a page fault, the data shows the address, represented by "A=" and, the number of bytes read or written, shown by "B=". A read call is similar, showing the file descriptor read from in "F=", and the number of bytes read. The next column shows the time spent on the call. A "W" next to the time means that the process was scheduled out during this time, and the value includes wait time (like waiting for some 'long' I/O process). The final column shows the process name; useful when you're monitoring multiple processes.
In this example, I particularly like the section where we can really see Apache listening to the default Apple configuration (about 16 lines in starting with the 'statfs' block) - it walks each of the user's "Sites" folders. Of course, there was a lot more output from fs_usage, but we could only look at a small portion here. Also, note that errors are displayed in square brackets. For example, Apache needs to look for an htaccess file, but not every directory will have or need one. Because of this, you'll see the open call return with an error [2] on line 12 in the output.
Naturally, you need to be root to allow fs_usage to dig in where it needs to. I should also point out that fs_usage is sensitive to the size of your terminal. The more columns you have, the more it'll display (to a point). If you always want to enable wide output, regardless of your actual terminal width, use the '-w' flag. Also, note that you can exclude processes with the "-e" switch. By itself, the "-e" switch will exclude the fs_usage program itself.
Watching the Watcher
As handy as sc_usage and fs_usage are, sometimes you just need more insight. When you need to reach into the nuclear arsenal, the first thing you should pull out is ktrace. If you've never used ktrace, please read this entire section before trying on a production machine!
ktrace is the kernel trace facility, and allows kernel tracing of a given process. Unlike the previous utilities demonstrated, ktrace logs its output to a file rather than the screen. By default, this output goes to a file named 'ktrace.out'. Also, if you think fs_usage is verbose, ktrace is even more so. Due to this, ktrace writes its output in a more efficient binary format. A separate utilty, kdump, will read a ktrace file - again, by default, ktrace.out - and display it in "human readable" format. Like most things, once you see it in action, you'll get it immediately. Back to httpd:
# ktrace -p 134
...and that's it. ktrace is silently logging in the background. Once again: ktrace can generate huge amounts of output. If you run this on a production machine, you can fill a disk very quickly, depending on the process you're tracing. Since ktrace launches a kernel process, you won't see any evidence of ktrace running in a 'ps' listing, so you can easily forget about it. Take care with this one! Back to usage...While this is tracing, I'll go load my web page. Then, we need to stop ktrace. While tracing can be stopped on a particular process, it's very rare that I'm tracing more than one process at a time, or that I wouldn't want to stop them all at the same time, so I go for the panic button: ktrace -C (note the capital 'C'). The '-C' switch stops all traces belonging to the current user. Unlike other utilities of its type, you do not have to be root to use ktrace, however, it certainly helps. If you own a process, though, you can ktrace it. For completeness sake, you cancel a specific trace by using the '-p' switch again: ktrace -p 134. Now, to look at our output:
# kdump | less
134 httpd RET accept 4
134 httpd CALL sigaction(0x1e,0xbffff988,0xbffffa00)
134 httpd RET sigaction 0
134 httpd CALL fcntl(0x4,0x2,0x1)
134 httpd RET fcntl 0
134 httpd CALL getsockname(0x4,0xbffffa60,0xbffffa80)
134 httpd RET getsockname 0
134 httpd CALL setsockopt(0x4,0x6,0x1,0xbffffa00,0x4)
134 httpd RET setsockopt 0
134 httpd CALL read(0x4,0x180ee90,0x1000)
134 httpd GIO fd 4 read 320 bytes
"GET / HTTP/1.1\r
Accept: */*\r
Accept-Language: en\r
Accept-Encoding: gzip, deflate\r
Cookie: PHPSESSID=43edcdc80eaadb4a7ed7778acebbd8d6; AWSUSER_ID=awsuser_id1137205739363r9609\r
User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en) AppleWebKit/418 (KHTML, like Gecko)
Safari/417.9.2\r
Connection: keep-alive\r
Host: ws\r
\r
"
134 httpd RET read 320/0x140
134 httpd CALL sigaction(0x1e,0xbfffd8a8,0xbfffd920)
134 httpd RET sigaction 0
134 httpd CALL stat(0x188eeb0,0x188d500)
134 httpd NAMI "/www/wheresspot/html"
134 httpd RET stat 0
134 httpd CALL open(0x188f0f0,0,0x1b6)
134 httpd NAMI "/www/wheresspot/html/.htaccess"
134 httpd RET open -1 errno 2 No such file or directory
134 httpd CALL sigaction(0xe,0,0xbffff830)
134 httpd RET sigaction 0
134 httpd CALL stat(0x188fba8,0x188f700)
134 httpd NAMI "/www/wheresspot/html/index.html"
134 httpd RET stat 0
134 httpd CALL stat(0x188fa00,0x188f260)
134 httpd NAMI "/www/wheresspot/html/index.html"
134 httpd RET stat 0
134 httpd CALL sigprocmask(0x1,0,0x13d9aa4)
134 httpd RET sigprocmask 0
134 httpd CALL umask(0x3f)
134 httpd RET umask 18/0x12
134 httpd CALL umask(0x12)
134 httpd RET umask 63/0x3f
134 httpd CALL sigprocmask(0x1,0,0x13d9aa4)
134 httpd RET sigprocmask 0
134 httpd CALL setitimer(0x2,0xbfffe7d0,0)
134 httpd RET setitimer 0
134 httpd CALL sigaction(0x1b,0xbfffe6f8,0xbfffe764)
134 httpd RET sigaction 0
134 httpd CALL sigprocmask(0x2,0xbfffe7e0,0)
134 httpd RET sigprocmask 0
134 httpd CALL sigprocmask(0x1,0,0x13d9aa4)
134 httpd RET sigprocmask 0
134 httpd CALL stat(0x901a0ad0,0xbfffe2f8)
134 httpd NAMI "/"
134 httpd RET stat 0
134 httpd CALL lstat(0x1886400,0xbfffe2f8)
134 httpd NAMI "."
134 httpd RET lstat 0
134 httpd CALL chdir(0xbfffe360)
134 httpd NAMI "/www/wheresspot/html"
134 httpd RET chdir 0
134 httpd CALL getattrlist(0xbfffe730,0xa0000494,0xbfffdf34,0x41c,0x1)
134 httpd NAMI "/www"
134 httpd RET getattrlist 0
134 httpd CALL getattrlist(0xbfffe730,0xa0000494,0xbfffdf34,0x41c,0x1)
134 httpd NAMI "/www/wheresspot"
134 httpd RET getattrlist 0
134 httpd CALL getattrlist(0xbfffe730,0xa0000494,0xbfffdf34,0x41c,0x1)
134 httpd NAMI "/www/WheresSpot/html"
134 httpd RET getattrlist 0
134 httpd CALL getattrlist(0xbfffe730,0xa0000494,0xbfffdf34,0x41c,0x1)
134 httpd NAMI "/www/WheresSpot/html/index.html"
This was just a tiny piece of the output! If you compare this ktrace data with the sc_usage display, you'll see many of the same calls, as one would expect. Also, comparing the ktrace output with the fs_usage data shows the same files being opened, and the same errors returned. The ktrace output, though, is much more detailed. Additionally, as files are opened and read, ktrace will actually show the content - look at the http header get passed across in the sample we're using.
As you may expect, ktrace can get even more advanced. First, you can use ktrace to launch a process, just so you get every drop of info, right from the start. Just pass the name on the command line: ktrace SomeApplication. Second, you can instruct ktrace to trace current and/or future descendants of a process, also. Very important stuff! In part two of this series we talked about Activity Monitor and using it to see parent processes - the process that spawned the process you're interested in. Thanks to launchd, SystemStarter and their ilk, this happens a lot under Tiger. ktrace doesn't disappoint. The '-d' switch immediately starts tracing descendants of the process you specify. A nice alternative is the '-i' switch, which will trace only newly created child processes (created after you begin tracing). This is exceptionally handy if you want to trace a process that is a launchd child whose "OnDemand" flag is false - in other words, launchd will keep it alive by restarting it, if it goes away. A perfect example is lookupd: how do you trace this cleanly? Use the '-i' switch:
# ktrace -p 1 -i ; killall lookupd
Here, we trace launchd and whatever it spawns - but we know it's going to re-spawn lookupd pretty soon, because we're going to kill it off. Check the process list with ps, and when lookupd shows up again, stop tracing with ktrace -C. When you look at the results with kdump, you can see the switch from launchd to lookupd:
27414 launchd CALL setsid
27414 launchd RET setsid 27414/0x6b16
27414 launchd CALL sigprocmask(0x3,0xf0101c54,0)
27414 launchd RET sigprocmask 0
27414 launchd CALL setpriority(0,0,0)
27414 launchd RET setpriority 0
27414 launchd CALL execve(0x107d4,0x10644,0x3001e0)
27414 launchd NAMI "/usr/sbin/lookupd"
27414 launchd NAMI "/usr/lib/dyld"
27414 lookupd RET execve 0
27414 lookupd CALL __sysctl(0xbffff9cc,0x2,0xbffff9d4,0xbffff9c8,0x8fe4591c,0xa)
27414 lookupd RET __sysctl 0
27414 lookupd CALL __sysctl(0xbffff9d4,0x2,0x8fe5899c,0xbffffa78,0,0)
27414 lookupd RET __sysctl 0
27414 lookupd CALL __sysctl(0xbffff9cc,0x2,0xbffff9d4,0xbffff9c8,0x8fe45948,0xd)
27414 lookupd RET __sysctl 0
27414 lookupd CALL __sysctl(0xbffff9d4,0x2,0x8fe58998,0xbffffa78,0,0)
27414 lookupd RET __sysctl 0
27414 lookupd CALL open(0x191c,0,0)
27414 lookupd NAMI "/usr/lib/libobjc.A.dylib"
27414 lookupd RET open 3
27414 lookupd CALL fstat(0x3,0xbfffd950)
As I mentioned, ktrace might stay at the bottom of your toolbox for a while, but in certain cases, there is no substitute.
Homework
Over the course of this series, we've looked at many different troubleshooting techniques and utilities. But wait, there's more! More than I can cover, unfortunately. If ktrace gives you happy dreams of watching a program run, how about interaction? I'd be remiss if I didn't mention gdb, the GNU debugger. Available on OS X as part of the Xcode installation, gdb will let you be in control of any application. There are some great documents to get you started, including:
Richard Stallman's, "Debugging with gdb" <http://developer.apple.com/documentation/DeveloperTools/gdb/gdb/gdb_toc.html>
Apple Computer's Technical Note 2032, "Getting Started with gdb" <http://developer.apple.com/technotes/tn/tn2032.html>
Apple Computer, "Using gdb" <http://developer.apple.com/documentation/DeveloperTools/gdb/gdb/gdb_toc.html>
Apple Computer's Technical Note 2030, "Using gdb for MacsBug Veterans" <http://developer.apple.com/technotes/tn/tn2030.html>
Breathe Out
So closes this series on troubleshooting OS X. Hope it transforms the way you attack problems and get answers to questions when dealing with OS X. Despite all of these great tools, there are issues that can be solved more easily and efficiently with a fresh install - and there will always be nasty little things that only the programmers at Apple can deal with effectively. Now, however, there will be a host of new things you have insight to.
I missed recommending a "media of the month" in June, so I need to make up for it this time. A college professor of mine made me read "The Mythical Man Month" by Frederick P. Brooks back in the day. While I really just wanted to dive into some code, I had then, and still do have tremendous respect for this professor, so I begrudgingly read it. It's a text that has stuck with me to this day. It's now available in a 20th Anniversary edition. If you've never read it, I highly recommend it. For those that still do want to 'just geek out', I suppose I should mention some sys-admin-y kind of reading. I've mentioned in the past that you can rule the universe if you know regular expressions, so, let's get everyone back on track. "Mastering Regular Expressions" by Jeffrey E. F. Friedl, published by O'Reilly really will get you on the right track regarding the basics, and all of the things that tend to trip people up with more complex regex.
July! Ready for WWDC? Whether or not you're going to be in San Francisco for the event, I think there's a lot of excitement this year, and you'll be able to live vicariously through MacTech, if you need to! I will be attending, and hope to meet up with readers old and new alike. Send along an e-mail if you'd like to meet up and shake hands, or, if you'd like to introduce a new author for MacTech. In person or in print, see you next month!
References
Just about everything at http://developer.apple.com
The respective source and man page for each utility mentioned.
Ed Marczak owns and operates Radiotope, a technology consulting company. Radiotope helps separate technology issues from policy issues, cool-tech from needed-tech. Guide your decision at http://www.radiotope.com