TweetFollow Us on Twitter

awk for Data Processing

Volume Number: 22 (2006)
Issue Number: 3
Column Tag: Programming

Mac In The Shell

awk for Data Processing

by Edward Marczak

The Complementary Pattern Processor to sed.

sed and awk are typically mentioned in the same sentence. They both have their own strengths and areas where they are most effective. The past few columns have walked though the power of sed, and I hope everyone has put sed into practice. If sed is so great, why do we need awk? sed is a non-interactive editor. It's powerful for unstructured data, and picking out patterns, and making changes in that data. awk excels at pulling, manipulating fields in structured data, and generating output formatted as you specify. You'll encounter both types of data as you work, and now you'll have the best, and most appropriate tools. How can awk help us?

History...Again

When I took one of my very first computer classes, in 7th grade or so, I remember the teacher launching into the history of computing. What?!? History? When are we going to sit down and start typing? Nowadays, I find myself launching into history quite a bit as I write these columns. The benefit is that it frames the present so nicely. A place we couldn't be now without that history. This is a long-winded way of saying I'm going to describe a little bit about the history of awk!

awk appeared in Bell Labs Unix V7 - roughly 1977 - and has been part of the standard distribution since. However, there have been a few revisions and versions of awk. Sometimes, these well-meaning versions have extended awk a little here and there. In 1985, the original authors officially revised the language. I can't possibly cover each and every facet of non-standard awk versions. Since this is MacTech, I'm going to cover Lucent awk, version 20040207, the version distributed with OS X, 10.4. This is the version of awk described in "The AWK Programming Language", 1988, by Al Aho, Peter Weinberger, and Brian Kernighan. (Do we see where the name "awk" comes from now?) Be aware that this version matches the POSIX standard of awk. It does not have every one of the extensions that have shown up over the years.

What is it?

The man page for awk says that it is a "pattern-directed scanning and processing language." The first thing to note is that it is a 'real' programming language, with structure. We've seen flow control and looping in bash and sed before. On a basic level, awk auto-constructs the main loop for you: it loops around each line of input. When it reaches EOF, the loop is broken. Like my Calculus I professor used to drill, "You have to know the rules!" Same goes for any programming language. However, rather than launch into a terse description, let's get right to some examples.

Awk me!

Here's an easy one:

$ awk '{print "Got a line"}' some_file.txt

This will print "Got a line" for each line in some_file.txt - there's that loop. This script has one action: run a print statement for each line of input. Besides running awk against a file, you can also pipe data in. Unlike sed, awk does not print input by default. So, to emulate cat, you could simply:

$ awk '{print}' some_file.txt

or

$ ls -l | awk '{print}'

(but using this to emulate cat would be silly). Again, awk really shines when operating on data with a structure. Comma, tab and other delimited formats are ideal - those have obvious structure. However, with enough practice, you'll start to see structure in non-obvious places.

For anyone that really dug into the sed columns, awk's pattern matching will look very familiar:

$ ls -l | awk '/pcap/ {print}'

We pipe the output of 'ls -l' into awk, where awk will jump into action each time it finds a line with 'pcap' on it. Well, we could have done that with 'ls -l *pcap*', right? Well, yes - but stay with me here. What if we didn't want all of the information that comes with 'ls -l'? Or, perhaps, if we wanted to rearrange that info? The output of ls, with the "-l" switch, happens to be very structured. Let's look at a snippet:

drwxr-x---   5 marczak  marczak  170 Jan 18 17:07 tmp
-rw-r-----    1 marczak  marczak  149 Oct 10 15:17 tw.png
-rw-r-----    1 root     marczak 3114 Nov  8 20:00 ts05.pcap

awk will refer to each of the columns as fields - just like a database. The permissions column is field 1, links column is field 2, and so on, up to field 9, in this example, being the file name. If we wanted to rearrange an 'ls' listing, we could use this:

$ ls -l | awk '/pcap/ {print $9,$5,$1}'
cramdump.pcap 15151 -rw-r-----
dhcp.pcap 16422 -rw-r-----
skypecatch.pcap 43421 -rw-r-----
ssldump.pcap 26070 -rw-r-----
testdump.pcap 12716 -rw-r-----
tsnow.pcap 391174 -rw-r-----

This example combines pattern-matching and the field operator. Again, the output of ls is piped to awk, which only acts when the input line matches "pcap". However, we decide to selectively output only the ninth, fifth and first fields.

Further into the Warren

With sed, we saw that it was good practice to create your script in a separate file - especially if it was a particularly complex script. awk can do the same using the '-f' switch. More conventionally, you may find long awk scripts written like a shell script, utilizing the 'she-bang' notation - #!/usr/bin/awk. Just remember to mark the script executable if you do this.

Another important practice, as pointed out with sed, is to comment your script! With awk, it turns out to be even more important, as you should document the expected input format along with code comments. Any routine that relies on structured data is fragile. When the data isn't perfect, it shatters into a million pieces. So, if you're processing a tab-delimited file, you might start your script with these comments:

# thinner.awk
# Remove un-needed data before injecting into mailing database
# Input: tab delimited file with layout:
# first_name, last_name, phone_num, shoe_size, e-mail, e-mail2, favorite_color

This way, when, three years later, the script stops working the way you'd expect, you can compare the input file against what you need.

awk has some built-in variables that help you move data around. You've seen the field operator - $ - which, I should note, starts at 1. I mean, the first field is actually numbered "1". What happened to programmers counting from zero? The field $0 refers to the entire line of input. A useful built-in that goes along with the field operators is NF.

NF references the number of fields on the current line. A side-effect is that NF will always refer to the last field (or, 'column'). We could rewrite the file listing example above like this:

ls -l | awk '/pcap/ {print $NF,$5,$1}'

Another important built-in is FS - field separator. Let's look at a very practical OS X use for awk - but we'll need to combine a few concepts to get there. By default, FS is set to a space character. As lines come into awk for processing, it splits up fields by string. Unfortunately, this means that a record reading "Name: Catherine O'Hara" is three fields, not two (of course, it's even worse for "James T. Kirk"). You can leave FS alone, making awk split based on a space character. You can also set FS to be any other single character, such as a comma - obviously useful for a CSV file. Finally, you can use a regexp and match multiple characters as a separator.

In addition to pattern matching to find data to process, awk supports two structures that allow for setup and tear-down (aka pre-processing and post-processing). The BEGIN structure runs before any lines are read in. This is ideal for setting variable states before diving in. The END structure runs after all input is processed, and is naturally useful for summing things up. BEGIN is a perfect place to set FS, although FS can even be changed while the script is running.

So, you're running OS X Server, and want to know who's logged on via AFP. awk to the rescue! Run this:

serveradmin command afp:command = getConnectedUsers | awk 'BEGIN {FS = "="} /name/ { print $NF }'

The output of serveradmin is fed to awk, which sets FS to the equal sign in a BEGIN structure. This simply splits the line in two, based on the input. Then we go on to look for 'name' records, and print out the last field using NF. Let's say that you just wanted to find out if one particular user is connected. awk will let you test a field for a match with the tilde operator ("~"). So, if we're only interested in finding out if "jane" was connected via afp, we can easily do this:

serveradmin command afp:command = getConnectedUsers | awk 'BEGIN {FS = "="} $2 ~ 
   /jane/ { print "Jane is connected!" }'

Of course, you can match any regular expression this way. (didn't I tell you learning regexp would let you rule the universe?) You can invert the tilde match with an exclamation point:

awk $2 !~ /barrel/ { print "Not a barrel" }

La Regle du Jeu

I mentioned some rules earlier. What are they, and how does that help us? Like sed, awk processes input in a very specific way.

By default, each incoming line is broken into fields, separated by a space. Lines ("records") are separated by a newline. An awk script is a set of pattern matching rules and actions, with the format:

pattern {action}

Patterns can be one of:

    A regular expression

    A relational expression

    BEGIN

    END

    A pattern range.

The BEGIN pattern runs its action before the first line of input is read. The END pattern runs its action after the last line of input is read and acted upon.

Some other rules about processing: A missing action defaults to "print". A missing pattern always matches. Program lines are terminated by a semicolon or newline. Comments begin with "#" and are not treated as statements. Comments do not need to start at column 1, and will continue until a newline is reached.

If you're thinking, "Hey! awk is pretty powerful and simple!" you'd be right. Like many Unix utilities it focuses on one thing, and does it really, really well. In some ways, it's only as complex as you make it. Of course, I've only laid out a fraction of awk's abilities. One more before I leave off.

Variables and Equations

Like every programming language, awk supports variables, and operations on those variables. Variables are case sensitive, but do not need to be declared or initialized. Like PHP, this allows variables to be loosely typed, and awk will choose the context automatically. The following examples do what you'd expect:

x = 7
y = x+3
a = "Hello, world"
z = $1   # assign the first field to z
print "z = " z
print "a contains " a
print "x = " x

Pretty straight-forward. Variables can be used in the pattern portion of a rule. How about a short example?

BEGIN { FS=":"; x=0 }
$2 ~ /Miguel/ { x = x  + 1 }
END { print "Miguel appears " x " times in the data." }

This fictitious example adds one to "x" for every time that the second field matches /Miguel/. If I claim that variables don't need to be initialized, why did I in this example? Because the auto-typing can sometimes trip you up. If "x" is not initialized, and there are no matches, awk assumes that, due to the context, that "x" is a string. This results in the message, "Miguel appears times in the data." And that's just not very friendly, is it?

In Summary...

Glad I didn't try to rush awk into last month's column. The more that you use both sed and awk, the more you see patterns in data, and tend to go back to these utilities. Despite being created in a time when personal computers (or even larger systems) didn't have their own SQL server running locally, or a powerful spreadsheet program at their disposal, sed and awk still have tremendous usefulness. Next month, I'm going to round out a little more about awk, and tie it into OS X.

Speaking of last month's column, I missed it then, but now realize that it marked "Mac in the Shell's" one-year anniversary! I need to thank David Sobsey, Neil Ticktin, and everyone at the magazine for getting me involved, spurring me along, and keeping me interested. Oh, and Dennis - I loved last month's cover! So, I raise my virtual glass in toast to another great year of MacTech! Cheers!

Finally, now that the dust has settled from MacWorld, I do want to say it was a pleasure meeting with many, many MacTech readers! As always, please feel free to comment, suggest and ask questions. See you next month.


Ed Marczak owns and operates Radiotope, a technology consulting company. More tech tips at the blog: http://www.radiotope.com/writing

 

Community Search:
MacTech Search:

Software Updates via MacUpdate

coconutBattery 3.9.14 - Displays info ab...
With coconutBattery you're always aware of your current battery health. It shows you live information about your battery such as how often it was charged and how is the current maximum capacity in... Read more
Keynote 13.2 - Apple's presentation...
Easily create gorgeous presentations with the all-new Keynote, featuring powerful yet easy-to-use tools and dazzling effects that will make you a very hard act to follow. The Theme Chooser lets you... Read more
Apple Pages 13.2 - Apple's word pro...
Apple Pages is a powerful word processor that gives you everything you need to create documents that look beautiful. And read beautifully. It lets you work seamlessly between Mac and iOS devices, and... Read more
Numbers 13.2 - Apple's spreadsheet...
With Apple Numbers, sophisticated spreadsheets are just the start. The whole sheet is your canvas. Just add dramatic interactive charts, tables, and images that paint a revealing picture of your data... Read more
Ableton Live 11.3.11 - Record music usin...
Ableton Live lets you create and record music on your Mac. Use digital instruments, pre-recorded sounds, and sampled loops to arrange, produce, and perform your music like never before. Ableton Live... Read more
Affinity Photo 2.2.0 - Digital editing f...
Affinity Photo - redefines the boundaries for professional photo editing software for the Mac. With a meticulous focus on workflow it offers sophisticated tools for enhancing, editing and retouching... Read more
SpamSieve 3.0 - Robust spam filter for m...
SpamSieve is a robust spam filter for major email clients that uses powerful Bayesian spam filtering. SpamSieve understands what your spam looks like in order to block it all, but also learns what... Read more
WhatsApp 2.2338.12 - Desktop client for...
WhatsApp is the desktop client for WhatsApp Messenger, a cross-platform mobile messaging app which allows you to exchange messages without having to pay for SMS. WhatsApp Messenger is available for... Read more
Fantastical 3.8.2 - Create calendar even...
Fantastical is the Mac calendar you'll actually enjoy using. Creating an event with Fantastical is quick, easy, and fun: Open Fantastical with a single click or keystroke Type in your event details... Read more
iShowU Instant 1.4.14 - Full-featured sc...
iShowU Instant gives you real-time screen recording like you've never seen before! It is the fastest, most feature-filled real-time screen capture tool from shinywhitebox yet. All of the features you... Read more

Latest Forum Discussions

See All

The iPhone 15 Episode – The TouchArcade...
After a 3 week hiatus The TouchArcade Show returns with another action-packed episode! Well, maybe not so much “action-packed" as it is “packed with talk about the iPhone 15 Pro". Eli, being in a time zone 3 hours ahead of me, as well as being smart... | Read more »
TouchArcade Game of the Week: ‘DERE Veng...
Developer Appsir Games have been putting out genre-defying titles on mobile (and other platforms) for a number of years now, and this week marks the release of their magnum opus DERE Vengeance which has been many years in the making. In fact, if the... | Read more »
SwitchArcade Round-Up: Reviews Featuring...
Hello gentle readers, and welcome to the SwitchArcade Round-Up for September 22nd, 2023. I’ve had a good night’s sleep, and though my body aches down to the last bit of sinew and meat, I’m at least thinking straight again. We’ve got a lot to look at... | Read more »
TGS 2023: Level-5 Celebrates 25 Years Wi...
Back when I first started covering the Tokyo Game Show for TouchArcade, prolific RPG producer Level-5 could always be counted on for a fairly big booth with a blend of mobile and console games on offer. At recent shows, the company’s presence has... | Read more »
TGS 2023: ‘Final Fantasy’ & ‘Dragon...
Square Enix usually has one of the bigger, more attention-grabbing booths at the Tokyo Game Show, and this year was no different in that sense. The line-ups to play pretty much anything there were among the lengthiest of the show, and there were... | Read more »
Valve Says To Not Expect a Faster Steam...
With the big 20% off discount for the Steam Deck available to celebrate Steam’s 20th anniversary, Valve had a good presence at TGS 2023 with interviews and more. | Read more »
‘Honkai Impact 3rd Part 2’ Revealed at T...
At TGS 2023, HoYoverse had a big presence with new trailers for the usual suspects, but I didn’t expect a big announcement for Honkai Impact 3rd (Free). | Read more »
‘Junkworld’ Is Out Now As This Week’s Ne...
Epic post-apocalyptic tower-defense experience Junkworld () from Ironhide Games is out now on Apple Arcade worldwide. We’ve been covering it for a while now, and even through its soft launches before, but it has returned as an Apple Arcade... | Read more »
Motorsport legends NASCAR announce an up...
NASCAR often gets a bad reputation outside of America, but there is a certain charm to it with its close side-by-side action and its focus on pure speed, but it never managed to really massively break out internationally. Now, there's a chance... | Read more »
Skullgirls Mobile Version 6.0 Update Rel...
I’ve been covering Marie’s upcoming release from Hidden Variable in Skullgirls Mobile (Free) for a while now across the announcement, gameplay | Read more »

Price Scanner via MacPrices.net

New low price: 13″ M2 MacBook Pro for $1049,...
Amazon has the Space Gray 13″ MacBook Pro with an Apple M2 CPU and 256GB of storage in stock and on sale today for $250 off MSRP. Their price is the lowest we’ve seen for this configuration from any... Read more
Apple AirPods 2 with USB-C now in stock and o...
Amazon has Apple’s 2023 AirPods Pro with USB-C now in stock and on sale for $199.99 including free shipping. Their price is $50 off MSRP, and it’s currently the lowest price available for new AirPods... Read more
New low prices: Apple’s 15″ M2 MacBook Airs w...
Amazon has 15″ MacBook Airs with M2 CPUs and 512GB of storage in stock and on sale for $1249 shipped. That’s $250 off Apple’s MSRP, and it’s the lowest price available for these M2-powered MacBook... Read more
New low price: Clearance 16″ Apple MacBook Pr...
B&H Photo has clearance 16″ M1 Max MacBook Pros, 10-core CPU/32-core GPU/1TB SSD/Space Gray or Silver, in stock today for $2399 including free 1-2 day delivery to most US addresses. Their price... Read more
Switch to Red Pocket Mobile and get a new iPh...
Red Pocket Mobile has new Apple iPhone 15 and 15 Pro models on sale for $300 off MSRP when you switch and open up a new line of service. Red Pocket Mobile is a nationwide service using all the major... Read more
Apple continues to offer a $350 discount on 2...
Apple has Studio Display models available in their Certified Refurbished store for up to $350 off MSRP. Each display comes with Apple’s one-year warranty, with new glass and a case, and ships free.... Read more
Apple’s 16-inch MacBook Pros with M2 Pro CPUs...
Amazon is offering a $250 discount on new Apple 16-inch M2 Pro MacBook Pros for a limited time. Their prices are currently the lowest available for these models from any Apple retailer: – 16″ MacBook... Read more
Closeout Sale: Apple Watch Ultra with Green A...
Adorama haș the Apple Watch Ultra with a Green Alpine Loop on clearance sale for $699 including free shipping. Their price is $100 off original MSRP, and it’s the lowest price we’ve seen for an Apple... Read more
Use this promo code at Verizon to take $150 o...
Verizon is offering a $150 discount on cellular-capable Apple Watch Series 9 and Ultra 2 models for a limited time. Use code WATCH150 at checkout to take advantage of this offer. The fine print: “Up... Read more
New low price: Apple’s 10th generation iPads...
B&H Photo has the 10th generation 64GB WiFi iPad (Blue and Silver colors) in stock and on sale for $379 for a limited time. B&H’s price is $70 off Apple’s MSRP, and it’s the lowest price... Read more

Jobs Board

Optometrist- *Apple* Valley, CA- Target Opt...
Optometrist- Apple Valley, CA- Target Optical Date: Sep 23, 2023 Brand: Target Optical Location: Apple Valley, CA, US, 92308 **Requisition ID:** 796045 At Target Read more
Senior *Apple* iOS CNO Developer (Onsite) -...
…Offense and Defense Experts (CODEX) is in need of smart, motivated and self-driven Apple iOS CNO Developers to join our team to solve real-time cyber challenges. Read more
*Apple* Systems Administrator - JAMF - Activ...
…**Public Trust/Other Required:** None **Job Family:** Systems Administration **Skills:** Apple Platforms,Computer Servers,Jamf Pro **Experience:** 3 + years of Read more
Child Care Teacher - Glenda Drive/ *Apple* V...
Child Care Teacher - Glenda Drive/ Apple ValleyTeacher Share by Email Share on LinkedIn Share on Twitter Share on Facebook Apply Read more
Machine Operator 4 - *Apple* 2nd Shift - Bon...
Machine Operator 4 - Apple 2nd ShiftApply now " Apply now + Start apply with LinkedIn + Apply Now Start + Please wait Date:Sep 22, 2023 Location: Swedesboro, NJ, US, Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.