TweetFollow Us on Twitter

awk for Data Processing

Volume Number: 22 (2006)
Issue Number: 3
Column Tag: Programming

Mac In The Shell

awk for Data Processing

by Edward Marczak

The Complementary Pattern Processor to sed.

sed and awk are typically mentioned in the same sentence. They both have their own strengths and areas where they are most effective. The past few columns have walked though the power of sed, and I hope everyone has put sed into practice. If sed is so great, why do we need awk? sed is a non-interactive editor. It's powerful for unstructured data, and picking out patterns, and making changes in that data. awk excels at pulling, manipulating fields in structured data, and generating output formatted as you specify. You'll encounter both types of data as you work, and now you'll have the best, and most appropriate tools. How can awk help us?

History...Again

When I took one of my very first computer classes, in 7th grade or so, I remember the teacher launching into the history of computing. What?!? History? When are we going to sit down and start typing? Nowadays, I find myself launching into history quite a bit as I write these columns. The benefit is that it frames the present so nicely. A place we couldn't be now without that history. This is a long-winded way of saying I'm going to describe a little bit about the history of awk!

awk appeared in Bell Labs Unix V7 - roughly 1977 - and has been part of the standard distribution since. However, there have been a few revisions and versions of awk. Sometimes, these well-meaning versions have extended awk a little here and there. In 1985, the original authors officially revised the language. I can't possibly cover each and every facet of non-standard awk versions. Since this is MacTech, I'm going to cover Lucent awk, version 20040207, the version distributed with OS X, 10.4. This is the version of awk described in "The AWK Programming Language", 1988, by Al Aho, Peter Weinberger, and Brian Kernighan. (Do we see where the name "awk" comes from now?) Be aware that this version matches the POSIX standard of awk. It does not have every one of the extensions that have shown up over the years.

What is it?

The man page for awk says that it is a "pattern-directed scanning and processing language." The first thing to note is that it is a 'real' programming language, with structure. We've seen flow control and looping in bash and sed before. On a basic level, awk auto-constructs the main loop for you: it loops around each line of input. When it reaches EOF, the loop is broken. Like my Calculus I professor used to drill, "You have to know the rules!" Same goes for any programming language. However, rather than launch into a terse description, let's get right to some examples.

Awk me!

Here's an easy one:

$ awk '{print "Got a line"}' some_file.txt

This will print "Got a line" for each line in some_file.txt - there's that loop. This script has one action: run a print statement for each line of input. Besides running awk against a file, you can also pipe data in. Unlike sed, awk does not print input by default. So, to emulate cat, you could simply:

$ awk '{print}' some_file.txt

or

$ ls -l | awk '{print}'

(but using this to emulate cat would be silly). Again, awk really shines when operating on data with a structure. Comma, tab and other delimited formats are ideal - those have obvious structure. However, with enough practice, you'll start to see structure in non-obvious places.

For anyone that really dug into the sed columns, awk's pattern matching will look very familiar:

$ ls -l | awk '/pcap/ {print}'

We pipe the output of 'ls -l' into awk, where awk will jump into action each time it finds a line with 'pcap' on it. Well, we could have done that with 'ls -l *pcap*', right? Well, yes - but stay with me here. What if we didn't want all of the information that comes with 'ls -l'? Or, perhaps, if we wanted to rearrange that info? The output of ls, with the "-l" switch, happens to be very structured. Let's look at a snippet:

drwxr-x---   5 marczak  marczak  170 Jan 18 17:07 tmp
-rw-r-----    1 marczak  marczak  149 Oct 10 15:17 tw.png
-rw-r-----    1 root     marczak 3114 Nov  8 20:00 ts05.pcap

awk will refer to each of the columns as fields - just like a database. The permissions column is field 1, links column is field 2, and so on, up to field 9, in this example, being the file name. If we wanted to rearrange an 'ls' listing, we could use this:

$ ls -l | awk '/pcap/ {print $9,$5,$1}'
cramdump.pcap 15151 -rw-r-----
dhcp.pcap 16422 -rw-r-----
skypecatch.pcap 43421 -rw-r-----
ssldump.pcap 26070 -rw-r-----
testdump.pcap 12716 -rw-r-----
tsnow.pcap 391174 -rw-r-----

This example combines pattern-matching and the field operator. Again, the output of ls is piped to awk, which only acts when the input line matches "pcap". However, we decide to selectively output only the ninth, fifth and first fields.

Further into the Warren

With sed, we saw that it was good practice to create your script in a separate file - especially if it was a particularly complex script. awk can do the same using the '-f' switch. More conventionally, you may find long awk scripts written like a shell script, utilizing the 'she-bang' notation - #!/usr/bin/awk. Just remember to mark the script executable if you do this.

Another important practice, as pointed out with sed, is to comment your script! With awk, it turns out to be even more important, as you should document the expected input format along with code comments. Any routine that relies on structured data is fragile. When the data isn't perfect, it shatters into a million pieces. So, if you're processing a tab-delimited file, you might start your script with these comments:

# thinner.awk
# Remove un-needed data before injecting into mailing database
# Input: tab delimited file with layout:
# first_name, last_name, phone_num, shoe_size, e-mail, e-mail2, favorite_color

This way, when, three years later, the script stops working the way you'd expect, you can compare the input file against what you need.

awk has some built-in variables that help you move data around. You've seen the field operator - $ - which, I should note, starts at 1. I mean, the first field is actually numbered "1". What happened to programmers counting from zero? The field $0 refers to the entire line of input. A useful built-in that goes along with the field operators is NF.

NF references the number of fields on the current line. A side-effect is that NF will always refer to the last field (or, 'column'). We could rewrite the file listing example above like this:

ls -l | awk '/pcap/ {print $NF,$5,$1}'

Another important built-in is FS - field separator. Let's look at a very practical OS X use for awk - but we'll need to combine a few concepts to get there. By default, FS is set to a space character. As lines come into awk for processing, it splits up fields by string. Unfortunately, this means that a record reading "Name: Catherine O'Hara" is three fields, not two (of course, it's even worse for "James T. Kirk"). You can leave FS alone, making awk split based on a space character. You can also set FS to be any other single character, such as a comma - obviously useful for a CSV file. Finally, you can use a regexp and match multiple characters as a separator.

In addition to pattern matching to find data to process, awk supports two structures that allow for setup and tear-down (aka pre-processing and post-processing). The BEGIN structure runs before any lines are read in. This is ideal for setting variable states before diving in. The END structure runs after all input is processed, and is naturally useful for summing things up. BEGIN is a perfect place to set FS, although FS can even be changed while the script is running.

So, you're running OS X Server, and want to know who's logged on via AFP. awk to the rescue! Run this:

serveradmin command afp:command = getConnectedUsers | awk 'BEGIN {FS = "="} /name/ { print $NF }'

The output of serveradmin is fed to awk, which sets FS to the equal sign in a BEGIN structure. This simply splits the line in two, based on the input. Then we go on to look for 'name' records, and print out the last field using NF. Let's say that you just wanted to find out if one particular user is connected. awk will let you test a field for a match with the tilde operator ("~"). So, if we're only interested in finding out if "jane" was connected via afp, we can easily do this:

serveradmin command afp:command = getConnectedUsers | awk 'BEGIN {FS = "="} $2 ~ 
   /jane/ { print "Jane is connected!" }'

Of course, you can match any regular expression this way. (didn't I tell you learning regexp would let you rule the universe?) You can invert the tilde match with an exclamation point:

awk $2 !~ /barrel/ { print "Not a barrel" }

La Regle du Jeu

I mentioned some rules earlier. What are they, and how does that help us? Like sed, awk processes input in a very specific way.

By default, each incoming line is broken into fields, separated by a space. Lines ("records") are separated by a newline. An awk script is a set of pattern matching rules and actions, with the format:

pattern {action}

Patterns can be one of:

    A regular expression

    A relational expression

    BEGIN

    END

    A pattern range.

The BEGIN pattern runs its action before the first line of input is read. The END pattern runs its action after the last line of input is read and acted upon.

Some other rules about processing: A missing action defaults to "print". A missing pattern always matches. Program lines are terminated by a semicolon or newline. Comments begin with "#" and are not treated as statements. Comments do not need to start at column 1, and will continue until a newline is reached.

If you're thinking, "Hey! awk is pretty powerful and simple!" you'd be right. Like many Unix utilities it focuses on one thing, and does it really, really well. In some ways, it's only as complex as you make it. Of course, I've only laid out a fraction of awk's abilities. One more before I leave off.

Variables and Equations

Like every programming language, awk supports variables, and operations on those variables. Variables are case sensitive, but do not need to be declared or initialized. Like PHP, this allows variables to be loosely typed, and awk will choose the context automatically. The following examples do what you'd expect:

x = 7
y = x+3
a = "Hello, world"
z = $1   # assign the first field to z
print "z = " z
print "a contains " a
print "x = " x

Pretty straight-forward. Variables can be used in the pattern portion of a rule. How about a short example?

BEGIN { FS=":"; x=0 }
$2 ~ /Miguel/ { x = x  + 1 }
END { print "Miguel appears " x " times in the data." }

This fictitious example adds one to "x" for every time that the second field matches /Miguel/. If I claim that variables don't need to be initialized, why did I in this example? Because the auto-typing can sometimes trip you up. If "x" is not initialized, and there are no matches, awk assumes that, due to the context, that "x" is a string. This results in the message, "Miguel appears times in the data." And that's just not very friendly, is it?

In Summary...

Glad I didn't try to rush awk into last month's column. The more that you use both sed and awk, the more you see patterns in data, and tend to go back to these utilities. Despite being created in a time when personal computers (or even larger systems) didn't have their own SQL server running locally, or a powerful spreadsheet program at their disposal, sed and awk still have tremendous usefulness. Next month, I'm going to round out a little more about awk, and tie it into OS X.

Speaking of last month's column, I missed it then, but now realize that it marked "Mac in the Shell's" one-year anniversary! I need to thank David Sobsey, Neil Ticktin, and everyone at the magazine for getting me involved, spurring me along, and keeping me interested. Oh, and Dennis - I loved last month's cover! So, I raise my virtual glass in toast to another great year of MacTech! Cheers!

Finally, now that the dust has settled from MacWorld, I do want to say it was a pleasure meeting with many, many MacTech readers! As always, please feel free to comment, suggest and ask questions. See you next month.


Ed Marczak owns and operates Radiotope, a technology consulting company. More tech tips at the blog: http://www.radiotope.com/writing

 

Community Search:
MacTech Search:

Software Updates via MacUpdate

NetShade 8.3 - Browse privately using an...
NetShade is an anonymous proxy and VPN app+service for Mac. Unblock your Internet through NetShade's high-speed proxy and VPN servers spanning 17 countries. NetShade masks your IP address as you... Read more
Adobe Animate CC 2020 20.0.1 - Animation...
Animate CC 2020 is available as part of Adobe Creative Cloud for as little as $20.99/month (or $9.99/month if you're a previous Flash Professional customer). Animate CC 2020 (was Flash CC) lets you... Read more
Adobe Acrobat DC 19.021.20058 - Powerful...
Acrobat DC is available only as a part of Adobe Creative Cloud, and can only be installed and/or updated through Adobe's Creative Cloud app. Adobe Acrobat DC with Adobe Document Cloud services is... Read more
Adobe Acrobat Reader 19.021.20058 - View...
Adobe Acrobat Reader allows users to view PDF documents. You may not know what a PDF file is, but you've probably come across one at some point. PDF files are used by companies and even the IRS to... Read more
Adobe Flash Player 32.0.0.303 - Plug-in...
Adobe Flash Player is a cross-platform, browser-based application runtime that provides uncompromised viewing of expressive applications, content, and videos across browsers and operating systems.... Read more
Adobe InDesign CC 2019 15.0.1 - Professi...
InDesign CC 2019 is available as part of Adobe Creative Cloud for as little as $20.99/month (or $9.99/month if you're a previous InDesign customer). Adobe InDesign CC 2019 is part of Creative Cloud.... Read more
Adobe Lightroom Classic 9.1 - Import, de...
You can download Lightroom for Mac as a part of Creative Cloud for only $9.99/month with Photoshop, included as part of the photography package. The latest version of Lightroom gives you all of the... Read more
Shredo 1.2.7 - $6.99
Shredo is a beautiful, functional file-shredding and privacy scan utility. It permanently shreds files, folders, and external volumes' contents to keep information secure and impossible for anyone to... Read more
Visual Studio Code 1.41.0 - Cross-platfo...
Visual Studio Code provides developers with a new choice of developer tool that combines the simplicity and streamlined experience of a code editor with the best of what developers need for their... Read more
calibre 4.6.0 - Complete e-book library...
Calibre is a complete e-book library manager. Organize your collection, convert your books to multiple formats, and sync with all of your devices. Let Calibre be your multi-tasking digital librarian... Read more

Latest Forum Discussions

See All

King's Throne, the hugely ambitious...
King's Throne: Game of Lust is a deeply immersive medieval-set idle RPG which sees you playing as an ambitious prince, and sole heir to your father's kingdom. On a seemingly ordinary night whilst wandering the king's castle, you make the shocking... | Read more »
Abyssrium Pole is an upcoming aquarium b...
FleroGames' upcoming Abyssrium Pole has recently hit one million pre-registers, which is very impressive, particularly for a fairly casual looking game. Those who have pre-registered will receive 1000 Pearl when the game launches on 8th January... | Read more »
Two Spies is pretty fun, but it's h...
Two Spies just dropped on the App Store this week, and it looks pretty neat. The game has two players capturing various cities across Europe, with the goal of eventually spotting and striking the other spy down. It may be simple-looking, but after... | Read more »
Two Spies is a turn-based game for iOS t...
There aren't too many games that feature pass and play multiplayer and there are even less where you can only play against people you know, even when playing online. But Two Spies does both of those things and you can get it for iOS right now. [... | Read more »
Solve your way through new low-poly puzz...
The best escape-the-room games don’t just test your creative problem-solving skills – they look great, too. Released in October this year by Antler (the developer of the succesful VR puzzle SVRVIVE: The Deus Helix), Krystopia offers everything you... | Read more »
Get ready for an epic adventure with Pea...
Following a hugely successful pre-registration campaign, Pearl Abyss' much-hyped MMORPG, Black Desert Mobile, has finally arrived for iOS and Android. With some of the most impressive visuals on mobile, a vast open world to explore, an in-depth... | Read more »
Elder Scrolls: Blades has ditched chest...
Elder Scrolls: Blades started out as one of the most hyped mobile games of 2019, boasting some impressive visuals and no shortage of promise. Our hopes were somewhat dashed when it eventually launched and we all became privy to its mishandled... | Read more »
Hands-On with the Pocket City December U...
At the end of last month, Codebrew Games announced an update coming to their popular city-builder, Pocket City some time this month. In this update is the promise of expanding your city out into other regions, enacting policies, and more. The full... | Read more »
Black Desert Mobile is available for pre...
Pearl Abyss' stunning open-world MMORPG, Black Desert Mobile, is set to launch for iOS and Android on December 11th at 12 AM PST (8 AM UTC). However, those looking to get in early and test out the in-depth character customisation will be able to... | Read more »
Extraordinary Ones, NetEase's innov...
NetEase's inventive 5v5 anime MOBA, Extraordinary Ones, has now opened for pre-registration ahead of its global launch in early 2020. The game seems to have received a fairly warm reception from fans after its soft-launch earlier in the year,... | Read more »

Price Scanner via MacPrices.net

Apple Watch Series 3 models on sale at Amazon...
Amazon has Apple Watch Series 3 GPS models on sale for $20 off MSRP, starting at only $179. Their prices are the lowest available for these models from any Apple reseller. Choose Amazon as the seller... Read more
Sunday AirPods Sale: Amazon drops prices to a...
Amazon has new 2019 Apple AirPods on sale today ranging up to $30 off MSRP, starting at $139. Shipping is free: – AirPods Pro: $249 $0 off MSRP – AirPods with Wireless Charging Case: $168.95 $30 off... Read more
Holiday 2019 sale: 11″ iPad Pros for up to $2...
Amazon has new Apple 11″ iPad Pros in stock today and on sale for up to $200 off Apple’s MSRP as part of their Holiday 2019 sale. These are the same iPad Pros sold by Apple in its retail and online... Read more
B&H has 12.9″ WiFi iPad Pros on sale for...
B&H Photo has new 12.9″ WiFi iPad Pros on sale for up to $150 off Apple’s MSRP as part of their Holiday 2019 sale. Overnight shipping is free to many addresses in the US: – 12.9″ 64GB WiFi iPad... Read more
Find the best Holiday 2019 prices on Apple’s...
Our Apple award-winning price trackers are the best place to look for the best deals and lowest prices on Apple gear this 2019 Holiday shopping season. Scan our price trackers for the latest... Read more
13″ 2.4GHz/256GB Silver MacBook Pro on sale f...
Amazon has the Silver 13″ 2.4GHz/256GB 4-Core Touch Bar MacBook Pro on sale for $1499.99 shipped. Their price is $300 off Apple’s MSRP, and it’s the lowest price currently available for a 13″ 2.4GHz... Read more
Sams Club one day sales event December 14th:...
Through midnight Saturday night (December 14th), Sams Club online has several Apple Watch Series 5 models on sale for $40 off MSRP as part of their One Day sales event. Choose free shipping or free... Read more
Total Wireless offers iPhone 6S models for as...
Total Wireless has Apple 32GB iPhone 6S models available starting at $99: – 32GB iPhone 6S: $99.99 – 32GB iPhone 6S Plus: $149.99 A no-contract Total Wireless prepaid plan is required with your... Read more
Get a 4 or 6-core Mac Mini for up to $170 off...
B&H Photo has 4-Core and 6-Core Mac minis on sale for up to $170 off Apple’s standard MSRP as part of their Holiday 2019 sale. Overnight shipping is free to many US addresses: – 3.6GHz Quad-Core... Read more
Amazon restocks base 13″ 1.4GHz MacBook Pro f...
Amazon has restocked the base 13″ 1.4GHz/128GB Space Gray MacBook Pro for $1099.99 shipped. Their price is $200 off Apple’s MSRP, and it’s the cheapest price available for a new MacBook Pro. Amazon... Read more

Jobs Board

*Apple* Mobility Sales Professional - Best B...
**750138BR** **Job Title:** Apple Mobility Sales Professional **Job Category:** Store Associates **Store NUmber or Department:** 000471-Mt Vernon-Store **Job Read more
*Apple* Engineering Specialist (ITC ) - Gene...
…Suitability clearance, per contract requirements. Currently, we are seeking an Apple Engineering Specialist in Washington, DC The responsibilities for candidates in Read more
Senior *Apple* Endpoint Engineer - Leidos (...
…Medicaid Service (CMS) End User environment. Perform specific duties as an Apple Endpoint Engineer in support of the infrastructure operations, hardware, software Read more
Perioperative - RN - ( *Apple* Hill Surgical...
Perioperative - RN - ( Apple Hill Surgical Center) Tracking Code 59281 Job Description Monday - Friday - Part Time - Days Possible Saturdays General Summary: Under Read more
Lead DevOps Engineer - *Apple* - Theorem, L...
Job Summary Apple is looking for a seasoned Lead DevOps Engineer that can lead multiple projects and teams while delivering high quality and performant solutions in Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.