TweetFollow Us on Twitter

Mac in the Shell: Python Text Parsing

Volume Number: 25
Issue Number: 07
Column Tag: Mac in the Shell

Mac in the Shell: Python Text Parsing

Automating entries through keyword searching

by Edward Marczak

Introduction

We've been covering Python basics over the last several columns. This month, we'll hit something with a little practicality: text processing. While computers are really good with numbers, people are really good with words. More often than not, input from people comes as text. Turns out that Python is pretty good at dealing with text processing and manipulation. Let's have a closer look, shall we?

More To The Story

OK, there's a little bit more to the story. I've dealt with e-mail systems and e-mail processing for a very long time. (Let's just say that I started with sendmail before it used m4, mmmmkay?). Oftentimes, though, we want a program dealing with incoming mail. This may be for the purposes of a mailing list, for auto-response or to parse the e-mail and then put relevant bits into a database.

E-mail is either really complex or really simple, depending on how you look at it. It's complex because it's got headers and encoding and parts. But it's simple, because it's all text. No matter what all of the pieces are, they're all just human readable text. Fortunately, there are many pre-built libraries that help deal with the complexity, allowing you, the script writer, to focus on the task at hand: processing the parts of the message body that you're interested in. Python's "batteries included" philosophy ensures that a good mail processing library ships as part of the core package.

How is any of this Mac-specific? Well, it isn't. Not directly. However, I just mentioned that Python has, by default-no extra installation required-a good e-mail processing library. Python ships standard with OS X. That's part of the equation solved. Then, there's the issue of receiving the mail in the first place.

Just about every contemporary mail system has a method of taking incoming mail and feeding it to a script. Postfix, which ships with OS X, is no exception. By default, an SMTP (Simple Mail Transfer Protocol) server wants to receive mail, decide if the mail is for a valid user on it's system, and to then drop that mail in the user's mailbox. That's it. But what about a list server? Well, you take the same SMTP server, but instead of delivering any mail to an end user's mailbox, you hand all mail off to the list server program. The list server program will determine who to deliver this mail to.

This is also similar to server-side anti-spam. All incoming mail is handed off to an anti-spam program. The mail is analyzed, potentially acted upon (read: dropped), and mail is then fed back into the SMTP server for final delivery.

We're not going to do anything so grand here today, but after finishing up, you'll have the groundwork. If you have an OS X machine acting as a mail relay and really want to test/use this, you're going to need to modify some postfix config files directly.

In /etc/postfix/transport, you'll need to first define a transport. Let's say your main mail server is called mail.example.com. If you want to divert mail to a script, have the mail sent to mproc.example.com, and add the following like to /etc/postfix/transport:

mproc.example.com mproc:

This says, "all mail that arrives for mproc.example.com send it to the transport named mproc." Once a transport is defined, we also need to tell postfix how to connect the dots between the transport named mproc and our script. That happens in /etc/postfix/master.cf. Add the following line to the end of the file:

mproc  unix  -       n       n       -       -       pipe
  flags=DRhu user=mproc argv=/usr/bin/mproc.py

This tells postfix that any mail arriving on the mproc transport should be piped to the mproc.py script. This is, of course, assuming that we store our script in /usr/bin as "mproc.py". Adjust as needed.

Of course, we're going to keep it simple: since the text will be piped into the script, it's easy to simulate. The pipe simply delivers the entire message on stdin.

A Text Processing Script

Again, we said that we're really focusing on processing e-mail as it arrives, so, we're going to look for input via stdin (which the pipe above does for us). Other text processing scripts may want to deal with text already in a file or elsewhere. I'll make sure to cover that in a future column, but that's not the goal of today's exercise. Despite 'keeping it simple,' we'll be covering a few new-to-us concepts.

Here's the assignment: currently, stock information arrives via e-mail where a dedicated person reads the mail and inputs the entries into a database. This person could clearly be doing better things, as this can be automated without changing the backend system that is sending the e-mail message (whether that's a person or a machine is immaterial for this article). These messages will have a strict format: category and value, separated by a colon. The body of a message would look like this:

Company: Cartier

Product: Watch

Model: Original Tank

Number: 12324A332

Price: $4,500

Available: Yes

However, there's a problem when parsing an e-mail message: it's never just the body that you receive. It's headers. And MIME parts. Oy. Fortunately, Python's email library has functions to deal with this.

I say, let's dive right in. Here's the code I'm using, which will be followed by an explanation of the program.

Listing 1: e-mail parsing program, epp.py

#!/usr/bin/env python
import email
import re
import sys
from email.Parser import Parser
# The keywords we're looking for
keys = ['Company', 'Product', 'Model', 'Number', 'Price', 'Available']
# Compile each keyword into a regular expression
keysre = {}
for i in keys:
  keysre[i] = re.compile(i)
# Read stdin into a single string
mystdin = sys.stdin.read()
# Create a parser object and parse the input
p = Parser()
ps = p.parsestr(mystdin)
# Examine each message part for an appripriate plain body
for i in ps.walk():
  if i.get_content_subtype() != "plain":
    continue
  plainbody = i.as_string()
# Break message into lines, based on newline char
plainbody = plainbody.splitlines()
for i in plainbody:
  # Look at each key for a match.
  for k in keys:
    if keysre[k].match(i):
      print i
sys.exit(0)

First thing to notice about the code is the relative brevity-37 lines in total. As usual, the first few lines simply get us set up: she-bang line and relevant imports, including the Python-supplied email module.

#!/usr/bin/env python
import email
import re
import sys
from email.Parser import Parser

There have been a few times in this column that I've mentioned the importance of regular expressions (RE). Python has good support for RE from the re module:

# The keywords we're looking for
keys = ['Company', 'Product', 'Model', 'Number', 'Price', 'Available']
# Compile each keyword into a regular expression
keysre = {}
for i in keys:
  keysre[i] = re.compile(i)

What is happening here is that we define a list of the keywords we're going to be looking for in the message body. Python regular expressions need to be compiled into an object, which is why we define the keysre dictionary. Of course, we could define these objects one at a time, but that's really inelegant and doesn't scale. In the loop, the dictionary is filled with keys that correspond to the words we're going to match, with a value of the compiled RE object.

# Read stdin into a single string
mystdin = sys.stdin.read()
# Create a parser object and parse the input
p = Parser()
ps = p.parsestr(mystdin)

The first part of this section is pretty simple: assign all of stdin to the variable mystdin. Part of the email library is the email parser object. This object allows an e-mail message, headers, MIME parts and all to be parsed, iterated over and picked apart. We're defining a new parser object and then loading the variable ps with a parsed version of the message that's arriving on stdin.

# Examine each message part for an appropriate plain body
for i in ps.walk():
  if i.get_content_subtype() != "plain":
    continue
  plainbody = i.as_string()

This section of the code hands us back the plain part of the message. MIME types are described in two parts, such as "text/html". We're only interested in the plain portion of the message if there are additional parts in the message. The conditional tests if the subpart is not plain. If it is not, we continue and go back to the top of the loop. If it is plain, we fall though and assign the entire subpart, as a string, to the variable plainbody.

# Break message into lines, based on newline char
plainbody = plainbody.splitlines()

The splitlines() string method returns a list, each element a line in the string, split by a separator-by default, the newline character. Now, we can examine each line in turn:

for i in plainbody:
  # Look at each key for a match.
  for k in keys:
    if keysre[k].match(i):
      print i

As we examine each line, an if statement tests for a match of our regular expressions by looping through the keysre dirctionary. If there's a match, we print it out. Naturally, we can take other action here besides printing it out, such as storing it internally, comparing it to some known value or even inserting it into a database. One thing you will likely want to do is to split the matching lines into key/value pairs. The string's split method does this very nicely. For example:

key, value = i.split(':')

The argument to split is the separator to split on. In our case, we know the lines are split by the colon character and that we're expecting back two values. The split method will happily split as many times as needed. In the case where you don't know how many values to expect, you may just want to assign to a list, like so:

values = i.split(':')

From there you can work out how many values were split and returned to you, and what to do with them.

Finally, we exit the program with a 'clean' exit code:

sys.exit(0)

Running the Program

If you don't happen to have any test e-mail sitting around, I've placed one on the MacTech ftp site, under this month's directory (ftp.mactech.com/src/mactech/volume25_2009/25.07.sit ). If you run your own mail server, you can actually just go grab a raw message from the mail spool-your own mail, mind you!

Since the instructions I gave in the first part allow postfix to send incoming mail through a pipe and to the application, we need a more convenient way to test. The command line makes this easy: just pipe it yourself. Don't forget to mark the program as executable:

chmod 770 epp.py

and then pipe away:

cat /path/to/mits_test_mail | ./epp.py

(or, substitute the ./ with the full path to the program, if needed). If you're using the test mail from the MacTech ftp site, you should see the output you expect: the values that we're matching on, with no headers, MIME clutter, etc. Take a look at the original test mail file to see just how much cruft is being left out.

Conclusion

This was a bit of a whirlwind tour of several concepts. I'd encourage you to bulk up an application like this by checking for error conditions and then taking appropriate action. Outside of that, though, it's pretty impressive at how few dedicated commands are needed to process a well-formed e-mail message. The rest are really just 'nuts and bolts' features of the language.

Media of the month: I'd like to think that everyone has some kind of music that they like. Something that reached them, or that reminds them of some period of time. Well, growing up in New York certainly left a musical stamp on me. I just finished "No Wave" by Marc Masters, and I just loved every second of it. I remember the NY scene around that time, but was certainly too young to fully appreciate it. I don't expect everyone to fully enjoy or 'get' No Wave. But sometimes, the best way to enjoy music is by reading about it. So think of the music that inspires you and find the reading material that points out its inspiration. Thanks to Bruce Gerson for inspiring the topic this month.

Next month, we'll expand on some of the concepts covered here and dig deeper into the well that Python has to offer.


Ed Marczak is the Executive Editor of MacTech Magazine. He has written for MacTech since 2004.

 

Community Search:
MacTech Search:

Software Updates via MacUpdate

ScreenFlow 8.2.5 - Create screen recordi...
ScreenFlow is powerful, easy-to-use screencasting software for the Mac. With ScreenFlow you can record the contents of your entire monitor while also capturing your video camera, microphone and your... Read more
MegaSeg 6.1.1 - Professional DJ and radi...
MegaSeg is a complete solution for pro audio/video DJ mixing, radio automation, and music scheduling with rock-solid performance and an easy-to-use design. Mix with visual waveforms and Magic... Read more
Beamer 3.4 - Stream any movie file from...
Beamer streams to your Apple TV or Chromecast. Plays any movie file - Just like the popular desktop movie players, Beamer accepts all common formats, codecs and resolutions. AVI, MKV, MOV, MP4, WMV... Read more
FotoMagico 5.6.12 - Powerful slideshow c...
FotoMagico lets you create professional slideshows from your photos and music with just a few, simple mouse clicks. It sports a very clean and intuitive yet powerful user interface. High image... Read more
OmniGraffle Pro 7.12.1 - Create diagrams...
OmniGraffle Pro helps you draw beautiful diagrams, family trees, flow charts, org charts, layouts, and (mathematically speaking) any other directed or non-directed graphs. We've had people use... Read more
beaTunes 5.2.1 - Organize your music col...
beaTunes is a full-featured music player and organizational tool for music collections. How well organized is your music library? Are your artists always spelled the same way? Any R.E.M. vs REM?... Read more
HandBrake 1.3.0 - Versatile video encode...
HandBrake is a tool for converting video from nearly any format to a selection of modern, widely supported codecs. Features Supported Sources VIDEO_TS folder, DVD image or real DVD (unencrypted... Read more
Macs Fan Control 1.5.1.6 - Monitor and c...
Macs Fan Control allows you to monitor and control almost any aspect of your computer's fans, with support for controlling fan speed, temperature sensors pane, menu-bar icon, and autostart with... Read more
TunnelBear 3.9.3 - Subscription-based pr...
TunnelBear is a subscription-based virtual private network (VPN) service and companion app, enabling you to browse the internet privately and securely. Features Browse privately - Secure your data... Read more
calibre 4.3.0 - Complete e-book library...
Calibre is a complete e-book library manager. Organize your collection, convert your books to multiple formats, and sync with all of your devices. Let Calibre be your multi-tasking digital librarian... Read more

Latest Forum Discussions

See All

The House of Da Vinci 2 gets a new gamep...
The House of Da Vinci launched all the way back in 2017. Now, developer Blue Brain Games is gearing up to deliver a second dose of The Room-inspired puzzling. Some fresh details have now emerged, alongside the game's first official trailer. [Read... | Read more »
Shoot 'em up action awaits in Battl...
BattleBrew Productions has just introduced another entry into its award winning, barrelpunk inspired, BattleSky Brigade series. Whilst its previous title BattleSky Brigade TapTap provided fans with idle town building gameplay, this time the... | Read more »
Arcade classic R-Type Dimensions EX blas...
If you're a long time fan of shmups and have been looking for something to play lately, Tozai Games may have just released an ideal game for you on iOS. R-Type Dimensions EX brings the first R-Type and its sequel to iOS devices. [Read more] | Read more »
Intense VR first-person shooter Colonicl...
Our latest VR obsession is Colonicle, an intense VR FPS, recently released on Oculus and Google Play, courtesy of From Fake Eyes and Goboogie Games. It's a pulse-pounding multiplayer shooter which should appeal to genre fanatics and newcomers alike... | Read more »
PUBG Mobile's incoming update bring...
PUGB Mobile's newest Royale Pass season they're calling Fury of the Wasteland arrives tomorrow and with it comes a fair chunk of new content to the game. We'll be seeing a new map, weapon and even a companion system. [Read more] | Read more »
PSA: Download Bastion for free, but wait...
There hasn’t been much news from Supergiant Games on mobile lately regarding new games, but there’s something going on with their first game. Bastion released on the App Store in 2012, and back then it was published by Warner Bros. This Warner... | Read more »
Apple Arcade: Ranked - 51+ [Updated 11.5...
This is Part 2 of our Apple Arcade Ranking list. To see part 1, go here. 51. Patterned [Read more] | Read more »
NABOKI is a blissful puzzler from acclai...
Acclaimed developer Rainbow Train's latest game, NABOKI, is set to launch for iOS, Android, and Steam on November 13th. It's a blissful puzzler all about taking levels apart in interesting, inventive ways. [Read more] | Read more »
A Case of Distrust is a narrative-driven...
A Case of Distrust a narrative-focused mystery game that's set in the roaring 20s. In it, you play as a detective with one of the most private eye sounding names ever – Phyllis Cadence Malone. You'll follow her journey in San Francisco as she... | Read more »
Brown Dust’s October update offers playe...
October is turning out to be a productive month for the Neowiz team, and a fantastic month to be a Brown Dust player. First, there was a crossover event with the popular manga That Time I Got Reincarnated as a Slime. Then, there was the addition of... | Read more »

Price Scanner via MacPrices.net

Score a 37% discount on Apple Smart Keyboards...
Amazon has Apple Smart Keyboards for current-generation 10″ iPad Airs and previous-generation 10″ iPad Pros on sale today for $99.99 shipped. That’s a 37% discount over Apple’s regular MSRP of $159... Read more
Apple has refurbished 2019 13″ 1.4GHz MacBook...
Apple has a full line of Certified Refurbished 2019 13″ 1.4GHz 4-Core Touch Bar MacBook Pros available starting at $1099 and up to $230 off MSRP. Apple’s one-year warranty is included, shipping is... Read more
2019 13″ 1.4GHz 4-Core MacBook Pros on sale f...
Amazon has new 2019 13″ 1.4GHz 4-Core Touch Bar MacBook Pros on sale for $150-$200 off Apple’s MSRP. These are the same MacBook Pros sold by Apple in its retail and online stores: – 2019 13″ 1.4GHz/... Read more
11″ 64GB Gray WiFi iPad Pro on sale for $674,...
Amazon has the 11″ 64GB Gray WiFi iPad Pro on sale today for $674 shipped. Their price is $125 off MSRP for this iPad, and it’s the lowest price available for the 64GB model from any Apple reseller. Read more
2019 15″ MacBook Pros available for up to $42...
Apple has a full line of 2019 15″ 6-Core and 8-Core Touch Bar MacBook Pros, Certified Refurbished, available for up to $420 off the cost of new models. Each model features a new outer case, shipping... Read more
2019 15″ MacBook Pros on sale this week for $...
Apple resellers B&H Photo and Amazon are offering the new 2019 15″ MacBook Pros for up to $300 off Apple’s MSRP including free shipping. These are the same MacBook Pros sold by Apple in its... Read more
Sunday Sale: AirPods with Wireless Charging C...
B&H Photo has Apple AirPods with Wireless Charging Case on sale for $159.99 through 11:59pm ET on November 11th. Their price is $40 off Apple’s MSRP, and it’s the lowest price available for these... Read more
Details of Sams Club November 9th one day App...
Through midnight Saturday night (November 9th), Sams Club online has several Apple products on sale as part of their One Day sales event. Choose free shipping or free local store pickup (if available... Read more
Sprint is offering the 64GB Apple iPhone 11 f...
Sprint has the new 64GB iPhone 11 available for $15 per month for new lines. That’s about 50% off their standard monthly lease of $29.17. Over is valid until November 24, 2019. The fine print: “Lease... Read more
New Sprint November iPhone deal: Lease one iP...
Switch to Sprint and purchase an Apple iPhone 11, 11 Pro, or 11 Pro Max, and get a second 64GB iPhone 11 for free. Requires 2 new lines or 1 upgrade-eligible line and 1 new line. Offer is valid from... Read more

Jobs Board

*Apple* Mobility Pro - Best Buy (United Stat...
**746087BR** **Job Title:** Apple Mobility Pro **Job Category:** Store Associates **Store NUmber or Department:** 000319-Harlem & Irving-Store **Job Description:** Read more
Best Buy *Apple* Computing Master - Best Bu...
**743392BR** **Job Title:** Best Buy Apple Computing Master **Job Category:** Store Associates **Store NUmber or Department:** 001171-Southglenn-Store **Job Read more
Best Buy *Apple* Computing Master - Best Bu...
**746015BR** **Job Title:** Best Buy Apple Computing Master **Job Category:** Sales **Store NUmber or Department:** 000372-Federal Way-Store **Job Description:** Read more
*Apple* Mobility Pro - Best Buy (United Stat...
**744658BR** **Job Title:** Apple Mobility Pro **Job Category:** Store Associates **Store NUmber or Department:** 000586-South Hills-Store **Job Description:** At Read more
Best Buy *Apple* Computing Master - Best Bu...
**741552BR** **Job Title:** Best Buy Apple Computing Master **Job Category:** Sales **Store NUmber or Department:** 000277-Metcalf-Store **Job Description:** **What Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.