June 96 - MPW Tips And Tricks: Scripted Text Editing
Mpw Tips And Tricks: Scripted Text
Editing
Tim Maroney
The MPW Shell contains a full-strength, high-speed text editor with scripting
capabilities. It's nothing to write love letters with, because it's targeted at
the ASCII format of compiler source files, but it provides the power to
automate complex and repetitive tasks in ASCII text. The key to the system lies
in a few editing-related commands, together with its regular expressions and
selection expressions.
In the MPW Shell, any search command can take one of two kinds of arguments.
The first is a plain string, which matches exactly its contents and nothing
else, using a simple character-by-character match. The other is a regular
expression, which is a pattern that can be recognized by a finite state
machine. You can't parse programming languages with regular expressions, but
you can use them to recognize many patterns, including wildcards, repeating
sequences, and sets of characters.
Regular expressions are bracketed with either slashes or backslashes, for
searching forward or backward respectively. So, for instance, the regular
expression \wombat\ would search backward from the current location for the
string "wombat".
There are about 20 special constructs within regular expressions, all of which
are cryptically described when you execute the command line "Help Patterns"
within the MPW Shell. I'll mention some of the more useful ones here. The
wildcard characters are the question mark (?) and the equivalence symbol (~,
Option-X). The question mark matches any one character except the end of a
line, while the equivalence symbol matches any number of such characters. For
instance, /w?mb~t/ would match "wombat" as well as "wambiklort" and "wymbt",
but not "wafkambiliot", nor "wkmb" at the end of a line. Restricted sets of
symbols can be given in brackets; for instance, you can search for alphanumeric
characters with the pattern [a-zA-Z0-9]. The reverse of a set can be specified
with the "not" symbol (~, Option-L); for instance, /[~a-z]/ finds any character
except a lowercase letter. The start of a line can be specified with the bullet
symbol (*, Option-8) and the end of a line with the infinity symbol
([[infinity]], Option-5).
These keyboard shortcuts are for American QWERTY keyboards. Other keyboards
have different layouts. For instance, on a direct neural interface keyboard,
think "blue wildebeest" and raise your right ear to type the bullet
symbol.*
Repeating patterns can be specified in three ways. Following any pattern with a
plus sign (+) means one or more instances of that pattern; for instance, the
regular expression /[0-9]+/ would match any sequence of digits. An optional
repeating pattern can be similarly specified with an asterisk (*), which means
zero or more repetitions. The rarely seen double angle brackets can be used to
specify exactly how many repetitions of a pattern are allowed. They're typed as
Option-backslash (<<) and Option-Shift-backslash (>>) and enclose a
single number to mean exactly that many repetitions, or two numbers separated
by a comma to specify a minimum and maximum number of repetitions, or a single
number followed by a comma to mean at least that many repetitions. For
instance, the pattern /[a-zA-Z]<<3,7>>/ would find all strings
composed of alphabetical characters and from three to seven letters long.
There are a number of ways of "escaping" special characters when you want to
look for something that has special meaning within regular expressions, such as
a question mark or plus sign. You can escape any character with the lowercase
delta ([[partialdiff]], Option-D), or use single or double quotes to escape
strings. To find the string "wombat+", for instance, you'd need to escape the
plus sign: /wombat[[partialdiff]]+/.
Finally, one of the most useful constructs consists of a tagged regular
expression. This allows you to associate a number between 0 and 9 with a
pattern that's matched, referring to it later with the "registered" symbol
(reg., Option-R) followed by a digit. This is very handy when you're doing
replacements. For instance, you can replace any angle-bracketed string with a
parenthesized string with the following command, which would turn
"<wombat>" into "(wombat)":
Replace /<([~<>]*)reg.1>/ (reg.1)
This
searches for any number of characters (except angle brackets) that are between
angle brackets, assigns them the number 1, and then replaces the angle brackets
with parentheses. Note that the syntax of tagged patterns requires the pattern
to be parenthesized.
Many editing commands (such as Replace) can take selection expressions as well
as regular expressions. Selection expressions provide more ways to select text
than the string matching provided by regular expressions. Common selection
expressions include the following:
- The bullet symbol, meaning the start of a file.
- The infinity symbol, meaning the end of a file.
- The current selection, denoted by [[section]] (Option-6). This might
have been selected with the mouse or by a Find command. [[section]] by
itself indicates the selection in the target window (which I'll explain later),
while pathname:[[section]] means the selection in the file indicated by the
pathname.
- A line number, specified simply as a number.
- The name of a marker, specified by the Mark command.
- A range between two selection expressions, separated by a colon
(:).
The above expressions require no special delimiters (they're not
directional like regular expressions). Regular expressions are actually a kind
of selection expression and are delimited by slash or backslash characters as
usual.
Some character-skipping variants of these options are also provided, such as
the position that's one character after the selection, denoted by following a
selection expression with an uppercase delta ([[Delta]], Option-J). These are
useful in dealing with context; for instance, you may want to select a string
when it's followed by another character, but not include the following
character in the selection. (An example is given later in the Subword script.)
Text emitted by a program like a table generator may be in a known format, such
as a columnar arrangement, in which case skipping a certain number of
characters will take you to the selection you need.
Again, the MPW Shell will give you a terse summary of selection expressions
when you execute the command line "Help Selections". I'm not going to list all
the minor variants here, but feel free to while away the hours in rapturous
contemplation of their mysteries on your own.
The most common editing commands are two that you probably use already: Find
and Replace. Dialogs that stand in for these commands are built into the MPW
Shell and accessible from the Find menu. You can give any selection expression
as a search pattern in either of these dialogs by clicking the Selection
Expression radio button instead of the default Literal button.
The same commands are the basis of most editing scripts. As tools, Find and
Replace take a selection expression as their primary argument. Don't confuse
Find and Search! The Search command puts out its results as text, while Find
actually changes the selection. In addition, Search takes a pattern -- that is,
a regular expression -- while Find takes any selection expression. For example,
to go to the start of a file in a script, you could give the command "Find
*", but not "Search
*".
Find is the basic navigation command in most editing scripts. For instance, you
can simulate the Select All command in the Edit menu like so:
Find *:[[infinity]] # select from start to end of target
The
commands File and Open, along with the variables Target and Active, determine
the files your scripts will work on. "File" is actually an alias for the real
command name, Target. The File command opens a file and makes it the target
window -- the window behind the frontmost window. The target window is an
important notion in MPW. It exists so that you can use the Worksheet window to
type commands that affect another window; since the Worksheet would be in
front, the window being affected would need to be behind the Worksheet. During
scripting, you may prefer to use the Open command, which opens a file and makes
it the frontmost window. The target window is referred to as {Target} in
scripts, while the frontmost window is called {Active}. Editing commands work
on the target window if you don't specify a window explicitly.
The Line command may also be used for navigation: it selects the numbered line
in the target window and then brings that window to the front. You probably
know this command already if you use compilers in the MPW Shell, since they put
out error messages in this form:
File "gwork.c"; Line 418 # Syntax error
Executing
this command takes you to the line in your code where the error was detected.
The Position command returns the current position in the target window, as a
line number, a character range, or both. The position could be saved to a
variable for later use as follows, using the backquote mechanism to execute a
command and insert its output inline:
Set SavedLineNumber `Position -l`
There
are dozens of commands pertaining to text editing in the MPW scripting
language. Help on all of them is available in the MPW Shell. The usual
Macintosh text-editing menu commands are available in the MPW scripting
language, including New, Open, Close, Save, Revert, Print, and the standard
Edit menu commands.
StreamEdit is a standalone editing tool that's rich and strange enough to
deserve its own co-->umn. It's a structured search and replacement language based
on the UNIXreg. command sed.
Some simpler standalone editing tools are provided. Sort has a rich function
set and can be used for many text-editing tasks. Canon takes a file of search
and replace strings and applies them to a file. It's used to automate
terminology changes, such as the work that was done to make the Mac OS API use
fewer acronyms and abbreviations when the new Inside Macintosh books were
written. Translate, like the UNIX command tr, maps characters onto other
characters.
Text indentation can be handled with four tools: Adjust, Align, Entab, and
Format. Adjust shifts a line to the right or left by a specified number of
spaces. Align sets the margin of a range of selected lines to the margin of the
first selected line. Entab converts runs of spaces to tabs, and Format sets the
column width used for tabs in a text document, as well as other settings like
font and size. (These settings are saved in a resource in the file, which many
ASCII text editors can recognize.)
Text-editing scripts often create temporary files, split single files into
multiple files, and perform other file-related tasks. MPW provides commands to
help you manage files. It has commands corresponding to almost all Finder
operations, such as Duplicate, Move, Delete, and NewFolder. There are also some
specialized file commands: FileDiv splits a file into multiple files based on a
byte or line count or on embedded form feed characters inserted during a
previous editing pass; Catenate does the opposite, joining files together.
A text-editing script often takes search and substitution text as parameters on
the command line. A few commands related to parameters are worth a quick
mention here. Echo is handy for concatenating parameters with other text. Quote
is similar to Echo but adds quote marks as needed to preserve the word breaks
in its parameters. MPW scripting requires quotes around any string that is
meant to be a single parameter but contains spaces (which would break the
string into multiple parameters). Echo puts out its arguments in a way that
allows them to be broken up, while Quote preserves the original word breaks by
inserting quotes.
Echo "Richard Loves Pat"
Richard Loves Pat
Quote "Bill Loves Everyone"
'Bill Loves Everyone'
Here's a script I've found useful for some years. It's called Subword and it
replaces a word by another string everywhere it occurs in the target window.
Set Sep "[~a-zA-Z_0-9]" # word separators
Find * "{Target}" # start at top of file
Replace -c [[infinity]] [[partialdiff]]
"[[Delta]]/{Sep}{1}{Sep}/!1:[[Delta]]/{Sep}/" [[partialdiff]]
"{2}" "{Target}"
The
selection in this Replace command is probably about as clear as the U.S. tax
code, so allow me to explain. The [[Delta]] means one character before the
selection. The !1 means one character past the selection. The colon
denotes everything between the selections (inclusively). So this pattern says,
in a nutshell, select the pattern in the first parameter ({1}) when it's
bracketed by separators, but exclude the separators.
Normally I don't use this script directly. I incorporate it into other scripts
as a utility. The bulk of the work of converting between similar languages like
Pascal and C can be done by an editing script, for example. Subword can be used
to convert keywords, as could Canon. I use another script which is essentially
Subword without the separators for changing symbols like equality operators.
Scripts to preconvert between Pascal and C can be found on this issue's CD.
They don't generate compiler-ready text, but I've found that they facilitate a
manual conversion at the rate of hundreds of lines per hour, allowing source
bases in the thousands of lines to be accurately translated in a day or three.
So the next time you're faced with a dull text-processing task, look over the
tools MPW gives you, and see whether you can save yourself a few days of
tedious manual labor!
TIM MARONEY recently changed his Apple badge color from green to white: he's
gone from contract programming to a technical leadership role developing user
interface software. Tim entertains himself in a variety of ways, such as
straining his surgically altered eyeballs on the small print of obscure
footnotes and collectible trading card games, and contorting his limbs in yogic
asanas. He designed the iron crystal that now resides at the core of the earth
and contributed significant ideas to the original (now obsolete) implementation
of Planck-scale gravitational phenomena in the universe.*
Thanks to Dave Evans, Scott Fraser, Arno Gourdol, and Alex McKale for reviewing
this column.*