TweetFollow Us on Twitter

File-Based Dataflow

Volume Number: 19 (2003)
Issue Number: 3
Column Tag: Mac OS X

File-Based Dataflow

building robust systems without explicit file locking

by Rich Morin

Last month, I discussed "File Change Watcher", a Perl daemon that wakes up periodically, checks for file-system changes, copies files, and gathers metadata. To save space, I ignored a critical issue: how can another process know that a metadata file is "ready" to be read?

Race Conditions

This question comes up in the design of any system where one process is writing a file and another is accessing it. In this type of "race condition", the reading process can overtake the writing process, arriving prematurely at the end of the data (and ignoring any data the other process may write thereafter).

Alternatively, if two processes try to write the same file at the same time, assorted damage can ensue. Depending on the circumstances, changes might get lost, output could get intermingled, etc.

One way to deal with the problem is to use a "lock file". By common agreement, if the lock file is in place, the file it "locks" isn't available for access. An inquiring process can test for this by trying to create the lock file. If the process fails, it just waits for a while before trying again.

The reason that this test works is that the kernel won't let two processes open the same file with exclusive access (O_EXCL). So, the first attempt succeeds and all others fail (until the first process closes the file). The technique works quite well, as long as everyone plays nicely; the BSD side of OSX contains a number of lock files (e.g., /var/spool/lock).

In the general case, lock files (or some equivalent technique) are necessary. If, for instance, two users want to edit a file, you really don't want them doing it at the same time. So, some Unix editors (e.g., the BSD version of vi) implement file locking.

Unfortunately, lock files add complexity and room for error. All of the processes have to honor the lock files; what if an existing program doesn't want to play? Also, if the system crashes, the lock file has to be explicitly removed when things start up again. And, for all of this pain, they only solve one part (simultaneous access) of the file-based dataflow problem set.

Atomic Actions

Fortunately, there are a number of alternatives to lock files. Most of them are based on some kind of "atomic action"; that is, something that can only be done completely or not at all. BSD provides several atomic actions as system calls, including chmod(2), chown(2), link(2), open(2), and rename(2).

These same facilities are available from Perl, but its chmod and chown are only atomic for single file nodes. Corresponding shell commands are available, albeit with some caveats:

  • chgrp(1), chmod(1), chown(1) of a single file system node

  • link(1), but only for hard links

  • mv(1), to another name on the same file system

So, although you can't assume that a process will finish writing before some other process starts to read the output file, you can safely rename(2) an existing file (or mv(1) it to another name on the same file system) without worrying about race conditions.

Because any hard link to a file is simply a directory entry that points to a common inode(5), both the data and file system metadata are shared among a set of hard links. This lets a single atomic action (e.g., chmod) change the status of any number of links.

Finally, Cocoa's Application Kit Framework's NSFileWrapper class includes methods such as writeToFile:atomically:updateFilenames:. In short, a wealth of solutions is at hand.

Emitters and Consumers

The system I'm building is based on a data-flow model, similar to Unix pipelines, but it uses files instead of pipes. The method is far from new; mail transfer agents and print spoolers use variations of it in OSX. I am simply generalizing the idea into a framework for creating sets of file- and time-based tasks.

By specifying that only one program will ever write to a file, I can eliminate the issue of simultaneous writing. This means that I only need to prevent processes from opening or removing files prematurely and make sure that every file gets processed to completion.

The rules below allow the safe (i.e., no race conditions) use of files by any number of "emitters" and "consumers", without the need for lock files.

  • Files are created and written by emitters, read and deleted by consumers. No modification of files is allowed, including appending or read/write usage.

  • Files are created under temporary names (on the destination file system), then "published" (e.g., renamed) for use by consumers.

  • A file can only be used by one consumer, which removes it just before exiting.

    Note: This restriction applies only to consumers. Other programs (e.g., more) may read published files at any time. Also, the consumer is not required to read the file, just remove it.

  • If an emitter is creating files for multiple consumers, a separate link must be created for each consumer. All of the links must then be published in a single, atomic operation. One method uses a common directory:

    • Create a temporary directory.

    • For each file with N consumers, make N-1 links in the directory.

    • Rename the temporary file(s) into the directory, as the Nth file link(s).

    • Rename the temporary directory, publishing all of the links.

      Another method, which can only "protect" a single output file, has the advantage that the output links do not have to be in the same directory:

    • Turn off read access (using chmod) on the temporary file.

    • For a file with N consumers, make N-1 links.

    • Rename the temporary file as the Nth link.

    • Make any (and thereby, every) link readable.

Although these rules might complicate the life of programmers, the resulting programs aren't any more complicated. In fact, they tend to be quite simple: discover an input file, process it (writing any output to temporary files); when you're done, publish (e.g., rename) the output files and remove the input file.

Better yet, the file-discovery and -management details can be handed off to a scheduling daemon, allowing most of the "tasks" (working code) to be written as "filters": read from standard input; write to standard output.

A Data Collection System

Let's apply this to a data collection system and see how it plays out. A ps(1)-monitoring task is supposed to run once a minute, writing a report. Another task reads the report, writing a YAML (www.yaml.org) version. Other tasks produce hourly and daily summaries.

This is a fairly complex set of tasks, but it's only a small fraction of the workload for a full-scale operating system monitor. So, the amount of specification for each task should be as simple as possible. Here's a first cut at a configuration file:

# Every minute, collect raw ps(1) data.
{ ps_raw, type: cron, min: every,
  out: 'ps/raw/$time'
}
# Process the raw ps(1) data.
# Create two output links.
{ ps_rare, type: file, patt: 'ps/raw/*',
  out: ['ps/rare/1.$time',
        'ps/rare/2.$time' ]
}
# Every hour, create a summary.
{ ps_hour, type: cron, hour: every,
  out: 'ps/hour/$time'
}
# Every day, create a summary.
{ ps_day, type: cron, day: every,
  out: 'ps/day/$time'
}

This is fairly concise, but there's a lot of repetition. We could "boil it down" by taking advantage of the fact that it describes a tree of processes:

{ ps_raw, type: cron, min: every,
  out: 'ps/raw/$time',
  { ps_rare, type: file, patt: 'ps/raw/*',
    out: ['ps/rare/1.$time',
          'ps/rare/2.$time' ]
    { ps_hour, type: cron, hour: every,
      out: 'ps/hour/$time'
    },
    { ps_day, type: cron, day: every,
      out: 'ps/day/$time'
} } }

But that only kills off two lines (excluding comments), so it's not a huge win. Also, I'm not convinced that it's as easy to read, modify, etc. If we are willing to let the scheduling infrastructure generate file names for us, however, we can get away with "idioms" like:

{ ps_raw,      type: cron, min:  every,
  { ps_rare,   type: file,
    { ps_hour, type: cron, hour: every },
    { ps_day,  type: cron, day:  every }
} }

That brings the overhead back under control. And, if we need to access a file, we can follow standardized naming rules to find it. The output of ps_hour, for instance, would have a name of the form "ps_hour/1.<time>".

So much for theoretical hand-waving and speculative descriptions. Next month, I'll discuss some daemons that can actually make all of this work.


Rich Morin has been using computers since 1970, Unix since 1983, and Mac-based Unix since 1986 (when he helped Apple create A/UX 1.0). When he isn't writing this column, Rich runs Prime Time Freeware (www.ptf.com), a publisher of books and CD-ROMs for the Free and Open Source software community. Feel free to write to Rich at rdm@ptf.com.

 

Community Search:
MacTech Search:

Software Updates via MacUpdate

Tunnelblick 3.8.2a - GUI for OpenVPN.
Tunnelblick is a free, open source graphic user interface for OpenVPN on OS X. It provides easy control of OpenVPN client and/or server connections. It comes as a ready-to-use application with all... Read more
calibre 4.17.0 - Complete e-book library...
Calibre is a complete e-book library manager. Organize your collection, convert your books to multiple formats, and sync with all of your devices. Let Calibre be your multi-tasking digital librarian... Read more
MacPilot 11.1.4 - $15.96
MacPilot gives you the power of UNIX and the simplicity of Macintosh, which means a phenomenal amount of untapped power in your hands! Use MacPilot to unlock over 1,200 features, and access them all... Read more
Transmission 3.00 - Popular BitTorrent c...
Transmission is a fast, easy, and free multi-platform BitTorrent client. Transmission sets initial preferences so things "just work", while advanced features like watch directories, bad peer blocking... Read more
Doom 3 1.3.1 - First-person shooter acti...
A massive demonic invasion has overwhelmed the Union Aerospace Corporation's (UAC) Mars Research Facility, leaving only chaos and horror in its wake. As one of only a few survivors, you must fight... Read more
Box Sync 4.0.8004 - Online synchronizati...
Box Sync gives you a hard-drive in the Cloud for online storage. Note: You must first sign up to use Box. What if the files you need are on your laptop -- but you're on the road with your iPhone? No... Read more
LibreOffice 6.4.4.2 - Free, open-source...
LibreOffice is an office suite (word processor, spreadsheet, presentations, drawing tool) compatible with other major office suites. The Document Foundation is coordinating development and... Read more
Day One 4.14 - Maintain a daily journal.
Day One is an easy, great-looking way to use a journal / diary / text-logging application. Day One is well designed and extremely focused to encourage you to write more through quick Menu Bar entry,... Read more
MenuMeters 2.0.7 - CPU, memory, disk, an...
MenuMeters is a set of CPU, memory, disk, and network monitoring tools for Mac OS X. Although there are numerous other programs which do the same thing, none had quite the feature set I was looking... Read more
War Thunder 1.97.2.19 - Multiplayer war...
In War Thunder, aircraft, attack helicopters, ground forces and naval ships collaborate in realistic competitive battles. You can choose from over 1,500 vehicles and an extensive variety of combat... Read more

Latest Forum Discussions

See All

SINoALICE, Yoko Taro and Pokelabo's...
Yoko Taro and developer Pokelabo's SINoALICE has now opened for pre-registration over on the App Store. It's already amassed 1.5 million Android pre-registrations, and it's currently slated to launch on July 1st. [Read more] | Read more »
Masketeers: Idle Has Fallen's lates...
Masketeers: Idle Has Fallen is the latest endeavour from Appxplore, the folks behind Crab War, Thor: War of Tapnarok and Light A Way. It's an idle RPG that's currently available for Android in Early Access and will head to iOS at a later date. [... | Read more »
Evil Hunter Tycoon celebrates 2 million...
Evil Hunter Tycoon has proved to be quite the hit since launching back in March, with its most recent milestone being 2 million downloads. To celebrate the achievement, developer Super Planet has released a new updated called Darkness' Front Yard... | Read more »
Peak's Edge is an intriguing roguel...
Peak's Edge is an upcoming roguelike puzzle game from developer Kenny Sun that's heading for both iOS and Android on June 4th as a free-to-play title. It will see players rolling a pyramid shape through a variety of different levels. [Read more] | Read more »
Clash Royale: The Road to Legendary Aren...
Supercell recently celebrated its 10th anniversary and their best title, Clash Royale, is as good as it's ever been. Even for lapsed players, returning to the game is as easy as can be. If you want to join us in picking the game back up, we've put... | Read more »
The Magic Gladiator class arrives in MU...
The Magic Gladiator class is now available in MU Origin 2 following the most recent patch. It also marks the start of Abyss Season 11 and the introduction of Couple Skills and Couple Dungeons. [Read more] | Read more »
The 5 Best Racing Games
With KartRider Rush+ making a splash this past week, we figured it was high time we updated our list of the best mobile racing games out there. From realistic racing sims to futuristic arcade racers (and even racing management games!), check out... | Read more »
KartRider Rush+ Guide - Tips for new rac...
KartRider Rush+ continues to be a surprisingly refreshing and fun kart racer that's entirely free-to-play. The main reason for this is just how high its skill ceiling is. Check out the video above if you're curious to know what top level play looks... | Read more »
KartRider Rush+ might be good, actually?
It's hard to find good racing games on mobile. Most of them are free-to-play, and free-to-play racers generally suck. Even Nintendo couldn't put together a competent Mario Kart game, opting instead for a weird score chaser that resembles--but feels... | Read more »
LifeAfter, NetEase's popular surviv...
A new map will be making its way into NetEase's popular survival game LifeAfter. The map is set to arrive on May 28th and will introduce a volcano that's teetering on the verge of eruption, bringing a host of added challenges to the game. [Read... | Read more »

Price Scanner via MacPrices.net

Apple restocks 2019 MacBook Airs starting at...
Apple has clearance, Certified Refurbished, 2019 13″ MacBook Airs available again starting at $779. Each MacBook features a new outer case, comes with a standard Apple one-year warranty, and is... Read more
Apple restocks clearance Mac minis for only $...
Apple has restocked Certified Refurbished 2018 4-Core Mac minis for only $599. Each mini comes with a new outer case plus a standard Apple one-year warranty. Shipping is free: – 3.6GHz Quad-Core... Read more
Apple’s new 2020 13″ MacBook Airs on sale for...
B&H Photo has Apple’s new 2020 13″ 4-Core and 6-Core MacBook Airs on sale today for $50-$100 off Apple’s MSRP, starting at $949. Expedited shipping is free to many addresses in the US. The... Read more
B&H continues to offer clearance 2019 13″...
B&H Photo has clearance 2019 13″ 4-Core MacBook Pros available for up to $300 off Apple’s original MSRP, with prices starting at $1149. Expedited shipping is free to many addresses in the US. B... Read more
Memorial Day Weekend Sale: Take $300 off thes...
Apple resellers are offering $300 discounts on select 16″ MacBook Pros as part of their Memorial Day Weekend 2020 sales. Prices start at $2099: – 16″ 2.6GHz 6-Core Space Gray MacBook Pro: $2099 at... Read more
Best Memorial Day Weekend 2020 Apple AirPods...
Apple resellers are offering discounts ranging up to $50 off MSRP on AirPods as part of their Memorial Day Weekend 2020 sales. These are the best deals today on various AirPods models. See our... Read more
Memorial Day Weekend Sale: 10″ Apple iPads fo...
Amazon is offering new 10.2″ iPads for $80-$100 off Apple’s MSRP as part of their Memorial Day Weekend 2020 sale, with prices starting at only $249. These are the same iPads sold by Apple in their... Read more
Memorial Day Weekend Sale: 2020 Apple iPhone...
Sprint is offering Apple’s new 2020 64GB iPhone SE for $0 per month for 18 months as part of their Memorial Day Weekend 2020 sale. New line of service and trade-in required. Offer is valid from 5/22/... Read more
Amazon’s popular $100 Apple Watch Series 5 di...
Amazon has Apple Watch Series 5 GPS + Cellular models on sale for up to $100 off Apple’s MSRP today. Shipping is free. These are the same Apple Watch models sold by Apple in their retail and online... Read more
2020 13″ 4-Core MacBook Air on sale for $949,...
Apple reseller Adorama has the new 2020 13″ 1.1GHz 4-Core Space Gray MacBook Air on sale today for $949 shipped. Their price is $50 off Apple’s MSRP, and it’s the lowest price currently available for... Read more

Jobs Board

Essbase Developer - *Apple* - Theorem, LLC...
Job Summary Apple is seeking an experienced, detail-minded Essbase developer to join our worldwide business development and strategy team. If you are someone who Read more
Senior Software Engineer @ *Apple* - Theore...
Job Summary Apple is looking for a seasoned senior software engineer to join our worldwide business development and strategy team. This is an opportunity to lead a Read more
Cub Foods - *Apple* Valley - Now Hiring Par...
Cub Foods - Apple Valley - Now Hiring Part Time! United States of America, Minnesota, Apple Valley Retail Operations Post Date May 18, 2020 Requisition # 119230 Read more
Senior Practice Manager - *Apple* Hill Eye...
Senior Practice Manager - Apple Hill Eye Center Tracking Code 61713 Job Description Schedule & Location: Full Time Days Apple Hill Medical Center General Read more
Retail Sales Consultant - *Apple* Valley -...
Retail Sales Consultant - Apple Valley - $500 Hiring Bonus Apply Now...12-30-2019 Address : 7875 150th St W Location : Apple Valley, MN US Req # : 272753BR Job Read more
All contents are Copyright 1984-2011 by Xplain Corporation. All rights reserved. Theme designed by Icreon.