Dec 95 Challenge

Volume Number:		11
Issue Number:		12
Column Tag:		Programmer’s Challenge

Programmer’s Challenge

By Bob Boonstra, Westford, Massachusetts

Note: Source code files accompanying article are located on MacTech CD-ROM or source code disks.

Find Again And Again

This month the Challenge is to write a text search engine that is optimized to operate repeatedly on the same text. You will be given a block of text, some storage for data structures, and an opportunity to analyze the text before being asked to perform any searches against that text. Then you will repeatedly be asked to find a specific occurrence of a given word in that block of text. The prototypes for the code you should write are:

void InitFind(
 char *textToSearch, /* find words in this block of text  */
 long textLength,/* number of chars in textToSearch   */
 void *privateStorage,    /* storage for your use              */
 long storageSize/* number of bytes in privateStorage */
);
long FindWordOccurrence( 
    /* return offset of wordToFind in textToSearch   */
 char *wordToFind, /* find this word in textToSearch    */
 long wordLength,/* number of chars in wordToFind     */
 long occurrenceToFind, /* find this instance of wordToFind  */
 char *textToSearch, /* same parameter passed to InitFind */
 long textLength,/* same parameter passed to InitFind */
 void *privateStorage,  /* same parameter passed to InitFind */
 long storageSize/* same parameter passed to InitFind */
);

The InitFind routine will be called once for a given block of textLength characters at textToSearch to allow you to analyze the text, create data structures, and store them in privateStorage. When InitFind is called, storageSize bytes of memory at privateStorage will have been preallocated and initialized to zero.

FindWordOccurrence is to search for words, where a word is defined as a continuous sequence of alphanumeric characters delimited by a non-alphanumeric character (e.g., space, tab, punctuation, hyphen, CR, NL, or other special character). Your code should look for complete words - it would be incorrect, for example, to return a value pointing to the word “these” if the wordToFind was “the”. The wordToFind will be a legal word (i.e., no embedded delimiters). FindWordOccurrence should return the offset in textToSearch of the occurrenceToFind-th instance of wordToFind. It should return -1 if wordToFind does not occur in textToSearch, or if there are fewer than occurrenceToFind instances of wordToFind.

Both the InitFind and the FindWordOccurrence routines will be timed in determining the winner. In designing your code, you should assume that FindWordOccurrence will be called approximately 1000 times for each call to InitFind (with the same textToSearch, but possibly differing values of wordToFind and occurrenceToFind).

There is no predefined limit on textLength - you should handle text of arbitrary length. The amount of privateStorage available could be very large, but is guaranteed to be at least 64K bytes. While the test cases will include at least one large textToSearch with a small storageSize, most test cases will provide at least 32 bytes for each occurrence of a word in textToSearch, so you might want to optimize for that condition.

Other fine print: you may not change the input pointed to by textToSearch or wordToFind, and you should not use any static storage other than that provided in privateStorage.

This will be a native PowerPC Challenge, scored using the latest CodeWarrior compiler. Good luck, and happy searching.

Programmer’s Challenge Mailing List

We are pleased to announce the creation of the Programmer’s Challenge Mailing List. The list will be used to distribute the latest Challenge, provide answers to questions about the current Challenge, and discuss suggestions for future Challenges. The Challenge problem will be posted to the list each month, sometime between the 20th and the 25th of the month. This should alleviate problems caused by variations in the publication and mailing date of the magazine, and provide a predictable amount of time to work on each Challenge.

To subscribe to the list, send a message to autoshare@mactech.com with the SUBJECT line “sub challenge YourName”, substituting your real name for YourName. To unsubscribe from the list, send a message to autoshare@mactech.com with the SUBJECT line “unsub challenge”.

Note: the list server, autoshare, is set to accept commands in the SUBJECT line, not the body of the message. If you have any problems, please contact online@mactech.com.

Two Month’s Ago Winner

The Master Mindreader Challenge inspired ten readers to enter, and all ten solutions gave correct results. Congratulations to Xan Gregg (Durham, N.C.) for producing the fastest entry and winning the Challenge.

The problem required you to write code that would correctly guess a sequence of colors using a callback routine provided in the problem statement that returned two values for each guess: the number of elements of the guess where the correct color is located in the correct place in the sequence, and the number of elements where the correct color is in an incorrect place in the sequence. The number of guesses was not an explicit factor in determining the winner, but the time used by the callback routine was included in determining the winner. Participants correctly noted that this made the relative execution time of the guessing routine and the callback routine a factor in designing a fast solution. A couple of entries went so far as to offer their own, more efficient, callbacks. Nice try, but I didn’t use them - the callback in the problem was designed to provide a known time penalty for making a guess, and that was the callback I used in evaluating solutions.

The callback I supplied had one unanticipated side effect - it permitted callers to supply an out-of-range value for positions in the sequence that they didn’t care about for that guess, and six of the entries took advantage of this loophole. This wasn’t what I had intended, and I gave some thought to giving priority to solutions that did not use the loophole. In the end, however, I decided not to treat these entries any differently, because the solution statement permitted and provided a defined callback behavior for out-of-range guesses. As it turned out, the winning entry and three of the fastest four entries did not use out-of-range guesses.

Xan’s winning code first makes a sequence of guesses to determine how many positions are set to each of the possible colors. He then starts with an initial guess corresponding to these colors and begins swapping positions to determine how the number of correctly placed colors is affected. Separate logic handles the cases where the number of correctly placed colors increased or decreased by 0, 1, or 2, all the while keeping track of which color possibilities have been eliminated for each position. These and other details of Xan’s algorithm are documented in the comments to his code.

The table of results below indicates, in addition to execution time, the cumulative number of guesses used by each entry for all test cases. In general, it shows the expected rough correlation between execution time and the number of guesses, with a significant exception for the second-place entry from Ernst Munter, which took significantly fewer guesses. Ernst precalculated tables to define the guessing strategy for problems of length 5 or less and devised a technique for partitioning larger problems to use these tables. Normally I try to discourage the use of extensive precalculated data, but I decided to allow this entry because the amount of data was not unreasonable, because the tables guided the algorithm but did not precalculate a solution, and because I thought the approach was innovative and interesting. Although including the second-place entry in the article is not possible because of length restrictions, I have included the preamble from Ernst’s solution describing his approach.

Here are the times and code sizes for each of the entries. Numbers in parentheses after a person’s name indicate that person’s cumulative point total for all previous Challenges, not including this one.

Name time guesses code data out-of-range

values used?

Xan Gregg (61) 102 4123 1360 16 no

Ernst Munter (90) 109 2880 6264 5480 limited

Gustav Larsson (60) 116 3700 712 40 no

Greg Linden 127 5002 576 16 no

M. Panchenko (4) 146 5391 344 16 yes

Eric Lengyel (20) 176 6456 312 16 yes

Peter Hance 206 6557 336 16 yes

J. Vineyard (42) 228 9933 328 16 no

Ken Slezak (10) 251 6544 808 16 yes

Stefan Sinclair 259 11058 200 16 yes

Top 20 Contestants of All Time

Here are the Top 20 Contestants for the Programmer’s Challenges to date. The numbers below include points awarded for this month’s entrants. (Note: ties are listed alphabetically by last name - there are more than 20 people listed this month because of ties.)

Rank Name Points

1. [Name deleted] 176

2. Munter, Ernst 100

3. Gregg, Xan 81

4. Karsh, Bill 78

5. Larsson, Gustav 67

6. Stenger, Allen 65

7. Riha, Stepan 51

8. Goebel, James 49

9. Nepsund, Ronald 47

10. Cutts, Kevin 46

11. Mallett, Jeff 44

12. Kasparian, Raffi 42

13. Vineyard, Jeremy 42

14. Darrah, Dave 31

15. Landry, Larry 29

16. Elwertowski, Tom 24

17. Lee, Johnny 22

18. Noll, Robert 22

19. Anderson, Troy 20

20. Beith, Gary 20

21. Burgoyne, Nick 20

22. Galway, Will 20

23. Israelson, Steve 20

24. Landweber, Greg 20

25. Lengyel, Eric 20

26. Pinkerton, Tom 20

There are three ways to earn points: (1) scoring in the top 5 of any Challenge, (2) being the first person to find a bug in a published winning solution or, (3) being the first person to suggest a Challenge that I use. The points you can win are:

1st place 20 points

2nd place 10 points

3rd place 7 points

4th place 4 points

5th place 2 points

finding bug 2 points

suggesting Challenge 2 points

Here is Xan’s winning solution:

MindReader

By Xan Gregg,Durham, N.C.

/*  
  I try to minimize the number of guesses without adding too much complexity to the
  code.  First I figure out how many of each color are present in the answer by
  essentially repeatedly guessing all of each color.
  
  Then I figure out the correct positions one at a time starting at slot 0.  I exchange it
  with each other slot (one at a time) until the correct color is found.  When there is a
  change in the numCorrect response from checkGuess I can tell which of the two
  slots caused the change by looking at my remembered information or, if necessary,
  by performing a second guess with one of the colors in both slots.
  
  The “remembered information” includes keeping track of colors that were
  determined (via the numCorrectchange) to be wrong before and/or a swap is made. 
  This doesn’t help out too often, but it doesn’t take much time to record compared to
  calling checkGuess.
  
  While the outer loop determines the color of each slot “left-to-right” (0 to n-1), I
  found that indexing the inner loop right-to-left instead of left-to-right increased the
  speed by 30% - 40%.  I wish I understood why!
  
  Oddly, the checkGuess function spends most of its time figuring out the numWrong
  value, which we generally ignore.
*/

typedef void (*CheckGuessProcPtr)(
        unsigned char  *theGuess,
        unsigned short *numInCorrectPos,
        unsigned short *numInWrongPos);

#define kMaxLength 16

#define Bit(color) (1L << (long) (color))


MindReader

void MindReader(unsigned char guess[],
    CheckGuessProcPtr checkGuess,
        unsigned short answerLength,
        unsigned short numColors)
{
  long    prevColorsFound;
  long    colorsFound;
  long    curColor;
  long    i, j;
  long    curCorrect;
  long    numOfColor[kMaxLength + 1];  /* 1-based */
  Boolean isCorrect[kMaxLength];
  long    possibilities[kMaxLength];   /* bit fields */
  long    colorBit1;
  long    colorBit2;
  char    color1;
  char    color2;
  long    delta;
  unsigned short  newCorrect;
  unsigned short  newWrong;
  
  /* first find the correct set of colors */
  colorsFound = 0;
  curColor = 1;
  while (colorsFound < answerLength)
   {
    for (i = colorsFound; i < answerLength; i++)
      guess[i] = curColor;
    (*checkGuess)(guess, &newCorrect, &newWrong);
    prevColorsFound = colorsFound;
    colorsFound = newCorrect + newWrong;
    numOfColor[curColor] = colorsFound - prevColorsFound;
    curColor++;
   }
  
  /* now work on the order */
  for (i = 0; i < answerLength; i++)
   {
    isCorrect[i] = false;
    possibilities[i] = -1;  /* all colors */
   }
  curCorrect = newCorrect;
  /* step through every slot, starting at 0 */
  for (i = 0; curCorrect < answerLength; i++)
   {
    if (isCorrect[i])
      continue;
    color1 = guess[i];
    colorBit1 = Bit(color1);
    /* try swapping slot i with every other open */
    /* slot, starting with the last one */
    j = answerLength;
    nextSubSlot:
    j--;
    if (guess[i] == guess[j])
      goto nextSubSlot;
    if (isCorrect[j])
      goto nextSubSlot;
    color2 = guess[j];
    colorBit2 = Bit(color2);
    if ((possibilities[i] & colorBit2) == 0)
      goto nextSubSlot;  /* no hope here */
    /* swap slots i & j and check result */
    guess[i] = color2;
    guess[j] = color1;
    (*checkGuess)(guess, &newCorrect, &newWrong);
    delta = newCorrect - curCorrect;
    if (delta >= 0)
      if (delta == 0)
       {  /* either both are incorrect OR */
                           /* one is correct and answer[i]==answer[j] */
        guess[i] = color1;
        guess[j] = color2;
        if (numOfColor[color1] == 1)
         {  /* color1 can’t be in both places */
          possibilities[i] &= ~colorBit1;
          possibilities[j] &= ~colorBit1;
         }
        if (numOfColor[color2] == 1)
         {  /* color2 can’t be in both places */
          possibilities[i] &= ~colorBit2;
          possibilities[j] &= ~colorBit2;
         }
       }
      else if (delta == 1)
       {  /* both were wrong, now one is correct */
                        /* find out which is correct */
        curCorrect = newCorrect;
        if ((possibilities[j] & colorBit1) == 0)
         {  /* i must be color2 */
          possibilities[j] &= ~colorBit2;
          numOfColor[color2] -= 1;
          goto nextSlot;
         }
        else if ((possibilities[i] & colorBit2) == 0)
         {  /* j must be color1 */
          isCorrect[j] = true;
          possibilities[i] &= ~colorBit1;
          numOfColor[color1] -= 1;
          color1 = color2;
          colorBit1 = colorBit2;
         }
        else
         {  /* we’ll have to make another guess to */
                        /* see which is correct */
          guess[i] = color1;
          (*checkGuess)(guess, &newCorrect, &newWrong);
          if (newCorrect == curCorrect)
           {  /* j must be color1 */
            possibilities[i] &=
                  (~(colorBit1 | colorBit2));
            isCorrect[j] = true;
            guess[i] = color2;
            numOfColor[color1] -= 1;
            color1 = color2;
            colorBit1 = colorBit2;
           }
          else
           {  /* i must be color2 */
            possibilities[j] &=
                  (~(colorBit1 | colorBit2));
            guess[i] = color2;
            numOfColor[color2] -= 1;
            goto nextSlot;
           }
         }
       }
      else  /* delta == 2 */
       {  /* both were wrong, now both correct */
        isCorrect[j] = true;
        numOfColor[color1] -= 1;
        numOfColor[color2] -= 1;
        curCorrect = newCorrect;
        goto nextSlot;
       }
    else  /* delta < 0 */
      if (delta == -1)
       {  /* one was correct before swap, now neither is */
        guess[i] = color1;
        guess[j] = color2;
        if ((possibilities[i] & colorBit1) == 0)
         {  /* color2 in slot j was correct */
          isCorrect[j] = true;
          numOfColor[color2] -= 1;
          possibilities[i] &= ~colorBit2;
         }
        else if ((possibilities[j] & colorBit2) == 0)
         {  /* color1 in slot i was correct */
          possibilities[j] &= ~colorBit1;
          numOfColor[color1] -= 1;
          goto nextSlot;
         }
        else
         {  /* we’ll have to make another guess to */
                        /* see which was correct */
          guess[j] = color1;
          (*checkGuess)(guess, &newCorrect, &newWrong);
          if (newCorrect == curCorrect)
           {  /* color1 in slot i was correct */
            possibilities[j] &=
                  (~(colorBit1 | colorBit2));
            guess[j] = color2;
            numOfColor[color1] -= 1;
            goto nextSlot;
           }
          else
           {  /* color2 in slot j was correct */
            possibilities[i] &=
                  (~(colorBit1 | colorBit2));
            guess[j] = color2;
            isCorrect[j] = true;
            numOfColor[color2] -= 1;
           }
         }
       }
      else  /* delta == -2 */
       {  /* both were already correct */
        guess[i] = color1;
        guess[j] = color2;
        isCorrect[j] = true;
        numOfColor[color1] -= 1;
        numOfColor[color2] -= 1;
        goto nextSlot;
       }
    goto nextSubSlot;
    nextSlot: ;
   }
  done: ;
}

Alternative Approach (Description Only)

/*
  Problem:
    Find the value of a multidigit code, by a question and answer method.  Each
    question is a guess of the code, the answer is the number of digits that are correct,
    reported as either in correct or wrong positions.

    The challenge is to minimize total time, that is in the first order, keep the number of
    guesses small, since the time to check the guess is included in total time.  But
    spending too much time minimizing the number of guesses is counterproductive.

  Assumptions:
    1. It is OK to guess a color that is not within the range 1 to numColors.  It will not
        be “correct” or “wrong”, but it will also not corrupt the CheckGuess function.

    2. The “opponent” will call with randomly generated correctAnswer codes, and not
        try to defeat the MindReader by learning the solution strategy.

    3. The objective is not to be a true Mindreader, as this could be done by reading into
         the (*checkGuess) code, the address of which is handy.  One would then
         disassemble PowerPC instructions to discover the hidden address of
         correctAnswer.

  Solution:
    It is relatively simple to manually construct solution trees for small N
    (N=answerLength), and make them into a lookup table.

    I have made a table for N=4, and hardcoded the trees for N=2 and N=3.

    The table for N=5 was too large to be done easily by hand, and I wrote a Tree
    Builder program to construct its 246 nodes.  I then hand tuned the 2 smallest parts
    of it.

    I felt, a 246 node tree is about at the limit of what might be tolerable in a static
    array.  The tree for N=6 would have 1400 or so nodes.  There are diminishing
    returns.  Adding the N=5 tree improved the higher splits (2 or 3 splits instead of 3 or
    4), but gained only a few percentage points on the callBack frequency overall;

    To keep the trees manageable, the permutation patterns and the color schemes are
    normalized.

    Now the details:

    Even if numColors > N, there can be at most N distinct colors in the answer, for
    example 5, if answerLength=5.

    And we can arrange a color mapping so that all colors are refered to by index 1, 2,
    3, etc, with the most frequently occurring color labeled #1.

    For N=5, this reduces the possible answers to 7 color schemes, 11111, 11112,
    11122, 11123, 11223, 11234, 12345.

    To solve for N<=5, the function “ProcessSlice()” only needs the color mapping, and
    a list of the colors, suitably sorted.

    For example, the real answer “73646” can be solved by walking the solution tree in
    4 steps, given the color list 6,3,4,7 and the pattern to be found is 42131.  The
    pattern at the root of the tree T11234, is 11234.

    To obtain the pattern information, I “scan” the answer with successive guesses
    (somewhat optimized for answer lengths of 2 to 4, to eliminate some obviously
    unneeded calls to checkGuess).  The basic idea is:

    correctAnswer
    7 3 6 4 6

    Six or seven calls to checkGuess, to build the color and color-frequency lists:

    guess       correct wrong   yetToFind colorList
    1 1 1 1 1   0       0       5         -
    2 2 2 2 2   0       0       5         -
    3 3 3 3 3   1       0       4         3
    4 4 4 4 4   1       0       3         3,4
    5 5 5 5 5   0       0       3         3,4
    6 6 6 6 6   2       0       1         6,3,4
    7 7 7 7 7   1       0       0         6,3,4,7

    The last call back is avoided if the color==numColors occurs in the code.

    Then, using the tree, the correct answer is found with four more calls to checkGuess:
                                 goal   42131
    6 6 3 4 7   1       x   tree code   11234
    6 3 6 7 4   2       x               12143
    4 3 6 6 7   2       x               32114
    4 6 6 7 3   1       x               31142

    7 3 6 4 6                           42131 (no other choice)

    This results in a total of 10 or 11 calls or less to the checkGuess function.

    On average, 10 calls are needed to solve 5-wide answers, when numColors is
    randomly set to a value from 1 to 16.

    For N>5, the size of tree grows very rapidly.  So I decided to split the answer into
    multiple slices, and treat each as separate problems of width 3, 4, or 5:

    6 = 3 + 3
    7 = 4 + 3
    8 = 4 + 4
    9 = 5 + 4
    10 = 5 + 5
    11 = 5 + 6 = 5 + (3 + 3)
    12 = 5 + 7 = 5 + (4 + 3)
    13 = 5 + 8 = 5 + (4 + 4)
    14 = 5 + 9 = 5 + (5 + 4)
    15 = 5 + 10 = 5 + (5 + 5)
    16 = 8 + 8 = (4 + 4) + (4 + 4)

    To create a split, we call checkGuess with guesses of a solid color for the left side,
    and 0s for the right. (e.g. first guess 1 1 1 1 0 0 0 0, to split 8).  As a result,
    correctPos gives the number of 1s in the left slice, and wrongPos, the number of 1s
    in the right slice.  If we get correctPos+wrongPos=4 as an answer, we must call
    again because there might be more than four 1s in the answer;  the guess 0 0 0 0 1 1
    1 1 will do it.

    Performance:

    Overall, I find an almost linear relationship between the total number of call backs
    (CB) and the value of answerLength (AL), approximately CB = AL * 1.26 + 2.84
    when numColors varies randomly from 1 to 16.
*/

Software Updates via MacUpdate

Latest Forum Discussions

Price Scanner via MacPrices.net

Jobs Board

SPREAD THE WORD:
Slashdot
Digg
Del.icio.us
Reddit
Newsvine