Dec 95 Challenge
Volume Number: | | 11
|
Issue Number: | | 12
|
Column Tag: | | Programmers Challenge
|
Programmers Challenge
By Bob Boonstra, Westford, Massachusetts
Note: Source code files accompanying article are located on MacTech CD-ROM or source code disks.
Find Again And Again
This month the Challenge is to write a text search engine that is optimized to operate repeatedly on the same text. You will be given a block of text, some storage for data structures, and an opportunity to analyze the text before being asked to perform any searches against that text. Then you will repeatedly be asked to find a specific occurrence of a given word in that block of text. The prototypes for the code you should write are:
void InitFind(
char *textToSearch, /* find words in this block of text */
long textLength,/* number of chars in textToSearch */
void *privateStorage, /* storage for your use */
long storageSize/* number of bytes in privateStorage */
);
long FindWordOccurrence(
/* return offset of wordToFind in textToSearch */
char *wordToFind, /* find this word in textToSearch */
long wordLength,/* number of chars in wordToFind */
long occurrenceToFind, /* find this instance of wordToFind */
char *textToSearch, /* same parameter passed to InitFind */
long textLength,/* same parameter passed to InitFind */
void *privateStorage, /* same parameter passed to InitFind */
long storageSize/* same parameter passed to InitFind */
);
The InitFind routine will be called once for a given block of textLength characters at textToSearch to allow you to analyze the text, create data structures, and store them in privateStorage. When InitFind is called, storageSize bytes of memory at privateStorage will have been preallocated and initialized to zero.
FindWordOccurrence is to search for words, where a word is defined as a continuous sequence of alphanumeric characters delimited by a non-alphanumeric character (e.g., space, tab, punctuation, hyphen, CR, NL, or other special character). Your code should look for complete words - it would be incorrect, for example, to return a value pointing to the word these if the wordToFind was the. The wordToFind will be a legal word (i.e., no embedded delimiters). FindWordOccurrence should return the offset in textToSearch of the occurrenceToFind-th instance of wordToFind. It should return -1 if wordToFind does not occur in textToSearch, or if there are fewer than occurrenceToFind instances of wordToFind.
Both the InitFind and the FindWordOccurrence routines will be timed in determining the winner. In designing your code, you should assume that FindWordOccurrence will be called approximately 1000 times for each call to InitFind (with the same textToSearch, but possibly differing values of wordToFind and occurrenceToFind).
There is no predefined limit on textLength - you should handle text of arbitrary length. The amount of privateStorage available could be very large, but is guaranteed to be at least 64K bytes. While the test cases will include at least one large textToSearch with a small storageSize, most test cases will provide at least 32 bytes for each occurrence of a word in textToSearch, so you might want to optimize for that condition.
Other fine print: you may not change the input pointed to by textToSearch or wordToFind, and you should not use any static storage other than that provided in privateStorage.
This will be a native PowerPC Challenge, scored using the latest CodeWarrior compiler. Good luck, and happy searching.
Programmers Challenge Mailing List
We are pleased to announce the creation of the Programmers Challenge Mailing List. The list will be used to distribute the latest Challenge, provide answers to questions about the current Challenge, and discuss suggestions for future Challenges. The Challenge problem will be posted to the list each month, sometime between the 20th and the 25th of the month. This should alleviate problems caused by variations in the publication and mailing date of the magazine, and provide a predictable amount of time to work on each Challenge.
To subscribe to the list, send a message to autoshare@mactech.com with the SUBJECT line sub challenge YourName, substituting your real name for YourName. To unsubscribe from the list, send a message to autoshare@mactech.com with the SUBJECT line unsub challenge.
Note: the list server, autoshare, is set to accept commands in the SUBJECT line, not the body of the message. If you have any problems, please contact online@mactech.com.
Two Months Ago Winner
The Master Mindreader Challenge inspired ten readers to enter, and all ten solutions gave correct results. Congratulations to Xan Gregg (Durham, N.C.) for producing the fastest entry and winning the Challenge.
The problem required you to write code that would correctly guess a sequence of colors using a callback routine provided in the problem statement that returned two values for each guess: the number of elements of the guess where the correct color is located in the correct place in the sequence, and the number of elements where the correct color is in an incorrect place in the sequence. The number of guesses was not an explicit factor in determining the winner, but the time used by the callback routine was included in determining the winner. Participants correctly noted that this made the relative execution time of the guessing routine and the callback routine a factor in designing a fast solution. A couple of entries went so far as to offer their own, more efficient, callbacks. Nice try, but I didnt use them - the callback in the problem was designed to provide a known time penalty for making a guess, and that was the callback I used in evaluating solutions.
The callback I supplied had one unanticipated side effect - it permitted callers to supply an out-of-range value for positions in the sequence that they didnt care about for that guess, and six of the entries took advantage of this loophole. This wasnt what I had intended, and I gave some thought to giving priority to solutions that did not use the loophole. In the end, however, I decided not to treat these entries any differently, because the solution statement permitted and provided a defined callback behavior for out-of-range guesses. As it turned out, the winning entry and three of the fastest four entries did not use out-of-range guesses.
Xans winning code first makes a sequence of guesses to determine how many positions are set to each of the possible colors. He then starts with an initial guess corresponding to these colors and begins swapping positions to determine how the number of correctly placed colors is affected. Separate logic handles the cases where the number of correctly placed colors increased or decreased by 0, 1, or 2, all the while keeping track of which color possibilities have been eliminated for each position. These and other details of Xans algorithm are documented in the comments to his code.
The table of results below indicates, in addition to execution time, the cumulative number of guesses used by each entry for all test cases. In general, it shows the expected rough correlation between execution time and the number of guesses, with a significant exception for the second-place entry from Ernst Munter, which took significantly fewer guesses. Ernst precalculated tables to define the guessing strategy for problems of length 5 or less and devised a technique for partitioning larger problems to use these tables. Normally I try to discourage the use of extensive precalculated data, but I decided to allow this entry because the amount of data was not unreasonable, because the tables guided the algorithm but did not precalculate a solution, and because I thought the approach was innovative and interesting. Although including the second-place entry in the article is not possible because of length restrictions, I have included the preamble from Ernsts solution describing his approach.
Here are the times and code sizes for each of the entries. Numbers in parentheses after a persons name indicate that persons cumulative point total for all previous Challenges, not including this one.
Name time guesses code data out-of-range
values used?
Xan Gregg (61) 102 4123 1360 16 no
Ernst Munter (90) 109 2880 6264 5480 limited
Gustav Larsson (60) 116 3700 712 40 no
Greg Linden 127 5002 576 16 no
M. Panchenko (4) 146 5391 344 16 yes
Eric Lengyel (20) 176 6456 312 16 yes
Peter Hance 206 6557 336 16 yes
J. Vineyard (42) 228 9933 328 16 no
Ken Slezak (10) 251 6544 808 16 yes
Stefan Sinclair 259 11058 200 16 yes
Top 20 Contestants of All Time
Here are the Top 20 Contestants for the Programmers Challenges to date. The numbers below include points awarded for this months entrants. (Note: ties are listed alphabetically by last name - there are more than 20 people listed this month because of ties.)
Rank Name Points
1. [Name deleted] 176
2. Munter, Ernst 100
3. Gregg, Xan 81
4. Karsh, Bill 78
5. Larsson, Gustav 67
6. Stenger, Allen 65
7. Riha, Stepan 51
8. Goebel, James 49
9. Nepsund, Ronald 47
10. Cutts, Kevin 46
11. Mallett, Jeff 44
12. Kasparian, Raffi 42
13. Vineyard, Jeremy 42
14. Darrah, Dave 31
15. Landry, Larry 29
16. Elwertowski, Tom 24
17. Lee, Johnny 22
18. Noll, Robert 22
19. Anderson, Troy 20
20. Beith, Gary 20
21. Burgoyne, Nick 20
22. Galway, Will 20
23. Israelson, Steve 20
24. Landweber, Greg 20
25. Lengyel, Eric 20
26. Pinkerton, Tom 20
There are three ways to earn points: (1) scoring in the top 5 of any Challenge, (2) being the first person to find a bug in a published winning solution or, (3) being the first person to suggest a Challenge that I use. The points you can win are:
1st place 20 points
2nd place 10 points
3rd place 7 points
4th place 4 points
5th place 2 points
finding bug 2 points
suggesting Challenge 2 points
Here is Xans winning solution:
MindReader
By Xan Gregg,Durham, N.C.
/*
I try to minimize the number of guesses without adding too much complexity to the
code. First I figure out how many of each color are present in the answer by
essentially repeatedly guessing all of each color.
Then I figure out the correct positions one at a time starting at slot 0. I exchange it
with each other slot (one at a time) until the correct color is found. When there is a
change in the numCorrect response from checkGuess I can tell which of the two
slots caused the change by looking at my remembered information or, if necessary,
by performing a second guess with one of the colors in both slots.
The remembered information includes keeping track of colors that were
determined (via the numCorrectchange) to be wrong before and/or a swap is made.
This doesnt help out too often, but it doesnt take much time to record compared to
calling checkGuess.
While the outer loop determines the color of each slot left-to-right (0 to n-1), I
found that indexing the inner loop right-to-left instead of left-to-right increased the
speed by 30% - 40%. I wish I understood why!
Oddly, the checkGuess function spends most of its time figuring out the numWrong
value, which we generally ignore.
*/
typedef void (*CheckGuessProcPtr)(
unsigned char *theGuess,
unsigned short *numInCorrectPos,
unsigned short *numInWrongPos);
#define kMaxLength 16
#define Bit(color) (1L << (long) (color))
MindReader
void MindReader(unsigned char guess[],
CheckGuessProcPtr checkGuess,
unsigned short answerLength,
unsigned short numColors)
{
long prevColorsFound;
long colorsFound;
long curColor;
long i, j;
long curCorrect;
long numOfColor[kMaxLength + 1]; /* 1-based */
Boolean isCorrect[kMaxLength];
long possibilities[kMaxLength]; /* bit fields */
long colorBit1;
long colorBit2;
char color1;
char color2;
long delta;
unsigned short newCorrect;
unsigned short newWrong;
/* first find the correct set of colors */
colorsFound = 0;
curColor = 1;
while (colorsFound < answerLength)
{
for (i = colorsFound; i < answerLength; i++)
guess[i] = curColor;
(*checkGuess)(guess, &newCorrect, &newWrong);
prevColorsFound = colorsFound;
colorsFound = newCorrect + newWrong;
numOfColor[curColor] = colorsFound - prevColorsFound;
curColor++;
}
/* now work on the order */
for (i = 0; i < answerLength; i++)
{
isCorrect[i] = false;
possibilities[i] = -1; /* all colors */
}
curCorrect = newCorrect;
/* step through every slot, starting at 0 */
for (i = 0; curCorrect < answerLength; i++)
{
if (isCorrect[i])
continue;
color1 = guess[i];
colorBit1 = Bit(color1);
/* try swapping slot i with every other open */
/* slot, starting with the last one */
j = answerLength;
nextSubSlot:
j--;
if (guess[i] == guess[j])
goto nextSubSlot;
if (isCorrect[j])
goto nextSubSlot;
color2 = guess[j];
colorBit2 = Bit(color2);
if ((possibilities[i] & colorBit2) == 0)
goto nextSubSlot; /* no hope here */
/* swap slots i & j and check result */
guess[i] = color2;
guess[j] = color1;
(*checkGuess)(guess, &newCorrect, &newWrong);
delta = newCorrect - curCorrect;
if (delta >= 0)
if (delta == 0)
{ /* either both are incorrect OR */
/* one is correct and answer[i]==answer[j] */
guess[i] = color1;
guess[j] = color2;
if (numOfColor[color1] == 1)
{ /* color1 cant be in both places */
possibilities[i] &= ~colorBit1;
possibilities[j] &= ~colorBit1;
}
if (numOfColor[color2] == 1)
{ /* color2 cant be in both places */
possibilities[i] &= ~colorBit2;
possibilities[j] &= ~colorBit2;
}
}
else if (delta == 1)
{ /* both were wrong, now one is correct */
/* find out which is correct */
curCorrect = newCorrect;
if ((possibilities[j] & colorBit1) == 0)
{ /* i must be color2 */
possibilities[j] &= ~colorBit2;
numOfColor[color2] -= 1;
goto nextSlot;
}
else if ((possibilities[i] & colorBit2) == 0)
{ /* j must be color1 */
isCorrect[j] = true;
possibilities[i] &= ~colorBit1;
numOfColor[color1] -= 1;
color1 = color2;
colorBit1 = colorBit2;
}
else
{ /* well have to make another guess to */
/* see which is correct */
guess[i] = color1;
(*checkGuess)(guess, &newCorrect, &newWrong);
if (newCorrect == curCorrect)
{ /* j must be color1 */
possibilities[i] &=
(~(colorBit1 | colorBit2));
isCorrect[j] = true;
guess[i] = color2;
numOfColor[color1] -= 1;
color1 = color2;
colorBit1 = colorBit2;
}
else
{ /* i must be color2 */
possibilities[j] &=
(~(colorBit1 | colorBit2));
guess[i] = color2;
numOfColor[color2] -= 1;
goto nextSlot;
}
}
}
else /* delta == 2 */
{ /* both were wrong, now both correct */
isCorrect[j] = true;
numOfColor[color1] -= 1;
numOfColor[color2] -= 1;
curCorrect = newCorrect;
goto nextSlot;
}
else /* delta < 0 */
if (delta == -1)
{ /* one was correct before swap, now neither is */
guess[i] = color1;
guess[j] = color2;
if ((possibilities[i] & colorBit1) == 0)
{ /* color2 in slot j was correct */
isCorrect[j] = true;
numOfColor[color2] -= 1;
possibilities[i] &= ~colorBit2;
}
else if ((possibilities[j] & colorBit2) == 0)
{ /* color1 in slot i was correct */
possibilities[j] &= ~colorBit1;
numOfColor[color1] -= 1;
goto nextSlot;
}
else
{ /* well have to make another guess to */
/* see which was correct */
guess[j] = color1;
(*checkGuess)(guess, &newCorrect, &newWrong);
if (newCorrect == curCorrect)
{ /* color1 in slot i was correct */
possibilities[j] &=
(~(colorBit1 | colorBit2));
guess[j] = color2;
numOfColor[color1] -= 1;
goto nextSlot;
}
else
{ /* color2 in slot j was correct */
possibilities[i] &=
(~(colorBit1 | colorBit2));
guess[j] = color2;
isCorrect[j] = true;
numOfColor[color2] -= 1;
}
}
}
else /* delta == -2 */
{ /* both were already correct */
guess[i] = color1;
guess[j] = color2;
isCorrect[j] = true;
numOfColor[color1] -= 1;
numOfColor[color2] -= 1;
goto nextSlot;
}
goto nextSubSlot;
nextSlot: ;
}
done: ;
}
Alternative Approach (Description Only)
Copyright 1995, Ernst Munter, Kanata, ON, Canada.
/*
Problem:
Find the value of a multidigit code, by a question and answer method. Each
question is a guess of the code, the answer is the number of digits that are correct,
reported as either in correct or wrong positions.
The challenge is to minimize total time, that is in the first order, keep the number of
guesses small, since the time to check the guess is included in total time. But
spending too much time minimizing the number of guesses is counterproductive.
Assumptions:
1. It is OK to guess a color that is not within the range 1 to numColors. It will not
be correct or wrong, but it will also not corrupt the CheckGuess function.
2. The opponent will call with randomly generated correctAnswer codes, and not
try to defeat the MindReader by learning the solution strategy.
3. The objective is not to be a true Mindreader, as this could be done by reading into
the (*checkGuess) code, the address of which is handy. One would then
disassemble PowerPC instructions to discover the hidden address of
correctAnswer.
Solution:
It is relatively simple to manually construct solution trees for small N
(N=answerLength), and make them into a lookup table.
I have made a table for N=4, and hardcoded the trees for N=2 and N=3.
The table for N=5 was too large to be done easily by hand, and I wrote a Tree
Builder program to construct its 246 nodes. I then hand tuned the 2 smallest parts
of it.
I felt, a 246 node tree is about at the limit of what might be tolerable in a static
array. The tree for N=6 would have 1400 or so nodes. There are diminishing
returns. Adding the N=5 tree improved the higher splits (2 or 3 splits instead of 3 or
4), but gained only a few percentage points on the callBack frequency overall;
To keep the trees manageable, the permutation patterns and the color schemes are
normalized.
Now the details:
Even if numColors > N, there can be at most N distinct colors in the answer, for
example 5, if answerLength=5.
And we can arrange a color mapping so that all colors are refered to by index 1, 2,
3, etc, with the most frequently occurring color labeled #1.
For N=5, this reduces the possible answers to 7 color schemes, 11111, 11112,
11122, 11123, 11223, 11234, 12345.
To solve for N<=5, the function ProcessSlice() only needs the color mapping, and
a list of the colors, suitably sorted.
For example, the real answer 73646 can be solved by walking the solution tree in
4 steps, given the color list 6,3,4,7 and the pattern to be found is 42131. The
pattern at the root of the tree T11234, is 11234.
To obtain the pattern information, I scan the answer with successive guesses
(somewhat optimized for answer lengths of 2 to 4, to eliminate some obviously
unneeded calls to checkGuess). The basic idea is:
correctAnswer
7 3 6 4 6
Six or seven calls to checkGuess, to build the color and color-frequency lists:
guess correct wrong yetToFind colorList
1 1 1 1 1 0 0 5 -
2 2 2 2 2 0 0 5 -
3 3 3 3 3 1 0 4 3
4 4 4 4 4 1 0 3 3,4
5 5 5 5 5 0 0 3 3,4
6 6 6 6 6 2 0 1 6,3,4
7 7 7 7 7 1 0 0 6,3,4,7
The last call back is avoided if the color==numColors occurs in the code.
Then, using the tree, the correct answer is found with four more calls to checkGuess:
goal 42131
6 6 3 4 7 1 x tree code 11234
6 3 6 7 4 2 x 12143
4 3 6 6 7 2 x 32114
4 6 6 7 3 1 x 31142
7 3 6 4 6 42131 (no other choice)
This results in a total of 10 or 11 calls or less to the checkGuess function.
On average, 10 calls are needed to solve 5-wide answers, when numColors is
randomly set to a value from 1 to 16.
For N>5, the size of tree grows very rapidly. So I decided to split the answer into
multiple slices, and treat each as separate problems of width 3, 4, or 5:
6 = 3 + 3
7 = 4 + 3
8 = 4 + 4
9 = 5 + 4
10 = 5 + 5
11 = 5 + 6 = 5 + (3 + 3)
12 = 5 + 7 = 5 + (4 + 3)
13 = 5 + 8 = 5 + (4 + 4)
14 = 5 + 9 = 5 + (5 + 4)
15 = 5 + 10 = 5 + (5 + 5)
16 = 8 + 8 = (4 + 4) + (4 + 4)
To create a split, we call checkGuess with guesses of a solid color for the left side,
and 0s for the right. (e.g. first guess 1 1 1 1 0 0 0 0, to split 8). As a result,
correctPos gives the number of 1s in the left slice, and wrongPos, the number of 1s
in the right slice. If we get correctPos+wrongPos=4 as an answer, we must call
again because there might be more than four 1s in the answer; the guess 0 0 0 0 1 1
1 1 will do it.
Performance:
Overall, I find an almost linear relationship between the total number of call backs
(CB) and the value of answerLength (AL), approximately CB = AL * 1.26 + 2.84
when numColors varies randomly from 1 to 16.
*/