Pages

CodeEval Type Ahead

Given the size of an n-gram in input, and its first n-1 components, we should write a function that returns our predicted possible last words, sorted by their calculated probability (from higher to lower) and then alphabetically. The solution assumes we generate our n-gram against the nursery rhyme Mary had a little lamb that we could safely hard-code in our program, stripping it from all non-words. More details on the CodeEval page problem.

I plan to write a python solution. So, first thing, I write a python test case:
def test_provided_1(self):  # 1
    self.assertEqual('lamb,0.375;teacher,0.250;children,0.125;eager,0.125;rule,0.125', solution(2, 'the'))

def test_lamb(self):
    self.assertEqual('at,0.200;its,0.200;love,0.200;was,0.200;you,0.200', solution(2, 'lamb'))

def test_the_lamb(self):
    self.assertEqual('love,0.333;was,0.333;you,0.333', solution(3, 'the lamb'))

def test_at(self):  # 2
    self.assertEqual('school,1.000', solution(2, 'at'))
1. The given as example. Looking at the Mary & lamb rhyme, the possible resulting two-grams when the first word is 'the' could be 'the lamb', 'the teacher', 'the children', 'the eager', 'the rule'. The most probable one, using 'lamb', has an estimated 0.375 chance. The less probable ones, having the same chance, should be presented in alphabetical order.
2. Just a single two-gram could be generated from 'at'. Its probability should be obviously 1. Notice the format for probability, three decimal are always required.

The most boring part of the problem was converting the rhyme in a list of words, something like
TEXT = ['Mary', 'had', 'a', 'little', 'lamb', 'its', 'fleece', 'was',
# ...
        'teacher', 'did', 'reply']

I found out running the CodeEval tester that I don't have to worry about using some case insensitive comparison. Upper and lower case are as found in the rhyme.

Then I need a dictionary, where I will put all the n-grams in the rhyme
n_grams = {}
I decided to be lazy. Actually, I'd say that in this case being lazy is the only sensible thing to do, if I don't want to create n-grams that are not required by the user. I write instead a function that checks if a give n-gram family has been already generated. If not, it is created and pushed in the dictionary:
def check_n_grams(n):
    if n in n_grams:  # 1
        return

    item = defaultdict(lambda: defaultdict(int))  # 2
    for i in range(len(TEXT) - n):  # 3
        outer = ' '.join(TEXT[i:i + n])  # 4
        intern = TEXT[i + n]
        item[outer][intern] += 1  # 5

    n_grams[n] = item
1. the n_grams for the current n have been already generated, there is nothing to be done here.
2. Let's generate all the n-grams for the passed n. We need here a bit of a perplexing data structure, a dictionary of dictionaries (that is going to be pushed in a dictionary). Moreover, to simplify a bit the following code, I used defaultdicts. I can't pass defaultdict to initialize a defaultdict, because it is not a callable. So I used the clever trick of passing instead a lambda (that is a callable) that returns it.
3. Loop on all the words in TEXT, stopping a tad before the end, see below if you don't have already guessed why. This code spectacularly fails if the user has the insane, but legit, idea of asking for an extremely high n. For production code a check on its size should be mandatory.
4. Generate the current n-gram considering it divided in two parts, the first one, that I am going to store as key in the outer dictionary, is done joining n words starting from the current one on, the second, that the key in the intern dictionary, represents the last word in the n-gram. So, actually, I am storing in this maps all possible (n+1)-grams.
5. Push the n-gram in the dictionary of dictionaries, setting the initial value to one, or increasing it if such combination is already there.

Large part of the job is done. The solution function does not much more than calling the utility function defined above and formatting the output.
def solution(n, given):
    n -= 1  # 1
    check_n_grams(n)  # 2

    target = n_grams[n][given]  # 3

    values = list(target.items())  # 4
    values.sort(key=lambda x: x[0])  # 5
    values.sort(key=lambda x: x[1], reverse=True)  # 6

    results = []  # 7
    population = sum(target.values())
    for value in values:
        results.append('{},{:.3f}'.format(value[0], value[1] / population))  # 8
    return ';'.join(results)
1. As seen in check_n_grams(), I generate in the n_grams dictionary the (n+1)-grams, splitting them in the key-value on the internal dictionary. I felt that this trick made the code more readable, however here I pay the price of it, with this obscure decrement. Again, production code should perform some parameter checking to avoid runtime disasters.
2. Ensure the current n-grams have been generated.
3. Get the internal dictionary. For instance, if I am looking for the 2-grams for 'at', in target I expect to get a dictionary containing just one element, where the key is the last word in the n-gram and value is its hit counters. In this case 'school': 1.
4. I need to do some sorting now, so I convert the items in target to a list, data type sporting a handy function, sort() that is stable, in this way I can call it repeatedly to get the expected effect.
5. First I sort values by the zeroth component, representing the key of the internal dictionary.
6. Then I sort it by the one-th component, the word counter, in reversed mode, so that the most seen would be first.
7. Format the data, put them in this temporary list, join its components and return them.
8. As required, the probability is formatted as a floating number with three decimals.

Full code pushed on GitHub. Both the tester and the actual python script.

No comments:

Post a Comment