Pages

CodeEval Filename Pattern

As input we get a string containing blank separated tokens. The first one is a pattern, the other ones have to be checked against it. Return the matching tokens, or a minus if no matching is found.
This is CodeEval problem #169 and I am going to show how I have written a Python 3 solution for it.

Firstly, I have converted the given samples to test cases:
def test_provided_1(self):
    self.assertEqual('bh1770.h', solution('*7* johab.py gen_probe.ko palmtx.h macpath.py tzp dm-dirty-log.h bh1770.h pktloc faillog.8.gz zconf.gperf'))

def test_provided_2(self):
    self.assertEqual('IBM1008_420.so', solution('*[0123456789]*[auoei]* IBM1008_420.so zfgrep limits.conf.5.gz tc.h ilogb.3.gz limits.conf CyrAsia-TerminusBold28x14.psf.gz nf_conntrack_sip.ko DistUpgradeViewNonInteractive.pyc NFKDQC'))

def test_provided_3(self):
    self.assertEqual('menu_no_no.utf-8.vim', solution('*.??? max_user_watches arptables.h net_namespace Kannada.pl menu_no_no.utf-8.vim shtags.1 unistd_32_ia32.h gettext-tools.mo ntpdate.md5sums linkat.2.gz'))

def test_provided_4(self):
    self.assertEqual('-', solution('*.pdf OldItali.pl term.log plymouth-upstart-bridge rand.so libipw.ko jisfreq.pyc impedance-analyzer xmon.h 1.5.0.3.txt bank'))

def test_provided_5(self):
    self.assertEqual('groff-base.conffiles', solution('g*.* 56b8a0b6.0 sl.vim digctl.h groff-base.conffiles python-software-properties.md5sums CountryInformation.py use_zero_page session-noninteractive d2i_RSAPublicKey.3ssl.gz container-detect.log.4.gz'))

def test_provided_6(self):
    self.assertEqual('46b2fd3b.0 libip6t_frag.so', solution('*[0123456789]* keyboard.h machinecheck 46b2fd3b.0 libip6t_frag.so timer_defs.h nano-menu.xpm NI vim-keys.conf setjmp.h memcg'))
Then I noticed the regular expression grammar is close with the one used in many programming languages, Python included. With a few slight differences.

A dot ('.') in the pattern means just a dot.
A question mark ('?') means a single character whichever, what we use to represent with a dot.
A star ('*') means any number of any characters, without the need of a dot before it.

And, there is one more change that is not given in the problem description but it is caught by the third test case.

The pattern should cover the entire string. So, for instance "*.???" means that a matching string starts with whatever and ends with three unspecified characters after a dot. So basically, it is like the behavior of the python re match() function when is passed a dollar sign ('$') at the end of the string.

Given the changes are so tiny, it doesn't make sense to create a custom regex engine, a much more effective solution is adapting the input pattern so that it obeys to the rules of the standard python regular expressions. So, I have written this function:
def adjust(pattern):
    return re.sub('\*', '.*', re.sub('\?', '.', re.sub('\.', '\.', pattern))) + '$'
Three calls to substitute the characters as expected by python, and then add a dollar at the end of the string.
The more internal call to sub() replace each non-escaped dots to escaped ones. Then each plain question marks become plain dots. And finally each plain stars become the couple dot-star.

Applying this adjustment, the solution is just a matter of regular expression matching:
def solution(line):
    data = line.split()
    pattern = adjust(data.pop(0)) # 1
    result = [name for name in data if re.match(pattern, name)]
    return ' '.join(result) if result else '-'
I used a list comprehension to filter the names in data to keep just the ones that matches the pattern.

This solution has been accepted by CodeEval, and I have pushed test cases and python script to GitHub.

No comments:

Post a Comment