Pages

Parsing with Spirit Qi

A cool thing about Spirit is that it has been designed keeping in mind scalability. That means we could have a limited knowledge of Spirit and yet be able to work with it, if we don't need any fancy feature.

Spirit Qi provides a good number of built-in parsers that could be combined to create our own specific parser. The Spirit tutorial shows us how to start from a built-in parser (double_) to end up with a more complex parser that accepts a list of comma separated floating point numbers.

All the examples we are about to see in this post share the same structure: we call the Spirit Qi function phrase_parse() on a string containing our input specifying the parser that has to be applied, and the "skip-parser" element that could be in the input sequence and should not interfere with the evaluation (typically, and in this case too, anything that is considered a space - blank, return, ...). What we are changing is the parser that has to be used, so I wrote a function that implements the generic behaviour, and requires in input, besides the string containing the text to be evaluate, an expression that represents the parser to use:
#include <boost/spirit/include/qi.hpp>
#include <string>

// ...

template<typename Expr>
inline bool genericParse(const std::string& input, const Expr& expr)
{
std::string::const_iterator first = input.begin();

bool r = boost::spirit::qi::phrase_parse( // 1.
first, // 2.
input.end(),
expr, // 3.
boost::spirit::ascii::space // 4.
);
if(first != input.end()) // 5.
return false;
return r;
}

1. The phrase_parse() returns true if the input sequence is parsed correctly.
2. First two arguments: iterators delimiting the sequence.
3. Third argument: the parser.
4. Fourth argument: the skip-parser element
5. Here we implement a stricter parsing: we check that there is no trailing leftover.

Parsing a number

Now it is quite easy to implement a function that parses a floating point number:
bool bsParseDouble(const std::string& input)
{
return genericParse(input, boost::spirit::qi::double_);
}

boost::spirit::qi::double_ is the built-in parser that is used to identify a number that could be stored in a double variable.

I find that test cases are very useful not only to verify the correctness of the code we produce, but also to understand better what existing code actually does. So I have written a bunch of test cases to verify how the above written code behaves. Here is just the first one I have written:
TEST(BSParseDouble, Double)
{
std::string input("1.21");
EXPECT_TRUE(bsParseDouble(input));
}

Parsing two numbers

For parsing two floating point numbers we have to create a custom parser:
bool bsParseTwoDouble(const std::string& input)
{
auto expr = boost::spirit::qi::double_ >> boost::spirit::qi::double_;
return genericParse(input, expr);
}

Spirit overloads the operator right shift (>>) as a way to convey the meaning of "followed by". So we could read the custom parser we create as: a double followed by another double. And here is it one of the tests I have written for this function:
TEST(BSParseTwoDouble, Double)
{
std::string input("1.21");
EXPECT_FALSE(bsParseTwoDouble(input));
}

Parsing zero or more numbers

A postfix star (known as Kleene Star) is the usual way a zero or more repetition of a expression is represented in regular expressions. The problem is that there is no postfix start operator in C++, so that was not a possible choice for the Spirit designers. That's the reason why a postfix star is used instead:
bool bsParseKSDouble(const std::string& input)
{
return genericParse(input, *boost::spirit::qi::double_);
}

A test I wrote for this function ensures that a sequence of three double is accepted; another one is to check that a couples of ints in a few blanks are accepted too:
TEST(BSParseKSDouble, TrebleDouble)
{
std::string input("1.21 7.44 8.03");
EXPECT_TRUE(bsParseKSDouble(input));
}

TEST(BSParseKSDouble, BlankIntIntBlank)
{
std::string input(" 42 33 ");
EXPECT_TRUE(bsParseKSDouble(input));
}

Parsing a comma-delimited list of numbers

Finally, the big fish of this post. We expect at least one number, and a comma should be used as delimitator:
bool bsParseCSDList(const std::string& input)
{
auto expr = boost::spirit::qi::double_ >>
*(boost::spirit::qi::char_(',') >> boost::spirit::qi::double_);
return genericParse(input, expr);
}

We can read the parser in this way: a double followed by zero or more elements of the expression made by a comma followed by a double.
Actually, we didn't have to cast explicitely the character comma to the parser for it, since the operator >>, having on its right an element of type parser, is smart enough to infer the conversion on its own. So, we could have written:
auto expr = boost::spirit::qi::double_ >> *(',' >> boost::spirit::qi::double_);
But it has been a good way to show the built-in char_ parser.

Being this parsing a bit more interesting, I'd suggest you to write a lot of test cases, to check if your expectations match the actual parsing behaviour. Here is a few of them:
TEST(BSParseCSDList, Empty)
{
std::string input;
EXPECT_FALSE(bsParseCSDList(input));
}

TEST(BSParseCSDList, Double)
{
std::string input("1.21");
EXPECT_TRUE(bsParseCSDList(input));
}

TEST(BSParseCSDList, DoubleDouble)
{
std::string input("1.21,7.44");
EXPECT_TRUE(bsParseCSDList(input));
}

TEST(BSParseCSDList, DoubleDouble2)
{
std::string input("1.21, 7.44");
EXPECT_TRUE(bsParseCSDList(input));
}

TEST(BSParseCSDList, DoubleDoubleBad)
{
std::string input("1.21 7.44");
EXPECT_FALSE(bsParseCSDList(input));
}

No comments:

Post a Comment