Skip to the content.

Regular expression to free Alternative, part 2

We defined Regex type to express a regular expression in part 1. Now, how can we get more information from it? For example, how can I get a length of a matched string? The simplest way is defining a new matcher.

match :: Regex -> String -> Maybe Int
match r s = listToMaybe $ mapMaybe f $ matchAlt r s
  where
    f (n, "") = Just n
    f _ = Nothing

matchAlt :: RAlt -> String -> [(Int, String)]
matchAlt (RAlt seqs) s = concat [matchSeq seq s | seq <- seqs]

matchSeq :: RSeq -> String -> [(Int, String)]
matchSeq REmpty s = [(0, s)]
matchSeq (RSeq char alt) s = do
  (n1, s1) <- matchChar char s
  (n2, s2) <- matchAlt alt s1
  pure (n1 + n2, s2)

matchChar :: RChar -> String -> [(Int, String)]
matchChar (RChar rc) (c : cs)
  | rc == c = [(1, cs)]
  | otherwise = []
matchChar _ [] = []

As you can see, these functions now return a list of a pair of Int and String instead of a list of String. Int represents a length of a matched string while String represents a rest of the string.

This works, but we need to define another set of functions to get another information, for example, a matched string itself.

To avoid that, let’s embed information we need into our Regex itself. We’ll add a parameter of a generic type a to RChar to hold this information.

data RChar a = RChar Char a

data RSeq a = REmpty a | RSeq (RChar a) (RAlt a)

newtype RAlt a = RAlt [RSeq a]

type Regex = RAlt

We also update a set of functions to build Regex.

rNever :: RAlt a
rNever = RAlt []

rEmpty :: (Monoid a) => RAlt a
rEmpty = RAlt [REmpty mempty]

rSeq :: RAlt a -> RAlt a -> RAlt a
rSeq (RAlt seqs) alt = RAlt $ concat [rSeq' seq alt | seq <- seqs]
  where
    rSeq' :: RSeq a -> RAlt a -> [RSeq a]
    rSeq' (REmpty _) (RAlt seqs') = seqs'
    rSeq' (RSeq c1 (RAlt seqs')) alt' = [RSeq c1 (RAlt (rSeq' seq' alt')) | seq' <- seqs']

rAlt :: RAlt a -> RAlt a -> RAlt a
rAlt (RAlt seqs1) (RAlt seqs2) = RAlt $ seqs1 <> seqs2

rMany :: (Monoid a) => RAlt a -> RAlt a
rMany r =
  let RAlt seqs = rSeq r (rMany r)
   in RAlt (REmpty mempty : seqs)

These functions are almost identical to the ones in the previous post, but have additional type parameter a. As you can see, I added Monoid a constraint to rEmpty and rMany so that we can build REmpty by passing mempty. We’ll also use Semigroup a constraint to match Regex to concatenate a value stored in RChar in matchSeq.

match :: (Semigroup a) => Regex a -> String -> Maybe a
match r s = listToMaybe $ mapMaybe f $ matchAlt r s
  where
    f (a, "") = Just a
    f _ = Nothing

matchAlt :: (Semigroup a) => RAlt a -> String -> [(a, String)]
matchAlt (RAlt seqs) s = concat [matchSeq seq s | seq <- seqs]

matchSeq :: (Semigroup a) => RSeq a -> String -> [(a, String)]
matchSeq (REmpty a) s = [(a, s)]
matchSeq (RSeq char alt) s = do
  (a1, s1) <- matchChar char s
  (a2, s2) <- matchAlt alt s1
  pure (a1 <> a2, s2)

matchChar :: RChar a -> String -> [(a, String)]
matchChar (RChar rc a) (c : cs)
  | rc == c = [(a, cs)]
  | otherwise = []
matchChar _ [] = []

Finally, we define rChar variants to build RChar with a desired value. For example, you can use this rChar to get length of a matched string. We always bind 1 to RChar because RChar always matches a single character.

rChar :: Char -> Regex (Sum Int)
rChar c = RAlt [RSeq (RChar c 1) rEmpty]

regex :: Regex
regex = (rChar 'a' `rSeq` rChar 'b') `rAlt` (rMany (rChar 'c' `rSeq` rChar 'd') `rSeq` rChar 'e')

You can find that getSum <$> match regex "cdcdcde" returns Just 7.

With rChar' which binds a string containing the character itself to RChar, you can get a matched string itself. This time, we bind [c] to it because RChar matches this string.

rChar' :: Char -> Regex String
rChar' c = RAlt [RSeq (RChar c [c]) rEmpty]

regex' :: Regex
regex' = (rChar' 'a' `rSeq` rChar' 'b') `rAlt` (rMany (rChar' 'c' `rSeq` rChar' 'd') `rSeq` rChar' 'e')

This time, match regex' "cdcdcde" returns Just “cdcdcde”.

Now we can get information from Regex using the same match function if the information is Monoid. But this means that we always concatenate information in RChar. How can we get information with more flexibilities? For example, how can we get a string that matched a part of a regular expression? Let’s see what we can do in the next post.