Skip to the content.

Regular expression to free Alternative, part 3

When you look at matchSeq in the previous post, you’ll find that we use (<>) of Semigroup only at a place. we can make match free from Semigroup if we can remove it. Let’s revisit matchSeq.

matchSeq :: (Semigroup a) => RSeq a -> String -> [(a, String)]
matchSeq (REmpty a) s = [(a, s)]
matchSeq (RSeq char alt) s = do
  (a1, s1) <- matchChar char s
  (a2, s2) <- matchAlt alt s1
  pure (a1 <> a2, s2)

One of the ways to remove (<>) is making a1 or a2 a function. For example, we can call it like a1 a2 instead of combining them like a1 <> a2 if a1 is a function a -> a. Actually, a1 isn’t necessarily a -> a, but can be b -> a if a2 is b. They can be the opposite. We can call it like a2 a1 instead of a1 <> a2 if a2 is b -> a and a1 is b.

Let’s take the latter approach and see how RSeq will look like.

data RSeq a where
  REmpty :: a -> RSeq a
  RSeq :: RChar b -> RAlt (b -> a) -> RSeq a

Now, the first part has b (RChar b), and the second part has b -> a (RAlt (b -> a)). As I wrote above, our new matchSeq will apply a1 to a2 instead of concatenating a1 and a2 using (<>).

matchSeq :: RSeq a -> String -> [(a, String)]
matchSeq (REmpty a) s = [(a, s)]
matchSeq (RSeq char alt) s = do
  (a1, s1) <- matchChar char s
  (a2, s2) <- matchAlt alt s1
  pure (a2 a1, s2)

For example, you can write a regular expression ab like this.

regex :: Regex Int
regex = RAlt [RSeq (RChar 'a' 1) (RAlt [RSeq (RChar 'b' (+ 1)) (RAlt [REmpty id])])]

In this example, match regex returns Just 1.

You can control what it returns by attaching whatever function you like to RSeq. For example, match regex' returns Just "There is b!" by throwing away the value in RChar 'a' 1.

regex' :: Regex String
regex' = RAlt [RSeq (RChar 'a' 1) (RAlt [RSeq (RChar 'b' (const "There is b!")) (RAlt [REmpty id])])]

Again, it’s hard to write these expression directly. Let’s update the combinator functions. rNever, rEmpty and rAlt are simple.

rNever :: RAlt a
rNever = RAlt []

rEmpty :: a -> RAlt a
rEmpty a = RAlt [REmpty a]

rAlt :: RAlt a -> RAlt a -> RAlt a
rAlt (RAlt seqs1) (RAlt seqs2) = RAlt $ seqs1 <> seqs2

infixl 3 `rAlt`

But rSeq is a bit complicated. rSeq will combine RAlt (a -> b) and rAlt a instead of RAlt a and RAlt a.

rSeq :: RAlt (a -> b) -> RAlt a -> RAlt b
rSeq (RAlt (seqs :: [RSeq (a -> b)])) (alt :: RAlt a) = RAlt $ concat [rSeq' seq alt | seq <- seqs]
  where
    rSeq' :: RSeq (a -> b) -> RAlt a -> [RSeq b]
    rSeq' (REmpty (f :: a -> b)) (RAlt (seqs' :: [RSeq a])) = map (f <$>) seqs'
    rSeq' (RSeq (c :: RChar c) (alt1 :: RAlt (c -> a -> b))) (alt2 :: RAlt a) =
      [ RSeq
          (c :: RChar c)
          ((((flip <$> (alt1 :: RAlt (c -> a -> b))) :: RAlt (a -> c -> b)) `rSeq` (alt2 :: RAlt a)) :: RAlt (c -> b))
      ]

infixl 4 `rSeq`

I added type annotations as much as possible to get the idea. After all, the first parameter is a -> b which consists of c and c -> a -> b, and the second parameter is a. We need to build a new structure whose first part is c and the second part is c -> b. It means that we need to flip c -> a -> b to get a -> c -> b, then apply a to get c -> b.

Also, we can build Regex for a character with rChar.

rChar :: Char -> RAlt Int
rChar c = RAlt [RSeq (RChar c 1) (rEmpty id)]

Can we write regex as rChar 'a' `rSeq` rChar 'b' again now? Unfortunately, no. rSeq combines RAlt (a -> b) and RAlt a, but both rChar 'a' and rChar 'b' is RAlt Int. We need to lift rChar 'a' to return (+ 1) instead of 1 by fmapping (+).

regex :: Regex Int
regex = (+) <$> rChar 'a' `rSeq` rChar 'b'

You can also write it with rEmpty.

regex :: Regex Int
regex = rEmpty (+) `rSeq` rChar 'a' `rSeq` rChar 'b'

There is one more change. rMany no longer returns RAlt a, but returns RAlt [a] instead. You can apply any function to it by fmapping the function.

rMany :: RAlt a -> RAlt [a]
rMany r =
  let RAlt seqs = rSeq ((:) <$> r) (rMany r)
   in RAlt (REmpty [] : seqs)

Now, you can write ab|(cd)*e like this.

rChar' :: Char -> RAlt String
rChar' c = RAlt [RSeq (RChar c [c]) (rEmpty id)]

regex :: Regex String
regex = (<>) <$> rChar' 'a' `rSeq` rChar' 'b' `rAlt` ((<>) <$> concat <$> rMany ((<>) <$> rChar' 'c' `rSeq` rChar' 'd')) `rSeq` rChar' 'e'

match regex "ab" returns Just "ab", and match regex "cdcdcde" returns Just "cdcdcde".

Also, we can get a string that matched (cd)*.

regex' :: Regex String
regex' = const (const "") <$> rChar' 'a' `rSeq` rChar' 'b' `rAlt` (const <$> concat <$> rMany ((<>) <$> rChar' 'c' `rSeq` rChar' 'd')) `rSeq` rChar' 'e'

Now, match regex' "ab" returns Just "", and match regex' "cdcdcde" returns Just "cdcdcd".

You might’ve already found this, but RAlt can be an instance of Applicative and Alternative where pure is rEmpty, (<*>) is rSeq, empty is rNever, and (<|>) is rAlt.

data RChar a = RChar Char a deriving (Functor)

data RSeq a where
  REmpty :: a -> RSeq a
  RSeq :: RChar b -> RAlt (b -> a) -> RSeq a

deriving instance Functor RSeq

newtype RAlt a = RAlt [RSeq a] deriving (Functor)

instance Applicative RAlt where
  pure = rEmpty
  (<*>) = rSeq

instance Alternative RAlt where
  empty = rNever
  (<|>) = rAlt

type Regex = RAlt

Also, you can use many instead of rMany. It’ll become a bit shorter to write ab|(cd)*e with these operators.

regex :: Regex String
regex = (<>) <$> rChar' 'a' <*> rChar' 'b' <|> ((<>) <$> concat <$> many ((<>) <$> rChar' 'c' <*> rChar' 'd')) <*> rChar' 'e'

We’ll make it a bit more generic in the next post.