From: Jeff Greif
Subject: regexp instance generator
Date: 
Message-ID: <YJ0mc.24720$Ia6.3747796@attbi_s03>
Does anyone know of an existing function that generates random strings which
match a given regular expression (in any reasonably popular regexp language,
such as emacs' or grep's)?

A sketch for a (slooooow) algorithm:
1.  construct the finite state machine for the regexp,
2.  start it,
3.  randomly pick a state transition from the current state,
4.  shuffle the character set randomly
4.  search the shuffled character set for a match to the transition
constraint (naively linearly, but possibly something better later) and add
the first one found to the string being constructed.
5.  if at the end state, return the constructed string
6.  else go to 3

Jeff

From: gantar
Subject: Re: regexp instance generator
Date: 
Message-ID: <c7a5u1$1ihkt$1@ID-233101.news.uni-berlin.de>
Jeff Greif wrote:
> Does anyone know of an existing function that generates random strings which
> match a given regular expression (in any reasonably popular regexp language,
> such as emacs' or grep's)?
> 
> A sketch for a (slooooow) algorithm:
> 1.  construct the finite state machine for the regexp,
> 2.  start it,
> 3.  randomly pick a state transition from the current state,
> 4.  shuffle the character set randomly
> 4.  search the shuffled character set for a match to the transition
> constraint (naively linearly, but possibly something better later) and add
> the first one found to the string being constructed.
> 5.  if at the end state, return the constructed string
> 6.  else go to 3
> 
> Jeff
> 
> 

This is a slow method indeed. Please note that regular expressions
are a special case of a transformational grammar. This means
a regular expression parser is the upside down mechanism for
*generating* regular expressions.

The regular expression

	a+

can be expressed as the transformational grammar

	START: A
	A -> Aa | a

A program for generating a string matching the regular expression
runs along the line:

	make an 'A'
	loop:
		toss a coin to decide whether to replace the A by
		an Aa or an a
		rinse, repeat

If you don't know about transformational grammars you find
a lot of stuff on the web. The best book about the matter
is "the art of compiler design" by Tom Pittman.

Good luck
Gantar
From: Marco Baringer
Subject: Re: regexp instance generator
Date: 
Message-ID: <m2r7tzz28r.fsf@bese.it>
"Jeff Greif" <······@spam-me-not.alumni.princeton.edu> writes:

> Does anyone know of an existing function that generates random strings which
> match a given regular expression (in any reasonably popular regexp language,
> such as emacs' or grep's)?

as if i didn't have a gazillion other things i should be doing this
morning:

[This assumes you have cl-ppcre lying around (i'm not about to write a
regular expression parser). It only handles alternation, it ignores
registers (and therefore backrefs), character classes, flags
(negative|pasitive)-look-(ahead|behind), if you want to add them it
should be easy. One note: greedy/non greedy repetiton will always
generate the shortest string possible. (ie (string-from-regexp "a*")
will always produce "").]

Examples:

(string-from-regexp "a|b" :case-insensitive-mode t)
==>
"a" or "A" or "b" or "B"

(string-from-regexp "a|b*")
==>
"a" or ""

(string-from-regexp "aab+")
==>
"aab"

Code:

(asdf:oos 'asdf:load-op :arnesi) ;; needed for WHICHEVER and LIST-MATCH-CASE

(asdf:oos 'asdf:load-op :cl-ppcre) ;; needed for PARSE-STRING

(defpackage :ercpp-lc 
  (:use :common-lisp :cl-ppcre :arnesi))

(in-package :ercpp-lc)

(defun string-from-regexp (regexp &key case-insensitive-mode multi-line-mode)
  "Generate a string which would match PARSE-TREE with the
  supplied options."
  (let ((/i case-insensitive-mode)
        (/m multi-line-mode))
    (declare (special /i /m))
    (with-output-to-string (string)
      (declare (special string))
      (generate (cl-ppcre::parse-string regexp)))))

(defun generate (parse-tree)
  "The work hores function of string-from-regexp. Pretend
  parse-tree is a grammar, generate a string specified by
  grammar."
  (etypecase parse-tree
    (character (generate-char parse-tree))
    (string (generate-string parse-tree))
    (t
     (list-match-case parse-tree
       (:VOID ;; epislon, do nothing
        nil)
       (:EVERYTHING
        (generate-everything))
       ((:SEQUENCE . ?rest)
        (dolist (p ?rest)
          (generate p)))
       ((:GROUP . ?rest)
        ;; treat this just like :SEQUENCE
        (generate `(:SEQUENCE ,@?rest)))
       ((:REGISTER ?tree)
        ;; just match ?tree, don't deal with registers for now
        (generate ?tree))
       ((:GREEDY-REPETITION ?min ?max ?tree)
        ;; this could match beteween ?min and ?max (which could be nil
        ;; aka inifinity) times, for simplicity we'll always use ?min.
        (dotimes (i ?min)
          (generate ?tree)))
       ((:ALTERNATION . ?parse-trees)
        (generate (nth (random (length ?parse-trees)) ?parse-trees)))
       (?anything-else (error "Don't know how to handle ~S." ?anything-else))))))

(defun generate-char (char)
  (declare (special /i string))
  (if /i
      ;; case INsensitive. The WHICHEVER macro choses ane of it's
      ;; forms at random and executes it.
      (whichever (write-char (char-upcase char) string)
                 (write-char (char-downcase char) string))
      ;; case sensitive
      (write-char char string)))

(defun generate-string (string)
  (loop
     for char across string
     do (generate-char char)))

(defun generate-everything ()
  (declare (special /s string))
  ;; we assume 7 bit ASCII.
  (if /s
      ;; dot matches new line as well.
      (loop for random-char = (code-char (random 127))
            while (or (char= random-char #\Newline)
                      (char= random-char #\Linefeed))
           finally (write-char random-char string))
      (write-char (code-char (random 127)) string)))

-- 
-Marco
Ring the bells that still can ring.
Forget your perfect offering.
There is a crack in everything.
That's how the light gets in.
     -Leonard Cohen