fbpx

System-Engineering | 14 min read

Regex Engine in JunOS

ngworx Team
August 2020
written by Maciej Zurawski
Senior Network Engineer

Have you ever heard of regular expressions or regex / regexp for short? There is a very high chance that you’ve even used it without knowing.

The basic ‘*’ is often used as a wildcard, having its base in the Kleene closure, formed by the creator of the concept of regular expression. Reading about finite automata or Thompson’s construction algorithm could give you a head start in terms of computational linguistics, in case you are interested in knowing how it is possible to create a mathematical notation of a language. So let’s jump right into it, shall we?

The theory behind it

What is behind the term regular expression? According to the source of all knowledge, to which everyone refers to (but is afraid to admit it):

Regex is a sequence of characters that define a search pattern.

Implementations of regular expressions are done using an abstract computation model: finite-state machines (finite state automata) which takes a string of symbols and is running through an internal state sequence determined by the given string. If the end state is reached it means that the given string matches the defined pattern. There are two types of finite-state automata used for that purpose:

  • non-deterministic (NFA)
  • deterministic (DFA)

regex

DFA will always map a single state to a single symbol in a pattern, which means that its transitions are deterministic. As it would never test the same character twice (no backtracking is required), it would always run in the linear time, as it takes the same amount of time to confirm whether there is a possible match or not at all. I sometimes picture DFA as a DFS/BFS algorithm in terms of how states graph is traversed.

NFA being non-deterministic counterpart can have zero, one or more transitions corresponding to a particular symbol. When there is more than one transition to a given symbol it branches and the branch lives as long as there is a valid transition. NFA can even perform a transition on an empty string. One of the properties of NFA is that it can always be converted into the DFA by just expanding the number of states. For example, powerset construction can be used for that. Both differ in the efficiency, implementation complexity and in what they support (for example non-greedy quantifiers). There is even a slight difference in what would be matched, as DFA will always find the longest possible match furthest to the left (longest-leftmost match), whereas NFA would generally tend to find the first match satisfying the conditions. That means that both DFA and NFA can sometimes match different strings using the same pattern.

Worth noting is that NFA looks more powerful, since it can incorporate more advanced search patterns, but as always, with great power comes great responsibility. There is the possibility to use a tremendous amount of time and resources if a pattern is using too much backtracking. Such extra care should always be invested in pattern optimization and understanding how exactly the engine will handle the tested string. No one wants to end like Cloudflare! There are certain interesting topics of ReDoS (regular expression DoS attacks) or evil regex – patterns which require a lot of unnecessary backtracking.

Engines

In the end, what engines are doing, they are converting regex patterns into either DFA or NFA (or both). There are different flavors of regular expression implementation. Historically we have to thank Ken Thompson (the one and only) since he basically created grep as a part of it’s ED editor (ed man!) and even earlier implementation of regex as an extension of QED. But since the 1970s, time passed and multiple engines surfaced with different syntax and functionalities, and what more importantly with different approaches on how regex should work. Out of that, 3 different types of engines emerged:

  • Traditional NFA (example: PCRE)
  • POSIX NFA (example: GNU Emacs)
  • DFA (examples: awk)

But why is there an additional NFA? The only difference between POSIX and traditional NFA is handling of the longest match. As I already stated, NFA would try to match the first possible substring, which would then satisfy all conditions. Whereas POSIX defines that if there are multiple matches which start at the same position, it is always the longest that must be returned. This makes them slower than traditional NFA. There are of course some applications which use a combination of DFA and NFA. These are usually called Hybrids – Tcl is an example of that.

In the regular expression world, NFA engines are also called regex-driven and DFA engines are text-driven. This means that in the case of NFA, pattern meta-characters are matched one by one against the tested string. So we can say that the pattern controls the flow. In the case of DFA, characters from the string would be scanned one by one and then matched against possible matches in regex. Here we can say that the string (or text) is in the control of the workflow.

Regex at JunOS

Basics

Where we can use regular expressions in JunOS:

  • match, find, except utilities
  • as-path (but there we need to remember that ASN is treated as a single symbol, so it is kind of hybrid between regex and wildcard)
  • replace (we need to remember that we can only replace “values”, so user-defined variables and parameters)
  • syslog filters

We ignore other usages, like inside scripts (Python, SLAX, XSLT) and on the UNIX shell, since they would use either language-specific libraries or OS-specific command-line utilities. We are focusing on the regex engine used by the mgd. Of course, we need to keep in mind that in some places we can use only wildcards, which can seem to some like a regex pattern at first but don’t be fooled. Wildcards can be seen:

  • in apply-path statement
  • configuration groups
  • wildcard range – operations
  • in some show commands (ex. show interface terse xe-* or show interface terse xe-0/0/[0-20])

Always check the documentation for other use cases and try them out to verify what kind of input is accepted and what is the matched output.

The documentation on Juniper’s webpage delivers all the information needed. So let’s test right away what features we can use there: According to it, JunOS uses regular expression engine working as defined in the POSIX 1003.2, which I assume is just POSIX BRE and POSIX ERE. The documentation on the usage of a regular expression is rather slim:

If you need better explanations on regular expressions, the following links contain helpful information.

Related Service See how we help businesses with our network engineering services:
Network Engineering We provide engineering services for over the whole lifecycle process. See more

Let’s try it out, shall we? I will use only “match” since I want to show the only line that matches our pattern. With grep, it would be easier, since I could just specify -o option to visualize what part of the output is matched exactly. (Zwinkern)

{master:0}
mzurawski@qfx5100-02> show version | match junos:    
Junos: 18.4R1.8

Out of that, we can draw the first conclusion: JunOS regular expressions are case insensitive and since we cannot specify a matching mode, there is no possibility to override this. As already established, JunOS uses different implementation.

$ perl -le 'print "match" if $ARGV[0] =~ m/(?i)junos/' 'Junos:'  
match

$ perl -le 'print "match" if $ARGV[0] =~ m/(?c)junos/' 'junos:' 
match

Now to the operational specifics: If we want to use matching groups, character classes or just spaces in a pattern, rather then a literal string, we need to wrap our expression with double-quotes. Actually, I was always thinking that this is true to all operators, like quantifiers or anchors, but I was so surprised when I tried some patterns without quotes, just to have a syntax error messages for this article.

If the regular expression contains any spaces, operators, or wildcard characters, enclose it in quotation marks. So, I personally always wrap everything behind match|expect|find with quotes, just not to hit any unexpected behavior. (Zwinkern) As we can see below syntax error is triggered only when an alteration is grouped with matching group brackets and the whole statement is not wrapped in quotes.

{master:0}
mzurawski@qfx5100-02> show version | match ^Model:?   
Model: qfx5100-48s-6q

{master:0}
mzurawski@qfx5100-02> show version | match "^Model:?"   
Model: qfx5100-48s-6q

{master:0}
mzurawski@qfx5100-02> show version | match "^(Model|Hostname):?" 
Hostname: qfx5100-02
Model: qfx5100-48s-6q

{master:0}
mzurawski@qfx5100-02> show version | match ^(Model|Hostname):?      
error: syntax error:

Supported operations/features

We’ve already seen how to operate and what to expect from the engine itself. I tried to test a little bit and created a simple table for all of you which are not familiar with regex, with some examples and tested it against JunOS CLI. I then compared given matches with what we would expect on the POSIX ERE engine – just to verify if what we expect is true. I used grep or Perl for a pattern matching examples, since as previously stated, within JunOS I would match the whole line, which is not really helpful for what is exactly matched. Click here to jump to the table.

As we can see, there are some similarities to the grep/egrep utility and especially how repeat quantifiers work. If we actually go to the unix shell on Juniper devices, we would see that match utility works identically to the grep/egrep utility there. Since POSIX ERE would not support back-references or lazy quantifiers, there is no point in testing that further. You can take my word for that. (Zwinkern)

Final statement

Let’s wrap this up, the same way we’ve done it in the Wireless series. JunOS engine specifics:

  • It should be POSIX compliant since it is based on POSIX standard – it should always return longest-leftmost match regardless of used FSM (but we should not really care about it, since we would always get a full matching line).
  • It is POSIX ERE.
  • It is case insensitive and matching mode cannot be changed.
  • Most of the shorthands are not supported, so it is better to use “explicit” character classes.
  • Back-references cannot be used

      I saw a lot of hate toward regex on the internet and yes it is kind of unfriendly, not so humanly readable, but still, it gets its job done. Most of the issues with performance are wrong matches returning, execution speed coming from poorly written patterns or not quite understanding how the engine would operate on a given pattern (especially NFA vs DFA). Don’t get me wrong here, I’m not regex fanboy but I think it is a useful tool, especially when used how it was intended to be used. It can really ease some of your everyday chores. There is a great book covering everything important on regex: Mastering Regular Expressions by Jeffrey E. F. Friedl. It is definitely on my to-read book list.

      If you survive through all of this, there is a little trivia awaiting here: You can actually check if a number is a prime, using regular expressions (Lächeln) (in a very inefficient way).

      $ perl -e '(1 x $ARGV[0]) =~ /^.?$|^(..+)\1+$/ || print "$ARGV[0] is prime\n"' 3
      3 is prime

      Table of supported operations/features

      View table

      Construct  Description Example match
      Character
      char

      • Usable in JunOS
      • Implemented in POSIX ERE
      Literal match of the specified character.
      $ echo "MAC: 0000.c914.f829" | grep -Eo 'M'
      M
      wildcard
      .

      • Usable in JunOS
      • Implemented in POSIX ERE
      Wildcard matches any single character.
      $ echo "MAC: 0000.c914.f829" | grep -Eo '.+'
      MAC: 0000.c914.f829
      character class
      [chars]

      • Usable in JunOS
      • Implemented in POSIX ERE
      Character class (or set) will match any single character from a give range or list.

      Most engines will recognize only a few meta-characters inside brackets: closing bracket (), carret ^ and escape character \, so others do not have to be escaped. That we will test later. (Zwinkern)

      $ echo "MAC: 0000.c914.f829" | grep -Eo '[A-Z]+'
      MAC
      negated character class
      [^chars]

      • Usable in JunOS
      • Implemented in POSIX ERE
      Negation of character class, so it should be obvious, it would match any character that it is not a part of the given set.
      $ echo "MAC: 0000.c914.f829" | grep -Eo '[^a-f0-9]+'
      MAC:
      .
      .
      escape character
      \

      • Usable in JunOS
      • Implemented in POSIX ERE
      Used to suppress meta-character special “meaning” so that they are visible to the engine as a literal character.
      $ echo "MAC: 0000.c914.f829" | grep -Eo '\.'
      .
      .
      shorthand character class: digits
      \d
      It is equivalent of [0-9]. If someone likes to play a regex golf he can save 3 characters using this!
      ## grep/egrep does not support shorthand digits ... same as Junos CLI ... interesting
      ## I know I could just use -P option so that grep would use Perl syntax, but nah.
      $ perl -le '@groups = ($ARGV[0] =~ m/(\d+)/g); print $_ for @groups;' 'MAC: 0000.c914.f829'
      0000
      914
      829
      shorthand character class: word character
      \w
      It is equivalent of character set matching ASCII characters [A-Za-z0-9_], some engines would support Unicode characters, but from obvious reasons, it is not the case for JunOS CLI.
      ## In Junos it seems to only use literal 'w'
      $ echo "MAC: 0000.c914.f829" | grep -Eo '\w+'
      MAC
      00a0
      c914
      f829
      negated shorthand character class: word character
      \W
      It is equivalent of character set not matching ASCII characters [^A-Za-z0-9_]
      ## In Junos it seems to only use literal 'w'
      $ echo "MAC: 0000.c914.f829" | grep -Eo '\W+'
      :
      .
      .
      shorthand character class: white space
      \s

      • Usable in JunOS
      Since this follows the same principle as the previous two constructs, this would match any white characters. So it is equivalent of [ \t\n\r\f]. If you are not familiar with the control characters and their escape sequence, this list contains in the exact order: space, tabulator, new line, carriage return and form feed (or a page-break).
      echo "MAC: 0000.c914.f829" | grep -Eo '\s+'
      
      
      ## in this case, it actually matched a space!
      beginning of a string anchor
      ^

      • Usable in JunOS
      • Implemented in POSIX ERE
      Anchors search pattern at the beginning of the string, as the name implies. Normally the longest possible match is returned on any position in the data set.
      $ echo "MAC: 0000.c914.f829" | grep -Eo "^\w+" 
      MAC
      end of a string anchor
      $

      • Usable in JunOS
      • Implemented in POSIX ERE
      And this one will do the same but on the opposite side.
      $ echo "MAC: 0000.c914.f829" | grep -Eo "\w+$"
      f829
      word boundaries anchor
      \bpattern\b
      Allows the “whole words only” search, and it would only match a pattern when it is wrapped with the non-word characters. Usually, word characters are the equivalent of \w, which means non-word characters would be \W. (Zwinkern) There is a negated version which uses meta-character \B.
      $ echo "MAC: 0000.c914.f829" | grep -Eo "\b[0-9]+\b"
      0000
      $ echo "MAC: 0000.c914.f829" | grep -Eo "\b[A-Za-z]+\b"
      MAC
      quantifier: zero or more
      *

      • Usable in JunOS
      • Implemented in POSIX ERE
      Gives a match if the preceding token is repeated zero or more times.
      $ echo "MAC: 0000.c914.f829" | grep -Eo "0*3*"
      0000
      quantifier: one or more
      +

      • Usable in JunOS
      • Implemented in POSIX ERE
      Gives a match if the preceding token is repeated one or more times.
      $ echo "MAC: 0000.c914.f829" | grep -Eo "0+"
      0000
      quantifier: zero or once
      ?

      • Usable in JunOS
      • Implemented in POSIX ERE
      Makes the preceding token optional.
      $ echo "MAC: 0000.c914.f829" | grep -Eo "912?"
      91
      quantifier: repeated
      {m,n}

      • Usable in JunOS (partially, {,n} does not work)
      • Implemented in POSIX ERE (partially, {,n} does not work)
      This is a special case of a repetition quantifier since it will set upper and lower limits. Any parameter can be omitted, to change its behavior. If the parameter is specified, it has to be greater then 0.

      {m,n} – matches token when repeated between m and n times.
      {m} – matches token exactly m times.
      {m,} – matches token when repeated at least m times.
      {,n} – matches token when repeated between 0 and n times.

      $ echo "MAC: 0000.c914.f829" | grep -Eo "0{2}"
      00
      00
      $ echo "MAC: 0000.c914.f829" | grep -Eo "0{,3}"
      000
      0
      alteration
      |

      • Usable in Junos
      • Implemented in POSIX ERE
      Matches either pattern on the left or right side of the alteration meta-character, can be chained into a series of alternatives.
      $ echo "MAC: 0000.c914.f829" | grep -Po "[a-f0-9](91|82)[a-f0-9]"
      c914
      f829

       

       

       

       

      Click here to return to the text.

      ngworx Team
      August 2020
      written by Maciej Zurawski
      Senior Network Engineer

      Most Popular

      Network-Engineering | 8 min read

      Junos upgrade – filesystem is full

      Not enough storage during Junos upgrade (EX2300 and EX3400). An extension of Juniper's article…

      Read more

      Juniper Networks

      Want to learn more about Juniper Networks? Discover their solutions, products, awards, team leaders, partners, training programs, and the latest events by clicking the button below.