generic regex to capture various optional groups

Multi tool use
generic regex to capture various optional groups
I am looking for a method to select, using regex, lines containing various groups, some of them optional, and capture the groups - the found ones, of course. After reading here on stackoverflow and many experiments I came up with this general enough approach:
^.*?(?:.*?(aaa).*?|.*?).*?(xxx).*?(yyy).*?(?:.*?([^ n]+).*?|.*?).*?$
So the general term for optional groups is:
(?:.*?(blabla).*?|.*?)
The above approach has backtracking problems in case of:
Any ideea how to create a generic enough regex able to capture optional groups? By generic I mean, like in the example I found, easily scalable up for various group patterns.
Thanks.
Both of these fail with catastrophic backtracking on regex101.com with a few paragraphs of text. The reason isn't hard to figure out, what with your eight different
.*
wildcards, imagine all the possible permutations of eight different things that can all be anything or nothing...– sweaver2112
Jul 1 at 5:54
.*
@sweaver2112 Yes, there are problems: - with large texts; - when in the pattern are only optional groups; - when nothing is matched. This can be alleviated by using a two steps approach and first using a pattern which does not contain the optional groups. And fortunately, I have to search in relatively short lines.
– Gigi
Jul 1 at 7:16
If the actual problem is, why the second pattern doesn't capture the
bbbaaa
sequence: because the .*?
after yyy does not advance the position and therefore it cannot continue the pattern since there is space: Try it like this: ^.*?(?:(aaa)|.*?).*?(xxx).*?(yyy)(?:(.*?)([^ n]+)|.*?).*?$
. I've added capture group 4 for demonstration purpose only.– wp78de
Jul 1 at 8:26
bbbaaa
.*?
^.*?(?:(aaa)|.*?).*?(xxx).*?(yyy)(?:(.*?)([^ n]+)|.*?).*?$
I wonder, why writing
(?:(aaa)|.*?)
when you could just write (?:(aaa)?)
?. So, if you don't have (aaa)
, then your expression turns into .*?.*?
which is a nonsense.– JohnyL
Jul 1 at 9:18
(?:(aaa)|.*?)
(?:(aaa)?)
(aaa)
.*?.*?
1 Answer
1
A more strikt approach could solve the issue. Instead of .*? we use a tempered greedy token that allows everything but the optional search term:
^(?:(?!bla).)*(bla)?(?:(?!bla).)*$
Demo
This performs much better than the original approach (even on large text) and still easy to extend and maintain; e.g. you can add additional restricted terms to the tempered token with alternations:
(?:(?!bla|blub).)*(bla|blub)?(?:(?!bla|blub).)*
Demo (Note: the start/end anchor were removed)
This rexegg.com tutorial explains pretty much everything about the greedy token and shows possible variations.
Code Sample
It does not seem to work. Could you show me an example for two capturing groups?
– Gigi
Jul 2 at 9:23
@Gigi I pull the answer since I am busy and cannot check it right now but come back to it later today.
– wp78de
Jul 2 at 17:09
@Gigi updated it
– wp78de
2 days ago
I am looking for a solution for the case when also the order of the optional groups is relevant. I have tried with no success: (?:(?!bla).)*(bla)?(?:(?!bla).)*.*?(?:(?!blub).)*(blub)?(?:(?!blub).)* It may also be the case that there are repetitions, meaning I have to search for a series of optional groups like: aaa, aaa, bbb, aaa. I know, in such case, when one or more of the groups is/are missing, is hard to tell which is which but that's another problem.
– Gigi
2 days ago
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Hi. Please include a tag for one specific engine or language (perl, pcre, c#, etc…). Regular expression questions get better answers if they… show the pattern that isn't working, provide some examples of input text that should match, and also ones that shouldn't match. Describe the desired results, and how the pattern isn't producing them.
– wp78de
Jul 1 at 5:13