RegEx - find a word inside a specific section of a file
RegEx - find a word inside a specific section of a file
I am trying to set up an alarm in a piece of weather software to look at a forecast for my area and tell me if the word "severe" appears in the upcoming forecast. I am looking at the following text file (shortened down a bit):
000
FPUS55 KBOU 301529
ZFPBOU
Zone Forecast Product for Northeast Colorado
National Weather Service Denver/Boulder CO
929 AM MDT Sat Jun 30 2018
COZ042-044-010615-
Northeast Weld County-Morgan County-
including Briggsdale, Grover, Pawnee Buttes, Raymer, Stoneham,
Brush, Fort Morgan, Goodrich, and Wiggins
929 AM MDT Sat Jun 30 2018
.REST OF TODAY...Chance of thunderstorms early in the afternoon.
Thunderstorms likely late in the afternoon. Some thunderstorms
may be severe with large hail. Highs 68 to 74. Northeast winds 10
to 15 mph with gusts to around 25 mph. Chance of thunderstorms 70
percent.
.TONIGHT...Mostly cloudy with a 30 percent chance of
thunderstorms in the evening, then mostly clear after midnight.
Some thunderstorms may be severe. Lows near 50. North winds 10 to
15 mph with gusts to around 25 mph in the evening becoming light.
.SUNDAY...Mostly sunny. Warmer. Highs in the 80s.
.SUNDAY NIGHT...Mostly clear. Lows in the mid to upper 50s. South
winds 10 to 15 mph.
.MONDAY...Mostly sunny. Highs near 90.
.MONDAY NIGHT AND TUESDAY...Partly cloudy with a 10 percent
chance of thunderstorms. Lows near 60. Highs in the lower to mid
90s.
.TUESDAY NIGHT AND Independence Day...Partly cloudy. Lows near
60. Highs in the 90s.
.WEDNESDAY NIGHT AND THURSDAY...Partly cloudy with a 10 percent
chance of thunderstorms. Lows near 60. Highs in the lower to mid
90s.
.THURSDAY NIGHT...Partly cloudy with a 30 percent chance of
thunderstorms. Lows near 60.
.FRIDAY...Partly cloudy with a 10 percent chance of
thunderstorms. Highs in the lower to mid 90s.
$$
COZ048>051-010615-
Logan County-Washington County-Sedgwick County-Phillips County-
including Crook, Merino, Sterling, Peetz, Akron, Cope,
Last Chance, Otis, Julesburg, Ovid, Sedgwick, Amherst, Haxtun,
and Holyoke
929 AM MDT Sat Jun 30 2018
.REST OF TODAY...Chance of showers and slight chance of
thunderstorms early in the afternoon. Showers likely and chance
of thunderstorms late in the afternoon. Highs in the lower 70s.
North winds 10 to 20 mph. Chance of precipitation 60 percent.
.TONIGHT...Mostly cloudy with a 50 percent chance of
thunderstorms in the evening, then mostly clear after midnight.
Some thunderstorms may be severe. Lows in the lower to mid 50s.
North winds 10 to 15 mph with gusts to around 25 mph in the
evening becoming light.
.SUNDAY...Mostly sunny. Highs in the mid 80s.
.SUNDAY NIGHT...Mostly clear. Lows near 60. South winds 10 to
15 mph.
.MONDAY...Partly cloudy with a 10 percent chance of
thunderstorms. Highs in the lower 90s. South winds 10 to 15 mph.
.MONDAY NIGHT...Partly cloudy with a 10 percent chance of
thunderstorms. Lows near 60.
.TUESDAY...Partly cloudy. Highs in the mid 90s.
.TUESDAY NIGHT...Partly cloudy with a 10 percent chance of
thunderstorms. Lows in the lower to mid 60s.
.INDEPENDENCE DAY...Partly cloudy. Highs in the mid 90s.
.WEDNESDAY NIGHT...Partly cloudy with a 10 percent chance of
thunderstorms. Lows in the lower to mid 60s.
.THURSDAY...Partly cloudy with a chance of rain showers and
slight chance of thunderstorms. Highs in the lower 90s. Chance of
precipitation 30 percent.
.THURSDAY NIGHT...Partly cloudy with a 30 percent chance of
thunderstorms. Lows in the lower to mid 60s.
.FRIDAY...Partly cloudy. Highs in the lower 90s.
$$
COZ046-010615-
North and Northeast Elbert County Below 6000 Feet/North Lincoln
County-
including Agate, Hugo, Limon, and Matheson
929 AM MDT Sat Jun 30 2018
.REST OF TODAY...Mostly cloudy. Chance of rain showers and slight
chance of thunderstorms early in the afternoon. Chance of
thunderstorms late in the afternoon. Some thunderstorms may be
severe late in the afternoon. Highs in the mid 70s. North winds
15 to 25 mph. Chance of precipitation 40 percent.
.TONIGHT...Mostly cloudy with a 50 percent chance of
thunderstorms in the evening, then partly cloudy after midnight.
Lows around 50. North winds 10 to 20 mph in the evening becoming
light.
.SUNDAY...Mostly sunny. Highs in the lower 80s. South winds 10 to
15 mph in the afternoon.
.SUNDAY NIGHT...Partly cloudy with a 10 percent chance of
thunderstorms. Lows in the mid to upper 50s. South winds 10 to
15 mph.
.MONDAY...Partly cloudy with a 10 percent chance of
thunderstorms. Highs near 90. South winds 10 to 15 mph.
.MONDAY NIGHT...Partly cloudy with a 10 percent chance of
thunderstorms. Lows in the mid 50s to lower 60s.
.TUESDAY THROUGH INDEPENDENCE DAY...Partly cloudy. Highs in the
lower to mid 90s. Lows in the mid 50s to lower 60s.
.WEDNESDAY NIGHT...Mostly cloudy with a 20 percent chance of
thunderstorms. Lows near 60.
.THURSDAY...Partly cloudy with a 10 percent chance of
thunderstorms. Highs around 90.
.THURSDAY NIGHT...Partly cloudy with a 30 percent chance of
thunderstorms. Lows near 60.
.FRIDAY...Partly cloudy with a 10 percent chance of
thunderstorms. Highs in the upper 80s.
$$
So, I want to look inside the group for Washington County, which is the second section of the above forecast. The phrase "Washington County" will always appear in the heading for my county's section of the forecast, and "$$" will always conclude each section of the forecast. As an example, I have figured out that the RegEx expression
Washington County([DS]*?)${2}
will find all of the text in my portion of the forecast. Then, specifically inside my county's portion of the forecast, I'm interested in the "TONIGHT" forecast period. I have figured out that the RegEx expression
.TONIGHT[DS]*?(?=s.)
will find the "TONIGHT" forecast period for all of the forecast sections. And, of course, the RegEx expression
severe
will find all of the instances of "severe" throughout the file. Where I am having trouble is trying to put all three together and get a result only when the word "severe" occurs in the "TONIGHT" forecast period inside the "Washington County" forecast section. When I try putting these all together, I find that RegEx will find the words that I'm looking for, but it will reach out into adjacent forecast sections. Is there a way to make this only search between "Washington County" and the very next instance of "$$" to be sure that I don't spill over into the next forecast section and return a false positive?
Many thanks to anybody that can help me with this. I'm pretty new to RegEx, so I just don't have a good idea for how to limit down the area that I am searching.
severe
The software that I am setting this up in will basically evaluate this alarm as true if the straight RegEx expression returns a result. If the RegEx expression doesn't come back with any matches, then the software will discard this bulletin. If the RegEx expression does return a match, then this bulletin is one that I'm interested in so the software processes this bulletin (emails it out, sends it out via SMS, parses it into a shorter message, etc.). Unfortunately, the manual doesn't tell me exactly what flavor of RegEx I'm using, so I have to tweak some expressions a bit to make them work.
– dialupisbad
Jun 30 at 21:21
3 Answers
3
You can achieve what you want by using negative lookahead assertions.
For example,
Ab(?!c).
matches Ab
followed by any character other than c
Ab
c
Ab((?!c).)+
matches Ab
and then keeps matching any character until it hits a c
Ab
c
In your case, we want to keep matching unless we hit the $$
on a newline at the end of the section. To do that, we can use Washington County((?!R$$)[sS])+
. The [sS]
matches any character, but the (?!R$$)
forces it to stop matching if it hits the $$
.
$$
Washington County((?!R$$)[sS])+
[sS]
(?!R$$)
$$
Expanding that concept out a bit, you can come up with a final expression to match severe
only in the .TONIGHT
section of your text block.
severe
.TONIGHT
Washington County((?!R$$)[sS])+R.TONIGHT((?!R.)[sS])+severe
Washington County((?!R$$)[sS])+R.TONIGHT
Match everything in the Washington County block until we hit the TONIGHT section.
((?!R.)[sS])+
Keep matching from that point forward until we hit a linebreak followed by a period. That would signify that we're leaving the TONIGHT section. We need this part of the regex to limit the query to only matching in the TONIGHT section and not spilling over beyond it.
severe
Match "severe" in the TONIGHT section.
Washington County((?!R$$)[sS])+R.TONIGHT
((?!R.)[sS])+
severe
You started this well, but at the puting together part, you have to write two more RegEx and replace
[Regex one for the city] [Regex two for the TONIGHT] [RegEx 3 for
severe]
with
[Regex one for the city] [Plus one for Any but no city] [Regex two for
the TONIGHT] [Plus One for Any but new section] [RegEx 3 for severe]
Thats for start ...
Spoiler: Maybe the „no city” can be (?!.*$$ ).* and the no new section can be (?!.*.[A-Z])
– n3ko
Jun 30 at 18:56
Thanks. I was trying to put my expressions together in the wrong order. The negative lookahead answer together with this order of operations got me up and running.
– dialupisbad
Jun 30 at 21:25
As a practical matter, you can separate this file into blocks separated by n$$n
as a delimiter. Any of sed
, awk
, perl
etc can do that and then a simple regex against the block will do what you wish.
n$$n
sed
awk
perl
Example in awk
:
awk
awk 'BEGIN {RS="n\$\$n"} /Washington County/ && /severe/ {print $0}' file
That will print the entire block between the two $$
if that block contains both 'Washington County' and 'severe'.
$$
If you wanted to only print the header of the section (the location) and the particular time with 'severe' in it, you can further subdivide into sections like so:
awk 'BEGIN {RS="n\$\$n"; FS="n\."} /Washington County/ && /severe/
{print $1; for (i=1;i<=NF;i++) if(match($i, /severe/)) print $i}' file
That prints:
COZ048>051-010615- Logan County-Washington County-Sedgwick
County-Phillips County- including Crook, Merino, Sterling, Peetz,
Akron, Cope, Last Chance, Otis, Julesburg, Ovid, Sedgwick, Amherst,
Haxtun, and Holyoke 929 AM MDT Sat Jun 30 2018
TONIGHT...Mostly cloudy with a 50 percent chance of thunderstorms in
the evening, then mostly clear after midnight. Some thunderstorms may
be severe. Lows in the lower to mid 50s. North winds 10 to 15 mph with
gusts to around 25 mph in the evening becoming light.
Thanks, but I unfortunately do not have any other programming language available in Weather Message other than the ability to feed it a straight-up RegEx expression.
– dialupisbad
Jun 30 at 21:23
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Do you want to replace that word with something else? Is there always a single occurrence? What is your regex flavor? See this PCRE regex demo matching just the word
severe
.– Wiktor Stribiżew
Jun 30 at 18:21