Regex Extract URL's and their labels from HTML file
Regex Extract URL's and their labels from HTML file
I have a large html file exported from Google doc. When opened in a text editor it appears as a single line. There are many URL's there which I need to extract. They appear in this form:
<a class="c13" href="https://www.google.com/url?q=https://example.com/page1/&sa=D&ust=1530382105580000">Text Label One</a>SOME HTML TEXT HERE<a class="c13" href="https://www.google.com/url?q=https://example.com/page2/&sa=D&ust=1530382105719000">Text Label Two</a>
<a class="c13" href="https://www.google.com/url?q=https://example.com/page1/&sa=D&ust=1530382105580000">Text Label One</a>SOME HTML TEXT HERE<a class="c13" href="https://www.google.com/url?q=https://example.com/page2/&sa=D&ust=1530382105719000">Text Label Two</a>
So far I found this solution
(?<=https://www.google.com/url?q=)(.*?)(?=&)
Which generates
https://example.com/page1/
https://example.com/page2/
How to extract also the labels in this form?
https://example.com/page1/ Text Label One
https://example.com/page2/ Text Label Two
Does this help? regex101.com/r/Bh9AGD/2
– NoobProgrammer
Jul 1 at 6:27
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
No, I need each URL to appear just once in the output (I see that my "solution" also produces two outputs per URL). And I need the text label in each output . Here is what I have: regex101.com/r/Bh9AGD/1
– Serg
Jun 30 at 18:53