Regex Extract URL's and their labels from HTML file


Regex Extract URL's and their labels from HTML file



I have a large html file exported from Google doc. When opened in a text editor it appears as a single line. There are many URL's there which I need to extract. They appear in this form:



<a class="c13" href="https://www.google.com/url?q=https://example.com/page1/&amp;sa=D&amp;ust=1530382105580000">Text Label One</a>SOME HTML TEXT HERE<a class="c13" href="https://www.google.com/url?q=https://example.com/page2/&amp;sa=D&amp;ust=1530382105719000">Text Label Two</a>


<a class="c13" href="https://www.google.com/url?q=https://example.com/page1/&amp;sa=D&amp;ust=1530382105580000">Text Label One</a>SOME HTML TEXT HERE<a class="c13" href="https://www.google.com/url?q=https://example.com/page2/&amp;sa=D&amp;ust=1530382105719000">Text Label Two</a>



So far I found this solution


(?<=https://www.google.com/url?q=)(.*?)(?=&amp)



Which generates


https://example.com/page1/
https://example.com/page2/



How to extract also the labels in this form?


https://example.com/page1/ Text Label One
https://example.com/page2/ Text Label Two





No, I need each URL to appear just once in the output (I see that my "solution" also produces two outputs per URL). And I need the text label in each output . Here is what I have: regex101.com/r/Bh9AGD/1
– Serg
Jun 30 at 18:53





Does this help? regex101.com/r/Bh9AGD/2
– NoobProgrammer
Jul 1 at 6:27









By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

List of Kim Possible characters

Audio Livestreaming with Python & Flask

NSwag: Generate C# Client from multiple Versions of an API