Regex Extract URL's and their labels from HTML file

Multi tool use
Multi tool use


Regex Extract URL's and their labels from HTML file



I have a large html file exported from Google doc. When opened in a text editor it appears as a single line. There are many URL's there which I need to extract. They appear in this form:



<a class="c13" href="https://www.google.com/url?q=https://example.com/page1/&amp;sa=D&amp;ust=1530382105580000">Text Label One</a>SOME HTML TEXT HERE<a class="c13" href="https://www.google.com/url?q=https://example.com/page2/&amp;sa=D&amp;ust=1530382105719000">Text Label Two</a>


<a class="c13" href="https://www.google.com/url?q=https://example.com/page1/&amp;sa=D&amp;ust=1530382105580000">Text Label One</a>SOME HTML TEXT HERE<a class="c13" href="https://www.google.com/url?q=https://example.com/page2/&amp;sa=D&amp;ust=1530382105719000">Text Label Two</a>



So far I found this solution


(?<=https://www.google.com/url?q=)(.*?)(?=&amp)



Which generates


https://example.com/page1/
https://example.com/page2/



How to extract also the labels in this form?


https://example.com/page1/ Text Label One
https://example.com/page2/ Text Label Two





No, I need each URL to appear just once in the output (I see that my "solution" also produces two outputs per URL). And I need the text label in each output . Here is what I have: regex101.com/r/Bh9AGD/1
– Serg
Jun 30 at 18:53





Does this help? regex101.com/r/Bh9AGD/2
– NoobProgrammer
Jul 1 at 6:27









By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

kdjyFFq7oPYCD0v,8aNASvDJ CLVclE v,UHz7kmlrwwAmqnAVwsL,KVeDu5nmi
5,cJabNTA34vlgXJSnql,n,4 SDLCR9LzDOQfmpRjDdPZewLXvag9FYK1pY5 UR9BU,DsCVeI J9OVz6pb4RLyB,t0Zd0fd9

Popular posts from this blog

PySpark - SparkContext: Error initializing SparkContext File does not exist

django NoReverseMatch Exception

Audio Livestreaming with Python & Flask