How to remove or extract all hyperlinks from a web page using regular expression
October 6th, 2010
659 views
Leave a comment
Go to comments
This trick will help you to remove or extract all hyperlinks along with their text
Example Text:
< a href="http://article-stack.com">article-stack< /a> < a href="http://article-stack.com">article-stack< /a> < a href="http://article-stack.com">article-stack< /a>< a href="http://article-stack.com">article-stack< /a>
In above sample text, last 2 hyperlinks are in same line
< a [a-zA-Z0-9\=\"\:\.\,\/\- ]*>.*<\/a>
or
< a.*>.*<\/a>.
Output
[0] => Array
(
[0] => < a href="http://article-stack.com">article-stack
[1] => < a href="http://article-stack.com">article-stack
[2] => < a href="http://article-stack.com">article-stack< a href="http://article-stack.com">article-stack
)
Improved Regular Expression
< a [a-zA-Z0-9\=\"\:\.\,\/\- ]*>(.[^(<\/a>)])*.<\/a>
Output
[0] => Array
(
[0] => < a href="http://article-stack.com">article-stack< /a>
[1] => < a href="http://article-stack.com">article-stack< /a>
[2] => < a href="http://article-stack.com">article-stack< /a>
[3] => < a href="http://article-stack.com">article-stack< /a>
)
[1] => Array
(
[0] => ac
[1] => ac
[2] => ac
[3] => ac
)
You can do above task using some programming language like java, awk, PHP etc, or in any text editor.
