URL Regex

  • SpooF
  • ٩๏̯͡๏۶
  • Bronze Member
  • User avatar
  • Posts: 3422
  • Loc: Richland, WA

Post 3+ Months Ago

I'm trying to write an expression to remove all urls from a string, so far I have this

Code: [ Select ]
(https?:\/\/[.\w]{1,}\/?\S+)


It will match everything in this string but a url starting with www. I should be a very simple addition to the regex but I can't seem to figure it out.

Code: [ Select ]
asd sadgf fdgd http://google.com/asd.php?asdasd jdjd http://bit.ly/AdsWG asjjg www.blah.com/gfds http://google.com/


I'm using Ruby if that makes much of a difference. Been testing the expression using this tool: http://rubular.com/
  • SpooF
  • ٩๏̯͡๏۶
  • Bronze Member
  • User avatar
  • Posts: 3422
  • Loc: Richland, WA

Post 3+ Months Ago

This works but seems kind of messy
Code: [ Select ]
(https?:\/\/[.\w]{1,}\/?\S+|w{3}[.\w]{1,}\/?\S+)
, its basically running two expressions, one for http and another for the www.
  • joebert
  • Fart Bubbles
  • Genius
  • User avatar
  • Posts: 13504
  • Loc: Florida

Post 3+ Months Ago

It's definitely easier when you just want to locate and remove the URL, rather than parse it into pieces.

If we take a look here, we can get a simplified syntax for an HTTP URL.

Code: [ Select ]
http://<host>:<port>/<path>?<searchpart>


I'll go ahead and start with that for my pattern, being sure to escape characters meaningful to regular expressions.

Code: [ Select ]
http:\/\/<host>:<port>\/<path>\?<searchpart>


Now I want to catch HTTPS, so I'll add a "zero or one time" s to the protocol.

Code: [ Select ]
https?:\/\/<host>:<port>\/<path>\?<searchpart>


Off the top of my head, I believe <host> can consist of alphanumeric characters, hyphens, and dots.

Looking back at that RFC, that's about right. At least, it was until recently, when internationalized TLDs were introduced with non-latin characters.
Luckily, even with internationalized TLDs the syntax of a URL still follows that original pattern. So in order to account for them, I'll just have to use a more generic pattern for the <host> than I normally would and make sure whatever regular expression engine I'm using can work with multi-byte character sets.

"anything other than a port separator AKA colon, a path separator AKA forward slash, or any kind of whitespace, three or more times" should do it. It's pretty generic, but combined with the beginning and end of the pattern to anchor it there shouldn't be many false positives and any misses will err on the side of removal.

Code: [ Select ]
https?:\/\/[^:\/\s]{3,}:<port>\/<path>\?<searchpart>


The port is optional and will always consist of digits. So I'll wrap that section in a "zero or one time" sub-pattern, and since ports range from 1-65535 I'll limit the number of digits to 1-5.

Code: [ Select ]
https?:\/\/[^:\/\s]{3,}(:\d{1,5})?\/<path>\?<searchpart>


Now, the path separator and path are also optional, however there isn't always a path when there's a path separator yet there is always a path separator when there's a path. So what I'll do is wrap both them both in a "zero or one time" sub pattern, then since a question mark is what marks the next section of a URL I'll use an "anything other than whitespace or a question mark zero or more times" after the path separator.

Code: [ Select ]
https?:\/\/[^:\/\s]{3,}(:\d{1,5})?(\/[^\?\s]*)?\?<searchpart>


The <searchpart> works similarly to the <path>.

Code: [ Select ]
https?:\/\/[^:\/\s]{3,}(:\d{1,5})?(\/[^\?\s]*)?(\?[^\s]*)?


Now, one thing that's not included in that RFC, is a mention of the <hash> (http://domain.tld/path?searchpart#hash) probably because the hash is used by the browser only and never actually sent to a server. The <hash> works similarly to the <searchpart>, but either of them can exist without the other being there.

The pattern as-is will catch the <hash> already, but only if there's a questionmark before it. Since the part of the pattern catching the <searchpart> is so generic, I can swap out that "\?" with a "question mark or pound symbol" and have it catch a querystring and/or a hash.

Code: [ Select ]
https?:\/\/[^:\/\s]{3,}(:\d{1,5})?(\/[^\?\s]*)?([\?#][^\s]*)?


--

Luckily, I caught an un-escaped forward slash before I posted this. I don't know if you can use alternate delimiters in Ruby, but if you can use something other than the traditional forward slash when working with a URL and regular expressions, you should. I normally use the pound symbol when working with regular expressions, but since there's one in my pattern this time, I'll use a tilde instead.

Code: [ Select ]
~https?://[^:/\s]{3,}(:\d{1,5})?(/[^\?\s]*)?([\?#][^\s]*)?~

Post Information

  • Total Posts in this topic: 3 posts
  • Users browsing this forum: No registered users and 65 guests
  • You cannot post new topics in this forum
  • You cannot reply to topics in this forum
  • You cannot edit your posts in this forum
  • You cannot delete your posts in this forum
  • You cannot post attachments in this forum
 
 

© 1998-2014. Ozzu® is a registered trademark of Unmelted, LLC.