Regex to split a line but with a catch

  • spork
  • Brewmaster
  • Silver Member
  • User avatar
  • Posts: 6252
  • Loc: Seattle, WA

Post 3+ Months Ago

Can someone come up with a regular expression that splits up a line into chunks based on the following rules?

For example, this line:
Code: [ Select ]
an example line


should be split into:
Code: [ Select ]
an
example
line
  1. an
  2. example
  3. line


However, if there are values (words) within parentheses, they should be considered one unit, so the following line:
Code: [ Select ]
this (is) an (example of what) i am (talking about)


should be split into
Code: [ Select ]
this
is
an
example of what
i
am
talking about
  1. this
  2. is
  3. an
  4. example of what
  5. i
  6. am
  7. talking about


And this line:
Code: [ Select ]
(this is all in one big block)


shouldn't be split at all.

I've tried a number of different expressions but none seem to do the trick. Any suggestions, or should I just write a routine to parse the strings without regular expressions?
  • Anonymous
  • Bot
  • No Avatar
  • Posts: ?
  • Loc: Ozzuland
  • Status: Online

Post 3+ Months Ago

  • UPSGuy
  • Lurker ಠ_ಠ
  • Web Master
  • User avatar
  • Posts: 2733
  • Loc: Nashville, TN

Post 3+ Months Ago

Code: [ Select ]
 
<?php
 
$input = "this (is) an (example of what) i am (talking about)";
$pattern = '#(\(.+\)|\w+\s)#U';
 
preg_match_all($pattern,$input,$match);
 
foreach($match[0] as $m) { echo "$m<br/>"; }
 
/*
output:
this
(is)
an
(example of what)
i
am
(talking about)
*/
?>
 
  1.  
  2. <?php
  3.  
  4. $input = "this (is) an (example of what) i am (talking about)";
  5. $pattern = '#(\(.+\)|\w+\s)#U';
  6.  
  7. preg_match_all($pattern,$input,$match);
  8.  
  9. foreach($match[0] as $m) { echo "$m<br/>"; }
  10.  
  11. /*
  12. output:
  13. this
  14. (is)
  15. an
  16. (example of what)
  17. i
  18. am
  19. (talking about)
  20. */
  21. ?>
  22.  


That's close, but it doesn't lose the parentheses - I think I can make that happen too, though.
  • UPSGuy
  • Lurker ಠ_ಠ
  • Web Master
  • User avatar
  • Posts: 2733
  • Loc: Nashville, TN

Post 3+ Months Ago

I can't figure out a good way to remove the parens without losing the grouping. Unless you're determined to stick to one pass simplicity, I would either loop through the resulting array and remove the parens or maybe do something like what is below and combine $match[1] and $match[2] somehow?

Code: [ Select ]
<?php

$input = "this (is) an (example of what) i am (talking about)";
$pattern = '#\((.+)\)|(\w+\s)#U';

preg_match_all($pattern,$input,$match);

foreach($match as $m) { foreach($m as $arr) { echo "$arr<br/>"; } echo "--------<br/>"; }


?>
  1. <?php
  2. $input = "this (is) an (example of what) i am (talking about)";
  3. $pattern = '#\((.+)\)|(\w+\s)#U';
  4. preg_match_all($pattern,$input,$match);
  5. foreach($match as $m) { foreach($m as $arr) { echo "$arr<br/>"; } echo "--------<br/>"; }
  6. ?>
  • spork
  • Brewmaster
  • Silver Member
  • User avatar
  • Posts: 6252
  • Loc: Seattle, WA

Post 3+ Months Ago

(This isn't PHP-specific, by the way, it just needs to be PCRE).
  • Bogey
  • Genius
  • Genius
  • Bogey
  • Posts: 8399
  • Loc: USA

Post 3+ Months Ago

The following worked for me and has one less loop :D (I basically did what UPSGuy said... "combine" $match[1] with $match[2])
PHP Code: [ Select ]
<?php
$input = "this (is) an (example of what) i am (talking about)";
$pattern = '#\((.+)\)|(\w+\s)#U';
 
preg_match_all($pattern,$input,$match);
 
foreach($match[1] as $key => $value)
   if(empty($value))
      echo "{$match[2][$key]}<br />";
   else
      echo "$value<br />";
?>
  1. <?php
  2. $input = "this (is) an (example of what) i am (talking about)";
  3. $pattern = '#\((.+)\)|(\w+\s)#U';
  4.  
  5. preg_match_all($pattern,$input,$match);
  6.  
  7. foreach($match[1] as $key => $value)
  8.    if(empty($value))
  9.       echo "{$match[2][$key]}<br />";
  10.    else
  11.       echo "$value<br />";
  12. ?>
  • spork
  • Brewmaster
  • Silver Member
  • User avatar
  • Posts: 6252
  • Loc: Seattle, WA

Post 3+ Months Ago

Thanks guys. I'm just looking for the expression though, as I'm using it in a C# application, not PHP, so the PHP code isn't relevant to me.
  • Bogey
  • Genius
  • Genius
  • Bogey
  • Posts: 8399
  • Loc: USA

Post 3+ Months Ago

Can't you apply the same logic in C# as here? Or does it not work like that?

And another thing... does C# has a function called 'preg_callback'? I don't know if this would help you do it in C#... just throwing something out there...
  • spork
  • Brewmaster
  • Silver Member
  • User avatar
  • Posts: 6252
  • Loc: Seattle, WA

Post 3+ Months Ago

C# has its own regex library; it's object-oriented and much, much better.
  • Bogey
  • Genius
  • Genius
  • Bogey
  • Posts: 8399
  • Loc: USA

Post 3+ Months Ago

spork wrote:
C# has its own regex library; it's object-oriented and much, much better.

Alright...
  • UPSGuy
  • Lurker ಠ_ಠ
  • Web Master
  • User avatar
  • Posts: 2733
  • Loc: Nashville, TN

Post 3+ Months Ago

You should still be able to extract the pattern (and rework the ungreedy) to this: (\(.+?\)|\w+?\s)
  • Rabid Dog
  • Web Master
  • Web Master
  • User avatar
  • Posts: 3245
  • Loc: South Africa

Post 3+ Months Ago

Ok I am going to make a fool of myself here but decided to anyway

Code: [ Select ]
 
namespace Little_Utils {
  public class DisplayBlock {

    private List<String> _displayLines = new List<string>();

    private DisplayBlock(){}
   
    public static DisplayBlock CreateInstance(String inputString){
      DisplayBlock block = new DisplayBlock();
      block.Format(inputString);
      return block;
    }
   
    private void Format(String input){
      String pattern = @"(\(.+?\)|\w+?\s)";
      Regex regEx = new Regex(pattern);
      MatchCollection matches = regEx.Matches(input);
 
      for (int i = 0; i < matches.Count; i++) {
        String line = matches[i].Value.Replace("(", String.Empty).Replace(")", String.Empty).Replace(Environment.NewLine, String.Empty);
        _displayLines.Add(String.Format(line));
      }
    }
 
    public override string ToString() {
      StringBuilder builder = new StringBuilder();
      String lineFormat = "{0}" + Environment.NewLine;
     
      for (int i = 0; i < _displayLines.Count; i++) {
        String line = _displayLines[i].Replace("(", String.Empty).Replace(")", String.Empty).Replace(Environment.NewLine, String.Empty);
        if(i < (_displayLines.Count - 1)){
          builder.Append(String.Format(lineFormat, line));
        }else{
          builder.Append(line); //strip off the last new line
        }
      }
 
      return builder.ToString();
    }
   
    public String GetLine(int lineNumber){
      //this will throw an exception if the lineNumber exceeds the _displayLine count -1.
      //Catch it here or let it go to the caller.
      return _displayLines[lineNumber];
    }
   
    public int LineCount{
      get{return _displayLines.Count;}
    }
 
  }
}
 
  1.  
  2. namespace Little_Utils {
  3.   public class DisplayBlock {
  4.     private List<String> _displayLines = new List<string>();
  5.     private DisplayBlock(){}
  6.    
  7.     public static DisplayBlock CreateInstance(String inputString){
  8.       DisplayBlock block = new DisplayBlock();
  9.       block.Format(inputString);
  10.       return block;
  11.     }
  12.    
  13.     private void Format(String input){
  14.       String pattern = @"(\(.+?\)|\w+?\s)";
  15.       Regex regEx = new Regex(pattern);
  16.       MatchCollection matches = regEx.Matches(input);
  17.  
  18.       for (int i = 0; i < matches.Count; i++) {
  19.         String line = matches[i].Value.Replace("(", String.Empty).Replace(")", String.Empty).Replace(Environment.NewLine, String.Empty);
  20.         _displayLines.Add(String.Format(line));
  21.       }
  22.     }
  23.  
  24.     public override string ToString() {
  25.       StringBuilder builder = new StringBuilder();
  26.       String lineFormat = "{0}" + Environment.NewLine;
  27.      
  28.       for (int i = 0; i < _displayLines.Count; i++) {
  29.         String line = _displayLines[i].Replace("(", String.Empty).Replace(")", String.Empty).Replace(Environment.NewLine, String.Empty);
  30.         if(i < (_displayLines.Count - 1)){
  31.           builder.Append(String.Format(lineFormat, line));
  32.         }else{
  33.           builder.Append(line); //strip off the last new line
  34.         }
  35.       }
  36.  
  37.       return builder.ToString();
  38.     }
  39.    
  40.     public String GetLine(int lineNumber){
  41.       //this will throw an exception if the lineNumber exceeds the _displayLine count -1.
  42.       //Catch it here or let it go to the caller.
  43.       return _displayLines[lineNumber];
  44.     }
  45.    
  46.     public int LineCount{
  47.       get{return _displayLines.Count;}
  48.     }
  49.  
  50.   }
  51. }
  52.  


Usage

Code: [ Select ]
 
namespace Little_Utils {
  class Program {
    static void Main(string[] args) {
      DisplayBlock block = DisplayBlock.CreateInstance("this (is) an (example of what) i am (talking about)");
      Console.WriteLine("Formatted : " + block.ToString());
      Console.WriteLine("Number of Lines : " + block.LineCount);
      Console.WriteLine("Get line : " + block.GetLine(3));
    }
  }
}
 
  1.  
  2. namespace Little_Utils {
  3.   class Program {
  4.     static void Main(string[] args) {
  5.       DisplayBlock block = DisplayBlock.CreateInstance("this (is) an (example of what) i am (talking about)");
  6.       Console.WriteLine("Formatted : " + block.ToString());
  7.       Console.WriteLine("Number of Lines : " + block.LineCount);
  8.       Console.WriteLine("Get line : " + block.GetLine(3));
  9.     }
  10.   }
  11. }
  12.  


Don't forget to add the regex namespace reference. Credit to UPSGuy for the pattern :) Might be over kill but it was fun :D

HTH
  • joebert
  • Fart Bubbles
  • Genius
  • User avatar
  • Posts: 13503
  • Loc: Florida

Post 3+ Months Ago

PHP seems to use the same syntax for look ahead and look behind assertions as noted at this page which has PCRE written all over it. (literally)

Quote:
LOOKAHEAD AND LOOKBEHIND ASSERTIONS

(?=...) positive look ahead
(?!...) negative look ahead
(?<=...) positive look behind
(?<!...) negative look behind

Each top-level branch of a look behind must be of a fixed length.


I don't know if any of the other solutions in this thread have worked, but the following code gave me the subsequent following result. :)

PHP Code: [ Select ]
<?php
 
$str = "this (is) an (example of what) i'm talking about. (I'll do it) This is a post-subject sentence. (and a post-subject comment)";
 
preg_match_all('#(?:\()?((?<=\()[\w\s\'-]+|[\w\'-]+)(?:\))?#', $str, $matches);
 
print_r($matches);
echo "\n";
 
?>
  1. <?php
  2.  
  3. $str = "this (is) an (example of what) i'm talking about. (I'll do it) This is a post-subject sentence. (and a post-subject comment)";
  4.  
  5. preg_match_all('#(?:\()?((?<=\()[\w\s\'-]+|[\w\'-]+)(?:\))?#', $str, $matches);
  6.  
  7. print_r($matches);
  8. echo "\n";
  9.  
  10. ?>


Code: [ Select ]
Array
(
    [0] => Array
        (
            [0] => this
            [1] => (is)
            [2] => an
            [3] => (example of what)
            [4] => i'm
            [5] => talking
            [6] => about
            [7] => (I'll do it)
            [8] => This
            [9] => is
            [10] => a
            [11] => post-subject
            [12] => sentence
            [13] => (and a post-subject comment)
        )
 
    [1] => Array
        (
            [0] => this
            [1] => is
            [2] => an
            [3] => example of what
            [4] => i'm
            [5] => talking
            [6] => about
            [7] => I'll do it
            [8] => This
            [9] => is
            [10] => a
            [11] => post-subject
            [12] => sentence
            [13] => and a post-subject comment
        )
 
)
  1. Array
  2. (
  3.     [0] => Array
  4.         (
  5.             [0] => this
  6.             [1] => (is)
  7.             [2] => an
  8.             [3] => (example of what)
  9.             [4] => i'm
  10.             [5] => talking
  11.             [6] => about
  12.             [7] => (I'll do it)
  13.             [8] => This
  14.             [9] => is
  15.             [10] => a
  16.             [11] => post-subject
  17.             [12] => sentence
  18.             [13] => (and a post-subject comment)
  19.         )
  20.  
  21.     [1] => Array
  22.         (
  23.             [0] => this
  24.             [1] => is
  25.             [2] => an
  26.             [3] => example of what
  27.             [4] => i'm
  28.             [5] => talking
  29.             [6] => about
  30.             [7] => I'll do it
  31.             [8] => This
  32.             [9] => is
  33.             [10] => a
  34.             [11] => post-subject
  35.             [12] => sentence
  36.             [13] => and a post-subject comment
  37.         )
  38.  
  39. )


It would probably easier, as well as use less memory to trim the () from the entries of a simple expression. I just couldn't pass up a chance to fool around with assertions. :D
  • Rabid Dog
  • Web Master
  • Web Master
  • User avatar
  • Posts: 3245
  • Loc: South Africa

Post 3+ Months Ago

I can't help but add a quote I once read.

If you use regex to solve a problem you end up with two problems :)
  • Bogey
  • Genius
  • Genius
  • Bogey
  • Posts: 8399
  • Loc: USA

Post 3+ Months Ago

Rabid Dog wrote:
I can't help but add a quote I once read.

If you use regex to solve a problem you end up with two problems :)

I'll give you a third one... I don't know what those two are :lol:
  • spork
  • Brewmaster
  • Silver Member
  • User avatar
  • Posts: 6252
  • Loc: Seattle, WA

Post 3+ Months Ago

Rabid Dog wrote:
I can't help but add a quote I once read.

If you use regex to solve a problem you end up with two problems :)

:lol:

Oh you guys. Thanks for the help. I've actually taken a different approach to the problem without regular expressions but I appreciate all of the input.

The expression you all used does work, but the requirements of what I'm working on kinda changed... drastically, so I don't need it anymore :)
  • Rabid Dog
  • Web Master
  • Web Master
  • User avatar
  • Posts: 3245
  • Loc: South Africa

Post 3+ Months Ago

AHHHHH I thought my code was going to get international exposure!

Post Information

  • Total Posts in this topic: 16 posts
  • Users browsing this forum: No registered users and 85 guests
  • You cannot post new topics in this forum
  • You cannot reply to topics in this forum
  • You cannot edit your posts in this forum
  • You cannot delete your posts in this forum
  • You cannot post attachments in this forum
 
 

© 1998-2014. Ozzu® is a registered trademark of Unmelted, LLC.