Roy Tang

Programmer, engineer, scientist, critic, gamer, dreamer, and kid-at-heart.

Blog Notes Photos Links Archives About
$pee = preg_replace( '|<p>|', "$1<p>", $pee );

This regular expression is from the Wordpress source code (formatting.php, wpautop function); I’m not sure what it does, can anyone help?

Actually I’m trying to port this function to Python…if anyone knows of an existing port already, that would be much better as I’m really bad with regex.

Comments

…?

Actually, it looks like this takes the first <p> tag and prepends the previous regular expression’s first match to it (since there’s no match in this one),

However, it seems that this behavior is bad to say the least, as there’s no guarantee that preg_* functions won’t clobber $1 with their own values.

Edit: Judging from Jay’s comment, this regex actually does nothing.

wordpress really calls a variable “pee” ?

I’m not sure what the $1 stands for (there are no braces in the first parameter?), so I don’t think it actually does anything, but i could be wrong.

It replace the match from the pattern

"|<p>|" 

by the string

"$1<p>"

The | in the replacement pattern is causes the regex engine to match either the part on the left side, or the part on the right side.

I do not get why it’s used that way because usually it’s for something like “ta(b|p)e”…

For the $1, I guess the variable $1 is in the PHP code and it replaced during the preg_replace so if $1 = “test”; the replacement will replace the

"<p>" 

to

"test<p>"

But I am not sure of it for the $1

The preg_replace() function - somewhat confusingly - allows you to use other delimiters besides the standard “/” for regular expressions, so

"|<p>|"

Would be a regular expression just matching

"<p>" 

in the text. However, I’m not clear on what the replacement parameter of

"$1<p>" 

would be doing, since there’s no grouping to map to $1. It would seem like as given, this is just replacing a paragraph tag with an empty string followed by a paragraph tag, and in effect doing nothing.

Anyone with more in-depth knowledge of PHP quirks have a better analysis?

Are there previous matchings in the same scope that this $1 could refer to?
I highly recommend the amazing RegexBuddy

The pipe symbols | in this case do not have the default meaning of “match this or that” but are use as alternative delimiters for the pattern instead of the more common slashes /. This may make sense, if you want to match for / without having to escape those appearances (e.g. /(.\*)\/(.\*)\// is not as readable as #/(.\*)/(.\*)/#). Seems quite contra productive to use | instead which is just another reserved char for patterns, though.

Normally $1 in the replacement pattern should match the first group denoted by parentheses. E.g if you’ve got a pattern like

"(.*)<p>"

$0 would contain the whole match and $1 the part before the &lt;p&gt;.

As the given reg-ex does not declare any groups and $1 is not a valid name for a variable (in PHP4) defined elsewhere, this call seems to replace any occurrences of &lt;p&gt; with &lt;p&gt;?

To be honest, now I’m also quite confused. Just a guess: gets another pattern-matching method (preg_match and the like) called before the given line so the $1 is “leaked” from there?

I don’t have very much experience with RegEx an don’t have a RegEx testing tool on me atm but after doing some searching and looking at other WordPress source code and comments, is it possible this code removes duplicate paragraph tags and replaces them wih a single set of tags.

I believe that line does nothing.

For what it’s worth, this is the previous line, in which $1 is set:

$pee = preg_replace('!<p>([^<]+)\s*?(</(?:div|address|form)[^>]*>)!', "<p>$1</p>$2", $pee);

However, I don’t think that’s worth anything. In my testing, $1 does not maintain a value from one preg_replace to the next, even if the next doesn’t set its own value for $1. Remember that PHP variable names cannot begin with a number (see: http://php.net/language.variables ), so $1 is not a PHP variable. It only means something within a single preg_replace, and in this case the rules of preg_replace suggest it doesn’t mean anything.

That said, autop being such a widely-used function makes me doubt my own conclusion that this line is doing nothing. So I look forward to someone correcting me.

The regex simply matches the literal text

. The choice to delimit the regex with the vertical bar instead of forward slashes is very unfortunate. It doesn’t change the code, but it makes it harder for humans to read. (It also makes it impossible to use the alternation operator in the regex.)

$1 is not a valid variable name in PHP, so $1 is never interpolated in double-quoted strings. The $1 gets passed to preg_replace unchanged. preg_replace parses the replacement string, and replaces $1 with the contents of the first capturing group. If there is no capturing group, $1 is replaced with nothing.

Thus, this code does the same as:

$pee = preg_replace( '/<p>/', "<p>", $pee );

It’s not correct that this does nothing. The search-and-replace will run, slowing down your software, and eating up memory for temporary copies of $pee.