I recently wanted to extract all function/method prototypes from a PHP files, using PCRE. The extraction itself is not a real problem, but for further processing I need every single parameter of each function as a seperate element. Surely it would be no problem, when using 2 regex (a. extract the function prototype, b. extract the parameters), but this would (IMHO) be much more resource intensive than doing all at once.
Here is a (much smaller than the real one) example which describes the problem (see the extended entry for the real regex I developed):
$string = "(aaa bbb ccc ddd)";
$regex = "/\((?:
([a-z]+)\s*
)+\)/x";
preg_match($regex, $string, $matches);
The output of a var_dump($matches) looks like this:
array(2) {
[0]=>
string(17) "(aaa bbb ccc ddd)"
[1]=>
string(3) "ddd"
}
Where I would expect something like
array(5) {
[0]=>
string(17) "(aaa bbb ccc ddd)"
[1]=>
string(3) "aaa"
[2]=>
string(3) "bbb"
[3]=>
string(3) "ccc"
[4]=>
string(3) "ddd"
}
or
array(2) {
[0]=>
string(17) "(aaa bbb ccc ddd)"
[1]=>
array(4) {
[0]=>
string(3) "aaa"
[1]=>
string(3) "bbb"
[2]=>
string(3) "ccc"
[3]=>
string(3) "ddd"
}
}
Does anyone know a solution for that (I repeate, in 1 regular expression)?
The original code for extracting functions/methods from PHP files I developed:
$regex = '/(?:function
\s*([a-zA-Z0-9_]+)\s*
\(\s*
(?:
(
\$[A-Za-z0-9_]+
(?:
\s*=\s*
(?: \'[^\']*\' | "[^"]*" | [A-Za-z0-9-.]+ )
)
)
\s*,*\s*
)*
\)
)/x';
$res = preg_match_all($regex, $in, $matches, PREG_SET_ORDER);
If you have any optimizations, please comment to this entry! Thanks!
If you liked this blog post or learned something, please consider using flattr to contribute back: .
Fields with bold names are mandatory.
Davey
Toby,
Link to commentI assume this is for the the PEAR QA tools we've been discussing lately.
I have mentioned in the past, that I have all the tools for the parsing of classes/functions/args/return values/docblock types info done.
I will happily contribute this.
Anyways, the solution is easy, don't use Regex. Use the tokenizer extension which is specifically for this task!
- Davey
Nico Edtinger
IMHO you're searching for the wrong solution.
Link to commentFirst, not everything should be done in one regex. Even the O'Reilly book "Mastering Regular Expressions" has some perl code around many regexes. So you may too use to regex. The second regex would be looped many times, but the compiled regex is save in php-extension. And from the regex view it's the same code that gets executed, you just do the loop yourself.
The other problem is your current regex. You would match functions in comments or in strings or even here docs or html between your php blocks.
That's why it would IMHO be better to use the tokenizer. It does exactly what you need, split php code into "php atoms".
Just my 2 eurocents.
b4n
Anonymous
Use explode to separate arguments
Link to commentTimm Friebe
I'd use ext/tokenizer.
Link to commentS
http://sean.caedmon.net/token-funcs.phps
Link to commentnot foolproof, but it seems to work for me..
not heavily tested, either
S
Toby
Thanks!
Link to commentWill try this and check it against alternatives like the reflection API and some more regex stuff (and if it's just for regex study purposes). :)
Davey
I spoke with Greg Beaver last night and he made me benchmark the tokenizer against the Reflection API, the results were quite astounding:
Link to commentI used the Convertor.inc file in the phpDocumentor package, its about 4000 LOC
Tokenizer: 65s
Reflection API: 0.005s
Whilst the Reflection API test didn't do everything my tokenizer using package does (it was a quick test), the difference in speed is just so much that I don't expect it to take anywhere near as long the Tokenizer even when fully fleshed out.
The only problem is that it will not work in your case. It requires you to tell it the class name, so we'd have to require full package.xml files which shows what it provides (like PEAR_PackageFileManager outputs). This might be something we could require for stable packages however... that they use PEAR_PFM... its easy enough for us to automate the generation of that script even - I do it for Cerebral Cortex :)
- Davey
Toby
I'm currently playing with ReflectionAPI, too.
Link to commentThe problem you describe is easy manageable using get_declared_classes(). :)