I have uploaded three files that should assist with testing regex vs. re, and obtaining profiling data.
timer_regex.py
This is a "proxy" package that will intercept calls to re.* methods (compile, findall, sub etc.) and log timings for both the re and (if available) regex calls. The data will be written to a file in /tmp named after the currently executing addon, eg. /tmp/plugin.video.iplayer.dat
If the regex package is not installed, then only the timings for re will be recorded.
Place this file in the root directory of the addon to be analysed, eg. /storage/.xbmc/addons/plugin.video.iplayer/timer_regex.py
importhook.py:
This is a hack to subvert the import process and ensure the addon (and any "helper" addons or scripts, but not system libraries) uses the timer_regex package whenever "re" is imported.
Place this file in the root directory of the addon to be analysed, eg. /storage/.xbmc/addons/plugin.video.iplayer/importhook.py
Each addon usually has a file called default.py, and the call to "import importhook" needs to be added to default.py as the
first import, before any other import. This will "hook" calls to "import re" so that "timer_regex" is imported instead of "re".
Also add "importhook.unload()" as the last line in default.py - this will allow the elapsed time for all processing to be recorded for analysis. This last line is optional, and can be omitted if you are not interested in the elapsed time analysis, however having this information available helps give some context/perspective to the other timings - eg. is a 5 second optimisation worthwhile when the overall elapsed time is 2 minutes? If you don't know the elapsed time, it's harder to say if a potential optimisation is likely to be of any benefit.
The importhook method has been shown to work with iPlayer, SportsDevil, Daily Motion and YouTube addons.
Some addons, such as YouTube, use other addons and scripts to implement their functionality, eg. YouTube has a dependency on script.module.simplejson, script.module.parsedom and script.module.simple.downloader. importhook will automatically ensure these "helper" addons use timer_regex too whenever called by the addon being analysed (there is no need to modify the "helper" addons/scripts).
analysis.sh:
A simple analysis tool. See "analysis.sh -h" for help.
It will display the accumulated details for each regular expression event (ie. each time a method is executed), and will show the most frequent and most expensive regular expression methods used by the addon.
Example:
Code:
rpi512:~ # ./analysis.sh -i /tmp/plugin.video.iplayer.dat -a -f -e -m findall
Method Freq regex vs. re Avg +/- us | re (min/max/avg/total) | regex (min/max/avg/total)
c.compile 644 28 vs 616 -313.5941 | 62.9425 / 43292.9993 / 681.6713 / 0.4390s | 99.8974 / 69763.1836 / 995.2654 / 0.6410s
c.findall 619 52 vs 567 -21.0925 | 53.8826 / 4201.8890 / 98.9822 / 0.0613s | 85.8307 / 4757.1659 / 120.0748 / 0.0743s
c.match 2 1 vs 1 -15.1396 | 113.9641 / 254.8695 / 184.4168 / 0.0004s | 136.1370 / 262.9757 / 199.5564 / 0.0004s
c.search 619 44 vs 575 -43.4284 | 45.7764 / 4821.0621 / 71.8811 / 0.0445s | 56.0284 / 4237.8902 / 115.3095 / 0.0714s
d.findall 4945 316 vs 4629 -100.3489 | 117.0635 / 280357.1224 / 296.9627 / 1.4685s | 216.0072 / 32974.0047 / 397.3115 / 1.9647s
d.sub 82 14 vs 68 -489.4821 | 133.9912 / 7480.8598 / 374.0787 / 0.0307s | 171.8998 / 24275.0645 / 863.5608 / 0.0708s
===============================================================================================================================================
TOTAL 6911 455 vs 6456 -0.7783s | 2.0443s | 2.8226s
ELAPSED TIME less re : 25.6184s
ELAPSED TIME less regex: 24.8401s
ELAPSED TIME TOTAL : 27.6627s
PERF LOGGING OVERHEAD : 9.1602s (included in above elapsed times)
Methods prefixed with c. are compiled patterns. Methods prefixed d. have been called with a pattern requiring compilation.
Top 10 most frequent findall() calls (ranked by descending frequency):
619 d.findall <updated[^>]*>(.*?)</updated>
619 d.findall <title[^>]*>(.*?)</title>
619 d.findall <link rel="self" .*title=".*pisode *([0-9]+?)
619 d.findall <link rel="related" href=".*microsite.*title="(.*?)" />
619 d.findall <id[^>]*>(.*?)</id>
619 d.findall <content[^>]*>(.*?)</content>
619 d.findall <category[^>]*term="(.*?)"[^>]*>
619 c.findall PIPS:([0-9a-z]{8})
530 d.findall <link rel="self" .*title="([0-9]+?)\.
80 d.findall iplayer/categories/(.*?)/list
Top 10 most expensive findall() calls (ranked by regex time):
- 247383.1177 us 280357.1224 us 32974.0047 us d.findall 16 <entry>(.*?)</entry>
- 4199.0280 us 32804.0123 us 28604.9843 us d.findall 0 <\?xml version="[^"]*" encoding="([^"]*)"\?>
+ 15471.9353 us 10276.0792 us 25748.0145 us d.findall 16 <link rel="related" href=".*microsite.*title="(.*?)" />
+ 11581.1825 us 9920.8355 us 21502.0180 us d.findall 16 <link rel="self" .*title=".*pisode *([0-9]+?)
+ 12005.8060 us 7300.1385 us 19305.9444 us d.findall 16 <link rel="self" .*title="([0-9]+?)\.
+ 8353.9486 us 4940.0330 us 13293.9816 us d.findall 16 iplayer/categories/(.*?)/list
+ 2883.1959 us 10088.9206 us 12972.1165 us d.findall 16 <updated[^>]*>(.*?)</updated>
+ 500.9174 us 12237.0720 us 12737.9894 us d.findall 16 <category[^>]*term="(.*?)"[^>]*>
+ 6442.7853 us 5768.0607 us 12210.8459 us d.findall 16 <content[^>]*>(.*?)</content>
+ 237.7033 us 10388.1359 us 10625.8392 us d.findall 16 <title[^>]*>(.*?)</title>
Downloading:
Code:
wget www.nmacleod.com/public/regex/timer_regex.py
wget www.nmacleod.com/public/regex/importhook.py
wget www.nmacleod.com/public/regex/analysis.sh
chmod +x analysis.sh
then copy the timer_regex.py and importhook.py files into the root of the addon to be analysed.
Enable by top & tailing default.py with:
Code:
import importhook
...
importhook.unload()
Note: Depending on how the addon is written, you may need to indent the last line.
Disable by simply removing the two importhook references from default.py.
Understanding the results
One of the most common issues I've seen is repeated calls to re.compile() for a static pattern in a loop while iterating over data. For instance in iplayer, this can be seen when navigating to Categories -> Childrens, where re.compile() is called over 600 times when it could be called just once.
In fact, as a general rule, wherever possible, addons should be compiling all static patterns and compiling them only once.
What often happens however is that patterns are not being compiled explicitly, which means the re/regex package has to lookup it's internal cache to see if the pattern has been compiled previously, which wastes time as it can often take longer to find a pattern in the cache than it would to compile the pattern in the first place.
So rather than:
Code:
for string in lots_of_data:
result = re.sub("/some pattern/", string)
the following code could be used which avoids the compile cache overhead:
Code:
re_pattern = re.compile("/some pattern/")
for string in lots_of_data:
result = re_pattern.sub(string)
And rather than repeatedly calling a function which then compiles a static pattern, compile the patterns once for the entire module:
So instead of:
Code:
def series_match(name):
# match the series name part of a programme name
seriesmatch = []
seriesmatch.append(re.compile('^(Late\s+Kick\s+Off\s+)'))
seriesmatch.append(re.compile('^(Inside\s+Out\s+)'))
seriesmatch.append(re.compile('^(.*?):'))
match = None
for s in seriesmatch:
match = s.match(name)
if match:
break
where series_match() is called 100 times resulting in 300 re.compile() calls, use the following recipe:
Code:
re_series_match = [re.compile('^(Late\s+Kick\s+Off\s+)'), \
re.compile('^(Inside\s+Out\s+)'), \
re.compile('^(.*?):')]
def series_match(name):
# match the series name part of a programme name
match = None
for s in re_seriesmatch:
match = s.match(name)
if match:
break
and now there are only three calls to re.compile() no matter how many times series_match() is called.
It's unlikely that the above small changes will have a huge performance impact but they should be beneficial over the long run.
The profiling data may also reveal other behavioural aspects worthy of attention and improvement. Not just that, the data may also demonstrate that regex does not, as a rule, outperform the existing re package (either at all, or by a significant margin).
On the basis of this analysis I have submitted patches for iplayer and SportsDevil that attempt to eliminate the worst cases of repeated static pattern compilation.