Bug in scrap.exe
#1
I think I've found a bug in scrap.exe; it may be in xbmc's parsing of scrapers but I think it is in scrap.exe, they have different behaviour when dealing with "cleaning" expressions (when you do not specify noclean="1"). It was turning me crazy...

I was trying to extract <genre> for the example scraper in "scraper for dummies". The interesting part of $$1 is:
Code:
...<font class = 'titulo3'>Género:</font><br>Terror / Thriller<br><br><font class = 'titulo3'>Nacionalidad:</font>...
the idea is this:
regexp1:
Code:
<RegExp input="$$1" output="\1" dest="9">
    <expression noclean="1">G.nero:(.[^:]*)Nacionalidad:</expression>
</RegExp>

Should store in $$9 this:
Code:
</font><br>Terror / Thriller<br><br><font class = 'titulo3'>

then regexp2:
Code:
<RegExp input="$$9" output="\1/" dest="7">
    <expression noclean="1">&gt;(.[^&lt;&gt;]*)&lt;</expression>
</RegExp>

should cut the innermost part and add "/" at the end, so store in $$7
Code:
Terror / Thriller/

and finally regexp3:
Code:
<RegExp input="$$7" output="&lt;genre&gt;\1&lt;/genre&gt;" dest="8+">
    <expression repeat="yes" trim="1">([^/]*)/</expression>
</RegExp>

appends to $$8 this:
Code:
<genre>Terror</genre><genre>Thriller</genre>

It actually works in both scrap.exe and xbmc.

But in my first attempt, I forgot to add "noclean=1" in both regexp1: and regexp2:, it should not work because the expression in regexp2 does not resolve to anything and and since $$7 is not cleared, in regexp3 it will use the previous content and generate some random <genre> or nothing, that is what happens in xbmc, but in scrap.exe it actually worked and gave me correct results!!

It occurred then to me that, since cleaning should strip all html content, using noclean="1" in regexp1 should return directly "Terror / Thriller", and so this shorter version should do the same (stripping regexp2):

Code:
<RegExp input="$$7" output="&lt;genre&gt;\1&lt;/genre&gt;" dest="8+">
    <RegExp input="$$1" output="\1/" dest="7">
        <expression>G.nero:(.[^:]*)Nacionalidad:</expression>
    </RegExp>
    <expression repeat="yes" trim="1">([^/]*)/</expression>
</RegExp>

in XBMC works like a charn, but in scrap.exe returns this:
Code:
<genre><</genre><genre>font><br>Terror</genre><genre>Thriller<br><br><font class = 'titulo3'></genre>

which proves that scrap.exe is not cleaning \1 in the inner regexp.

I'm using in the "scraper for dummies" the first, longer version with regexp 1, 2 and 3 because works in both cases, and so is better to not confuse people that may try it by hand.
Reply
#2
Unfortunately, scrap.exe is out of date, and no longer maintained. The original author lost the sources to his updated build.

This means that the only way to test it reliably at this point is directly from XBMC.
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


Image
Reply
#3
Ok, I will include that info in the wiki for everyone to know.

scrap.exe is still useful enough for some quick tests, I haven't found other bugs except that noclean issue.
Reply

Logout Mark Read Team Forum Stats Members Help
Bug in scrap.exe0