Updating an Existing Scraper

  Thread Rating:
  • 1 Votes - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
nucleo Offline
Junior Member
Posts: 8
Joined: Apr 2011
Reputation: 0
Question  Updating an Existing Scraper Post: #1
Hi

I was trying to use existing TMDB scraper for my movie collection. Unfortunately, it seems that it sends the whole file name to the search api, but I have all file names like "<Director Name> - <Title> [part1-2] (<Year>)". So I need to change regexps a little to match my file names. I found a guide here:
http://wiki.xbmc.org/index.php?title=HOW...s_guide%29

Unfortunately it's very difficult to understand from that guide, how to work with existing scrapers.
Here is an excerpt from TMDB scraper:
Code:
<CreateSearchUrl dest="3">
        <RegExp input="$$1" output="&lt;url&gt;http://api.themoviedb.org/2.1/Movie.search/$INFO[language]/xml/57983e31fb435df4df77afb854740ea9/\1&lt;/url&gt;" dest="3">
            <RegExp input="$$2" output="+\1" dest="4">
                <expression clear="yes">(.+)</expression>
            </RegExp>
            <expression noclean="1"/>
        </RegExp>
    </CreateSearchUrl>
The regexp I need for my files is simple and obvious:
Code:
.+\s-\s(.+)\s(part[1-9]\s)?\(.+\)

I just don't get how the inner regexp from scraper is connected to the outer regexp - the inner one works with buffers 2 and 4 and the outer one works with 1 and 3. So for me they shouldn't correlate at all...

Furthermore, the inner regexp is (.+), which should wipe out everything and the output is +\1 - does it mean it just adds "+" in front of the string?

Please explain me how can I modify the scraper above to add my regexp?
find quote
bambi73 Offline
Senior Member
Posts: 165
Joined: Jan 2010
Reputation: 0
Location: Czech Republic
Post: #2
nucleo Wrote:I just don't get how the inner regexp from scraper is connected to the outer regexp - the inner one works with buffers 2 and 4 and the outer one works with 1 and 3. So for me they shouldn't correlate at all...

Furthermore, the inner regexp is (.+), which should wipe out everything and the output is +\1 - does it mean it just adds "+" in front of the string?
Inner one is processed first, then outer one.
By Spiff from another thread: expressions are evaluated in an lifo/depth-search fashion, i.e. dig into the deepest one and evaluate that first.

And you are right inner regexp make no sense, result is not used and i have no idea what is inside $$2 at CreateSearchUrl, IMHO nothing.

nucleo Wrote:Please explain me how can I modify the scraper above to add my regexp?

Something like:
Code:
<CreateSearchUrl dest="3">
  <RegExp input="$$1" output="&lt;url&gt;http://api.themoviedb.org/2.1/Movie.search/$INFO[language]/xml/57983e31fb435df4df77afb854740ea9/\1&lt;/url&gt;" dest="3">
    <expression noclean="1">.+?\s-\s(.+)\s(?:part[1-9]\s)?\(.+\)</expression>
  </RegExp>
</CreateSearchUrl>

I added ? to first .+ to make it lazy, otherwise \s-\s will catch last occurence which can be in movie name.
BTW I didn't tested it so it's without any warranty Wink
find quote
nucleo Offline
Junior Member
Posts: 8
Joined: Apr 2011
Reputation: 0
Post: #3
Thanks for your reply!

Actually you also added "?:" before part[1-9]. I'm not sure what for, anyway it doesn't work with it or without. This is the main problem. I've already tried before to use it in the way you suggested and XBMC always says "Unable to connect to remote server". However, the default scraper works (though it gets wrong info, because it uses "Paul Haggis - Crash (2004)" instead of just "Crash").

Unfortunately I cannot test regexps with Scrape XML, because I'm on Ubuntu. So I just rely on XBMC itself... And the message is not very promising.

BTW, this additional inner regexp is always in place in any scraper, for example, for IMDB.But there we have output %20\1 instead of +\1.
%20 and + remind me about spaces in HTTP URL, but how does the expression work and why it places its output into $$4, which is never used? Is it some hidden not documented functionality like in Win API? Smile

It is said somewhere in manuals that XBMC will offer you a list of variants for each file, but it offers nothing to me, can I enable it somehow?
find quote
olympia Online
Team-XBMC Member
Posts: 2,381
Joined: May 2008
Reputation: 30
Post: #4
@bambi73
$$2 is the year from the filename and it is used in case of imdb scraper.

Not sure why it is there in tmdb scraper. There it is indeed not used, so probably a leftover from the past.
find quote
nucleo Offline
Junior Member
Posts: 8
Joined: Apr 2011
Reputation: 0
Post: #5
olympia, could you please give an example, where I can place my regexp in tmdb or imdb scraper?
find quote
bambi73 Offline
Senior Member
Posts: 165
Joined: Jan 2010
Reputation: 0
Location: Czech Republic
Post: #6
nucleo Wrote:Thanks for your reply!

Actually you also added "?:" before part[1-9]. I'm not sure what for, anyway it doesn't work with it or without. This is the main problem. I've already tried before to use it in the way you suggested and XBMC always says "Unable to connect to remote server". However, the default scraper works (though it gets wrong info, because it uses "Paul Haggis - Crash (2004)" instead of just "Crash").
After Olympia response i got bit suspicious and yes XBMC already removes year from file name and passes it in $$2. So you need only

Code:
.+?\s-\s(.+?)(?:\spart[1-9])?

(?: ) means that this group doesn't produce replace string in \2
EDIT: added ? to movie name group too, this should be lazy too because otherwise "part" will become part of movie name. If movie group is greedy it will always force part group to be {0}.
Again, not tested, only guess from the table Smile

nucleo Wrote:Unfortunately I cannot test regexps with Scrape XML, because I'm on Ubuntu. So I just rely on XBMC itself... And the message is not very promising.

BTW, this additional inner regexp is always in place in any scraper, for example, for IMDB.But there we have output %20\1 instead of +\1.
%20 and + remind me about spaces in HTTP URL, but how does the expression work and why it places its output into $$4, which is never used? Is it some hidden not documented functionality like in Win API? Smile

It is said somewhere in manuals that XBMC will offer you a list of variants for each file, but it offers nothing to me, can I enable it somehow?
Turn on debuging in XBMC setting and you will see return string from parser functions in log. You can exploit this return string to see values of any buffer in function, simply do something like:

Code:
<RegExp input="$$1" output="&lt;url&gt;http://akas.imdb.com/find?s=tt;q=\1$$4&lt;/url&gt;  ##1=$$1  ##2=$$2" dest="3">
because it's after ending element it doesn't hurt XML parser (at least visibly Wink) and you see it in log. Quite simple but useful.

olympia Wrote:@bambi73
$$2 is the year from the filename and it is used in case of imdb scraper.
Good to know, never worked on movie scraper so this is news info for me Smile
(This post was last modified: 2011-04-09 20:35 by bambi73.)
find quote
nucleo Offline
Junior Member
Posts: 8
Joined: Apr 2011
Reputation: 0
Post: #7
Thanks a lot, I followed your advice, enabled logging. Seems that my regexp produces nothing, because from logs the output URL contains only static part without generated group \1. So I'll try some simpler regexps, may be there is some mistake.
find quote
mortstar Offline
Senior Member
Posts: 249
Joined: Aug 2010
Reputation: 3
Post: #8
nucleo Wrote:Thanks a lot, I followed your advice, enabled logging. Seems that my regexp produces nothing, because from logs the output URL contains only static part without generated group \1. So I'll try some simpler regexps, may be there is some mistake.

Try using ScraperXML to see how your scraper flows.

You can use the test engine to see what is held in the buffer at each stage. You can also test your regex.

[Image: watched-clearlogo.jpg]
find quote
nucleo Offline
Junior Member
Posts: 8
Joined: Apr 2011
Reputation: 0
Post: #9
As I already mentioned in one of my posts, I'm on Ubuntu and I'm not aware how to run Scraper XML there. I tried to run it under Wine, but without success.
find quote
nucleo Offline
Junior Member
Posts: 8
Joined: Apr 2011
Reputation: 0
Post: #10
bambi73 Wrote:EDIT: added ? to movie name group too, this should be lazy too because otherwise "part" will become part of movie name. If movie group is greedy it will always force part group to be {0}.
Again, not tested, only guess from the table Smile

Just noticed your edit. Yes, you are right. I'm not that strong in regexps, never used lazy groups.
find quote
Post Reply