Updating an Existing Scraper

  Thread Rating:
  • 1 Votes - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
olympia Offline
Team-XBMC Member
Posts: 2,377
Joined: May 2008
Reputation: 30
Post: #11
Be aware that the title in $$1 in CreateSearchURL is URL encoded, so you have to create the regexp for paul%20haggis%20%2d%20crash in this example.

So something like:
Code:
<expression noclean="1">.+?%20%2d%20(.+?)(?:%20part[1-9]%20)?$</expression>
or thereabout.
find quote
nucleo Offline
Junior Member
Posts: 8
Joined: Apr 2011
Reputation: 0
Post: #12
olympia Wrote:Be aware that the title in $$1 in CreateSearchURL is URL encoded, so you have to create the regexp for paul%20haggis%20%2d%20crash in this example.

So something like:
Code:
<expression noclean="1">.+?%20%2d%20(.+?)(?:%20part[1-9]%20)?$</expression>
or thereabout.

Wow, that's really helpful! Thanks for the tip, I'll try.
find quote
nucleo Offline
Junior Member
Posts: 8
Joined: Apr 2011
Reputation: 0
Thumbs Up    Post: #13
Finally I got everything working the way I want. Great thanks to olympia and bambi73. Without you I would never sort this out.

I post here my results for the case if somebody finds it useful to organize their own movie collection.

File names should contain movie title and year. I tried 2 formats, both work perfectly:
1. Sidney Lumet - Dog Day Afternoon (1975).part1.avi
2. Sidney Lumet - Dog Day Afternoon part1 (1975).avi
And the third is obvious when you don't have movie broken into parts:
3. Sidney Lumet - Dog Day Afternoon (1975).avi

There are 3 very important points I've learned about XBMC scraping:

1. It cuts automatically the year and the file extension from the file name before the scraper even starts working, so
Sidney Lumet - Dog Day Afternoon (1975).avi
becomes
Sidney Lumet - Dog Day Afternoon - this what comes to the scraper in buffer $$1 (well, not exactly this, see item 3)
The buffer $$2 in this case will contain "1975" - the year stripped from braces, even before the scraper starts.

2. It automatically recognizes words like "part[1-9]", "cd[1-9]" cuts them off and displays several parts as one item in the movie library. No further action is required from the scraper. Thus
Sidney Lumet - Dog Day Afternoon (1975).part1.avi
and
Sidney Lumet - Dog Day Afternoon (1975).part2.avi
are scraped as one item, which is
Sidney Lumet - Dog Day Afternoon (well, not exactly this, see item 3)
at buffer $$1, before applying regular expressions by scraper.

3. Items in $$1 come to scraper URL-encoded and lower-cased. Thus in our example $$1 will actually contain
sidney%20lumet%20%2d%20dog%20day%20afternoon
All spaces are replaced with %20 and dash is replaced with %2d

I had to modify a little default scrapers, so that they can work with my file naming. Here is what I have for now:

1. TMDB scraper (on Ubuntu: ~/.xbmc/addons/metadata.themoviedb.org/tmdb.xml)
Code:
<CreateSearchUrl dest="3">
        <RegExp input="$$1" output="&lt;url&gt;http://api.themoviedb.org/2.1/Movie.search/$INFO[language]/xml/57983e31fb435df4df77afb854740ea9/\1+$$2&lt;/url&gt;" dest="3">
            <expression>.+%20%2d%20(.+)</expression>
        </RegExp>
    </CreateSearchUrl>
There was an inner regexp and I removed it, because it does absolutely nothing. I added "+$$2" to the url so that it also searches by year - it is supported functionality of TMDB public API, so I don't know, why it was not used in the default scraper. Also I used my own regexp to parse file names. I tried to use ".+?" instead of ".+" like suggested by bambi73, but it appears too "lazy", according to my tests it may take just "D" from "Dog Day Afternoon". I'm sorry if I'm wrong here, because I'm not so strong in regexps, as bambi73.
Unfortunately, TMDB appears to not contain information about some of my movies (Woody Allen - Manhatten - what the heck, is it so rare?). That's why I used also another scraper - IMDB
EDIT: It's my mistake in typing. Manhatten should be ManhattAn. And of course TMDB could find a misspelled word too, but anyway I'm happy it was found at all Smile


2. IMDB scraper (on Ubuntu: ~/.xbmc/addons/metadata.imdb.org/imdb.xml)
Code:
<CreateSearchUrl dest="3" SearchStringEncoding="iso-8859-1">
        <RegExp input="$$1" output="&lt;url&gt;http://akas.imdb.com/find?s=tt;q=\1$$4&lt;/url&gt;" dest="3">
            <RegExp input="$$2" output="%20(\1)" dest="4">
                <expression clear="yes">(.+)</expression>
            </RegExp>
            <expression noclean="1">.+%20%2d%20(.+)</expression>
        </RegExp>
    </CreateSearchUrl>

Again, the same regexp, but now there is also inner one for year. It was there and I didn't touch it, though I find it strange to use inner regexp, which just adds %20 before the year in the buffer $$4, while you can just add "%20($$2)" to the url directly. Anyway the scraper works 95% of time for me, so please consider doing this yourself, if you need to.

If this information is anyhow useful and somebody can point me to the corresponding Wiki page, I can add it there. Or somebody can do it for me, if I cannot access that wiki.
(This post was last modified: 2011-04-10 22:58 by nucleo.)
find quote
olympia Offline
Team-XBMC Member
Posts: 2,377
Joined: May 2008
Reputation: 30
Post: #14
Actually both (tmdb year addition and imdb inner regex) are good catch. Thank you for sharing. I will tune the official scrapers according to this.

Not sure why the year was not added to tmdb search URL before. One possibility is that at the time it has been written the API did not supported this yet.
find quote
Post Reply