Filmdelta not fetching images
#1
Hi.
I wrote the Filmdelta scraper some years ago. To be perfectly honest I don't remember much of how it's working or scraper developement at all. Anyway it hasn't been fetching any images at all for some time now and I thought it might be a good idea to fix it up before Frodo release. I hope there is some scraper guru in here that can explain to me what's going wrong...

What happens is it first tries to fetch the image from themoviedb, using the function GetTMDBThumbsById, sending it the Google search for the original title on imdb. I don't know, has that function changed in any way or is it still supposed to work the same way as it did? The regexp for getting the original title from filmdelta still works.

Anyway, the second thing it does is to call GetFilmdeltaThumb which is my fallback function in case GetTMDBThumbsById doesn't return anything. It tries fetching the low resolution image that's on filmdelta. It doesn't work either and I suspect it's because filmdelta changed the css style tag for the image. The scraper style='width:px' but the page says style='max-width: 240px; max-height: 240px;'. I don't know if that's true for all films though. Would the best practice here be to go for a wildcard instead? I'm thinking something like style='[^']*'

/Daniel
Reply
#2
Ok. I've been struggling for hours now. I really don't get how GetTMDBThumbsById and GetTMDBFanartById are supposed to work. I thought I'd shamelessly steal some code from Spiffs filmbasen scraper since it also uses these functions. But no matter what I do I get nothing reasonable out of it. So what is I supposed to send to it? Right now my scraper looks like this: http://pastebin.com/pQnYAzn6 which uses the functions more or less exactly like in Spiffs scraper. So what am I doing wrong?

I'm testing in latest Frodo nightly here. Could it be that they simply are broken?

/Daniel
Reply
#3
I'm about to give this up now, I just can't seem to understand how this works. The amount of documention about how GetTMDBThumbsById and GetTMDBFanartById works seem nonexisting. Furthermore, xbmc wiki says I should use chain functions, which I guess means I should use GetTMDBThumbsByIdchain and GetTMDBFanartByIdchain instead, but I really can't seem to find anything about what that means either. What is a chain function anyway?

/Daniel
Reply
#4
In order to use those functions you need to import metadata.common.themoviedb.org in your addon.xml:

Code:
<requires>
  <import addon="xbmc.metadata" version="1.0"/>
  <import addon="metadata.common.themoviedb.org" version="2.7.0"/>
</requires>

Reply
#5
(2012-11-25, 12:17)scudlee Wrote: In order to use those functions you need to import metadata.common.themoviedb.org in your addon.xml:

I already had those imports, I have successfully used those functions in the past. It said version 1.0 though, don't know what difference that makes?

Anyway I still don't understand what to send to the functions. Do they simply require the imdb id (inluding tt) as argument? And what difference does it make if I use the chain or the function?

/Daniel
Reply
#6
Chain functions send the enclosed string to buffer $$1, whereas url functions retrieve the the contents of the enclosed url and put that into buffer $$1.

So what happens is you have something like:
Code:
<chain function="GetTMDBThumbsByIdChain">tt12345678</chain>
that calls GetTMDBThumbsByIdChain with the string "tt12345678" in buffer $$1, which it then uses to form something like:
Code:
<url function="ParseTMDBThumbs" cache="tmdb-images-en-tt12345678.json">http://api.themoviedb.org/3/movie/tt12345678/images?api_key=57983e31fb435df4df77afb854740ea9&amp;amp;language=en</url>
which then calls ParseTMDBThumbs with the contents of the url in $$1 (and then spits out the actual thumbs).

Presumably the reason the wiki says to use the chain functions is so that if the url changes, it only needs to be updated in one place, since it's unlikely the id needed will change. (Plus it simplifies the code in your scraper.)

I would assume you'd need to keep the version number of the shared library up to date in order for it to work.
Reply
#7
Ok, I feel really stupid now, but I can't seem to get this working no matter what I do. Since the documentation on this is kinda poor I have to look at existing scrapers. One example that does exactly what I want to do (get the thumb from tmdb based on original title) is the Filmbasen scraper. It does like this:

Code:
            <RegExp input="$$1" output="&lt;url function=&quot;GetTMDBThumbsById&quot;&gt;http://www.google.com/search?q=\1+%20($$4)+site:imdb.com&lt;/url&gt;" dest="5+">
                <expression noclean="1" encode="1">Originaltittel:&lt;/h5&gt;([^&lt;]*?)&lt;/li</expression>
            </RegExp>

So this scraper actually sends a google search url as argument to GetTMDBThumbsById? Or am I completely missing something here?

/Daniel
Reply
#8
I don't see how that can possibly work, by calling GetTMDBThumbsById as a url function instead of a chain, the entire contents of the google search are passed to $$1, and the code for GetTMDBThumbsById is:
Code:
    <GetTMDBThumbsById dest="4">
        <RegExp input="$$5" output="&lt;details&gt;\1&lt;/details&gt;" dest="4">
            <RegExp input="$$1" output="&lt;url function=&quot;ParseTMDBThumbs&quot; cache=&quot;tmdb-images-$INFO[language]-\1.json&quot;&gt;http://api.themoviedb.org/3/movie/\1/images?api_key=57983e31fb435df4df77afb854740ea9&amp;amp;language=$INFO[language]&lt;/url&gt;" dest="5">
                <expression />
            </RegExp>
            <expression noclean="1" />
        </RegExp>
    </GetTMDBThumbsById>
You can see it's clearly expecting just the id in $$1 - there's not even an expression to ensure it's a valid IMDB or TMDB id, the entire contents is just passed straight to form the url.

Have you actually tested the Filmbasen scraper? If that works, I'm as baffled as you.
Reply
#9
(2012-11-25, 18:47)scudlee Wrote: Have you actually tested the Filmbasen scraper? If that works, I'm as baffled as you.

I guess you're spot on the problem. I just assumed it's a working scraper since it's it's in the official repository and it's made by master Spiff. So I've really been trying to figure out how it works...

Guess I'll have to find some other example of scraper to look at.

/Daniel
Reply
#10
Slowly getting there now. I've put together stuff that I thought would do the trick, but either imdb hates me or I've done wrong again because I get nothing out of it. Snippet:

Code:
            <RegExp input="$$4" output="\1" dest="9">
                <RegExp input="$$1" output="&lt;url&gt;http://akas.imdb.com/find?s=tt&amp;q=\1+(\2)&lt;/url&gt;" dest="4">
                    <expression>&lt;h4&gt;Originaltitel&lt;/h4&gt;[^&lt;]*&lt;h5&gt;([^&lt;]*)&lt;/h5&gt;.*?/filmarkiv/([0-9]*)/</expression>
                </RegExp>
                <expression>/title/(tt[t0-9]*)</expression>
            </RegExp>
            <RegExp input="$$9" output="&lt;chain function=&quot;GetTMDBThumbsByIdChain&quot;&gt;$$9&lt;/chain&gt;" dest="5+">
                <expression/>
            </RegExp>

So. I know the first expression works (the one that extracts original title and year). So the url created to imdb SHOULD become for example http://akas.imdb.com/find?s=tt&q=Terminator+(1984), and from that page the next expression SHOULD be able to extract the ID which should then be sent to GetTMDBThumbsByIdChain. In my debug log I find <chain function="GetTMDBThumbsByIdChain"></chain> though which I interpret as buffer 9 being empty, so I guess my imdb call doesn't work. Am I thinking wrong?

Btw, what is the difference between www.imdb.com and akas.imdb.com? According to my dns they point to the same place...

/Daniel
Reply
#11
You need to call a separate function with the url, and then have that call GetTMDBThumbsByIdChain, you can't do it in one step because XBMC doesn't process the urls and chains until the end of the current function.

So you'd have:
Code:
...
                <RegExp input="$$1" output="&lt;url function=&quot;CleverFunctionName&quot;&gt;http://akas.imdb.com/find?s=tt&amp;q=\1+(\2)&lt;/url&gt;" dest="5+">
                    <expression>&lt;h4&gt;Originaltitel&lt;/h4&gt;[^&lt;]*&lt;h5&gt;([^&lt;]*)&lt;/h5&gt;.*?/filmarkiv/([0-9]*)/</expression>
                </RegExp>

...

<CleverFunctionName dest="4">
    <RegExp input="$$5" output="&lt;details&gt;\1&lt;/details&gt;" dest="4">
            <RegExp input="$$1" output="&lt;chain function=&quot;GetTMDBThumbsByIdChain&quot;&gt;/1&lt;/chain&gt;" dest="5">
                <expression>/title/(tt[t0-9]*)</expression>
            </RegExp>
            <expression/>
    </RegExp>
</CleverFunctionName>

(I have no idea on the difference between the imdb urls.)
Reply
#12
...but no matter how much you help me I can't seem to get it right. Now my function looks like this (I threw in a call to get tmdb fanart as well):

Code:
    <GetTMDBimages dest="4">
        <RegExp input="$$5" output="&lt;details&gt;\1&lt;/details&gt;" dest="4">
            <RegExp input="$$1" output="&lt;chain function=&quot;GetTMDBThumbsByIdChain&quot;&gt;\1&lt;/chain&gt;" dest="5">
                <expression>/title/(tt[t0-9]*)</expression>
            </RegExp>
            <expression/>
        </RegExp>
        <RegExp input="$$5" output="&lt;details&gt;\1&lt;/details&gt;" dest="4+">
            <RegExp input="$$1" output="&lt;chain function=&quot;GetTMDBFanartByIdChain&quot;&gt;\1&lt;/chain&gt;" dest="5">
                <expression>/title/(tt[t0-9]*)</expression>
            </RegExp>
            <expression/>
        </RegExp>
    </GetTMDBimages>

...and when I try it my logs says the following:

Code:
20:29:47 T:139756054759168   DEBUG: CurlFile::Open(0x4ddee00) http://akas.imdb.com/find?s=tt&q=2012+(2009)
20:29:49 T:139756054759168   DEBUG: scraper: GetTMDBimages returned <details>tt1190080</details><details>tt1190080</details>

...so the chain functions gets the id but then just sends it on? What part of your instruction did I misunderstand this time?

/Daniel

Reply
#13
Absolutely my bad on this one. Serves me right for typing that code out blind without checking.

The missing important point is that you need a noclean="1" on the final <expression/> otherwise any tags get cleaned out (hence why just the ids remained.):
Code:
    <GetTMDBimages dest="4">
        <RegExp input="$$5" output="&lt;details&gt;\1&lt;/details&gt;" dest="4">
            <RegExp input="$$1" output="&lt;chain function=&quot;GetTMDBThumbsByIdChain&quot;&gt;\1&lt;/chain&gt;" dest="5">
                <expression>/title/(tt[t0-9]*)</expression>
            </RegExp>
            <expression noclean="1"/>
        </RegExp>
        <RegExp input="$$5" output="&lt;details&gt;\1&lt;/details&gt;" dest="4+">
            <RegExp input="$$1" output="&lt;chain function=&quot;GetTMDBFanartByIdChain&quot;&gt;\1&lt;/chain&gt;" dest="5">
                <expression>/title/(tt[t0-9]*)</expression>
            </RegExp>
            <expression noclean="1"/>
        </RegExp>
    </GetTMDBimages>

The less significant part is that you only need one <details>...</details> (I'm not sure if having two affects anything, but still...)
So something like:
Code:
    <GetTMDBimages dest="4">
        <RegExp input="$$5" output="&lt;details&gt;\1&lt;/details&gt;" dest="4">
            <RegExp input="$$1" output="&lt;chain function=&quot;GetTMDBThumbsByIdChain&quot;&gt;\1&lt;/chain&gt;" dest="5">
                <expression>/title/(tt[t0-9]*)</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;chain function=&quot;GetTMDBFanartByIdChain&quot;&gt;\1&lt;/chain&gt;" dest="5+">
                <expression>/title/(tt[t0-9]*)</expression>
            </RegExp>
            <expression noclean="1"/>
        </RegExp>
    </GetTMDBimages>
Should be enough.
Reply
#14
(2012-11-26, 22:12)scudlee Wrote: The missing important point is that you need a noclean="1" on the final <expression/> otherwise any tags get cleaned out (hence why just the ids remained.):

Ah, thanks! You're the man! Anyone tell me why there are no thanks buttons in this forumHuh

Now everything works for movies with just one word in the title. So I need to replace all spaces in the original title with %20. But I plan to figure this one out myself Tongue

/Daniel

edit: Straightened it out myself (a bit proud there ;-). Now if I just manage to get the version in the official repository updated everyone's happy :-)
Reply
#15
The new version with working images is now in official repository. Everyone reading this, please test it (version 1.1.0) and tell me if anything is not working!

/Daniel
Reply

Logout Mark Read Team Forum Stats Members Help
Filmdelta not fetching images0