XBMC Community Forum
German IMDB scraper, please test it and give feedback - Printable Version

+- XBMC Community Forum (http://forum.xbmc.org)
+-- Forum: Help and Support (/forumdisplay.php?fid=33)
+--- Forum: Add-ons Help and Support (/forumdisplay.php?fid=27)
+---- Forum: Metadata scrapers (/forumdisplay.php?fid=147)
+---- Thread: German IMDB scraper, please test it and give feedback (/showthread.php?tid=75121)

Pages: 1 2 3 4 5 6 7 8 9 10 11


- Nicezia - 2010-06-06 09:11

olympia Wrote:Yes, it's only useful when you have an xbmc compliant external nfo to import from. Nevertheless you couldn't even scrape this info from anywhere


Order doesn't matter


Yes, if this option is enabled in xbmc. But obviously this is again an info what you couldn't scrape from a web site. These tags are existing for an nfo, because you might don't want xbmc to do the extraction from the media file in itself, because you use an external nfo manager for that purposes and you want xbmc to import the data generated by that.


i don't mean to be correcting you here, but i have written a scraper for AEBN that imports the Movie Series name as set, and XBMC parses it from the scraper just fine, so set can as well be used with scraping

Any tag that can be exported from XBMC (do an export on a small library for an example) can be imported by XBMC from scraper or nfo....
however, you are very right in that its best to let XBMC or an external nfo manager to handle fileinfo


- Nicezia - 2010-06-06 09:24

Eisbahn Wrote:- importing up to 6 genres (9 easy possible)


if you used a sepearating nested RegExp, you could import infinite amount of genres


for instance say something is formated as such
Code:
<info-genre>Genres: Western, Comedy, Whatever, Else</info-genre>

First copy the part that you need to a buffer (8 in this case)
Code:
<RegExp input=$$1 output="\1" dest=8>
    <expression>&lt;info-genre&gt;Genres: ([^<]*)&lt;/info-genre&gt;</expression>
</RegExp>

and then parent that with a regular expression that repeatedly finds the comma seperated genres (getting the input from the string you copied to $$8)

so that the end product is something like this

Code:
<RegExp input="$$8" output="&lt;genre&gt;\1&lt;/genre&gt;" dest="whatever buffer you're collecting details in">
    <RegExp input=$$1 output="\1" dest=8>
        <expression>&lt;info-genre&gt;Genres: ([^<]*)&lt;/info-genre&gt;</expression>
    </RegExp>
    <expression repeat="yes" trim="1">([^,])(?:, )</expression>
</RegExp>



- vdrfan - 2010-06-06 11:39

Thanks for the good information Nicezia! Note, as of SVN revision r30825 the way we handle the runtime/duration has slightly changed.

The following VideoInfoTags (streamdetails) are exported from the actual media file: codec, aspect, width, height and duration (new).
In order to make sure the runtime is shown properly, make sure the scraper only returns minutes (numeric) only. This value is used in case the meta extraction is disabled and/or we somehow fail to extract it.


- Nicezia - 2010-06-06 12:01

vdrfan Wrote:Thanks for the good information Nicezia! Note, as of SVN revision r30825 the way we handle the runtime/duration has slightly changed.

The following VideoInfoTags (streamdetails) are exported from the actual media file: codec, aspect, width, height and duration (new).
In order to make sure the runtime is shown properly, make sure the scraper only returns minutes (numeric) only. This value is used in case the meta extraction is disabled and/or we somehow fail to extract it.

Noted, and i will adjust ScraperXML code accordingly

@vdrfan, also noticed there are a few other tags not mentioned anywhere else (country, sorttitle, epbookmark, originaltitle) and that premiered though taken from the nfo/scraper, doesn't seem to store into database at all (at least in the last version i'm basing off of which is before the add-on merge, and therefore when importing the file this info is lost, if its even provided)

are these extra tags depreciated tags that haven't been removed from code or added tags (only just now getting to a point where i can read C++ code as well as CSharp) and is the Premiered getting lost fromthe database an oversight?


- Eisbahn - 2010-06-06 23:59

Hi,

sorry for late reply, but we had a wonderful day, relaxing with my wife and kids. But i've done a little bit of work and some new questions:
What is the content of the tag <outline>? Some more infos as tagline but not as much as in plot or plotsummary? Couldn't find it... At the moment I put in this tag a "short" plot (the one given at the main overview of IMDB).
What about <certification>? Is it deprecated and only MPAA is used instead?
Because of different DVDs, I've got more than one mpaa tag, e.g. 12years heavy cut, 16years cut, 18years uncut (it's not a single instance) at "The Rock" (IMDB-ID = tt0117500)
Is <originaltitle> a subset of <sorttitle>, e.g.
Code:
<originaltitle>
    movie A
</originaltitle>
<sorttitle>
    movie A has a nice long name
<sorttitle>
<sorttitle>
    movie A  /* is this included or not? */
<sorttitle>
<sorttitle>
    movie A has a short name as well
<sorttitle>
<sorttitle>
    movie A has a french name
<sorttitle>
Think this depends on the users preference...

What about the function GetIMDBThumbs? Does it fetch all pics from IMDB, or only the posters (and maybe product)? What are the constants SX, SY, SX$INFO and SY$INFO (or what is this)? Why is the function not repeated (think the users wants more than one thumbnail)? Don't know exactly what this function should do. Pointing to <http://www.imdb.de/title/tt0499549/mediaindex?refine=poster>? Any help?

How can I call a site without getting a "&" to "&amp;" cleaned? Actually I used a function which removes the &amp; and makes an & into the links :=( The "no HTML clean" tag does not work at all...


ok = meaning my scraper gathers the corresponding infos
n/a = not for import use or no infos given on german imdb site
stc = still to come => maybe implemented in future release (meaning: think it's a useless feature...)
Code:
<movie>
ok         <id>tt0432337</id>
ok         <title>Who knows</title>
ok         <originaltitle>Who knows for real</originaltitle>
ok         <sorttitle>Who knows 1</sorttitle>
n/a        <set>Who knows triology</set>
ok         <rating>6.100000</rating>
ok         <votes>50</votes>
ok         <year>2008</year>
ok         <top250>0</top250>
ok         <certification>MPAA for different countries</certification>
ok         <mpaa>Not available</mpaa>
ok         <studio>my camera</studio>
stc        <outline>A look at the role of the Buckeye State in the 2004 Presidential Election.</outline>
ok         <plot>A look at the role of the Buckeye State in the 2004 Presidential Election.</plot>
n/a        <tagline></tagline>
ok         <runtime>90 min</runtime>
ok         <thumb>http://ia.ec.imdb.com/media/imdb/01/I/25/65/31/10f.jpg</thumb>   /* Link broken/not working. Status 500 from server, using thumbs from MovieposterDB */
n/a        <playcount>0</playcount>
n/a        <watched>false</watched>
n/a        <filenameandpath>c:\Dummy_Movie_Files\Movies\...So Goes The Nation.avi</filenameandpath>
stc        <trailer></trailer>
ok         <genre></genre>
ok         <credits></credits>
stc        <premiered>single instance/optional</premiered>
n/a        <fileinfo>
n/a           <streamdetails>
n/a              <video>
n/a                 <codec>h264</codec>
n/a                 <aspect>2.35</aspect>
n/a                 <width>1920</width>
n/a                 <height>816</height>
n/a              </video>
n/a              <audio>
n/a                 <codec>ac3</codec>
n/a                 <language>eng</language>
n/a                 <channels>6</channels>
n/a              </audio>
n/a              <subtitle>
n/a                 <language>spa</language>
n/a              </subtitle>
n/a           </streamdetails>
n/a        </fileinfo>
ok         <director>Adam Del Deo</director>
ok         <actor>
ok            <thumb></thumb>
ok            <name></name>
ok            <role></role>
ok         </actor>
       </movie>
Actually most things are working pretty good, only thumbs and pictures are a bit unclear for me.

What format should <premiered> have? String with month written out, or date?

Eisbahn


- Nicezia - 2010-06-07 00:44

Eisbahn Wrote:What about <certification>? Is it deprecated and only MPAA is used instead?
Because of different DVDs, I've got more than one mpaa tag, e.g. 12years heavy cut, 16years cut, 18years uncut (it's not a single instance) at "The Rock" (IMDB-ID = tt0117500)

Certification is still in there, sorry I left it out of my info

Eisbahn Wrote:What about the function GetIMDBThumbs? Does it fetch all pics from IMDB, or only the posters (and maybe product)? What are the constants SX, SY, SX$INFO and SY$INFO (or what is this)? Why is the function not repeated (think the users wants more than one thumbnail)? Don't know exactly what this function should do. Pointing to <http://www.imdb.de/title/tt0499549/mediaindex?refine=poster>? Any help?

GetIMDBThumbs only grabs the posters. The actor thumbs are grabbed with the rest of the actor info

SX$INFO is nothing however the $INFO part has meaning, what you left out was [imdbscale] in its entirety $INFO[imdbscale], is a place holder for whatever value the user has selected in the settings for the size of the images to be downloaded (the setting with the id "imdbscale"), $INFO[<settingid>] simply tells the scraper "Replace this placeholder (the placeholder being in this case $INFO[<settingid>])with the text selected in the setting with the id <settingid>

Eisbahn Wrote:How can I call a site without getting a "&" to "&amp;" cleaned? Actually I used a function which removes the &amp; and makes an & into the links :=( The "no HTML clean" tag does not work at all...

Ampersands should be cleaned up by default (if you're looking at the source code of XBMC see ScraperParser::ParseExression where it is commented nasty hack #1)

double the ampersand

example http://foo.com/search.php?q=foo&amp;s=foo2

the effect being that &amp;amp; becomes &amp;

Eisbahn Wrote:What format should <premiered> have? String with month written out, or date?

Premiered is simply imported/exported as a string, so it has no localization and/or globalization format. So it doesn't really matter
(but as of current i have no idea IF its stored in database, and if it is, no idea WHERE its stored, because looking in the video34.db the premiered value
seems to be nowhere.)


- spiff - 2010-06-07 10:25

Nicezia Wrote:@vdrfan, also noticed there are a few other tags not mentioned anywhere else (country, sorttitle, epbookmark, originaltitle) and that premiered though taken from the nfo/scraper, doesn't seem to store into database at all (at least in the last version i'm basing off of which is before the add-on merge, and therefore when importing the file this info is lost, if its even provided)

are these extra tags depreciated tags that haven't been removed from code or added tags (only just now getting to a point where i can read C++ code as well as CSharp) and is the Premiered getting lost fromthe database an oversight?

added. premiered is only used in relation to tvshows. country and sorttitle should be selfexplanatory, epbookmark is the episode bookmark in multi-episode files (i.e. where does episode 2 start).


- Eisbahn - 2010-06-07 22:38

Nearly everything works now, only the thumbs from IMDB are not working at all...
In the main scraper I use the following RegEx
Code:
        <RegExp input="$$2" output="&lt;url cache=&quot;$$2-posters.html&quot; function=&quot;GetIMDBThumbs&quot;&gt;$$3mediaindex?refine=poster&lt;/url&gt;" dest="5+">
            <expression/>
        </RegExp>
        <RegExp input="$$2" output="&lt;url cache=&quot;$$2-product.html&quot; function=&quot;GetIMDBThumbs&quot;&gt;$$3mediaindex?refine=product&lt;/url&gt;" dest="5+">
            <expression/>
        </RegExp>
Resulting in the URLs
Code:
http://www.imdb.com/title/tt0499549/mediaindex?refine=poster
http://www.imdb.com/title/tt0499549/mediaindex?refine=product

The Function is
Code:
<GetIMDBThumbs dest="5">
    <RegExp input="$$6" output="&lt;details&gt;\1&lt;/details&gt;" dest="5">
        <!--\1_SX$INFO[imdbscale]_SY$INFO[imdbscale]_\2-->
        <RegExp input="$$1" output="\1_SX512_SY512_\2" dest="4">
            <expression repeat="yes" noclean="1,2">&lt;img alt=&quot;&quot; height=&quot;100&quot; width=&quot;100&quot;  src=(.*?)_S.*?(.jpg)&quot;</expression>
        </RegExp>
        <RegExp input="$$4" output="&lt;thumb&gt;\1&lt;/thumb&gt;" dest="6">
            <expression repeat="yes" noclean="1">(.*?_SX[0-9]+_SY[0-9]+_.jpg)</expression>
        </RegExp>
        <expression noclean="1"/>
    </RegExp>
</GetIMDBThumbs>
If I do it by hand, I can see nice pics (why the hell should they be crippled to "square format"), e.g. <http://ia.media-imdb.com/images/M/MV5BMTYxMzg0NzYwOV5BMl5BanBnXkFtZTcwMDc3MzEzMw@@._V1._CR0,0,388,388_SX512_SY512_​.jpg>. But if I have a look in XBMC, I see only placeholders (white "Polaroid" with black square). What went wrong?

Eisbahn


- Nicezia - 2010-06-10 21:42

Eisbahn Wrote:(why the hell should they be crippled to "square format"), e.g. <http://ia.media-imdb.com/images/M/MV5BMTYxMzg0NzYwOV5BMl5BanBnXkFtZTcwMDc3MzEzMw@@._V1._CR0,0,388,388_SX512_SY512_​.jpg>.

it really isn't "crippled" to square, the image is scaled by imdb in relation to the width.


- Eisbahn - 2010-06-12 12:24

Nicezia Wrote:it really isn't "crippled" to square, the image is scaled by imdb in relation to the width.

Hmmm, the original image is <http://www.imdb.de/media/rm3073674240/tt0499549>, all thumbs are cutted to squares, e.g. <http://ia.media-imdb.com/images/M/MV5BMTYxMzg0NzYwOV5BMl5BanBnXkFtZTcwMDc3MzEzMw@@._ V1._CR0,0,388,388_SX512_SY512_.jpg>. But thats not a problem of XBMC or the scraper, it's IMDB.
But the main problem still exists: the images are not shown in XBMC. Any chance to check wich URL is generated by the scraper and used for the pic in XBMC?
However: think I could release v1.0 which gathers nearly all infos in a nice format from IMDB and (on user preference) covers and plot from partner sites this weekend.

Eisbahn