German IMDB scraper, please test it and give feedback

  Thread Rating:
  • 1 Votes - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
Eisbahn Offline
Junior Member
Posts: 43
Joined: Jun 2010
Reputation: 2
Post: #1
new version online: http://github.com/Eisbahn/IMDb_de-Scraper/zipball/3.0.5

v3.0.4 and v2.0.2 out now: <http://github.com/Eisbahn/IMDb_de-Scraper/>

Hello,

just finished a first version after some work: german scraper for IMDB (in german language). Actual Version 1.0.0 is on http://ul.to/v5d9j0 ready for test. It grabs every tag availabel, only Trailer is not implemented (because for me it's a useless feature). Please feel free to report bugs or issues,
latest Version 2.0.0 for XBMC v9.11 can be found here:
<http://eisbahn.ohost.de/>

What is _not_ working
<premiered>Premierendatum</premiered> not im-/exported to XBMC
<aired>???</aired> only for TV-Shows/series?
<set>???</set> don't know what this is
<artist>???</artist> difference to actor?
<status>???</status> don't know what this is
<certification>Altersfreigabe für alle Staaten außer D</certification> not im-/exported to XBMC
<sorttitle>alternative Filmtitel</sorttitle>only first titel is im-/exported to XBMC
<code>???</code> don't know what this is, I think it's the codec => no sense to import anything in this field
<trailer>Trailer</trailer> senseless for me because the hole DVD is in XBMC present

Any hints for the corrupted tags are highly welcome.

Code:
<movie>
ok         <id>tt0432337</id>
ok         <title>Who knows</title>
ok         <originaltitle>Who knows for real</originaltitle>
ok         <sorttitle>Who knows 1</sorttitle>
n/a        <set>Who knows triology</set>
ok         <rating>6.100000</rating>
ok         <votes>50</votes>
ok         <year>2008</year>
ok         <top250>0</top250>
ok         <certification>MPAA for different countries</certification>
ok         <mpaa>Not available</mpaa>
ok         <studio>my camera</studio>
ok         <outline>A look at the role of the Buckeye State in the 2004 Presidential Election.</outline>
ok         <plot>A look at the role of the Buckeye State in the 2004 Presidential Election.</plot>
ok         <tagline></tagline>
ok         <runtime>90 min</runtime>
ok         <thumb>http://ia.ec.imdb.com/media/imdb/01/I/25/65/31/10f.jpg</thumb>
n/a        <playcount>0</playcount>
n/a        <watched>false</watched>
n/a        <filenameandpath>c:\Dummy_Movie_Files\Movies\...So Goes The Nation.avi</filenameandpath>
stc        <trailer></trailer>
ok         <genre></genre>
ok         <credits></credits>
ok         <premiered>single instance/optional</premiered>
n/a        <fileinfo>
n/a           <streamdetails>
n/a              <video>
n/a                 <codec>h264</codec>
n/a                 <aspect>2.35</aspect>
n/a                 <width>1920</width>
n/a                 <height>816</height>
n/a              </video>
n/a              <audio>
n/a                 <codec>ac3</codec>
n/a                 <language>eng</language>
n/a                 <channels>6</channels>
n/a              </audio>
n/a              <subtitle>
n/a                 <language>spa</language>
n/a              </subtitle>
n/a           </streamdetails>
n/a        </fileinfo>
ok         <director>Adam Del Deo</director>
ok         <actor>
ok            <thumb></thumb>
ok            <name></name>
ok            <role></role>
ok         </actor>
       </movie>

Regards, Eisbahn
(This post was last modified: 2010-09-19 17:00 by Eisbahn.)
find quote
Spaggi Offline
Senior Member
Posts: 179
Joined: May 2010
Reputation: 0
Post: #2
Great work! Will test on the weekend Smile
find quote
Eisbahn Offline
Junior Member
Posts: 43
Joined: Jun 2010
Reputation: 2
Post: #3
Because I'm new to XBMC: what tags are supported/should be provided by a scraper?
Could you give me an overview of mandantory and optional tags?

Regards,

Eisbahn
find quote
vdrfan Offline
Team-XBMC Developer
Posts: 2,891
Joined: Jan 2008
Reputation: 8
Location: Germany
Post: #4
Eisbahn Wrote:Because I'm new to XBMC: what tags are supported/should be provided by a scraper?
Could you give me an overview of mandantory and optional tags?

Regards,

Eisbahn

Check out the other scrapers. The imdb.com is pretty feature complete.

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules
For troubleshooting and bug reporting please make sure you read this first.
find quote
donabi Offline
Senior Member
Posts: 297
Joined: Apr 2006
Reputation: 3
Location: germany
Post: #5
well, that differs very much.
some users "need" studio-tags, to have fancy icons in the skin.
or the narator.
others, like me, just need things like playtime, year, actors, fsk (mpaa) and ONE genre.
the orignal imdb-scraper gets a lot of genre-tags.
which makes the genre-filter sense-less.

p.s.:
we would like to see you at german xbmc.de

http://www.xbmcnerds.com - german xbmc community
find quote
Eisbahn Offline
Junior Member
Posts: 43
Joined: Jun 2010
Reputation: 2
Post: #6
@vdrfan:
Hmmm, sorry. Do we have a spec showing which tags are mandantory/optional? If not: how can I figure out which tags are supported? The IMDB com scraper fetches no infos about sound, subtitle, video-format (if I looked right), in several screenshots I could see infos about these things... So the answer: please do reverse engineering because everybody can implement tags however he/she likes is a bit contra productive and shows kind of quick-and-dirty-hacking without any concept? Is this the XBMC style?
What about:
Code:
<details>
    <title></title>
    <year></year>
    <director></director>
    <top250></top250>
    <mpaa></mpaa>
    <tagline></tagline>
    <runtime></runtime>
    <thumb></thumb>
    <credits></credits>
    <rating></rating>
    <votes></votes>
    <genre></genre>
    <actor>
        <name></name>
        <role></role>
    </actor>
    <outline></outline>
    <plot></plot>
</details>

@donabi: to cut some infos away is not a real problem and done in few seconds. But gathering all possible things is a bit more complicated. So first I would have a scraper which gets all infos.
If you have a decription of the alowed tags, please provide it. Is the order/sequence relevant, what tags are supported, what format is expected and so on. If the german board has active members, why not. But to be honest: think after the scraper my active work is over :=(

@all: Where can I get infos which tags are supported by XBMC? If the skins shows the infos doesn't matter at all, think a "good scrapper" should gather as much as possible. For the result of a scraper: is the order/sequence relevant, what tags are supported, what format is expected and so on. Today all I've done is reverse engineering, but I think thats not the right way...

Eisbahn
find quote
olympia Offline
Team-XBMC Member
Posts: 2,495
Joined: May 2008
Reputation: 32
Post: #7
As for the starting point:
http://lmgtfy.com/?q=xbmc+nfo

I think you are chasing something like the first result?

Other than that, I am not sure you are seriously calling "reverse engineering" to just have a look at what tags are being used by other scrapers.
find quote
Eisbahn Offline
Junior Member
Posts: 43
Joined: Jun 2010
Reputation: 2
Post: #8
@olympia:
great, I found google. If you know the right words and do not type "scraper, tags, xbmc" as a newbee or anything like that, it realy works. If my questions are so easy: why do I get only from you an answer? Think it's a bit frustrating for both of us: for you as expert and me as new user...
- The set tag is for a standalone XBMC useless because you could not edit the tag before importing, so should not be used by a scraper. Am I right?
- what about the order. Is it relevant? Seemed to be not (looking at your nfo and the imdb.com output)
- fileinfos are imported by XBMC by analysing the video file on its own without interaction?

[edit] new version:
- year gets imported if quartal is added, like in "Insomnia (2002)"
- importing up to 6 genres (9 easy possible)
- triming of spaces
=> to come: all tags like in the nfo
[/edit]
(This post was last modified: 2010-06-05 23:08 by Eisbahn.)
find quote
Nicezia Offline
Fan
Posts: 369
Joined: Nov 2006
Reputation: 0
Location: Montgomery, Alabama
Post: #9
Eisbahn Wrote:@vdrfan:
Hmmm, sorry. Do we have a spec showing which tags are mandantory/optional? If not: how can I figure out which tags are supported? The IMDB com scraper fetches no infos about sound, subtitle, video-format (if I looked right), in several screenshots I could see infos about these things... So the answer: please do reverse engineering because everybody can implement tags however he/she likes is a bit contra productive and shows kind of quick-and-dirty-hacking without any concept? Is this the XBMC style?
What about:
Code:
<details>
    <title></title>
    <year></year>
    <director></director>
    <top250></top250>
    <mpaa></mpaa>
    <tagline></tagline>
    <runtime></runtime>
    <thumb></thumb>
    <credits></credits>
    <rating></rating>
    <votes></votes>
    <genre></genre>
    <actor>
        <name></name>
        <role></role>
    </actor>
    <outline></outline>
    <plot></plot>
</details>

@donabi: to cut some infos away is not a real problem and done in few seconds. But gathering all possible things is a bit more complicated. So first I would have a scraper which gets all infos.
If you have a decription of the alowed tags, please provide it. Is the order/sequence relevant, what tags are supported, what format is expected and so on. If the german board has active members, why not. But to be honest: think after the scraper my active work is over :=(

@all: Where can I get infos which tags are supported by XBMC? If the skins shows the infos doesn't matter at all, think a "good scrapper" should gather as much as possible. For the result of a scraper: is the order/sequence relevant, what tags are supported, what format is expected and so on. Today all I've done is reverse engineering, but I think thats not the right way...

Eisbahn

All tags are optional , but i would say its best that the TITLE is at least supplied

Code:
<details>
    <title>single instance/Required</title>
    <id>single instance/optional</id>
    <studio>single instance/optional</studio>
    <year>single instance/optional</year>
    <director>multiple instance/optional</director>
    <top250>single instance/optional</top250>
    <mpaa>single instance/optional</mpaa>
    <tagline>single instance/optional</tagline>
    <runtime>single instance/optional</runtime>
    <thumb>multiple instance/optional</thumb>
    <credits></credits>
    <rating>single instance/optional</rating>
    <votes>single instance/optional</votes>
    <genre>multiple instance/optional</genre>
    <actor>
        <name></name>
        <thumb></thumb>
        <role></role>
    </actor>
    <outline>single instance/optional</outline>
    <plot>single instance/optional</plot>
    <premiered>single instance/optional</premiered>
    <set>multiple instance/optional</set>
    <trailer>multiple instance/optional</trailer>
    <streamdetails>
       <audio/>
          <codec></codec>
          <channels></channels>
       </audio>
       <video>
           <codec></codec>
           <height></height>
           <width></width>
      </video>
      <subtitle>
         <language></language>
      </subtitle>
   </streamdetails>
</details>

of course it goes without saying that actor, audio (inside stream info), video(inside stream info) and subtitle(inside stream info) are multiple instance and optional

ScraperXML Open Source Web Scraper Library compatible with XBMC XML Scrapers


I Suck, and if you act now by sending only $19.95 and a self addressed stamped envelop, so can you!

[Image: teamumx_sigline.png]
(This post was last modified: 2010-06-06 11:58 by Nicezia.)
find quote
olympia Offline
Team-XBMC Member
Posts: 2,495
Joined: May 2008
Reputation: 32
Post: #10
Eisbahn Wrote:- The set tag is for a standalone XBMC useless because you could not edit the tag before importing, so should not be used by a scraper. Am I right?
Yes, it's only useful when you have an xbmc compliant external nfo to import from. Nevertheless you couldn't even scrape this info from anywhere

Eisbahn Wrote:- what about the order. Is it relevant? Seemed to be not (looking at your nfo and the imdb.com output)
Order doesn't matter

Eisbahn Wrote:- fileinfos are imported by XBMC by analysing the video file on its own without interaction?
Yes, if this option is enabled in xbmc. But obviously this is again an info what you couldn't scrape from a web site. These tags are existing for an nfo, because you might don't want xbmc to do the extraction from the media file in itself, because you use an external nfo manager for that purposes and you want xbmc to import the data generated by that.
find quote
Nicezia Offline
Fan
Posts: 369
Joined: Nov 2006
Reputation: 0
Location: Montgomery, Alabama
Post: #11
olympia Wrote:Yes, it's only useful when you have an xbmc compliant external nfo to import from. Nevertheless you couldn't even scrape this info from anywhere


Order doesn't matter


Yes, if this option is enabled in xbmc. But obviously this is again an info what you couldn't scrape from a web site. These tags are existing for an nfo, because you might don't want xbmc to do the extraction from the media file in itself, because you use an external nfo manager for that purposes and you want xbmc to import the data generated by that.


i don't mean to be correcting you here, but i have written a scraper for AEBN that imports the Movie Series name as set, and XBMC parses it from the scraper just fine, so set can as well be used with scraping

Any tag that can be exported from XBMC (do an export on a small library for an example) can be imported by XBMC from scraper or nfo....
however, you are very right in that its best to let XBMC or an external nfo manager to handle fileinfo

ScraperXML Open Source Web Scraper Library compatible with XBMC XML Scrapers


I Suck, and if you act now by sending only $19.95 and a self addressed stamped envelop, so can you!

[Image: teamumx_sigline.png]
(This post was last modified: 2010-06-06 09:48 by Nicezia.)
find quote
Nicezia Offline
Fan
Posts: 369
Joined: Nov 2006
Reputation: 0
Location: Montgomery, Alabama
Post: #12
Eisbahn Wrote:- importing up to 6 genres (9 easy possible)


if you used a sepearating nested RegExp, you could import infinite amount of genres


for instance say something is formated as such
Code:
<info-genre>Genres: Western, Comedy, Whatever, Else</info-genre>

First copy the part that you need to a buffer (8 in this case)
Code:
<RegExp input=$$1 output="\1" dest=8>
    <expression>&lt;info-genre&gt;Genres: ([^<]*)&lt;/info-genre&gt;</expression>
</RegExp>

and then parent that with a regular expression that repeatedly finds the comma seperated genres (getting the input from the string you copied to $$8)

so that the end product is something like this

Code:
<RegExp input="$$8" output="&lt;genre&gt;\1&lt;/genre&gt;" dest="whatever buffer you're collecting details in">
    <RegExp input=$$1 output="\1" dest=8>
        <expression>&lt;info-genre&gt;Genres: ([^<]*)&lt;/info-genre&gt;</expression>
    </RegExp>
    <expression repeat="yes" trim="1">([^,])(?:, )</expression>
</RegExp>

ScraperXML Open Source Web Scraper Library compatible with XBMC XML Scrapers


I Suck, and if you act now by sending only $19.95 and a self addressed stamped envelop, so can you!

[Image: teamumx_sigline.png]
(This post was last modified: 2010-06-06 12:08 by Nicezia.)
find quote
vdrfan Offline
Team-XBMC Developer
Posts: 2,891
Joined: Jan 2008
Reputation: 8
Location: Germany
Post: #13
Thanks for the good information Nicezia! Note, as of SVN revision r30825 the way we handle the runtime/duration has slightly changed.

The following VideoInfoTags (streamdetails) are exported from the actual media file: codec, aspect, width, height and duration (new).
In order to make sure the runtime is shown properly, make sure the scraper only returns minutes (numeric) only. This value is used in case the meta extraction is disabled and/or we somehow fail to extract it.

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules
For troubleshooting and bug reporting please make sure you read this first.
find quote
Nicezia Offline
Fan
Posts: 369
Joined: Nov 2006
Reputation: 0
Location: Montgomery, Alabama
Post: #14
vdrfan Wrote:Thanks for the good information Nicezia! Note, as of SVN revision r30825 the way we handle the runtime/duration has slightly changed.

The following VideoInfoTags (streamdetails) are exported from the actual media file: codec, aspect, width, height and duration (new).
In order to make sure the runtime is shown properly, make sure the scraper only returns minutes (numeric) only. This value is used in case the meta extraction is disabled and/or we somehow fail to extract it.

Noted, and i will adjust ScraperXML code accordingly

@vdrfan, also noticed there are a few other tags not mentioned anywhere else (country, sorttitle, epbookmark, originaltitle) and that premiered though taken from the nfo/scraper, doesn't seem to store into database at all (at least in the last version i'm basing off of which is before the add-on merge, and therefore when importing the file this info is lost, if its even provided)

are these extra tags depreciated tags that haven't been removed from code or added tags (only just now getting to a point where i can read C++ code as well as CSharp) and is the Premiered getting lost fromthe database an oversight?

ScraperXML Open Source Web Scraper Library compatible with XBMC XML Scrapers


I Suck, and if you act now by sending only $19.95 and a self addressed stamped envelop, so can you!

[Image: teamumx_sigline.png]
(This post was last modified: 2010-06-06 12:24 by Nicezia.)
find quote
Eisbahn Offline
Junior Member
Posts: 43
Joined: Jun 2010
Reputation: 2
Post: #15
Hi,

sorry for late reply, but we had a wonderful day, relaxing with my wife and kids. But i've done a little bit of work and some new questions:
What is the content of the tag <outline>? Some more infos as tagline but not as much as in plot or plotsummary? Couldn't find it... At the moment I put in this tag a "short" plot (the one given at the main overview of IMDB).
What about <certification>? Is it deprecated and only MPAA is used instead?
Because of different DVDs, I've got more than one mpaa tag, e.g. 12years heavy cut, 16years cut, 18years uncut (it's not a single instance) at "The Rock" (IMDB-ID = tt0117500)
Is <originaltitle> a subset of <sorttitle>, e.g.
Code:
<originaltitle>
    movie A
</originaltitle>
<sorttitle>
    movie A has a nice long name
<sorttitle>
<sorttitle>
    movie A  /* is this included or not? */
<sorttitle>
<sorttitle>
    movie A has a short name as well
<sorttitle>
<sorttitle>
    movie A has a french name
<sorttitle>
Think this depends on the users preference...

What about the function GetIMDBThumbs? Does it fetch all pics from IMDB, or only the posters (and maybe product)? What are the constants SX, SY, SX$INFO and SY$INFO (or what is this)? Why is the function not repeated (think the users wants more than one thumbnail)? Don't know exactly what this function should do. Pointing to <http://www.imdb.de/title/tt0499549/mediaindex?refine=poster>? Any help?

How can I call a site without getting a "&" to "&amp;" cleaned? Actually I used a function which removes the &amp; and makes an & into the links :=( The "no HTML clean" tag does not work at all...


ok = meaning my scraper gathers the corresponding infos
n/a = not for import use or no infos given on german imdb site
stc = still to come => maybe implemented in future release (meaning: think it's a useless feature...)
Code:
<movie>
ok         <id>tt0432337</id>
ok         <title>Who knows</title>
ok         <originaltitle>Who knows for real</originaltitle>
ok         <sorttitle>Who knows 1</sorttitle>
n/a        <set>Who knows triology</set>
ok         <rating>6.100000</rating>
ok         <votes>50</votes>
ok         <year>2008</year>
ok         <top250>0</top250>
ok         <certification>MPAA for different countries</certification>
ok         <mpaa>Not available</mpaa>
ok         <studio>my camera</studio>
stc        <outline>A look at the role of the Buckeye State in the 2004 Presidential Election.</outline>
ok         <plot>A look at the role of the Buckeye State in the 2004 Presidential Election.</plot>
n/a        <tagline></tagline>
ok         <runtime>90 min</runtime>
ok         <thumb>http://ia.ec.imdb.com/media/imdb/01/I/25/65/31/10f.jpg</thumb>   /* Link broken/not working. Status 500 from server, using thumbs from MovieposterDB */
n/a        <playcount>0</playcount>
n/a        <watched>false</watched>
n/a        <filenameandpath>c:\Dummy_Movie_Files\Movies\...So Goes The Nation.avi</filenameandpath>
stc        <trailer></trailer>
ok         <genre></genre>
ok         <credits></credits>
stc        <premiered>single instance/optional</premiered>
n/a        <fileinfo>
n/a           <streamdetails>
n/a              <video>
n/a                 <codec>h264</codec>
n/a                 <aspect>2.35</aspect>
n/a                 <width>1920</width>
n/a                 <height>816</height>
n/a              </video>
n/a              <audio>
n/a                 <codec>ac3</codec>
n/a                 <language>eng</language>
n/a                 <channels>6</channels>
n/a              </audio>
n/a              <subtitle>
n/a                 <language>spa</language>
n/a              </subtitle>
n/a           </streamdetails>
n/a        </fileinfo>
ok         <director>Adam Del Deo</director>
ok         <actor>
ok            <thumb></thumb>
ok            <name></name>
ok            <role></role>
ok         </actor>
       </movie>
Actually most things are working pretty good, only thumbs and pictures are a bit unclear for me.

What format should <premiered> have? String with month written out, or date?

Eisbahn
find quote
Post Reply