Quick Scraper Question (Hope so:))

  Thread Rating:
  • 0 Votes - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #31
no. i have already explained it and if there is something i absolutely won't do, it is repeat myself
find quote
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #32
spiff Wrote:no. i have already explained it and if there is something i absolutely won't do, it is repeat myself


I totally understand your point but i can't make this work.

Code:
            <!--Poster URL-->
                <RegExp input="$$1" output="&lt;url function=&quot;GetThumbnailLink&quot; cache=&quot;\1.xml&quot; &gt;http://www.cinefacts.de/kino/film/\1/\2/plakate.html&lt;/url&gt;" dest="5+">
                       <expression repeat ="yes">&lt;a href=&quot;/kino/film/([0-9]*)/([^\/]*)/plakate.html&quot;&gt;</expression>
                </RegExp>
                        <expression noclean="1"/>
                </RegExp>
        </GetDetails>

    <!--Thumbnail-->
        <GetThumbnailLink clearbuffers="no" dest="6">
               <RegExp input="$$7" output="&lt;details&gt;\1&lt;/details&gt;" dest="6">
                 <RegExp input="$$1" output="&lt;url function=&quot;GetThumbnail&quot;&gt;http://www.cinefacts.de/kino/film/\1&lt;/url&gt;" dest="7">
            <expression repeat="yes" noclean="1">&lt;a href=&quot;/kino/film/([^&quot;]+)&quot;&gt;[^&lt;]*&lt;img</expression>
                    </RegExp>
                    <RegExp input="" output="&lt;url function=&quot;CollectThumbnails&quot; cache=&quot;\1.xml&quot;&gt;http://www.cinefacts.de/kino/datenbank.html&lt;/url&gt;" dest="7+">
                        <expression/>
            </RegExp>
                    <expression noclean="1"/>
               </RegExp>
        </GetThumbnailLink>

    <GetThumbnail clearbuffers="no" dest="5">
        <RegExp input="$$1" output="&lt;thumb&gt;http://www.cinefacts.de/kino/plakat/\1&lt;/thumb&gt;" dest="8+">
               <expression>href=&quot;/kino/plakat/([^&quot;]*)&quot;</expression>
                </RegExp>
                <RegExp input="" output="&lt;details&gt;&lt;/details&gt;" dest="5">
               <expression noclean="1"/>
        </RegExp>
    </GetThumbnail>

        <CollectThumbnails dest="2">
           <RegExp input="$$8" output="&lt;details&gt;&lt;thumbs&gt;\1&lt;/thumbs&gt;&lt;/details&gt;" dest="2">
             <expression noclean="1"/>
           </RegExp>
        </CollectThumbnails>
</scraper>

The GetThumbnailLink gives me the valid urls, so i think there must be something wrong after that.

If i change the GetThumbnail from 8+ to 5+ it outputs two seperate posters. i have no clue what else to do. i'd love to make this work too because i'm nearly finished with the scraper, i got even fanart to work.

Anyway, thanks again for all your help.

Schenk
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #33
hi,

one problem is that you set the cache to .xml in the collectthumbnails call. use a litteral filename.
other than that i do not see what could be wrong. are you sure the expressions are correct?
perhaps i could have the whole file so i could see what goes wrong?
find quote
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #34
Hi spiff,

thanks i try to play a little more.

Yes I think the regex's are okay.

here's my file but with the old poster part.

Code:
<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?>
<scraper name="Cinefacts" content="movies" thumb="cinefacts.jpg" language="de">
    <GetSettings dest="3">
        <RegExp input="$$5" output="&lt;settings&gt;\1&lt;/settings&gt;" dest="3">
            <RegExp input="$$1" output="&lt;setting label=&quot;Fanart&quot; type=&quot;bool&quot; id=&quot;fanart&quot; default=&quot;true&quot;&gt;&lt;/setting&gt;" dest="5+">
                <expression></expression>
            </RegExp>
            <expression noclean="1"></expression>
        </RegExp>
    </GetSettings>
    
        <CreateSearchUrl dest="3" SearchStringEncoding="iso-8859-1">
        <RegExp input="$$1" output="http://www.cinefacts.de/suche/suche.php?name=\1" dest="3">
            <expression noclean="1"/>
        </RegExp>
    </CreateSearchUrl>

    <GetSearchResults dest="8">
        <RegExp input="$$5" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot; standalone=&quot;yes&quot;?&gt;&lt;results&gt;\1&lt;/results&gt;" dest="8">
            <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;\3 (\4)&lt;/title&gt;&lt;url&gt;http://www.cinefacts.de/kino/\1/\2/filmdetails.html&lt;/url&gt;&lt;/entity&gt;" dest="5">
                <expression repeat="yes">&gt;&lt;a href=&quot;/kino/([0-9]*)/(.[^\/]*)/filmdetails.html&quot;&gt;[^&lt;]*&lt;b title=&quot;([^&quot;]*)&quot; class=&quot;headline&quot;&gt;[^&lt;]+&lt;/b&gt;&lt;/a&gt;&lt;br&gt;[^&lt;]+&lt;br&gt;+[^0-9]+([^&lt;]*)</expression>
        </RegExp>
            <expression noclean="1"/>
        </RegExp>
    </GetSearchResults>

    <GetDetails dest="3">
        <RegExp input="$$5" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot; standalone=&quot;yes&quot;?&gt;&lt;details&gt;\1&lt;/details&gt;" dest="3">
            <!--Title-->
            <RegExp input="$$1" output="&lt;title&gt;\1&lt;/title&gt;" dest="5+">
                <expression trim="1" noclean="1">&lt;h1&gt;([^&lt;]*)</expression>
            </RegExp>

            <!--Original Title-->
            <RegExp input="$$1" output="&lt;originaltitle&gt;\1&lt;/originaltitle&gt;" dest="5+">
                <expression>&lt;dt class=&quot;c1&quot;&gt;Originaltitel:&lt;/dt&gt;[^&lt;]*&lt;dd class=&quot;first&quot;&gt;(.[^&lt;]*)&lt;/dd&gt;</expression>
            </RegExp>

            <!--Genre-->
            <RegExp input="$$1" output="\1" dest="4+">
                <expression noclean="1">Genre:([^:]*)Deutschlandstart:</expression>
            </RegExp>
            <RegExp input="$$4" output="&lt;genre&gt;\1&lt;/genre&gt;" dest="5+">
                <expression repeat="yes" noclean="1" trim="1">&gt;*[ A-Za-z]([^&lt;&gt;]*)&lt;/a&gt;</expression>
            </RegExp>
                              
            <!--Director Film-->
            <RegExp input="$$1" output="\1" dest="7+">
                <expression noclean="1">Regie:([^:]*)Buch:</expression>
            </RegExp>
            <RegExp input="$$7" output="&lt;director&gt;\1&lt;/director&gt;" dest="5+">
                <expression repeat="yes" >&lt;a href=&quot;[^&quot;]*&quot;&gt;([^&lt;]*)&lt;/a&gt;</expression>
            </RegExp>

            <!--Actors-->
            <RegExp input="$$1" output="\1" dest="7+">
                <expression noclean="1">Darsteller:([^|]*)</expression>
            </RegExp>
            <RegExp input="$$7" output="&lt;actor&gt;&lt;name&gt;\1&lt;/name&gt;&lt;role&gt;\2&lt;/role&gt;&lt;/actor&gt;" dest="5+">
                <expression repeat="yes">&gt;([^&lt;&gt;]*)&lt;/a&gt;&lt;/td&gt;+[^&lt;]+&lt;[^&gt;]+&gt; als([ A-Za-z]*)</expression>
            </RegExp>

            <!--Studio-->
            <RegExp input="$$1" output="&lt;studio&gt;\1&lt;/studio&gt;" dest="5+">
                <expression>Studio:([^\.]*)\.</expression>
            </RegExp>

            <!--Year-->
            <RegExp input="$$1" output="&lt;year&gt;\1&lt;/year&gt;" dest="5+">
                <expression>&lt;/a&gt; ([0-9]*) &lt;/dd&gt;</expression>
            </RegExp>

            <!--MPAA-->
            <RegExp input="$$1" output="&lt;mpaa&gt;\1&lt;/mpaa&gt;" dest="5+">
                <expression>FSK:&lt;/dt&gt;[^&gt;]*&gt;([^&lt;]*)&lt;</expression>
            </RegExp>

            <!--Runtime-->
            <RegExp input="$$1" output="&lt;runtime&gt;\1&lt;/runtime&gt;" dest="5+">
                <expression>L.nge:&lt;/dt&gt;[^&gt;]*&gt;([^&lt;]*)&lt;</expression>
            </RegExp>

            <!--Plot-->
            <RegExp input="$$1" output="&lt;plot&gt;\1&lt;/plot&gt;" dest="5+">
                <expression>KURZINHALT&lt;/h2&gt;&lt;/li&gt;[^&gt;]*&gt;*([^&lt;]*)[&lt;/li&gt;]</expression>
            </RegExp>

            <!--Writers-->
            <RegExp input="$$1" output="\1" dest="6+">
                <expression noclean="1">Buch:([^:]*)Musik:</expression>
            </RegExp>
            <RegExp input="$$6" output="&lt;credits&gt;\1&lt;/credits&gt;" dest="5+">
                <expression repeat="yes" >&lt;a href=&quot;[^&quot;]*&quot;&gt;([^&lt;]*)&lt;/a&gt;</expression>
            </RegExp>

            <!--Poster URL-->
                        <RegExp input="$$1" output="&lt;url function=&quot;GetPosters&quot;&gt;http://www.cinefacts.de/kino/film/\1/\2/\3/\4/plakat.html&lt;/url&gt;" dest="5+">
                <expression repeat="yes">&quot;/kino/film/([0-9]*)/([^\/]*)/([^\/]*)/([^\/]*)/plakat.html&quot;\)</expression>
            </RegExp>

            <!--IMDB URL-->
                        <RegExp input="$$1" output="&lt;url function=&quot;GetIMDBid&quot;&gt;http://akas.imdb.com/find?s=tt;q=\2 (\1)&lt;/url&gt;" dest="5+">
                <expression>&lt;h1&gt;[^&lt;]*&lt;/h1&gt;[^0-9]*([0-9]*) &lt;/li&gt;[^:]*:&lt;/dt&gt;[^&lt;]*&lt;dd class=&quot;first&quot;&gt;(.[^&lt;]*)&lt;/dd&gt;</expression>
            </RegExp>
                                <expression noclean="1"/>
        </RegExp>
    </GetDetails>

    <!--Poster-->
    <GetPosters clearbuffers="no" dest="5">
        <RegExp input="$$2" output="&lt;?xml version=&lt;details&gt;&lt;thumbs&gt;\1&lt;/thumbs&gt;&lt;/details&gt;" dest="5+">
            <RegExp input="$$1" output="&lt;thumb&gt;http://www.cinefacts.de/kino/plakat/\1&lt;/thumb&gt;" dest="2">
                <expression repeat="yes">href=&quot;/kino/plakat/([^&quot;]*)&quot;</expression>
            </RegExp>
            <expression noclean="1"></expression>
        </RegExp>
    </GetPosters>

    <!--Get IMDB ID-->
        <GetIMDBid dest="5">
        <RegExp input="$$2" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot; standalone=&quot;yes&quot;&gt;&lt;details&gt;\1&lt;/details&gt;" dest="5">
                        <RegExp input="$$1" output="&lt;url function=&quot;GetTMDBId&quot;&gt;http://api.themoviedb.org/2.0/Movie.imdbLookup?imdb_id=\1&amp;amp;api_key=57983e31fb435df4df77afb854740ea9&lt;/url&gt;" dest="2+">
                <expression>/title/([t0-9]*)</expression>
            </RegExp>
                <expression noclean="1"/>
        </RegExp>
    </GetIMDBid>
    <!-- Fanart -->
    <GetTMDBId dest="5">
        <RegExp conditional="fanart" input="$$1" output="&lt;details&gt;&lt;url function=&quot;GetTMDBFanart&quot;&gt;http://api.themoviedb.org/2.0/Movie.getInfo?id=\1&amp;amp;api_key=57983e31fb435df4df77afb854740ea9&lt;/url&gt;&lt;/details&gt;" dest="5">
            <expression>&lt;id&gt;([0-9]*)&lt;/id&gt;</expression>
        </RegExp>
    </GetTMDBId>
    <GetTMDBFanart dest="5">
        <RegExp input="$$2" output="&lt;details&gt;&lt;fanart url=&quot;http://themoviedb.org/image/backdrops&quot;&gt;\1&lt;/fanart&gt;&lt;/details&gt;" dest="5">
            <RegExp input="$$1" output="&lt;thumb preview=&quot;/\1/\2_poster.jpg&quot;&gt;/\1/\2.jpg&lt;/thumb&gt;" dest="2">
                <expression repeat="yes">&lt;backdrop size=&quot;original&quot;&gt;http://www.themoviedb.org/image/backdrops/([0-9]*)/([^\.]*).jpg&lt;/backdrop&gt;</expression>
            </RegExp>
            <expression noclean="1">(.+)</expression>
        </RegExp>
    </GetTMDBFanart>
</scraper>

Thanks

Schenk
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #35
http://trac.xbmc.org/ticket/6485

works for me
find quote
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #36
spiff Wrote:http://trac.xbmc.org/ticket/6485

works for me


You're my hero. Thanks so much for all your help!!!

I will now take a deep look into it and maybe change some details to make it better and trying to understand everything Smile

The reason for this scraper is, that Moviemaze scraper is very good but only lists the cinema films not the dvd realeses and so on. Okay, that's it for now and again thank you very, very much.

Good Night

Schenk
find quote
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #37
Hi spiff,

i want to make parsing for the IMDB ID better. In most cases it will work good and gets the right id but i stuck with one movie for example which gets the false ID.

The movie title is Beverly Hills Chihuahua. This code searches with the parsed original title and date, in this example for
Beverly Hills Chihuahua : South of the Border (AT) (2008)
Code:
            <!--IMDB URL-->
                        <RegExp input="$$1" output="&lt;url function=&quot;GetIMDBid&quot;&gt;http://akas.imdb.com/find?s=tt;q=\2 (\1)&lt;/url&gt;" dest="5+">
                <expression>&lt;h1&gt;[^&lt;]*&lt;/h1&gt;[^0-9]*([0-9]*) &lt;/li&gt;[^:]*:&lt;/dt&gt;[^&lt;]*&lt;dd class=&quot;first&quot;&gt;(.[^&lt;]*)&lt;/dd&gt;</expression>
            </RegExp>
                        <expression noclean="1"/>

If i type this directly in the browser it will find the imdb results page and the regex will be good.

the prob that i have now is that here
Code:
    <!--Get IMDB ID-->
        <GetIMDBid dest="5">
        <RegExp input="$$2" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot; standalone=&quot;yes&quot;&gt;&lt;details&gt;\1&lt;/details&gt;" dest="5+">
                        <RegExp input="$$1" output="&lt;url function=&quot;GetTMDBId&quot;&gt;http://api.themoviedb.org/2.0/Movie.imdbLookup?imdb_id=\1&amp;amp;api_key=57983e31fb435df4df77afb854740ea9&lt;/url&gt;" dest="2+">
                <expression>/title/([t0-9]*)</expression>
            </RegExp>
                <expression noclean="1"/>
        </RegExp>
    </GetIMDBid>

it inputs IMDB ID 0086960 and that is from Beverly Hills Cop and so the fanart will be wrong. The question now is. Where does it get this wrong ID and why, because the aka search result only show the right movie.

Thanks for any help

Schenk
find quote
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #38
And again another question:

Code:
            <!--IMDB Rating URL -->
                        <RegExp input="$$1" output="&lt;url function=&quot;GetIMDBRating&quot;&gt;http://akas.imdb.com/find?s=tt;q=\2 (\1)&lt;/url&gt;" dest="5+">
                <expression>&lt;h1&gt;[^&lt;]*&lt;/h1&gt;[^0-9]*([0-9]*) &lt;/li&gt;[^:]*:&lt;/dt&gt;[^&lt;]*&lt;dd class=&quot;first&quot;&gt;(.[^&lt;]*)&lt;/dd&gt;</expression>
            </RegExp>

            <expression noclean="1"/>
        </RegExp>
    </GetDetails>

        <GetIMDBRating dest="5">
        <RegExp input="$$2" output="&lt;details&gt;\1&lt;/details&gt;" dest="5+">
                        <RegExp input="$$1" output="&lt;url function=&quot;GetRating&quot;&gt;http://www.imdb.com/title/\1&lt;/url&gt;" dest="2+">
                <expression>/title/([t0-9]*)/</expression>
            </RegExp>
                <expression noclean="1"/>
        </RegExp>
    </GetIMDBRating>

    <GetRating dest="5">
        <RegExp input="$$1" output="&lt;rating&gt;\1&lt;/rating&gt;&lt;votes&gt;\2&lt;/votes&gt;" dest="5+">
            <expression>&lt;b&gt;([0-9.]+)/10&lt;/b&gt;[^&lt;]*&lt;a href=&quot;ratings&quot; class=&quot;tn15more&quot;&gt;([0-9,]+) votes&lt;/a&gt;</expression>
        </RegExp>
                <expression noclean="1"/>
    </GetRating>
</scraper>

all works well in scrap test but rating and votes are not shown in XBMC. Any reason why?
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #39
yes, ALL output from a scraper you want parsed needs the surrounding <details> tags.

and the number of expressions and RegExps doesnt add up.....
find quote
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #40
Things could be so easy, thanks again it works now !!!

yes, ALL output from a scraper you want parsed needs the surrounding <details> tags.



maybe my english's not good enought but i don't understand the content of this...

and the number of expressions and RegExps doesnt add up....

Thanks so much for helping out!!!
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #41
Code:
    <GetRating dest="5">
        <RegExp input="$$1" output="&lt;rating&gt;\1&lt;/rating&gt;&lt;votes&gt;\2&lt;/votes&gt;" dest="5+">
            <expression>&lt;b&gt;([0-9.]+)/10&lt;/b&gt;[^&lt;]*&lt;a href=&quot;ratings&quot; class=&quot;tn15more&quot;&gt;([0-9,]+) votes&lt;/a&gt;</expression>
        </RegExp>
                <expression noclean="1"/>
    </GetRating>

see; one <RegExp>, two <expression>
find quote
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #42
spiff Wrote:
Code:
    <GetRating dest="5">
        <RegExp input="$$1" output="&lt;rating&gt;\1&lt;/rating&gt;&lt;votes&gt;\2&lt;/votes&gt;" dest="5+">
            <expression>&lt;b&gt;([0-9.]+)/10&lt;/b&gt;[^&lt;]*&lt;a href=&quot;ratings&quot; class=&quot;tn15more&quot;&gt;([0-9,]+) votes&lt;/a&gt;</expression>
        </RegExp>
                <expression noclean="1"/>
    </GetRating>

see; one <RegExp>, two <expression>

sorry, i'm to stupid; can't see where you wanna lead me Sad
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #43
<expression> tags only makes sense wrapped in a <RegExp>. the latter one is surely NOT wrapped in a <RegExp> -> it's useless
find quote
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #44
spiff Wrote:<expression> tags only makes sense wrapped in a <RegExp>. the latter one is surely NOT wrapped in a <RegExp> -> it's useless

that's what i thought you try to tell me but if i delete it, parsing didn't work (or maybe only didn't work with scrap.exeConfused)

Thanks

Schenk
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #45
any result from scrap.exe is irrelevant
find quote
Post Reply