Quick Scraper Question (Hope so:))

  Thread Rating:
  • 0 Votes - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #21
Thanks spiff but i can't follow you, maybe i got a blockade in my head now Smile

This is how it looks right now:

Code:
<RegExp input="$$1" output="&lt;url function=&quot;GetThumbnailLink&quot;&gt;http://www.cinefacts.de/kino/film/\1/\2/plakate.html&lt;/url&gt;" dest="5+">
                       <expression repeat ="yes">&lt;a href=&quot;/kino/film/([0-9]*)/([^\/]*)/plakate.html&quot;&gt;</expression>
                </RegExp>

                        <expression noclean="1"/>
                </RegExp>
        </GetDetails>

    <!--Thumbnail-->
        <GetThumbnailLink clearbuffers="no" dest="6">
             <RegExp input="$$1" output="&lt;details&gt;&lt;url function=&quot;GetThumbnail&quot;&gt;http://www.cinefacts.de/kino/film/\1&lt;/url&gt;&lt;/details&gt;" dest="6">
            <expression repeat="yes" noclean="1">&lt;a href=&quot;/kino/film/([^&quot;]+)&quot;&gt;[^&lt;]*&lt;img</expression>
        </RegExp>
        </GetThumbnailLink>



    <GetThumbnail dest="5">
        <RegExp input="$$2" output="&lt;details&gt;&lt;thumbs&gt;\1&lt;/thumbs&gt;&lt;/details&gt;" dest="5+">
            <RegExp input="$$1" output="&lt;thumb&gt;http://www.cinefacts.de/kino/plakat/\1&lt;/thumb&gt;" dest="2">
                <expression>=&quot;/kino/plakat/([^&quot;]*)&quot;</expression>
            </RegExp>
            <expression noclean="1"/>
        </RegExp>
    </GetThumbnail>

I can't find out where to change this.

Cheers

Schenk
find quote
spiff Offline
Grumpy Bastard Developer
Posts: 12,176
Joined: Nov 2003
Reputation: 82
Post: #22
why are you repeating in getthumbnaillink? i don't get what you are trying to achieve and hence it is impossible to help you

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
find quote
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #23
spiff Wrote:why are you repeating in getthumbnaillink? i don't get what you are trying to achieve and hence it is impossible to help you

Thanks for answer spiff, trying to explain.

in the GetDetails i output one url like ...posters.html

in the GetThumbnailLink there are as many output url's as cover pages like ...poster_1.html poster_2.html etc.

i want all the outputted urls to be parsed for the cover and get them.

hope that will make things clearer for you and i really appreciate your help !!!

Thanx

Schenk
find quote
spiff Offline
Grumpy Bastard Developer
Posts: 12,176
Joined: Nov 2003
Reputation: 82
Post: #24
aha.

then you want
Code:
<RegExp input="$$1" output="&lt;url function=&quot;GetThumbnailLink&quot; cache=&quot;some.xml&quot; &gt;http://www.cinefacts.de/kino/film/\1/\2/plakate.html&lt;/url&gt;" dest="5+">
                       <expression repeat ="yes">&lt;a href=&quot;/kino/film/([0-9]*)/([^\/]*)/plakate.html&quot;&gt;</expression>
                </RegExp>
                        <expression noclean="1"/>
                </RegExp>
        </GetDetails>

    <!--Thumbnail-->
        <GetThumbnailLink clearbuffers="no" dest="6">
             <RegExp input="$$7" output="&lt;details&gt\1&gt;&lt;/details&gt;" dest="6">
                    <RegExp input="$$1" output=";&lt;url function=&quot;GetThumbnail&quot;&gt;http://www.cinefacts.de/kino/film/\1&lt;/url&gt;" dest="7+">
            <expression repeat="yes" noclean="1">&lt;a href=&quot;/kino/film/([^&quot;]+)&quot;&gt;[^&lt;]*&lt;img</expression>
                    <RegExp input="" output="&lt;url function=&quot;CollectThumbnails&gt;&quot></url>
                       <expression/>
                    </RegExp>
        </RegExp>
        </GetThumbnailLink>

    <GetThumbnail  clearbuffers="no" dest="5">
        <RegExp input="$$1" output="&lt;thumb&gt;http://www.cinefacts.de/kino/plakat/\1&lt;/thumb&gt;" dest="8">
                <expression>=&quot;/kino/plakat/([^&quot;]*)&quot;</expression>
        </RegExp>
                <RegExp input="" output="&lt;details&gt;&lt;/details&gt; dest="5">
            <expression noclean="1"/>
        </RegExp>
    </GetThumbnail>
        
        <CollectThumbnails dest="2">
           <RegExp input="$$8" output="&lt;details&gt;&lt;thumbs&gt;\1&lt;/thumbs&gt;&lt;/details&gt;" dest="">
             <expression noclean="1"/>
           </RegExp>
        </CollectThumbnails>

i'm sure it's full of typos but i'm at work. shows the idea anyways

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
find quote
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #25
Thanks spiff,

a few question for me the stupid boy:

cache="some.xml" what´s that?

a few typos is okay, i think i found them but the empty buffers and dest; is this for real or are these typos tooConfused

Thanks again, hope soon i'm finished with questioning and bothering you Smile

Schenk
find quote
spiff Offline
Grumpy Bastard Developer
Posts: 12,176
Joined: Nov 2003
Reputation: 82
Post: #26
okay, the cache thing is actually to hack around a limitation i will lift soonish. you can only run a scraper function on a valid url.

if you set the cache property on a url, we cache to a local file with that name. usually it is used to run several functions on the same page. in this case we just need *some* valid url to run the last function on, and to avoid fetching anything we use the same url as the one we set the cache property on.

the expressions with the empty inputs are there on purpose. we need to return a valid xml from each function call, or the process stops. empty dest is a typo, it should be 2

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
find quote
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #27
Hey spiff,

here's what i got now. didn't work and i don't know if i understand the cache thing right and there'll be many false things, i guess. maybe you could see what's wrong:

Code:
            <RegExp input="$$1" output="&lt;url function=&quot;GetThumbnailLink&quot; cache=&quot;http://www.google.de&quot; &gt;http://www.cinefacts.de/kino/film/\1/\2/plakate.html&lt;/url&gt;" dest="5+">
                       <expression repeat ="yes">&lt;a href=&quot;/kino/film/([0-9]*)/([^\/]*)/plakate.html&quot;&gt;</expression>
                </RegExp>
                        <expression noclean="1"/>
                </RegExp>
        </GetDetails>

    <!--Thumbnail-->
        <GetThumbnailLink clearbuffers="no" dest="6">
               <RegExp input="$$7" output="&lt;details&gt;\1&lt;/details&gt;" dest="6">
                 <RegExp input="$$1" output="&lt;url function=&quot;GetThumbnail&quot;&gt;http://www.cinefacts.de/kino/film/\1&lt;/url&gt;" dest="7+">
            <expression repeat="yes" noclean="1">&lt;a href=&quot;/kino/film/([^&quot;]+)&quot;&gt;[^&lt;]*&lt;img</expression>
                    <RegExp input="" output="&lt;url function=&quot;CollectThumbnails&quot;&gt;&lt;/url&gt;"
                        <expression/>
            </RegExp>
               </RegExp>
        </GetThumbnailLink>

    <GetThumbnail clearbuffers="no" dest="5">
        <RegExp input="$$1" output="&lt;thumb&gt;http://www.cinefacts.de/kino/plakat/\1&lt;/thumb&gt;" dest="8">
                <expression>=&quot;/kino/plakat/([^&quot;]*)&quot;</expression>
                </RegExp>
                 <RegExp input="" output="&lt;details&gt;&lt;/details&gt;" dest="5">
          <expression noclean="1"/>
        </RegExp>
    </GetThumbnail>

        <CollectThumbnails dest="2">
           <RegExp input="$$8" output="&lt;details&gt;&lt;thumbs&gt;\1&lt;/thumbs&gt;&lt;/details&gt;" dest="2">
             <expression noclean="1"/>
           </RegExp>
        </CollectThumbnails>
</scraper>

Thanx Schenk
find quote
spiff Offline
Grumpy Bastard Developer
Posts: 12,176
Joined: Nov 2003
Reputation: 82
Post: #28
Code:
        <RegExp input="$$1" output="&lt;url function=&quot;GetThumbnailLink&quot; cache=&quot;some.xml&quot;&gt;http://www.cinefacts.de/kino/film/\1/\2/plakate.html&lt;/url&gt;" dest="5+">
                       <expression repeat ="yes">&lt;a href=&quot;/kino/film/([0-9]*)/([^\/]*)/plakate.html&quot;&gt;</expression>
                </RegExp>
                        <expression noclean="1"/>
                </RegExp>
        </GetDetails>

    <!--Thumbnail-->
        <GetThumbnailLink clearbuffers="no" dest="6">
               <RegExp input="$$7" output="&lt;details&gt;\1&lt;/details&gt;" dest="6">
                 <RegExp input="$$1" output="&lt;url function=&quot;GetThumbnail&quot;&gt;http://www.cinefacts.de/kino/film/\1&lt;/url&gt;" dest="7">
            <expression repeat="yes" noclean="1">&lt;a href=&quot;/kino/film/([^&quot;]+)&quot;&gt;[^&lt;]*&lt;img</expression>
                    </RegExp>
                    <RegExp input="" output="&lt;url function=&quot;CollectThumbnails&quot; cache=&quot;some.xml&quot; &gt;http://doesnt.matter&lt;/url&gt;" dest="7+">
                        <expression/>
            </RegExp>
               </RegExp>
        </GetThumbnailLink>

    <GetThumbnail clearbuffers="no" dest="5">
        <RegExp input="$$1" output="&lt;thumb&gt;http://www.cinefacts.de/kino/plakat/\1&lt;/thumb&gt;" dest="8+">
                <expression>=&quot;/kino/plakat/([^&quot;]*)&quot;</expression>
                </RegExp>
                 <RegExp input="" output="&lt;details&gt;&lt;/details&gt;" dest="5">
          <expression noclean="1"/>
        </RegExp>
    </GetThumbnail>

        <CollectThumbnails dest="2">
           <RegExp input="$$8" output="&lt;details&gt;&lt;thumbs&gt;\1&lt;/thumbs&gt;&lt;/details&gt;" dest="2">
             <expression noclean="1"/>
           </RegExp>
        </CollectThumbnails>
</scraper>

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
find quote
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #29
spiff Wrote:
Code:
        <RegExp input="$$1" output="&lt;url function=&quot;GetThumbnailLink&quot; cache=&quot;some.xml&quot;&gt;http://www.cinefacts.de/kino/film/\1/\2/plakate.html&lt;/url&gt;" dest="5+">
                       <expression repeat ="yes">&lt;a href=&quot;/kino/film/([0-9]*)/([^\/]*)/plakate.html&quot;&gt;</expression>
                </RegExp>
                        <expression noclean="1"/>
                </RegExp>
        </GetDetails>

    <!--Thumbnail-->
        <GetThumbnailLink clearbuffers="no" dest="6">
               <RegExp input="$$7" output="&lt;details&gt;\1&lt;/details&gt;" dest="6">
                 <RegExp input="$$1" output="&lt;url function=&quot;GetThumbnail&quot;&gt;http://www.cinefacts.de/kino/film/\1&lt;/url&gt;" dest="7">
            <expression repeat="yes" noclean="1">&lt;a href=&quot;/kino/film/([^&quot;]+)&quot;&gt;[^&lt;]*&lt;img</expression>
                    </RegExp>
                    <RegExp input="" output="&lt;url function=&quot;CollectThumbnails&quot; cache=&quot;some.xml&quot; &gt;http://doesnt.matter&lt;/url&gt;" dest="7+">
                        <expression/>
            </RegExp>
               </RegExp>
        </GetThumbnailLink>

    <GetThumbnail clearbuffers="no" dest="5">
        <RegExp input="$$1" output="&lt;thumb&gt;http://www.cinefacts.de/kino/plakat/\1&lt;/thumb&gt;" dest="8+">
                <expression>=&quot;/kino/plakat/([^&quot;]*)&quot;</expression>
                </RegExp>
                 <RegExp input="" output="&lt;details&gt;&lt;/details&gt;" dest="5">
          <expression noclean="1"/>
        </RegExp>
    </GetThumbnail>

        <CollectThumbnails dest="2">
           <RegExp input="$$8" output="&lt;details&gt;&lt;thumbs&gt;\1&lt;/thumbs&gt;&lt;/details&gt;" dest="2">
             <expression noclean="1"/>
           </RegExp>
        </CollectThumbnails>
</scraper>


Error: Unable to parse GetThumbnailLink.xml
find quote
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #30
Hey spiff, me again :=

I couldn't get this working and my head's gonna explode.

I found some new way to parse so i changed my code. I think it is the same but a lot easier.

Code:
            <!--Poster URL-->
                        <RegExp input="$$1" output="&lt;url function=&quot;GetPosters&quot;&gt;http://www.cinefacts.de/kino/film/\1/\2/\3/\4/plakat.html&lt;/url&gt;" dest="5+">
                <expression repeat="yes">&quot;/kino/film/([0-9]*)/([^\/]*)/([^\/]*)/([^\/]*)/plakat.html&quot;\)</expression>
            </RegExp>
                                <expression noclean="1"/>
        </RegExp>
    </GetDetails>

    <!--Poster-->
    <GetPosters clearbuffers="no" dest="5">
        <RegExp input="$$2" output="&lt;?xml version=&lt;details&gt;&lt;thumbs&gt;\1&lt;/thumbs&gt;&lt;/details&gt;" dest="5+">
            <RegExp input="$$1" output="&lt;thumb&gt;http://www.cinefacts.de/kino/plakat/\1&lt;/thumb&gt;" dest="2">
                <expression repeat="yes">href=&quot;/kino/plakat/([^&quot;]*)&quot;</expression>
            </RegExp>
            <expression noclean="1"></expression>
        </RegExp>
    </GetPosters>
</scraper>

Poster URL gives me (for example) two valid html pages.

Get Poster gives me two url's but outputs it seperat so only one is available in XBMC. Could you again decsribe for me dumb ass what to do to get all covers downloaded.
find quote
Post Reply