Quick Scraper Question (Hope so:))

  Thread Rating:
  • 0 Votes - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #16
you need to set the SearchStringEncoding on the CreateSearchUrl function
find quote
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #17
spiff Wrote:you need to set the SearchStringEncoding on the CreateSearchUrl function

As always thanks for this one, Spiff !!!
find quote
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #18
Is there something equal for the GetDetails Section, because the plot is displayed with the html tags for the umlautsConfused
find quote
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #19
My <GetThumbnailLink dest="5"> outputs as many urls as covers but my <GetThumbnail dest="5"> only outputs the thumb from the first url. how to make all url' s outputted, means getting all thumbs??

thanks in advance and sorry for that poor english Smile

Schenk
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #20
<details><thumbs><thumb>..</thumb><thumb>..</thumb></thumbs></details>

also see http://forum.xbmc.org/showthread.php?tid=48643
(This post was last modified: 2009-04-28 22:07 by spiff.)
find quote
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #21
Thanks spiff but i can't follow you, maybe i got a blockade in my head now Smile

This is how it looks right now:

Code:
<RegExp input="$$1" output="&lt;url function=&quot;GetThumbnailLink&quot;&gt;http://www.cinefacts.de/kino/film/\1/\2/plakate.html&lt;/url&gt;" dest="5+">
                       <expression repeat ="yes">&lt;a href=&quot;/kino/film/([0-9]*)/([^\/]*)/plakate.html&quot;&gt;</expression>
                </RegExp>

                        <expression noclean="1"/>
                </RegExp>
        </GetDetails>

    <!--Thumbnail-->
        <GetThumbnailLink clearbuffers="no" dest="6">
             <RegExp input="$$1" output="&lt;details&gt;&lt;url function=&quot;GetThumbnail&quot;&gt;http://www.cinefacts.de/kino/film/\1&lt;/url&gt;&lt;/details&gt;" dest="6">
            <expression repeat="yes" noclean="1">&lt;a href=&quot;/kino/film/([^&quot;]+)&quot;&gt;[^&lt;]*&lt;img</expression>
        </RegExp>
        </GetThumbnailLink>



    <GetThumbnail dest="5">
        <RegExp input="$$2" output="&lt;details&gt;&lt;thumbs&gt;\1&lt;/thumbs&gt;&lt;/details&gt;" dest="5+">
            <RegExp input="$$1" output="&lt;thumb&gt;http://www.cinefacts.de/kino/plakat/\1&lt;/thumb&gt;" dest="2">
                <expression>=&quot;/kino/plakat/([^&quot;]*)&quot;</expression>
            </RegExp>
            <expression noclean="1"/>
        </RegExp>
    </GetThumbnail>

I can't find out where to change this.

Cheers

Schenk
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #22
why are you repeating in getthumbnaillink? i don't get what you are trying to achieve and hence it is impossible to help you
find quote
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #23
spiff Wrote:why are you repeating in getthumbnaillink? i don't get what you are trying to achieve and hence it is impossible to help you

Thanks for answer spiff, trying to explain.

in the GetDetails i output one url like ...posters.html

in the GetThumbnailLink there are as many output url's as cover pages like ...poster_1.html poster_2.html etc.

i want all the outputted urls to be parsed for the cover and get them.

hope that will make things clearer for you and i really appreciate your help !!!

Thanx

Schenk
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #24
aha.

then you want
Code:
<RegExp input="$$1" output="&lt;url function=&quot;GetThumbnailLink&quot; cache=&quot;some.xml&quot; &gt;http://www.cinefacts.de/kino/film/\1/\2/plakate.html&lt;/url&gt;" dest="5+">
                       <expression repeat ="yes">&lt;a href=&quot;/kino/film/([0-9]*)/([^\/]*)/plakate.html&quot;&gt;</expression>
                </RegExp>
                        <expression noclean="1"/>
                </RegExp>
        </GetDetails>

    <!--Thumbnail-->
        <GetThumbnailLink clearbuffers="no" dest="6">
             <RegExp input="$$7" output="&lt;details&gt\1&gt;&lt;/details&gt;" dest="6">
                    <RegExp input="$$1" output=";&lt;url function=&quot;GetThumbnail&quot;&gt;http://www.cinefacts.de/kino/film/\1&lt;/url&gt;" dest="7+">
            <expression repeat="yes" noclean="1">&lt;a href=&quot;/kino/film/([^&quot;]+)&quot;&gt;[^&lt;]*&lt;img</expression>
                    <RegExp input="" output="&lt;url function=&quot;CollectThumbnails&gt;&quot></url>
                       <expression/>
                    </RegExp>
        </RegExp>
        </GetThumbnailLink>

    <GetThumbnail  clearbuffers="no" dest="5">
        <RegExp input="$$1" output="&lt;thumb&gt;http://www.cinefacts.de/kino/plakat/\1&lt;/thumb&gt;" dest="8">
                <expression>=&quot;/kino/plakat/([^&quot;]*)&quot;</expression>
        </RegExp>
                <RegExp input="" output="&lt;details&gt;&lt;/details&gt; dest="5">
            <expression noclean="1"/>
        </RegExp>
    </GetThumbnail>
        
        <CollectThumbnails dest="2">
           <RegExp input="$$8" output="&lt;details&gt;&lt;thumbs&gt;\1&lt;/thumbs&gt;&lt;/details&gt;" dest="">
             <expression noclean="1"/>
           </RegExp>
        </CollectThumbnails>

i'm sure it's full of typos but i'm at work. shows the idea anyways
find quote
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #25
Thanks spiff,

a few question for me the stupid boy:

cache="some.xml" what´s that?

a few typos is okay, i think i found them but the empty buffers and dest; is this for real or are these typos tooConfused

Thanks again, hope soon i'm finished with questioning and bothering you Smile

Schenk
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #26
okay, the cache thing is actually to hack around a limitation i will lift soonish. you can only run a scraper function on a valid url.

if you set the cache property on a url, we cache to a local file with that name. usually it is used to run several functions on the same page. in this case we just need *some* valid url to run the last function on, and to avoid fetching anything we use the same url as the one we set the cache property on.

the expressions with the empty inputs are there on purpose. we need to return a valid xml from each function call, or the process stops. empty dest is a typo, it should be 2
find quote
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #27
Hey spiff,

here's what i got now. didn't work and i don't know if i understand the cache thing right and there'll be many false things, i guess. maybe you could see what's wrong:

Code:
            <RegExp input="$$1" output="&lt;url function=&quot;GetThumbnailLink&quot; cache=&quot;http://www.google.de&quot; &gt;http://www.cinefacts.de/kino/film/\1/\2/plakate.html&lt;/url&gt;" dest="5+">
                       <expression repeat ="yes">&lt;a href=&quot;/kino/film/([0-9]*)/([^\/]*)/plakate.html&quot;&gt;</expression>
                </RegExp>
                        <expression noclean="1"/>
                </RegExp>
        </GetDetails>

    <!--Thumbnail-->
        <GetThumbnailLink clearbuffers="no" dest="6">
               <RegExp input="$$7" output="&lt;details&gt;\1&lt;/details&gt;" dest="6">
                 <RegExp input="$$1" output="&lt;url function=&quot;GetThumbnail&quot;&gt;http://www.cinefacts.de/kino/film/\1&lt;/url&gt;" dest="7+">
            <expression repeat="yes" noclean="1">&lt;a href=&quot;/kino/film/([^&quot;]+)&quot;&gt;[^&lt;]*&lt;img</expression>
                    <RegExp input="" output="&lt;url function=&quot;CollectThumbnails&quot;&gt;&lt;/url&gt;"
                        <expression/>
            </RegExp>
               </RegExp>
        </GetThumbnailLink>

    <GetThumbnail clearbuffers="no" dest="5">
        <RegExp input="$$1" output="&lt;thumb&gt;http://www.cinefacts.de/kino/plakat/\1&lt;/thumb&gt;" dest="8">
                <expression>=&quot;/kino/plakat/([^&quot;]*)&quot;</expression>
                </RegExp>
                 <RegExp input="" output="&lt;details&gt;&lt;/details&gt;" dest="5">
          <expression noclean="1"/>
        </RegExp>
    </GetThumbnail>

        <CollectThumbnails dest="2">
           <RegExp input="$$8" output="&lt;details&gt;&lt;thumbs&gt;\1&lt;/thumbs&gt;&lt;/details&gt;" dest="2">
             <expression noclean="1"/>
           </RegExp>
        </CollectThumbnails>
</scraper>

Thanx Schenk
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #28
Code:
        <RegExp input="$$1" output="&lt;url function=&quot;GetThumbnailLink&quot; cache=&quot;some.xml&quot;&gt;http://www.cinefacts.de/kino/film/\1/\2/plakate.html&lt;/url&gt;" dest="5+">
                       <expression repeat ="yes">&lt;a href=&quot;/kino/film/([0-9]*)/([^\/]*)/plakate.html&quot;&gt;</expression>
                </RegExp>
                        <expression noclean="1"/>
                </RegExp>
        </GetDetails>

    <!--Thumbnail-->
        <GetThumbnailLink clearbuffers="no" dest="6">
               <RegExp input="$$7" output="&lt;details&gt;\1&lt;/details&gt;" dest="6">
                 <RegExp input="$$1" output="&lt;url function=&quot;GetThumbnail&quot;&gt;http://www.cinefacts.de/kino/film/\1&lt;/url&gt;" dest="7">
            <expression repeat="yes" noclean="1">&lt;a href=&quot;/kino/film/([^&quot;]+)&quot;&gt;[^&lt;]*&lt;img</expression>
                    </RegExp>
                    <RegExp input="" output="&lt;url function=&quot;CollectThumbnails&quot; cache=&quot;some.xml&quot; &gt;http://doesnt.matter&lt;/url&gt;" dest="7+">
                        <expression/>
            </RegExp>
               </RegExp>
        </GetThumbnailLink>

    <GetThumbnail clearbuffers="no" dest="5">
        <RegExp input="$$1" output="&lt;thumb&gt;http://www.cinefacts.de/kino/plakat/\1&lt;/thumb&gt;" dest="8+">
                <expression>=&quot;/kino/plakat/([^&quot;]*)&quot;</expression>
                </RegExp>
                 <RegExp input="" output="&lt;details&gt;&lt;/details&gt;" dest="5">
          <expression noclean="1"/>
        </RegExp>
    </GetThumbnail>

        <CollectThumbnails dest="2">
           <RegExp input="$$8" output="&lt;details&gt;&lt;thumbs&gt;\1&lt;/thumbs&gt;&lt;/details&gt;" dest="2">
             <expression noclean="1"/>
           </RegExp>
        </CollectThumbnails>
</scraper>
find quote
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #29
spiff Wrote:
Code:
        <RegExp input="$$1" output="&lt;url function=&quot;GetThumbnailLink&quot; cache=&quot;some.xml&quot;&gt;http://www.cinefacts.de/kino/film/\1/\2/plakate.html&lt;/url&gt;" dest="5+">
                       <expression repeat ="yes">&lt;a href=&quot;/kino/film/([0-9]*)/([^\/]*)/plakate.html&quot;&gt;</expression>
                </RegExp>
                        <expression noclean="1"/>
                </RegExp>
        </GetDetails>

    <!--Thumbnail-->
        <GetThumbnailLink clearbuffers="no" dest="6">
               <RegExp input="$$7" output="&lt;details&gt;\1&lt;/details&gt;" dest="6">
                 <RegExp input="$$1" output="&lt;url function=&quot;GetThumbnail&quot;&gt;http://www.cinefacts.de/kino/film/\1&lt;/url&gt;" dest="7">
            <expression repeat="yes" noclean="1">&lt;a href=&quot;/kino/film/([^&quot;]+)&quot;&gt;[^&lt;]*&lt;img</expression>
                    </RegExp>
                    <RegExp input="" output="&lt;url function=&quot;CollectThumbnails&quot; cache=&quot;some.xml&quot; &gt;http://doesnt.matter&lt;/url&gt;" dest="7+">
                        <expression/>
            </RegExp>
               </RegExp>
        </GetThumbnailLink>

    <GetThumbnail clearbuffers="no" dest="5">
        <RegExp input="$$1" output="&lt;thumb&gt;http://www.cinefacts.de/kino/plakat/\1&lt;/thumb&gt;" dest="8+">
                <expression>=&quot;/kino/plakat/([^&quot;]*)&quot;</expression>
                </RegExp>
                 <RegExp input="" output="&lt;details&gt;&lt;/details&gt;" dest="5">
          <expression noclean="1"/>
        </RegExp>
    </GetThumbnail>

        <CollectThumbnails dest="2">
           <RegExp input="$$8" output="&lt;details&gt;&lt;thumbs&gt;\1&lt;/thumbs&gt;&lt;/details&gt;" dest="2">
             <expression noclean="1"/>
           </RegExp>
        </CollectThumbnails>
</scraper>


Error: Unable to parse GetThumbnailLink.xml
find quote
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #30
Hey spiff, me again :=

I couldn't get this working and my head's gonna explode.

I found some new way to parse so i changed my code. I think it is the same but a lot easier.

Code:
            <!--Poster URL-->
                        <RegExp input="$$1" output="&lt;url function=&quot;GetPosters&quot;&gt;http://www.cinefacts.de/kino/film/\1/\2/\3/\4/plakat.html&lt;/url&gt;" dest="5+">
                <expression repeat="yes">&quot;/kino/film/([0-9]*)/([^\/]*)/([^\/]*)/([^\/]*)/plakat.html&quot;\)</expression>
            </RegExp>
                                <expression noclean="1"/>
        </RegExp>
    </GetDetails>

    <!--Poster-->
    <GetPosters clearbuffers="no" dest="5">
        <RegExp input="$$2" output="&lt;?xml version=&lt;details&gt;&lt;thumbs&gt;\1&lt;/thumbs&gt;&lt;/details&gt;" dest="5+">
            <RegExp input="$$1" output="&lt;thumb&gt;http://www.cinefacts.de/kino/plakat/\1&lt;/thumb&gt;" dest="2">
                <expression repeat="yes">href=&quot;/kino/plakat/([^&quot;]*)&quot;</expression>
            </RegExp>
            <expression noclean="1"></expression>
        </RegExp>
    </GetPosters>
</scraper>

Poster URL gives me (for example) two valid html pages.

Get Poster gives me two url's but outputs it seperat so only one is available in XBMC. Could you again decsribe for me dumb ass what to do to get all covers downloaded.
find quote
Post Reply