Quick Scraper Question (Hope so:))

  Thread Rating:
  • 0 Votes - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #1
Hi everyone,

i try to make a scraper but can't get ahead with one step.

I use scrap.exe to test my scraper:

CreateSearchUrl returned is okay!

GetSearchResults returned is okay !

Details URL is okay !

but then the GetDetails returned: is nothing with the Error: Unable to parse details.xml

Here's my code:

PHP Code:
<scraper name="TEST" content="movies" thumb="cinefacts.gif" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" language="de">
    <
CreateSearchUrl dest="3">
        <
RegExp input="$$1" output="http://www.cinefacts.de/suche/suche.php?name=\1" dest="3">
            <
expression noclean="1"/>
        </
RegExp>
    </
CreateSearchUrl>
    <
GetSearchResults dest="8">
        <
RegExp input="$$5" output="<?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot; standalone=&quot;yes&quot;?><results>\1</results>" dest="8">
            <
RegExp input="$$1" output="<entity><title>\3 \4</title><url>http://www.cinefacts.de/kino/\1/\2/filmdetails.html</url></entity>" dest="5">
                <
expression repeat="yes">><a href=&quot;/kino/([0-9]*)/(.[^\/]*)/filmdetails.html&quot;>[^>]*(.[^<]*)</b></a><br>[^>]*[^\t]+\t+[^&nbsp;]+[^0-9]+([^<]+)</expression>
            </
RegExp>
            <
expression noclean="1"/>
        </
RegExp>
    </
GetSearchResults>
    <
GetDetails dest="3">
        <
RegExp input="$$5" output="<details>\1</details>" dest="3">
            <!--
Title -->
            <
RegExp input="$$1" output="<title>\1</title>" dest="5+">
                <
expression trim="1" noclean="1"><h1>([^<]*)</expression>
            </
RegExp>
                </
RegExp>
        </
GetDetails>
</
scraper

Maybe someone could have a quick look at this and tell me the direction to get it right.

Thanks so much in advance

Schenk
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #2
unfortunately scrap.exe is outdated and we lost the source.

and the reason it does not work is that you are missing the expression for the outermost RegExp in GetDetails, i.e.
Code:
....
</RegExp>
<expression noclean="1"/>
</RegExp>
</GetDetails>
(This post was last modified: 2009-04-22 22:19 by spiff.)
find quote
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #3
Hi Spiff,

thanks for your answer, that solved the problem with scrap.exe Smile

But now i tried it in XBMC and it doesn't work. i know that scrap.exe is outdated but is there any chance to see at which point XBMC stuck with my scrapper or better why it not works. any scrapper logsConfused At this point i have absolutely no clue where to start and find the error because with scrap.exe it's just fine. Thanks again for any hints or infos.

Greetz

Schenk
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #4
my answer depends on two things;
1) you speak c++ and can compile
or
2) you can compile
or
3) neither
find quote
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #5
Big Grin

maybe 2) better 3)

Could you explain why?

Thanks

Schenk
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #6
if 1 i could gotten away with instructions
2 means i'll have to do a patch for you which i will do shortly - here it is; http://dureks.dyndns.org:8080/scraperlog.diff
3 means i don't have to do anything

Smile
(This post was last modified: 2009-04-22 22:41 by spiff.)
find quote
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #7
spiff Wrote:if 1 i could gotten away with instructions
2 means i'll have to do a patch for you which i will do shortly
3 means i don't have to do anything

Smile


2 sound like i could try
3 makes me crying because i want that Cinefacts Scraper working Smile
find quote
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #8
little side note:

i made a cinefacts.de scraper for MediaPortal but now switched to XBMC and would like to use it here. It was even hard for me to do this in MP, in XBMC i'm getting depressed because it's totally different Smile
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #9
heh, different does not mean bad. don't give up, you'll get the hang of it =P
find quote
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #10
Spiff, i know i'm kind of lazy yet but is there a compiled version with your patch to download or do i really have to compile by my own, what makes me really afraid Shocked
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #11
that was the prerequisite for 2)
find quote
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #12
Hey Spiff,

don't wanna waste your time but i got a question left. I'm getting on with my scraper and the first things are very good. But now i parse the genres and that work but the output is like Action, Thriller, Horror. How to get rid of the , Confused

Thanks in advance

Schenk
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #13
Code:
<RegExp input="$$2" output="\1\2" dest="3">
    <expression noclean="1,2" repeat="yes">(.*?),(.*)</expression>
</RegExp>

also you should use multiple <genre> tags so maybe something like this?
Code:
<RegExp input="$$2" output="&lt;genre&gt;\1&lt;/genre&gt;" dest="3">
    <expression noclean="1" repeat="yes">(.*?),</expression>
</RegExp>
find quote
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #14
And again a question, sorry for that in advance:

if my movie title has a german umlaut in it (ä) the scraper can't find the movie but when i'm writing ae instead of the umlaut it will be found. i tried all the encoding stuff but can't find the answer:

Here's my code:

PHP Code:
<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?>
<scraper name="Cinefacts.de" content="movies" thumb="cinefacts.jpg" language="de">
    
    <CreateSearchUrl dest="3">
        <RegExp input="$$1" output="http://www.cinefacts.de/suche/suche.php?name=\1" dest="3">
            <expression noclean="1"/>
        </RegExp>
    </CreateSearchUrl>

    <GetSearchResults dest="8">
        <RegExp input="$$5" output="<?xml version=&quot;1.0&quotencoding=&quot;iso-8859-1&quotstandalone=&quot;yes&quot;?><results>\1</results>" dest="8">
            <RegExp input="$$1" output="<entity><title>\3 (\4)</title><url>http://www.cinefacts.de/kino/\1/\2/filmdetails.html</url></entity>" dest="5">
                <expression repeat="yes">><a href=&quot;/kino/([0-9]*)/(.[^\/]*)/filmdetails.html&quot;>[^<]*<b title=&quot;([^&quot;]*)&quot; class=&quot;headline&quot;>[^<]+</b></a><br>[^<]+<br>+[^0-9]+([^<]*)</td></expression>
        </RegExp>
                        <expression noclean="1"/>
        </RegExp>
    </GetSearchResults>

</scraper> 


thanks again for any hints!!!

Schenk
find quote
Nicezia Offline
Fan
Posts: 369
Joined: Nov 2006
Reputation: 0
Location: Montgomery, Alabama
Post: #15
Schenk2302 Wrote:And again a question, sorry for that in advance:

if my movie title has a german umlaut in it (ä) the scraper can't find the movie but when i'm writing ae instead of the umlaut it will be found. i tried all the encoding stuff but can't find the answer:

Here's my code:



thanks again for any hints!!!

Schenk


try escaping the charachter code with '\xE4'
not sure if that's included in the regular expression engine though
find quote
Post Reply