I've never write a scraper, a quick help ?

  Thread Rating:
  • 0 Votes - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
small_frenchy Offline
Junior Member
Posts: 32
Joined: Jun 2009
Reputation: 0
Post: #21
I'm currently trying to write the scraper.

Code:
<scraper name="CentralMediaServer" content="movies" thumb="" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <CreateSearchUrl dest="3">
          <RegExp input="$$1" output="http://localhost:52026/CMMServer/Default.aspx?s=\1" dest="3">
             <expression noclean="1"/>
          </RegExp>
    </CreateSearchUrl>

    <GetSearchResults dest="8">
        
    </GetSearchResults>

    <GetDetails dest="3">
           <RegExp input="$$1" output="\1" dest="3">
              <expression noclean="1"/>
           </RegExp>
    </GetDetails>
</scraper>

I have a little problem about the GetSearchResults. My tool give one result or nothing, so I don't know how to configure the GetSearchResults (Shall I make a result list page ?)
find quote
small_frenchy Offline
Junior Member
Posts: 32
Joined: Jun 2009
Reputation: 0
Post: #22
I have created a page wich return results and try to create my first scraper :

Code:
<scraper name="CentralMediaServer" content="movies" thumb="" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <CreateSearchUrl dest="3">
          <RegExp input="$$1" output="http://localhost:52026/CMMServer/GetSearchResults.aspx?s=\1" dest="3">
             <expression noclean="1"/>
          </RegExp>
    </CreateSearchUrl>

    <GetSearchResults dest="8">
        <RegExp input="$$5" output="<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?><results>\1</results>" dest="8">
            <RegExp input="$$1" output="&lt;entity%gt;&lt;title&gt;'\2'&lt;/title&gt;&lt;url&gt;http://localhost:52026/CMMServer/GetDetails?s=\1&lt;/url&gt;&lt;/entity&gt;" dest="5">
                <expression repeat="yes">&lt;td id="result"&gt;(.[^<]*)&lt;/td&gt;</expression>
            </RegExp>
    </RegExp>
    </GetSearchResults>

    <GetDetails dest="3">
           <RegExp input="$$1" output="\1" dest="3">
              <expression noclean="1"/>
           </RegExp>
    </GetDetails>
</scraper>

AAANNNDDD It not works Smile any clue ?
find quote
spiff Offline
Grumpy Bastard Developer
Posts: 12,181
Joined: Nov 2003
Reputation: 82
Post: #23
malformed xml - firefox is your friend.

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
find quote
small_frenchy Offline
Junior Member
Posts: 32
Joined: Jun 2009
Reputation: 0
Post: #24
Yes, my mistake (again Smile )

Code:
<?xml version="1.0" encoding="utf-8"?>
<scraper framework="1.1" date="2009-08-11" name="CentralMediaServer" content="movies" thumb="" language="en" requiressettings="false">
  <CreateSearchUrl dest="3">
    <RegExp input="$$1" output="http://localhost:52026/CMMServer/GetSearchResults.aspx?s=\1" dest="3">
      <expression noclean="1"/>
    </RegExp>
  </CreateSearchUrl>

    <GetSearchResults dest="8">
    <RegExp input="$$5" output="&lt;results sorted=&quot;Yes&quot;&gt;\1&lt;/results&gt;" dest="8">
      <RegExp input="$$1" output="&lt;entity%gt;&lt;title&gt;'\2'&lt;/title&gt;&lt;url&gt;http://localhost:52026/CMMServer/GetDetails?s=\1&lt;/url&gt;&lt;/entity&gt;" dest="5">
        <expression repeat="yes">&lt;td id="result"&gt;(.[^&lt;]*)&lt;/td&gt;</expression>
      </RegExp>
        </RegExp>
    </GetSearchResults>

  <GetDetails dest="3">
    <RegExp input="$$1" output="\1" dest="3">
      <expression noclean="1"/>
    </RegExp>
  </GetDetails>
</scraper>

This one seems to be wellformed, but doesn't work Sad

I have problem around $$5 $$8, I understand not very well buffer systems...
find quote
spiff Offline
Grumpy Bastard Developer
Posts: 12,181
Joined: Nov 2003
Reputation: 82
Post: #25
while the scraper xml is wellformed, the outputted xml from GetSearchResults is malformed.

please don't wear my out on these sillies. i have enough to do as it is.

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
find quote
small_frenchy Offline
Junior Member
Posts: 32
Joined: Jun 2009
Reputation: 0
Post: #26
Sorry I don't understand "hand holding time is it?" my english is not very well sorry.

I have change for this, and it's not better Smile I'm so "ridicule" we say in french Smile

Code:
<?xml version="1.0" encoding="utf-8"?>
<scraper framework="1.1" date="2009-08-11" name="CentralMediaServer" content="movies" thumb="" language="en" requiressettings="false">
  <CreateSearchUrl dest="3">
    <RegExp input="$$1" output="http://localhost:52026/CMMServer/GetSearchResults.aspx?s=\1" dest="3">
      <expression noclean="1"/>
    </RegExp>
  </CreateSearchUrl>

    <GetSearchResults dest="8">
    <RegExp input="$$5" output="&lt;results sorted=&quot;Yes&quot;&gt;\1&lt;/results&gt;" dest="8">
      <RegExp input="$$1" output="&lt;entity%gt;&lt;title&gt;'\2'&lt;/title&gt;&lt;url&gt;http://localhost:52026/CMMServer/GetDetails?s=\1&lt;/url&gt;&lt;/entity&gt;" dest="5">
        <expression repeat="yes">&lt;td id=&quot;result&quot;&gt;(.[^&lt;]*)&lt;/td&gt;</expression>
      </RegExp>
        </RegExp>
    </GetSearchResults>

  <GetDetails dest="3">
    <RegExp input="$$1" output="\1" dest="3">
      <expression noclean="1"/>
    </RegExp>
  </GetDetails>
</scraper>

Don't worry, I'll find...
find quote
spiff Offline
Grumpy Bastard Developer
Posts: 12,181
Joined: Nov 2003
Reputation: 82
Post: #27
Code:
<RegExp input="$$1" output="&lt;entity%gt;&lt ....
from the very function i mentioned. that % is NOT what you want -> malformed xml outputted.

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
find quote
small_frenchy Offline
Junior Member
Posts: 32
Joined: Jun 2009
Reputation: 0
Star    Post: #28
I'have tried it with ScraperEditor, I think results are better, but now it's a logic problem I think, I can't find which...
Code:
<?xml version="1.0" encoding="utf-8"?>
<scraper framework="1.1" date="2009-08-11" name="CentralMediaServer" content="movies" thumb="" language="en" requiressettings="false">
<CreateSearchUrl dest="3">
    <RegExp input="$$1" output="http://localhost:52026/CMMServer/GetSearchResults.aspx?s=\1" dest="3">
      <expression noclean="1"/>
    </RegExp>
  </CreateSearchUrl>

    <GetSearchResults dest="8">
    <RegExp input="$$5" output="&lt;results sorted=&quot;Yes&quot;&gt;\1&lt;/results&gt;" dest="8">
      <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;'\2'&lt;/title&gt;&lt;url&gt;http://localhost:52026/CMMServer/GetDetails?s=\1&lt;/url&gt;&lt;/entity&gt;" dest="5">
        <expression repeat="yes">&lt;td id=&quot;result&quot;&gt;(.[^&lt;]*)&lt;/td&gt;</expression>
      </RegExp>
        </RegExp>
    </GetSearchResults>

  <GetDetails dest="3">
    <RegExp input="$$1" output="\1" dest="3">
      <expression noclean="1"/>
    </RegExp>
  </GetDetails>
</scraper>
find quote
spiff Offline
Grumpy Bastard Developer
Posts: 12,181
Joined: Nov 2003
Reputation: 82
Post: #29
second <RegExp> in getsearchresults have no matching <expression>

Code:
<?xml version="1.0" encoding="utf-8"?>
<scraper framework="1.1" date="2009-08-11" name="CentralMediaServer" content="movies" thumb="" language="en" requiressettings="false">
<CreateSearchUrl dest="3">
    <RegExp input="$$1" output="http://localhost:52026/CMMServer/GetSearchResults.aspx?s=\1" dest="3">
      <expression noclean="1"/>
    </RegExp>
  </CreateSearchUrl>

  <GetSearchResults dest="8">
      <RegExp input="$$5" output="&lt;results sorted=&quot;Yes&quot;&gt;\1&lt;/results&gt;" dest="8">
        <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;'\2'&lt;/title&gt;&lt;url&gt;http://localhost:52026/CMMServer/GetDetails?s=\1&lt;/url&gt;&lt;/entity&gt;" dest="5">
          <expression repeat="yes">&lt;td id=&quot;result&quot;&gt;(.[^&lt;]*)&lt;/td&gt;</expression>
        </RegExp>
       <expression noclean="1"/> <--- THIS
    </RegExp>
  </GetSearchResults>

  <GetDetails dest="3">
    <RegExp input="$$1" output="\1" dest="3">
      <expression noclean="1"/>
    </RegExp>
  </GetDetails>
</scraper>

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
find quote
small_frenchy Offline
Junior Member
Posts: 32
Joined: Jun 2009
Reputation: 0
Post: #30
Damn it's really hard Smile

this url : http://localhost:52026/CMMServer/GetSear...aspx?s=Apt

Give me the page :
Code:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
<head><title>
    Page sans titre
</title></head>
<body>
    <form name="form1" method="post" action="GetSearchResults.aspx?s=Apt" id="form1">
<div>
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwUJNzgzNDMwNTMzZGS4WPlPQcYahgiGM0moo21VgV1KzA==" />
</div>

    <div>
    
    </div>

    <table id="Results">
    <tr>
        <td id="result">Apt pupil (1998)</td>
    </tr>
</table>
</form>
</body>
</html>

And now I want to extract "Apt pupil (1998)" to create a url like :
http://localhost:52026/CMMServer/GetDeta...%281998%29

hmm, with Scraper Editor the result of my scraper is :

"<results sorted="Yes"></results>"

So I think, I'm wrong with the use of \1 in output of first regexp in GetSearchResults... I'm right ?
find quote