I've never write a scraper, a quick help ?

  Thread Rating:
  • 0 Votes - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #16
easiest for you is to scan a movie or two, then export your library (single file is fine). then look at the format there - it's exactly the same. if you read c++ its in VideoInfoTag.cpp, ::Save()

wiki is outdated it seems
find quote
small_frenchy Offline
Junior Member
Posts: 32
Joined: Jun 2009
Reputation: 0
Post: #17
Ok, I got it !
I will produce the xml by understanding the code. Perfect. Thanks
find quote
small_frenchy Offline
Junior Member
Posts: 32
Joined: Jun 2009
Reputation: 0
Post: #18
Ok,

now I able to generate this response

Code:
<movie>
<title>Basic</title>
<rating>6,300000</rating>
<year>2003</year>
<top250>0</top250>
<votes>23,325</votes>

<outline>
Une nuit, lors d'un exercice d'entraînement, un ouragan frappe Panama. Six militaires dont l'autoritaire sergent West disparaissent et il ne reste que deux témoins pour raconter ce qu'il s'est passé.
</outline>

<plot>
Une nuit, lors d'un exercice d'entraînement, un ouragan frappe Panama. Six militaires dont l'autoritaire sergent West disparaissent et il ne reste que deux témoins pour raconter ce qu'il s'est passé.
</plot>

<tagline>
On sent que les deux acteurs prennent un réel plaisir à jouer au chat et à la souris. Et nous aussi !
</tagline>
<runtime>1h 38min</runtime>

<thumb>
http://localhost:52026/CMMServer/temp/thumb_fd23a1e2-5df7-4e33-bbb0-ff3db720d3ca.jpg
</thumb>

<fanart>

<thumb>
http://localhost:52026/CMMServer/temp/fanart_fd23a1e2-5df7-4e33-bbb0-ff3db720d3ca.jpg
</thumb>
</fanart>
<mpaa/>
<playcount>0</playcount>
<lastplayed/>
<id>tt0264395</id>
<genre>Action / Thriller</genre>
<credits/>
<director>John McTiernan</director>
<premiered/>
<status/>
<code/>
<aired/>
<studio/>
<trailer/>

<actor>
<name>John Travolta</name>
<role>Hardy</role>

<thumb>
http://localhost:52026/CMMServer/temp/actor_5a33487c-78fe-4d4b-911f-8682bdfc3839.jpg
</thumb>
</actor>

<actor>
<name>Connie Nielsen</name>
<role>Osborne</role>

<thumb>
http://localhost:52026/CMMServer/temp/actor_4d4eadac-dda5-4748-bdaf-63cc1add370c.jpg
</thumb>
</actor>

<actor>
<name>Samuel L. Jackson</name>
<role>West</role>

<thumb>
http://localhost:52026/CMMServer/temp/actor_26259f9f-568a-44b0-9d19-105265742fce.jpg
</thumb>
</actor>

<actor>
<name>Tim Daly</name>
<role>Styles</role>

<thumb>
http://localhost:52026/CMMServer/temp/actor_e37fddcd-2e97-46c4-afc5-fb2b38eea759.jpg
</thumb>
</actor>

<actor>
<name>Giovanni Ribisi</name>
<role>Kendall</role>

<thumb>
http://localhost:52026/CMMServer/temp/actor_7dbdc481-4980-4d5f-a8c9-3f10671ab9a8.jpg
</thumb>
</actor>

<actor>
<name>Brian Van Holt</name>
<role>Dunbar</role>

<thumb>
http://localhost:52026/CMMServer/temp/actor_f362d674-0225-4c17-9a92-ca829459d70f.jpg
</thumb>
</actor>

<actor>
<name>Taye Diggs</name>
<role>Pike</role>

<thumb>
http://localhost:52026/CMMServer/temp/actor_82eb9a50-8714-4749-bb7b-69a99dc529e6.jpg
</thumb>
</actor>

<actor>
<name>Dash Mihok</name>
<role>Mueller</role>

<thumb>
http://localhost:52026/CMMServer/temp/actor_a58c15a0-1571-4452-be89-020ff1ae6314.jpg
</thumb>
</actor>

<actor>
<name>Cristián de la Fuente몭</name>
<role>Castro</role>

<thumb>
http://localhost:52026/CMMServer/temp/actor_5c2dac13-e899-44ab-9b43-ab4c5e752362.jpg
</thumb>
</actor>

<actor>
<name>Roselyn Sanchez</name>
<role>Nunez</role>

<thumb>
http://localhost:52026/CMMServer/temp/actor_545b7858-9054-4af6-8fbf-75d38a247fb4.jpg
</thumb>
</actor>

<actor>
<name>Harry Connick Jr.</name>
<role>Vilmer</role>

<thumb>
http://localhost:52026/CMMServer/temp/actor_9cf63a79-6252-495f-8cb5-4fad7e01bc9e.jpg
</thumb>
</actor>

<actor>
<name>Georgia Hausserman</name>
<role>Pilot</role>
<thumb/>
</actor>

<actor>
<name>Margaret Travolta</name>
<role>Nurse #1</role>

<thumb>
http://localhost:52026/CMMServer/temp/actor_c06e45d4-4696-4576-a74a-a93a6abb9082.jpg
</thumb>
</actor>

<actor>
<name>Dena Johnston</name>
<role>Nurse #2</role>
<thumb/>
</actor>

<actor>
<name>Nick Loren</name>
<role>Helicopter Pilot</role>

<thumb>
http://localhost:52026/CMMServer/temp/actor_dbcf3ff1-408b-416c-bb7f-92702aeb4e10.jpg
</thumb>
</actor>
<artist/>
</movie>

Next step : scrap from XBMC let's try...
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #19
the surrounding tag needs to be <details>, not <movie>. different semantics for scrapers since they are multi-content.
find quote
small_frenchy Offline
Junior Member
Posts: 32
Joined: Jun 2009
Reputation: 0
Post: #20
Okay, I make the change Wink
find quote
small_frenchy Offline
Junior Member
Posts: 32
Joined: Jun 2009
Reputation: 0
Post: #21
I'm currently trying to write the scraper.

Code:
<scraper name="CentralMediaServer" content="movies" thumb="" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <CreateSearchUrl dest="3">
          <RegExp input="$$1" output="http://localhost:52026/CMMServer/Default.aspx?s=\1" dest="3">
             <expression noclean="1"/>
          </RegExp>
    </CreateSearchUrl>

    <GetSearchResults dest="8">
        
    </GetSearchResults>

    <GetDetails dest="3">
           <RegExp input="$$1" output="\1" dest="3">
              <expression noclean="1"/>
           </RegExp>
    </GetDetails>
</scraper>

I have a little problem about the GetSearchResults. My tool give one result or nothing, so I don't know how to configure the GetSearchResults (Shall I make a result list page ?)
find quote
small_frenchy Offline
Junior Member
Posts: 32
Joined: Jun 2009
Reputation: 0
Post: #22
I have created a page wich return results and try to create my first scraper :

Code:
<scraper name="CentralMediaServer" content="movies" thumb="" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <CreateSearchUrl dest="3">
          <RegExp input="$$1" output="http://localhost:52026/CMMServer/GetSearchResults.aspx?s=\1" dest="3">
             <expression noclean="1"/>
          </RegExp>
    </CreateSearchUrl>

    <GetSearchResults dest="8">
        <RegExp input="$$5" output="<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?><results>\1</results>" dest="8">
            <RegExp input="$$1" output="&lt;entity%gt;&lt;title&gt;'\2'&lt;/title&gt;&lt;url&gt;http://localhost:52026/CMMServer/GetDetails?s=\1&lt;/url&gt;&lt;/entity&gt;" dest="5">
                <expression repeat="yes">&lt;td id="result"&gt;(.[^<]*)&lt;/td&gt;</expression>
            </RegExp>
    </RegExp>
    </GetSearchResults>

    <GetDetails dest="3">
           <RegExp input="$$1" output="\1" dest="3">
              <expression noclean="1"/>
           </RegExp>
    </GetDetails>
</scraper>

AAANNNDDD It not works Smile any clue ?
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #23
malformed xml - firefox is your friend.
find quote
small_frenchy Offline
Junior Member
Posts: 32
Joined: Jun 2009
Reputation: 0
Post: #24
Yes, my mistake (again Smile )

Code:
<?xml version="1.0" encoding="utf-8"?>
<scraper framework="1.1" date="2009-08-11" name="CentralMediaServer" content="movies" thumb="" language="en" requiressettings="false">
  <CreateSearchUrl dest="3">
    <RegExp input="$$1" output="http://localhost:52026/CMMServer/GetSearchResults.aspx?s=\1" dest="3">
      <expression noclean="1"/>
    </RegExp>
  </CreateSearchUrl>

    <GetSearchResults dest="8">
    <RegExp input="$$5" output="&lt;results sorted=&quot;Yes&quot;&gt;\1&lt;/results&gt;" dest="8">
      <RegExp input="$$1" output="&lt;entity%gt;&lt;title&gt;'\2'&lt;/title&gt;&lt;url&gt;http://localhost:52026/CMMServer/GetDetails?s=\1&lt;/url&gt;&lt;/entity&gt;" dest="5">
        <expression repeat="yes">&lt;td id="result"&gt;(.[^&lt;]*)&lt;/td&gt;</expression>
      </RegExp>
        </RegExp>
    </GetSearchResults>

  <GetDetails dest="3">
    <RegExp input="$$1" output="\1" dest="3">
      <expression noclean="1"/>
    </RegExp>
  </GetDetails>
</scraper>

This one seems to be wellformed, but doesn't work Sad

I have problem around $$5 $$8, I understand not very well buffer systems...
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #25
while the scraper xml is wellformed, the outputted xml from GetSearchResults is malformed.

please don't wear my out on these sillies. i have enough to do as it is.
find quote
small_frenchy Offline
Junior Member
Posts: 32
Joined: Jun 2009
Reputation: 0
Post: #26
Sorry I don't understand "hand holding time is it?" my english is not very well sorry.

I have change for this, and it's not better Smile I'm so "ridicule" we say in french Smile

Code:
<?xml version="1.0" encoding="utf-8"?>
<scraper framework="1.1" date="2009-08-11" name="CentralMediaServer" content="movies" thumb="" language="en" requiressettings="false">
  <CreateSearchUrl dest="3">
    <RegExp input="$$1" output="http://localhost:52026/CMMServer/GetSearchResults.aspx?s=\1" dest="3">
      <expression noclean="1"/>
    </RegExp>
  </CreateSearchUrl>

    <GetSearchResults dest="8">
    <RegExp input="$$5" output="&lt;results sorted=&quot;Yes&quot;&gt;\1&lt;/results&gt;" dest="8">
      <RegExp input="$$1" output="&lt;entity%gt;&lt;title&gt;'\2'&lt;/title&gt;&lt;url&gt;http://localhost:52026/CMMServer/GetDetails?s=\1&lt;/url&gt;&lt;/entity&gt;" dest="5">
        <expression repeat="yes">&lt;td id=&quot;result&quot;&gt;(.[^&lt;]*)&lt;/td&gt;</expression>
      </RegExp>
        </RegExp>
    </GetSearchResults>

  <GetDetails dest="3">
    <RegExp input="$$1" output="\1" dest="3">
      <expression noclean="1"/>
    </RegExp>
  </GetDetails>
</scraper>

Don't worry, I'll find...
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #27
Code:
<RegExp input="$$1" output="&lt;entity%gt;&lt ....
from the very function i mentioned. that % is NOT what you want -> malformed xml outputted.
find quote
small_frenchy Offline
Junior Member
Posts: 32
Joined: Jun 2009
Reputation: 0
Star   
Post: #28
I'have tried it with ScraperEditor, I think results are better, but now it's a logic problem I think, I can't find which...
Code:
<?xml version="1.0" encoding="utf-8"?>
<scraper framework="1.1" date="2009-08-11" name="CentralMediaServer" content="movies" thumb="" language="en" requiressettings="false">
<CreateSearchUrl dest="3">
    <RegExp input="$$1" output="http://localhost:52026/CMMServer/GetSearchResults.aspx?s=\1" dest="3">
      <expression noclean="1"/>
    </RegExp>
  </CreateSearchUrl>

    <GetSearchResults dest="8">
    <RegExp input="$$5" output="&lt;results sorted=&quot;Yes&quot;&gt;\1&lt;/results&gt;" dest="8">
      <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;'\2'&lt;/title&gt;&lt;url&gt;http://localhost:52026/CMMServer/GetDetails?s=\1&lt;/url&gt;&lt;/entity&gt;" dest="5">
        <expression repeat="yes">&lt;td id=&quot;result&quot;&gt;(.[^&lt;]*)&lt;/td&gt;</expression>
      </RegExp>
        </RegExp>
    </GetSearchResults>

  <GetDetails dest="3">
    <RegExp input="$$1" output="\1" dest="3">
      <expression noclean="1"/>
    </RegExp>
  </GetDetails>
</scraper>
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #29
second <RegExp> in getsearchresults have no matching <expression>

Code:
<?xml version="1.0" encoding="utf-8"?>
<scraper framework="1.1" date="2009-08-11" name="CentralMediaServer" content="movies" thumb="" language="en" requiressettings="false">
<CreateSearchUrl dest="3">
    <RegExp input="$$1" output="http://localhost:52026/CMMServer/GetSearchResults.aspx?s=\1" dest="3">
      <expression noclean="1"/>
    </RegExp>
  </CreateSearchUrl>

  <GetSearchResults dest="8">
      <RegExp input="$$5" output="&lt;results sorted=&quot;Yes&quot;&gt;\1&lt;/results&gt;" dest="8">
        <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;'\2'&lt;/title&gt;&lt;url&gt;http://localhost:52026/CMMServer/GetDetails?s=\1&lt;/url&gt;&lt;/entity&gt;" dest="5">
          <expression repeat="yes">&lt;td id=&quot;result&quot;&gt;(.[^&lt;]*)&lt;/td&gt;</expression>
        </RegExp>
       <expression noclean="1"/> <--- THIS
    </RegExp>
  </GetSearchResults>

  <GetDetails dest="3">
    <RegExp input="$$1" output="\1" dest="3">
      <expression noclean="1"/>
    </RegExp>
  </GetDetails>
</scraper>
find quote
small_frenchy Offline
Junior Member
Posts: 32
Joined: Jun 2009
Reputation: 0
Post: #30
Damn it's really hard Smile

this url : http://localhost:52026/CMMServer/GetSear...aspx?s=Apt

Give me the page :
Code:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
<head><title>
    Page sans titre
</title></head>
<body>
    <form name="form1" method="post" action="GetSearchResults.aspx?s=Apt" id="form1">
<div>
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwUJNzgzNDMwNTMzZGS4WPlPQcYahgiGM0moo21VgV1KzA==" />
</div>

    <div>
    
    </div>

    <table id="Results">
    <tr>
        <td id="result">Apt pupil (1998)</td>
    </tr>
</table>
</form>
</body>
</html>

And now I want to extract "Apt pupil (1998)" to create a url like :
http://localhost:52026/CMMServer/GetDeta...%281998%29

hmm, with Scraper Editor the result of my scraper is :

"<results sorted="Yes"></results>"

So I think, I'm wrong with the use of \1 in output of first regexp in GetSearchResults... I'm right ?
find quote