I've never write a scraper, a quick help ?

  Thread Rating:
  • 0 Votes - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
small_frenchy Offline
Junior Member
Posts: 32
Joined: Jun 2009
Reputation: 0
Post: #1
Hi there !

I'm working on a tool which combine XBMC scrapper for a search, and the produce HTML rendering of the result. The idea is to be able to scrap from multiple sources (for example take thumb on imdb scraper, fanart on tmdb scraper and plot from allocine scrapper).

My tool start to make the deal (with the use of my dll to scrap from the different scraper) and produce an HTML result. What I wanna do now is writing a scraper.

Here is a little sample :

with the resquest : http://localhost:52026/CMMServer/Default.aspx?s=Basic

I have the response :

Code:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
<head><title>
    CMM Response
</title></head>
<body>
    <form name="form1" method="post" action="Default.aspx?s=Basic+(2003)" id="form1">
<div>
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwULLTE4MTc3Njc3MzUPZBYCAgMPZBYCAgEPZBYcZg9kFgJmD2QWAmYPDxYCHghJbWFnZVVybAU2L​i90ZW1wL2ZhbmFydF8yZWZmNjc1YS05NzRjLTQxZDgtOTlkNy1lYjM3MWY0MjE0ZmYuanBnZGQCAQ9kF​gJmD2QWAmYPDxYCHwAFNS4vdGVtcC90aHVtYl8yZWZmNjc1YS05NzRjLTQxZDgtOTlkNy1lYjM3MWY0M​jE0ZmYuanBnZGQCAg9kFgJmD2QWAmYPDxYCHgRUZXh0BdQBVW5lIG51aXQsIGxvcnMgZCd1biBleGVyY​2ljZSBkJ2VudHJhw65uZW1lbnQsIHVuIG91cmFnYW4gZnJhcHBlIFBhbmFtYS4gU2l4IG1pbGl0YWlyZ​XMgZG9udCBsJ2F1dG9yaXRhaXJlIHNlcmdlbnQgV2VzdCBkaXNwYXJhaXNzZW50IGV0IGlsIG5lIHJlc​3RlIHF1ZSBkZXV4IHTDqW1vaW5zIHBvdXIgcmFjb250ZXIgY2UgcXUnaWwgcydlc3QgcGFzc8OpLg3vg​I3rqq3vgI1kZAIDD2QWAmYPZBYCZg8PFgIfAQUOSm9obiBNY1RpZXJuYW5kZAIED2QWAmYPZBYCZg8PF​gIfAQURQWN0aW9uIC8gVGhyaWxsZXJkZAIFD2QWAmYPZBYCZg8PFgIfAQXUAVVuZSBudWl0LCBsb3JzI​GQndW4gZXhlcmNpY2UgZCdlbnRyYcOubmVtZW50LCB1biBvdXJhZ2FuIGZyYXBwZSBQYW5hbWEuIFNpe​CBtaWxpdGFpcmVzIGRvbnQgbCdhdXRvcml0YWlyZSBzZXJnZW50IFdlc3QgZGlzcGFyYWlzc2VudCBld​CBpbCBuZSByZXN0ZSBxdWUgZGV1eCB0w6ltb2lucyBwb3VyIHJhY29udGVyIGNlIHF1J2lsIHMnZXN0I​HBhc3PDqS4N74CN66qt74CNZGQCBg9kFgJmD2QWAmYPDxYCHwEFCDFoIDM4bWluZGQCBw9kFgJmD2QWA​mYPDxYCHwFkZGQCCA9kFgJmD2QWAmYPDxYCHwEFcU9uIHNlbnQgcXVlIGxlcyBkZXV4IGFjdGV1cnMgc​HJlbm5lbnQgdW4gcsOpZWwgcGxhaXNpciDDoCBqb3VlciBhdSBjaGF0IGV0IMOgIGxhIHNvdXJpcy4gR​XQgbm91cyBhdXNzaSAh66qt74CN66qtZGQCCQ9kFgJmD2QWAmYPDxYCHwEFBUJhc2ljZGQCCg9kFgJmD​2QWAmYPDxYCHwEFBjIzLDMyNWRkAgsPZBYCZg9kFgJmDw8WAh8BZGRkAgwPZBYCZg9kFgJmDw8WAh8BB​QM2LDNkZAIND2QWAmYPZBYCZg8PFgIfAQUEMjAwM2RkZBzUu/5/2nYny+68afkNsdhQhAhN" />
</div>


    <table id="TableResult" border="0" style="width:133px;">

    <tr id="FanartRow">
        <td id="FanartCell"><img id="FanartImage" src="./temp/fanart_2eff675a-974c-41d8-99d7-eb371f4214ff.jpg" style="border-width:0px;" /></td>
    </tr><tr id="TableRow1">
        <td id="TableCell1"><img id="ThumbImage" src="./temp/thumb_2eff675a-974c-41d8-99d7-eb371f4214ff.jpg" style="border-width:0px;" /></td>
    </tr><tr id="TableRow2">
        <td id="TableCell2"><span id="Plot">Une nuit, lors d'un exercice d'entraînement, un ouragan frappe Panama. Six militaires dont l'autoritaire sergent West disparaissent et il ne reste que deux témoins pour raconter ce qu'il s'est passé.</span></td>
    </tr><tr id="TableRow3">
        <td id="TableCell3"><span id="Director">John McTiernan</span></td>

    </tr><tr id="TableRow4">
        <td id="TableCell4"><span id="Genre">Action / Thriller</span></td>
    </tr><tr id="TableRow5">
        <td id="TableCell5"><span id="PlotOutline">Une nuit, lors d'un exercice d'entraînement, un ouragan frappe Panama. Six militaires dont l'autoritaire sergent West disparaissent et il ne reste que deux témoins pour raconter ce qu'il s'est passé.</span></td>
    </tr><tr id="TableRow6">
        <td id="TableCell6"><span id="Runtime">1h 38min</span></td>
    </tr><tr id="TableRow7">

        <td id="TableCell7"><span id="Studio"></span></td>
    </tr><tr id="TableRow8">
        <td id="TableCell8"><span id="Tagline">On sent que les deux acteurs prennent un réel plaisir à jouer au chat et à la souris. Et nous aussi !</span></td>
    </tr><tr id="TableRow9">
        <td id="TableCell9"><span id="Title">Basic</span></td>
    </tr><tr id="TableRow10">
        <td id="TableCell10"><span id="Votes">23,325</span></td>

    </tr><tr id="TableRow11">
        <td id="TableCell11"><span id="WritingCredit"></span></td>
    </tr><tr id="TableRow12">
        <td id="TableCell12"><span id="Rating">6,3</span></td>
    </tr><tr id="TableRow13">
        <td id="TableCell13"><span id="Year">2003</span></td>
    </tr>
</table>

    

    </form>
</body>
</html>

But I'm not strong enought to write the scraper (I'm not very fluent with regexp). Does anyone can help me ?
find quote
spiff Online
Grumpy Bastard Developer
Posts: 12,186
Joined: Nov 2003
Reputation: 82
Post: #2
uhrr. read the scrapers. pretty much EVERY scraper grabs info from several sites.

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
find quote
small_frenchy Offline
Junior Member
Posts: 32
Joined: Jun 2009
Reputation: 0
Post: #3
Hi spiff,

Never the less, my tool produce this kind of HTML and I need to scrap this result.
find quote
spiff Online
Grumpy Bastard Developer
Posts: 12,186
Joined: Nov 2003
Reputation: 82
Post: #4
then it's time to learn regexp Wink

why don't you make your local service output the xbmc xml format? then a scraper is dead simple...

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
find quote
flobbes Offline
Senior Member
Posts: 140
Joined: Mar 2009
Reputation: 0
Post: #5
Don't you think it's a bit odd to come here and ask other people to implement a scraper for you that they don't even benefit from.

Why don't you just start reading how to build your own scraper and try the editor that you can find here.

It only took me 1 day to build my own scraper beginning with completly no knowledge at all.

Its really beginner friendly and at the end it even was fun.
find quote
small_frenchy Offline
Junior Member
Posts: 32
Joined: Jun 2009
Reputation: 0
Post: #6
In fact, my little tool has his own database. In fact, I have a form to configure the scraping. Then, I scrap and store data in a database. At least I have another form to let the user select a specific Fanart and a specific thumb for each movie.

Then I've wrote a simple web app to retrieve data from the database in a simple page.

What you're saying is XBMC store media info internally in XML and you suggest me to get this XML ?

@flobbes

Hi, and sorry, to borry you with my question. It was just to gain time in my dev. It's a good idea to use the editor I haven't think of it, thanks for the idea.
(This post was last modified: 2009-08-25 17:37 by small_frenchy.)
find quote
spiff Online
Grumpy Bastard Developer
Posts: 12,186
Joined: Nov 2003
Reputation: 82
Post: #7
no, i mean have that web page you wrote generate xml instead of html.

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
find quote
small_frenchy Offline
Junior Member
Posts: 32
Joined: Jun 2009
Reputation: 0
Post: #8
Hmm, yes why not, but where it should be be easier to scrap ?
(Ok, I understand not a lot in scraper in fact Rolleyes I've just make a dll from the xbmc code to scrap, that's all)
find quote
spiff Online
Grumpy Bastard Developer
Posts: 12,186
Joined: Nov 2003
Reputation: 82
Post: #9
because you don't need to scrape that way - you just pass the entire results on

Code:
<GetDetails dest="3">
           <RegExp input="$$1" output="\1" dest="3">
              <expression noclean="1"/>
           </RegExp>
</GetDetails>

and you're done

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
find quote
small_frenchy Offline
Junior Member
Posts: 32
Joined: Jun 2009
Reputation: 0
Post: #10
How can I know the XML format I have to produce ? (I have to go now, I will take a look in this way tomorrow)
find quote
Post Reply