I've never write a scraper, a quick help ?

  Thread Rating:
  • 0 Votes - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
small_frenchy Offline
Junior Member
Posts: 32
Joined: Jun 2009
Reputation: 0
Post: #1
Hi there !

I'm working on a tool which combine XBMC scrapper for a search, and the produce HTML rendering of the result. The idea is to be able to scrap from multiple sources (for example take thumb on imdb scraper, fanart on tmdb scraper and plot from allocine scrapper).

My tool start to make the deal (with the use of my dll to scrap from the different scraper) and produce an HTML result. What I wanna do now is writing a scraper.

Here is a little sample :

with the resquest : http://localhost:52026/CMMServer/Default.aspx?s=Basic

I have the response :

Code:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
<head><title>
    CMM Response
</title></head>
<body>
    <form name="form1" method="post" action="Default.aspx?s=Basic+(2003)" id="form1">
<div>
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwULLTE4MTc3Njc3MzUPZBYCAgMPZBYCAgEPZBYcZg9kFgJmD2QWAmYPDxYCHghJbWFnZVVybAU2L​i90ZW1wL2ZhbmFydF8yZWZmNjc1YS05NzRjLTQxZDgtOTlkNy1lYjM3MWY0MjE0ZmYuanBnZGQCAQ9kF​gJmD2QWAmYPDxYCHwAFNS4vdGVtcC90aHVtYl8yZWZmNjc1YS05NzRjLTQxZDgtOTlkNy1lYjM3MWY0M​jE0ZmYuanBnZGQCAg9kFgJmD2QWAmYPDxYCHgRUZXh0BdQBVW5lIG51aXQsIGxvcnMgZCd1biBleGVyY​2ljZSBkJ2VudHJhw65uZW1lbnQsIHVuIG91cmFnYW4gZnJhcHBlIFBhbmFtYS4gU2l4IG1pbGl0YWlyZ​XMgZG9udCBsJ2F1dG9yaXRhaXJlIHNlcmdlbnQgV2VzdCBkaXNwYXJhaXNzZW50IGV0IGlsIG5lIHJlc​3RlIHF1ZSBkZXV4IHTDqW1vaW5zIHBvdXIgcmFjb250ZXIgY2UgcXUnaWwgcydlc3QgcGFzc8OpLg3vg​I3rqq3vgI1kZAIDD2QWAmYPZBYCZg8PFgIfAQUOSm9obiBNY1RpZXJuYW5kZAIED2QWAmYPZBYCZg8PF​gIfAQURQWN0aW9uIC8gVGhyaWxsZXJkZAIFD2QWAmYPZBYCZg8PFgIfAQXUAVVuZSBudWl0LCBsb3JzI​GQndW4gZXhlcmNpY2UgZCdlbnRyYcOubmVtZW50LCB1biBvdXJhZ2FuIGZyYXBwZSBQYW5hbWEuIFNpe​CBtaWxpdGFpcmVzIGRvbnQgbCdhdXRvcml0YWlyZSBzZXJnZW50IFdlc3QgZGlzcGFyYWlzc2VudCBld​CBpbCBuZSByZXN0ZSBxdWUgZGV1eCB0w6ltb2lucyBwb3VyIHJhY29udGVyIGNlIHF1J2lsIHMnZXN0I​HBhc3PDqS4N74CN66qt74CNZGQCBg9kFgJmD2QWAmYPDxYCHwEFCDFoIDM4bWluZGQCBw9kFgJmD2QWA​mYPDxYCHwFkZGQCCA9kFgJmD2QWAmYPDxYCHwEFcU9uIHNlbnQgcXVlIGxlcyBkZXV4IGFjdGV1cnMgc​HJlbm5lbnQgdW4gcsOpZWwgcGxhaXNpciDDoCBqb3VlciBhdSBjaGF0IGV0IMOgIGxhIHNvdXJpcy4gR​XQgbm91cyBhdXNzaSAh66qt74CN66qtZGQCCQ9kFgJmD2QWAmYPDxYCHwEFBUJhc2ljZGQCCg9kFgJmD​2QWAmYPDxYCHwEFBjIzLDMyNWRkAgsPZBYCZg9kFgJmDw8WAh8BZGRkAgwPZBYCZg9kFgJmDw8WAh8BB​QM2LDNkZAIND2QWAmYPZBYCZg8PFgIfAQUEMjAwM2RkZBzUu/5/2nYny+68afkNsdhQhAhN" />
</div>


    <table id="TableResult" border="0" style="width:133px;">

    <tr id="FanartRow">
        <td id="FanartCell"><img id="FanartImage" src="./temp/fanart_2eff675a-974c-41d8-99d7-eb371f4214ff.jpg" style="border-width:0px;" /></td>
    </tr><tr id="TableRow1">
        <td id="TableCell1"><img id="ThumbImage" src="./temp/thumb_2eff675a-974c-41d8-99d7-eb371f4214ff.jpg" style="border-width:0px;" /></td>
    </tr><tr id="TableRow2">
        <td id="TableCell2"><span id="Plot">Une nuit, lors d'un exercice d'entraînement, un ouragan frappe Panama. Six militaires dont l'autoritaire sergent West disparaissent et il ne reste que deux témoins pour raconter ce qu'il s'est passé.</span></td>
    </tr><tr id="TableRow3">
        <td id="TableCell3"><span id="Director">John McTiernan</span></td>

    </tr><tr id="TableRow4">
        <td id="TableCell4"><span id="Genre">Action / Thriller</span></td>
    </tr><tr id="TableRow5">
        <td id="TableCell5"><span id="PlotOutline">Une nuit, lors d'un exercice d'entraînement, un ouragan frappe Panama. Six militaires dont l'autoritaire sergent West disparaissent et il ne reste que deux témoins pour raconter ce qu'il s'est passé.</span></td>
    </tr><tr id="TableRow6">
        <td id="TableCell6"><span id="Runtime">1h 38min</span></td>
    </tr><tr id="TableRow7">

        <td id="TableCell7"><span id="Studio"></span></td>
    </tr><tr id="TableRow8">
        <td id="TableCell8"><span id="Tagline">On sent que les deux acteurs prennent un réel plaisir à jouer au chat et à la souris. Et nous aussi !</span></td>
    </tr><tr id="TableRow9">
        <td id="TableCell9"><span id="Title">Basic</span></td>
    </tr><tr id="TableRow10">
        <td id="TableCell10"><span id="Votes">23,325</span></td>

    </tr><tr id="TableRow11">
        <td id="TableCell11"><span id="WritingCredit"></span></td>
    </tr><tr id="TableRow12">
        <td id="TableCell12"><span id="Rating">6,3</span></td>
    </tr><tr id="TableRow13">
        <td id="TableCell13"><span id="Year">2003</span></td>
    </tr>
</table>

    

    </form>
</body>
</html>

But I'm not strong enought to write the scraper (I'm not very fluent with regexp). Does anyone can help me ?
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #2
uhrr. read the scrapers. pretty much EVERY scraper grabs info from several sites.
find quote
small_frenchy Offline
Junior Member
Posts: 32
Joined: Jun 2009
Reputation: 0
Post: #3
Hi spiff,

Never the less, my tool produce this kind of HTML and I need to scrap this result.
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #4
then it's time to learn regexp Wink

why don't you make your local service output the xbmc xml format? then a scraper is dead simple...
find quote
flobbes Offline
Senior Member
Posts: 131
Joined: Mar 2009
Reputation: 0
Post: #5
Don't you think it's a bit odd to come here and ask other people to implement a scraper for you that they don't even benefit from.

Why don't you just start reading how to build your own scraper and try the editor that you can find here.

It only took me 1 day to build my own scraper beginning with completly no knowledge at all.

Its really beginner friendly and at the end it even was fun.
find quote
small_frenchy Offline
Junior Member
Posts: 32
Joined: Jun 2009
Reputation: 0
Post: #6
In fact, my little tool has his own database. In fact, I have a form to configure the scraping. Then, I scrap and store data in a database. At least I have another form to let the user select a specific Fanart and a specific thumb for each movie.

Then I've wrote a simple web app to retrieve data from the database in a simple page.

What you're saying is XBMC store media info internally in XML and you suggest me to get this XML ?

@flobbes

Hi, and sorry, to borry you with my question. It was just to gain time in my dev. It's a good idea to use the editor I haven't think of it, thanks for the idea.
(This post was last modified: 2009-08-25 17:37 by small_frenchy.)
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #7
no, i mean have that web page you wrote generate xml instead of html.
find quote
small_frenchy Offline
Junior Member
Posts: 32
Joined: Jun 2009
Reputation: 0
Post: #8
Hmm, yes why not, but where it should be be easier to scrap ?
(Ok, I understand not a lot in scraper in fact Rolleyes I've just make a dll from the xbmc code to scrap, that's all)
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #9
because you don't need to scrape that way - you just pass the entire results on

Code:
<GetDetails dest="3">
           <RegExp input="$$1" output="\1" dest="3">
              <expression noclean="1"/>
           </RegExp>
</GetDetails>

and you're done
find quote
small_frenchy Offline
Junior Member
Posts: 32
Joined: Jun 2009
Reputation: 0
Post: #10
How can I know the XML format I have to produce ? (I have to go now, I will take a look in this way tomorrow)
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #11
wiki
find quote
small_frenchy Offline
Junior Member
Posts: 32
Joined: Jun 2009
Reputation: 0
Post: #12
Ok, I will read this tomorrow spiff, thanks for your help. My tool is in early alpha (UI is very bad for now Smile) I will publish it when it will be cleaner Smile
find quote
DaveGee Offline
Senior Member
Posts: 118
Joined: Aug 2009
Reputation: 0
Post: #13
spiff Wrote:then it's time to learn regexp Wink

regex isn't hard... hell no!

M4 now we're talking hard... Try simply reading an 'ordinary' sendmail.cf file and then explain what its doing... yea the people responsible for that must have been one "very special high" when they wrote that stuff...

Okay lets get serious!

Thats not to say that the above comments are in any way an exaggeration... They're not. They are quite serious and accurate. Big Grin

regex isn't THAT hard so much as it can be picky... not all regex are created equal and depending on the version adopted by the developer (in this case the XBMC developers) you might see some minor variations in syntax when you're trying to learn it.

I'd say a good generic place to start learning regex is:

http://www.icewarp.com/support/online_ma...030104.htm

Its very basic but it will start your brain thinking in a 'regex' mindset.

It might seem daunting but little by little you'll begin to see certain character sequences and say AH I know what thats searching for and then again you'll see long and very elegant I might add regex sequences and say WTFreak could they possibly be looking for. lol

Before diving in too deep into the tutorial I linked you might want to first find out what regex they have adopted in XBMC and then look for a tutorial based specifically on that variant so you don't learn too many 'wrong' things.

GLuck!

Dave
(This post was last modified: 2009-08-25 20:33 by DaveGee.)
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #14
heh, true. but you need nowhere those skills to do simple selections which is what writing a scraper takes
find quote
small_frenchy Offline
Junior Member
Posts: 32
Joined: Jun 2009
Reputation: 0
Post: #15
It's not the first time I have to fight with regex in my work, but I really hate that Smile But Thanks DaveGee, I will take a look at your link and try make effort to understand regex a little more Smile

so spiff, if I understand well, all I have to do is generate this kind of XML :

Code:
<details>
    <title></title>
    <year></year>
    <director></director>
    <top250></top250>
    <mpaa></mpaa>
    <tagline></tagline>
    <runtime></runtime>
    <thumb></thumb>
    <credits></credits>
    <rating></rating>
    <votes></votes>
    <genre></genre>
    <actor>
        <name></name>
        <role></role>
    </actor>
    <outline></outline>
    <plot></plot>
</details>

I am right ? I can't see where to put fanarts in this xml...
(This post was last modified: 2009-08-26 07:46 by small_frenchy.)
find quote
Post Reply