I've never write a scraper, a quick help ? - Printable Version +- Kodi Community Forum (https://forum.kodi.tv) +-- Forum: Development (https://forum.kodi.tv/forumdisplay.php?fid=32) +--- Forum: Scrapers (https://forum.kodi.tv/forumdisplay.php?fid=60) +--- Thread: I've never write a scraper, a quick help ? (/showthread.php?tid=56886) |
I've never write a scraper, a quick help ? - small_frenchy - 2009-08-25 Hi there ! I'm working on a tool which combine XBMC scrapper for a search, and the produce HTML rendering of the result. The idea is to be able to scrap from multiple sources (for example take thumb on imdb scraper, fanart on tmdb scraper and plot from allocine scrapper). My tool start to make the deal (with the use of my dll to scrap from the different scraper) and produce an HTML result. What I wanna do now is writing a scraper. Here is a little sample : with the resquest : http://localhost:52026/CMMServer/Default.aspx?s=Basic I have the response : Code: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> But I'm not strong enought to write the scraper (I'm not very fluent with regexp). Does anyone can help me ? - spiff - 2009-08-25 uhrr. read the scrapers. pretty much EVERY scraper grabs info from several sites. - small_frenchy - 2009-08-25 Hi spiff, Never the less, my tool produce this kind of HTML and I need to scrap this result. - spiff - 2009-08-25 then it's time to learn regexp why don't you make your local service output the xbmc xml format? then a scraper is dead simple... - flobbes - 2009-08-25 Don't you think it's a bit odd to come here and ask other people to implement a scraper for you that they don't even benefit from. Why don't you just start reading how to build your own scraper and try the editor that you can find here. It only took me 1 day to build my own scraper beginning with completly no knowledge at all. Its really beginner friendly and at the end it even was fun. - small_frenchy - 2009-08-25 In fact, my little tool has his own database. In fact, I have a form to configure the scraping. Then, I scrap and store data in a database. At least I have another form to let the user select a specific Fanart and a specific thumb for each movie. Then I've wrote a simple web app to retrieve data from the database in a simple page. What you're saying is XBMC store media info internally in XML and you suggest me to get this XML ? @flobbes Hi, and sorry, to borry you with my question. It was just to gain time in my dev. It's a good idea to use the editor I haven't think of it, thanks for the idea. - spiff - 2009-08-25 no, i mean have that web page you wrote generate xml instead of html. - small_frenchy - 2009-08-25 Hmm, yes why not, but where it should be be easier to scrap ? (Ok, I understand not a lot in scraper in fact I've just make a dll from the xbmc code to scrap, that's all) - spiff - 2009-08-25 because you don't need to scrape that way - you just pass the entire results on Code: <GetDetails dest="3"> and you're done - small_frenchy - 2009-08-25 How can I know the XML format I have to produce ? (I have to go now, I will take a look in this way tomorrow) - spiff - 2009-08-25 wiki - small_frenchy - 2009-08-25 Ok, I will read this tomorrow spiff, thanks for your help. My tool is in early alpha (UI is very bad for now ) I will publish it when it will be cleaner - DaveGee - 2009-08-25 spiff Wrote:then it's time to learn regexp regex isn't hard... hell no! M4 now we're talking hard... Try simply reading an 'ordinary' sendmail.cf file and then explain what its doing... yea the people responsible for that must have been one "very special high" when they wrote that stuff... Okay lets get serious! Thats not to say that the above comments are in any way an exaggeration... They're not. They are quite serious and accurate. regex isn't THAT hard so much as it can be picky... not all regex are created equal and depending on the version adopted by the developer (in this case the XBMC developers) you might see some minor variations in syntax when you're trying to learn it. I'd say a good generic place to start learning regex is: http://www.icewarp.com/support/online_manual/203030104.htm Its very basic but it will start your brain thinking in a 'regex' mindset. It might seem daunting but little by little you'll begin to see certain character sequences and say AH I know what thats searching for and then again you'll see long and very elegant I might add regex sequences and say WTFreak could they possibly be looking for. lol Before diving in too deep into the tutorial I linked you might want to first find out what regex they have adopted in XBMC and then look for a tutorial based specifically on that variant so you don't learn too many 'wrong' things. GLuck! Dave - spiff - 2009-08-25 heh, true. but you need nowhere those skills to do simple selections which is what writing a scraper takes - small_frenchy - 2009-08-26 It's not the first time I have to fight with regex in my work, but I really hate that But Thanks DaveGee, I will take a look at your link and try make effort to understand regex a little more so spiff, if I understand well, all I have to do is generate this kind of XML : Code: <details> I am right ? I can't see where to put fanarts in this xml... |