ScraperXML (Open Source XML Web Scraper C# Library) please help verify my work... - Printable Version +- Kodi Community Forum (https://forum.kodi.tv) +-- Forum: Development (https://forum.kodi.tv/forumdisplay.php?fid=32) +--- Forum: Scrapers (https://forum.kodi.tv/forumdisplay.php?fid=60) +--- Thread: ScraperXML (Open Source XML Web Scraper C# Library) please help verify my work... (/showthread.php?tid=50055) |
- Nicezia - 2009-05-24 I guess before i post i'm going to find someway to implement nfoUrl and getEpisodeGuide (seeing as how each of those are dependant upon finding the right scraper to use to retrieve the info - Currently the scraperParser runs through all processes and returns ALL the info as a complete xml - but i want to be able to do it on demand like XBMC does) I guess i'm going to have to make an object to keep track of all the available scrapers, and be able to run the url against each. XBMC's Behaviour? - Nicezia - 2009-05-24 How long does XBMC hold the scraper info in cache? - Nicezia - 2009-05-24 Gamester17 Wrote:@Nicezia, not sure if it is something you want in your ScraperXML library or not but FYI; changeset r20559 commits "Ability to scrape and scan TV Shows into the video library by air-date via TheTVDB.com" to XBMC scraper code and TheTVDB XML scraper, see patch #5143 on trac for more information. Supported now! - spiff - 2009-05-24 we clear the cache on the start of each scrape, i.e. before calling CreateSearchUrl - Nicezia - 2009-05-25 Just about got this thing wrapped up, (it being a holiday weeked kindatook alittle of my time. But all official xbmc scrapers support is wrapped up, i just need to write some kind of interface to NfoUrl , and EpisodeGuideUrl, and its in the bag - Nicezia - 2009-05-26 Okay I am sure this is the last question i need to ask. Almost every function in the imdb is looking for a value in $$2 is there a standard bit of info that's supposed to go there for each function or is this dependant upon the scraper? I know for GetSearchResults its looking for the imdb id in $$2 and in GetDetails its looking for the url. The question is is this what EVERY movie scraper should have in $$2 at those times?? - mkortstiege - 2009-05-26 $$2 holds the year for movies (if available) and artist for music scrapers. - Nicezia - 2009-05-26 vdrfan Wrote:$$2 holds the year for movies (if available) and artist for music scrapers. ok i know that's the case for CreateSearchUrl and GetSearchResults After Get search results is run the buffers are cleared in The IMDB scraper When GetDetails runs it searches for the html as string in $$1, the id in $$2 and the actual webpage address in $$3(as it applies it to the links for cast, and postersm although GetDetails function does not put that info in there by use of RegExp) I just need to know default info sent to buffers during GetDetails (or how the default information automatically sent for buffers is deceided) for each scraper type - mkortstiege - 2009-05-26 That's why cptspiff is mr. scraper - Nicezia - 2009-05-26 Here's what i have it doing by default, the entire entity of the selected results is sent back to the get details function, the buffers are cleared and the webpage downloaded to $$1, the id (if it exists from the entity element) is set as $$2, and the url (also from the entity element) is set as $$3 I just need to know if that's a blanket solution for the GetDetails Function or if i need to program some routine to make decisions on other stuff to stuff into the buffers How's about it Cpt. Spiff!!!? - spiff - 2009-05-26 gotta run now, have a look in the relevant functions in IMDB.cpp. if that isnt enough i'll give a summary later Getting MediaPortal developers invloved with ScraperXML library to use XBMC scrapers - Gamester17 - 2009-05-26 @spiff and Nicezia, please have a look at and reply to MediaPortal's and the Moving Pictures project developers feedback here: http://forum.team-mediaportal.com/improvement-suggestions-46/suggestion-use-xbmcs-xml-scrapers-http-scraping-35312/ fforde Wrote:I can't speak for the MediaPortal guys, but on the Moving Pictures project, I spent a lot of time looking into the XBMC scraper system before we implemented our own generic Cornerstone Scraper Engine. I have not looked too closely at the new ScraperXML project (although I did take a peek and by the way it is written in Visual Basic, not C#). But if it works similar to or is based on the older C++ scraper engine for XBMC it has a couple problems.For reference; fforde is the main developer of the "Moving Pictures" (movies) plugin for MediaPortal (which is a plugin used by basically as all MediaPortal users as MediaPortal itself has no native scraper engine: http://code.google.com/p/moving-pictures/ Quote:Moving Pictures 0.7.3 - Nicezia - 2009-05-26 I don't think i'm really qualified to defend XBMC scraper format, however every other scraper format i've seen seems to be limited by the need to have programming skills and if not limited by the type of data that can be gathered by it, provisioned for only one type with no want for expansion, and no posibility for drop in usability... everyone's so proprietary, and defensive about their formats, personally, i started making this because i thought it'd be a good way to add a expandable, non-proprietary, info scraper to media manager apps (like the one i'm creating myself) that requires the user only to know regular expressions, which one can learn in the span of an afternoon, and how to put together an xml. There's nothing out there at the moment that has that kind of flexibility, hell i'm even adding to the already available library of things that this format can scrape, if they don't want flexibility and want to stick to what they know that's fine with me, if no one else cares to use XBMC's scraper format I still will because i see the potential for expansion, i can offer what i have as a standard, but i can't force it on them. And honestly, if i did want to go to bat for XBMC's scraper format, pulling info from the filesystem would be as easy as writing a scraper to do so if i'm understanding the format properly, In the case of runing thier quieries or whatever it is they wanted to run, at least scraperXML in this case returns XML formatted info which they could run whatever kind of quieries they wanted. They just want to stick with their format, and they are going to be stubborn about it. Even in the case of MeediOS they just want something to drop in that they don't have to manage, one person said clear as day that they really didn't want to have anything to do with managing the scrapers, so i see a little difficulty in bringing all the Open source HTPC's under the same roof as far as information gathering, but that isn't going to stop my quest to continue to expand on the XBMC's scraper capabilities through ScraperXML, and maintain compatibility with XBMC's current scrapers... maybe when i'm done with XBMC's scraper compatibility, i'll investigate their format, and then add compatibility with it, maybe the way to bring everyone under the same roof is to support all their proprietary systems in one library, give the user a choice of how to update their information from online sources. Quote: 1. The method of outputting results with the XBMC scripting engine is fairly cryptic. The script writer has to basically construct an XML document for the output. And what makes this even worse is this construction is embedded in an existing XML document, which means all special characters must be escaped. This dramatically complicates things, reducing the maintainability of existing scripts and making new scripts much more difficult to write. XBMC's scraper format requires less of a lerning curve than programming. If neccessary i can provide a program that i've written that builds a RegExp Block (with expression, and the option to add all information from that block all you have to do is write your info without worrying about excaping characters and it spits out the RegExp. The options for both the RegExp block And the expression can be set, by clicking their corresponding controls. Its what was origionally going to be the scraper editor... I could probably even add predefined regular expressions, with descriptions to make it easier. Quote: 2. The output of the search function is only a text based string and a url for the details page for the movie, tv show, etc. The string is not even consistent, sometimes it is title, sometimes it is title and year, sometimes it is official title, english title, then year. It's just not reliable. This would create difficulties with the auto approval system in Moving Pictures.Not true, the output of the search function is always a url (or am i wrong)... the output of each function is consistent, i think his concern, like mine is what is put into each function, and how to make a descision on which is the best match to the input. I can see how that would be a valid concern. But since the output of each function is consistent CreateSearchUrl: <url>whatever-page-link</url> (i know some output without the <url> tags but i think this should be standard for this function, and have made it standard with all the new type scrapers i'm working on) GetSearchResults: <results><entity><title/><url/><id/><what/><ever/><else/><you/><want/><entity>........</results> GetDetails: <details><all/><the/><details/><you/><could/><possibly/><gather/></details> I don't know why they consider it would be hard to autoaccept an search result based on its <title> or <id> or <url> or <aired> or whaever else the information is returned XML, which means its quieriable, even sortable and comparable. Quote: 3. The way the scraper engine is written, it's just not that extensible. What if someone wanted to add XML parsing tools to be provided to the scraper? What if someone wants to execute XSLT queries on a web page found on the internet? Or what about pulling details from the filesystem rather than a URL? These are not features I particularly care about, but my point is it would not be easy to add these features with the way the XBMC stuff is coded. The core of the scraper engine would have to be modified and this would bring risk to existing functionality, because everything as it is, is lumped together in a big class / file. Valid arguements i'd say, but then you can't possibly program to allow EVERYTHING to be done, or you run the risk of making it much to complicated for the COMMON user to work with, and lets face it... in the world of HTPC the user is a COMMON user, there are more non-programmers using these apps than there are programmers, and the majority of people from my point of view, want a smaller learning curve for their HTPC so they spend less time getting it to do what they want and more time enjoying the fact that it does what they want. - spiff - 2009-05-26 1) see nicezia's answer 2) the nice thing is all is xml driven. we can change anything, in fact i'm very much prepared to change what info is passed into the scraper etc. i haven't put much consideration into generality, since thus far the only thing i've had to worry about is xbmc. just write up what you consider a sane standard, and it will be considered. + since everything is xml, you can add attributes as you see fit without "hurting" other parsers which does not support them. 3) uhm, xbmc can scrape fine from local files. it's an url, not necessarily http. also i'd like to mention that the "big" class in question consists of approximately 500 lines of c++ of which 50 is comments and 200 is "stupid" initialization code etc. as for functionality breaking; again, it's xml. it's a very limited "language" and i have still not have one thing break on me, other than the scrapers themself which is part of the nature of web scrapers. that being said, they are free to do whatever they see fit. - Nicezia - 2009-05-27 Allright i really need a roadmap as to what goes into the buffers during details by default still a little lost in c++ code, all i know for sure is that Code: CreateSearchURL -- CreateAlbumSearch -- CreateArtistSearch Code: GetSearchResults -- GetAlbumSearchResults -- GetArtistSearchResults Code: GetDetails -- GetEpisodeDetails -- GetAlbumDetails -- GetArtistDetails As soon as i know for sure what's supposed to go in buffers as a certainty i can account for it rather than making guesses based on what the scrapers are looking for |