ScraperXML (Open Source XML Web Scraper C# Library) please help verify my work...

ScraperXML (Open Source XML Web Scraper C# Library) please help verify my work... - Printable Version

+- Kodi Community Forum (https://forum.kodi.tv)
+-- Forum: Development (https://forum.kodi.tv/forumdisplay.php?fid=32)
+--- Forum: Scrapers (https://forum.kodi.tv/forumdisplay.php?fid=60)
+--- Thread: ScraperXML (Open Source XML Web Scraper C# Library) please help verify my work... (/showthread.php?tid=50055)

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

- Nicezia - 2009-05-24

I guess before i post i'm going to find someway to implement nfoUrl and getEpisodeGuide (seeing as how each of those are dependant upon finding the right scraper to use to retrieve the info - Currently the scraperParser runs through all processes and returns ALL the info as a complete xml - but i want to be able to do it on demand like XBMC does)

I guess i'm going to have to make an object to keep track of all the available scrapers, and be able to run the url against each.

XBMC's Behaviour? - Nicezia - 2009-05-24

How long does XBMC hold the scraper info in cache?

- Nicezia - 2009-05-24

Gamester17 Wrote:@Nicezia, not sure if it is something you want in your ScraperXML library or not but FYI; changeset r20559 commits "Ability to scrape and scan TV Shows into the video library by air-date via TheTVDB.com" to XBMC scraper code and TheTVDB XML scraper, see patch #5143 on trac for more information.

Supported now!

- spiff - 2009-05-24

we clear the cache on the start of each scrape, i.e. before calling CreateSearchUrl

- Nicezia - 2009-05-25

Just about got this thing wrapped up, (it being a holiday weeked kindatook alittle of my time. But all official xbmc scrapers support is wrapped up, i just need to write some kind of interface to NfoUrl , and EpisodeGuideUrl, and its in the bag

- Nicezia - 2009-05-26

Okay

I am sure this is the last question i need to ask.

Almost every function in the imdb is looking for a value in $$2

is there a standard bit of info that's supposed to go there for each function or is this dependant upon the scraper?

I know for GetSearchResults its looking for the imdb id in $$2
and in GetDetails its looking for the url. The question is is this what EVERY movie scraper should have in $$2 at those times??

- mkortstiege - 2009-05-26

$$2 holds the year for movies (if available) and artist for music scrapers.

- Nicezia - 2009-05-26

vdrfan Wrote:$$2 holds the year for movies (if available) and artist for music scrapers.

ok i know that's the case for CreateSearchUrl and GetSearchResults

After Get search results is run the buffers are cleared
in The IMDB scraper When GetDetails runs it searches for the html as string in $$1, the id in $$2 and the actual webpage address in $$3(as it applies it to the links for cast, and postersm although GetDetails function does not put that info in there by use of RegExp)

I just need to know default info sent to buffers during GetDetails (or how the default information automatically sent for buffers is deceided) for each scraper type

- mkortstiege - 2009-05-26

That's why cptspiff is mr. scraper Wink

- Nicezia - 2009-05-26

Here's what i have it doing by default, the entire entity of the selected results is sent back to the get details function, the buffers are cleared and the webpage downloaded to $$1, the id (if it exists from the entity element) is set as $$2, and the url (also from the entity element) is set as $$3

I just need to know if that's a blanket solution for the GetDetails Function or if i need to program some routine to make decisions on other stuff to stuff into the buffers

How's about it Cpt. Spiff!!! Huh

- spiff - 2009-05-26

gotta run now, have a look in the relevant functions in IMDB.cpp. if that isnt enough i'll give a summary later

Getting MediaPortal developers invloved with ScraperXML library to use XBMC scrapers - Gamester17 - 2009-05-26

@spiff and Nicezia, please have a look at and reply to MediaPortal's and the Moving Pictures project developers feedback here:
http://forum.team-mediaportal.com/improvement-suggestions-46/suggestion-use-xbmcs-xml-scrapers-http-scraping-35312/

fforde Wrote:I can't speak for the MediaPortal guys, but on the Moving Pictures project, I spent a lot of time looking into the XBMC scraper system before we implemented our own generic Cornerstone Scraper Engine. I have not looked too closely at the new ScraperXML project (although I did take a peek and by the way it is written in Visual Basic, not C#). But if it works similar to or is based on the older C++ scraper engine for XBMC it has a couple problems.

The method of outputting results with the XBMC scripting engine is fairly cryptic. The script writer has to basically construct an XML document for the output. And what makes this even worse is this construction is embedded in an existing XML document, which means all special characters must be escaped. This dramatically complicates things, reducing the maintainability of existing scripts and making new scripts much more difficult to write.

The output of the search function is only a text based string and a url for the details page for the movie, tv show, etc. The string is not even consistent, sometimes it is title, sometimes it is title and year, sometimes it is official title, english title, then year. It's just not reliable. This would create difficulties with the auto approval system in Moving Pictures.

The way the scraper engine is written, it's just not that extensible. What if someone wanted to add XML parsing tools to be provided to the scraper? What if someone wants to execute XSLT queries on a web page found on the internet? Or what about pulling details from the filesystem rather than a URL? These are not features I particularly care about, but my point is it would not be easy to add these features with the way the XBMC stuff is coded. The core of the scraper engine would have to be modified and this would bring risk to existing functionality, because everything as it is, is lumped together in a big class / file.

For these reasons I chose not to get involved with the XBMC scraper engine a while back. Instead we created the Cornerstone Scraper Engine (also GPL) that powers Moving Pictures. I think that a community effort to create a common data provider system for multiple HTPC apps is a good idea, but if the project is going to base the engine on the XBMC implementation, I am unfortunately not really interested in getting involved.

For reference; fforde is the main developer of the "Moving Pictures" (movies) plugin for MediaPortal (which is a plugin used by basically as all MediaPortal users as MediaPortal itself has no native scraper engine:
http://code.google.com/p/moving-pictures/

Quote:Moving Pictures 0.7.3

Moving Pictures is a movies plug-in for the MediaPortal media center application. The goal of the plug-in is to create a very focused and refined experience that requires minimal user interaction. The plug-in emphasizes usability and ease of use in managing a movie collection consisting of ripped DVDs, and movies reencoded in common video formats supported by MediaPortal.

The plug-in's intended focus is exclusively on movie content. It is not designed to manage clips, music videos, TV shows, or any other type of video content. This narrow focus will allow for a much more refined interface than something like the MyVideo interface included in the standard MediaPortal distribution. This focus is based after the similar focus and success of the TVSeries plug-in for MediaPortal.

Moving Pictures is also designed to require the least amount of user interaction possible. If a new movie file or DVD rip is placed in a watch folder, it will automatically be added to the system. Meta data will be retrieved and artwork, both covers and backdrops (fanart) will be automatically downloaded. If user interaction is necessary to clarify the actual title that has been added, the user will be prompted, but this is a last resort.

Staying Informed

The community for the Moving Pictures plug-in revolves around the sub-forum located on the MediaPortal website. This is the first place to go for up to the minute information on the plug-in. If you'd like to keep up to date with new releases of Moving Pictures but don't frequent the forums, you can subscribe to the Moving Pictures General Discussion Group. As new versions of the plug-in are released, status updates will be posted to the Discussion Group which will in turn hit your email inbox.

Getting Involved

The team for the Moving Pictures plug-in is currently pretty small, but if you are interesting in contributing, make a quick post over in the discussion group or send a developer a message on the MediaPortal forums and we can talk about how you can help out. We currently (always) could use C# developers, and skin developers, but if you have another skill set that you think could be useful, don't be shy. If you are interested in helping in any way, we'd be glad to hear from you. Make a post in the Discussion Group. As a side note, we could definitely use a logo for the project so if you are a graphic designer and are interested in throwing something together, we would be very grateful.

If you are a C# developer and want to take a look at the existing code base, you can browse the source code repository via the link above. You could also take a look at the Developers Guide which will, among other things, help you get your local environment setup for working on the Moving Pictures project.

- Nicezia - 2009-05-26

I don't think i'm really qualified to defend XBMC scraper format, however every other scraper format i've seen seems to be limited by the need to have programming skills and if not limited by the type of data that can be gathered by it, provisioned for only one type with no want for expansion, and no posibility for drop in usability... everyone's so proprietary, and defensive about their formats, personally, i started making this because i thought it'd be a good way to add a expandable, non-proprietary, info scraper to media manager apps (like the one i'm creating myself) that requires the user only to know regular expressions, which one can learn in the span of an afternoon, and how to put together an xml.

There's nothing out there at the moment that has that kind of flexibility, hell i'm even adding to the already available library of things that this format can scrape, if they don't want flexibility and want to stick to what they know that's fine with me, if no one else cares to use XBMC's scraper format I still will because i see the potential for expansion, i can offer what i have as a standard, but i can't force it on them.

And honestly, if i did want to go to bat for XBMC's scraper format, pulling info from the filesystem would be as easy as writing a scraper to do so if i'm understanding the format properly, In the case of runing thier quieries or whatever it is they wanted to run, at least scraperXML in this case returns XML formatted info which they could run whatever kind of quieries they wanted. They just want to stick with their format, and they are going to be stubborn about it.

Even in the case of MeediOS they just want something to drop in that they don't have to manage, one person said clear as day that they really didn't want to have anything to do with managing the scrapers, so i see a little difficulty in bringing all the Open source HTPC's under the same roof as far as information gathering, but that isn't going to stop my quest to continue to expand on the XBMC's scraper capabilities through ScraperXML, and maintain compatibility with XBMC's current scrapers...

maybe when i'm done with XBMC's scraper compatibility, i'll investigate their format, and then add compatibility with it, maybe the way to bring everyone under the same roof is to support all their proprietary systems in one library, give the user a choice of how to update their information from online sources.

Quote: 1. The method of outputting results with the XBMC scripting engine is fairly cryptic. The script writer has to basically construct an XML document for the output. And what makes this even worse is this construction is embedded in an existing XML document, which means all special characters must be escaped. This dramatically complicates things, reducing the maintainability of existing scripts and making new scripts much more difficult to write.

XBMC's scraper format requires less of a lerning curve than programming. If neccessary i can provide a program that i've written that builds a RegExp Block (with expression, and the option to add all information from that block all you have to do is write your info without worrying about excaping characters and it spits out the RegExp. The options for both the RegExp block And the expression can be set, by clicking their corresponding controls. Its what was origionally going to be the scraper editor... I could probably even add predefined regular expressions, with descriptions to make it easier.

Quote: 2. The output of the search function is only a text based string and a url for the details page for the movie, tv show, etc. The string is not even consistent, sometimes it is title, sometimes it is title and year, sometimes it is official title, english title, then year. It's just not reliable. This would create difficulties with the auto approval system in Moving Pictures.

Not true, the output of the search function is always a url (or am i wrong)... the output of each function is consistent, i think his concern, like mine is what is put into each function, and how to make a descision on which is the best match to the input. I can see how that would be a valid concern. But since the output of each function is consistent

CreateSearchUrl: <url>whatever-page-link</url> (i know some output without the <url> tags but i think this should be standard for this function, and have made it standard with all the new type scrapers i'm working on)
GetSearchResults: <results><entity><title/><url/><id/><what/><ever/><else/><you/><want/><entity>........</results>
GetDetails: <details><all/><the/><details/><you/><could/><possibly/><gather/></details>

I don't know why they consider it would be hard to autoaccept an search result based on its <title> or <id> or <url> or <aired> or whaever else
the information is returned XML, which means its quieriable, even sortable and comparable.

Quote: 3. The way the scraper engine is written, it's just not that extensible. What if someone wanted to add XML parsing tools to be provided to the scraper? What if someone wants to execute XSLT queries on a web page found on the internet? Or what about pulling details from the filesystem rather than a URL? These are not features I particularly care about, but my point is it would not be easy to add these features with the way the XBMC stuff is coded. The core of the scraper engine would have to be modified and this would bring risk to existing functionality, because everything as it is, is lumped together in a big class / file.

Valid arguements i'd say, but then you can't possibly program to allow EVERYTHING to be done, or you run the risk of making it much to complicated for the COMMON user to work with, and lets face it... in the world of HTPC the user is a COMMON user, there are more non-programmers using these apps than there are programmers, and the majority of people from my point of view, want a smaller learning curve for their HTPC so they spend less time getting it to do what they want and more time enjoying the fact that it does what they want.

- spiff - 2009-05-26

1) see nicezia's answer
2) the nice thing is all is xml driven. we can change anything, in fact i'm very much prepared to change what info is passed into the scraper etc. i haven't put much consideration into generality, since thus far the only thing i've had to worry about is xbmc. just write up what you consider a sane standard, and it will be considered. + since everything is xml, you can add attributes as you see fit without "hurting" other parsers which does not support them.
3) uhm, xbmc can scrape fine from local files. it's an url, not necessarily http. also i'd like to mention that the "big" class in question consists of approximately 500 lines of c++ of which 50 is comments and 200 is "stupid" initialization code etc. as for functionality breaking; again, it's xml. it's a very limited "language" and i have still not have one thing break on me, other than the scrapers themself which is part of the nature of web scrapers.

that being said, they are free to do whatever they see fit.

- Nicezia - 2009-05-27

Allright i really need a roadmap as to what goes into the buffers during details by default

still a little lost in c++ code, all i know for sure is that

Code:
CreateSearchURL -- CreateAlbumSearch -- CreateArtistSearch 

$$1 - Title -- AblumTitle -- Artist Name

$$2 - Year -- ?????????? -- ?????????

Code:
GetSearchResults -- GetAlbumSearchResults -- GetArtistSearchResults

$$1 - html  -- html -- html

$$2  ???? -- ???? -- ????

Code:
GetDetails  -- GetEpisodeDetails -- GetAlbumDetails -- GetArtistDetails

$$1 - html  -- html -- html -- html

$$2   id for movies -- episode_id/or maybe airdate -- ??????? -- ????????

$$3  url for imdb scraper/??? for anything else -- ?????? - ????? - ????

As soon as i know for sure what's supposed to go in buffers as a certainty i can account for it rather than making guesses based on what the scrapers are looking for