ScraperXML (Open Source XML Web Scraper C# Library) please help verify my work...

ScraperXML (Open Source XML Web Scraper C# Library) please help verify my work... - Printable Version

+- Kodi Community Forum (https://forum.kodi.tv)
+-- Forum: Development (https://forum.kodi.tv/forumdisplay.php?fid=32)
+--- Forum: Scrapers (https://forum.kodi.tv/forumdisplay.php?fid=60)
+--- Thread: ScraperXML (Open Source XML Web Scraper C# Library) please help verify my work... (/showthread.php?tid=50055)

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

- spiff - 2009-05-27

xbmc/utils/MusicInfoScraper.cpp; ::FindAlbumInfo;
CreateAlbumSearchUrl: $$1 = album title, $$2 = artist title

xbmc/utils/MusicInfoscraper.cpp, ::FindArtistInfo
CreateArtistSearchUrl: $$1 = artist title

xbmc/utils/IMDB.cpp, ::InternalFindMovie:
GetSearchResults: $$1 html, $$2 url of that html

xbmc/utils/MusicInfoScraper, ::FindAlbumInfo
GetAlbumSearchResults: $$1 html

xbmc/utils/MusicInfoScraper, ::FindArtistInfo
GetArtistSearchResults: $$1 html

xbmc/utils/IMDB.cpp, ::InternalGetDetails
GetEpisodeDetails, GetDetails: $$1 html, $$2 entity id (usually imdb id), $$3 url to html

xbmc/utils/MusicAlbumInfo.cpp, ::Load
GetAlbumDetails: $$1 html

xbmc/utils/MusicArtistInfo.cpp, ::Load
GetArtistDetails: $$1 html, $$2 the url-encoded version of the artist name, i.e. what we searched for

now this is a mess and obviously needs some standardization work. open to suggestions, but everything that is currently passed to scrapers are there cause some scraper use them. so we cannot cut anything out, but we can surely feed more to others to make it more consistent.

- Nicezia - 2009-05-27

spiff Wrote:xbmc/utils/MusicInfoScraper.cpp; ::FindAlbumInfo;
CreateAlbumSearchUrl: $$1 = album title, $$2 = artist title

xbmc/utils/MusicInfoscraper.cpp, ::FindArtistInfo
CreateArtistSearchUrl: $$1 = artist title

xbmc/utils/IMDB.cpp, ::InternalFindMovie:
GetSearchResults: $$1 html, $$2 url of that html

xbmc/utils/MusicInfoScraper, ::FindAlbumInfo
GetAlbumSearchResults: $$1 html

xbmc/utils/MusicInfoScraper, ::FindArtistInfo
GetArtistSearchResults: $$1 html

xbmc/utils/IMDB.cpp, ::InternalGetDetails
GetEpisodeDetails, GetDetails: $$1 html, $$2 entity id (usually imdb id), $$3 url to html

xbmc/utils/MusicAlbumInfo.cpp, ::Load
GetAlbumDetails: $$1 html

xbmc/utils/MusicArtistInfo.cpp, ::Load
GetArtistDetails: $$1 html, $$2 the url-encoded version of the artist name, i.e. what we searched for

now this is a mess and obviously needs some standardization work. open to suggestions, but everything that is currently passed to scrapers are there cause some scraper use them. so we cannot cut anything out, but we can surely feed more to others to make it more consistent.

why not standardize what goes where, or send query info to buffers as xml for instance, sending back the info from getsearchresults as a whole into $$2

<entity>
<title/>
<id/>
<url/>
</entity>

supplies GetDetails with all the info its looking for, html could go into $$1 as per usual, but each function could have its own place in $$2 for passed values, and any values that a user wanted to get from the search page could be passed and stored as well.

now i realize this strategy requires the addition of a extra nested regexp, to isolate information such as the url to create the custom function calls in the IMDB scraper, however, one always knows that that url, the artiist name, the album title, the urlencoded searchstring, will be in buffer $$2 and just needs to know the tag... merely a suggestion.

- spiff - 2009-05-27

and a good one at that. makes perfect sense!

- Nicezia - 2009-05-27

so would the same be done for the create search functions? for instance
CreateAlbulmSearch: $$2 =

Code:
<search>

    <album>Dark+Side+Of+The+Moon</album>

    <artist>Pink+Floyd</artist>

</search>

Then GetAlbumSearchResults: $$2 =

Code:
<searchpage>

   <url post="yes">http://www.allmusic.com/cg/amg.dll?P=amg&amp;SQL=Dark+Side+Of+The+Moon&amp;OPT1=2</url>

</searchpage>

It would be perfectly fine to leave those functions as is of course, but why not a blanket standard for all functions, leaving no question as to where the info is going to be, in this way $$1 would ALWAYS be the string data gathered from the url in the case of creating the search the input field would just move from $$1 to $$2 for all info. $$1 would always be the data to search(be it filesystem or html link) and $$2 will always be info passed from XBMC.

(which is partially what the argument against using the XBMC scrapers as the standard is, though i am not suggesting to cater to those who are against it, it just makes it more... uniform)

- Nicezia - 2009-05-27

oh , and it would definately make my game scrapers from Allgame.com easier to design too Smile

so far its a almost direct copy of the Music Scraper for Allmusic

The functions (if you're interested) are as follows

CreateGameSearchUrl
GetGameSearchResults
GetGameDetails
GetReview

CreatePlatformSearchUrl
GetPlatformSeachResults
GetPlatormDetails
GetGameList

(customFunctions)
GetFanart (Not sure this exists yet, but with HTbackdrops looking for gameart, its soon to be available)
GetThumbs (not yet necessary, but going to find a page with higher resolution thumbs to get covers from)
Get Trailer

so far this is what gets passed to the functions
CreateGameSearch $$1 Game title $$2 Platform (would like to pass the year as well to this function with standard xml in $$2 that would be nice)
GetGameSearchResults $$1 html $$2 the url of the html
GetGameDetails $$1 html (would like to pass all the previously gathered info from search results to send to the fanart search and trailer search)

the info for games is as such

Code:
<details>

    <title/>

    <platform/>

    <year/>

    <releasedate/>

    <review/>

    <developer/>

    <genre/>

    <genre/>

    ......

    <thumb>

    <publisher/>

    <controlscheme/>

    <warnings/>

    <esrb/>

    <id>

    <otherplatforms/>

    <screenshots>

        <thumb>

        <thumb>

        .....

    </screenshots>

    <fanart url="">

       <thumb>

       <thumb>

       ....

   </fanart>

</details>

for platform

Code:
<details>

    <title/>

    <id/>

    <releasedate/>

    <developer/>

    <year/>

    <review/>

    <gamelist>

       <title>

       <title>

       ......

   </gamelist>

   <thumb>

</details>

- spiff - 2009-05-27

there is already a scraper for allgames sitting in the gamelibrary branch

- Nicezia - 2009-05-28

ah i haven't been down that branch, but now that i look at it it doesn't get any platform information, i'll build off its format, but i also want to be able to get information about the program's platform.

who is the best person to talk to about that branch Huh

- spiff - 2009-05-28

leo2

- Gamester17 - 2009-05-28

http://trac.xbmc.org/browser/branches/gamelibrary/system/scrapers/programs/allgame.xml

Code:
svn checkout https://xbmc.svn.sourceforge.net/svnroot/xbmc/branches/gamelibrary

For more information see:
http://forum.xbmc.org/showthread.php?tid=40715
and:
http://forum.xbmc.org/showthread.php?tid=28953
also:
http://forum.xbmc.org/showthread.php?tid=43766

Wink

Release 3.6 on Sourceforge - Nicezia - 2009-05-29

All XBMC scrapers officially supported (even if the workarounds are for setting buffers are a bit painful)

One slight problem that i haven't been able to work out yet with the AllMusic scraper is the fact that though it finds and sets the links properly for the biography and discography of Artist, when the page downloads its the generic front page for artist search with no Artist Info at all.

copying and pasting the link in firefox has the same results (though the scraper doesn't set it as post, and even in the page its not sent as post) clicking on the same link in the Artist's Page will call the proper page, this one is so far beyond me but i'm still working on it.

another problem(that i have implemented a temp fix for is the fact that some results yeild an "&" inside the returned values and therefore are unparseable with a validating XML parser such as the XMLDOM and Linq, so far just doing a search and replace for the ampersan before parsing the string values to XML, which i consider a really nasty workaround, as i could possibly screw up other entities(a situation that i haven't run into yet - since i only use this workaround when parsing URLs for download).

My actual parasing functions for the scrapers are pretty solid, however, i think the reasoning for having to implement these nasty workarounds are some incorrectly implemented or missing utility functions that XBMC uses, so i'm going to spend the weekend brushing up on c++ and see if i can't follow the flow of XBMC's scraper process (maybe even run ith through the debugger, and step through process by process until i figure out exactley where my utility codes differ from xbmc.

Not exactley happy with this really as its got some really nasty coding in it that i couldn't avoid til i figure out a better way, thought i'd throw it up to see if anyone else (though noone else seems interested in helping with this project) could help to figure it out.

Lmfao - Nicezia - 2009-05-30

I finally boned up enough on my C++ to be able to follow the flow of XBMC's ScraperParser, so now i've run into 3 curiosities

A reference to a "clear" element(in the RegExp element - i know there's a clear attribute in the expression, but i've never seen a clear element used)
A reference to a "optional" attribute(in the expression element)
A reference to a "compare" attribute(in the expression element)

And

Correct me if i'm wrong, but it looks to me in the source code that clean cleans the buffers not the Regular Expression match strings.

that might be my whole problem as i'm cleaning the matches before applying them to the output and not the input before matching the regular expression.

And one last thing, does trim option remove ALL whitespace, Leading & Trailing whitespace, just leading whitespace, or just trailing whitespace.

Don't answer just yet, stillexaminingXBMC source, so i might figure out the answer myself

Allright everything figured out now - Nicezia - 2009-05-31

Version 4.0 should be up sometime before midnight figured everything out now, just finished testing all the changes, soon as i optimize code and add logging all XBMC suport for all XBMC featres, and everything that XBMC scrapes will be fully supported, of course this as well is not a drop in solution for the older versions.

- DonJ - 2009-05-31

Good to hear!

- Nicezia - 2009-05-31

Good to actually be at this point finally, the scrapers for dummies tutorial was a little misleading as to when and where things happen and a little bit wrong on what some attributes actually do(though close enough to allow a working scraper to be made by following it), but once i got to a point where i could read XBMC source code, it all became clear.

Next step is letting other people mess around with it and make suggestions and point out bugs,

I've tried to make it easy as possible to deal with each type of scraper takes only two commands, one to retrieve the search results and one to retrieve the details(Save for the tvscrapers which take three commands, two to get info about the series, and one to get episode details (with the optional episode list update to retieve new episode info)

Later I'm going to add a relevance feature so that results can be auto selected, but i need a minute break from this library Smile

)

I'm looking forward to suggestions and criticism, and i'm getting started on a scraper tester tomorrow(though that's partially done in the program that i've developed to test the library) for both .net and mono.

- Nicezia - 2009-05-31

Still one question about the trim option, i still can't figure out if it removes ALL white space, just leading, or just trailing, or leading and trailing spaces
most of my results look okay but i still am not sure if i have the function for that right, as for the moment i have it removing leading and trailing whitespace