ScraperXML (Open Source XML Web Scraper C# Library) please help verify my work... - Printable Version
+- XBMC Community Forum (http://forum.xbmc.org)
+-- Forum: Development (/forumdisplay.php?fid=32)
+--- Forum: Scraper Development (/forumdisplay.php?fid=60)
+--- Thread: ScraperXML (Open Source XML Web Scraper C# Library) please help verify my work... (/showthread.php?tid=50055)
- spiff - 2009-05-16 00:37
1) remove html tags
2) replace html chars such as & etc
- Nicezia - 2009-05-16 00:56
yeah i found those functions in the XBMC source, and fixed that, then realized that it was replacing with the buffers from before the custom functions started running. instead of the copied bufferstate
- Nicezia - 2009-05-16 09:50
Ok there's some logic i must truly admit that i don't understand, with the IMDB scraper custom functions, everything starts out fine, its appending all information to $$5 and then all of a sudden it tells it to clear $$5, and put new information over it, which loses all the previus information that was chaining together to create one xml element (details) which would be fine if i was processing information individually, as it came, but if i do that i get repetitive information in the output, so i tried letting it go through all the custom functions and then add the data to what i already have (the non custom function references) and then once we get to GetIMPALink it all gets cleared out by the data that function gets,
So what happens is i start out getting all the information as one big details element, then when it gets to the GetIMPALink it deletes everything it was storing and replaces it with the IMPALink. So i'm guessing i have to catch the information after every Custom function has run, and sort out the information to see if i already have it in my details, before adding it, but i feel the need to ask the reason for this. It just seems like since you already stared to chain the information to be sent back to the <details> element why all of a sudden delete it all and start over? why not either continue chaining it, or send them back individually, one or the other?
not questioning your programming, just trying to find out if there's some reason for this that i don't understand (i don't see a problem in coding around this behavior though)
- spiff - 2009-05-16 12:44
this perhaps weird logic is due to this;
after each function has returned we call videoinfotag.load() in xbmc. this loads the <details> block into our class. once that is done, there is no need for the information in the buffers any longer and we might as well clear them.
- Nicezia - 2009-05-16 21:14
Allright, but i think i may have been wrong about not seeing a problem coding around it, because apparently the custom functions create custom function calls in the IMDB scraper, and the custom function calls that already exist after the getdetails function is finished need info from those custom function that are called by the created custom function calls in order to progress, problem i'm having is calling the custom function calls inside a custom function call is causing an infinite loop until stack overflow. In XBMC this may be easy to handle by just calling for the gathered info every time a custom function ends then clearing it, but trying to code that kind of behaviour into a library that's just returning a collection of info to a program is a somewhat daunting task for my amateur skills, the logic just doesn't lend itself easily to portability, so at this point its either deverge from using xbmc's logic, or put this aside until i get enough programming skill to compensate for XBMC's Logic or I can get a more skilled programmer than myself (which is just about anybody) onboard to program around this logic.
Of course less complicated Scrapers that only use one page to gather final details work perfectly. Even those that have custom function calls that aren't creating custom functions calls work perfect. its just the IMDB scraper that i have trouble programming around.
Seriously, if anybody has some awesome VB.NET skills and knows how the XBMC scrapers work, i would appreciate some help with this problem. This library is a single function away from being complete.
There's a release up on the Sourceforge site with the code that i have so far... If anyone would like to take a crack at it, the code in svn is not the same code, as svn for some reason keeps screwing up on me.
- Nicezia - 2009-05-17 03:08
Ok so i got to thinking about it, and i guess what's going to have to be done here is i'm going to have to make a class for each different media type, and pass the information as i get it, the way XBMC does. that will at least solve my probelm of how the custom functions handl info..... however i still can't get my head wrapped around how to solve the problem of a custom function creating another custom function call that needs to be run BEFORE any of the custom functions that already exist, in order for information to be correct in the buffer when its called for in the IMDB scraper. recursion would be my guess, but unlike the recursion with the RegExp's this is a dynamically created recursion, and that's too much for me to handle.
- Nicezia - 2009-05-18 04:03
Ok Maybe its just me, but is this function creating a call for itself? why?
This is the reason i'm getting an infinite loop
- spiff - 2009-05-18 07:37
it is conditionally creating a link to itself if that regexp has a match. reason is probably a splash/ad/commercial or the likes that pops up at times.
- Nicezia - 2009-05-18 11:07
hmmm, interesting work around
kinda leaves me stumped, though, as to how to compensate.
Other than that not being able to compensate for that one function my library is solid. Works with every other scraper, right now until i can figure it out, i'm rewriting a IMDB scraper without the IMPAposters, so that it will work in my library. (Since IMDB is the big one that everybody wants) and then I'm Finally releasing a (near-feature complete) version.
- spiff - 2009-05-18 11:39
why are you sold? that expression should only output to the xml IF the regexp matches. and it should only match IF we are on that REFRESH page.