ScraperXML (Open Source XML Web Scraper C# Library) please help verify my work...

  Thread Rating:
  • 0 Votes - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
Nicezia Offline
Fan
Posts: 369
Joined: Nov 2006
Reputation: 0
Location: Montgomery, Alabama
Post: #46
2 more questions and this library is ready for a point release (at least as far as movies goes, coding for other content will be a breeze once i nail down movies)


Are custom functions run A) before that info is put into the buffer or B) after the main function is complete

When a custom function runs with clearbuffers="no", does it A)use a copy of the Main buffers or B) A copy of the local Custom Buffers
(This post was last modified: 2009-05-15 02:14 by Nicezia.)
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #47
1) they are run after the main function has finished
2) it uses the state of the buffers when the function is called. you process the returned xml sequentially (i.e. the <url function="foo"> bits). i.e. the first custom function is called with the state after the main function has finished, the second one with the state after the first custom (if the first custom has clearbuffers="no" and so on)
find quote
Nicezia Offline
Fan
Posts: 369
Joined: Nov 2006
Reputation: 0
Location: Montgomery, Alabama
Post: #48
So only the first custom function to run in a sequence has the option to copy the bufferstate, if a function running after the first one clears buffers, then that bufferstate is completely lost to any other custom functions in the chain, right?
(This post was last modified: 2009-05-15 10:49 by Nicezia.)
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #49
yes
find quote
Nicezia Offline
Fan
Posts: 369
Joined: Nov 2006
Reputation: 0
Location: Montgomery, Alabama
Post: #50
Oops i feel preety stupid now

I found my ACTUAL problem,
when running custom functions i forgot to account for the RIGHT buffers to replace.
I never told it to replace with local buffers it was still trying to use the last bufferstate before the custom function ran
(This post was last modified: 2009-05-16 00:38 by Nicezia.)
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #51
1) remove html tags
2) replace html chars such as &amp; etc
find quote
Nicezia Offline
Fan
Posts: 369
Joined: Nov 2006
Reputation: 0
Location: Montgomery, Alabama
Post: #52
yeah i found those functions in the XBMC source, and fixed that, then realized that it was replacing with the buffers from before the custom functions started running. instead of the copied bufferstate
find quote
Nicezia Offline
Fan
Posts: 369
Joined: Nov 2006
Reputation: 0
Location: Montgomery, Alabama
Post: #53
Ok there's some logic i must truly admit that i don't understand, with the IMDB scraper custom functions, everything starts out fine, its appending all information to $$5 and then all of a sudden it tells it to clear $$5, and put new information over it, which loses all the previus information that was chaining together to create one xml element (details) which would be fine if i was processing information individually, as it came, but if i do that i get repetitive information in the output, so i tried letting it go through all the custom functions and then add the data to what i already have (the non custom function references) and then once we get to GetIMPALink it all gets cleared out by the data that function gets,

So what happens is i start out getting all the information as one big details element, then when it gets to the GetIMPALink it deletes everything it was storing and replaces it with the IMPALink. So i'm guessing i have to catch the information after every Custom function has run, and sort out the information to see if i already have it in my details, before adding it, but i feel the need to ask the reason for this. It just seems like since you already stared to chain the information to be sent back to the <details> element why all of a sudden delete it all and start over? why not either continue chaining it, or send them back individually, one or the other?

not questioning your programming, just trying to find out if there's some reason for this that i don't understand (i don't see a problem in coding around this behavior though)
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #54
this perhaps weird logic is due to this;

after each function has returned we call videoinfotag.load() in xbmc. this loads the <details> block into our class. once that is done, there is no need for the information in the buffers any longer and we might as well clear them.
find quote
Nicezia Offline
Fan
Posts: 369
Joined: Nov 2006
Reputation: 0
Location: Montgomery, Alabama
Post: #55
Allright, but i think i may have been wrong about not seeing a problem coding around it, because apparently the custom functions create custom function calls in the IMDB scraper, and the custom function calls that already exist after the getdetails function is finished need info from those custom function that are called by the created custom function calls in order to progress, problem i'm having is calling the custom function calls inside a custom function call is causing an infinite loop until stack overflow. In XBMC this may be easy to handle by just calling for the gathered info every time a custom function ends then clearing it, but trying to code that kind of behaviour into a library that's just returning a collection of info to a program is a somewhat daunting task for my amateur skills, the logic just doesn't lend itself easily to portability, so at this point its either deverge from using xbmc's logic, or put this aside until i get enough programming skill to compensate for XBMC's Logic or I can get a more skilled programmer than myself (which is just about anybody) onboard to program around this logic.

Of course less complicated Scrapers that only use one page to gather final details work perfectly. Even those that have custom function calls that aren't creating custom functions calls work perfect. its just the IMDB scraper that i have trouble programming around.

Seriously, if anybody has some awesome VB.NET skills and knows how the XBMC scrapers work, i would appreciate some help with this problem. This library is a single function away from being complete.

There's a release up on the Sourceforge site with the code that i have so far... If anyone would like to take a crack at it, the code in svn is not the same code, as svn for some reason keeps screwing up on me.
(This post was last modified: 2009-05-16 23:42 by Nicezia.)
find quote
Nicezia Offline
Fan
Posts: 369
Joined: Nov 2006
Reputation: 0
Location: Montgomery, Alabama
Post: #56
Ok so i got to thinking about it, and i guess what's going to have to be done here is i'm going to have to make a class for each different media type, and pass the information as i get it, the way XBMC does. that will at least solve my probelm of how the custom functions handl info..... however i still can't get my head wrapped around how to solve the problem of a custom function creating another custom function call that needs to be run BEFORE any of the custom functions that already exist, in order for information to be correct in the buffer when its called for in the IMDB scraper. recursion would be my guess, but unlike the recursion with the RegExp's this is a dynamically created recursion, and that's too much for me to handle.
(This post was last modified: 2009-05-17 03:59 by Nicezia.)
find quote
Nicezia Offline
Fan
Posts: 369
Joined: Nov 2006
Reputation: 0
Location: Montgomery, Alabama
Post: #57
Code:
<GetIMPAPosters clearbuffers="no" dest="5">
        <RegExp input="$$1" output="&lt;details&gt;&lt;url function=&quot;GetIMPAPosters&quot;&gt;http://www.impawards.com/\1&lt;/url&gt;&lt;/details&gt;" dest="5">
            <expression clear="yes">&lt;meta http-equiv=&quot;REFRESH&quot; content=&quot;0;URL=[^/]*/([^&quot;]*)&quot;&gt;</expression>
        </RegExp>
        <RegExp input="$$1" output="\1" dest="4">
            <expression clear="yes" noclean="1">value=&quot;/([^&quot;]*)/[^&quot;]*\.html&quot;&gt;</expression>
        </RegExp>
        <RegExp input="$$1" output="&lt;thumb&gt;http://www.impawards.com/$$4/posters/\1&lt;/thumb&gt;" dest="8+">
            <expression clear="yes" noclean="1">&lt;img SRC=&quot;posters/([^&quot;]*)&quot;</expression>
        </RegExp>
        <RegExp input="$$1" output="&lt;thumb&gt;http://www.impawards.com/$$4/posters/\1&lt;/thumb&gt;" dest="9+">
            <expression clear="yes" repeat="yes" noclean="1">thumbs/imp_([^&gt;]*ver[^&gt;]*.jpg)&gt;</expression>
        </RegExp>
    </GetIMPAPosters>

Ok Maybe its just me, but is this function creating a call for itself? why?
This is the reason i'm getting an infinite loop
(This post was last modified: 2009-05-18 04:09 by Nicezia.)
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #58
it is conditionally creating a link to itself if that regexp has a match. reason is probably a splash/ad/commercial or the likes that pops up at times.
find quote
Nicezia Offline
Fan
Posts: 369
Joined: Nov 2006
Reputation: 0
Location: Montgomery, Alabama
Post: #59
hmmm, interesting work around
kinda leaves me stumped, though, as to how to compensate.


Other than that not being able to compensate for that one function my library is solid. Works with every other scraper, right now until i can figure it out, i'm rewriting a IMDB scraper without the IMPAposters, so that it will work in my library. (Since IMDB is the big one that everybody wants) and then I'm Finally releasing a (near-feature complete) version.
(This post was last modified: 2009-05-18 11:15 by Nicezia.)
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #60
why are you sold? that expression should only output to the xml IF the regexp matches. and it should only match IF we are on that REFRESH page.
find quote
Post Reply