• 1
  • 3
  • 4
  • 5(current)
  • 6
  • 7
  • 22
ScraperXML (Open Source XML Web Scraper C# Library) please help verify my work...
#61
I haven't got a clue, i'm still checking all my source code to see why it keeps recreating itself
I haven't figured out a way to clear the function, and for whatever reason, it keeps popping back up
Reply
#62
that must mean that you get a match on that regexp for some reason. looking at the regexps, it might be you not respecting the "clear" keyword on the expression. clear means we shall clear the dest buffer no matter if the expression matches or not. if you do not respect that, we have a chain already stored in $$5 and shit hits the fan
Reply
#63
spiff Wrote:that must mean that you get a match on that regexp for some reason. looking at the regexps, it might be you not respecting the "clear" keyword on the expression. clear means we shall clear the dest buffer no matter if the expression matches or not. if you do not respect that, we have a chain already stored in $$5 and shit hits the fan

ok so the "clear" keyword clears the buffer whether it matches or not, the scraper tutorial gave me the wrong impresson of what the clear keyword did. but i've found the problem now.
Reply
#64
yeah. sorry i haven't even read that tutorial, it was based on some early ramblings of mine, but other than that it's all third party Smile
Reply
#65
well i'm glad that got cleared up , i was thinking it was impossible to get around
anyway my problem with my clear on fail was i was checking for "regexp.match.group.count > 0" not knowing that it would make groups even if the match wasn't successful
got to get out of the habit of lazy programming!
Reply
#66
Code:
<GetIMDBPoster dest="5">
        <RegExp input="$$8$$9$$10$$11" output="&lt;details&gt;&lt;thumbs&gt;\1&lt;/thumbs&gt;&lt;/details&gt;" dest="5">
                        <RegExp input="$$1" output="\1_SX$INFO[imdbscale]_SY$INFO[imdbscale]_\2" dest="6">
                            <expression noclean="1,2">&lt;a name=&quot;poster&quot;.*?src=&quot;(.*?)_S.*?(.jpg)&quot;.*?&lt;/a&gt;</expression>
                        </RegExp>
                        <RegExp input="$$6" output="&lt;thumb&gt;\1&lt;/thumb&gt;" dest="11">
                            <expression clear="yes" noclean="1">(.*?_SX[0-9]+_SY[0-9]+_.jpg)</expression>
                        </RegExp>
            <expression noclean="1"></expression>
        </RegExp>
    </GetIMDBPoster>


okay another question... this function clears the buffers, however its asking for info from 4 of the buffers, info created fom other functions, so when exactley is clearbuffers supposed to happen?

from this function's behaviour i would guess that at the beginning of a new function it checks the state of clear buffers from the last function and if clearbuffers is true from the last function it clears the buffers and then sets the state of the current function, am i right?


Code:
Function Process

1. checks for if clearbuffers = no
  1.a if clearbuffers ="no" leaves everything intact in the buffers
  1.b if clearbuffers ="yes" or isnot set then deletes everything from all buffer
2. Either downloads specific page reffered to, or takes title and year and sets that to $$1

3.Parses through regular expressions
  execution starts from the first RegExp's deepest decendant
    Check conditional    
       if condition met...
          replacebuffers on input
          if clear is set on the expression, the destination is cleared before execution,
          replace buffers on expression before compile
       apply expression to text
              check if there are any matches
                 if repeat....
                 if noclean....
                 if trim.....
      apply results to output
          replace buffers on ouptut
          checks wheter to append or overwrite
      if condition not met
          do not execute regexp
4. check output for custom function calls
    Goes over the same process above with customfunctions
      checks each custom function output once more for any newly created custom function calls
5. Final results

This is the process of my parser as i have it so far, other than not being sure of where and when to clear buffers or at what time from each function it reads this info i think i have it licked, can you verify?
Reply
#67
spiff Wrote:yes on the first one.

the second one; depends on whether or not the calling function has clearbuffers="no" set. if it isn't set, clear buffers after function execution, if it is set do not and hence the next function should be called with the previous buffer state (excluding the first one which holds the data of course).

already explained. in particular note the AFTER. your stuff looks okay but i only had half an eye to spear
Reply
#68
Just wanted to inform everyone that ScraperXML version 1.0 will hit the svn repository tonight at midnight (Central US time), it has full support for all XBMC Movie Scrapers!

There will also be a dll release for windows, and soon to follow after, a mono-library
Reply
#69
30 minutes late but its up
Reply
#70
Nicezia Wrote:30 minutes late but its up

Okay, found it !!!
Reply
#71
congrats!
Reply
#72
I can honestly say i couldn't have done it without you spiff,

i did find one error in it, the function that i made to clean is getting out all the html entities, but not the nested tags (found thisout cause i tried to run a scraper i made in with my library on XBMC (the debugging messages for scrapers really helps)

so i'm going to have to rewrite the function to remove tags

so version 1.1 soon to be released

And oh, i rewrote the Excalibur scraper, and its working perfectly now. and a few others that weren't working in XBMC for me.
Looking into info on post data now haven't run into a scraper that uses it yet, but I might as well go ahead and make sanctions for it... spoof is already supported.
Reply
#73
allmusic is a poster iirc. asiandb is a movie scraper posting
Reply
#74
Post method handled now, any other methods on downloading pages i need to know about that are supported in XBMC?
Reply
#75
nope, post and submit (i.e. no post) are the two forms we support
Reply
  • 1
  • 3
  • 4
  • 5(current)
  • 6
  • 7
  • 22

Logout Mark Read Team Forum Stats Members Help
ScraperXML (Open Source XML Web Scraper C# Library) please help verify my work...0