Kodi Community Forum
ScraperXML (Open Source XML Web Scraper C# Library) please help verify my work... - Printable Version

+- Kodi Community Forum (https://forum.kodi.tv)
+-- Forum: Development (https://forum.kodi.tv/forumdisplay.php?fid=32)
+--- Forum: Scrapers (https://forum.kodi.tv/forumdisplay.php?fid=60)
+--- Thread: ScraperXML (Open Source XML Web Scraper C# Library) please help verify my work... (/showthread.php?tid=50055)

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22


- spiff - 2009-07-15

1) we fill up the buffers starting with 1, second url is stuffed in 2 and so on. this semantic is valid everywhere, for those functions which are passed e.g. title in buffer 2, it should be "the buffer after the last url"
2) i don't get what you mean here? nfourl ultimately returns an url?
3) the cure for that one is called 'beer' Wink

snippet from the c++ code;

Code:
// download url's
vector<CStdString> strHTML;
for (unsigned int i=0;i<scrURL.m_url.size();++i)
{
  CStdString strCurrHTML;
  if (!CScraperUrl::Get(scrURL.m_url[i],strCurrHTML,m_http) || strCurrHTML.size() == 0)
    return 0;
  strHTML.push_back(strCurrHTML);
}

// fill buffers
for (unsigned int i=0;i<strHTML.size();++i)
  m_parser.m_param[i] = strHTML[i];
m_parser.m_param[strHTML.size()] = scrURL.m_url[0].m_url;

btw, you want to have a look at http://trac.xbmc.org/changeset/21675


- xyber - 2009-07-15

Nicezia Wrote:Going to update svn again tomorrow, adding functions to return the MediaTag objects rather than string, in this way it can be directly integrated into Media Manager Programs.

The functions that return strings will still be there and it will be the programs choice of which to use

Yay Big Grin I wanted to request this but was happy enough with the string.

On the other point, I am using TechNuts.MediaTags as I work in C#. Will do a clean grab off svn when you updated and check those tags again.

[edit] Almost done with TV section and then I just need to clean up a bit and will release the app. If you want access to the source I'll send it to you.


- Nicezia - 2009-07-16

spiff Wrote:2) i don't get what you mean here? nfourl ultimately returns an url?

there are some that return a string
some that return a <url> element
and some that return <url> element and a <id> element (mind you not enclosed in a tag either just as siblings which makes it difficult to hand with .Net xml handlers)

spiff Wrote:1) we fill up the buffers starting with 1, second url is stuffed in 2 and so on. this semantic is valid everywhere, for those functions which are passed e.g. title in buffer 2, it should be "the buffer after the last url"

every time i think i've got this stuff pegged there's something new that i have to completely redesign for..... well not completely redesign in factwith the way i have things now, there only a small section of each Content handler that needs to be redesigned.

However, with multiple urls what do you use for the url value if there needs to be a url value in the buffer.

there are a whole lot of things that are too often variable..., kinda think things could have a more rigid standard and still be flexible. (And this would make the code more portable as well), but that would be too easy and i would have ben done with this thing ages ago.. then what would i doHuh

ideas to considerHuh?
1) For the multiple url's i think it would be a better idea to run the get details function against each (i.e. download one page then run getdetails download the next page then run getdetails again, so forth and so on so that the only buffer needed for html is $$1, then run the custom functions after all urls referred to are finished

Edit: I just thought of a few complications this might cause if the different pages have matches for something they shouldn't match. Perhaps have them run as custom functions? in fact that's what i thought custom functions were for.

2. To make the scraper more flexible: Use buffer $$1 and $$2 for storing values only - buffer $$2 holding the item to retrieve from the last (standard, not custom) function.
i.e CreateSearchUrl's results are passed to $$2(or in light of what you just told me about the chaining to buffers ... the next buffer after the last HTML) for use with GetSearchResults, the <entity> to retrieve from GetSearchResults is passed to $$2 on select to be used with GetDetails
in this way the user can gather whatever info they want in the GetSearchResults function, and reuse it in the GetDetails function, Also all neccessary url's Id's and titles will also be there for the purpose of use

3. add a urlencode option to RegExp or expression so that it will be possible to pull things like artist name or title and UrlEncode them for fanart and thumbnail searches.

Code:
<RegExp input="$$2" output="\1" urlencode="yes" dest="3">
               <expression clear="yes">&lt;artist&gt;(.+)&lt;/artist&gt;</expression>
         </RegExp>

4. Limit NfoUrl returns to either a <entity> Element (containing things like id, title, url elements) or an <url> Element since its value is used directly in GetDetails

Code:
<NfoUrl dest="4">
    <RegExp input = "$$3" output=""&lt;entity&gt;\1"&lt;/entity&gt; dest = $$4>
          <RegExp input="$$1" output=&lt;url&gt;http://www.\1/title/tt\2/&lt;/url&gt;&lt;id&gt;tt\2&lt;/id&gt;" dest="3">
               <expression clear="yes" noclean="1">(imdb.com/)Title\?([0-9]*)</expression>
         </RegExp>        
        <expression noclean="1">(.+)</expression>
  </RegExp>
</NfoUrl >

or

Code:
<NfoUrl dest="4">
        <RegExp input="$$1" output=&lt;url&gt;http://www.\1/title/tt\2/&lt;/url&gt" dest="4">
               <expression clear="yes" noclean="1">(imdb.com/)Title\?([0-9]*)</expression>
         </RegExp>
</NfoUrl >


Because most of the problems I'm running into in scraperXML is that there doesn't seem to be an across the board standard,
Each scraper type has the different options it needs to be set to buffers, by using just $$2 to hold values each scraper needs
as a standard, you can use the same code for all scrapers, except in "CreateSearchUrl" where it will still be neccessary to pass
values specific to each scraper. the url encode option can handle setting things that need to be url encoded for customfunction


- Nicezia - 2009-07-16

sorry about missing my commit yesterday, Personal business got in the way.

commit finished now though (however i haven't made demo's for the Tag Returns... and i haven't yet finished the TVshow tag returns

going to have to do a bit of internal changes with this new information coming to light.


- spiff - 2009-07-17

i'm tired so i'll comment on the rest tomorrow.

nfourl needs to be able to chain. all scraper functions are allowed to chain - everywhere. it's used for e.g. translating imdb id's into ofdb id's, tmdb id's etc. the answer to all of this is recursive code which conditionally calls itself if it finds a <url> element in the xml


- Nicezia - 2009-07-17

spiff Wrote:i'm tired so i'll comment on the rest tomorrow.

nfourl needs to be able to chain. all scraper functions are allowed to chain - everywhere. it's used for e.g. translating imdb id's into ofdb id's, tmdb id's etc. the answer to all of this is recursive code which conditionally calls itself if it finds a <url> element in the xml

I'm already making the changes to make this happen, all i really need to know is if you have three urls to download and the scraper(for example imdb) needs a url in the buffer, which Url is used.

The chaining i can understand, and definately see the need for it, after i sent all that i thought up a little function that allows the stupid inflexible .net to parse sibling elements (why this isn't in the native code i will never know). most of the things i suggest pose
no real problems anymore.

won't be updating for a little while while i sort out a few things on how to get my code working EXACTLEY like XBMC's i keep my XBMC source updated and check over it for changes in the scraper code regularly, i can almost follow exactley what it says there was just the url thing that i didn't understand because i had no idea what vectoring did, but know that i know... that makes it a little easier to understand, now i can't seem to find where it sets the URL value that's needed in the buffer, so i'm lost on that.


- spiff - 2009-07-17

first one. the multi-url's are used very seldom btw, it's usually more convenient to just do another chain. it's for the few corner cases where you need the info from two pages at the same time


- Nicezia - 2009-07-17

thanks


- smeehrrr - 2009-07-17

Question about thetvdb.com: Does everyone get frequent timeouts when trying to scrape this site? It only seems to happen on certain titles, so I'm wondering if there's a bug in the scraper, a bug in ScrapeXML, or if the site is just unreliable.


- xyber - 2009-07-17

Might be the site. I've gone to the site in my browser when there are problems in the scraper and then get an error on the actual site. I posted the message I normally get a few posts back.


- smeehrrr - 2009-07-17

xyber Wrote:Might be the site. I've gone to the site in my browser when there are problems in the scraper and then get an error on the actual site. I posted the message I normally get a few posts back.

I'm starting to think this is something else. I've noticed that the first few searches I do against the site always work, and then at some point it starts timing out on every call. If I close my process and restart, then I get a few more good searches before it starting timing out again. At any point during this, searches that I do through my browser appear to work fine.

I'm now of the opinion that this is either a ScraperXML bug or a Windows networking bug. I'll do some more investigation and see if I can find anything interesting.


- smeehrrr - 2009-07-17

This looks like it may be a memory leak. I see the commit charge for my process grow on each call to GetDetails, and when it gets above about 300MB then all calls to tvdb start to timeout. Restarting the processing resets the commit charge and I can do more searches until I exceed the same approximate memory usage.

Maybe some object in the TV show parsing needs an explicit Dispose() call? I'll try to narrow this down some more if I get some time later.


- xyber - 2009-07-17

ooh.. sounds like a bad one. Guess it wil happen in my app too then :/ Will be scanning my whole TV folder through the weekend. Think I'll run in debug and see what happens.


- smeehrrr - 2009-07-17

The leak is in the Zip library. I'm going to pull that stuff out and replace it with the zip stuff in the .Net framework.


- smeehrrr - 2009-07-18

OK, System.IO.Packaging is just made of FAIL. It can't open the zip files returned from tvdb because they don't contain a file called Content_Types.

I guess I'll go get the source to SharpZipLib and see if I can fix the problem there.