ScraperXML (Open Source XML Web Scraper C# Library) please help verify my work... - Printable Version
+- XBMC Community Forum (http://forum.xbmc.org)
+-- Forum: Development (/forumdisplay.php?fid=32)
+--- Forum: Scraper Development (/forumdisplay.php?fid=60)
+--- Thread: ScraperXML (Open Source XML Web Scraper C# Library) please help verify my work... (/showthread.php?tid=50055)
- Nicezia - 2009-07-17 00:53
spiff Wrote:i'm tired so i'll comment on the rest tomorrow.
I'm already making the changes to make this happen, all i really need to know is if you have three urls to download and the scraper(for example imdb) needs a url in the buffer, which Url is used.
The chaining i can understand, and definately see the need for it, after i sent all that i thought up a little function that allows the stupid inflexible .net to parse sibling elements (why this isn't in the native code i will never know). most of the things i suggest pose
no real problems anymore.
won't be updating for a little while while i sort out a few things on how to get my code working EXACTLEY like XBMC's i keep my XBMC source updated and check over it for changes in the scraper code regularly, i can almost follow exactley what it says there was just the url thing that i didn't understand because i had no idea what vectoring did, but know that i know... that makes it a little easier to understand, now i can't seem to find where it sets the URL value that's needed in the buffer, so i'm lost on that.
- spiff - 2009-07-17 00:57
first one. the multi-url's are used very seldom btw, it's usually more convenient to just do another chain. it's for the few corner cases where you need the info from two pages at the same time
- Nicezia - 2009-07-17 01:03
- smeehrrr - 2009-07-17 10:13
Question about thetvdb.com: Does everyone get frequent timeouts when trying to scrape this site? It only seems to happen on certain titles, so I'm wondering if there's a bug in the scraper, a bug in ScrapeXML, or if the site is just unreliable.
- xyber - 2009-07-17 18:26
Might be the site. I've gone to the site in my browser when there are problems in the scraper and then get an error on the actual site. I posted the message I normally get a few posts back.
- smeehrrr - 2009-07-17 19:58
xyber Wrote:Might be the site. I've gone to the site in my browser when there are problems in the scraper and then get an error on the actual site. I posted the message I normally get a few posts back.
I'm starting to think this is something else. I've noticed that the first few searches I do against the site always work, and then at some point it starts timing out on every call. If I close my process and restart, then I get a few more good searches before it starting timing out again. At any point during this, searches that I do through my browser appear to work fine.
I'm now of the opinion that this is either a ScraperXML bug or a Windows networking bug. I'll do some more investigation and see if I can find anything interesting.
- smeehrrr - 2009-07-17 20:10
This looks like it may be a memory leak. I see the commit charge for my process grow on each call to GetDetails, and when it gets above about 300MB then all calls to tvdb start to timeout. Restarting the processing resets the commit charge and I can do more searches until I exceed the same approximate memory usage.
Maybe some object in the TV show parsing needs an explicit Dispose() call? I'll try to narrow this down some more if I get some time later.
- xyber - 2009-07-17 20:38
ooh.. sounds like a bad one. Guess it wil happen in my app too then :/ Will be scanning my whole TV folder through the weekend. Think I'll run in debug and see what happens.
- smeehrrr - 2009-07-17 22:09
The leak is in the Zip library. I'm going to pull that stuff out and replace it with the zip stuff in the .Net framework.
- smeehrrr - 2009-07-18 01:50
OK, System.IO.Packaging is just made of FAIL. It can't open the zip files returned from tvdb because they don't contain a file called Content_Types.
I guess I'll go get the source to SharpZipLib and see if I can fix the problem there.