XBMC Community Forum
ScraperXML (Open Source XML Web Scraper C# Library) please help verify my work... - Printable Version

+- XBMC Community Forum (http://forum.xbmc.org)
+-- Forum: Development (/forumdisplay.php?fid=32)
+--- Forum: Scraper Development (/forumdisplay.php?fid=60)
+--- Thread: ScraperXML (Open Source XML Web Scraper C# Library) please help verify my work... (/showthread.php?tid=50055)

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33


- xyber - 2009-07-13 21:02

Nicezia Wrote:new version committed to svn ... shouldn't throw any exceptions, and should log any error instances and the scraper results. (and when i say this i mean the library itself not the test programs as they are only loosely knitted programs to demonstrate usage of the library)

TvShows are working (at least TvRage & tvdb are, have to investigate to see if tv.com & imdb tv scrapers are working properly in xbmc before attempting a fix - but that's work for tomorrow)

Thanks for the update, and thanks for making this lib. I can't imagine how much time I would have to spend doing this myself.

I just did a check on Californication, Heroes and Dexter in XBMC. imdb and tv.com wasn't of much use and I ended up with bad looking library. TheTVDB scraper had good results.


- xyber - 2009-07-14 19:02

Dunno if you changed this yet but the properties in the TVEpisodeTag class is not set to public.


- Nicezia - 2009-07-14 21:26

xyber Wrote:Dunno if you changed this yet but the properties in the TVEpisodeTag class is not set to public.

you sure you're not playing around with the old VB source? (which by the way isn't being updated anymore), Because all the tags in use have been set to public in the TechNuts.MediaTags (CSharp Library)


- Nicezia - 2009-07-14 21:46

spiff Wrote:the joys of not using svn:eol native Wink

Hey spiffster

i need to know a few things

1.) how are multiple url's for functions handled?

2.) can there be some kind of standardization set for NfoUrl. It seems EVERY scraper returns something different. It'd be nice if that could be narrowed down to 2 standard return values. (just a request)

3.) could you go back in time and un-make that comment about me making my little scraper tester program into a library... cause right now i'm about to pull my hair out!! when it was just a program to test scrapers it was easy, but now trying to anticipate all the ways it might be used, Geesh!

Just kidding, of course, i'm actually enjoying programming this, i just wish i could follow c++ better so i could know exactley how everything was being done internally in XBMC (maybe i should look into tinyxml because .Net Xml handlers can be quite inflexible).


- Nicezia - 2009-07-15 01:46

Going to update svn again tomorrow, adding functions to return the MediaTag objects rather than string, in this way it can be directly integrated into Media Manager Programs.

The functions that return strings will still be there and it will be the programs choice of which to use


- spiff - 2009-07-15 10:11

1) we fill up the buffers starting with 1, second url is stuffed in 2 and so on. this semantic is valid everywhere, for those functions which are passed e.g. title in buffer 2, it should be "the buffer after the last url"
2) i don't get what you mean here? nfourl ultimately returns an url?
3) the cure for that one is called 'beer' Wink

snippet from the c++ code;

Code:
// download url's
vector<CStdString> strHTML;
for (unsigned int i=0;i<scrURL.m_url.size();++i)
{
  CStdString strCurrHTML;
  if (!CScraperUrl::Get(scrURL.m_url[i],strCurrHTML,m_http) || strCurrHTML.size() == 0)
    return 0;
  strHTML.push_back(strCurrHTML);
}

// fill buffers
for (unsigned int i=0;i<strHTML.size();++i)
  m_parser.m_param[i] = strHTML[i];
m_parser.m_param[strHTML.size()] = scrURL.m_url[0].m_url;

btw, you want to have a look at http://trac.xbmc.org/changeset/21675


- xyber - 2009-07-15 17:33

Nicezia Wrote:Going to update svn again tomorrow, adding functions to return the MediaTag objects rather than string, in this way it can be directly integrated into Media Manager Programs.

The functions that return strings will still be there and it will be the programs choice of which to use

Yay Big Grin I wanted to request this but was happy enough with the string.

On the other point, I am using TechNuts.MediaTags as I work in C#. Will do a clean grab off svn when you updated and check those tags again.

[edit] Almost done with TV section and then I just need to clean up a bit and will release the app. If you want access to the source I'll send it to you.


- Nicezia - 2009-07-16 22:39

spiff Wrote:2) i don't get what you mean here? nfourl ultimately returns an url?

there are some that return a string
some that return a <url> element
and some that return <url> element and a <id> element (mind you not enclosed in a tag either just as siblings which makes it difficult to hand with .Net xml handlers)

spiff Wrote:1) we fill up the buffers starting with 1, second url is stuffed in 2 and so on. this semantic is valid everywhere, for those functions which are passed e.g. title in buffer 2, it should be "the buffer after the last url"

every time i think i've got this stuff pegged there's something new that i have to completely redesign for..... well not completely redesign in factwith the way i have things now, there only a small section of each Content handler that needs to be redesigned.

However, with multiple urls what do you use for the url value if there needs to be a url value in the buffer.

there are a whole lot of things that are too often variable..., kinda think things could have a more rigid standard and still be flexible. (And this would make the code more portable as well), but that would be too easy and i would have ben done with this thing ages ago.. then what would i doConfused

ideas to considerConfused?
1) For the multiple url's i think it would be a better idea to run the get details function against each (i.e. download one page then run getdetails download the next page then run getdetails again, so forth and so on so that the only buffer needed for html is $$1, then run the custom functions after all urls referred to are finished

Edit: I just thought of a few complications this might cause if the different pages have matches for something they shouldn't match. Perhaps have them run as custom functions? in fact that's what i thought custom functions were for.

2. To make the scraper more flexible: Use buffer $$1 and $$2 for storing values only - buffer $$2 holding the item to retrieve from the last (standard, not custom) function.
i.e CreateSearchUrl's results are passed to $$2(or in light of what you just told me about the chaining to buffers ... the next buffer after the last HTML) for use with GetSearchResults, the <entity> to retrieve from GetSearchResults is passed to $$2 on select to be used with GetDetails
in this way the user can gather whatever info they want in the GetSearchResults function, and reuse it in the GetDetails function, Also all neccessary url's Id's and titles will also be there for the purpose of use

3. add a urlencode option to RegExp or expression so that it will be possible to pull things like artist name or title and UrlEncode them for fanart and thumbnail searches.

Code:
<RegExp input="$$2" output="\1" urlencode="yes" dest="3">
               <expression clear="yes">&lt;artist&gt;(.+)&lt;/artist&gt;</expression>
         </RegExp>

4. Limit NfoUrl returns to either a <entity> Element (containing things like id, title, url elements) or an <url> Element since its value is used directly in GetDetails

Code:
<NfoUrl dest="4">
    <RegExp input = "$$3" output=""&lt;entity&gt;\1"&lt;/entity&gt; dest = $$4>
          <RegExp input="$$1" output=&lt;url&gt;http://www.\1/title/tt\2/&lt;/url&gt;&lt;id&gt;tt\2&lt;/id&gt;" dest="3">
               <expression clear="yes" noclean="1">(imdb.com/)Title\?([0-9]*)</expression>
         </RegExp>        
        <expression noclean="1">(.+)</expression>
  </RegExp>
</NfoUrl >

or

Code:
<NfoUrl dest="4">
        <RegExp input="$$1" output=&lt;url&gt;http://www.\1/title/tt\2/&lt;/url&gt" dest="4">
               <expression clear="yes" noclean="1">(imdb.com/)Title\?([0-9]*)</expression>
         </RegExp>
</NfoUrl >


Because most of the problems I'm running into in scraperXML is that there doesn't seem to be an across the board standard,
Each scraper type has the different options it needs to be set to buffers, by using just $$2 to hold values each scraper needs
as a standard, you can use the same code for all scrapers, except in "CreateSearchUrl" where it will still be neccessary to pass
values specific to each scraper. the url encode option can handle setting things that need to be url encoded for customfunction


- Nicezia - 2009-07-16 23:22

sorry about missing my commit yesterday, Personal business got in the way.

commit finished now though (however i haven't made demo's for the Tag Returns... and i haven't yet finished the TVshow tag returns

going to have to do a bit of internal changes with this new information coming to light.


- spiff - 2009-07-17 00:38

i'm tired so i'll comment on the rest tomorrow.

nfourl needs to be able to chain. all scraper functions are allowed to chain - everywhere. it's used for e.g. translating imdb id's into ofdb id's, tmdb id's etc. the answer to all of this is recursive code which conditionally calls itself if it finds a <url> element in the xml