ScraperXML (Open Source XML Web Scraper C# Library) please help verify my work...

  Thread Rating:
  • 0 Votes - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
Nicezia Offline
Fan
Posts: 369
Joined: Nov 2006
Reputation: 0
Location: Montgomery, Alabama
Post: #31
Here's a idea of how i want the library to interface with programs

[Image: Image1.jpg]

As you can see i want the programs to have to do very little in the way of actually processing any scraper information
(This post was last modified: 2009-05-11 06:02 by Nicezia.)
find quote
Nicezia Offline
Fan
Posts: 369
Joined: Nov 2006
Reputation: 0
Location: Montgomery, Alabama
Post: #32
Because i want to account for all the behaviours of the XBMC parser...

There's something i've run across with the TVdb scraper which is a RegExp without an input buffer reference, i'm assuming if there is no input reference it automatically assumes buffer $$1?
find quote
nul7 Offline
Posting Freak
Posts: 875
Joined: Oct 2008
Reputation: 14
Post: #33
there is also an issue that I didn't know how to handle with the allocine scraper:

Code:
<RegExp input="$$1" output="&lt;setting label=&quot;Activer les Vignettes d'acteurs&quot; type=&quot;bool&quot; id=&quot;actor&quot; default=&quot;[b][color=Red]falsetrue[/color][/b]&quot;&gt;&lt;/setting&gt;" dest="5+">

I just changed the actual XML file to read "true" so it could be parsed correctly. Is this intentional and needs to be coded around? If so, what exactly are we doing here?
find quote
Nicezia Offline
Fan
Posts: 369
Joined: Nov 2006
Reputation: 0
Location: Montgomery, Alabama
Post: #34
yeah also there's scrapers that use dest="5+" on the first RegExp to be executed, (and for some reason my recursive coding goes over the process twice (and i haven't been able to nail down why, although if you saw in the source there was a place that i had a suspicion was the cause) so with (TVdb scraper for example) it ends up twice as long as it should be.
find quote
Nicezia Offline
Fan
Posts: 369
Joined: Nov 2006
Reputation: 0
Location: Montgomery, Alabama
Post: #35
Code:
- <details>
  <id>http://akas.imdb.com/title/tt0120601/</id>
  <title>Being John Malkovich</title>
  <year>1999</year>
  <top250 />
  <mpaa>Rated R for language and sexuality.</mpaa>
  <certification>Singapore:R(A)</certification>
  <certification>Portugal:M/16</certification>
  <certification>Philippines:R-18</certification>
  <certification>Brazil:18</certification>
  <certification>Argentina:13</certification>
  <certification>Australia:MA</certification>
  <certification>Belgium:KT</certification>
  <certification>Canada:14A</certification>
  <certification>Canada:G (Québec)</certification>
  <certification>Chile:14</certification>
  <certification>Finland:K-12</certification>
  <certification>France:U</certification>
  <certification>Germany:12 (w)</certification>
  <certification>Hong Kong:IIB</certification>
  <certification>Ireland:15</certification>
  <certification>Italy:T</certification>
  <certification>Netherlands:AL</certification>
  <certification>New Zealand:M</certification>
  <certification>South Korea:18</certification>
  <certification>Spain:13</certification>
  <certification>Sweden:11</certification>
  <certification>Switzerland:16 (canton of Geneva)</certification>
  <certification>Switzerland:16 (canton of Vaud)</certification>
  <certification>UK:15</certification>
  <certification>USA:R (certificate #36965)</certification>
  <certification>Norway:11</certification>
  <certification>Iceland:L (original rating)</certification>
  <certification>Iceland:LH (video rating)</certification>
  <certification>Singapore:M18 (DVD rating)</certification>
  <tagline>Ever wanted to be someone else? Now you can.</tagline>
  <runtime>112 min | Canada:113 min</runtime>
  <rating>7.9</rating>
  <votes>101,110</votes>
  <genre>Comedy</genre>
  <genre>Drama</genre>
  <genre>Fantasy</genre>
  <studio>Gramercy Pictures (I)</studio>
  <outline />
  <plot />
- <url function="GetMoviePlot">
- <details>
  <id>http://akas.imdb.com/title/tt0120601/</id>
  <title>Being John Malkovich</title>
  <year>1999</year>
  <top250 />
  <mpaa>Rated R for language and sexuality.</mpaa>
  <certification>Singapore:R(A)</certification>
  <certification>Portugal:M/16</certification>
  <certification>Philippines:R-18</certification>
  <certification>Brazil:18</certification>
  <certification>Argentina:13</certification>
  <certification>Australia:MA</certification>
  <certification>Belgium:KT</certification>
  <certification>Canada:14A</certification>
  <certification>Canada:G (Québec)</certification>
  <certification>Chile:14</certification>
  <certification>Finland:K-12</certification>
  <certification>France:U</certification>
  <certification>Germany:12 (w)</certification>
  <certification>Hong Kong:IIB</certification>
  <certification>Ireland:15</certification>
  <certification>Italy:T</certification>
  <certification>Netherlands:AL</certification>
  <certification>New Zealand:M</certification>
  <certification>South Korea:18</certification>
  <certification>Spain:13</certification>
  <certification>Sweden:11</certification>
  <certification>Switzerland:16 (canton of Geneva)</certification>
  <certification>Switzerland:16 (canton of Vaud)</certification>
  <certification>UK:15</certification>
  <certification>USA:R (certificate #36965)</certification>
  <certification>Norway:11</certification>
  <certification>Iceland:L (original rating)</certification>
  <certification>Iceland:LH (video rating)</certification>
  <certification>Singapore:M18 (DVD rating)</certification>
  <tagline>Ever wanted to be someone else? Now you can.</tagline>
  <runtime>112 min | Canada:113 min</runtime>
  <rating>7.9</rating>
  <votes>101,110</votes>
  <genre>Comedy</genre>
  <genre>Drama</genre>
  <genre>Fantasy</genre>
  <studio>Gramercy Pictures (I)</studio>
  <outline />
  <plot />
  <url function="GetMoviePlot">$$3plotsummary</url>
  <url cache="$$2-credits.html" function="GetMovieCast">$$3</url>
  <url cache="$$2-credits.html" function="GetMovieDirectors">$$3</url>
  <url cache="$$2-credits.html" function="GetMovieWriters">$$3</url>
  <url cache="$$2-fullcredits.html" function="GetMovieCast">$$3fullcredits</url>
  <url cache="$$2-fullcredits.html" function="GetMovieDirectors">$$3fullcredits</url>
  <url cache="$$2-fullcredits.html" function="GetMovieWriters">$$3fullcredits</url>
  <url cache="$$2-posters.html" function="GetIMPALink">$$3posters</url>
  <url function="GetMoviePosterDBLink">http://www.movieposterdb.com/browse/search?title=0120601</url>
  <url function="GetTrailer">http://akas.imdb.com/video/imdb/vi3585671449</url>
  <url cache="$$2-posters.html" function="GetIMDBPoster">$$3posters</url>
  <url function="GetFanart">http://api.themoviedb.org/backdrop.php?imdb=$$2</url>
  </details>
  plotsummary
  </url>
  </details>

Okay I'm down to the final little bit of getting my code completely working (all the simpler ones work, its the IMDB one i'm working on now) and this is what i'm coming up with (before processing custom functions... so what exactley is <url cache> for? what does it do or tell XBMC to do?
(This post was last modified: 2009-05-12 13:33 by Nicezia.)
find quote
sbass Offline
Junior Member
Posts: 2
Joined: May 2009
Reputation: 0
Post: #36
Nicezia,

I've been following the XMLParser work by Spiff followed by the library work by yourself. I'm very interested in seeing something like this implemented for MeediOS in order to avoid issues of scrapers breaking plugins, etc. My question for you is this (and know that I didn't look at the code yet):

Is there a mechanism for the parser library to just fetch the top search result without returning a list of selections for the user to choose the best match?

The reason I'm asking this is while I personally prefer user initiated matching, some people like to setup importers to auto-import and then manually correct within the UI if necessary.

Finally, if we end up determining that some particular fields of interest are missing in an XMLScraper file how are you guys managing modifications or versioning to those scripts?

Thanks.

Shawn
find quote
Nicezia Offline
Fan
Posts: 369
Joined: Nov 2006
Reputation: 0
Location: Montgomery, Alabama
Post: #37
Yes I'm working on that as an alternative

I've run into situations where there is an EXACT match only, so i'm trying to code in the ability to return only the First match that is returned

I think however that will be set up as a "Return Only Exact Match" setting, naturaully that would b dependant on supplying the right information on a search.

The imdb especially is bad about not getting the exact match as the first, so i'm working on a routine that will sort the matches (don't know if they have that in the XBMC scraper system, but i plan to implement it in mine.
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #38
cache is a local filename. we cache the url to that file. it is used for speeding up running several functions on the same page
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #39
sbass, that's the beauty of xml. if a field is available, you parse it. if not, you do not. the scrapers have no clue what they are returning as such, to them all is text.

if these guys gets adopted more widely, we should definitely add a version field to them. since it's all xml, again no problem Smile
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #40
and yes, we have sorting etc in our system. but that is all external to the scraper stuff
find quote
Nicezia Offline
Fan
Posts: 369
Joined: Nov 2006
Reputation: 0
Location: Montgomery, Alabama
Post: #41
yeah one thing i've definately learned about the XBMC scrapers is you can pull all the information in the world.... its actually the PROGRAM that decides what to actually use.

You can use this to scrape any information you would ever want to!
find quote
Nicezia Offline
Fan
Posts: 369
Joined: Nov 2006
Reputation: 0
Location: Montgomery, Alabama
Post: #42
spiff Wrote:cache is a local filename. we cache the url to that file. it is used for speeding up running several functions on the same page

Ah Ha!!!! that makes a whole hell of alot of sense!

This looks like a job for System.IO.Streamwriter/Reader!!!!!!!!!!!!!!!!
find quote
Nicezia Offline
Fan
Posts: 369
Joined: Nov 2006
Reputation: 0
Location: Montgomery, Alabama
Post: #43
At what point are references to the buffer replaced? after the whole function runs or right after its expression matches are applied to the output, or before the RegExp even runs??

I think that's the key to why i keep getting some working and breaking others
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #44
you can use buffers in

1) the input - replaced before running the expression
2) the expressions themself - replaced before compiling the expressions
3) the output string - replaced after obtaining the replacestring from the regexp
find quote
Nicezia Offline
Fan
Posts: 369
Joined: Nov 2006
Reputation: 0
Location: Montgomery, Alabama
Post: #45
thank you, see i was trying to do it all at once, didn't know there were three different times, that ought to clear my imdb problem right up
find quote
Post Reply