ScraperXML (Open Source XML Web Scraper C# Library) please help verify my work...

  Thread Rating:
  • 0 Votes - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
Nicezia Offline
Fan
Posts: 369
Joined: Nov 2006
Reputation: 0
Location: Montgomery, Alabama
Post: #31
Here's a idea of how i want the library to interface with programs

[Image: Image1.jpg]

As you can see i want the programs to have to do very little in the way of actually processing any scraper information
(This post was last modified: 2009-05-11 06:02 by Nicezia.)
find quote
Nicezia Offline
Fan
Posts: 369
Joined: Nov 2006
Reputation: 0
Location: Montgomery, Alabama
Post: #32
Because i want to account for all the behaviours of the XBMC parser...

There's something i've run across with the TVdb scraper which is a RegExp without an input buffer reference, i'm assuming if there is no input reference it automatically assumes buffer $$1?
find quote
nul7 Offline
Posting Freak
Posts: 875
Joined: Oct 2008
Reputation: 14
Post: #33
there is also an issue that I didn't know how to handle with the allocine scraper:

Code:
<RegExp input="$$1" output="&lt;setting label=&quot;Activer les Vignettes d'acteurs&quot; type=&quot;bool&quot; id=&quot;actor&quot; default=&quot;[b][color=Red]falsetrue[/color][/b]&quot;&gt;&lt;/setting&gt;" dest="5+">

I just changed the actual XML file to read "true" so it could be parsed correctly. Is this intentional and needs to be coded around? If so, what exactly are we doing here?
find quote
Nicezia Offline
Fan
Posts: 369
Joined: Nov 2006
Reputation: 0
Location: Montgomery, Alabama
Post: #34
yeah also there's scrapers that use dest="5+" on the first RegExp to be executed, (and for some reason my recursive coding goes over the process twice (and i haven't been able to nail down why, although if you saw in the source there was a place that i had a suspicion was the cause) so with (TVdb scraper for example) it ends up twice as long as it should be.
find quote
Nicezia Offline
Fan
Posts: 369
Joined: Nov 2006
Reputation: 0
Location: Montgomery, Alabama
Post: #35
Code:
- <details>
  <id>http://akas.imdb.com/title/tt0120601/</id>
  <title>Being John Malkovich</title>
  <year>1999</year>
  <top250 />
  <mpaa>Rated R for language and sexuality.</mpaa>
  <certification>Singapore:R(A)</certification>
  <certification>Portugal:M/16</certification>
  <certification>Philippines:R-18</certification>
  <certification>Brazil:18</certification>
  <certification>Argentina:13</certification>
  <certification>Australia:MA</certification>
  <certification>Belgium:KT</certification>
  <certification>Canada:14A</certification>
  <certification>Canada:G (Québec)</certification>
  <certification>Chile:14</certification>
  <certification>Finland:K-12</certification>
  <certification>France:U</certification>
  <certification>Germany:12 (w)</certification>
  <certification>Hong Kong:IIB</certification>
  <certification>Ireland:15</certification>
  <certification>Italy:T</certification>
  <certification>Netherlands:AL</certification>
  <certification>New Zealand:M</certification>
  <certification>South Korea:18</certification>
  <certification>Spain:13</certification>
  <certification>Sweden:11</certification>
  <certification>Switzerland:16 (canton of Geneva)</certification>
  <certification>Switzerland:16 (canton of Vaud)</certification>
  <certification>UK:15</certification>
  <certification>USA:R (certificate #36965)</certification>
  <certification>Norway:11</certification>
  <certification>Iceland:L (original rating)</certification>
  <certification>Iceland:LH (video rating)</certification>
  <certification>Singapore:M18 (DVD rating)</certification>
  <tagline>Ever wanted to be someone else? Now you can.</tagline>
  <runtime>112 min | Canada:113 min</runtime>
  <rating>7.9</rating>
  <votes>101,110</votes>
  <genre>Comedy</genre>
  <genre>Drama</genre>
  <genre>Fantasy</genre>
  <studio>Gramercy Pictures (I)</studio>
  <outline />
  <plot />
- <url function="GetMoviePlot">
- <details>
  <id>http://akas.imdb.com/title/tt0120601/</id>
  <title>Being John Malkovich</title>
  <year>1999</year>
  <top250 />
  <mpaa>Rated R for language and sexuality.</mpaa>
  <certification>Singapore:R(A)</certification>
  <certification>Portugal:M/16</certification>
  <certification>Philippines:R-18</certification>
  <certification>Brazil:18</certification>
  <certification>Argentina:13</certification>
  <certification>Australia:MA</certification>
  <certification>Belgium:KT</certification>
  <certification>Canada:14A</certification>
  <certification>Canada:G (Québec)</certification>
  <certification>Chile:14</certification>
  <certification>Finland:K-12</certification>
  <certification>France:U</certification>
  <certification>Germany:12 (w)</certification>
  <certification>Hong Kong:IIB</certification>
  <certification>Ireland:15</certification>
  <certification>Italy:T</certification>
  <certification>Netherlands:AL</certification>
  <certification>New Zealand:M</certification>
  <certification>South Korea:18</certification>
  <certification>Spain:13</certification>
  <certification>Sweden:11</certification>
  <certification>Switzerland:16 (canton of Geneva)</certification>
  <certification>Switzerland:16 (canton of Vaud)</certification>
  <certification>UK:15</certification>
  <certification>USA:R (certificate #36965)</certification>
  <certification>Norway:11</certification>
  <certification>Iceland:L (original rating)</certification>
  <certification>Iceland:LH (video rating)</certification>
  <certification>Singapore:M18 (DVD rating)</certification>
  <tagline>Ever wanted to be someone else? Now you can.</tagline>
  <runtime>112 min | Canada:113 min</runtime>
  <rating>7.9</rating>
  <votes>101,110</votes>
  <genre>Comedy</genre>
  <genre>Drama</genre>
  <genre>Fantasy</genre>
  <studio>Gramercy Pictures (I)</studio>
  <outline />
  <plot />
  <url function="GetMoviePlot">$$3plotsummary</url>
  <url cache="$$2-credits.html" function="GetMovieCast">$$3</url>
  <url cache="$$2-credits.html" function="GetMovieDirectors">$$3</url>
  <url cache="$$2-credits.html" function="GetMovieWriters">$$3</url>
  <url cache="$$2-fullcredits.html" function="GetMovieCast">$$3fullcredits</url>
  <url cache="$$2-fullcredits.html" function="GetMovieDirectors">$$3fullcredits</url>
  <url cache="$$2-fullcredits.html" function="GetMovieWriters">$$3fullcredits</url>
  <url cache="$$2-posters.html" function="GetIMPALink">$$3posters</url>
  <url function="GetMoviePosterDBLink">http://www.movieposterdb.com/browse/search?title=0120601</url>
  <url function="GetTrailer">http://akas.imdb.com/video/imdb/vi3585671449</url>
  <url cache="$$2-posters.html" function="GetIMDBPoster">$$3posters</url>
  <url function="GetFanart">http://api.themoviedb.org/backdrop.php?imdb=$$2</url>
  </details>
  plotsummary
  </url>
  </details>

Okay I'm down to the final little bit of getting my code completely working (all the simpler ones work, its the IMDB one i'm working on now) and this is what i'm coming up with (before processing custom functions... so what exactley is <url cache> for? what does it do or tell XBMC to do?
(This post was last modified: 2009-05-12 13:33 by Nicezia.)
find quote
sbass Offline
Junior Member
Posts: 2
Joined: May 2009
Reputation: 0
Post: #36
Nicezia,

I've been following the XMLParser work by Spiff followed by the library work by yourself. I'm very interested in seeing something like this implemented for MeediOS in order to avoid issues of scrapers breaking plugins, etc. My question for you is this (and know that I didn't look at the code yet):

Is there a mechanism for the parser library to just fetch the top search result without returning a list of selections for the user to choose the best match?

The reason I'm asking this is while I personally prefer user initiated matching, some people like to setup importers to auto-import and then manually correct within the UI if necessary.

Finally, if we end up determining that some particular fields of interest are missing in an XMLScraper file how are you guys managing modifications or versioning to those scripts?

Thanks.

Shawn
find quote
Nicezia Offline
Fan
Posts: 369
Joined: Nov 2006
Reputation: 0
Location: Montgomery, Alabama
Post: #37
Yes I'm working on that as an alternative

I've run into situations where there is an EXACT match only, so i'm trying to code in the ability to return only the First match that is returned

I think however that will be set up as a "Return Only Exact Match" setting, naturaully that would b dependant on supplying the right information on a search.

The imdb especially is bad about not getting the exact match as the first, so i'm working on a routine that will sort the matches (don't know if they have that in the XBMC scraper system, but i plan to implement it in mine.
find quote
spiff Offline
Grumpy Bastard Developer
Posts: 12,187
Joined: Nov 2003
Reputation: 82
Post: #38
cache is a local filename. we cache the url to that file. it is used for speeding up running several functions on the same page

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
find quote
spiff Offline
Grumpy Bastard Developer
Posts: 12,187
Joined: Nov 2003
Reputation: 82
Post: #39
sbass, that's the beauty of xml. if a field is available, you parse it. if not, you do not. the scrapers have no clue what they are returning as such, to them all is text.

if these guys gets adopted more widely, we should definitely add a version field to them. since it's all xml, again no problem Smile

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
find quote
spiff Offline
Grumpy Bastard Developer
Posts: 12,187
Joined: Nov 2003
Reputation: 82
Post: #40
and yes, we have sorting etc in our system. but that is all external to the scraper stuff

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
find quote
Post Reply