• 1
  • 2
  • 3(current)
  • 4
  • 5
  • 22
ScraperXML (Open Source XML Web Scraper C# Library) please help verify my work...
#31
Here's a idea of how i want the library to interface with programs

Image

As you can see i want the programs to have to do very little in the way of actually processing any scraper information
Reply
#32
Because i want to account for all the behaviours of the XBMC parser...

There's something i've run across with the TVdb scraper which is a RegExp without an input buffer reference, i'm assuming if there is no input reference it automatically assumes buffer $$1?
Reply
#33
there is also an issue that I didn't know how to handle with the allocine scraper:

Code:
<RegExp input="$$1" output="&lt;setting label=&quot;Activer les Vignettes d'acteurs&quot; type=&quot;bool&quot; id=&quot;actor&quot; default=&quot;[b][color=Red]falsetrue[/color][/b]&quot;&gt;&lt;/setting&gt;" dest="5+">

I just changed the actual XML file to read "true" so it could be parsed correctly. Is this intentional and needs to be coded around? If so, what exactly are we doing here?
Reply
#34
yeah also there's scrapers that use dest="5+" on the first RegExp to be executed, (and for some reason my recursive coding goes over the process twice (and i haven't been able to nail down why, although if you saw in the source there was a place that i had a suspicion was the cause) so with (TVdb scraper for example) it ends up twice as long as it should be.
Reply
#35
Code:
- <details>
  <id>http://akas.imdb.com/title/tt0120601/</id>
  <title>Being John Malkovich</title>
  <year>1999</year>
  <top250 />
  <mpaa>Rated R for language and sexuality.</mpaa>
  <certification>Singapore:R(A)</certification>
  <certification>Portugal:M/16</certification>
  <certification>Philippines:R-18</certification>
  <certification>Brazil:18</certification>
  <certification>Argentina:13</certification>
  <certification>Australia:MA</certification>
  <certification>Belgium:KT</certification>
  <certification>Canada:14A</certification>
  <certification>Canada:G (Québec)</certification>
  <certification>Chile:14</certification>
  <certification>Finland:K-12</certification>
  <certification>France:U</certification>
  <certification>Germany:12 (w)</certification>
  <certification>Hong Kong:IIB</certification>
  <certification>Ireland:15</certification>
  <certification>Italy:T</certification>
  <certification>Netherlands:AL</certification>
  <certification>New Zealand:M</certification>
  <certification>South Korea:18</certification>
  <certification>Spain:13</certification>
  <certification>Sweden:11</certification>
  <certification>Switzerland:16 (canton of Geneva)</certification>
  <certification>Switzerland:16 (canton of Vaud)</certification>
  <certification>UK:15</certification>
  <certification>USA:R (certificate #36965)</certification>
  <certification>Norway:11</certification>
  <certification>Iceland:L (original rating)</certification>
  <certification>Iceland:LH (video rating)</certification>
  <certification>Singapore:M18 (DVD rating)</certification>
  <tagline>Ever wanted to be someone else? Now you can.</tagline>
  <runtime>112 min | Canada:113 min</runtime>
  <rating>7.9</rating>
  <votes>101,110</votes>
  <genre>Comedy</genre>
  <genre>Drama</genre>
  <genre>Fantasy</genre>
  <studio>Gramercy Pictures (I)</studio>
  <outline />
  <plot />
- <url function="GetMoviePlot">
- <details>
  <id>http://akas.imdb.com/title/tt0120601/</id>
  <title>Being John Malkovich</title>
  <year>1999</year>
  <top250 />
  <mpaa>Rated R for language and sexuality.</mpaa>
  <certification>Singapore:R(A)</certification>
  <certification>Portugal:M/16</certification>
  <certification>Philippines:R-18</certification>
  <certification>Brazil:18</certification>
  <certification>Argentina:13</certification>
  <certification>Australia:MA</certification>
  <certification>Belgium:KT</certification>
  <certification>Canada:14A</certification>
  <certification>Canada:G (Québec)</certification>
  <certification>Chile:14</certification>
  <certification>Finland:K-12</certification>
  <certification>France:U</certification>
  <certification>Germany:12 (w)</certification>
  <certification>Hong Kong:IIB</certification>
  <certification>Ireland:15</certification>
  <certification>Italy:T</certification>
  <certification>Netherlands:AL</certification>
  <certification>New Zealand:M</certification>
  <certification>South Korea:18</certification>
  <certification>Spain:13</certification>
  <certification>Sweden:11</certification>
  <certification>Switzerland:16 (canton of Geneva)</certification>
  <certification>Switzerland:16 (canton of Vaud)</certification>
  <certification>UK:15</certification>
  <certification>USA:R (certificate #36965)</certification>
  <certification>Norway:11</certification>
  <certification>Iceland:L (original rating)</certification>
  <certification>Iceland:LH (video rating)</certification>
  <certification>Singapore:M18 (DVD rating)</certification>
  <tagline>Ever wanted to be someone else? Now you can.</tagline>
  <runtime>112 min | Canada:113 min</runtime>
  <rating>7.9</rating>
  <votes>101,110</votes>
  <genre>Comedy</genre>
  <genre>Drama</genre>
  <genre>Fantasy</genre>
  <studio>Gramercy Pictures (I)</studio>
  <outline />
  <plot />
  <url function="GetMoviePlot">$$3plotsummary</url>
  <url cache="$$2-credits.html" function="GetMovieCast">$$3</url>
  <url cache="$$2-credits.html" function="GetMovieDirectors">$$3</url>
  <url cache="$$2-credits.html" function="GetMovieWriters">$$3</url>
  <url cache="$$2-fullcredits.html" function="GetMovieCast">$$3fullcredits</url>
  <url cache="$$2-fullcredits.html" function="GetMovieDirectors">$$3fullcredits</url>
  <url cache="$$2-fullcredits.html" function="GetMovieWriters">$$3fullcredits</url>
  <url cache="$$2-posters.html" function="GetIMPALink">$$3posters</url>
  <url function="GetMoviePosterDBLink">http://www.movieposterdb.com/browse/search?title=0120601</url>
  <url function="GetTrailer">http://akas.imdb.com/video/imdb/vi3585671449</url>
  <url cache="$$2-posters.html" function="GetIMDBPoster">$$3posters</url>
  <url function="GetFanart">http://api.themoviedb.org/backdrop.php?imdb=$$2</url>
  </details>
  plotsummary
  </url>
  </details>

Okay I'm down to the final little bit of getting my code completely working (all the simpler ones work, its the IMDB one i'm working on now) and this is what i'm coming up with (before processing custom functions... so what exactley is <url cache> for? what does it do or tell XBMC to do?
Reply
#36
Nicezia,

I've been following the XMLParser work by Spiff followed by the library work by yourself. I'm very interested in seeing something like this implemented for MeediOS in order to avoid issues of scrapers breaking plugins, etc. My question for you is this (and know that I didn't look at the code yet):

Is there a mechanism for the parser library to just fetch the top search result without returning a list of selections for the user to choose the best match?

The reason I'm asking this is while I personally prefer user initiated matching, some people like to setup importers to auto-import and then manually correct within the UI if necessary.

Finally, if we end up determining that some particular fields of interest are missing in an XMLScraper file how are you guys managing modifications or versioning to those scripts?

Thanks.

Shawn
Reply
#37
Yes I'm working on that as an alternative

I've run into situations where there is an EXACT match only, so i'm trying to code in the ability to return only the First match that is returned

I think however that will be set up as a "Return Only Exact Match" setting, naturaully that would b dependant on supplying the right information on a search.

The imdb especially is bad about not getting the exact match as the first, so i'm working on a routine that will sort the matches (don't know if they have that in the XBMC scraper system, but i plan to implement it in mine.
Reply
#38
cache is a local filename. we cache the url to that file. it is used for speeding up running several functions on the same page
Reply
#39
sbass, that's the beauty of xml. if a field is available, you parse it. if not, you do not. the scrapers have no clue what they are returning as such, to them all is text.

if these guys gets adopted more widely, we should definitely add a version field to them. since it's all xml, again no problem Smile
Reply
#40
and yes, we have sorting etc in our system. but that is all external to the scraper stuff
Reply
#41
yeah one thing i've definately learned about the XBMC scrapers is you can pull all the information in the world.... its actually the PROGRAM that decides what to actually use.

You can use this to scrape any information you would ever want to!
Reply
#42
spiff Wrote:cache is a local filename. we cache the url to that file. it is used for speeding up running several functions on the same page

Ah Ha!!!! that makes a whole hell of alot of sense!

This looks like a job for System.IO.Streamwriter/Reader!!!!!!!!!!!!!!!!
Reply
#43
At what point are references to the buffer replaced? after the whole function runs or right after its expression matches are applied to the output, or before the RegExp even runs??

I think that's the key to why i keep getting some working and breaking others
Reply
#44
you can use buffers in

1) the input - replaced before running the expression
2) the expressions themself - replaced before compiling the expressions
3) the output string - replaced after obtaining the replacestring from the regexp
Reply
#45
thank you, see i was trying to do it all at once, didn't know there were three different times, that ought to clear my imdb problem right up
Reply
  • 1
  • 2
  • 3(current)
  • 4
  • 5
  • 22

Logout Mark Read Team Forum Stats Members Help
ScraperXML (Open Source XML Web Scraper C# Library) please help verify my work...0