Settings in scrapers

  Thread Rating:
  • 2 Votes - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
pko66 Offline
Senior Member
Posts: 190
Joined: Dec 2006
Reputation: 0
Post: #1
I'm studying IMDB scraper for the 3rd chapter of "scraper for dummies", and I would like to know if some of my first conclusions regarding settings are right. I will later try them for real (right now I have no access to a XBMC installation), but if spiff or somebody could shed some light here it would be greatly appreciated.

The settings for a scraper have 2 parts, first, they need to be specified to XBMC so the user can modify them (and defaults can be set), second they need to be used in the actual scraper code.

For the first part, there is a new regexp, "getsettings" that need to generate a XML structure like this:

<settings>
<setting label=NAME type=TYPE id=ID default=DEFAULT></setting>
</settings>

where (all are quoted strings):
NAME = Name of setting, like "enable full cast credit"
TYPE = Type of setting one of ("bool"|"labelenum"|"sep"|"text")
ID = Identifier of setting (just any string? must a word?)
DEFAULT = Default value

Type bool can be "true" or "false"
Type labelenum is a list of strings separated by "|" symbol
Type sep is just a cosmetic function, simply displays a line to tidy up things
Type text allows entering an arbitrary string

- There can be as many <setting> as needed
- are there any other "type"s?
- ID is just a string to identify a setting latter in the code.

For the second part, using the settings in the code, you simply use $INFO[ID], for example, if one ID is "fanart", you can obtain the user-selected option by using $INFO[fanart] (do not use quotes in ID, just the name). $INFO[ID] returns a string with the corresponding value established for the source being scraped.

Are there any "built in" infos? for example, it would be really useful for some applications to have $INFO(source) to get the source where the movie file is, $INFO(filename), $INFO(path) etc
find quote
pko66 Offline
Senior Member
Posts: 190
Joined: Dec 2006
Reputation: 0
Post: #2
I have some questions about custom functions:

Where can they be used? I see in imdb.xml and other scrapers that they are used in "output", but can be used in input or expression? I assume not, but I'm not sure.

Also, how do the buffers work when calling custom functions? are they global? there is a parameter "clearbuffers" but it is not clear for me how it works... do the buffers get erased by default? if so, we are losing info!

I'm studying http://trac.xbmc.org/browser/trunk/XBMC/...oPoisk.xml that seems to be one of the more simple scrapers that actually uses custom functions (although nested! GMPP is called from GMP, and GMPP has clearbuffers="no" but GMP hasn't).

Also I'm confused as how exactly is dest buffer handled in these calls; in KinoPoisk.ru scraper, buffer 5 is used to generate all the results, but the GMPP function uses it as dest...

My interpretation for now is this:

- Buffers are local.
- $$1 contains the fetched URL.
- dest simply states a placeholder for the output, can be any buffer no matter if it is used inside or outside the function (if so, why do we need to specify it?).
- The other buffers are empty by default.
- We can if we want use a copy of the contents of all the buffers (except buffer 1) as they were at the point when the function was called. For that we insert the option clearbuffers="no" in the function definition; the manipulations we may do to that contents will not be preserved when the custom function ends and returns to the main GetDetails.
- Whatever we generate as output will substitute the <url function="CustomFunction">url...</url> structure used to call the custom function

Is that correct? I will perform tomorrow some tests, but if I could have confirmation/corrections before that I would save time and work...
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #3
pko66 Wrote:For the first part, there is a new regexp, "getsettings" that need to generate a XML structure like this:

<settings>
<setting label=NAME type=TYPE id=ID default=DEFAULT></setting>
</settings>

where (all are quoted strings):
NAME = Name of setting, like "enable full cast credit"
TYPE = Type of setting one of ("bool"|"labelenum"|"sep"|"text")
ID = Identifier of setting (just any string? must a word?)
DEFAULT = Default value

Type bool can be "true" or "false"
Type labelenum is a list of strings separated by "|" symbol
Type sep is just a cosmetic function, simply displays a line to tidy up things
Type text allows entering an arbitrary string

- There can be as many <setting> as needed
- are there any other "type"s?
- ID is just a string to identify a setting latter in the code.

other relevants would probably be
Code:
if (strcmpi(type, "text") == 0 ||
      strcmpi(type, "integer") == 0 || strcmpi(type, "fileenum") == 0)
text is keyboard input, integer a number, fileenum a file dialog

note that these are shared with plugins and scripts (same format for their settings).

Quote:For the second part, using the settings in the code, you simply use $INFO[ID], for example, if one ID is "fanart", you can obtain the user-selected option by using $INFO[fanart] (do not use quotes in ID, just the name). $INFO[ID] returns a string with the corresponding value established for the source being scraped.

Are there any "built in" infos? for example, it would be really useful for some applications to have $INFO(source) to get the source where the movie file is, $INFO(filename), $INFO(path) etc
no. but this is a good idea, please add a trac ticket
find quote
pko66 Offline
Senior Member
Posts: 190
Joined: Dec 2006
Reputation: 0
Post: #4
I've completed the second chapter with a reference to NfoUrl, but it is really an assumption on how the function really works:

- The text of the URL found inside the nfo file is passed as $$1. I suppose XBMC does some sanity checks to be sure that the string is actually a URL and just one.
- If NfoUrl returns something it is assumed that:
1) It is the correct scraper to use
2) The string returned as result is the URL to be fetched and passed to GetDetails

By the way, the scraper for culturalia used as exercise is mostly finished (well, there are a few more tests to do and I want to store the country of production into the videodb as the filmaffinity scraper does) and it is a little better than the current one on XBMC (no surprise there, since I used that scraper as template and then improve on it), the current one do not fetches thumbnails (culturalianet introduced download protection that needs spoofing to be solved) and had a little bug (stored multiple genres as <genre>aventura / comedia</genre> instead of <genre>aventura</genre><genre>comedia</genre>). I think the new one should substitute when finished the one in the trunk, how can that be done?
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #5
pko66 Wrote:I have some questions about custom functions:

Where can they be used? I see in imdb.xml and other scrapers that they are used in "output", but can be used in input or expression? I assume not, but I'm not sure.
they can currently be used everywhere u output results as well as in createsearchurl and getsettings. you invoke the custom function foo by having a <url function="foo">someurl</url> in the returned xml.
Quote:Also, how do the buffers work when calling custom functions? are they global? there is a parameter "clearbuffers" but it is not clear for me how it works... do the buffers get erased by default? if so, we are losing info!
correct, by default we clear the buffers between function calls. however, we are not loosing info. the returned xml is parsed after each function call, so anything we generate will be parsed and stored. clearbuffers="no" should only be used when u need to pass information between functions.
Quote:Also I'm confused as how exactly is dest buffer handled in these calls; in KinoPoisk.ru scraper, buffer 5 is used to generate all the results, but the GMPP function uses it as dest...
the dest is the buffer we will try to parse for results after a function has been evaluated. this includes looking for the <url function=..> stuff
Quote:My interpretation for now is this:

- Buffers are local.
- $$1 contains the fetched URL.
$$2 url number two and so on if you specify more urls. you can do

<url function="foo"><url>someurl</url><url>someotherurl</url></url>

if you want several urls fetched before a function call.
Quote:- dest simply states a placeholder for the output, can be any buffer no matter if it is used inside or outside the function (if so, why do we need to specify it?).
not correct. see further up.
Quote:- The other buffers are empty by default.
- We can if we want use a copy of the contents of all the buffers (except buffer 1) as they were at the point when the function was called. For that we insert the option clearbuffers="no" in the function definition; the manipulations we may do to that contents will not be preserved when the custom function ends and returns to the main GetDetails.
it won't return to getdetails(well technically it does, but only to realize it's done). getdetails is already evaluated.
Quote:- Whatever we generate as output will substitute the <url function="CustomFunction">url...</url> structure used to call the custom function
?
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #6
pko66 Wrote:I've completed the second chapter with a reference to NfoUrl, but it is really an assumption on how the function really works:

- The text of the URL found inside the nfo file is passed as $$1. I suppose XBMC does some sanity checks to be sure that the string is actually a URL and just one.
- If NfoUrl returns something it is assumed that:
1) It is the correct scraper to use
2) The string returned as result is the URL to be fetched and passed to GetDetails
correct

Quote:I think the new one should substitute when finished the one in the trunk, how can that be done?
submit a patch on trac
find quote
pko66 Offline
Senior Member
Posts: 190
Joined: Dec 2006
Reputation: 0
Post: #7
It seems I'm using much of your time today, thanks for your answers! Laugh

spiff Wrote:they can currently be used everywhere u output results as well as in createsearchurl and getsettings. you invoke the custom function foo by having a <url function="foo">someurl</url> in the returned xml.

Great! maybe I can use it in GetSearchResults in my hybrid culturalia/IMDB scraper to give the user the opportunity to use or not IMDB... So, when the scraper gets the incorrect IMDB movie it can be corrected. The main problem I see is that it will take a long time just to generate the search url; for each possible movie its page must be fetched and with the data obtained a search in IMDB must be executed. That's ten additional web operations for just 5 possible movies... Is it possible to know for a scraper if it is being called in "automatic" mode? by "automatic" I mean when the "scan for new content" option is invoked.

spiff Wrote:the dest is the buffer we will try to parse for results after a function has been evaluated. this includes looking for the <url function=..> stuff

Ok, I had a misconception there...

spiff Wrote:it won't return to getdetails(well technically it does, but only to realize it's done). getdetails is already evaluated.

Oh, I see, I hadn't understood it correctly...

spiff Wrote:?

mmm... I think my interpretation of how things work was wrong again, but to be sure, see this example; I was thinking something like this should work:

- a function is called, let's say GetResults, all the RegExp in it get evaluated and finally output goes to, say, buffer 10
- Buffer 10 is then inspected, let's assume it is:

Code:
...whatever...
<title>badmovie</title>
<plot>some plot</plot>
<thumb>http://myfiles.com/file.jpg<thumb>
<url function="MyFunction">http://MyFunction.com/hello</url>
<actor>
  <name>Juan Nadie</name>
  <role><url function="GetRole">http://MovieRoles.com/role.php?Juan Nadie&movie=badmovie</url></role>
</actor>
...more whatever...

MyFunction returns, say <tagline>some tagline</tagline> and GetRole returns just "John Doe" and so result is:

Code:
...whatever...
<title>badmovie</title>
<plot>some plot</plot>
<thumb>http://myfiles.com/file.jpg<thumb>
<tagline>some tagline</tagline>
<actor>
  <name>Juan Nadie</name>
  <role>John Doe</role>
</actor>
...more whatever...

That was what I thought; I think now that that is wrong, that what really happens is that the results get interpreted, but all the <url function... get stripped from the results and its output get appended to the original result, so in this case the results are:
Code:
...whatever...
<title>badmovie</title>
<plot>some plot</plot>
<thumb>http://myfiles.com/file.jpg<thumb>
<actor>
  <name>Juan Nadie</name>
  <role></role>
</actor>
...more whatever...
<tagline>some tagline</tagline>
John Doe

which is not what I initially thought...

spiff Wrote:submit a patch on trac

Ok, I'll need to learn how to do that :-) Anyone that wants to can submit a patch?

spiff Wrote:other relevants would probably be
[...]
note that these are shared with plugins and scripts (same format for their settings).

I have not seen yet anything about script/plugin development, I will do.

I'll add documentation about that to the wiki (although fileenum I think is not very useful in scrapers, since I think there is no capability to access the filesystem, neither in the scraper itself nor in URL evaluation, like to submit a file to a site for identification or something like that

spiff Wrote:no. but this is a good idea, please add a trac ticket

Ok, will do, I think they can be used for a number of things: creating custom genres based in directories, or for introducing in the db (for example, in less important fields like tagline or outline) an advice like "turn on the secondary server to watch this movie"... also, having data available, people may think cool things to do with it through scrapers.
find quote