Scraper functions - How do they work?

  Thread Rating:
  • 0 Votes - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
MKay Offline
Member
Posts: 81
Joined: Jan 2010
Reputation: 10
Post: #1
Hi,

I have general questions about how scraper functions work. It would be nice if someone could help me Smile
I tried to analyse the cinefacts scraper to understand scraper functions. Here is the interesting part of it:
Code:
<NfoUrl dest="3">
    <RegExp input="$$1" output="&lt;url&gt;http://www.cinefacts.de/kino/\1/\2/filmdetails.html&lt;/url&gt;&lt;id&gt;\1&lt;/id&gt;" dest="3">
        <expression clear="yes" noclean="1">(cinefacts.de/kino/)([0-9]*)/(.[^\/]*)/filmdetails.html</expression>
    </RegExp>
    <RegExp input="$$1" output="&lt;details&gt;&lt;url cache=&quot;tt\2&quot; function=&quot;GetByIMDBId&quot;&gt;http://www.imdb.com/title/tt\2/externalreviews&lt;/url&gt;&lt;id&gt;tt\2&lt;/id&gt;&lt;/details&gt;" dest="3+">
        <expression>(imdb.com/title/tt)([0-9]*)</expression>
    </RegExp>
    <RegExp input="$$1" output="&lt;details&gt;&lt;url cache=&quot;tt\2&quot; function=&quot;GetByIMDBId&quot;&gt;http://www.imdb.com/title/tt\2/externalreviews&lt;/url&gt;&lt;id&gt;tt\2&lt;/id&gt;&lt;/details&gt;" dest="3+">
        <expression>(imdb.com/)Title\?([0-9]+)</expression>
    </RegExp>
</NfoUrl>
<GetByIMDBId dest="3">
    <RegExp input="$$1" output="&lt;details&gt;&lt;url function=&quot;GetCinefactsDetailsLink&quot;&gt;http://www.cinefacts.de/kino/\1&lt;/url&gt;&lt;id&gt;$$2&lt;/id&gt;&lt;/details&gt;" dest="3+">
        <expression noclean="1">&lt;a href=&quot;http://www.cinefacts.de/kino/([^&quot;]*)&quot;</expression>
    </RegExp>
</GetByIMDBId>
<GetCinefactsDetailsLink dest="3">
    <RegExp input="$$1" output="&lt;url&gt;http://www.cinefacts.de\1&lt;/url&gt;&lt;id&gt;$$2&lt;/id&gt;" dest="3+">
        <expression>&lt;a href=&quot;([^&quot;]*)&quot;&gt;Filmdetails&lt;/a&gt;</expression>
    </RegExp>
</GetCinefactsDetailsLink>
So the 3rd RegExp inside NfoUrl would output something like that
Code:
<details><url cache="tt\2" function="GetByIMDBId">http://www.imdb.com/title/tt\2/externalreviews</url><id>tt\2</id></details>
So the GetByIMDBId-function is called which itself calls the GetCinefactsDetailsLink-function.

My first question is: How is the output of a function inserted into the "parent-output".
Example (GetByIMDBId calls GetCinefactsDetailsLink ):
I could imagine that the url-tag in GetByIMDBId will be replaced with the output of GetCinefactsDetailsLink.
Then the output would be something like
Quote:<details><url>http://www.cinefacts.de\1</url><id>$$2</id><id>$$2</id></details>
But what happens with the output of GetByIMDBId? The 3rd RegExp in NfoUrl outputs a <details>-Tag and GetByIMDBId outputs a <details>-Tag, too.
So how would the ouput look like? "<details><details><url>..."?
Or must the <url function="">-Tag reside in another element (in this <details>) which will be replaced with the output of the function-call?
In this case NfoUrl would output something like "<url></url><id></id><url></url><id></id><url></url><id></id>".

Best Regards,
MKay

:eek2: AWX - Ajax based Webinterface for XBMC (Dharma Beta 2)
find quote
SorryGoFish Offline
Member
Posts: 51
Joined: Sep 2012
Reputation: 0
Post: #2
I found myself wondering the same thing. This scraper formulation is super cool. I'd love it if somebody could explain the paradigm to new XBMC hackers.

@MKay - in the last (almost) year, did you come to an understanding? Google brought me (and maybe others) here. Thanks.

Edit: The Wiki is pretty good: http://wiki.xbmc.org/index.php?title=Scrapers
(This post was last modified: 2013-02-06 16:14 by SorryGoFish.)
find quote
MKay Offline
Member
Posts: 81
Joined: Jan 2010
Reputation: 10
Post: #3
Wow, this is a pretty old topic Big Grin

Some background:
Last year I wrote a PHP-based scraper-software but paused development coz ... I was short of time Smile
So I didn't touched the code for month. (But the tool still looks great Tongue)

Just for you I took a look at the 700+ lines source Tongue. It took a while to look through, but I think I found the important parts.
This is how I implemented it (And at that time it worked well, e.g. TheTVDB-scraper):

When a "url-function" (let's call it F) is called it gets passed a reference to a result-array.
The result of F is appended to the result array. Then cascading function calls are resolved. For example if F calls a function G then G is executed and its results are appended to the same results array.
After all functions were called, the outputs are merged.

For example, after executing "GetDetails" (with cascading function-calls F and G) all results of the functions are stored in the result array:
Code:
result[0] = result of F, e.g. <details>...A...</details>
result[1] = result of G, e.g. <details>...B...</details>

These results are merged by their first node name, so the final result will be: <details>...A...B...</details>

That's my interpretation of my actual scraper-code Wink As I said .... it has been a long time since I touched the code Smile

:eek2: AWX - Ajax based Webinterface for XBMC (Dharma Beta 2)
(This post was last modified: 2013-02-06 20:20 by MKay.)
find quote
SorryGoFish Offline
Member
Posts: 51
Joined: Sep 2012
Reputation: 0
Post: #4
Oh, right. I saw your discussion somewhere about your efforts. Thanks for the run-through, and your time.
find quote