Scraper functions - How do they work? - Printable Version

Scraper functions - How do they work? - Printable Version

+- Kodi Community Forum (https://forum.kodi.tv)
+-- Forum: Development (https://forum.kodi.tv/forumdisplay.php?fid=32)
+--- Forum: Scrapers (https://forum.kodi.tv/forumdisplay.php?fid=60)
+--- Thread: Scraper functions - How do they work? (/showthread.php?tid=121933)

Scraper functions - How do they work? - MKay - 2012-02-04

Hi,

I have general questions about how scraper functions work. It would be nice if someone could help me Smile

Smile

I tried to analyse the cinefacts scraper to understand scraper functions. Here is the interesting part of it:

Code:
<NfoUrl dest="3">

    <RegExp input="$$1" output="&lt;url&gt;http://www.cinefacts.de/kino/\1/\2/filmdetails.html&lt;/url&gt;&lt;id&gt;\1&lt;/id&gt;" dest="3">

        <expression clear="yes" noclean="1">(cinefacts.de/kino/)([0-9]*)/(.[^\/]*)/filmdetails.html</expression>

    </RegExp>

    <RegExp input="$$1" output="&lt;details&gt;&lt;url cache=&quot;tt\2&quot; function=&quot;GetByIMDBId&quot;&gt;http://www.imdb.com/title/tt\2/externalreviews&lt;/url&gt;&lt;id&gt;tt\2&lt;/id&gt;&lt;/details&gt;" dest="3+">

        <expression>(imdb.com/title/tt)([0-9]*)</expression>

    </RegExp>

    <RegExp input="$$1" output="&lt;details&gt;&lt;url cache=&quot;tt\2&quot; function=&quot;GetByIMDBId&quot;&gt;http://www.imdb.com/title/tt\2/externalreviews&lt;/url&gt;&lt;id&gt;tt\2&lt;/id&gt;&lt;/details&gt;" dest="3+">

        <expression>(imdb.com/)Title\?([0-9]+)</expression>

    </RegExp>

</NfoUrl>

<GetByIMDBId dest="3">

    <RegExp input="$$1" output="&lt;details&gt;&lt;url function=&quot;GetCinefactsDetailsLink&quot;&gt;http://www.cinefacts.de/kino/\1&lt;/url&gt;&lt;id&gt;$$2&lt;/id&gt;&lt;/details&gt;" dest="3+">

        <expression noclean="1">&lt;a href=&quot;http://www.cinefacts.de/kino/([^&quot;]*)&quot;</expression>

    </RegExp>

</GetByIMDBId>

<GetCinefactsDetailsLink dest="3">

    <RegExp input="$$1" output="&lt;url&gt;http://www.cinefacts.de\1&lt;/url&gt;&lt;id&gt;$$2&lt;/id&gt;" dest="3+">

        <expression>&lt;a href=&quot;([^&quot;]*)&quot;&gt;Filmdetails&lt;/a&gt;</expression>

    </RegExp>

</GetCinefactsDetailsLink>

So the 3rd RegExp inside NfoUrl would output something like that

Code:
<details><url cache="tt\2" function="GetByIMDBId">http://www.imdb.com/title/tt\2/externalreviews</url><id>tt\2</id></details>

So the GetByIMDBId-function is called which itself calls the GetCinefactsDetailsLink-function.

My first question is: How is the output of a function inserted into the "parent-output".
Example (GetByIMDBId calls GetCinefactsDetailsLink ):
I could imagine that the url-tag in GetByIMDBId will be replaced with the output of GetCinefactsDetailsLink.
Then the output would be something like

Quote:<details><url>http://www.cinefacts.de\1</url><id>$$2</id><id>$$2</id></details>

But what happens with the output of GetByIMDBId? The 3rd RegExp in NfoUrl outputs a <details>-Tag and GetByIMDBId outputs a <details>-Tag, too.
So how would the ouput look like? "<details><details><url>..."?
Or must the <url function="">-Tag reside in another element (in this <details>) which will be replaced with the output of the function-call?
In this case NfoUrl would output something like "<url></url><id></id><url></url><id></id><url></url><id></id>".

Best Regards,
MKay

RE: Scraper functions - How do they work? - SorryGoFish - 2013-02-06

I found myself wondering the same thing. This scraper formulation is super cool. I'd love it if somebody could explain the paradigm to new XBMC hackers.

@MKay - in the last (almost) year, did you come to an understanding? Google brought me (and maybe others) here. Thanks.

Edit: The Wiki is pretty good: http://wiki.xbmc.org/index.php?title=Scrapers

RE: Scraper functions - How do they work? - MKay - 2013-02-06

Wow, this is a pretty old topic Big Grin

Big Grin

Some background:
Last year I wrote a PHP-based scraper-software but paused development coz ... I was short of time Smile

Smile

So I didn't touched the code for month. (But the tool still looks great Tongue

Tongue

)

Just for you I took a look at the 700+ lines source Tongue

Tongue

. It took a while to look through, but I think I found the important parts.
This is how I implemented it (And at that time it worked well, e.g. TheTVDB-scraper):

When a "url-function" (let's call it F) is called it gets passed a reference to a result-array.
The result of F is appended to the result array. Then cascading function calls are resolved. For example if F calls a function G then G is executed and its results are appended to the same results array.
After all functions were called, the outputs are merged.

For example, after executing "GetDetails" (with cascading function-calls F and G) all results of the functions are stored in the result array:

Code:
result[0] = result of F, e.g. <details>...A...</details>

result[1] = result of G, e.g. <details>...B...</details>

These results are merged by their first node name, so the final result will be: <details>...A...B...</details>

That's my interpretation of my actual scraper-code Wink

Wink

As I said .... it has been a long time since I touched the code Smile

Smile

RE: Scraper functions - How do they work? - SorryGoFish - 2013-02-07

Oh, right. I saw your discussion somewhere about your efforts. Thanks for the run-through, and your time.