What is the input $$2 for CreateSearchUrl? (can't find documentation)
#1
I'm trying to understand how the scrapers work so I'm looking at imdb.xml.

It contains the following code:
Code:
<CreateSearchUrl dest="3" SearchStringEncoding="iso-8859-1">
    <RegExp input="$$1" output="&lt;url&gt;http://akas.imdb.com/find?s=tt;q=\1$$4&lt;/url&gt;" dest="3">
        <RegExp input="$$2" output="%20(\1)" dest="4">
            <expression clear="yes">(.+)</expression>
        </RegExp>
        <expression noclean="1"/>
    </RegExp>
</CreateSearchUrl>

From what I've read, the inner RegExp executes before the outer RegExp. (and siblings execute in document order, top-down)

What I don't understand is that the inner RegExp has an input of $$2. Where is that coming from? I thought $$1 was the only valid input and it contained the file name.

Thanks for any help!
Reply
#2
$$2 is the year if we managed to extract one from the filename. documentation is only available in the form of the source code so no wonder you couldn't find any Wink
Reply
#3
Ahhh, thanks!

Can you point me in the right direction for the source where it filters the filename and then fills in $$1 and $$2, etc?

I'm looking at xbmc/utils/ScraperParser.cpp but that doesn't seem to be the correct place...
Reply
#4
xbmc/utils/IMDB.cpp. in general a simple grep for the scraper function names will be a good trail to follow.
Reply
#5
Great, thanks a ton!
Reply
#6
Also, if you would like to write up some documentation, please feel free to start. I'm more than happy to help proof it etc.
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


Image
Reply
#7
That sounds good, I might give it a try.

Truth be told, I'd like to write an XBMC/XML scraper library that utilizes the XBMC scrapers. Basically I'd like to create something like the ScraperXML project but with much better documentation and better organized.

Creating some more documentation for the scraper format would definitely be a good start! Smile
Reply
#8
I'm making good progress but I have a question about the repeat attribute.

Let's take the following very simple example:
Code:
<RegExp input="$$1" output="\1" dest="5+">
    <expression noclean="1" repeat="yes">name=&quot;([^&quot;]*)&quot;</expression>
</RegExp>

Suppose the expression matches five times.

This is how I think it works:

1) Execute regular expression. Create output string like normal.
2) Repeat for each match, appending output strings together and finally storing the complete result in the destination buffer.

Does that sound right?

I think what's a little confusing is that it seems the repeat attribute should be on the RegExp tag and not the expression tag.

How does the clear tag work?
Reply
#9
repeat does the following;

1) run expression
2) if no match goto 6
3) extract the selections, append the to the output buffer
4) remove the matchlen from the input buffer
5) goto 1
6) done


clear="yes" will clear the output buffer even if the expression fails. the default is to leave the buffer alone if the expression doesn't match.
Reply
#10
I think I follow but what is matchlen?
Reply
#11
it's the length of the regular expression match. we advance the input buffer by matchlen chars. bad grammar up there Wink
Reply
#12
Gotcha. It sounds like your just manually looking for matches. (by removing any previous content in the buffer)

One last question! (or so I hope!)

Does a chain function execute and then replace it's <chain> tag with it's output?

For example, here's the IMDB - GetDetails output:
Code:
<details>
  <id>tt1014759</id>
  <originaltitle>Alice in Wonderland </originaltitle>
  <chain function="GetIMDBAKATitlesById">tt1014759</chain>
  <...>
</details>

The chain function creates this output:
Code:
<details>
  <url cache="tt1014759-credits.html" function="ParseIMDBAKATitles">http://akas.imdb.com/title/tt1014759</url>
</details>

If I simply replace the original <chain> element with this new output it will look like this:
Code:
<details>
  <id>tt1014759</id>
  <originaltitle>Alice in Wonderland </originaltitle>
  <details>
    <url cache="tt1014759-credits.html" function="ParseIMDBAKATitles">http://akas.imdb.com/title/tt1014759</url>
  </details>
  <...>
</details>

Is that right? That seems like a nested <details> tag. However, other chain functions generate different output that may or may not include an outer <details> tag.

Thanks so much for the help again.
Reply
#13
not how it works.

each function returns a <details> block. after each function return we parse the xml.
the parsing is done with the convention that anything that's additive (i.e. you can have several tags like <genre>) is added, anything else which was previously returned is replaced. then, if we find a <chain> or an <url> in the returned xml, we perform the requested function call.
Reply
#14
Thanks!

I think I have pretty much everything working except I noticed a big problem.

Why do certain functions return invalid XML?

For example, in movieposterdb.xml (inside metadata.common.movieposterdb.com), function GetMoviePosterDBThumbs:
Code:
<GetMoviePosterDBThumbs dest="6">
    <RegExp input="$$1" output="&lt;details&gt;&lt;url function=&quot;GetMoviePosterDBLink&quot;&gt;http://www.movieposterdb.com/browse/search?type=movies&amp;query=\1&lt;/url&gt;&lt;/details&gt;" dest="6">
        <expression>tt([t0-9]*)</expression>
    </RegExp>
</GetMoviePosterDBThumbs>

If you HTML decode the output you get this:
Code:
<details><url function="GetMoviePosterDBLink">http://www.movieposterdb.com/browse/search?type=movies&query=\1>/url></details>

That ampersand before query is invalid XML. It almost seems like the ampersand should be double-escaped but this works in XBMC so I'm guessing it's doing something to get around that?

Also, I'm guessing this could happen if a plot or actor had a special XML character in their name?
Reply
#15
tinyxml is forgiving on &'s but you are correct - the library should encode those &'s. will nag the relevant parties Smile

it may happen, but the scraper authors have the tools to avoid it.
Reply

Logout Mark Read Team Forum Stats Members Help
What is the input $$2 for CreateSearchUrl? (can't find documentation)0