Login at Kodi Home

w00dst0ck · 2008-08-19, 11:44

Hi there,

im currently developing a http://www.moviemaze.de scraper. I managed to scrap most of the provided informations but have some trouble with the following issues:

Director:

Moviemaze.de HTML
[HTML] <td valign=top class="standard">
Regie:
</td>

<td valign=top class="standard_justify">
Guillermo Del Toro </td>
</tr>
[/HTML]

my regex:

Code:
<RegExp input="$$6" output="&lt;director&gt;\1&lt;/director&gt;" dest="5+">

    <RegExp input="$$1" output="\2" dest="6">

    <expression trim=2">Regie([^&quot;]*)&quot;standard_justify&quot;(.*?)&lt;</expression>

    </RegExp>

    <expression>([A-Za-z0-9 ,.]+)</expression>

</RegExp>

scap.exe delivers the right results, but XBMC displays a TRIM as result.
I decided to get the result in two steps because it's surrounded by tabs.
Any ideas why XBMC displays TRIM?

Actors:

Moviemaze.de HTML
[HTML] <tr>
<td valign=top class="standard">
Darsteller:

</td>
<td valign=top class="standard_justify" width=100%>
<a href="/celebs/89/1.html">Christian Bale</a> (Bruce Wayne/ Batman), <a href="/celebs/114/1.html">Heath Ledger</a> (Joker), <a href="/celebs/127/1.html">Michael Caine</a> (Alfred), Maggie Gyllenhaal (Rachel Dawes), Gary Oldman (Lt. James Gordon), Aaron Eckhart (Harvey Dent), <a href="/celebs/47/1.html">Morgan Freeman</a> (Lucius Fox), Monique Curnen (Detective Ramirez), Ron Dean (Detective Wuertz), Cillian Murphy (Dr. Jonathan Crane) </td>

</tr>[/HTML]

My regex:

Code:
<RegExp input="$$2" output="&lt;actor&gt;&lt;name&gt;\1&lt;/name&gt;&lt;role&gt;\2&lt;/role&gt;&lt;/actor&gt;" dest="5+">

    <RegExp input="$$1" output="\2" dest="2">

        <expression repeat="yes">Darsteller:([^%]*)%&gt;(.*?)&lt;/tr</expression>

    </RegExp>

    <expression repeat="yes" trim="1">([A-Za-z \-]*)\(([A-Za-z \-]*)\)</expression>

</RegExp>

It works, but I miss the actors with a href link and I did not managed to find a solution.

German Umlaut [äöüß]:

I can't grep words with characters like "german umlaute" / [äöüß].
I've tried expressions like ([A-Za-z0-9äÄöÖüÜß ,.]+) and ([A-Za-z0-9äÄöÖüÜ ,.]+) without success.

Can somebody please help me?

regards,
w00dst0ck

**spiff** · 2008-08-19, 12:06

if that's a c&p;

Code:
<expression trim=2">Regie([^&quot;]*)&quot;standard_justify&quot;(.*?)&lt;</expression>

missing a "

actors; try adding the href stuff with a ?, i.e. make it optional

umlaut might have to do with encoding - make sure the scraper is set to the encoding you are saving the file in (and also that it matches moviemaze's encoding)

w00dst0ck · 2008-08-19, 13:27

thanx for reply!

spiff Wrote:if that's a c&p;

Code:
<expression trim=2">Regie([^"]*)"standard_justify"(.*?)<</expression>

missing a "

Added the missing " but there is no difference. Also tried without the trim option. But that results in no information provided.

spiff Wrote:umlaut might have to do with encoding - make sure the scraper is set to the encoding you are saving the file in (and also that it matches moviemaze's encoding)

How do i set the encoding of the scraper?
The moviemaze.de page is encoded with iso-8859-1 and my results are generated with:
<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?>

**spiff** · 2008-08-19, 13:45

you set the encoding of the scraper xml file using exactly that kinda header as you just pasted

Gaarv · 2008-08-19, 13:56

w00dst0ck Wrote:Added the missing " but there is no difference. Also tried without the trim option. But that results in no information provided.

Seems you're missing a ">" too after "standard_justify". Its not required to match, but it will be returned in /2 so thats a problem.

For umlaut, like spiff already said use ""so-8859-1" but be sure if you're using wilcard matches to include "&", "#" ";" and numbers because umlaut is displayed as it in the source :

Heerf & # 2 5 2 ; hrer
Ma & # 2 2 3 ; arbeit

Heerführer
Maßarbeit

I had to space out the code to illustrate, of course they aren't needed.

w00dst0ck · 2008-08-19, 15:32

Gaarv Wrote:Seems you're missing a ">" too after "standard_justify". Its not required to match, but it will be returned in /2 so thats a problem.

Thanx, that solves the problem.

The problem with the umlaut is solved by using (.*) Nod

Next I will try to solve my href problem...

Gaarv · 2008-08-19, 16:08

I did this rapidly, you will have to adapt ot limit its application in the whole page but it gave the idea

Code:
[^=",]*([A-Za-z0-9 ./]*\([A-Za-z0-9 ./]*\))

In the exemple you gave this match all actor names and role, with sometimes a </a> that needs to be cleaned

w00dst0ck · 2008-08-20, 08:51

Hi there and thanx for your help.
I've only one problem left with the actors part.
[HTML] <tr>
<td valign=top class="standard">
Darsteller:

</td>
<td valign=top class="standard_justify" width=100%>
Til Schweiger (Ludo), Nora Tschirner (Anna), Matthias Schweighöfer (Moritz), Alwara Höfels (Miriam), Jürgen Vogel (Jürgen Vogel), Rick Kavanian (Chefredakteur), Armin Rohde (Bello), Wolfgang Stumph, Barbara Rudnik (Lilli), Christian Tramitz </td>
</tr>[/HTML]

Code:
<!--Actors-->    

    <RegExp input="$$2" output="&lt;actor&gt;&lt;name&gt;\2&lt;/name&gt;&lt;role&gt;\5&lt;/role&gt;&lt;/actor&gt;" dest="5+">

        <RegExp input="$$1" output="\2" dest="2">

            <expression trim="2" repeat="yes">Darsteller:([^%]*)%&gt;(.*?)&lt;/tr</expression>

        </RegExp>

        <expression repeat="yes">(&lt;a href\="[^&gt;]*&gt;)?(.*?)(&lt;/a&gt;)?( \((.*?)\))?, </expression>

    </RegExp>

The 10 tabs in front of the first actor name are added in \2. I've tried to clear them out with [\t]{10}?
but i think the XBMC scraper engine don't understand \t.
Any idea how I can get solve this?

**spiff** · 2008-08-20, 11:47

\\t

w00dst0ck · 2008-08-20, 14:23

Found another solution.

Will submit the working scraper at http://trac.xbmc.org/ticket/4563

w00dst0ck · 2008-09-08, 16:38

Need some help again!

Don't know what's wrong. The RegEx works in a regex tester. So it must be the code. Sad

Code:
<!--URL to Trailer-->

    <RegExp input="$$1" output="&lt;url function=&quot;GetTrailerLink&quot;&gt;http://www.moviemaze.de/media/trailer/\1.html&lt;/url&gt;" dest="5+">

        <expression>href=&quot;/media/trailer/(.*?).html&quot; ti</expression>

    </RegExp>

    <expression noclean="1"></expression>

    </RegExp>

</GetDetails>

<!--Trailer-->

    <GetTrailerLink dest="5">

        <RegExp input="$$2" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot; standalone=&quot;yes&quot;&gt;&lt;details&gt;\1&lt;/details&gt;" dest="5+">

            <RegExp input="$$1" output="&lt;trailer urlencoded=&quot;yes&quot;&gt;http://www.moviemaze.de/media/trailer/delivery/\1.mov&lt;/trailer&gt;" dest="2">

                <expression>delivery/(.*?).mov&quot;</expression>

            </RegExp>

            <expression noclean="1"></expression>

        </RegExp>

    </GetTrailerLink>

Is there a way to implement more than one trailer?
The site delivers many trailers in different sizes and it would be great if it's possible to select the trailer in XBMC.

Gamester17 · 2008-09-08, 17:46

w00dst0ck Wrote:Is there a way to implement more than one trailer? The site delivers many trailers in different sizes and it would be great if it's possible to select the trailer in XBMC.

I am guessing that will count as a feature request, please submit a new ticket on trac http://trac.xbmc.org

Wink

w00dst0ck · 2008-09-09, 13:45

Sometimes the xbmc.log helps to solve a problem.
Submitted the working version with trailer support to trac.
Also submitted a feature request.

w00dst0ck · 2008-10-21, 10:03

I have the bulk part of fanart done.

To get the missing imdb number I've created a google wrapper.

Code:
<!--URL to Google and Fanart-->

<RegExp conditional="fanart" input="$$1" output="&lt;url function=&quot;GoogleToIMDB&quot;&gt;http://www.google.com/search?q=site:imdb.com+moviemaze+\2+\1&lt;/url&gt;" dest="5+">

<expression>&lt;h2&gt;\(([^,]*), ([0-9]{4})</expression>

</RegExp>

The generated URL is:

Code:
http://www.google.com/search?q=site:imdb.com+moviemaze+2008+The Dark Knight

And it results in:

Code:
INFO: Get URL: http://www.google.com/search?q=site:imdb.com+moviemaze+2008+The Dark Knight

ERROR: Server returned: 400 Bad Request

I've discover that the spaces in "The Dark Knight" have to be replaced with "+".
But I don't know how to replace that char with an regex. Any ideas about that?

**spiff** · 2008-10-21, 10:28

something along

<RegExp input=$$1 output="\1+\2" dest="4">
<expression repeat="yes" noclean="1,2">(.*?) (.*)</expression>
</RegExp>