Kodi Community Forum

Full Version: moviemaze.de scraper development - help needed
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
Hi there,

im currently developing a http://www.moviemaze.de scraper. I managed to scrap most of the provided informations but have some trouble with the following issues:

Director:

Moviemaze.de HTML
[HTML] <td valign=top class="standard">
<span class="fett">Regie:</span>
</td>

<td valign=top class="standard_justify">
Guillermo Del Toro </td>
</tr>
[/HTML]

my regex:
Code:
<RegExp input="$$6" output="&lt;director&gt;\1&lt;/director&gt;" dest="5+">
    <RegExp input="$$1" output="\2" dest="6">
    <expression trim=2">Regie([^&quot;]*)&quot;standard_justify&quot;(.*?)&lt;</expression>
    </RegExp>
    <expression>([A-Za-z0-9 ,.]+)</expression>
</RegExp>

scap.exe delivers the right results, but XBMC displays a TRIM as result.
I decided to get the result in two steps because it's surrounded by tabs.
Any ideas why XBMC displays TRIM?


Actors:

Moviemaze.de HTML
[HTML] <tr>
<td valign=top class="standard">
<span class="fett">Darsteller:</span>

</td>
<td valign=top class="standard_justify" width=100%>
<a href="/celebs/89/1.html">Christian Bale</a> (Bruce Wayne/ Batman), <a href="/celebs/114/1.html">Heath Ledger</a> (Joker), <a href="/celebs/127/1.html">Michael Caine</a> (Alfred), Maggie Gyllenhaal (Rachel Dawes), Gary Oldman (Lt. James Gordon), Aaron Eckhart (Harvey Dent), <a href="/celebs/47/1.html">Morgan Freeman</a> (Lucius Fox), Monique Curnen (Detective Ramirez), Ron Dean (Detective Wuertz), Cillian Murphy (Dr. Jonathan Crane) </td>

</tr>[/HTML]

My regex:
Code:
<RegExp input="$$2" output="&lt;actor&gt;&lt;name&gt;\1&lt;/name&gt;&lt;role&gt;\2&lt;/role&gt;&lt;/actor&gt;" dest="5+">
    <RegExp input="$$1" output="\2" dest="2">
        <expression repeat="yes">Darsteller:([^%]*)%&gt;(.*?)&lt;/tr</expression>
    </RegExp>
    <expression repeat="yes" trim="1">([A-Za-z \-]*)\(([A-Za-z \-]*)\)</expression>
</RegExp>

It works, but I miss the actors with a href link and I did not managed to find a solution.


German Umlaut [äöüß]:

I can't grep words with characters like "german umlaute" / [äöüß].
I've tried expressions like ([A-Za-z0-9äÄöÖüÜß ,.]+) and ([A-Za-z0-9&auml;&Auml;&ouml;&Ouml;&uuml;&Uuml; ,.]+) without success.


Can somebody please help me?

regards,
w00dst0ck
if that's a c&p;
Code:
<expression trim=2">Regie([^&quot;]*)&quot;standard_justify&quot;(.*?)&lt;</expression>

missing a "

actors; try adding the href stuff with a ?, i.e. make it optional

umlaut might have to do with encoding - make sure the scraper is set to the encoding you are saving the file in (and also that it matches moviemaze's encoding)
thanx for reply!

spiff Wrote:if that's a c&p;
Code:
<expression trim=2">Regie([^&quot;]*)&quot;standard_justify&quot;(.*?)&lt;</expression>

missing a "

Added the missing " but there is no difference. Also tried without the trim option. But that results in no information provided.


spiff Wrote:umlaut might have to do with encoding - make sure the scraper is set to the encoding you are saving the file in (and also that it matches moviemaze's encoding)

How do i set the encoding of the scraper?
The moviemaze.de page is encoded with iso-8859-1 and my results are generated with:
<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?>
you set the encoding of the scraper xml file using exactly that kinda header as you just pasted
w00dst0ck Wrote:Added the missing " but there is no difference. Also tried without the trim option. But that results in no information provided.

Seems you're missing a ">" too after "standard_justify". Its not required to match, but it will be returned in /2 so thats a problem.

For umlaut, like spiff already said use ""so-8859-1" but be sure if you're using wilcard matches to include "&", "#" ";" and numbers because umlaut is displayed as it in the source :

Heerf & # 2 5 2 ; hrer
Ma & # 2 2 3 ; arbeit

Heerführer
Maßarbeit

I had to space out the code to illustrate, of course they aren't needed.
Gaarv Wrote:Seems you're missing a ">" too after "standard_justify". Its not required to match, but it will be returned in /2 so thats a problem.

Thanx, that solves the problem.

The problem with the umlaut is solved by using (.*) Nod

Next I will try to solve my href problem...
I did this rapidly, you will have to adapt ot limit its application in the whole page but it gave the idea

Code:
[^=",]*([A-Za-z0-9 ./]*\([A-Za-z0-9 ./]*\))

In the exemple you gave this match all actor names and role, with sometimes a </a> that needs to be cleaned
Hi there and thanx for your help.
I've only one problem left with the actors part.
[HTML] <tr>
<td valign=top class="standard">
<span class="fett">Darsteller:</span>

</td>
<td valign=top class="standard_justify" width=100%>
Til Schweiger (Ludo), Nora Tschirner (Anna), Matthias Schweighöfer (Moritz), Alwara Höfels (Miriam), Jürgen Vogel (Jürgen Vogel), Rick Kavanian (Chefredakteur), Armin Rohde (Bello), Wolfgang Stumph, Barbara Rudnik (Lilli), Christian Tramitz </td>
</tr>[/HTML]


Code:
<!--Actors-->    
    <RegExp input="$$2" output="&lt;actor&gt;&lt;name&gt;\2&lt;/name&gt;&lt;role&gt;\5&lt;/role&gt;&lt;/actor&gt;" dest="5+">
        <RegExp input="$$1" output="\2" dest="2">
            <expression trim="2" repeat="yes">Darsteller:([^%]*)%&gt;(.*?)&lt;/tr</expression>
        </RegExp>
        <expression repeat="yes">(&lt;a href\="[^&gt;]*&gt;)?(.*?)(&lt;/a&gt;)?( \((.*?)\))?, </expression>
    </RegExp>

The 10 tabs in front of the first actor name are added in \2. I've tried to clear them out with [\t]{10}?
but i think the XBMC scraper engine don't understand \t.
Any idea how I can get solve this?
\\t
Found another solution.

Will submit the working scraper at http://trac.xbmc.org/ticket/4563
Need some help again!

Don't know what's wrong. The RegEx works in a regex tester. So it must be the code. Sad

Code:
<!--URL to Trailer-->
    <RegExp input="$$1" output="&lt;url function=&quot;GetTrailerLink&quot;&gt;http://www.moviemaze.de/media/trailer/\1.html&lt;/url&gt;" dest="5+">
        <expression>href=&quot;/media/trailer/(.*?).html&quot; ti</expression>
    </RegExp>

    <expression noclean="1"></expression>
    </RegExp>
</GetDetails>

<!--Trailer-->
    <GetTrailerLink dest="5">
        <RegExp input="$$2" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot; standalone=&quot;yes&quot;&gt;&lt;details&gt;\1&lt;/details&gt;" dest="5+">
            <RegExp input="$$1" output="&lt;trailer urlencoded=&quot;yes&quot;&gt;http://www.moviemaze.de/media/trailer/delivery/\1.mov&lt;/trailer&gt;" dest="2">
                <expression>delivery/(.*?).mov&quot;</expression>
            </RegExp>
            <expression noclean="1"></expression>
        </RegExp>
    </GetTrailerLink>

Is there a way to implement more than one trailer?
The site delivers many trailers in different sizes and it would be great if it's possible to select the trailer in XBMC.
w00dst0ck Wrote:Is there a way to implement more than one trailer? The site delivers many trailers in different sizes and it would be great if it's possible to select the trailer in XBMC.
I am guessing that will count as a feature request, please submit a new ticket on trac http://trac.xbmc.org

Wink
Sometimes the xbmc.log helps to solve a problem.
Submitted the working version with trailer support to trac.
Also submitted a feature request.
I have the bulk part of fanart done.

To get the missing imdb number I've created a google wrapper.
Code:
<!--URL to Google and Fanart-->
<RegExp conditional="fanart" input="$$1" output="&lt;url function=&quot;GoogleToIMDB&quot;&gt;http://www.google.com/search?q=site:imdb.com+moviemaze+\2+\1&lt;/url&gt;" dest="5+">
<expression>&lt;h2&gt;\(([^,]*), ([0-9]{4})</expression>
</RegExp>

The generated URL is:
Code:
http://www.google.com/search?q=site:imdb.com+moviemaze+2008+The Dark Knight

And it results in:
Code:
INFO: Get URL: http://www.google.com/search?q=site:imdb.com+moviemaze+2008+The Dark Knight
ERROR: Server returned: 400 Bad Request

I've discover that the spaces in "The Dark Knight" have to be replaced with "+".
But I don't know how to replace that char with an regex. Any ideas about that?
something along

<RegExp input=$$1 output="\1+\2" dest="4">
<expression repeat="yes" noclean="1,2">(.*?) (.*)</expression>
</RegExp>
Pages: 1 2