moviemaze.de scraper development - help needed

  Thread Rating:
  • 1 Votes - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
w00dst0ck Offline
Junior Member
Posts: 37
Joined: Aug 2008
Reputation: 0
Location: Germany
Question  moviemaze.de scraper development - help needed
Post: #1
Hi there,

im currently developing a http://www.moviemaze.de scraper. I managed to scrap most of the provided informations but have some trouble with the following issues:

Director:

Moviemaze.de HTML
[HTML] <td valign=top class="standard">
<span class="fett">Regie:</span>
</td>

<td valign=top class="standard_justify">
Guillermo Del Toro </td>
</tr>
[/HTML]

my regex:
Code:
<RegExp input="$$6" output="&lt;director&gt;\1&lt;/director&gt;" dest="5+">
    <RegExp input="$$1" output="\2" dest="6">
    <expression trim=2">Regie([^&quot;]*)&quot;standard_justify&quot;(.*?)&lt;</expression>
    </RegExp>
    <expression>([A-Za-z0-9 ,.]+)</expression>
</RegExp>

scap.exe delivers the right results, but XBMC displays a TRIM as result.
I decided to get the result in two steps because it's surrounded by tabs.
Any ideas why XBMC displays TRIM?


Actors:

Moviemaze.de HTML
[HTML] <tr>
<td valign=top class="standard">
<span class="fett">Darsteller:</span>

</td>
<td valign=top class="standard_justify" width=100%>
<a href="/celebs/89/1.html">Christian Bale</a> (Bruce Wayne/ Batman), <a href="/celebs/114/1.html">Heath Ledger</a> (Joker), <a href="/celebs/127/1.html">Michael Caine</a> (Alfred), Maggie Gyllenhaal (Rachel Dawes), Gary Oldman (Lt. James Gordon), Aaron Eckhart (Harvey Dent), <a href="/celebs/47/1.html">Morgan Freeman</a> (Lucius Fox), Monique Curnen (Detective Ramirez), Ron Dean (Detective Wuertz), Cillian Murphy (Dr. Jonathan Crane) </td>

</tr>[/HTML]

My regex:
Code:
<RegExp input="$$2" output="&lt;actor&gt;&lt;name&gt;\1&lt;/name&gt;&lt;role&gt;\2&lt;/role&gt;&lt;/actor&gt;" dest="5+">
    <RegExp input="$$1" output="\2" dest="2">
        <expression repeat="yes">Darsteller:([^%]*)%&gt;(.*?)&lt;/tr</expression>
    </RegExp>
    <expression repeat="yes" trim="1">([A-Za-z \-]*)\(([A-Za-z \-]*)\)</expression>
</RegExp>

It works, but I miss the actors with a href link and I did not managed to find a solution.


German Umlaut [äöüß]:

I can't grep words with characters like "german umlaute" / [äöüß].
I've tried expressions like ([A-Za-z0-9äÄöÖüÜß ,.]+) and ([A-Za-z0-9&auml;&Auml;&ouml;&Ouml;&uuml;&Uuml; ,.]+) without success.


Can somebody please help me?

regards,
w00dst0ck
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #2
if that's a c&p;
Code:
<expression trim=2">Regie([^&quot;]*)&quot;standard_justify&quot;(.*?)&lt;</expression>

missing a "

actors; try adding the href stuff with a ?, i.e. make it optional

umlaut might have to do with encoding - make sure the scraper is set to the encoding you are saving the file in (and also that it matches moviemaze's encoding)
find quote
w00dst0ck Offline
Junior Member
Posts: 37
Joined: Aug 2008
Reputation: 0
Location: Germany
Post: #3
thanx for reply!

spiff Wrote:if that's a c&p;
Code:
<expression trim=2">Regie([^&quot;]*)&quot;standard_justify&quot;(.*?)&lt;</expression>

missing a "

Added the missing " but there is no difference. Also tried without the trim option. But that results in no information provided.


spiff Wrote:umlaut might have to do with encoding - make sure the scraper is set to the encoding you are saving the file in (and also that it matches moviemaze's encoding)

How do i set the encoding of the scraper?
The moviemaze.de page is encoded with iso-8859-1 and my results are generated with:
<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?>
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #4
you set the encoding of the scraper xml file using exactly that kinda header as you just pasted
find quote
Gaarv Offline
Junior Member
Posts: 35
Joined: May 2008
Reputation: 0
Post: #5
w00dst0ck Wrote:Added the missing " but there is no difference. Also tried without the trim option. But that results in no information provided.

Seems you're missing a ">" too after "standard_justify". Its not required to match, but it will be returned in /2 so thats a problem.

For umlaut, like spiff already said use ""so-8859-1" but be sure if you're using wilcard matches to include "&", "#" ";" and numbers because umlaut is displayed as it in the source :

Heerf & # 2 5 2 ; hrer
Ma & # 2 2 3 ; arbeit

Heerführer
Maßarbeit

I had to space out the code to illustrate, of course they aren't needed.


XBMC Linux Ubuntu 8.04 - Antec Fusion Black
Gigabyte MA78GM-S2H - AMD Athlon 64 X2 BE-2350
Corsair 2Go DDRII PC6400 - Samsung Spinpoint 500 Go
Sony Bravia KDL-40W4000 - Logitech Harmony 555
find quote
w00dst0ck Offline
Junior Member
Posts: 37
Joined: Aug 2008
Reputation: 0
Location: Germany
Post: #6
Gaarv Wrote:Seems you're missing a ">" too after "standard_justify". Its not required to match, but it will be returned in /2 so thats a problem.

Thanx, that solves the problem.

The problem with the umlaut is solved by using (.*) Nod

Next I will try to solve my href problem...
find quote
Gaarv Offline
Junior Member
Posts: 35
Joined: May 2008
Reputation: 0
Post: #7
I did this rapidly, you will have to adapt ot limit its application in the whole page but it gave the idea

Code:
[^=",]*([A-Za-z0-9 ./]*\([A-Za-z0-9 ./]*\))

In the exemple you gave this match all actor names and role, with sometimes a </a> that needs to be cleaned


XBMC Linux Ubuntu 8.04 - Antec Fusion Black
Gigabyte MA78GM-S2H - AMD Athlon 64 X2 BE-2350
Corsair 2Go DDRII PC6400 - Samsung Spinpoint 500 Go
Sony Bravia KDL-40W4000 - Logitech Harmony 555
find quote
w00dst0ck Offline
Junior Member
Posts: 37
Joined: Aug 2008
Reputation: 0
Location: Germany
Post: #8
Hi there and thanx for your help.
I've only one problem left with the actors part.
[HTML] <tr>
<td valign=top class="standard">
<span class="fett">Darsteller:</span>

</td>
<td valign=top class="standard_justify" width=100%>
Til Schweiger (Ludo), Nora Tschirner (Anna), Matthias Schweighöfer (Moritz), Alwara Höfels (Miriam), Jürgen Vogel (Jürgen Vogel), Rick Kavanian (Chefredakteur), Armin Rohde (Bello), Wolfgang Stumph, Barbara Rudnik (Lilli), Christian Tramitz </td>
</tr>[/HTML]


Code:
<!--Actors-->    
    <RegExp input="$$2" output="&lt;actor&gt;&lt;name&gt;\2&lt;/name&gt;&lt;role&gt;\5&lt;/role&gt;&lt;/actor&gt;" dest="5+">
        <RegExp input="$$1" output="\2" dest="2">
            <expression trim="2" repeat="yes">Darsteller:([^%]*)%&gt;(.*?)&lt;/tr</expression>
        </RegExp>
        <expression repeat="yes">(&lt;a href\="[^&gt;]*&gt;)?(.*?)(&lt;/a&gt;)?( \((.*?)\))?, </expression>
    </RegExp>

The 10 tabs in front of the first actor name are added in \2. I've tried to clear them out with [\t]{10}?
but i think the XBMC scraper engine don't understand \t.
Any idea how I can get solve this?
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #9
\\t
find quote
w00dst0ck Offline
Junior Member
Posts: 37
Joined: Aug 2008
Reputation: 0
Location: Germany
Post: #10
Found another solution.

Will submit the working scraper at http://trac.xbmc.org/ticket/4563
find quote
w00dst0ck Offline
Junior Member
Posts: 37
Joined: Aug 2008
Reputation: 0
Location: Germany
Post: #11
Need some help again!

Don't know what's wrong. The RegEx works in a regex tester. So it must be the code. Sad

Code:
<!--URL to Trailer-->
    <RegExp input="$$1" output="&lt;url function=&quot;GetTrailerLink&quot;&gt;http://www.moviemaze.de/media/trailer/\1.html&lt;/url&gt;" dest="5+">
        <expression>href=&quot;/media/trailer/(.*?).html&quot; ti</expression>
    </RegExp>

    <expression noclean="1"></expression>
    </RegExp>
</GetDetails>

<!--Trailer-->
    <GetTrailerLink dest="5">
        <RegExp input="$$2" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot; standalone=&quot;yes&quot;&gt;&lt;details&gt;\1&lt;/details&gt;" dest="5+">
            <RegExp input="$$1" output="&lt;trailer urlencoded=&quot;yes&quot;&gt;http://www.moviemaze.de/media/trailer/delivery/\1.mov&lt;/trailer&gt;" dest="2">
                <expression>delivery/(.*?).mov&quot;</expression>
            </RegExp>
            <expression noclean="1"></expression>
        </RegExp>
    </GetTrailerLink>

Is there a way to implement more than one trailer?
The site delivers many trailers in different sizes and it would be great if it's possible to select the trailer in XBMC.
find quote
Gamester17 Offline
Team-XBMC Forum Moderator
Posts: 10,523
Joined: Sep 2003
Reputation: 10
Location: Sweden
Post: #12
w00dst0ck Wrote:Is there a way to implement more than one trailer? The site delivers many trailers in different sizes and it would be great if it's possible to select the trailer in XBMC.
I am guessing that will count as a feature request, please submit a new ticket on trac http://trac.xbmc.org

Wink

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
find quote
w00dst0ck Offline
Junior Member
Posts: 37
Joined: Aug 2008
Reputation: 0
Location: Germany
Post: #13
Sometimes the xbmc.log helps to solve a problem.
Submitted the working version with trailer support to trac.
Also submitted a feature request.
find quote
w00dst0ck Offline
Junior Member
Posts: 37
Joined: Aug 2008
Reputation: 0
Location: Germany
Post: #14
I have the bulk part of fanart done.

To get the missing imdb number I've created a google wrapper.
Code:
<!--URL to Google and Fanart-->
<RegExp conditional="fanart" input="$$1" output="&lt;url function=&quot;GoogleToIMDB&quot;&gt;http://www.google.com/search?q=site:imdb.com+moviemaze+\2+\1&lt;/url&gt;" dest="5+">
<expression>&lt;h2&gt;\(([^,]*), ([0-9]{4})</expression>
</RegExp>

The generated URL is:
Code:
http://www.google.com/search?q=site:imdb.com+moviemaze+2008+The Dark Knight

And it results in:
Code:
INFO: Get URL: http://www.google.com/search?q=site:imdb.com+moviemaze+2008+The Dark Knight
ERROR: Server returned: 400 Bad Request

I've discover that the spaces in "The Dark Knight" have to be replaced with "+".
But I don't know how to replace that char with an regex. Any ideas about that?
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #15
something along

<RegExp input=$$1 output="\1+\2" dest="4">
<expression repeat="yes" noclean="1,2">(.*?) (.*)</expression>
</RegExp>
find quote
Post Reply