moviemaze.de scraper development - help needed

  Thread Rating:
  • 1 Votes - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
w00dst0ck Offline
Junior Member
Posts: 37
Joined: Aug 2008
Reputation: 0
Location: Germany
Question  moviemaze.de scraper development - help needed Post: #1
Hi there,

im currently developing a http://www.moviemaze.de scraper. I managed to scrap most of the provided informations but have some trouble with the following issues:

Director:

Moviemaze.de HTML
[HTML] <td valign=top class="standard">
<span class="fett">Regie:</span>
</td>

<td valign=top class="standard_justify">
Guillermo Del Toro </td>
</tr>
[/HTML]

my regex:
Code:
<RegExp input="$$6" output="&lt;director&gt;\1&lt;/director&gt;" dest="5+">
    <RegExp input="$$1" output="\2" dest="6">
    <expression trim=2">Regie([^&quot;]*)&quot;standard_justify&quot;(.*?)&lt;</expression>
    </RegExp>
    <expression>([A-Za-z0-9 ,.]+)</expression>
</RegExp>

scap.exe delivers the right results, but XBMC displays a TRIM as result.
I decided to get the result in two steps because it's surrounded by tabs.
Any ideas why XBMC displays TRIM?


Actors:

Moviemaze.de HTML
[HTML] <tr>
<td valign=top class="standard">
<span class="fett">Darsteller:</span>

</td>
<td valign=top class="standard_justify" width=100%>
<a href="/celebs/89/1.html">Christian Bale</a> (Bruce Wayne/ Batman), <a href="/celebs/114/1.html">Heath Ledger</a> (Joker), <a href="/celebs/127/1.html">Michael Caine</a> (Alfred), Maggie Gyllenhaal (Rachel Dawes), Gary Oldman (Lt. James Gordon), Aaron Eckhart (Harvey Dent), <a href="/celebs/47/1.html">Morgan Freeman</a> (Lucius Fox), Monique Curnen (Detective Ramirez), Ron Dean (Detective Wuertz), Cillian Murphy (Dr. Jonathan Crane) </td>

</tr>[/HTML]

My regex:
Code:
<RegExp input="$$2" output="&lt;actor&gt;&lt;name&gt;\1&lt;/name&gt;&lt;role&gt;\2&lt;/role&gt;&lt;/actor&gt;" dest="5+">
    <RegExp input="$$1" output="\2" dest="2">
        <expression repeat="yes">Darsteller:([^%]*)%&gt;(.*?)&lt;/tr</expression>
    </RegExp>
    <expression repeat="yes" trim="1">([A-Za-z \-]*)\(([A-Za-z \-]*)\)</expression>
</RegExp>

It works, but I miss the actors with a href link and I did not managed to find a solution.


German Umlaut [äöüß]:

I can't grep words with characters like "german umlaute" / [äöüß].
I've tried expressions like ([A-Za-z0-9äÄöÖüÜß ,.]+) and ([A-Za-z0-9&auml;&Auml;&ouml;&Ouml;&uuml;&Uuml; ,.]+) without success.


Can somebody please help me?

regards,
w00dst0ck
find quote
spiff Offline
Grumpy Bastard Developer
Posts: 12,180
Joined: Nov 2003
Reputation: 82
Post: #2
if that's a c&p;
Code:
<expression trim=2">Regie([^&quot;]*)&quot;standard_justify&quot;(.*?)&lt;</expression>

missing a "

actors; try adding the href stuff with a ?, i.e. make it optional

umlaut might have to do with encoding - make sure the scraper is set to the encoding you are saving the file in (and also that it matches moviemaze's encoding)

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
find quote
w00dst0ck Offline
Junior Member
Posts: 37
Joined: Aug 2008
Reputation: 0
Location: Germany
Post: #3
thanx for reply!

spiff Wrote:if that's a c&p;
Code:
<expression trim=2">Regie([^&quot;]*)&quot;standard_justify&quot;(.*?)&lt;</expression>

missing a "

Added the missing " but there is no difference. Also tried without the trim option. But that results in no information provided.


spiff Wrote:umlaut might have to do with encoding - make sure the scraper is set to the encoding you are saving the file in (and also that it matches moviemaze's encoding)

How do i set the encoding of the scraper?
The moviemaze.de page is encoded with iso-8859-1 and my results are generated with:
<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?>
find quote
spiff Offline
Grumpy Bastard Developer
Posts: 12,180
Joined: Nov 2003
Reputation: 82
Post: #4
you set the encoding of the scraper xml file using exactly that kinda header as you just pasted

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
find quote
Gaarv Offline
Junior Member
Posts: 35
Joined: May 2008
Reputation: 0
Post: #5
w00dst0ck Wrote:Added the missing " but there is no difference. Also tried without the trim option. But that results in no information provided.

Seems you're missing a ">" too after "standard_justify". Its not required to match, but it will be returned in /2 so thats a problem.

For umlaut, like spiff already said use ""so-8859-1" but be sure if you're using wilcard matches to include "&", "#" ";" and numbers because umlaut is displayed as it in the source :

Heerf & # 2 5 2 ; hrer
Ma & # 2 2 3 ; arbeit

Heerführer
Maßarbeit

I had to space out the code to illustrate, of course they aren't needed.


XBMC Linux Ubuntu 8.04 - Antec Fusion Black
Gigabyte MA78GM-S2H - AMD Athlon 64 X2 BE-2350
Corsair 2Go DDRII PC6400 - Samsung Spinpoint 500 Go
Sony Bravia KDL-40W4000 - Logitech Harmony 555
find quote
w00dst0ck Offline
Junior Member
Posts: 37
Joined: Aug 2008
Reputation: 0
Location: Germany
Post: #6
Gaarv Wrote:Seems you're missing a ">" too after "standard_justify". Its not required to match, but it will be returned in /2 so thats a problem.

Thanx, that solves the problem.

The problem with the umlaut is solved by using (.*) Nod

Next I will try to solve my href problem...
find quote
Gaarv Offline
Junior Member
Posts: 35
Joined: May 2008
Reputation: 0
Post: #7
I did this rapidly, you will have to adapt ot limit its application in the whole page but it gave the idea

Code:
[^=",]*([A-Za-z0-9 ./]*\([A-Za-z0-9 ./]*\))

In the exemple you gave this match all actor names and role, with sometimes a </a> that needs to be cleaned


XBMC Linux Ubuntu 8.04 - Antec Fusion Black
Gigabyte MA78GM-S2H - AMD Athlon 64 X2 BE-2350
Corsair 2Go DDRII PC6400 - Samsung Spinpoint 500 Go
Sony Bravia KDL-40W4000 - Logitech Harmony 555
find quote
w00dst0ck Offline
Junior Member
Posts: 37
Joined: Aug 2008
Reputation: 0
Location: Germany
Post: #8
Hi there and thanx for your help.
I've only one problem left with the actors part.
[HTML] <tr>
<td valign=top class="standard">
<span class="fett">Darsteller:</span>

</td>
<td valign=top class="standard_justify" width=100%>
Til Schweiger (Ludo), Nora Tschirner (Anna), Matthias Schweighöfer (Moritz), Alwara Höfels (Miriam), Jürgen Vogel (Jürgen Vogel), Rick Kavanian (Chefredakteur), Armin Rohde (Bello), Wolfgang Stumph, Barbara Rudnik (Lilli), Christian Tramitz </td>
</tr>[/HTML]


Code:
<!--Actors-->    
    <RegExp input="$$2" output="&lt;actor&gt;&lt;name&gt;\2&lt;/name&gt;&lt;role&gt;\5&lt;/role&gt;&lt;/actor&gt;" dest="5+">
        <RegExp input="$$1" output="\2" dest="2">
            <expression trim="2" repeat="yes">Darsteller:([^%]*)%&gt;(.*?)&lt;/tr</expression>
        </RegExp>
        <expression repeat="yes">(&lt;a href\="[^&gt;]*&gt;)?(.*?)(&lt;/a&gt;)?( \((.*?)\))?, </expression>
    </RegExp>

The 10 tabs in front of the first actor name are added in \2. I've tried to clear them out with [\t]{10}?
but i think the XBMC scraper engine don't understand \t.
Any idea how I can get solve this?
find quote
spiff Offline
Grumpy Bastard Developer
Posts: 12,180
Joined: Nov 2003
Reputation: 82
Post: #9
\\t

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
find quote
w00dst0ck Offline
Junior Member
Posts: 37
Joined: Aug 2008
Reputation: 0
Location: Germany
Post: #10
Found another solution.

Will submit the working scraper at http://trac.xbmc.org/ticket/4563
find quote
Post Reply