Kodi Community Forum
moviemaze.de scraper development - help needed - Printable Version

+- Kodi Community Forum (https://forum.kodi.tv)
+-- Forum: Development (https://forum.kodi.tv/forumdisplay.php?fid=32)
+--- Forum: Scrapers (https://forum.kodi.tv/forumdisplay.php?fid=60)
+--- Thread: moviemaze.de scraper development - help needed (/showthread.php?tid=36080)

Pages: 1 2


moviemaze.de scraper development - help needed - w00dst0ck - 2008-08-19

Hi there,

im currently developing a http://www.moviemaze.de scraper. I managed to scrap most of the provided informations but have some trouble with the following issues:

Director:

Moviemaze.de HTML
[HTML] <td valign=top class="standard">
<span class="fett">Regie:</span>
</td>

<td valign=top class="standard_justify">
Guillermo Del Toro </td>
</tr>
[/HTML]

my regex:
Code:
<RegExp input="$$6" output="&lt;director&gt;\1&lt;/director&gt;" dest="5+">
    <RegExp input="$$1" output="\2" dest="6">
    <expression trim=2">Regie([^&quot;]*)&quot;standard_justify&quot;(.*?)&lt;</expression>
    </RegExp>
    <expression>([A-Za-z0-9 ,.]+)</expression>
</RegExp>

scap.exe delivers the right results, but XBMC displays a TRIM as result.
I decided to get the result in two steps because it's surrounded by tabs.
Any ideas why XBMC displays TRIM?


Actors:

Moviemaze.de HTML
[HTML] <tr>
<td valign=top class="standard">
<span class="fett">Darsteller:</span>

</td>
<td valign=top class="standard_justify" width=100%>
<a href="/celebs/89/1.html">Christian Bale</a> (Bruce Wayne/ Batman), <a href="/celebs/114/1.html">Heath Ledger</a> (Joker), <a href="/celebs/127/1.html">Michael Caine</a> (Alfred), Maggie Gyllenhaal (Rachel Dawes), Gary Oldman (Lt. James Gordon), Aaron Eckhart (Harvey Dent), <a href="/celebs/47/1.html">Morgan Freeman</a> (Lucius Fox), Monique Curnen (Detective Ramirez), Ron Dean (Detective Wuertz), Cillian Murphy (Dr. Jonathan Crane) </td>

</tr>[/HTML]

My regex:
Code:
<RegExp input="$$2" output="&lt;actor&gt;&lt;name&gt;\1&lt;/name&gt;&lt;role&gt;\2&lt;/role&gt;&lt;/actor&gt;" dest="5+">
    <RegExp input="$$1" output="\2" dest="2">
        <expression repeat="yes">Darsteller:([^%]*)%&gt;(.*?)&lt;/tr</expression>
    </RegExp>
    <expression repeat="yes" trim="1">([A-Za-z \-]*)\(([A-Za-z \-]*)\)</expression>
</RegExp>

It works, but I miss the actors with a href link and I did not managed to find a solution.


German Umlaut [äöüß]:

I can't grep words with characters like "german umlaute" / [äöüß].
I've tried expressions like ([A-Za-z0-9äÄöÖüÜß ,.]+) and ([A-Za-z0-9&auml;&Auml;&ouml;&Ouml;&uuml;&Uuml; ,.]+) without success.


Can somebody please help me?

regards,
w00dst0ck


- spiff - 2008-08-19

if that's a c&p;
Code:
<expression trim=2">Regie([^&quot;]*)&quot;standard_justify&quot;(.*?)&lt;</expression>

missing a "

actors; try adding the href stuff with a ?, i.e. make it optional

umlaut might have to do with encoding - make sure the scraper is set to the encoding you are saving the file in (and also that it matches moviemaze's encoding)


- w00dst0ck - 2008-08-19

thanx for reply!

spiff Wrote:if that's a c&p;
Code:
<expression trim=2">Regie([^&quot;]*)&quot;standard_justify&quot;(.*?)&lt;</expression>

missing a "

Added the missing " but there is no difference. Also tried without the trim option. But that results in no information provided.


spiff Wrote:umlaut might have to do with encoding - make sure the scraper is set to the encoding you are saving the file in (and also that it matches moviemaze's encoding)

How do i set the encoding of the scraper?
The moviemaze.de page is encoded with iso-8859-1 and my results are generated with:
<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?>


- spiff - 2008-08-19

you set the encoding of the scraper xml file using exactly that kinda header as you just pasted


- Gaarv - 2008-08-19

w00dst0ck Wrote:Added the missing " but there is no difference. Also tried without the trim option. But that results in no information provided.

Seems you're missing a ">" too after "standard_justify". Its not required to match, but it will be returned in /2 so thats a problem.

For umlaut, like spiff already said use ""so-8859-1" but be sure if you're using wilcard matches to include "&", "#" ";" and numbers because umlaut is displayed as it in the source :

Heerf & # 2 5 2 ; hrer
Ma & # 2 2 3 ; arbeit

Heerführer
Maßarbeit

I had to space out the code to illustrate, of course they aren't needed.


- w00dst0ck - 2008-08-19

Gaarv Wrote:Seems you're missing a ">" too after "standard_justify". Its not required to match, but it will be returned in /2 so thats a problem.

Thanx, that solves the problem.

The problem with the umlaut is solved by using (.*) Nod

Next I will try to solve my href problem...


- Gaarv - 2008-08-19

I did this rapidly, you will have to adapt ot limit its application in the whole page but it gave the idea

Code:
[^=",]*([A-Za-z0-9 ./]*\([A-Za-z0-9 ./]*\))

In the exemple you gave this match all actor names and role, with sometimes a </a> that needs to be cleaned


- w00dst0ck - 2008-08-20

Hi there and thanx for your help.
I've only one problem left with the actors part.
[HTML] <tr>
<td valign=top class="standard">
<span class="fett">Darsteller:</span>

</td>
<td valign=top class="standard_justify" width=100%>
Til Schweiger (Ludo), Nora Tschirner (Anna), Matthias Schweighöfer (Moritz), Alwara Höfels (Miriam), Jürgen Vogel (Jürgen Vogel), Rick Kavanian (Chefredakteur), Armin Rohde (Bello), Wolfgang Stumph, Barbara Rudnik (Lilli), Christian Tramitz </td>
</tr>[/HTML]


Code:
<!--Actors-->    
    <RegExp input="$$2" output="&lt;actor&gt;&lt;name&gt;\2&lt;/name&gt;&lt;role&gt;\5&lt;/role&gt;&lt;/actor&gt;" dest="5+">
        <RegExp input="$$1" output="\2" dest="2">
            <expression trim="2" repeat="yes">Darsteller:([^%]*)%&gt;(.*?)&lt;/tr</expression>
        </RegExp>
        <expression repeat="yes">(&lt;a href\="[^&gt;]*&gt;)?(.*?)(&lt;/a&gt;)?( \((.*?)\))?, </expression>
    </RegExp>

The 10 tabs in front of the first actor name are added in \2. I've tried to clear them out with [\t]{10}?
but i think the XBMC scraper engine don't understand \t.
Any idea how I can get solve this?


- spiff - 2008-08-20

\\t


- w00dst0ck - 2008-08-20

Found another solution.

Will submit the working scraper at http://trac.xbmc.org/ticket/4563


- w00dst0ck - 2008-09-08

Need some help again!

Don't know what's wrong. The RegEx works in a regex tester. So it must be the code. Sad

Code:
<!--URL to Trailer-->
    <RegExp input="$$1" output="&lt;url function=&quot;GetTrailerLink&quot;&gt;http://www.moviemaze.de/media/trailer/\1.html&lt;/url&gt;" dest="5+">
        <expression>href=&quot;/media/trailer/(.*?).html&quot; ti</expression>
    </RegExp>

    <expression noclean="1"></expression>
    </RegExp>
</GetDetails>

<!--Trailer-->
    <GetTrailerLink dest="5">
        <RegExp input="$$2" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot; standalone=&quot;yes&quot;&gt;&lt;details&gt;\1&lt;/details&gt;" dest="5+">
            <RegExp input="$$1" output="&lt;trailer urlencoded=&quot;yes&quot;&gt;http://www.moviemaze.de/media/trailer/delivery/\1.mov&lt;/trailer&gt;" dest="2">
                <expression>delivery/(.*?).mov&quot;</expression>
            </RegExp>
            <expression noclean="1"></expression>
        </RegExp>
    </GetTrailerLink>

Is there a way to implement more than one trailer?
The site delivers many trailers in different sizes and it would be great if it's possible to select the trailer in XBMC.


- Gamester17 - 2008-09-08

w00dst0ck Wrote:Is there a way to implement more than one trailer? The site delivers many trailers in different sizes and it would be great if it's possible to select the trailer in XBMC.
I am guessing that will count as a feature request, please submit a new ticket on trac http://trac.xbmc.org

Wink


- w00dst0ck - 2008-09-09

Sometimes the xbmc.log helps to solve a problem.
Submitted the working version with trailer support to trac.
Also submitted a feature request.


- w00dst0ck - 2008-10-21

I have the bulk part of fanart done.

To get the missing imdb number I've created a google wrapper.
Code:
<!--URL to Google and Fanart-->
<RegExp conditional="fanart" input="$$1" output="&lt;url function=&quot;GoogleToIMDB&quot;&gt;http://www.google.com/search?q=site:imdb.com+moviemaze+\2+\1&lt;/url&gt;" dest="5+">
<expression>&lt;h2&gt;\(([^,]*), ([0-9]{4})</expression>
</RegExp>

The generated URL is:
Code:
http://www.google.com/search?q=site:imdb.com+moviemaze+2008+The Dark Knight

And it results in:
Code:
INFO: Get URL: http://www.google.com/search?q=site:imdb.com+moviemaze+2008+The Dark Knight
ERROR: Server returned: 400 Bad Request

I've discover that the spaces in "The Dark Knight" have to be replaced with "+".
But I don't know how to replace that char with an regex. Any ideas about that?


- spiff - 2008-10-21

something along

<RegExp input=$$1 output="\1+\2" dest="4">
<expression repeat="yes" noclean="1,2">(.*?) (.*)</expression>
</RegExp>