2008-08-19, 11:44
Hi there,
im currently developing a http://www.moviemaze.de scraper. I managed to scrap most of the provided informations but have some trouble with the following issues:
Director:
Moviemaze.de HTML
[HTML] <td valign=top class="standard">
<span class="fett">Regie:</span>
</td>
<td valign=top class="standard_justify">
Guillermo Del Toro </td>
</tr>
[/HTML]
my regex:
scap.exe delivers the right results, but XBMC displays a TRIM as result.
I decided to get the result in two steps because it's surrounded by tabs.
Any ideas why XBMC displays TRIM?
Actors:
Moviemaze.de HTML
[HTML] <tr>
<td valign=top class="standard">
<span class="fett">Darsteller:</span>
</td>
<td valign=top class="standard_justify" width=100%>
<a href="/celebs/89/1.html">Christian Bale</a> (Bruce Wayne/ Batman), <a href="/celebs/114/1.html">Heath Ledger</a> (Joker), <a href="/celebs/127/1.html">Michael Caine</a> (Alfred), Maggie Gyllenhaal (Rachel Dawes), Gary Oldman (Lt. James Gordon), Aaron Eckhart (Harvey Dent), <a href="/celebs/47/1.html">Morgan Freeman</a> (Lucius Fox), Monique Curnen (Detective Ramirez), Ron Dean (Detective Wuertz), Cillian Murphy (Dr. Jonathan Crane) </td>
</tr>[/HTML]
My regex:
It works, but I miss the actors with a href link and I did not managed to find a solution.
German Umlaut [äöüß]:
I can't grep words with characters like "german umlaute" / [äöüß].
I've tried expressions like ([A-Za-z0-9äÄöÖüÜß ,.]+) and ([A-Za-z0-9äÄöÖüÜ ,.]+) without success.
Can somebody please help me?
regards,
w00dst0ck
im currently developing a http://www.moviemaze.de scraper. I managed to scrap most of the provided informations but have some trouble with the following issues:
Director:
Moviemaze.de HTML
[HTML] <td valign=top class="standard">
<span class="fett">Regie:</span>
</td>
<td valign=top class="standard_justify">
Guillermo Del Toro </td>
</tr>
[/HTML]
my regex:
Code:
<RegExp input="$$6" output="<director>\1</director>" dest="5+">
<RegExp input="$$1" output="\2" dest="6">
<expression trim=2">Regie([^"]*)"standard_justify"(.*?)<</expression>
</RegExp>
<expression>([A-Za-z0-9 ,.]+)</expression>
</RegExp>
scap.exe delivers the right results, but XBMC displays a TRIM as result.
I decided to get the result in two steps because it's surrounded by tabs.
Any ideas why XBMC displays TRIM?
Actors:
Moviemaze.de HTML
[HTML] <tr>
<td valign=top class="standard">
<span class="fett">Darsteller:</span>
</td>
<td valign=top class="standard_justify" width=100%>
<a href="/celebs/89/1.html">Christian Bale</a> (Bruce Wayne/ Batman), <a href="/celebs/114/1.html">Heath Ledger</a> (Joker), <a href="/celebs/127/1.html">Michael Caine</a> (Alfred), Maggie Gyllenhaal (Rachel Dawes), Gary Oldman (Lt. James Gordon), Aaron Eckhart (Harvey Dent), <a href="/celebs/47/1.html">Morgan Freeman</a> (Lucius Fox), Monique Curnen (Detective Ramirez), Ron Dean (Detective Wuertz), Cillian Murphy (Dr. Jonathan Crane) </td>
</tr>[/HTML]
My regex:
Code:
<RegExp input="$$2" output="<actor><name>\1</name><role>\2</role></actor>" dest="5+">
<RegExp input="$$1" output="\2" dest="2">
<expression repeat="yes">Darsteller:([^%]*)%>(.*?)</tr</expression>
</RegExp>
<expression repeat="yes" trim="1">([A-Za-z \-]*)\(([A-Za-z \-]*)\)</expression>
</RegExp>
It works, but I miss the actors with a href link and I did not managed to find a solution.
German Umlaut [äöüß]:
I can't grep words with characters like "german umlaute" / [äöüß].
I've tried expressions like ([A-Za-z0-9äÄöÖüÜß ,.]+) and ([A-Za-z0-9äÄöÖüÜ ,.]+) without success.
Can somebody please help me?
regards,
w00dst0ck