moviemaze.de scraper development - help needed - Printable Version +- Kodi Community Forum (https://forum.kodi.tv) +-- Forum: Development (https://forum.kodi.tv/forumdisplay.php?fid=32) +--- Forum: Scrapers (https://forum.kodi.tv/forumdisplay.php?fid=60) +--- Thread: moviemaze.de scraper development - help needed (/showthread.php?tid=36080) Pages:
1
2
|
moviemaze.de scraper development - help needed - w00dst0ck - 2008-08-19 Hi there, im currently developing a http://www.moviemaze.de scraper. I managed to scrap most of the provided informations but have some trouble with the following issues: Director: Moviemaze.de HTML [HTML] <td valign=top class="standard"> <span class="fett">Regie:</span> </td> <td valign=top class="standard_justify"> Guillermo Del Toro </td> </tr> [/HTML] my regex: Code: <RegExp input="$$6" output="<director>\1</director>" dest="5+"> scap.exe delivers the right results, but XBMC displays a TRIM as result. I decided to get the result in two steps because it's surrounded by tabs. Any ideas why XBMC displays TRIM? Actors: Moviemaze.de HTML [HTML] <tr> <td valign=top class="standard"> <span class="fett">Darsteller:</span> </td> <td valign=top class="standard_justify" width=100%> <a href="/celebs/89/1.html">Christian Bale</a> (Bruce Wayne/ Batman), <a href="/celebs/114/1.html">Heath Ledger</a> (Joker), <a href="/celebs/127/1.html">Michael Caine</a> (Alfred), Maggie Gyllenhaal (Rachel Dawes), Gary Oldman (Lt. James Gordon), Aaron Eckhart (Harvey Dent), <a href="/celebs/47/1.html">Morgan Freeman</a> (Lucius Fox), Monique Curnen (Detective Ramirez), Ron Dean (Detective Wuertz), Cillian Murphy (Dr. Jonathan Crane) </td> </tr>[/HTML] My regex: Code: <RegExp input="$$2" output="<actor><name>\1</name><role>\2</role></actor>" dest="5+"> It works, but I miss the actors with a href link and I did not managed to find a solution. German Umlaut [äöüß]: I can't grep words with characters like "german umlaute" / [äöüß]. I've tried expressions like ([A-Za-z0-9äÄöÖüÜß ,.]+) and ([A-Za-z0-9äÄöÖüÜ ,.]+) without success. Can somebody please help me? regards, w00dst0ck - spiff - 2008-08-19 if that's a c&p; Code: <expression trim=2">Regie([^"]*)"standard_justify"(.*?)<</expression> missing a " actors; try adding the href stuff with a ?, i.e. make it optional umlaut might have to do with encoding - make sure the scraper is set to the encoding you are saving the file in (and also that it matches moviemaze's encoding) - w00dst0ck - 2008-08-19 thanx for reply! spiff Wrote:if that's a c&p; Added the missing " but there is no difference. Also tried without the trim option. But that results in no information provided. spiff Wrote:umlaut might have to do with encoding - make sure the scraper is set to the encoding you are saving the file in (and also that it matches moviemaze's encoding) How do i set the encoding of the scraper? The moviemaze.de page is encoded with iso-8859-1 and my results are generated with: <?xml version="1.0" encoding="iso-8859-1" standalone="yes"?> - spiff - 2008-08-19 you set the encoding of the scraper xml file using exactly that kinda header as you just pasted - Gaarv - 2008-08-19 w00dst0ck Wrote:Added the missing " but there is no difference. Also tried without the trim option. But that results in no information provided. Seems you're missing a ">" too after "standard_justify". Its not required to match, but it will be returned in /2 so thats a problem. For umlaut, like spiff already said use ""so-8859-1" but be sure if you're using wilcard matches to include "&", "#" ";" and numbers because umlaut is displayed as it in the source : Heerf & # 2 5 2 ; hrer Ma & # 2 2 3 ; arbeit Heerführer Maßarbeit I had to space out the code to illustrate, of course they aren't needed. - w00dst0ck - 2008-08-19 Gaarv Wrote:Seems you're missing a ">" too after "standard_justify". Its not required to match, but it will be returned in /2 so thats a problem. Thanx, that solves the problem. The problem with the umlaut is solved by using (.*) Next I will try to solve my href problem... - Gaarv - 2008-08-19 I did this rapidly, you will have to adapt ot limit its application in the whole page but it gave the idea Code: [^=",]*([A-Za-z0-9 ./]*\([A-Za-z0-9 ./]*\)) In the exemple you gave this match all actor names and role, with sometimes a </a> that needs to be cleaned - w00dst0ck - 2008-08-20 Hi there and thanx for your help. I've only one problem left with the actors part. [HTML] <tr> <td valign=top class="standard"> <span class="fett">Darsteller:</span> </td> <td valign=top class="standard_justify" width=100%> Til Schweiger (Ludo), Nora Tschirner (Anna), Matthias Schweighöfer (Moritz), Alwara Höfels (Miriam), Jürgen Vogel (Jürgen Vogel), Rick Kavanian (Chefredakteur), Armin Rohde (Bello), Wolfgang Stumph, Barbara Rudnik (Lilli), Christian Tramitz </td> </tr>[/HTML] Code: <!--Actors--> The 10 tabs in front of the first actor name are added in \2. I've tried to clear them out with [\t]{10}? but i think the XBMC scraper engine don't understand \t. Any idea how I can get solve this? - spiff - 2008-08-20 \\t - w00dst0ck - 2008-08-20 Found another solution. Will submit the working scraper at http://trac.xbmc.org/ticket/4563 - w00dst0ck - 2008-09-08 Need some help again! Don't know what's wrong. The RegEx works in a regex tester. So it must be the code. Code: <!--URL to Trailer--> Is there a way to implement more than one trailer? The site delivers many trailers in different sizes and it would be great if it's possible to select the trailer in XBMC. - Gamester17 - 2008-09-08 w00dst0ck Wrote:Is there a way to implement more than one trailer? The site delivers many trailers in different sizes and it would be great if it's possible to select the trailer in XBMC.I am guessing that will count as a feature request, please submit a new ticket on trac http://trac.xbmc.org - w00dst0ck - 2008-09-09 Sometimes the xbmc.log helps to solve a problem. Submitted the working version with trailer support to trac. Also submitted a feature request. - w00dst0ck - 2008-10-21 I have the bulk part of fanart done. To get the missing imdb number I've created a google wrapper. Code: <!--URL to Google and Fanart--> The generated URL is: Code: http://www.google.com/search?q=site:imdb.com+moviemaze+2008+The Dark Knight And it results in: Code: INFO: Get URL: http://www.google.com/search?q=site:imdb.com+moviemaze+2008+The Dark Knight I've discover that the spaces in "The Dark Knight" have to be replaced with "+". But I don't know how to replace that char with an regex. Any ideas about that? - spiff - 2008-10-21 something along <RegExp input=$$1 output="\1+\2" dest="4"> <expression repeat="yes" noclean="1,2">(.*?) (.*)</expression> </RegExp> |