How to get runtime out of HTML for scraper?
#1
Hello,

trying to develop my first scraper for german DVDs and got a little problem...
I don't get the runtime parsed :=(

The source of the HTML is something like
Code:
[...]
<a class="tn15more inline" href="/title/tt0195234/releaseinfo#akas" onClick="(new Image()).src='/rg/title-tease/akas/images/b.gif?link=/title/tt0195234/releaseinfo#akas';">Mehr ansehen</a>&nbsp;&raquo;

</div>
</div>

<div class="info">
<h5>L&#xE4;nge:</h5>
<div class="info-content">
93 Min
</div>
</div>

<div class="info">
<h5>Land:</h5>
<div class="info-content">
UK

</div>
[...]
What I want to get is the runtime (in german Länge), here it is 93 Min.
The corresponding scraper statement is
Code:
            <RegExp input="$$1" output="&lt;runtime&gt;\1&lt;/runtime&gt;" dest="5+">
                <expression trim="1">&lt;h5&gt;L&amp;#xE4;nge:&lt;/h5&gt;\n&lt;div class=&quot;info-content&quot;&gt;\n[0-9]* Min</expression>
            </RegExp>
As result I only get <runtime></runtime>. Of course I would like to get something like <runtime>93 Min</runtime>. Where can I define how my result looks like, or better: which parts of my RegEx should be taken as result?

Any help? Regards,

Eisbahn
Reply
#2
you're not selecting anything...

you probably mean something ala
Code:
<RegExp input="$$1" output="&lt;runtime&gt;\1&lt;/runtime&gt;" dest="5+">
    <expression trim="1">&lt;h5&gt;L&amp;#xE4;nge:&lt;/h5&gt;\n&lt;div class=&quot;info-content&quot;&gt;\n([0-9]*) Min</expression>
</RegExp>
note the added ()'s. that's a selection - the one you're referring to as \1 in the output.
Reply
#3
THANKS A LOT!
Perfect and quick answer. Thats exactly what I needed.

Eisbahn

P.S. Maybe you got a first Implementation of german IMDB scraper at weekend. Sadly sun is outside, so be patient :=)
Reply

Logout Mark Read Team Forum Stats Members Help
How to get runtime out of HTML for scraper?0