help for complicate RegEx "Altersfreigabe"/Certification on imdb.de
#1
Hello,

I've got serious problems on getting the "Altersfreigabe"/Certification used for my scraper. In the US-IMDB we have every "Altersfreigabe"/Certification with a link (href) and the scraper can use the repeat function. German is only:
Code:
<div class="info">
<h5>Altersfreigabe:</h5>
<div class="info-content">
USA:PG-13 <i>(certificate #45663)</i> | S&#xFC;dkorea:12  | UK:12A  | Norwegen:11  | Irland:12A  | Schweden:11  | Singapur:PG  | D&#xE4;nemark:11  | Brasilien:12  | Finnland:K-13  | Schweiz:12 <i>(canton of Vaud)</i> | Schweiz:12 <i>(canton of Geneva)</i> | Niederlande:12  | Philippinen:PG-13 <i>(MTRCB)</i> | Australien:M  | Portugal:M/6 <i>(Qualidade)</i> | Kanada:14A <i>(British Columbia/Manitoba)</i> | Kanada:G <i>(Quebec)</i> | Kanada:PG <i>(Alberta/Ontario)</i> | Deutschland:12  | Neuseeland:M  | Island:10  | Hong Kong:IIA  | Taiwan:GP <i>(original rating)</i> | Argentinien:13  | Peru:14  | Japan:G  | Mexiko:B  | Taiwan:GP

</div>
</div>
No problem to determine the correct <div> from USATongue... up to wan:GP with all the "Altersfreigabe"/Certification inside.
Code:
            <RegExp input="$$1" output="&lt;certification&gt;\2 \4&lt;/certification&gt;" dest="5+">
                <expression repeat="yes">&lt;h5&gt;Altersfreigabe:&lt;/h5&gt;\n&lt;div class=&quot;info-content&quot;&gt;\n([^\n]*)?</expression>
            </RegExp>

But how to separate them for a nice human readable view?

Next problem:
Code:
<div class="info">
<h5>Altersfreigabe:</h5>
<div class="info-content">
Deutschland:0  <i>(free)</i>

</div>
</div>
Code:
<div class="info">
<h5>Altersfreigabe:</h5>
<div class="info-content">
Deutschland:12  | USA:PG-13 <i>(certificate #45663)</i>

</div>
</div>
can be deliverd as well, meaning: any combinations with italic info and only one country is given back by database.

So I've got several problems:
- how to divide the countrys in a nice readable form?
- how to match the different types given by the website?

Regards,

Eisbahn
Reply
#2
Code:
(([^/<>|"(\n]+:[^<"\( #\n|:=.]+)[ \n]+(<i>([^<]*)</i>)?)[ \n]
is the lucky regex. Beginning to love them, but now it's time for barbecue ;=)

Regards,

Eisbahn
Reply

Logout Mark Read Team Forum Stats Members Help
help for complicate RegEx "Altersfreigabe"/Certification on imdb.de0