Quick Scraper Question (Hope so:))
#61
spiff Wrote:well, you should know what that does - it is just a regular expression.

that being said; my bad. you want
Code:
<RegExp input="$$6" output="\1%20" dest="7">
  <expression repeat="yes">([^ ]+)</expression>
</RegExp>


Hey spiff,

got it working with your help. little problem i have now is:

Der%20letzte%20Zug%20

How to get rid of the last 20% ?? Another regexp?

Thanks

Schenk
Reply
#62
Okay, now i'm trying to do some cosmetics:

Sometimes the plot on the site i parse is written with umlauts (ä,ü,ö) in real like:

Anfänglich hält ...

sometimes it is written with tags like:

pl&ouml;tzlich

I tried any encoding and noclean stuff but can't get the second choice to display "plötzlich", instead it's the above "pl&ouml;tzlich"

Any hints Huh

Thanks in advance

Schenk
Reply
#63
run a replacement regexp. or give me a list of tags that isn't cleaned/replaced properly and i'll add them to the list.
Reply
#64
spiff Wrote:run a replacement regexp. or give me a list of tags that isn't cleaned/replaced properly and i'll add them to the list.

i think there's only:

# ä -> &auml;
# Ä -> &Auml;
# ö -> &ouml;
# Ö -> &Ouml;
# ü -> &uuml;
# Ü -> &Uuml;
# ß -> &szlig;

thanks
Reply
#65
I tried everything but can't get this to work:

Here's the regexp:
Code:
KURZINHALT&lt;/h2&gt;&lt;/li&gt;[^&gt;]*&gt;+(.*)&lt;/li&gt;

Here's e.g. some text:
KURZINHALT</h2></li>
<li class="c1">Kult comes back! Starsky &amp; Hutch sind wieder da!<br />
Jetzt erfahren die Buddies ihr Leinwandcomeback: Schrill, laut und jederzeit locker... Die amerikanischen Ausnahme-Comedians BEN STILLER („Verr&uuml;ckt nach Mary“, „Meine Braut, ihr Vater und ich“) und OWEN WILSON („Die Royal Tenenbaums“, „Shanghai Knights“) schl&uuml;pfen in die l&auml;ssigen Outfits der Undercover-Agenten und heften sich an die Fersen des zwielichtigen Gesch&auml;ftsmannes Reese Feldmann (VINCE VAUGHN) und dessen Freundin Kitty (JULIETTE LEWIS). Mit Hilfe ihres gerissenen Informanten Huggy Bear (SNOOP DOGG) und den entz&uuml;ckenden Cheerleadern Staci (CARMEN ELECTRA), Holly (AMY SMART) und Heather (BRANDIE RODERICK) wollen der angeknackste Starsky (Ben Stiller) und Womanizer Hutch (Owen Wilson) der Gerechtigkeit gen&uuml;getun...</li>

The first &amp; is displayed as &, but after that all umlauts are displayed like in the text.

Thanks in advance

Schenk
Reply
#66
maybe my above question was stupid again, but why is the &amp; displayed correctly as & and the umlauts not HuhSad
Reply
#67
because it is within html tags. i have found the issue but had no time to test
Reply
#68
spiff Wrote:because it is within html tags. i have found the issue but had no time to test


Okay, thanks for taking care !!!
Reply
#69
hey spiff,

some little cosmetics i have found and think it's not just my scraper, tried ofdb too:

in xbmc, in the search results window umlauts and . and : are shown fine but i'm not able to display &, no matter which setting i try. Example is Starsky & Hutch and Fast & Furious, they're shown just without & (Starsky Hutch). This is happening in the title after parsing, too.

another question from me, sorry :=

in the output= am i allowed to put two different url functions in one line, after another because the use the same regexpHuh


Thanks

Schenk
Reply
#70
& is not an allowed char in xml, nor in html.

if the pages hold litteral &'s they are not html compatible unless it is in CDATA or verbatim fields.
if they ARE in those fields, the scraper needs to handle the & -> &amp; conversion.

and yes on the question. you can add thousands of fields to the xml at the same time if you so see fit. to the parser it's all just text.
Reply
#71
Code:
<GetPosterLinkURL dest="5">
        <RegExp input="$$2" output="&lt;details&gt;\1&lt;/details&gt;" dest="5+">
                        <RegExp input="$$1" output="&lt;url function=&quot;GetPosterURL&quot;&gt;http://www.moviemaze.de/filme/\1/\2&lt;/url&gt;" dest="2+">
                <expression>&lt;a href=&quot;/filme/([0-9]+)/([^&quot;]*)&quot;</expression>
            </RegExp>
            <expression noclean="1"/>
        </RegExp>
    </GetPosterLinkURL>

    <GetPosterURL dest="5">
        <RegExp input="$$1" output="&lt;details&gt;&lt;url function=&quot;GetPoster&quot;&gt;http://www.moviemaze.de/media/poster/\1/\2&lt;/url&gt;&lt;/details&gt;" dest="5+">
            <expression>&lt;a href=&quot;/media/poster/([0-9]+)/([^&quot;]*)&quot;</expression>
        </RegExp>
    </GetPosterURL>

    <GetPoster clearbuffers="no" dest="5">
        <RegExp input="$$1" output=";&lt;thumb&gt;http://www.moviemaze.de/filme/\1/poster_lg\2.jpg&lt;/thumb&gt;;" dest="10+">
            <expression repeat="yes">/([0-9]+)/poster([0-9]+)</expression>
        </RegExp>
    </GetPoster>

        <GetThumbnailLink clearbuffers="no" dest="6">
               <RegExp input="$$7" output="&lt;details&gt;\1&lt;/details&gt;" dest="6">
                 <RegExp input="$$1" output="&lt;url function=&quot;GetThumbnail&quot;&gt;http://www.cinefacts.de/kino/film/\1&lt;/url&gt;" dest="7">
            <expression repeat="yes" noclean="1">&lt;a href=&quot;/kino/film/([^&quot;]+)&quot;&gt;[^&lt;]*&lt;img</expression>
                    </RegExp>
                    <RegExp input="" output="&lt;url function=&quot;CollectThumbnails&quot; cache=&quot;film.xml&quot;&gt;http://www.cinefacts.de/kino/datenbank.html&lt;/url&gt;" dest="7+">
                        <expression/>
            </RegExp>
                    <expression noclean="1"/>
               </RegExp>
        </GetThumbnailLink>

    <GetThumbnail clearbuffers="no" dest="5">
        <RegExp input="$$1" output="&lt;thumb&gt;http://www.cinefacts.de/kino/plakat/\1&lt;/thumb&gt;" dest="8+">
               <expression>href=&quot;/kino/plakat/([^&quot;]*)&quot;</expression>
                </RegExp>
                <RegExp input="" output="&lt;details&gt;&lt;/details&gt;" dest="5">
               <expression noclean="1"/>
        </RegExp>
    </GetThumbnail>

        <CollectThumbnails dest="2">
           <RegExp input="$$10$$8" output="&lt;details&gt;&lt;thumbs&gt;\1&lt;/thumbs&gt;&lt;/details&gt;" dest="2+">
             <expression noclean="1"/>
           </RegExp>
        </CollectThumbnails>

Hey spiff,

this almost works but i got one problem. let me explain: if there are covers on cinefacts and moviemaze, all covers are shown. If there is only a cover in cinefacts, this is shown. But when there's only a cover at moviemaze, none is shown. I think it has something to do with the $$10$$8 but i don't know exactly and really appreciate if you could help hereSmile

Thanks so much

Schenk
Reply
#72
anyone could help with the above? Sad

thanks in advance

Schenk
Reply
#73
why are you concating to buffer 2 in the collectfunction?

i really do not have the time to help now, you do know that new builds log all the scraper output?
Reply
#74
Thanks for the answer spiff,

no i didn't know but upgraded now. it shows me the output when both moviemaze and cinefacts have covers. it shows nothing when only moviemaze got one.

for the collect section, that's what you gave me !
Reply
#75
maybe, but i never said that the code i give you are exactly correct. i only give pointers to point out the concepts.
Reply

Logout Mark Read Team Forum Stats Members Help
Quick Scraper Question (Hope so:))0