Quick Scraper Question (Hope so:))

  Thread Rating:
  • 0 Votes - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #61
spiff Wrote:well, you should know what that does - it is just a regular expression.

that being said; my bad. you want
Code:
<RegExp input="$$6" output="\1%20" dest="7">
  <expression repeat="yes">([^ ]+)</expression>
</RegExp>


Hey spiff,

got it working with your help. little problem i have now is:

Der%20letzte%20Zug%20

How to get rid of the last 20% ?? Another regexp?

Thanks

Schenk
find quote
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #62
Okay, now i'm trying to do some cosmetics:

Sometimes the plot on the site i parse is written with umlauts (ä,ü,ö) in real like:

Anfänglich hält ...

sometimes it is written with tags like:

pl&ouml;tzlich

I tried any encoding and noclean stuff but can't get the second choice to display "plötzlich", instead it's the above "pl&ouml;tzlich"

Any hints Confused

Thanks in advance

Schenk
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #63
run a replacement regexp. or give me a list of tags that isn't cleaned/replaced properly and i'll add them to the list.
find quote
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #64
spiff Wrote:run a replacement regexp. or give me a list of tags that isn't cleaned/replaced properly and i'll add them to the list.

i think there's only:

# ä -> &auml;
# Ä -> &Auml;
# ö -> &ouml;
# Ö -> &Ouml;
# ü -> &uuml;
# Ü -> &Uuml;
# ß -> &szlig;

thanks
find quote
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #65
I tried everything but can't get this to work:

Here's the regexp:
Code:
KURZINHALT&lt;/h2&gt;&lt;/li&gt;[^&gt;]*&gt;+(.*)&lt;/li&gt;

Here's e.g. some text:
KURZINHALT</h2></li>
<li class="c1">Kult comes back! Starsky &amp; Hutch sind wieder da!<br />
Jetzt erfahren die Buddies ihr Leinwandcomeback: Schrill, laut und jederzeit locker... Die amerikanischen Ausnahme-Comedians BEN STILLER („Verr&uuml;ckt nach Mary“, „Meine Braut, ihr Vater und ich“) und OWEN WILSON („Die Royal Tenenbaums“, „Shanghai Knights“) schl&uuml;pfen in die l&auml;ssigen Outfits der Undercover-Agenten und heften sich an die Fersen des zwielichtigen Gesch&auml;ftsmannes Reese Feldmann (VINCE VAUGHN) und dessen Freundin Kitty (JULIETTE LEWIS). Mit Hilfe ihres gerissenen Informanten Huggy Bear (SNOOP DOGG) und den entz&uuml;ckenden Cheerleadern Staci (CARMEN ELECTRA), Holly (AMY SMART) und Heather (BRANDIE RODERICK) wollen der angeknackste Starsky (Ben Stiller) und Womanizer Hutch (Owen Wilson) der Gerechtigkeit gen&uuml;getun...</li>

The first &amp; is displayed as &, but after that all umlauts are displayed like in the text.

Thanks in advance

Schenk
find quote
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #66
maybe my above question was stupid again, but why is the &amp; displayed correctly as & and the umlauts not ConfusedSad
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #67
because it is within html tags. i have found the issue but had no time to test
find quote
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #68
spiff Wrote:because it is within html tags. i have found the issue but had no time to test


Okay, thanks for taking care !!!
find quote
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #69
hey spiff,

some little cosmetics i have found and think it's not just my scraper, tried ofdb too:

in xbmc, in the search results window umlauts and . and : are shown fine but i'm not able to display &, no matter which setting i try. Example is Starsky & Hutch and Fast & Furious, they're shown just without & (Starsky Hutch). This is happening in the title after parsing, too.

another question from me, sorry :=

in the output= am i allowed to put two different url functions in one line, after another because the use the same regexpConfused


Thanks

Schenk
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #70
& is not an allowed char in xml, nor in html.

if the pages hold litteral &'s they are not html compatible unless it is in CDATA or verbatim fields.
if they ARE in those fields, the scraper needs to handle the & -> &amp; conversion.

and yes on the question. you can add thousands of fields to the xml at the same time if you so see fit. to the parser it's all just text.
find quote
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #71
Code:
<GetPosterLinkURL dest="5">
        <RegExp input="$$2" output="&lt;details&gt;\1&lt;/details&gt;" dest="5+">
                        <RegExp input="$$1" output="&lt;url function=&quot;GetPosterURL&quot;&gt;http://www.moviemaze.de/filme/\1/\2&lt;/url&gt;" dest="2+">
                <expression>&lt;a href=&quot;/filme/([0-9]+)/([^&quot;]*)&quot;</expression>
            </RegExp>
            <expression noclean="1"/>
        </RegExp>
    </GetPosterLinkURL>

    <GetPosterURL dest="5">
        <RegExp input="$$1" output="&lt;details&gt;&lt;url function=&quot;GetPoster&quot;&gt;http://www.moviemaze.de/media/poster/\1/\2&lt;/url&gt;&lt;/details&gt;" dest="5+">
            <expression>&lt;a href=&quot;/media/poster/([0-9]+)/([^&quot;]*)&quot;</expression>
        </RegExp>
    </GetPosterURL>

    <GetPoster clearbuffers="no" dest="5">
        <RegExp input="$$1" output=";&lt;thumb&gt;http://www.moviemaze.de/filme/\1/poster_lg\2.jpg&lt;/thumb&gt;;" dest="10+">
            <expression repeat="yes">/([0-9]+)/poster([0-9]+)</expression>
        </RegExp>
    </GetPoster>

        <GetThumbnailLink clearbuffers="no" dest="6">
               <RegExp input="$$7" output="&lt;details&gt;\1&lt;/details&gt;" dest="6">
                 <RegExp input="$$1" output="&lt;url function=&quot;GetThumbnail&quot;&gt;http://www.cinefacts.de/kino/film/\1&lt;/url&gt;" dest="7">
            <expression repeat="yes" noclean="1">&lt;a href=&quot;/kino/film/([^&quot;]+)&quot;&gt;[^&lt;]*&lt;img</expression>
                    </RegExp>
                    <RegExp input="" output="&lt;url function=&quot;CollectThumbnails&quot; cache=&quot;film.xml&quot;&gt;http://www.cinefacts.de/kino/datenbank.html&lt;/url&gt;" dest="7+">
                        <expression/>
            </RegExp>
                    <expression noclean="1"/>
               </RegExp>
        </GetThumbnailLink>

    <GetThumbnail clearbuffers="no" dest="5">
        <RegExp input="$$1" output="&lt;thumb&gt;http://www.cinefacts.de/kino/plakat/\1&lt;/thumb&gt;" dest="8+">
               <expression>href=&quot;/kino/plakat/([^&quot;]*)&quot;</expression>
                </RegExp>
                <RegExp input="" output="&lt;details&gt;&lt;/details&gt;" dest="5">
               <expression noclean="1"/>
        </RegExp>
    </GetThumbnail>

        <CollectThumbnails dest="2">
           <RegExp input="$$10$$8" output="&lt;details&gt;&lt;thumbs&gt;\1&lt;/thumbs&gt;&lt;/details&gt;" dest="2+">
             <expression noclean="1"/>
           </RegExp>
        </CollectThumbnails>

Hey spiff,

this almost works but i got one problem. let me explain: if there are covers on cinefacts and moviemaze, all covers are shown. If there is only a cover in cinefacts, this is shown. But when there's only a cover at moviemaze, none is shown. I think it has something to do with the $$10$$8 but i don't know exactly and really appreciate if you could help hereSmile

Thanks so much

Schenk
find quote
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #72
anyone could help with the above? Sad

thanks in advance

Schenk
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #73
why are you concating to buffer 2 in the collectfunction?

i really do not have the time to help now, you do know that new builds log all the scraper output?
find quote
Schenk2302 Offline
Senior Member
Posts: 103
Joined: Feb 2009
Reputation: 4
Post: #74
Thanks for the answer spiff,

no i didn't know but upgraded now. it shows me the output when both moviemaze and cinefacts have covers. it shows nothing when only moviemaze got one.

for the collect section, that's what you gave me !
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #75
maybe, but i never said that the code i give you are exactly correct. i only give pointers to point out the concepts.
find quote
Post Reply