Filmweb scraper

  Thread Rating:
  • 2 Votes - 3 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
smuto Offline
Senior Member
Posts: 242
Joined: Sep 2004
Reputation: 2
Post: #16
wiesz co nie wiem dlaczego ale nigdy nie zadziałał mi regexp na link tekstowy, a nie mam czasu na testy

filmweb tylko maskuje numer id
twój link
http://frantic.filmweb.pl/

to ten sam co ten z id
http://www.filmweb.pl/Film?id=1107

lub ten
http://www.filmweb.pl/Film,id=1107

sorki ale nie planuję zaktualizować scrapera, mam nadzieję że pliki nfo zaczną zawierać id

smuto

[Image: 1.png]
find quote
smuto Offline
Senior Member
Posts: 242
Joined: Sep 2004
Reputation: 2
Post: #17
i hope i can use my native language in this topic

[Image: 1.png]
find quote
smuto Offline
Senior Member
Posts: 242
Joined: Sep 2004
Reputation: 2
Post: #18
help!!

quite simple. output xml in the format

<actor>
<thumb>...</thumb>
<name>something</name>
<role>somethingelse</role>
</actor>

but, i don't have thumb url in cast. So i try with "url function"

first without luck, but i like this idea (mayby this should work in libary by "Set Actor Thumb"

<actor>
<thumb><url function="ActorLink">...</url></thumb>
<name>something</name>
<role>somethingelse</role>
</actor>

second also without luck
<actor>
<name>something</name>
<role>somethingelse</role>
</actor>
<url function="ActorLink">somethinglink</url>

function="ActorLink"
<actor>
<name>something</name>
<thumb>...</thumb>
</actor>

don't know by mayby i need same numerator

actor$1 -> function="ActorLink$1"
actor$2 -> function="ActorLink$2"

my WIP
filmweb.xml

smuto

[Image: 1.png]
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #19
dont add the actors at that point.

1) make sure all function you call dont clear buffers
2) make sure not to destroy the buffer which holds the id when it enters getdetails (# of htmls +1)
4) grab the url and chain once per actor
5) use the id to grab the role from the filmography list.

that should do, no?
find quote
smuto Offline
Senior Member
Posts: 242
Joined: Sep 2004
Reputation: 2
Post: #20
i made a lite ver.of scraper for tests
filmweb_only_actor_test.xml

from scrap.exe for "Goodbye Bafana"

details.xml

ActorLink.xml

why in ActorLink.xml i have only one (last) entry , scrap visit all url's from details

[Image: 1.png]
find quote
C-Quel Offline
Retired Team-XBMC Member
Posts: 1,375
Joined: Aug 2004
Reputation: 0
Post: #21
Well looks like you dont repeat the thumb expression anyway.

Zotac ID89 + 4GB + 160GB Intel SSD + Samsung UE40D7000 + DS411+II / 2 x 3TB WD RED CAVIAR (TVHeadend Package + 4 Tuners) + Fibaro HC2 Home Automation Intergration!

^^^

Fucking awesome springs to mind :)

iNerd Store

iNerd Forum
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #22
scrap will only show you the last outputted xml.

in xbmc the actors will be pushed to a list for each returned xml
find quote
smuto Offline
Senior Member
Posts: 242
Joined: Sep 2004
Reputation: 2
Post: #23
thx a lot

filmweb.xml with actor's thumb - 100% working

but it becomes extremly slow - sometimes to collect url of thumbs, scraper visits more then 20 pages

so if someone want use it - just grab it from here

one more question

i edit TheTVDB.com scraper to match at first polish strings
tvdb-pl.xml
try to set encoding to ISO-8859-2 in scraper, but without success

A gui charset in langinfo.xml
<charsets>
<gui unicodefont="false">CP1250</gui>
<subtitle>CP1250</subtitle>
</charsets>

polish xbmc language strings are in "utf-8"
polish subtitle are mostly in CP1250

when i change gui charset to
<gui >ISO-8859-2</gui>
tvdb-pl scraper works perfect

What gui charset is for?
smuto

[Image: 1.png]
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #24
if returned xml is not utf8, it will be assumed to be gui charset and is converted from that to utf8.
if this is the best behaviour? not sure

as for the scraper being slow - not much we can do about that as long as the site is organized as it is...
(This post was last modified: 2007-12-11 15:27 by spiff.)
find quote
smuto Offline
Senior Member
Posts: 242
Joined: Sep 2004
Reputation: 2
Post: #25
update - add scraper settings

one week for tests before we add this to SVN

filmweb.xml- with settings



have problem with encodings labels in scraper file
[Image: scraper_settings.jpg]
is the way to add "Automatically grab actor thumbs" set to scraper settings window?

smuto

[Image: 1.png]
find quote
C-Quel Offline
Retired Team-XBMC Member
Posts: 1,375
Joined: Aug 2004
Reputation: 0
Post: #26
Just add a setting to the xml label="Auto Grab Actor Thumbs" id="autograb" type="bool" default="false"

duplicate your ActorLink have one input with conditional="autograb" (with thumb)

and the copy of ActorLink conditional="!autograb" but do not output <thumb></thumb>

EDIT: line 104, pos 239 change &nbsp to &amp;nbsp;

Zotac ID89 + 4GB + 160GB Intel SSD + Samsung UE40D7000 + DS411+II / 2 x 3TB WD RED CAVIAR (TVHeadend Package + 4 Tuners) + Fibaro HC2 Home Automation Intergration!

^^^

Fucking awesome springs to mind :)

iNerd Store

iNerd Forum
(This post was last modified: 2008-01-19 13:52 by C-Quel.)
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #27
i dont think it fits as a scraper setting. you see, if you do it in the scraper it means you won't return the urls at all. the global setting is whether or not to actually grab the thumbs, not whether or not to grab the urls. small but important difference here - if you disable it at scraper level it means you cannot grab them manually either... hence dual settings makes sense to me
find quote
smuto Offline
Senior Member
Posts: 242
Joined: Sep 2004
Reputation: 2
Post: #28
i try to update the <nfourl>
can someone help me

for now i only use link with id
http://www.filmweb.pl/Film?id=999999

i try to add link with movie title to <nfourl>
http://movie.title.filmweb.pl/

this is my wip
PHP Code:
    <NfoUrl dest="3">
        <
RegExp input="$$1" output="http://www.filmweb.pl/Film?id=\1"  dest="3">
            <
expression noclean="1">Film.id=([0-9]*)</expression>
        </
RegExp>
                <
RegExp input="$$1" output="http://\1.filmweb.pl"  dest="3+">
            <
expression noclean="1">http://([^\/]+).filmweb.pl</expression>
        
</RegExp>
    </
NfoUrl

but movie title regexp work for both url
how can i force scraper to use id, if it's present

[Image: 1.png]
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #29
easiest solution (i dont have time to analyze the regexp's).

output xml, i.e. <url>theurl</url>

first url block will take priority
find quote
smuto Offline
Senior Member
Posts: 242
Joined: Sep 2004
Reputation: 2
Post: #30
thx a lot - it's working

add as a patch to SVN

[Image: 1.png]
find quote
Post Reply