2013-10-26, 20:06
Hello,
I'm in the process of learning and making my own scraper. I've been struggling with this for a while now and reading through other threads didn't yield any solution for me .
Might I request a little help with this? I must be missing something trivial or obvious...
1) Maybe there's a problem with the "name" attribute? The value is not the same as in the corresponding "addon.xml".
2) As I take it, if the "CreateSearchURL" function is correct, it will output the constructed URL into the debug log?
The problem is the following:
And this is the scraper I got so far:
I'm in the process of learning and making my own scraper. I've been struggling with this for a while now and reading through other threads didn't yield any solution for me .
Might I request a little help with this? I must be missing something trivial or obvious...
1) Maybe there's a problem with the "name" attribute? The value is not the same as in the corresponding "addon.xml".
2) As I take it, if the "CreateSearchURL" function is correct, it will output the constructed URL into the debug log?
The problem is the following:
Code:
19:41:30 T:139925521422080 DEBUG: VideoInfoScanner: No NFO file found. Using title search for '/share/HDD1_Filmy/CSFD_scraper_test/atv_cerna_labut_1024x576.m4v'
19:41:30 T:139925521422080 DEBUG: FindMovie: Searching for 'atv cerna labut 1024x576' using CSFD movies SYN1 scraper (path: '/root/.xbmc/addons/metadata.csfd.cz', content: 'movies', version: '1.0.0')
19:41:30 T:139925521422080 ERROR: Run: Unable to parse web site
And this is the scraper I got so far:
Code:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<scraper language="cs" thumb="icon.png" date="2013-10-26" content="movies" framework="1.1" name="SYN1_CSFD_scraper">
<CreateSearchUrl dest="2">
<RegExp dest="2" output="<url>http://www.csfd.cz/hledat/?q=\1</url>" input="$$1">
<expression clear="yes">atv (.*) [0-9]+[xX][0-9]+</expression>
</RegExp>
</CreateSearchUrl>
<GetSearchResults dest="3">
<RegExp dest="3" output="<?xml version="1.0" encoding="UTF-8" standalone="yes"?><results>\1</results>" input="$$2">
<RegExp dest="2" output="<entity><title>\2 (\3)</title><url>www.csfd.cz/film/\1/</url></entity>" input="$$1">
<expression repeat="yes"><a href="/film/([^/]*)/"[^>]*>([^<]+)</a>(?:[\s]*<span[^>]*>[^<]*</span>[\s]*)?</h3>[\s]*<p>[^0-9]*([0-9]{4,4})</p></expression>
</RegExp>
<expression noclean="1"></expression>
</RegExp>
</GetSearchResults>
<GetDetails dest="2">
<RegExp dest="2" output="<details>\1</details>" input="$$3">
<RegExp dest="3+" output="<title>\1</title><sorttitle>\1</sorttitle>" input="$$1">
<expression><div class="info">(?:[\s]*<[^>]+>[\s]*)+?<h1>[\s]*([^<]+?)[\s]*</h1></expression>
</RegExp>
<RegExp dest="3+" output="<originaltitle>\1</originaltitle>" input="$$1">
<expression><ul class="names">[\s]*<li>[\s]*<[^>]+>[\s]*<h3>([^<]+)</h3></expression>
</RegExp>
<RegExp dest="3+" output="<year>\1</year>" input="$$1">
<expression><p class="origin">[^<]*?, ([0-9]{4,4})</expression>
</RegExp>
<RegExp dest="3+" output="<director>\1</director>" input="$$1">
<expression><h4>Režie:(?:[\s]*<[^>]+>[\s]*)+?<a href="[^"]+">([^<]+)</a></expression>
</RegExp>
<RegExp dest="3+" output="<top250>\1</top250>" input="$$1">
<expression><p class="charts">[\s]+<a href="[^"]+">([0-9]+)\. nejlepší film</a></expression>
</RegExp>
<RegExp dest="3+" output="<rating>\1</rating>" input="$$1">
<expression><div id="rating">[\s]*<h2 class="average">([0-9]{2,2}%)</h2></expression>
</RegExp>
<RegExp dest="3+" output="<plot>\1</plot>" input="$$1">
<expression noclean="1"><div id="plots"[^>]*>(?:[\s]*<[^>]+>[\s]*)+?<h3>[\s]*Obsah[\s]*</h3>(?:[\s]*<[^>]+>[\s]*)+?<img[^>]*>[\s]*(.*?)<span</expression>
</RegExp>
<RegExp dest="3+" output="<runtime>\1</runtime>" input="$$1">
<expression><p class="origin">[^<]*?, ([0-9]{2,3}) min</expression>
</RegExp>
<RegExp dest="3+" output="<thumb><url spoof="http://www.csfd.cz">img.csfd.cz/files/images/film/posters/\1/\2/\3.\4</url></thumb>" input="$$1">
<expression repeat="yes"><div class="image" style="background-image: url\('\\/\\/img\\.csfd\\.cz\\/files\\/images\\/film\\/posters\\/([^\\]+)\\/([^\\]+)\\/([^\\]+)\\\.([a-zA-Z]+)</expression>
</RegExp>
<RegExp dest="3+" output="<id>\1</id>" input="$$1">
<expression><link rel="canonical" href="http://www.csfd.cz/film/([^\-]+)-[^"]+"</expression>
</RegExp>
<RegExp dest="3+" output="<genre>\1</genre>" input="$$1">
<expression><p class="genre">([^/\s]+)</expression>
</RegExp>
<RegExp dest="3+" output="<actor><name>\1</name></actor>" input="$$1">
<expression><h4>Hrají:</h4>[\s]*<span[^>]*>[\s]*<a href="[^"]+">([^<]+)</a></expression>
</RegExp>
<expression></expression>
</RegExp>
</GetDetails>
</scraper>