Allocine.fr (TV Shows) scraper

  Thread Rating:
  • 0 Votes - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
The_Dogg Offline
Fan
Posts: 385
Joined: Feb 2004
Reputation: 4
Location: Canada
Post: #1
I'm working on a TV Show scraper for allocine.fr.

I'm down to the episode list, but I have a little problem:

I use the scrap.exe tool to test it, and when the tool get the links for the episode list, there is a "&" sign that gets lost, let me show you:

Code:
</status><premiered>
7 Ao¹t 2005</premiered><episodeguide><url>http://www.allocine.fr/series/episodes_gen_csaison=1511&cserie=513.html</url>
<url>http://www.allocine.fr/series/episodes_gen_csaison=2450&cserie=513.html</url></episodeguide></details>
Episodelist URL 1:http://www.allocine.fr/series/episodes_gen_csaison=1511cserie=513.html
Episodelist URL 2:http://www.allocine.fr/series/episodes_gen_csaison=2450cserie=513.html
GetEpisodeListInternal 2 returned :
GetEpisodeList returned :
Error: Unable to parse episodelist.xml

this is the output of the scrap.exe tool.

You can see that in the <details> tag the URL are OK :
Code:
<url>http://www.allocine.fr/series/episodes_gen_csaison=1511&cserie=513.html</url>
but when the tool says "Episodelist URL" the & sign is lost in the link, causing a near empty page on the website.
Code:
Episodelist URL 1:http://www.allocine.fr/series/episodes_gen_csaison=1511cserie=513.html

and here is the code from the scraper.xml
Code:
<RegExp input="$$8" output="&lt;episodeguide&gt;\1&lt;/episodeguide&gt;" dest="5+">    
                <RegExp input="$$2" output="&lt;url&gt;http://www.allocine.fr/series/episodes_gen_csaison=\1&amp;cserie=$$4.html&lt;/url&gt;" dest="8">
                    <expression repeat="yes">&quot;/series/casting_gen_csaison=([0-9]*)&amp;cserie=$$4.html&quot; class=&quot;link1&quot;>[0-9]&lt;/a&gt;</expression>
                </RegExp>
                <expression noclean="1"></expression>
            </RegExp>

I tried replacing the $amp; with only &, i tried putting it twice (&amp;&amp; and &&) the & sign never shows up. but when i try to change the &amp; with &quot; the " sign appears where I need it, only the &amp; that doesnt seems to work.

any help would be appreciated.

The_Dogg

Hardware: Revo 3610 + SSD - Harmony 700 Remote
Software: XBMCBuntu Gotham - Sickbeard - SabNZBd+

[Image: all-thin-banner.jpg]
find quote
The_Dogg Offline
Fan
Posts: 385
Joined: Feb 2004
Reputation: 4
Location: Canada
Post: #2
After a little more research I found the way to have the missing & show Smile


I had to put
Code:
&amp;amp;

so the resulting scraper code is:

Code:
<RegExp input="$$8" output="&lt;episodeguide&gt;\1&lt;/episodeguide&gt;" dest="5+">    
                <RegExp input="$$2" output="&lt;url&gt;http://www.allocine.fr/series/episodes_gen_csaison=\1&amp;amp;cserie=$$4.html&lt;/url&gt;" dest="8">
                    <expression repeat="yes">&quot;/series/casting_gen_csaison=([0-9]*)&amp;cserie=$$4.html&quot; class=&quot;link1&quot;>[0-9]&lt;/a&gt;</expression>
                </RegExp>
                <expression noclean="1"></expression>
            </RegExp>

Laugh

Hardware: Revo 3610 + SSD - Harmony 700 Remote
Software: XBMCBuntu Gotham - Sickbeard - SabNZBd+

[Image: all-thin-banner.jpg]
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #3
reason for this is: you are in an xml document. and you return xml.... each time xml is parsed, you need &amp; or it will be stripped due to being a nonvalid xml char....
find quote