Lovefilm.se (Swedish) scraper - search uses javascript, can that be bypassed?

  Thread Rating:
  • 1 Votes - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
filigran Offline
Senior Member
Posts: 192
Joined: Oct 2009
Reputation: 0
Post: #1
Hi!
I've been using imdb and thetvdb.com to scrape my movies/tv shows, but being swedish and all, I thought it would be nice to have a swedish scraper, and tried to create one for http://www.lovefilm.se . I found the dummies howto, and I've figured out the basics, but I've stumbled upon a problem: their search engine uses javascript (atleast that's what I've come to think) and when parsing the results using scrape.exe I get nothing. I tried to just wget the page, and saw that there are indeed no results.

Can I, somehow, get the search results to be parsed even though they use JS (or whatever)? I found this post: http://forum.xbmc.org/showpost.php?p=262...stcount=10 which says:
Quote:I notice that they are currently using a javascript search system which prevents you from simply parsing the HTML sent back after a search request (bastards *shakes fist*) which makes it only slightly harder to scrape.

I read that like it's doable? Too bad he doesn't say how.
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #2
searching using google is the only workaround i'm aware of. btw, scrape.exe is utterly outdated, check the scraper editor.
find quote
The_Ghost16 Offline
Senior Member
Posts: 102
Joined: Dec 2009
Reputation: 0
Location: Netherlands
Post: #3
With the following url you can find a movie:
http://www.lovefilm.se/movieSearch.do?query=
Just paste the movietitle after query= and you will find the movies.
After that is done you can open the movie and you can scrape the result.
This doesn't look that hard.
find quote
filigran Offline
Senior Member
Posts: 192
Joined: Oct 2009
Reputation: 0
Post: #4
spiff Wrote:searching using google is the only workaround i'm aware of. btw, scrape.exe is utterly outdated, check the scraper editor.

Yeah, I saw that on the page. The scraper editor, would that be http://forum.xbmc.org/showthread.php?tid=52929 ? I tried that one using wine, but I needed to install mono through wine, and scrape.exe seemed to work properly for just testing. Guess I'll have to fix the Mono stuff for wine to be sure. I'll try using google search. Thanks.

The_Ghost16 Wrote:With the following url you can find a movie:
http://www.lovefilm.se/movieSearch.do?query=
Just paste the movietitle after query= and you will find the movies.
After that is done you can open the movie and you can scrape the result.
This doesn't look that hard.

Yeah, I know. But doing a search for "batman", i.e. http://www.lovefilm.se/movieSearch.do?query=batman gives me a few results. If I select some results, and check the DOM with firefox, I see them:
PHP Code:
<div id="resultAllMovie">
<
div class="resultRow">
<
div class="boxContentTiny">
<
a href="http://www.lovefilm.se/film/49546-Batman.do"
class="showMovieToolTip" rel="49546">
<
strong>Batman</strong> (Action)</a>
</
div

that's the first result.
But checking the source, I get this:
PHP Code:
        <div id="resultAllMovie"></div>
        <
div id="pagesMovie"></div

It's the same if I just wget the page. Am I missing something?
If I search for something that only yields one matching result, like "band of brothers", I get to that movie page directly, and there I can scrape the details. But I need to find the search results too.

Thanks for your replies!
find quote
filigran Offline
Senior Member
Posts: 192
Joined: Oct 2009
Reputation: 0
Post: #5
Sorry to bring up this forgotten thread again. I gave up on this since I couldn't find a way to work it out, but I just have to ask:

spiff Wrote:searching using google is the only workaround i'm aware of. btw, scrape.exe is utterly outdated, check the scraper editor.

When you say "searching using google", what exactly do you mean? I thought I knew how to google, but I must be missing something obvious.

EDIT: I assume you mean "site:lovefilm.se/film <keyword>"? I guess that's as close as I can come? Or did you have something else in mind?

Could a javascript capable scraper be something for the future? Might be a security risk I suppose ... or is it just not possible?
(This post was last modified: 2010-02-02 23:28 by filigran.)
find quote
vdrfan Offline
Team-XBMC Developer
Posts: 2,894
Joined: Jan 2008
Reputation: 8
Location: Germany
Post: #6
Why not just use http://www.lovefilm.se/movieSearch.do?qu...keyword>; ?

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules
For troubleshooting and bug reporting please make sure you read this first.
find quote
filigran Offline
Senior Member
Posts: 192
Joined: Oct 2009
Reputation: 0
Post: #7
vdrfan Wrote:Why not just use http://www.lovefilm.se/movieSearch.do?qu...keyword>; ?

Like I said earlier:
filigran Wrote:Yeah, I know. But doing a search for "batman", i.e. http://www.lovefilm.se/movieSearch.do?query=batman gives me a few results. If I select some results, and check the DOM with firefox, I see them:
PHP Code:
<div id="resultAllMovie">
<
div class="resultRow">
<
div class="boxContentTiny">
<
a href="http://www.lovefilm.se/film/49546-Batman.do"
class="showMovieToolTip" rel="49546">
<
strong>Batman</strong> (Action)</a>
</
div

that's the first result.
But checking the source, I get this:
PHP Code:
        <div id="resultAllMovie"></div>
        <
div id="pagesMovie"></div

It's the same if I just wget the page. Am I missing something?
If I search for something that only yields one matching result, like "band of brothers", I get to that movie page directly, and there I can scrape the details. But I need to find the search results too.

Am I just being totally fucking dumb here?
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #8
nope. what i mean by search using google is something ala

http://www.google.com/search?hl=en&site=...tnG=Search
find quote
filigran Offline
Senior Member
Posts: 192
Joined: Oct 2009
Reputation: 0
Post: #9
spiff Wrote:nope. what i mean by search using google is something ala

http://www.google.com/search?hl=en&site=...tnG=Search

Yeah, that's what I meant, just didn't include the url but the search string. Tongue

I got a bit further now, and I have a scraper that works inside the editor, but not inside XBMC.

If I use the editor and test the scraper, it asks me for a search string, gets an url, gives me a list of search results, and then fetches info for the one I choose. All is well. But inside XBMC I get no results when scanning, and no results when adding manually (hitting 'I' on a movie). Using other scrapers work.
The XBMC log says this:
Code:
23:33:01 T:3860 M:450981888   DEBUG: SDLKeyboard: scancode: 23, sym: 105, unicode: 105, modifier: 0
23:33:01 T:3860 M:450981888   DEBUG: CApplication::OnKey: 61513 pressed, action is 11
23:33:01 T:3860 M:450969600   DEBUG: CVideoDatabase::GetMovieId (D:\Documents and Settings\Administrator\Skrivbord\filmer\the dark knight.iso), query = select idMovie from movie where idFile=2
23:33:01 T:3860 M:450945024   DEBUG: No NFO file found. Using title search for 'D:\Documents and Settings\Administrator\Skrivbord\filmer\the dark knight.iso'
23:33:01 T:3860 M:450945024    INFO: Loading skin file: DialogProgress.xml
23:33:01 T:3860 M:450940928   DEBUG: Load DialogProgress.xml: 3.23ms
23:33:01 T:3860 M:450940928   DEBUG: ------ Window Init (DialogProgress.xml) ------
23:33:01 T:3860 M:450940928   DEBUG: Alloc resources: 0.08ms (0.00 ms skin load)
23:33:01 T:2192 M:450609152   DEBUG: thread start, auto delete: 0
23:33:01 T:2192 M:450588672   DEBUG: CIMDB::InternalFindMovie: Searching for 'filmer' using Lovefilm.se scraper (file: 'lovefilm.xml', content: 'movies', language: 'sv', date: '2010-02-04', framework: '1,1')
23:33:01 T:2192 M:450506752   DEBUG: FileCurl::Open(0012D770) http://www.google.com/search?hl=en&q=intitle:filmer+site:lovefilm.se/film&num=100
23:33:01 T:2192 M:450465792    INFO: XCURL::DllLibCurlGlobal::easy_aquire - Created session to http://www.google.com
23:33:01 T:2192 M:450408448   DEBUG: FileCurl::Close(0012D770) http://www.google.com/search?hl=en&q=intitle:filmer+site:lovefilm.se/film&num=100
23:33:01 T:2192 M:450404352   DEBUG: scraper: GetSearchResults returned <results></results>
23:33:01 T:2192 M:450404352   ERROR: CIMDB::Process: Error looking up movie filmer
23:33:01 T:2192 M:450404352   DEBUG: Thread 2192 terminating
23:33:01 T:3860 M:450523136    INFO: Loading skin file: DialogKeyboard.xml
23:33:01 T:3860 M:450916352   DEBUG: Load DialogKeyboard.xml: 21.32ms

The results, according to the editor is:
Code:
<results><entity><url>http://www.lovefilm.se/film/48044-The+Dark+Knight.do</url><title>The Dark Knight DVD</title></entity><entity><url>http://www.lovefilm.se/film/52631-The+Dark+Knight+(Blu-ray)+-+Extramaterial.do</url><title>The Dark Knight (Blu-ray) - Extramaterial</title></entity><entity><url>http://www.lovefilm.se/film/51628-The+Dark+Knight+(Blu-ray).do;jsessionid=DDC3B8E739F803541C84096C18C90991</url><title>The Dark Knight (Blu-ray)</title></entity></results>

My XML code:
PHP Code:
<?xml version="1.0" encoding="utf-8"?>
<scraper framework="1,1" date="2010-02-04" name="Lovefilm.se" content="movies" thumb="lovefilm.png" language="sv">
    <CreateSearchUrl dest="4">
        <!-- I've used both <url>...</url> and like this here. Using <url>...</url> gives an error in the editor, but neither works in XBMC. -->
        <RegExp input="$$1" output="http://www.google.com/search?hl=en&amp;q=intitle:\1+site:lovefilm.se/film&amp;num=100" dest="4">
            <expression></expression>
        </RegExp>
    </CreateSearchUrl>
    <GetSearchResults dest="6">
        <RegExp input="$$5" output="<results>\1</results>" dest="6">
            <RegExp input="$$1" output="<entity><url>\1</url><title>\2</title></entity>" dest="5">
                <expression repeat="yes"><a href=&quot;(http:\/\/www.lovefilm.se\/film\/.*?)&quot;.*?\)&quot;>(.*?) - Hyr</expression>
            </RegExp>
            <expression noclean="1"></expression>
        </RegExp>
    </GetSearchResults>
    <GetDetails dest="8">
        <RegExp input="$$7" output="<details>\1</details>" dest="8">
            <RegExp input="$$1" output="<title>\1</title><year>\2</year>" dest="7+">
                <expression><h1>.*?>(.*?)<\/span>.*?([0-9]+)\)</expression>
            </RegExp>
            <RegExp input="$$1" output="<rating>\1</rating><votes>\2</votes>" dest="7+">
                <expression>\(([0-9],[0-9])\) \(([0-9]+) röster\)<\/p></expression>
            </RegExp>
            <RegExp input="$$1" output="<originaltitle>\1</originaltitle>" dest="7+">
                <expression>Originaltitel:<\/div>.*?<div class=&quot;mainInfoRowRight&quot;>.*?<strong>(.*?)<\/strong></expression>
            </RegExp>
            <RegExp input="$$1" output="<director>\1</director>" dest="7+">
                <expression>REGISSÖR<\/li>.*?<ul>.*?<li>.*?>(.*?)<\/a><\/li></expression>
            </RegExp>
            <RegExp input="$$1" output="<plot>\1</plot>" dest="7+">
                <expression trim="1"><div id=&quot;description&quot;>(.*?)<\/div></expression>
            </RegExp>
            <RegExp input="$$1" output="<genre>\1</genre>" dest="7+">
                <expression cs="true" repeat="yes" trim="1"><li class=&quot;header&quot;>GENRE</li>.*?<a href=&quot;/category/.*?>(.*?)</a></li></expression>
            </RegExp>
            <RegExp input="$$1" output="<runtime>\1</runtime>" dest="7+">
                <expression><span>.[^ ]*DVD.*?Speltid:.*?<strong>(.*?)\.<\/strong></expression>
            </RegExp>
            <RegExp input="$$1" output="<thumb>\1</thumb>" dest="7+">
                <expression><img src=&quot;(http://static.lovefilm.se/img/cover/movie/huge/.*?)&quot;</expression>
            </RegExp>
            <expression noclean="1"></expression>
        </RegExp>
    </GetDetails>
<!--Created with ScraperXml Editor, Author: filigran-->
</scraper> 

My regexes probably suck, but they yield some results in the editor atleast.
Is there anything missing, some required field? NfoUrl and stuff, do they have to be there?

Thanks for your help so far! Smile
find quote
spiff Offline
Retired Developer
Posts: 12,386
Joined: Nov 2003
Post: #10
no reason to escape the /'es.
find quote
filigran Offline
Senior Member
Posts: 192
Joined: Oct 2009
Reputation: 0
Post: #11
spiff Wrote:no reason to escape the /'es.

Yeah, I noticed I did that. I use rubular (rubular.com - great tool btw) to test out my regexes, and it requires escaping, forgot to remove some I guess. Anyhoo, removing them did no difference. I got it working though, to some degree.
I ended up doing this:
Code:
<?xml version="1.0" encoding="utf-8"?><scraper framework="1,1" date="2010-02-04" name="Lovefilm.se" content="movies" thumb="lovefilm.png" language="sv">
    <CreateSearchUrl dest="4">
        <RegExp input="$$1" output="&lt;url&gt;http://www.google.com/search?q=intitle:\1+site:lovefilm.se/film&amp;num=100&lt;/url&gt;" dest="4">
            <expression></expression>
        </RegExp>
    </CreateSearchUrl>
    <GetSearchResults dest="6">
        <RegExp input="$$5" output="&lt;results&gt;\1&lt;/results&gt;" dest="6">
            <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;\2&lt;/title&gt;&lt;url&gt;\1&lt;/url&gt;&lt;/entity&gt;" dest="5">
                <expression repeat="yes" clear="yes">a href="(http://www.lovefilm.se/film/[^\.]*.do)".[^&gt;]*&gt;(.[^H]*) Hyr</expression>
            </RegExp>
            <expression noclean="1"></expression>
        </RegExp>
    </GetSearchResults>
    <GetDetails dest="8">
        <RegExp input="$$7" output="&lt;details&gt;\1&lt;/details&gt;" dest="8">
            <RegExp input="$$1" output="&lt;title&gt;\1&lt;/title&gt;&lt;year&gt;\2&lt;/year&gt;" dest="7">
                <expression trim="1">&lt;h1&gt;.[^&gt;]*&gt;(.[^&lt;]*)&lt;/span&gt;.[^0-9]*([0-9]+)\)</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;rating&gt;\1&lt;/rating&gt;&lt;votes&gt;\2&lt;/votes&gt;" dest="7+">
                <expression trim="1">\(([0-9],[0-9])\) \(([0-9]+) r.ster\)&lt;/p&gt;</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;originaltitle&gt;\1&lt;/originaltitle&gt;" dest="7+">
                <expression trim="1">Originaltitel:&lt;/div&gt;.[^&lt;]*&lt;div class="mainInfoRowRight"&gt;.[^&lt;]*&lt;strong&gt;(.[^&lt;]*)&lt;/strong&gt;</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;director&gt;\1&lt;/director&gt;" dest="7+">
                <expression trim="1">REGISSÖR&lt;/li&gt;.[^&lt;]*&lt;ul&gt;.[^&lt;]*&lt;li&gt;.[^&gt;]*&gt;(.[^&lt;]*)&lt;/a&gt;&lt;/li&gt;</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;plot&gt;\1&lt;/plot&gt;" dest="7+">
                <expression trim="1">&lt;div id="description"&gt;(.[^&lt;]*)&lt;/div&gt;</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;genre&gt;\1&lt;/genre&gt;" dest="7+">
                <expression cs="true" trim="1">&lt;li class="header"&gt;GENRE&lt;/li&gt;.[^&lt;]*&lt;a href="/category/.[^&gt;]*&gt;(.[^&lt;]*)&lt;/a&gt;&lt;/li&gt;</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;runtime&gt;\1&lt;/runtime&gt;" dest="7+">
                <expression trim="1">&lt;span&gt;.[^ ]*DVD.[^S]*Speltid:.[^&lt;]*&lt;strong&gt;(.[^\.]*)\.&lt;/strong&gt;</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;thumb&gt;\1&lt;/thumb&gt;" dest="7+">
                <expression trim="1">&lt;img src="(http://static.lovefilm.se/img/cover/movie/huge/.[^"]*)"</expression>
            </RegExp>
            <expression noclean="1"></expression>
        </RegExp>
    </GetDetails>
<!--Created with ScraperXml Editor, Author: filigran-->
</scraper>
and finally got it working, to some degree. I think it was the .*? matching it didn't like. I replaced that and it started working. It finds results, and gets the info I want, but the plot is messed up:
[Image: 11w6lgi.jpg]
In the page code there's four tabs before the plot text starts, and those end up as squares. I thought trimming would take care of that, but I guess it only removes spaces? Or am I using it wrong?

Another issue is that it doesn't find all the results that I do when searching manually with google, for some reason (using the same url). But I'm done trying to fix this. I'll just use the filmdelta.se scraper for now, it works good enough for me.

If you know why it's doing that, and how to fix it, I'd be glad to here what's causing it though. For future reference. Smile

Thanks for your help!
find quote
jojje Offline
Junior Member
Posts: 1
Joined: Apr 2010
Reputation: 0
Post: #12
...
(This post was last modified: 2010-04-20 23:16 by jojje.)
find quote
PrimaryMaster Offline
Senior Member
Posts: 171
Joined: May 2009
Reputation: 0
Post: #13
Confused?

[Image: xbmcnubanner.jpg]
find quote