Lovefilm.se (Swedish) scraper - search uses javascript, can that be bypassed?

Lovefilm.se (Swedish) scraper - search uses javascript, can that be bypassed? - Printable Version

+- Kodi Community Forum (https://forum.kodi.tv)
+-- Forum: Development (https://forum.kodi.tv/forumdisplay.php?fid=32)
+--- Forum: Scrapers (https://forum.kodi.tv/forumdisplay.php?fid=60)
+--- Thread: Lovefilm.se (Swedish) scraper - search uses javascript, can that be bypassed? (/showthread.php?tid=67920)

Lovefilm.se (Swedish) scraper - search uses javascript, can that be bypassed? - filigran - 2010-01-20

Hi!
I've been using imdb and thetvdb.com to scrape my movies/tv shows, but being swedish and all, I thought it would be nice to have a swedish scraper, and tried to create one for http://www.lovefilm.se . I found the dummies howto, and I've figured out the basics, but I've stumbled upon a problem: their search engine uses javascript (atleast that's what I've come to think) and when parsing the results using scrape.exe I get nothing. I tried to just wget the page, and saw that there are indeed no results.

Can I, somehow, get the search results to be parsed even though they use JS (or whatever)? I found this post: http://forum.xbmc.org/showpost.php?p=262584&postcount=10 which says:

Quote:I notice that they are currently using a javascript search system which prevents you from simply parsing the HTML sent back after a search request (bastards *shakes fist*) which makes it only slightly harder to scrape.

I read that like it's doable? Too bad he doesn't say how.

- spiff - 2010-01-20

searching using google is the only workaround i'm aware of. btw, scrape.exe is utterly outdated, check the scraper editor.

- The_Ghost16 - 2010-01-20

With the following url you can find a movie:
http://www.lovefilm.se/movieSearch.do?query=
Just paste the movietitle after query= and you will find the movies.
After that is done you can open the movie and you can scrape the result.
This doesn't look that hard.

- filigran - 2010-01-20

spiff Wrote:searching using google is the only workaround i'm aware of. btw, scrape.exe is utterly outdated, check the scraper editor.

Yeah, I saw that on the page. The scraper editor, would that be http://forum.xbmc.org/showthread.php?tid=52929 ? I tried that one using wine, but I needed to install mono through wine, and scrape.exe seemed to work properly for just testing. Guess I'll have to fix the Mono stuff for wine to be sure. I'll try using google search. Thanks.

The_Ghost16 Wrote:With the following url you can find a movie:
http://www.lovefilm.se/movieSearch.do?query=
Just paste the movietitle after query= and you will find the movies.
After that is done you can open the movie and you can scrape the result.
This doesn't look that hard.

Yeah, I know. But doing a search for "batman", i.e. http://www.lovefilm.se/movieSearch.do?query=batman gives me a few results. If I select some results, and check the DOM with firefox, I see them:

PHP Code:
<div id="resultAllMovie">
<div class="resultRow">
<div class="boxContentTiny">
<a href="http://www.lovefilm.se/film/49546-Batman.do"
class="showMovieToolTip" rel="49546">
<strong>Batman</strong> (Action)</a>
</div> 

that's the first result.
But checking the source, I get this:

PHP Code:
        <div id="resultAllMovie"></div>
        <div id="pagesMovie"></div> 

It's the same if I just wget the page. Am I missing something?
If I search for something that only yields one matching result, like "band of brothers", I get to that movie page directly, and there I can scrape the details. But I need to find the search results too.

Thanks for your replies!

- filigran - 2010-02-02

Sorry to bring up this forgotten thread again. I gave up on this since I couldn't find a way to work it out, but I just have to ask:

spiff Wrote:searching using google is the only workaround i'm aware of. btw, scrape.exe is utterly outdated, check the scraper editor.

When you say "searching using google", what exactly do you mean? I thought I knew how to google, but I must be missing something obvious.

EDIT: I assume you mean "site:lovefilm.se/film <keyword>"? I guess that's as close as I can come? Or did you have something else in mind?

Could a javascript capable scraper be something for the future? Might be a security risk I suppose ... or is it just not possible?

- mkortstiege - 2010-02-03

Why not just use http://www.lovefilm.se/movieSearch.do?query=<keyword> ?

- filigran - 2010-02-03

vdrfan Wrote:Why not just use http://www.lovefilm.se/movieSearch.do?query=<keyword> ?

Like I said earlier:

filigran Wrote:Yeah, I know. But doing a search for "batman", i.e. http://www.lovefilm.se/movieSearch.do?query=batman gives me a few results. If I select some results, and check the DOM with firefox, I see them:

PHP Code:
<div id="resultAllMovie"> <div class="resultRow"> <div class="boxContentTiny"> <a href="http://www.lovefilm.se/film/49546-Batman.do" class="showMovieToolTip" rel="49546"> <strong>Batman</strong> (Action)</a> </div>

that's the first result.
But checking the source, I get this:

PHP Code:
<div id="resultAllMovie"></div> <div id="pagesMovie"></div>

It's the same if I just wget the page. Am I missing something?
If I search for something that only yields one matching result, like "band of brothers", I get to that movie page directly, and there I can scrape the details. But I need to find the search results too.

Am I just being totally fucking dumb here?

- spiff - 2010-02-03

nope. what i mean by search using google is something ala

http://www.google.com/search?hl=en&site=&q=batman+site%3Alovefilm.se&btnG=Search

- filigran - 2010-02-05

spiff Wrote:nope. what i mean by search using google is something ala

http://www.google.com/search?hl=en&site=&q=batman+site%3Alovefilm.se&btnG=Search

Yeah, that's what I meant, just didn't include the url but the search string. Tongue

I got a bit further now, and I have a scraper that works inside the editor, but not inside XBMC.

If I use the editor and test the scraper, it asks me for a search string, gets an url, gives me a list of search results, and then fetches info for the one I choose. All is well. But inside XBMC I get no results when scanning, and no results when adding manually (hitting 'I' on a movie). Using other scrapers work.
The XBMC log says this:

Code:
33:01 T:3860 M:450981888   DEBUG: SDLKeyboard: scancode: 23, sym: 105, unicode: 105, modifier: 0

33:01 T:3860 M:450981888   DEBUG: CApplication::OnKey: 61513 pressed, action is 11

33:01 T:3860 M:450969600   DEBUG: CVideoDatabase::GetMovieId (D:\Documents and Settings\Administrator\Skrivbord\filmer\the dark knight.iso), query = select idMovie from movie where idFile=2

33:01 T:3860 M:450945024   DEBUG: No NFO file found. Using title search for 'D:\Documents and Settings\Administrator\Skrivbord\filmer\the dark knight.iso'

33:01 T:3860 M:450945024    INFO: Loading skin file: DialogProgress.xml

33:01 T:3860 M:450940928   DEBUG: Load DialogProgress.xml: 3.23ms

33:01 T:3860 M:450940928   DEBUG: ------ Window Init (DialogProgress.xml) ------

33:01 T:3860 M:450940928   DEBUG: Alloc resources: 0.08ms (0.00 ms skin load)

33:01 T:2192 M:450609152   DEBUG: thread start, auto delete: 0

33:01 T:2192 M:450588672   DEBUG: CIMDB::InternalFindMovie: Searching for 'filmer' using Lovefilm.se scraper (file: 'lovefilm.xml', content: 'movies', language: 'sv', date: '2010-02-04', framework: '1,1')

33:01 T:2192 M:450506752   DEBUG: FileCurl::Open(0012D770) http://www.google.com/search?hl=en&q=intitle:filmer+site:lovefilm.se/film&num=100

33:01 T:2192 M:450465792    INFO: XCURL::DllLibCurlGlobal::easy_aquire - Created session to http://www.google.com

33:01 T:2192 M:450408448   DEBUG: FileCurl::Close(0012D770) http://www.google.com/search?hl=en&q=intitle:filmer+site:lovefilm.se/film&num=100

33:01 T:2192 M:450404352   DEBUG: scraper: GetSearchResults returned <results></results>

33:01 T:2192 M:450404352   ERROR: CIMDB::Process: Error looking up movie filmer

33:01 T:2192 M:450404352   DEBUG: Thread 2192 terminating

33:01 T:3860 M:450523136    INFO: Loading skin file: DialogKeyboard.xml

33:01 T:3860 M:450916352   DEBUG: Load DialogKeyboard.xml: 21.32ms

The results, according to the editor is:

Code:
<results><entity><url>http://www.lovefilm.se/film/48044-The+Dark+Knight.do</url><title>The Dark Knight DVD</title></entity><entity><url>http://www.lovefilm.se/film/52631-The+Dark+Knight+(Blu-ray)+-+Extramaterial.do</url><title>The Dark Knight (Blu-ray) - Extramaterial</title></entity><entity><url>http://www.lovefilm.se/film/51628-The+Dark+Knight+(Blu-ray).do;jsessionid=DDC3B8E739F803541C84096C18C90991</url><title>The Dark Knight (Blu-ray)</title></entity></results>

My XML code:

PHP Code:
<?xml version="1.0" encoding="utf-8"?>
<scraper framework="1,1" date="2010-02-04" name="Lovefilm.se" content="movies" thumb="lovefilm.png" language="sv">
    <CreateSearchUrl dest="4">
        <!-- I've used both <url>...</url> and like this here. Using <url>...</url> gives an error in the editor, but neither works in XBMC. -->
        <RegExp input="$$1" output="http://www.google.com/search?hl=en&amp;q=intitle:\1+site:lovefilm.se/film&amp;num=100" dest="4">
            <expression></expression>
        </RegExp>
    </CreateSearchUrl>
    <GetSearchResults dest="6">
        <RegExp input="$$5" output="&lt;results&gt;\1&lt;/results&gt;" dest="6">
            <RegExp input="$$1" output="&lt;entity&gt;&lt;url&gt;\1&lt;/url&gt;&lt;title&gt;\2&lt;/title&gt;&lt;/entity&gt;" dest="5">
                <expression repeat="yes">&lt;a href=&quot;(http:\/\/www.lovefilm.se\/film\/.*?)&quot;.*?\)&quot;&gt;(.*?) - Hyr</expression>
            </RegExp>
            <expression noclean="1"></expression>
        </RegExp>
    </GetSearchResults>
    <GetDetails dest="8">
        <RegExp input="$$7" output="&lt;details&gt;\1&lt;/details&gt;" dest="8">
            <RegExp input="$$1" output="&lt;title&gt;\1&lt;/title&gt;&lt;year&gt;\2&lt;/year&gt;" dest="7+">
                <expression>&lt;h1&gt;.*?&gt;(.*?)&lt;\/span&gt;.*?([0-9]+)\)</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;rating&gt;\1&lt;/rating&gt;&lt;votes&gt;\2&lt;/votes&gt;" dest="7+">
                <expression>\(([0-9],[0-9])\) \(([0-9]+) röster\)&lt;\/p&gt;</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;originaltitle&gt;\1&lt;/originaltitle&gt;" dest="7+">
                <expression>Originaltitel:&lt;\/div&gt;.*?&lt;div class=&quot;mainInfoRowRight&quot;&gt;.*?&lt;strong&gt;(.*?)&lt;\/strong&gt;</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;director&gt;\1&lt;/director&gt;" dest="7+">
                <expression>REGISSÖR&lt;\/li&gt;.*?&lt;ul&gt;.*?&lt;li&gt;.*?&gt;(.*?)&lt;\/a&gt;&lt;\/li&gt;</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;plot&gt;\1&lt;/plot&gt;" dest="7+">
                <expression trim="1">&lt;div id=&quot;description&quot;&gt;(.*?)&lt;\/div&gt;</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;genre&gt;\1&lt;/genre&gt;" dest="7+">
                <expression cs="true" repeat="yes" trim="1">&lt;li class=&quot;header&quot;&gt;GENRE&lt;/li&gt;.*?&lt;a href=&quot;/category/.*?&gt;(.*?)&lt;/a&gt;&lt;/li&gt;</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;runtime&gt;\1&lt;/runtime&gt;" dest="7+">
                <expression>&lt;span&gt;.[^ ]*DVD.*?Speltid:.*?&lt;strong&gt;(.*?)\.&lt;\/strong&gt;</expression>
            </RegExp>
            <RegExp input="$$1" output="&lt;thumb&gt;\1&lt;/thumb&gt;" dest="7+">
                <expression>&lt;img src=&quot;(http://static.lovefilm.se/img/cover/movie/huge/.*?)&quot;</expression>
            </RegExp>
            <expression noclean="1"></expression>
        </RegExp>
    </GetDetails>
<!--Created with ScraperXml Editor, Author: filigran-->
</scraper> 

My regexes probably suck, but they yield some results in the editor atleast.
Is there anything missing, some required field? NfoUrl and stuff, do they have to be there?

Thanks for your help so far! Smile

- spiff - 2010-02-05

no reason to escape the /'es.

- filigran - 2010-02-06

spiff Wrote:no reason to escape the /'es.

Yeah, I noticed I did that. I use rubular (rubular.com - great tool btw) to test out my regexes, and it requires escaping, forgot to remove some I guess. Anyhoo, removing them did no difference. I got it working though, to some degree.
I ended up doing this:

Code:
<?xml version="1.0" encoding="utf-8"?><scraper framework="1,1" date="2010-02-04" name="Lovefilm.se" content="movies" thumb="lovefilm.png" language="sv">

    <CreateSearchUrl dest="4">

        <RegExp input="$$1" output="&lt;url&gt;http://www.google.com/search?q=intitle:\1+site:lovefilm.se/film&amp;num=100&lt;/url&gt;" dest="4">

            <expression></expression>

        </RegExp>

    </CreateSearchUrl>

    <GetSearchResults dest="6">

        <RegExp input="$$5" output="&lt;results&gt;\1&lt;/results&gt;" dest="6">

            <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;\2&lt;/title&gt;&lt;url&gt;\1&lt;/url&gt;&lt;/entity&gt;" dest="5">

                <expression repeat="yes" clear="yes">a href="(http://www.lovefilm.se/film/[^\.]*.do)".[^&gt;]*&gt;(.[^H]*) Hyr</expression>

            </RegExp>

            <expression noclean="1"></expression>

        </RegExp>

    </GetSearchResults>

    <GetDetails dest="8">

        <RegExp input="$$7" output="&lt;details&gt;\1&lt;/details&gt;" dest="8">

            <RegExp input="$$1" output="&lt;title&gt;\1&lt;/title&gt;&lt;year&gt;\2&lt;/year&gt;" dest="7">

                <expression trim="1">&lt;h1&gt;.[^&gt;]*&gt;(.[^&lt;]*)&lt;/span&gt;.[^0-9]*([0-9]+)\)</expression>

            </RegExp>

            <RegExp input="$$1" output="&lt;rating&gt;\1&lt;/rating&gt;&lt;votes&gt;\2&lt;/votes&gt;" dest="7+">

                <expression trim="1">\(([0-9],[0-9])\) \(([0-9]+) r.ster\)&lt;/p&gt;</expression>

            </RegExp>

            <RegExp input="$$1" output="&lt;originaltitle&gt;\1&lt;/originaltitle&gt;" dest="7+">

                <expression trim="1">Originaltitel:&lt;/div&gt;.[^&lt;]*&lt;div class="mainInfoRowRight"&gt;.[^&lt;]*&lt;strong&gt;(.[^&lt;]*)&lt;/strong&gt;</expression>

            </RegExp>

            <RegExp input="$$1" output="&lt;director&gt;\1&lt;/director&gt;" dest="7+">

                <expression trim="1">REGISSÖR&lt;/li&gt;.[^&lt;]*&lt;ul&gt;.[^&lt;]*&lt;li&gt;.[^&gt;]*&gt;(.[^&lt;]*)&lt;/a&gt;&lt;/li&gt;</expression>

            </RegExp>

            <RegExp input="$$1" output="&lt;plot&gt;\1&lt;/plot&gt;" dest="7+">

                <expression trim="1">&lt;div id="description"&gt;(.[^&lt;]*)&lt;/div&gt;</expression>

            </RegExp>

            <RegExp input="$$1" output="&lt;genre&gt;\1&lt;/genre&gt;" dest="7+">

                <expression cs="true" trim="1">&lt;li class="header"&gt;GENRE&lt;/li&gt;.[^&lt;]*&lt;a href="/category/.[^&gt;]*&gt;(.[^&lt;]*)&lt;/a&gt;&lt;/li&gt;</expression>

            </RegExp>

            <RegExp input="$$1" output="&lt;runtime&gt;\1&lt;/runtime&gt;" dest="7+">

                <expression trim="1">&lt;span&gt;.[^ ]*DVD.[^S]*Speltid:.[^&lt;]*&lt;strong&gt;(.[^\.]*)\.&lt;/strong&gt;</expression>

            </RegExp>

            <RegExp input="$$1" output="&lt;thumb&gt;\1&lt;/thumb&gt;" dest="7+">

                <expression trim="1">&lt;img src="(http://static.lovefilm.se/img/cover/movie/huge/.[^"]*)"</expression>

            </RegExp>

            <expression noclean="1"></expression>

        </RegExp>

    </GetDetails>

<!--Created with ScraperXml Editor, Author: filigran-->

</scraper>

and finally got it working, to some degree. I think it was the .*? matching it didn't like. I replaced that and it started working. It finds results, and gets the info I want, but the plot is messed up:

In the page code there's four tabs before the plot text starts, and those end up as squares. I thought trimming would take care of that, but I guess it only removes spaces? Or am I using it wrong?

Another issue is that it doesn't find all the results that I do when searching manually with google, for some reason (using the same url). But I'm done trying to fix this. I'll just use the filmdelta.se scraper for now, it works good enough for me.

If you know why it's doing that, and how to fix it, I'd be glad to here what's causing it though. For future reference. Smile

Thanks for your help!

- jojje - 2010-04-20

...

RE: Lovefilm.se (Swedish) scraper - search uses javascript, can that be bypassed? - PrimaryMaster - 2012-10-30

Huh