Rotten Tomatoes Scraper
#1
Hello All,

I'm working on creating a Rotten Tomatoes scraper and I've hit a small hurdle. RT redirects based on your IP address, so I only have access to the Australian site. I'm not sure if there's a proxy I could use to get around it somehow, but I thought I'd ask for some help from some of you US and UK folks.

There will be a setting in the scraper to change which rating (MPAA, OFLC or BBFC) is retrieved, so pretty much I need some localized samples of HTML from the RT site. The section of code I am after is

Code:
<div id="movie_stats">
    <div class="fl">
      <p>
        <span class="label">Australian Rating:</span>
        <span class="content">M <a style="font-weight: normal;" class="movie_rating_reason" id="movie_rating_work" href="javascript:void(0);">[See Full Rating]</a>
            <span class="movie_rating_reason" style="display: none"> Frequent action violence</span>        </span>

      </p><p><span class="label">Runtime:</span> <span class="content">2 hrs 33 mins</span></p><p><span class="label">Genre:</span> <span class="content"><a href="/movie/browser.php?genre=200001">Action/Adventure</a></span></p>    </div>
    <div class="fl">
      <p><span class="label">Australian Theatrical Release:</span><br /><span class="content">Jul 16, 2008 Wide</span></p>                  <p><span class="label">US Box Office:</span> <span class="content"><a href="/m/the_dark_knight/numbers.php">$533,316,061</a></span></p>    </div>

  </div>

For those wanting a progress report I have Synopsis, Running Time, Year, Director, Actors, Rating (RT Score in %) and Votes all scraping successfully. I also have Australian Ratings and Rating Reason working.

Any help with the HTML samples is appreciated. Thanks.
Reply
#2
I found out the computers at work use a singaporean IP so I've saved some HTML samples from the UK and US Rotten Tomatoes sites. I'll post back when the scraper is ready for testing.
Reply
#3
Okay, I've got a pretty good skeleton of a scraper going here, no fanart or thumbs yet. I'm slightly at a loss as to how I'm going to tackle that as other scrapers look like they use the IMDB id # to fetch artwork. I can do the reverse (i.e, look up a RT movie with the IMDB id with http://www.rottentomatoes.com/alias?type=imdbid&s=[imdb id #]) but RT movie pages do not contain the IMDB id or a way to look it up from within that site. So should I:

1) rewrite the scraper to lookup the IMDB id with CreateSearchUrl and call all the RT stuff with functions?

or

2) lookup the IMDB title with a call to Google or the like. I think PTGate looks up the IMDB id like this if the movie's page doesn't include it.

Both methods seem less than ideal, but they're the only options I can think of right now.

Here's what I've done so far anyway:

Image

Code:
<?xml version="1.0" encoding="utf-8"?>
<scraper name="Rotten Tomatoes 0.5" date="2009-08-05" content="movies" framework="1.0" thumb="rottentomatoes.png" language="">
  <GetSettings dest="3">
    <RegExp input="$$5" output="&lt;settings&gt;\1&lt;/settings&gt;" dest="3">
      <RegExp input="$$1" output="&lt;setting label=&quot;Location&quot; type=&quot;labelenum&quot; values=&quot;us|au|uk&quot; id=&quot;locality&quot; default=&quot;au&quot;&gt;&lt;/setting&gt;" dest="5">
        <expression></expression>
      </RegExp>
      <RegExp input="$$1" output="&lt;setting label=&quot;Retrieve Classification Reason&quot; type=&quot;bool&quot; id=&quot;classreason&quot; default=&quot;false&quot;&gt;&lt;/setting&gt;" dest="5+">
                <expression></expression>
            </RegExp>
      <expression noclean="1"></expression>
    </RegExp>
  </GetSettings>
  <NfoUrl dest="3">
    <RegExp input="$$1" output="\1" dest="3">
      <expression noclean="1">(http://$INFO[locality]\.rottentomatoes\.com/m/[A-Za-z0-9_]*)</expression>
    </RegExp>
  </NfoUrl>
  <CreateSearchUrl dest="3">
    <RegExp input="$$1" output="http://$INFO[locality].rottentomatoes.com/search/full_search.php?search=\1" dest="3">
      <expression noclean="1"></expression>
    </RegExp>
  </CreateSearchUrl>
<GetSearchResults dest="8">
    <RegExp input="$$5" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot; standalone=&quot;yes&quot;?&gt;&lt;results&gt;\1&lt;/results&gt;" dest="8">
      <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;\2 (\3)&lt;/title&gt;&lt;url&gt;http://$INFO[locality].rottentomatoes.com/m/\1&lt;/url&gt;&lt;/entity&gt;" dest="5">
        <expression repeat="yes">&lt;a href=&quot;/m/([^&quot;]*)&quot;&gt;([^&lt;]*).*?([0-9]{4})</expression>
      </RegExp>
      <expression noclean="1"></expression>
    </RegExp>
  </GetSearchResults>
  <GetDetails dest="3">  
    <RegExp input="$$8" output="&lt;details&gt;\1&lt;/details&gt;" dest="3">
      <RegExp input="$$1" output="&lt;title&gt;\1&lt;/title&gt;&lt;originaltitle&gt;\1&lt;/originaltitle&gt;&lt;year&gt;\2&lt;/year&gt;" dest="8">
        <expression noclean="1" trim="1">&lt;h1.class=&quot;movie_title clearfix&quot;&gt;([\S\s]*)\(([0-9]{4})\)&lt;/h1&gt;[\S\s]*dialog_content clearfix</expression>
      </RegExp>
      
      <RegExp input="$$7" output="&lt;director&gt;\1&lt;/director&gt;" dest="8+">
        <RegExp input="$$1" output="\1" dest="7">
          <expression noclean="1">&lt;p class=&quot;movie_crew_shortened[\S\s]*Director:([\S\s]*)movie_crew_all</expression>
        </RegExp>
        <expression repeat="yes" noclean="1">&lt;a.href=&quot;[^&gt;]*&gt;([A-Za-z ]*)</expression>
      </RegExp>
      <RegExp conditional="!classreason" input="$$1" output="&lt;mpaa&gt;\1&lt;/mpaa&gt;" dest="8+">
                <expression>&lt;div id=&quot;movie_stats&quot;&gt;[\S\s]*&lt;span class=&quot;content&quot;&gt;([^&lt;]*)[\S\s]*\[See.Full.Rating\]</expression>
            </RegExp>
            <RegExp conditional="classreason" input="$$1" output="&lt;mpaa&gt;\1 \2&lt;/mpaa&gt;" dest="8+">
                <expression>&lt;div id=&quot;movie_stats&quot;&gt;[\S\s]*&lt;span class=&quot;content&quot;&gt;([^&lt;]*)[\S\s]*\[See.Full.Rating\][\S\s]*movie_rating_reason&quot;.style=&quot;display:.none&quot;&gt;([^&lt;]*)</expression>
            </RegExp>
      <RegExp input="$$1" output="&lt;runtime&gt;\1&lt;/runtime&gt;" dest="8+">
          <expression>Runtime:[^0-9]*([^&lt;]*)</expression>
      </RegExp>
      <RegExp input="$$1" output="&lt;thumb&gt;&lt;url spoof=&quot;http://www.culturalianet.com&quot;&gt;http://www.culturalianet.com/imatges/articulos/\1-1.jpg&lt;/url&gt;&lt;/thumb&gt;" dest="8+">
        <expression>imatges/articulos/([0-9]*)-</expression>
      </RegExp>
      <RegExp input="$$7" output="&lt;credits&gt;\1&lt;/credits&gt;" dest="8+">
        <RegExp input="$$1" output="\1" dest="7">
          <expression noclean="1">class=&quot;label&quot;&gt;Screenwriter:&lt;/span&gt;([\S\s]*)Story:&lt;/span&gt;</expression>
        </RegExp>
        <expression repeat="yes" noclean="1">&lt;a.href=&quot;[^&gt;]*&gt;([A-Za-z ]*)</expression>
      </RegExp>
      <RegExp input="$$1" output="&lt;rating&gt;\1&lt;/rating&gt;" dest="8+">
        <expression>&lt;li class=&quot;ui-tabs-selected&quot;&gt;&lt;a title="([0-9]{2,3})</expression>
      </RegExp>
      <RegExp input="$$1" output="&lt;votes&gt;\1&lt;/votes&gt;" dest="8+">
        <expression>&lt;p&gt;Reviews Counted: ([0-9]*)</expression>
      </RegExp>
<RegExp input="$$1" output="&lt;genre&gt;\1&lt;/genre&gt;" dest="8+">
    <expression noclean="1">&lt;span.class=&quot;label&quot;&gt;Genre:&lt;/span&gt;.&lt;span class=&quot;content&quot;&gt;&lt;a.href=&quot;/movie/browser.php\?genre=[0-9]*&quot;&gt;([^&lt;]*)</expression>
    </RegExp>
      <RegExp input="$$7" output="&lt;actor&gt;&lt;name&gt;\1&lt;/name&gt;&lt;role&gt;&lt;/role&gt;&lt;/actor&gt;" dest="8+">
        <RegExp input="$$1" output="\1" dest="7">
          <expression noclean="1">&lt;span class=&quot;label&quot;&gt;Starring:([\S\s]*)&lt;p class=&quot;movie_cast_all&quot;</expression>
        </RegExp>
        <expression repeat="yes" noclean="1">&lt;a.href=&quot;[^&gt;]*&gt;([A-Za-z ]*)</expression>
      </RegExp>
      <RegExp input="$$1" output="&lt;plot&gt;\1&lt;/plot&gt;" dest="8+">
        <expression>&lt;span id=&quot;movie_synopsis_all&quot; style=&quot;display: none;&quot;&gt;([\S\s]*)&lt;a href=&quot;#&quot; id=&quot;movie_synopsis_link</expression>
      </RegExp>
      <expression noclean="1"></expression>
    </RegExp>
  </GetDetails>
</scraper>
Reply
#4
Went the fetch IMDB id from Google route.
Fanart and thumbs now working, the scraper is definitely getting there. Uses the new "includes" feature of latest SVN builds. If there's anything I've done that's not correct please let me know.

Code:
<?xml version="1.0" encoding="utf-8"?>
<scraper name="Rotten Tomatoes 0.6" date="2009-08-12" content="movies" framework="1.0" thumb="rottentomatoes.png" language="en">
    <include>common/tmdb.xml</include>
    <include>common/movieposterdb.xml</include>
<GetSettings dest="3">
    <RegExp input="$$5" output="&lt;settings&gt;\1&lt;/settings&gt;" dest="3">
        <RegExp input="$$1" output="&lt;setting label=&quot;Location&quot; type=&quot;labelenum&quot; values=&quot;us|au|uk&quot; id=&quot;locality&quot; default=&quot;au&quot;&gt;&lt;/setting&gt;" dest="5">
            <expression></expression>
        </RegExp>
        <RegExp input="$$1" output="&lt;setting label=&quot;Retrieve Classification Reason&quot; type=&quot;bool&quot; id=&quot;classreason&quot; default=&quot;true&quot;&gt;&lt;/setting&gt;" dest="5+">
            <expression></expression>
        </RegExp>
        <RegExp input="$$1" output="&lt;setting label=&quot;Retrieve Fanart&quot; type=&quot;bool&quot; id=&quot;fanart&quot; default=&quot;true&quot;&gt;&lt;/setting&gt;" dest="5+">
            <expression></expression>
        </RegExp>
        <expression noclean="1"></expression>
    </RegExp>
</GetSettings>
    
<NfoUrl dest="3">
    <RegExp input="$$1" output="\1" dest="3">
        <expression noclean="1">(http://$INFO[locality]\.rottentomatoes\.com/m/[A-Za-z0-9_]*)</expression>
    </RegExp>
</NfoUrl>
<CreateSearchUrl dest="3">
    <RegExp input="$$1" output="http://$INFO[locality].rottentomatoes.com/search/full_search.php?search=\1" dest="3">
        <expression noclean="1"></expression>
    </RegExp>
</CreateSearchUrl>
<GetSearchResults dest="8">
    <RegExp input="$$5" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot; standalone=&quot;yes&quot;?&gt;&lt;results&gt;\1&lt;/results&gt;" dest="8">
        <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;\2 (\3)&lt;/title&gt;&lt;url&gt;http://$INFO[locality].rottentomatoes.com/m/\1&lt;/url&gt;&lt;/entity&gt;" dest="5">
            <expression repeat="yes">&lt;a href=&quot;/m/([^&quot;]*)&quot;&gt;([^&lt;]*).*?([0-9]{4})</expression>
        </RegExp>
        <expression noclean="1"></expression>
    </RegExp>
</GetSearchResults>

<GetDetails dest="3">  
    <RegExp input="$$8" output="&lt;details&gt;\1&lt;/details&gt;" dest="3">
        <RegExp input="$$1" output="&lt;title&gt;\1&lt;/title&gt;&lt;originaltitle&gt;\1&lt;/originaltitle&gt;" dest="6">
            <expression noclean="1" trim="1">&lt;h1.class=&quot;movie_title clearfix&quot;&gt;([\S\s]*)\([0-9]{4}\)&lt;/h1&gt;[\S\s]*dialog_content clearfix</expression>
        </RegExp>
        <RegExp dest="8" input="$$6" output="\1">
            <expression noclean="1"></expression>
        </RegExp>
        <RegExp input="$$1" output="&lt;year&gt;\1&lt;/year&gt;" dest="9">
            <expression noclean="1">&lt;h1.class=&quot;movie_title clearfix&quot;&gt;[\S\s]*\(([0-9]{4})\)&lt;/h1&gt;[\S\s]*dialog_content clearfix</expression>
        </RegExp>
        <RegExp dest="8+" input="$$9" output="\1">
            <expression noclean="1"></expression>
        </RegExp>
        <RegExp input="$$7" output="&lt;director&gt;\1&lt;/director&gt;" dest="8+">
        <RegExp input="$$1" output="\1" dest="7">
            <expression noclean="1">&lt;p class=&quot;movie_crew_shortened[\S\s]*Director:([\S\s]*)movie_crew_all</expression>
        </RegExp>
            <expression repeat="yes" noclean="1">&lt;a.href=&quot;[^&gt;]*&gt;([A-Za-z ]*)</expression>
        </RegExp>
        <RegExp conditional="!classreason" input="$$1" output="&lt;mpaa&gt;\1&lt;/mpaa&gt;" dest="8+">
            <expression>&lt;div id=&quot;movie_stats&quot;&gt;[\S\s]*&lt;span class=&quot;content&quot;&gt;([^&lt;]*)[\S\s]*\[See.Full.Rating\]</expression>
        </RegExp>
        <RegExp conditional="classreason" input="$$1" output="&lt;mpaa&gt;\1 \2&lt;/mpaa&gt;" dest="8+">
            <expression>&lt;div id=&quot;movie_stats&quot;&gt;[\S\s]*&lt;span class=&quot;content&quot;&gt;([^&lt;]*)[\S\s]*\[See.Full.Rating\][\S\s]*movie_rating_reason&quot;.style=&quot;display:.none&quot;&gt;([^&lt;]*)</expression>
        </RegExp>
        <RegExp input="$$1" output="&lt;runtime&gt;\1&lt;/runtime&gt;" dest="8+">
            <expression>Runtime:[^0-9]*([^&lt;]*)</expression>
        </RegExp>
        <RegExp input="$$1" output="&lt;studio&gt;\1&lt;/studio&gt;" dest="8+">
            <expression>&lt;span class=&quot;label&quot;&gt;Studio:&lt;/span&gt;([^&lt;]*)</expression>
        </RegExp>
        <RegExp input="$$7" output="&lt;credits&gt;\1&lt;/credits&gt;" dest="8+">
            <RegExp input="$$1" output="\1" dest="7">
                <expression noclean="1">class=&quot;label&quot;&gt;Screenwriter:&lt;/span&gt;([\S\s]*)Story:&lt;/span&gt;</expression>
            </RegExp>
            <expression repeat="yes" noclean="1">&lt;a.href=&quot;[^&gt;]*&gt;([A-Za-z ]*)</expression>
        </RegExp>
        <RegExp input="$$1" output="&lt;rating&gt;\1&lt;/rating&gt;" dest="8+">
            <expression>&lt;li class=&quot;ui-tabs-selected&quot;&gt;&lt;a title=&quot;([0-9]{2,3})</expression>
        </RegExp>
        <RegExp input="$$1" output="&lt;votes&gt;\1&lt;/votes&gt;" dest="8+">
            <expression>&lt;p&gt;Reviews Counted: ([0-9]*)</expression>
        </RegExp>
        <RegExp input="$$1" output="&lt;genre&gt;\1&lt;/genre&gt;" dest="8+">
            <expression noclean="1">&lt;span.class=&quot;label&quot;&gt;Genre:&lt;/span&gt;.&lt;span class=&quot;content&quot;&gt;&lt;a.href=&quot;/movie/browser.php\?genre=[0-9]*&quot;&gt;([^&lt;]*)</expression>
        </RegExp>
        <RegExp input="$$7" output="&lt;actor&gt;&lt;name&gt;\1&lt;/name&gt;&lt;role&gt;&lt;/role&gt;&lt;/actor&gt;" dest="8+">
            <RegExp input="$$1" output="\1" dest="7">
                <expression noclean="1">&lt;span class=&quot;label&quot;&gt;Starring:([\S\s]*)&lt;p class=&quot;movie_cast_all&quot;</expression>
            </RegExp>
            <expression repeat="yes" noclean="1">&lt;a.href=&quot;[^&gt;]*&gt;([A-Za-z ]*)</expression>
        </RegExp>
        <RegExp input="$$1" output="&lt;plot&gt;\1&lt;/plot&gt;" dest="8+">
            <expression>&lt;span id=&quot;movie_synopsis_all&quot; style=&quot;display: none;&quot;&gt;([\S\s]*)&lt;a href=&quot;#&quot; id=&quot;movie_synopsis_link</expression>
        </RegExp>
        <!-- Because we don't know the IMDB id and it's required for many Artwork lookups we search Google -->
        <RegExp dest="8+" input="$$10" output="\1">
            <RegExp dest="10" input="$$12" output="&lt;url function=&quot;GetIMDBfromGoogle&quot;&gt;http://www.google.com/search?q=site:imdb.com\1&lt;/url&gt;">
                <!-- remove spaces and , from title-->
                <RegExp input="$$6" output="+\1" dest="12">
                    <expression repeat="yes">([^ ,]+)</expression>
                </RegExp>
                <!-- add year to search string -->
                <RegExp input="$$9" output="+\1" dest="12+">
                    <expression></expression>
                </RegExp>
                <expression></expression>
            </RegExp>
            <expression noclean="1"></expression>
        </RegExp>

        <expression noclean="1"></expression>
    </RegExp>
</GetDetails>

<GetIMDBfromGoogle clearbuffers="no" dest="3">
    <RegExp input="$$5" dest="3" output="&lt;details&gt;\1&lt;/details&gt;">
        <!-- Store IMDB ID in buffer 12 for Artwork Lookups -->
        <RegExp input="$$1" dest="12" output="\1">
            <expression>www\.imdb\.com/title/(tt[0-9]*)/\s</expression>
        </RegExp>
        <!-- Encapsulate into id tags and append url -->
        <RegExp input="$$12" dest="5" output="&lt;id&gt;www\.imdb\.com/title/\1&lt;/id&gt;">
            <expression></expression>
        </RegExp>
        <!--TMBD Thumbs using IMDB ID -->
                        <RegExp input="$$12" output="&lt;url function=&quot;GetTMDBThumbsByIMDBId&quot; cache=&quot;tmdb-trans-\1.xml&quot;&gt;http://api.themoviedb.org/2.0/Movie.imdbLookup?imdb_id=\1&amp;amp;api_key=57983e31fb435df4df77afb854740ea9&lt;/url&gt;" dest="5+">
                <expression/>
            </RegExp>
<!-- Fanart using IMDB ID -->
            <RegExp input="$$12" output="&lt;url function=&quot;GetTMDBFanartByIMDBId&quot;&gt;http://api.themoviedb.org/2.0/Movie.imdbLookup?imdb_id=$$12&amp;amp;api_key=57983e31fb435df4df77afb854740ea9&lt;/url&gt;" dest="5+">
                <expression></expression>
            </RegExp>
        <expression noclean="1"></expression>
    </RegExp>
</GetIMDBfromGoogle>
</scraper>
Reply
#5
if you have a look at how e.g filmdelta.xml use google to translate to an imdb id, you can save yourself the whole GoogleToIMDB function
Reply
#6
spiff Wrote:if you have a look at how e.g filmdelta.xml use google to translate to an imdb id, you can save yourself the whole GoogleToIMDB function

edit: Ahh, I see what you mean now after having another look. Thanks. However doesn't this mean you cannot make all of the thumbs conditional, only the artwork lookup as a whole?
I modelled mine after the PTGate scraper.

I'm also having difficulty seeing what the cache feature is about. http://forum.xbmc.org/showthread.php?tid...ight=cache explains it a little bit but with the filmdelta scraper I don't know what file it is caching... is this a local file or a file from the 'net?
Reply
#7
it's a local file which the url in question is cached to. that way you don't have to do query google twice, i.e. once for the thumbs and once for the fanart. by using a cached file the speed impact is minimial.

so if you want the thumbs you call GetTMDBThumbsFromIMDBID on the google page, if you want fanart you use GetTMDBFanartFromIMDBID on the same page where you set the cache for the google page to avoid fetching it twice. the tmdb include does the same internally for the translation page so we avoid that extra query as well.

this does not affect conditionality at all, you just set the expressions that add the GetTMDBThumbsByMDBId / GetTMDBFanartByIMDBId conditional as you would otherwise
Reply
#8
Thanks. I'll check it out and have a play round and see what I can work out.
Reply
#9
With regard to ratings, is there a way to display without a decimal point as a have asked in this thread: http://forum.xbmc.org/showthread.php?tid=56263
Thanks.
Reply
#10
Just wanted to throw some support towards this project - I love the idea of having the Tomatometer right in my XMBC interface.

What other info are you pulling from the scraper? Things like Gross $? Reviews?

Not sure how this would fit into the XMBC data schema, but I'm all about having some additional stuff available.

*Has dreams of sorting movies by box office take*

Blacklist
Reply
#11
blacklist Wrote:Just wanted to throw some support towards this project - I love the idea of having the Tomatometer right in my XBMC interface.

What other info are you pulling from the scraper? Things like Gross $? Reviews?

Not sure how this would fit into the XBMC data schema, but I'm all about having some additional stuff available.

*Has dreams of sorting movies by box office take*

Blacklist

At the moment rating, genre, synopsis, actors, votes, classification & reason, a couple of others. Look through the XML and you can see exactly what I'm grabbing. Reviews are not currently possible, but plan on making different types of ratings (i.e, user, cream of the crop, etc) available. There is currently only the capacity for only one rating at the moment so I'll be making it settable by an option. Also certain movies do not have complete information so I may need to use imdb as a fallback if that is somehow possible.
Reply
#12
Just wanted to toss some support in too for this scraper. I hope to be able to sort movies by either IMDB or RT rating at some point in the near future. Great work (also on Alaska and UMM Wink)
Have a question? First try the XBMC online-manual and FAQ. Also: How to submit a debug log
Reply
#13
rausch101 Wrote:Just wanted to toss some support in too for this scraper. I hope to be able to sort movies by either IMDB or RT rating at some point in the near future. Great work (also on Alaska and UMM Wink)

Thanks for the support. I've submitted the first fully working version to trac and at the moment it's all info from RT, extra fields would be needed to be added to enable multiple ratings which is more than likely not the way the devs would like to go. I'm not so
sure about it either. However some feature I would like to get going in the future are a "tomatometer" type bar for skins and the ability to see if a movie is "fresh", "rotten" or "certified fresh". Bit one step at a time Wink
Reply
#14
redtapemedia Wrote:Thanks for the support. I've submitted the first fully working version to trac and at the moment it's all info from RT, extra fields would be needed to be added to enable multiple ratings which is more than likely not the way the devs would like to go. I'm not so
sure about it either. However some feature I would like to get going in the future are a "tomatometer" type bar for skins and the ability to see if a movie is "fresh", "rotten" or "certified fresh". Bit one step at a time Wink

A "tomatometer" sounds great.

Now that you mention it, I imagine an entirely new XBMC-wide rating system integration that could allow for sorting would probably be a big task so I'm fine waiting a while for that one Smile
Have a question? First try the XBMC online-manual and FAQ. Also: How to submit a debug log
Reply
#15
Great idea!
Reply

Logout Mark Read Team Forum Stats Members Help
Rotten Tomatoes Scraper0