Developing an Amazon Movie Scraper

  Thread Rating:
  • 1 Votes - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Post Reply
jelockwood Offline
Senior Member
Posts: 111
Joined: Mar 2008
Reputation: 0
Post: #21
C-Quel Wrote:try this....

http://pastebin.com/m657636a8

old but with cleanup + minor changes should work

Many, many, many thanks for pointing me towards this. I felt I was getting closer to getting the GetSearchResults working but had not yet succeeded. Yours does work.

For your information I found the following results from your unmodified Amazon scraper.

1. It successfully did a CreateSearchUrl
2. It only listed one result using its GetSearchResults
3. When that result was used it only succeeded in filling in "Studio", "Runtime" and obtaining the movie thumbnail.

I have so far 'improved' it by

1. Listing the entire page of results returned by Amazon (a maximum of twelve results). This was done by adding a repeat command to your GetSearchResults.
2. Filling in the movie "Title", "Year" (the proper film year not the DVD year), and the "Plot".
3. I have very slightly changed the filename you used for the movie thumbnail to one I believe will still return a result in a very few cases yours might not. My modified filename will always return the largest available artwork (usually 500 pixels) whereas yours would only get 500 pixel tall artwork and I believe a very few DVDs may not have artwork available that big.

While I have added/changed the code to also do "Directors" and "Actors", this is not working. My currently not working approach was to have a first regex to get the block listing all the actor(s) or director(s) and then a second regex which is supposed to extract the individual names from that block.

My efforts so far can be obtained from the following link

http://homepage.mac.com/jelockwood/.Publ...ustest.zip

As far as I can see there is no available information on the Amazon product page to do MPAA rating (Amazon use a GIF and no text), nor a tagline or summary, genre, or writer. There might be a way of getting a rating (that is reader score) by using the following text align="absbottom" alt="4.5 out of 5 stars" height="12". Note: There are several entries of this text in a product page and we would always want to look only at the first.

If anyone else would like to help out it would be much appreciated. In particular getting the actor(s) and director(s) working is a priority.

For everyone's benefit this is what the block containing all the actors looks like

Code:
<li> <b>Actors:</b> <a href="/s?ie=UTF8&search-alias=dvd&field-keywords=Charlton%20Heston">Charlton Heston</a>, <a href="/s?ie=UTF8&search-alias=dvd&field-keywords=Edward%20G.%20Robinson">Edward G. Robinson</a>, <a href="/s?ie=UTF8&search-alias=dvd&field-keywords=Dick%20Van%20Patten">Dick Van Patten</a>, <a href="/s?ie=UTF8&search-alias=dvd&field-keywords=Chuck%20Connors">Chuck Connors</a>, <a href="/s?ie=UTF8&search-alias=dvd&field-keywords=Joseph%20Cotten">Joseph Cotten</a></li>

And the Directors block is virtually identical

Code:
<li> <b>Directors:</b> <a href="/s?ie=UTF8&search-alias=dvd&field-keywords=Richard%20Fleischer">Richard Fleischer</a></li>
find quote
jelockwood Offline
Senior Member
Posts: 111
Joined: Mar 2008
Reputation: 0
Post: #22
I made good progress today and have now managed to also get "Directors", "Actors", "Rating", and "Votes" all working!

I have even partially managed to get MPAA (aka. Certification) working. I say partially because the only text to search and match for in the Amazon HTML code is in lowercase and I would prefer to return and display it in blocks caps as is more usual. Does anyone have any suggestions on how to use regex to convert to block uppercase? Currently it returns "pg" rather than the desired "PG".

With this done the scraper will be pretty much finished in that I will have done as many fields as Amazon provide. I will then do more extensive testing (with more titles), and then make a second version for Amazon UK.
find quote
jelockwood Offline
Senior Member
Posts: 111
Joined: Mar 2008
Reputation: 0
Post: #23
jelockwood Wrote:I made good progress today and have now managed to also get "Directors", "Actors", "Rating", and "Votes" all working!

I have even partially managed to get MPAA (aka. Certification) working. I say partially because the only text to search and match for in the Amazon HTML code is in lowercase and I would prefer to return and display it in blocks caps as is more usual. Does anyone have any suggestions on how to use regex to convert to block uppercase? Currently it returns "pg" rather than the desired "PG".

With this done the scraper will be pretty much finished in that I will have done as many fields as Amazon provide. I will then do more extensive testing (with more titles), and then make a second version for Amazon UK.

(There does not seem to be a way to edit ones own posts.)

I have now come up with a way of returning MPAA ratings in uppercase, it is rather brute force as I had to write a copy of the regex for each MPAA rating meaning there are multiple copies for MPAA but only one will ever match, then rather than using the usual \1 I hard coded the text to return, e.g. PG-13.

This now looks as complete as is possible with the information Amazon provide.


Hmm, just had a thought, even when one searches Amazon US it can return UK results (and vice versa), it will list the US results first generally as a better match. However if one does search Amazon US and happen to select a UK DVD then one would presumably get a UK Certification rather than an MPAA rating. I will be able to 'fix' this by adding yet more hard coded copies for both types of certification. This will be useful anyway since I want to do a UK version as well. I will probably leave it as an exercise for the reader to do Germany, France, Australia, etc.
find quote
C-Quel Offline
Retired Team-XBMC Member
Posts: 1,378
Joined: Aug 2004
Reputation: 0
Post: #24
Keep up the good work nice to see more hands on deck Smile

Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


[Image: badge.gif]

If scraper related please always grab the latest XML relevant to the content you are trying to grab info for from this link https://xbmc.svn.sourceforge.net/svnroot...m/scrapers

System Specs:

A Computer with loads of shiny things that make a noise and bring life to my tv, and xbmc ofc :)

iNerd Store

iNerd Forum
find quote
jelockwood Offline
Senior Member
Posts: 111
Joined: Mar 2008
Reputation: 0
Post: #25
The Amazon US scraper looks done now, and I have converted it to do Amazon UK. This was not as straight forward as it sounds as there were more differences than you think, even such minor ones as two spaces where the US one had a single space.

Despite this I have managed to get all the equivalent fields as the US one working with one exception. This is the information for the "Plot" field. Unfortunately Amazon UK make this more difficult in several ways, firstly the most appropriate block of text is much longer, longer than would be preferred. Secondly, it has html formatting mixed in with the text. As far as I can see the scraper would clean up and remove that html formatting but I am struggling to get a regex to extract the text since the embedded html code makes matching more difficult.

Here is what the raw section looks like

Code:
<b>Amazon.co.uk Review</b><br />
  While <I>Soylent Green</I> may be one of the many <a href="/exec/obidos/tg/feature/-/298867/${0}">dystopian visions</a> of the future, the film stands out because it's one of the few titles that addresses current environmental issues head on. Adapted from Harry Harrison's novel <I>Make Room, Make Room</I>, it gives us a nightmarish vision of an over-populated, polluted future on the brink of collapse--a vision that gets uncomfortably closer every year. Charlton Heston as police officer Thorn investigates a murder in between suppressing food riots and uncovers the nightmarish truth about Soylent Green, the new foodstuff being sold to the poor. <p> The film neatly combines police procedural with conspiracy thriller. Heston's scenes are counterpointed by more elegiac ones in which the centenarian Edward G Robinson as his friend Sol broods on the world he has outlived--his death in a euthanasia chamber is a gloriously lachrymose moment, which he plays to the hilt. Heston, too, is good as Thorn, a morally equivocal cop who loots the apartments of the victims whose deaths he investigates--he's a man just getting by in an impossible world. <p> <B>On the DVD:</B> <I>Soylent Green</I> on disc comes with a commentary from director Richard Fleischer, the highpoint of which is a memorable description of what it was like to work with the brilliant ailing, entirely deaf Robinson. He is joined by Leigh Taylor-Young whose work on the film as heroine led to years of serious environmentalist commitment. It has a useful contemporary making-of documentary and touching shots of Robinson's 100th birthday party with telegrams from Sinatra and others. The feature itself is presented in anamorphic widescreen with its original mono sound. --<I>Roz Kaveney</I>
  
<br /><br />

(It will be easier to view if you copy and paste it in to an editor.)

All I need is a regex for the above and then both Amazon scrapers can be released.

Note: The URL to view the original Amazon UK page that above came from is

http://www.amazon.co.uk/Soylent-Green-Ch...117&sr=1-1
find quote
w00dst0ck Offline
Junior Member
Posts: 37
Joined: Aug 2008
Reputation: 0
Location: Germany
Post: #26
This works for me:

Code:
Review</b><br />(.*)--<I
find quote
w00dst0ck Offline
Junior Member
Posts: 37
Joined: Aug 2008
Reputation: 0
Location: Germany
Post: #27
Tested with some titles at amazon.co.uk and found out that they use sometimes --<i> and --<I> at the end of the review.

Try this regex and clean it afterwards:
Code:
Review</b><br />(.*?)--<
find quote
jelockwood Offline
Senior Member
Posts: 111
Joined: Mar 2008
Reputation: 0
Post: #28
w00dst0ck Wrote:Tested with some titles at amazon.co.uk and found out that they use sometimes --<i> and --<I> at the end of the review.

Try this regex and clean it afterwards:
Code:
Review</b><br />(.*?)--<

Sorry for the delay in replying, been busy doing other things. Your code above does indeed seem to work, I had however before I came back and checked come up with this which also seems to work

Code:
<b>Amazon.co.uk Review</b><br />\n  ([^\n]*)

I could strip off the author bit the same way as you like so

Code:
<b>Amazon.co.uk Review</b><br />\n  ([^\n]*)--<

However I would not be surprised if there are some entries with no author listed and then it would fail to match.

Let me know if you see any drawbacks from mine.
find quote
w00dst0ck Offline
Junior Member
Posts: 37
Joined: Aug 2008
Reputation: 0
Location: Germany
Post: #29
Do you have an example url for that?
find quote
jelockwood Offline
Senior Member
Posts: 111
Joined: Mar 2008
Reputation: 0
Rainbow  Amazon Scrapers now available to download Post: #30
I finally got round to setting up a download page for the Amazon video Scrapers I have written (as discussed in this thread).

There is one for Amazon.com and a matching one for Amazon.co.uk, just to make it clear these are Video scrapers, not music or TV scrapers. You can download them as a Zip file from the following link, the Zip file includes the matching Scraper logos as well.

These have been tested on a Mac but should work on all XBMC platforms.

Note: I still recommend people normally should use the standard IMDB scraper first, but if like me, you have some DVDs which IMDB does not list (e.g. Simpsons Christmas Special) then these scrapers will help.

Here is the download link http://homepage.mac.com/jelockwood/scrapers.html
find quote
Post Reply